ElCel TechnologyHome|Company|Software|Services|MyAccount|Shopping Cart
SOFTWARE
OpenTop
XML Tools
User Comments
Support

XML Validator
Canonical XML
Manual
Licence

xmlcanon

Name

xmlcanon — convert xml documents into canonical form [version 1.2]

Synopsis

xmlcanon [ OPTION ...] [URL...]

Usage

xmlcanon --validate foo.xml

Description

xmlcanon is a command-line utility for transforming XML files into canonical form. It is produced and maintained by ElCel Technology.

For each URL, xmlcanon transforms the contents into canonical form using either the algorithm recommended and published by the W3C Canonical XML Version 1.0 or James Clark's original canonical xml algorithm.

By default, the canonical results are written to the standard output. The --out option may be used to redirect the output to a file of your choice.

The canonical form of an xml document is a normalization of the input file together with any information required from the DTD, such as attribute defaults and entities. For this reason xmlcanon can be used to transform xml documents containing DTDs into standalone documents without DTDs. (The James Clark algorithm does use a limited DTD in some circumstances, but this DTD is comprised of only an internal subset).

Each filename passed on the command line is treated as a Uniform Resource Locator (URL). If no protocol is present in the URL, it is assumed to refer to a local file. For example 'c:\test.xml' is treated as being equivalent to the URL 'file:///c:\test.xml'. To read from the standard input specify a URL of '-'.

xmlcanon can be configured to route HTTP network requests via a proxy server. See Network Access Options for further details.

On Microsoft Windows platforms, to compensate for the lack of shell file name expansion, xmlcanon automatically expands file names containing wild-card characters ('*' and '?') into a list of matching files. This expansion only occurs when the URL(s) look like file names, i.e. they do not contain a protocol such as "file:". Beware that even though xmlcanon will happily process multiple input files, the results will be concatenated into a single (badly formed) output file.

Messages produced by xmlcanon can be translated into your native language. This is described in the section titled Native Language Support.

General Options

Single character option names may be concatenated together. POSIX-style option names (those beginning with --) must be specified separately, but may be abbreviated.

-c --nocomments

Disable comment processing when using the W3C canonicalization method. This has no effect when using the James Clark method because comments are never included.

-d --nonsdecl

Disable the validation of namespace declaration attributes when using XML namespaces. Namespaces are enabled for the W3C canonicalization method. If this flag is not specified, namespace declarations (e.g. xmlns:foo="xxx") are validated against the DTD. This is the correct behaviour according to the XML 1.0 and XML namespace recommendations. This option only takes effect when the --validate option is specified.

-e --encoding ENCODING

Instruct xmlcanon which encoding to use for the output file. The default (and recommended) encoding is UTF-8. However it is possible that you would like the canonical output to use an alternative encoding. If you specify an encoding other than UTF-8/UTF-16 a xml declaration will be written to the canonical output file. It is possible that a Byte Order Mark (BOM) will also be written if the encoding requires it.

-h --help

Display a brief help page with available options and exit.

-i --interop

Enable tests that check the input xml for interoperability with SGML-based systems.

-l --newline

Write an extra newline (linefeed) at the end of processing each input file. This option may be necessary on some UNIX-based systems that expect text files to have a terminating linefeed.

-m --method METHOD

The canonicalization method to be used. This can be either "W3C" for the W3C Canonical XML 1.0 recommendation or "JClark" for James Clark's Canonical XML method. The James Clark method is used extensively in XML conformance test suites.

-o --out FILE

Direct the canonical results to the specified FILE rather than STDOUT. You may wish to do this when writing UTF-16 encoded files to avoid corruption due to native line-endings on Windows-based systems.

-v --validate

Perform DTD validation of the input file(s). xmlcanon does not perform XML schema validation.

-V --version

Display the version number and exit.

-w --warnings

Enable warning tests. The XML 1.0 recommendation specifies a number of conditions that XML processors may report as warnings but that are not errors. These tests are not performed by default, but can be enabled by specifying this option.

Network Access Options

xmlcanon uses the capabilities of the ElCel Technology library to access files from the Internet. In some organizations access to the Internet is provided via a proxy server, sometimes requiring authentication.

The following options can be used to control how xmlcanon accesses network resources.

--httpproxy SERVER[:PORT]

This option, or the use of the ET_HTTP_PROXY environment variable, causes xmlcanon to use the specified HTTP proxy server to satisfy HTTP network requests. If a port number is not specified then 8080 is used by default.

-p --password PASSWORD

This option, or the use of the ET_HTTP_PASSWORD environment variable, specifies the password to send to origin HTTP servers for authentication.

-P --proxypassword PASSWORD

This option, or the use of the ET_HTTP_PROXY_PASSWORD environment variable, specifies the password to send to the HTTP proxy server for authentication.

-u --user USER

This option, or the use of the ET_HTTP_USER environment variable, specifies the user name to send to origin HTTP servers for authentication.

-U --proxyuser USER

This option, or the use of the ET_HTTP_PROXY_USER environment variable, specifies the user name to send to the HTTP proxy server for authentication.

XML Catalog Options

The DTD and other external entities referenced within the input file(s) can be resolved using an XML Catalog. This is further described in the section titled Entity Resolution.

-g --catalog CATALOG

This option, or the use of the ET_XMLCAT_CATALOG environment variable, causes xmlcanon to use the specified XML catalog entry file for Entity Resolution.

--nocatalogpis

Disable the processing of <?oasis-xml-catalog?> processing instructions.

--prefer [system|public]

This option, or the ET_XMLCAT_PREFER environment variable, is used to set the application preference for system or public identifiers. The default value is 'public', which means that public catalog entries may be used to resolve external entities even when a system identifier exists for the resource. Note that system catalog entries still take precedence over public catalog entries even when this option is set to 'public'.

Return Code

0 - Success, 1 - Failure

Entity Resolution

xmlcanon can use a XML catalog to look up and resolve public and system identifiers. This important feature is present in SGML systems but was originally absent from most XML tools. The XML catalog file is specified with the --catalog option, the ET_XMLCAT_CATALOG environment variable or the <?oasis-xml-catalog?> processing instruction imbedded in the prolog (before the DOCTYPE declaration) of your XML document.

The format and semantics of the XML catalog entries follow the OASIS XML Catalog specification.

Control over whether public or system identifiers are preferred is provided by means of the --prefer option or the ET_XMLCAT_PREFER environment variable.

The OASIS XML Catalog specification. describes a powerful set of features which cannot adequately be described here. However a brief example of a valid catalog entry file is shown below:-

<catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog">
 <public publicId="-//OASIS//DTD DocBook XML V4.1.2//EN"
         uri="docbookx.dtd"/>
 <system systemId="docbookx.dtd" uri="docbookx.dtd"/>
 <delegateSystem systemIdStartString="doc" 
                 catalog="http://www.acme.org/DocBook/catalog"/>
 <delegatePublic publicIdStartString="-//OASIS" 
                 catalog="file:///usr/doc/oasis/catalog.xml"/>
</catalog>
Note how the example makes use of the default namespace. All catalog elements must be within the urn:oasis:names:tc:entity:xmlns:xml:catalog namespace.

If no catalog is specified (either using the --catalog option, the ET_XMLCAT_CATALOG environment variable or the <?oasis-xml-catalog?> processing instruction) or a catalog match fails to occur, then xmlcanon will read external entities by dereferencing the system identifiers. The <?oasis-xml-catalog?> processing instruction can be disabled by specifying the --nocatalogpis option.

Native Language Support

By default, xmlcanon produces error messages in the English language but these can be replaced with messages written in a native language of your choice.

The ElCel Technology Native Language Authoring Kit contains message catalogs translated into other languages. You may be lucky and find that messages for your native language have already been translated. More likely, if you wish to use native language messages you will need to undertake the translation work yourself. The Kit contains pro-forma message catalogs written in English which form the basis for the native language versions. It is not necessary to translate all messages for the exercise to be useful, translating just the common messages is feasible.

Once you have the necessary message catalogs, it is quite straight forward to configure xmlcanon to use them. This is achieved by setting two environment variables: LANG and ET_MSG_DIR.

LANG

This is used by many programs and utilities to determine the locale category for native language, local customs and coded characters. It normally contains a language and region code such as "en_GB", "en_AU" or "fr_FR";

ET_MSG_DIR

This is used to specify the base directory under which the native language message catalogs are located. On UNIX systems this is commonly /usr/share or /usr/share/locale but may be any valid directory name.

When searching for message catalogs, xmlcanon concatenates the environment variables like this: $ET_MSG_DIR/elcel/$LANG/. Within the message directory, the message catalog files have a suffix of .msg and are named in accordance with the library or application to which they refer.

For example, if ET_MSG_DIR=/usr/share and LANG=fr_FR the message catalog file containing XML validation messages (in the French language) would be /usr/share/elcel/fr_FR/xml.msg. These messages may be shared by other tools built with . The catalog file containing messages specific to xmlcanon would be named /usr/share/elcel/fr_FR/xmlcanon.msg.

Feedback

We welcome feedback about our products. If you have a bug report, a suggested enhancement or simply enjoy using xmlcanon please let us know. support@elcel.com.

xmlcanon version 1.2, March 2003