xmlcanon — convert xml documents into canonical form [version 1.2]
xmlcanon --validate foo.xml
xmlcanon is a command-line utility for transforming
XML files into canonical form.
It is produced and maintained by ElCel Technology.
For each URL, xmlcanon transforms the contents into canonical
form using either the algorithm recommended and published by the W3C
Version 1.0 or
James Clark's original
canonical xml algorithm.
By default, the canonical results are written to the standard output. The --out
option may be used to redirect the output to a file of your choice.
The canonical form of an xml document is
a normalization of the input file together with any information required
from the DTD, such as attribute defaults and entities. For this reason xmlcanon can be
used to transform xml documents containing DTDs into standalone documents without
DTDs. (The James Clark algorithm does use a limited DTD in some circumstances, but this
DTD is comprised of only an internal subset).
Each filename passed on the command line is treated as a Uniform Resource Locator (URL).
If no protocol is present in the URL, it is assumed to refer to a local file.
For example 'c:\test.xml' is treated as being equivalent to the URL 'file:///c:\test.xml'.
To read from the standard input specify a URL of '-'.
xmlcanon can be configured to route HTTP network requests
via a proxy server. See Network Access Options for further details.
On Microsoft Windows platforms, to compensate for the lack of shell file name expansion, xmlcanon
automatically expands file names containing wild-card characters ('*' and '?') into a list of matching files.
This expansion only occurs when the URL(s) look like file names, i.e. they do not contain a
protocol such as "file:".
Beware that even though xmlcanon
will happily process multiple input files, the results will be concatenated into a single (badly formed) output file.
Messages produced by xmlcanon can be translated into your native language.
This is described in the section titled Native Language Support.
Single character option names may be concatenated together. POSIX-style option
names (those beginning with --) must be specified separately, but may be abbreviated.
- -c --nocomments
Disable comment processing when using the W3C canonicalization method.
This has no effect when using the James Clark method because comments
are never included.
- -d --nonsdecl
Disable the validation of namespace declaration attributes when using
XML namespaces. Namespaces are enabled for the W3C canonicalization
method. If this flag is not specified, namespace declarations (e.g. xmlns:foo="xxx") are validated
against the DTD. This is the correct behaviour according to the XML 1.0
and XML namespace recommendations.
This option only takes effect when the --validate option is specified.
- -e --encoding ENCODING
Instruct xmlcanon which encoding to use for the
output file. The default (and recommended) encoding is UTF-8. However
it is possible that you would like the canonical output to use an
alternative encoding. If you specify an encoding other than UTF-8/UTF-16
a xml declaration will be written to the canonical output file. It is
possible that a Byte Order Mark (BOM) will also be written if the encoding
- -h --help
Display a brief help page with available options and exit.
- -i --interop
Enable tests that check the input xml for interoperability with SGML-based systems.
- -l --newline
Write an extra newline (linefeed) at the end of processing each input file.
This option may be necessary on some UNIX-based systems that expect
text files to have a terminating linefeed.
- -m --method METHOD
The canonicalization method to be used. This can be
either "W3C" for the W3C Canonical XML 1.0 recommendation or "JClark"
for James Clark's Canonical XML method. The James Clark method
is used extensively in XML conformance test suites.
- -o --out FILE
Direct the canonical results to the specified FILE rather than STDOUT.
You may wish to do this when writing
UTF-16 encoded files to avoid corruption due to native line-endings on
- -v --validate
Perform DTD validation of the input file(s).
xmlcanon does not perform XML schema validation.
- -V --version
Display the version number and exit.
- -w --warnings
Enable warning tests. The XML 1.0 recommendation specifies a
number of conditions that XML processors may report as warnings but that
are not errors. These tests are not performed by default, but can be
enabled by specifying this option.
Network Access Options
xmlcanon uses the capabilities of the ElCel Technology library
to access files from the Internet. In some organizations access to the Internet
is provided via a proxy server, sometimes requiring authentication.
The following options can be used to control how xmlcanon
accesses network resources.
- --httpproxy SERVER[:PORT]
This option, or the use of the ET_HTTP_PROXY environment variable,
causes xmlcanon to use the specified HTTP proxy server
to satisfy HTTP network requests. If a port number is not specified then 8080
is used by default.
- -p --password PASSWORD
This option, or the use of the ET_HTTP_PASSWORD environment variable,
specifies the password to send to origin HTTP servers for authentication.
- -P --proxypassword PASSWORD
This option, or the use of the ET_HTTP_PROXY_PASSWORD environment variable,
specifies the password to send to the HTTP proxy server for authentication.
- -u --user USER
This option, or the use of the ET_HTTP_USER environment variable,
specifies the user name to send to origin HTTP servers for authentication.
- -U --proxyuser USER
This option, or the use of the ET_HTTP_PROXY_USER environment variable,
specifies the user name to send to the HTTP proxy server for authentication.
XML Catalog Options
The DTD and other external entities referenced within the input file(s) can be
resolved using an XML Catalog. This is further described in the section titled
- -g --catalog CATALOG
This option, or the use of the ET_XMLCAT_CATALOG environment variable,
causes xmlcanon to use the specified XML catalog entry file
for Entity Resolution.
Disable the processing of <?oasis-xml-catalog?>
- --prefer [system|public]
This option, or the ET_XMLCAT_PREFER environment variable,
is used to set the application
preference for system or public identifiers. The default value is 'public',
which means that public catalog entries may be used to resolve
external entities even when a system identifier exists for the resource. Note
that system catalog entries still take precedence over public catalog entries
even when this option is set to 'public'.
0 - Success,
1 - Failure
xmlcanon can use a XML catalog to look up and
resolve public and system identifiers. This important feature
is present in SGML systems but was originally absent from most
XML tools. The XML catalog file is specified with the
--catalog option, the
ET_XMLCAT_CATALOG environment variable or the
<?oasis-xml-catalog?> processing instruction imbedded
in the prolog (before the DOCTYPE declaration) of your XML document.
The format and semantics of the XML catalog entries follow the
OASIS XML Catalog specification.
Control over whether public or system identifiers are preferred is provided by
means of the --prefer option or the
ET_XMLCAT_PREFER environment variable.
The OASIS XML Catalog specification.
describes a powerful set of features which cannot adequately be described here. However
a brief example of a valid catalog entry file is shown below:-
<public publicId="-//OASIS//DTD DocBook XML V4.1.2//EN"
<system systemId="docbookx.dtd" uri="docbookx.dtd"/>
Note how the example makes use of the default namespace. All catalog elements
must be within the urn:oasis:names:tc:entity:xmlns:xml:catalog
If no catalog is specified (either using the --catalog option, the
ET_XMLCAT_CATALOG environment variable or the <?oasis-xml-catalog?>
processing instruction) or a catalog match fails to occur, then xmlcanon will read external entities by dereferencing
the system identifiers. The <?oasis-xml-catalog?> processing instruction
can be disabled by specifying the --nocatalogpis option.
Native Language Support
By default, xmlcanon produces error messages in the
English language but these can be replaced with messages written in a
native language of your choice.
The ElCel Technology Native Language Authoring Kit
contains message catalogs translated into
other languages. You may be lucky and find that messages for your native language
have already been translated. More likely, if you wish to use native language messages
you will need to undertake the translation work yourself. The Kit contains
pro-forma message catalogs written in English which form the basis for the
native language versions. It is not necessary to translate all messages for the
exercise to be useful, translating just the common messages is feasible.
Once you have the necessary message catalogs, it is quite straight forward to configure
xmlcanon to use them. This is achieved by setting
two environment variables: LANG and ET_MSG_DIR.
This is used by many programs and utilities to determine the locale category for
native language, local customs and coded characters. It normally contains
a language and region code such as "en_GB", "en_AU" or "fr_FR";
This is used to specify the base directory under which the
native language message catalogs are located. On UNIX systems
this is commonly /usr/share or /usr/share/locale but may be
any valid directory name.
When searching for message catalogs, xmlcanon concatenates
the environment variables like this: $ET_MSG_DIR/elcel/$LANG/.
Within the message directory, the message catalog files have a suffix of
.msg and are named in accordance with
the library or application to which they refer.
For example, if ET_MSG_DIR=/usr/share
and LANG=fr_FR the message catalog file
containing XML validation messages (in the French language) would be
/usr/share/elcel/fr_FR/xml.msg. These messages may be shared
by other tools built with . The catalog file containing messages
specific to xmlcanon would be named
We welcome feedback about our products. If you have a bug report, a suggested
enhancement or simply enjoy using xmlcanon please
let us know. email@example.com.
xmlcanon version 1.2, March 2003