Re: xerces always escapes ampersands

Andy Clark 4 Aug 2003 20:07:41 -0000

Williams, Erskine BGI SF wrote:

I'm finding that xerces is always escaping ampersands, even when they are a
part of a character reference. For example, if I want to define a text
element like so: <someText>&#x20AC</someText>, (where "&#x20AC;" is the
hexadecimal entity reference for the euro "EUR" sign) when xerces writes
this out to a file, I invariably get: "<someText>&amp;#x20AC;</someText>"
Xerces is always escaping ampersands into the entity ref "&amp;"


Xerces is not escaping the character incorrectly -- it
is working correctly. Your code is inserting literal
ampersands into the text which MUST be escaped when
output.

Perhaps my confusion arises out of poor understanding of xml, but I should
think that xerces would only escape ampersands that aren't a part of a valid
entity reference, i.e., if an ampersand is immediately followed by a pound
(#) sign, it should leave it alone. Is there a more reliable way to
reference extended ascii characters in xml, so that they will pass through
xerces unmolested?


That is not correct. A character entity (e.g. "&#32;" or
"&#x20;" -- a single space character specified in decimal
and hexadecimal, respectively), is read by the parser and
converted to the equivalent Unicode character (assuming
that it is a legal XML character).

    Content c = new Content();
    c.addPara("&#xA3; &#xA9; &#xAE;");


Here's what's wrong.

You are adding literal text to the content of the document.
You are not adding the string of an English pound symbol,
a space, a copyright symbol, a space, and a registered
trademark symbol. (In fact, this may be an indicator of
another problem but I'll get to that in a minute.) What you
are adding is literally the string "&#xA3; &#xA9; &#xAE;".

When adding text programmatically to an XML document, the
text is NOT parsed as if it is XML content. It is added
as-is to the content.

      Marshaller m = new Marshaller(fw);
      m.setEncoding("iso-8859-1");


This may be another problem.

In your previous code, you were trying to append certain
characters to the content. Even though you're specifying
your output encoding as ISO Latin 1, do not try to append
ISO Latin 1-encoded characters to the text. You may not
get what you want.

When dealing with an XML document programmatically, you
need to work with only Unicode characters. (Note: this
only applies to Unicode based systems which is MOST but
not all XML parsers and tools -- e.g. expat sends byte
sequences which retains the original encoding.)

It just so happens that the ISO Latin 1 characters for
the characters you were appending match directly to the
Unicode characters. But don't confuse this with the
encoding of these characters! The &#xAE; character is
encoded as the byte 0xAE in ISO Latin 1 but would be
encoded differently in a Unicode encoding such as UTF-8.

<?xml version="1.0" encoding="iso-8859-1"?>
<factsheet>
  <content>
    <para>&amp;#xA3; &amp;#xA9; &amp;#xAE;</para>
  </content>
</factsheet>


Given your code, this is what I would expect. If you
really want to append the characters in question, you
should do something like the following:

   Content c = new Content();
   c.addPara("\u00A3 \u00A9 \u00AE");

    Document document = DocumentHelper.createDocument();
    Element root = document.addElement("root");
    Element test = root.addElement("test").addText("&#xA3;,&#xAE;");


The same comments apply here as well.

Does that clear things up? or did I just make it more
confusing?

--
Andy Clark * [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: xerces always escapes ampersands

Reply via email to