RE: xerces always escapes ampersands

Williams, Erskine BGI SF 15 Aug 2003 22:05:28 -0000

Andy, 
Thanks very much - that clarifies things considerably. 

My problem is now coming into more focus, so forgive me if the discussion
starts to veer away from strictly xerces. I'm dealing with a servlet
application (it's a struts application, to be specific), and I'm finding
that when I post a character string to struts, the euro symbol is being
encoded using extended ascii in hex notation: '\x0080'. When I try to write
this string to xml, I get the dreaded '?' symbol.

What I need is a way to make sure the request parameters get encoded using
Unicode and thus can be written into xml correctly according to your
discussion below. Do you have any ideas on how to control the encoding of
the post data? I've tried putting "<meta http-equiv="Content-Type"
content="text/html; charset=UTF-8"/>" into the head of my web form, but
still no dice.

-----Original Message-----
From: Andy Clark [mailto:[EMAIL PROTECTED]
Sent: Monday, August 04, 2003 1:07 PM
To: [EMAIL PROTECTED]
Subject: Re: xerces always escapes ampersands

Williams, Erskine BGI SF wrote:
> I'm finding that xerces is always escaping ampersands, even when they are
a
> part of a character reference. For example, if I want to define a text
> element like so: <someText>&#x20AC</someText>, (where "&#x20AC;" is the
> hexadecimal entity reference for the euro "EUR" sign) when xerces writes
> this out to a file, I invariably get: "<someText>&amp;#x20AC;</someText>"
> Xerces is always escaping ampersands into the entity ref "&amp;"

Xerces is not escaping the character incorrectly -- it
is working correctly. Your code is inserting literal
ampersands into the text which MUST be escaped when
output.

> Perhaps my confusion arises out of poor understanding of xml, but I should
> think that xerces would only escape ampersands that aren't a part of a
valid
> entity reference, i.e., if an ampersand is immediately followed by a pound
> (#) sign, it should leave it alone. Is there a more reliable way to
> reference extended ascii characters in xml, so that they will pass through
> xerces unmolested?

That is not correct. A character entity (e.g. "&#32;" or
"&#x20;" -- a single space character specified in decimal
and hexadecimal, respectively), is read by the parser and
converted to the equivalent Unicode character (assuming
that it is a legal XML character).

>     Content c = new Content();
>     c.addPara("&#xA3; &#xA9; &#xAE;");

Here's what's wrong.

You are adding literal text to the content of the document.
You are not adding the string of an English pound symbol,
a space, a copyright symbol, a space, and a registered
trademark symbol. (In fact, this may be an indicator of
another problem but I'll get to that in a minute.) What you
are adding is literally the string "&#xA3; &#xA9; &#xAE;".

When adding text programmatically to an XML document, the
text is NOT parsed as if it is XML content. It is added
as-is to the content.

>       Marshaller m = new Marshaller(fw);
>       m.setEncoding("iso-8859-1");

This may be another problem.

In your previous code, you were trying to append certain
characters to the content. Even though you're specifying
your output encoding as ISO Latin 1, do not try to append
ISO Latin 1-encoded characters to the text. You may not
get what you want.

When dealing with an XML document programmatically, you
need to work with only Unicode characters. (Note: this
only applies to Unicode based systems which is MOST but
not all XML parsers and tools -- e.g. expat sends byte
sequences which retains the original encoding.)

It just so happens that the ISO Latin 1 characters for
the characters you were appending match directly to the
Unicode characters. But don't confuse this with the
encoding of these characters! The &#xAE; character is
encoded as the byte 0xAE in ISO Latin 1 but would be
encoded differently in a Unicode encoding such as UTF-8.

> <?xml version="1.0" encoding="iso-8859-1"?>
> <factsheet>
>   <content>
>     <para>&amp;#xA3; &amp;#xA9; &amp;#xAE;</para>
>   </content>
> </factsheet>

Given your code, this is what I would expect. If you
really want to append the characters in question, you
should do something like the following:

    Content c = new Content();
    c.addPara("\u00A3 \u00A9 \u00AE");

>     Document document = DocumentHelper.createDocument();
>     Element root = document.addElement("root");
>     Element test = root.addElement("test").addText("&#xA3;,&#xAE;");

The same comments apply here as well.

Does that clear things up? or did I just make it more
confusing?

-- 
Andy Clark * [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: xerces always escapes ampersands

Reply via email to