Re: xerces always escapes ampersands

Volker Witzel 18 Aug 2003 08:45:19 -0000

Dear Erskine,

I watched this discussion thread with interest, as we had a similar problem. The problem you describe seems to be a Browser/Encoding rather than a Java/Xerces problem. I found the following article enlightning: http://ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html

We changed our forms to use multipart/form-data as content-type and use the Common O'Reilley servlets API's MultiPartRequest ( http://servlets.com/cos/javadoc/com/oreilly/servlet/MultipartRequest.html) to extract the relevant information. We are using "simple" servlets and a cutsom framework, but I don't believe it interferes with Struts.

Hope that helps,
Volker.

Williams, Erskine BGI SF wrote:

Andy, Thanks very much - that clarifies things considerably.

My problem is now coming into more focus, so forgive me if the discussion starts to veer away from strictly xerces. I'm dealing with a servlet application (it's a struts application, to be specific), and I'm finding that when I post a character string to struts, the euro symbol is being encoded using extended ascii in hex notation: '\x0080'. When I try to write this string to xml, I get the dreaded '?' symbol.

What I need is a way to make sure the request parameters get encoded using
Unicode and thus can be written into xml correctly according to your
discussion below. Do you have any ideas on how to control the encoding of
the post data? I've tried putting "<meta http-equiv="Content-Type"
content="text/html; charset=UTF-8"/>" into the head of my web form, but
still no dice.

-----Original Message-----
From: Andy Clark [mailto:[EMAIL PROTECTED]
Sent: Monday, August 04, 2003 1:07 PM
To: [EMAIL PROTECTED]
Subject: Re: xerces always escapes ampersands


Williams, Erskine BGI SF wrote:

I'm finding that xerces is always escaping ampersands, even when they are

part of a character reference. For example, if I want to define a text
element like so: <someText>&#x20AC</someText>, (where "&#x20AC;" is the
hexadecimal entity reference for the euro "EUR" sign) when xerces writes
this out to a file, I invariably get: "<someText>&amp;#x20AC;</someText>"
Xerces is always escaping ampersands into the entity ref "&amp;"

Xerces is not escaping the character incorrectly -- it
is working correctly. Your code is inserting literal
ampersands into the text which MUST be escaped when
output.

Perhaps my confusion arises out of poor understanding of xml, but I should
think that xerces would only escape ampersands that aren't a part of a


valid

entity reference, i.e., if an ampersand is immediately followed by a pound
(#) sign, it should leave it alone. Is there a more reliable way to
reference extended ascii characters in xml, so that they will pass through
xerces unmolested?

That is not correct. A character entity (e.g. "&#32;" or
"&#x20;" -- a single space character specified in decimal
and hexadecimal, respectively), is read by the parser and
converted to the equivalent Unicode character (assuming
that it is a legal XML character).

   Content c = new Content();
   c.addPara("&#xA3; &#xA9; &#xAE;");

Here's what's wrong.

You are adding literal text to the content of the document.
You are not adding the string of an English pound symbol,
a space, a copyright symbol, a space, and a registered
trademark symbol. (In fact, this may be an indicator of
another problem but I'll get to that in a minute.) What you
are adding is literally the string "&#xA3; &#xA9; &#xAE;".

When adding text programmatically to an XML document, the
text is NOT parsed as if it is XML content. It is added
as-is to the content.

     Marshaller m = new Marshaller(fw);
     m.setEncoding("iso-8859-1");

This may be another problem.

In your previous code, you were trying to append certain
characters to the content. Even though you're specifying
your output encoding as ISO Latin 1, do not try to append
ISO Latin 1-encoded characters to the text. You may not
get what you want.

When dealing with an XML document programmatically, you
need to work with only Unicode characters. (Note: this
only applies to Unicode based systems which is MOST but
not all XML parsers and tools -- e.g. expat sends byte
sequences which retains the original encoding.)

It just so happens that the ISO Latin 1 characters for
the characters you were appending match directly to the
Unicode characters. But don't confuse this with the
encoding of these characters! The &#xAE; character is
encoded as the byte 0xAE in ISO Latin 1 but would be
encoded differently in a Unicode encoding such as UTF-8.

<?xml version="1.0" encoding="iso-8859-1"?>
<factsheet>
 <content>
   <para>&amp;#xA3; &amp;#xA9; &amp;#xAE;</para>
 </content>
</factsheet>

Given your code, this is what I would expect. If you
really want to append the characters in question, you
should do something like the following:

    Content c = new Content();
    c.addPara("\u00A3 \u00A9 \u00AE");

   Document document = DocumentHelper.createDocument();
   Element root = document.addElement("root");
   Element test = root.addElement("test").addText("&#xA3;,&#xAE;");

The same comments apply here as well.

Does that clear things up? or did I just make it more
confusing?


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: xerces always escapes ampersands

Reply via email to