Dear Erskine,
I watched this discussion thread with interest, as we had a similar problem. The problem you describe seems to be a Browser/Encoding rather than a Java/Xerces problem. I found the following article enlightning: http://ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html
We changed our forms to use multipart/form-data as content-type and use the Common O'Reilley servlets API's MultiPartRequest (
http://servlets.com/cos/javadoc/com/oreilly/servlet/MultipartRequest.html) to extract the relevant information. We are using "simple" servlets and a cutsom framework, but I don't believe it interferes with Struts.
Hope that helps, Volker.
Williams, Erskine BGI SF wrote:
Andy, Thanks very much - that clarifies things considerably.
My problem is now coming into more focus, so forgive me if the discussion
starts to veer away from strictly xerces. I'm dealing with a servlet
application (it's a struts application, to be specific), and I'm finding
that when I post a character string to struts, the euro symbol is being
encoded using extended ascii in hex notation: '\x0080'. When I try to write
this string to xml, I get the dreaded '?' symbol.
What I need is a way to make sure the request parameters get encoded using Unicode and thus can be written into xml correctly according to your discussion below. Do you have any ideas on how to control the encoding of the post data? I've tried putting "<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>" into the head of my web form, but still no dice.
-----Original Message----- From: Andy Clark [mailto:[EMAIL PROTECTED] Sent: Monday, August 04, 2003 1:07 PM To: [EMAIL PROTECTED] Subject: Re: xerces always escapes ampersands
Williams, Erskine BGI SF wrote:
I'm finding that xerces is always escaping ampersands, even when they are
a
part of a character reference. For example, if I want to define a text element like so: <someText>€</someText>, (where "€" is the hexadecimal entity reference for the euro "EUR" sign) when xerces writes this out to a file, I invariably get: "<someText>&#x20AC;</someText>" Xerces is always escaping ampersands into the entity ref "&"
Xerces is not escaping the character incorrectly -- it is working correctly. Your code is inserting literal ampersands into the text which MUST be escaped when output.
Perhaps my confusion arises out of poor understanding of xml, but I should think that xerces would only escape ampersands that aren't a part of a
valid
entity reference, i.e., if an ampersand is immediately followed by a pound (#) sign, it should leave it alone. Is there a more reliable way to reference extended ascii characters in xml, so that they will pass through xerces unmolested?
That is not correct. A character entity (e.g. " " or " " -- a single space character specified in decimal and hexadecimal, respectively), is read by the parser and converted to the equivalent Unicode character (assuming that it is a legal XML character).
Content c = new Content(); c.addPara("£ © ®");
Here's what's wrong.
You are adding literal text to the content of the document. You are not adding the string of an English pound symbol, a space, a copyright symbol, a space, and a registered trademark symbol. (In fact, this may be an indicator of another problem but I'll get to that in a minute.) What you are adding is literally the string "£ © ®".
When adding text programmatically to an XML document, the text is NOT parsed as if it is XML content. It is added as-is to the content.
Marshaller m = new Marshaller(fw); m.setEncoding("iso-8859-1");
This may be another problem.
In your previous code, you were trying to append certain characters to the content. Even though you're specifying your output encoding as ISO Latin 1, do not try to append ISO Latin 1-encoded characters to the text. You may not get what you want.
When dealing with an XML document programmatically, you need to work with only Unicode characters. (Note: this only applies to Unicode based systems which is MOST but not all XML parsers and tools -- e.g. expat sends byte sequences which retains the original encoding.)
It just so happens that the ISO Latin 1 characters for the characters you were appending match directly to the Unicode characters. But don't confuse this with the encoding of these characters! The ® character is encoded as the byte 0xAE in ISO Latin 1 but would be encoded differently in a Unicode encoding such as UTF-8.
<?xml version="1.0" encoding="iso-8859-1"?> <factsheet> <content> <para>&#xA3; &#xA9; &#xAE;</para> </content> </factsheet>
Given your code, this is what I would expect. If you really want to append the characters in question, you should do something like the following:
Content c = new Content(); c.addPara("\u00A3 \u00A9 \u00AE");
Document document = DocumentHelper.createDocument(); Element root = document.addElement("root"); Element test = root.addElement("test").addText("£,®");
The same comments apply here as well.
Does that clear things up? or did I just make it more confusing?
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
