Re: Illegal characters, can xmlbeans be forgiving?

Dennis Sosnoski Thu, 29 Dec 2005 21:24:30 -0800

The XML recommendation says (4.3.3):

"It is a fatal error when an XML processor encounters an entity with anencoding that it is unable to process. It is a fatal error if an XMLentity is determined (via default, encoding declaration, or higher-levelprotocol) to be in a certain encoding but contains octet sequences thatare not legal in that encoding. It is also a fatal error if an XMLentity contains no encoding declaration and its content is not legalUTF-8 or UTF-16."

Fatal errors are supposed to end processing. Since this doesn't seem tobe enforced by XMLBeans (or more likely, by the parser), you shouldreport this as an error.

I think it'd be a much more serious problem if XMLBeans fails to processa document written as UTF-8 or UTF-16 without an encoding declaration,or a document written as ISO-8858-1 with an encoding declaration. Youmight want to test those variations.


 - Dennis

maarten wrote:

I have noticed that xmlbeans 2.0 doesn't care whether the encodingdeclaration

in the xml document matches the byte-encoding that is actually used.
It seems to be more forgiving than I would like it to be.

For example:

public static void test (String charsetDocument, String charsetBytes)throws Exception {System.out.print ("doc: " + charsetDocument + ", bytes: " +charsetBytes + " => ");

String xml =
"<?xml version=\"1.0\" encoding=\"" + charsetDocument + "\"?>\n" +
"<vap xmlns=\"http://www.eurid.eu/2005/vap\"; >" +
" <command>\n" +
" <login>\n" +
" <id>àáâäãā</id>\n" +
" <password>àáâäãā</password>\n" +
" </login> \n" +
" </command>\n" +
"</vap>";
byte[] bytes = new byte[0];
bytes = xml.getBytes(charsetBytes);
ByteArrayInputStream in = new ByteArrayInputStream(bytes);
try {
VapDocument document = VapDocument.Factory.parse(in);
if (document.validate()) {

System.out.println("valid, encoding = " +document.documentProperties().getEncoding());

return;
}
} catch(Exception e) {
System.out.println(e.getClass().getName());
return;
}
}

public static void main(String[] args) throws Exception {
test ("UTF-8", "UTF-8");
test ("UTF-8", "UTF-16");
test ("ISO-8859-1", "UTF-8");
test ("ISO-8859-1", "UTF-16");
test ("anything", "ISO-8859-1");
test ("anything", "UTF-8");
test ("anything", "UTF-16");
}

gives the following output:

doc: UTF-8, bytes: UTF-8 => valid, encoding = UTF-8
doc: UTF-8, bytes: UTF-16 => valid, encoding = UTF-8
doc: ISO-8859-1, bytes: UTF-8 => valid, encoding = ISO-8859-1
doc: ISO-8859-1, bytes: UTF-16 => valid, encoding = ISO-8859-1
doc: anything, bytes: ISO-8859-1 => java.io.UnsupportedEncodingException
doc: anything, bytes: UTF-8 => java.io.UnsupportedEncodingException
doc: anything, bytes: UTF-16 => valid, encoding = anything


Anything I can do about this ?

Maarten


Dennis Sosnoski wrote:

Do your XML documents specify the encoding in the XML declaration? Ifnot, there's no way to distinguish between UTF-8 and ISO-8859-Xwithout the multiple parses - and the multiple parse approach doesn'teven come close to guaranteeing that you've ended up with the correctencoding (since the different flavors of ISO-8859-X reuse the samebyte values for different characters). If the documents *do* give theencoding in the XML declaration, XMLBeans should be reading it andinterpreting the document correctly.
- Dennis

Christophe Bouhier (MC/ECM) wrote:
Hi Lawrence,
I am not sure how to detect the XML charsets, besides just loopingthrough the list of supported encodings and trying to parsesuccesfully. This is is not elegant but it worked for me. Thanks foryour help.
Cheers . Christophe
-----Original Message-----
From: Lawrence Jones [mailto:[EMAIL PROTECTED] Sent: 17 Disember 20050:59
To: [email protected]
Subject: RE: Illegal characters, can xmlbeans be forgiving?
Have a look at the code in:

$XMLBEANS/src/common/org/apache/xmlbeans/impl/common/EncodingMap.java

and the code that calls it in
$XMLBEANS/src/store/org/apache/xmlbeans/impl/store/Saver.javaaround line 1760 onwards
EncodingMap.java contains all the supported encodings in the staticinitializer at line 70.
Cheers,

Lawrence
-----Original Message-----
From: Christophe Bouhier (MC/ECM)[mailto:[EMAIL PROTECTED]
Sent: Thursday, December 15, 2005 7:25 PM
To: '[email protected]'
Subject: RE: Illegal characters, can xmlbeans be forgiving?

Thanks! That helps. I checked the API doc for
setCharterEncoding but
couldn’t find The supported encoding types. In other words whichencodings are allowed in the FunctionsetCharacterEncoding("encoding"); ?
Cheers / Christophe
-----Original Message-----
From: Lawrence Jones [mailto:[EMAIL PROTECTED]
Sent: 16 Disember 2005 2:11
To: [email protected]
Subject: RE: Illegal characters, can xmlbeans be forgiving?

Hi Christophe

It's very unlikely that the characters are the problem -
all Unicode
characters are allowed in XML - see e.g.
http://www.xml.com/axml/testaxml.htm (section 2.2) and hence inXmlBeans.
What is more likely is that the characters are not encoded (asbytes) in the way XmlBeans expects. By default XmlBeans assumesUTF-8 encoding. Yours are probably ISO8859_1 or some such
thing. If
you want to play around with character encoding have a look atXmlOptions.setCharacterEncoding().
Cheers,

Lawrence
-----Original Message-----
From: Christophe Bouhier (MC/ECM)
[mailto:[EMAIL PROTECTED]
Sent: Wednesday, December 14, 2005 6:04 PM
To: '[email protected]'
Subject: Illegal characters, can xmlbeans be forgiving?

Hi,

My application parses XML from many different sources.
(It's a RSS
reader/Podcast receiver).
Before I switched to XMLBeans I was using an xml parser
called nanoXMl
which didn't mind Some illegal characters especially when
wrapped in
CDATA.
Now XMLBeans stumbles over the illegal chars
below:(â€œ) (Throws
exception).

....
<description><![CDATA[
Miljenko â€œMikeâ€? Grgich first gained international
recognition at
the celebrated â€œParis Tastingâ€? of 1976. They had
chosen Mikeâ€™s
1973 Chateau Montelena Chardonnay as the finest white wine
in the world.
Today, Mike oversees daily operations at his winery
Grgich Hills.
His aim, year after year, is to improve the quality of their[...]]]></description> ......
Is there anyway I can set an option to ignore illegal chars
and go on.
For me this could be a deal-breaker. I unfortunatly can't
expect all
XML out on the web to be "nice and tidy".

Thanks for the help!
Cheers / Christophe
--------------------------------------------------------------------
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Illegal characters, can xmlbeans be forgiving?

Reply via email to