Re: Illegal characters, can xmlbeans be forgiving?
I have noticed that xmlbeans 2.0 doesn't care whether the encoding declaration in the xml document matches the byte-encoding that is actually used. It seems to be more forgiving than I would like it to be. For example: public static void test (String charsetDocument, String charsetBytes) throws Exception { System.out.print (doc: + charsetDocument + , bytes: + charsetBytes + = ); String xml = ?xml version=\1.0\ encoding=\ + charsetDocument + \?\n + vap xmlns=\http://www.eurid.eu/2005/vap\; + command\n + login\n + idàáâäãā/id\n + passwordàáâäãā/password\n + /login \n + /command\n + /vap; byte[] bytes = new byte[0]; bytes = xml.getBytes(charsetBytes); ByteArrayInputStream in = new ByteArrayInputStream(bytes); try { VapDocument document = VapDocument.Factory.parse(in); if (document.validate()) { System.out.println(valid, encoding = + document.documentProperties().getEncoding()); return; } } catch(Exception e) { System.out.println(e.getClass().getName()); return; } } public static void main(String[] args) throws Exception { test (UTF-8, UTF-8); test (UTF-8, UTF-16); test (ISO-8859-1, UTF-8); test (ISO-8859-1, UTF-16); test (anything, ISO-8859-1); test (anything, UTF-8); test (anything, UTF-16); } gives the following output: doc: UTF-8, bytes: UTF-8 = valid, encoding = UTF-8 doc: UTF-8, bytes: UTF-16 = valid, encoding = UTF-8 doc: ISO-8859-1, bytes: UTF-8 = valid, encoding = ISO-8859-1 doc: ISO-8859-1, bytes: UTF-16 = valid, encoding = ISO-8859-1 doc: anything, bytes: ISO-8859-1 = java.io.UnsupportedEncodingException doc: anything, bytes: UTF-8 = java.io.UnsupportedEncodingException doc: anything, bytes: UTF-16 = valid, encoding = anything Anything I can do about this ? Maarten Dennis Sosnoski wrote: Do your XML documents specify the encoding in the XML declaration? If not, there's no way to distinguish between UTF-8 and ISO-8859-X without the multiple parses - and the multiple parse approach doesn't even come close to guaranteeing that you've ended up with the correct encoding (since the different flavors of ISO-8859-X reuse the same byte values for different characters). If the documents *do* give the encoding in the XML declaration, XMLBeans should be reading it and interpreting the document correctly. - Dennis Christophe Bouhier (MC/ECM) wrote: Hi Lawrence, I am not sure how to detect the XML charsets, besides just looping through the list of supported encodings and trying to parse succesfully. This is is not elegant but it worked for me. Thanks for your help. Cheers . Christophe -Original Message- From: Lawrence Jones [mailto:[EMAIL PROTECTED] Sent: 17 Disember 2005 0:59 To: user@xmlbeans.apache.org Subject: RE: Illegal characters, can xmlbeans be forgiving? Have a look at the code in: $XMLBEANS/src/common/org/apache/xmlbeans/impl/common/EncodingMap.java and the code that calls it in $XMLBEANS/src/store/org/apache/xmlbeans/impl/store/Saver.java around line 1760 onwards EncodingMap.java contains all the supported encodings in the static initializer at line 70. Cheers, Lawrence -Original Message- From: Christophe Bouhier (MC/ECM) [mailto:[EMAIL PROTECTED] Sent: Thursday, December 15, 2005 7:25 PM To: 'user@xmlbeans.apache.org' Subject: RE: Illegal characters, can xmlbeans be forgiving? Thanks! That helps. I checked the API doc for setCharterEncoding but couldn’t find The supported encoding types. In other words which encodings are allowed in the Function setCharacterEncoding(encoding); ? Cheers / Christophe -Original Message- From: Lawrence Jones [mailto:[EMAIL PROTECTED] Sent: 16 Disember 2005 2:11 To: user@xmlbeans.apache.org Subject: RE: Illegal characters, can xmlbeans be forgiving? Hi Christophe It's very unlikely that the characters are the problem - all Unicode characters are allowed in XML - see e.g. http://www.xml.com/axml/testaxml.htm (section 2.2) and hence in XmlBeans. What is more likely is that the characters are not encoded (as bytes) in the way XmlBeans expects. By default XmlBeans assumes UTF-8 encoding. Yours are probably ISO8859_1 or some such thing. If you want to play around with character encoding have a look at XmlOptions.setCharacterEncoding(). Cheers, Lawrence -Original Message- From: Christophe Bouhier (MC/ECM) [mailto:[EMAIL PROTECTED] Sent: Wednesday, December 14, 2005 6:04 PM To: 'user@xmlbeans.apache.org' Subject: Illegal characters, can xmlbeans be forgiving? Hi, My application parses XML from many different sources. (It's a RSS reader/Podcast receiver). Before I switched to XMLBeans I was using an xml parser called nanoXMl which didn't mind Some illegal characters especially when wrapped in CDATA. Now XMLBeans stumbles over the illegal chars below:(“) (Throws exception). description![CDATA[ Miljenko “Mikeâ€? Grgich first gained international recognition at the celebrated “Paris Tastingâ€? of 1976. They had
Re: Illegal characters, can xmlbeans be forgiving?
Do your XML documents specify the encoding in the XML declaration? If not, there's no way to distinguish between UTF-8 and ISO-8859-X without the multiple parses - and the multiple parse approach doesn't even come close to guaranteeing that you've ended up with the correct encoding (since the different flavors of ISO-8859-X reuse the same byte values for different characters). If the documents *do* give the encoding in the XML declaration, XMLBeans should be reading it and interpreting the document correctly. - Dennis Christophe Bouhier (MC/ECM) wrote: Hi Lawrence, I am not sure how to detect the XML charsets, besides just looping through the list of supported encodings and trying to parse succesfully. This is is not elegant but it worked for me. Thanks for your help. Cheers . Christophe -Original Message- From: Lawrence Jones [mailto:[EMAIL PROTECTED] Sent: 17 Disember 2005 0:59 To: user@xmlbeans.apache.org Subject: RE: Illegal characters, can xmlbeans be forgiving? Have a look at the code in: $XMLBEANS/src/common/org/apache/xmlbeans/impl/common/EncodingMap.java and the code that calls it in $XMLBEANS/src/store/org/apache/xmlbeans/impl/store/Saver.java around line 1760 onwards EncodingMap.java contains all the supported encodings in the static initializer at line 70. Cheers, Lawrence -Original Message- From: Christophe Bouhier (MC/ECM) [mailto:[EMAIL PROTECTED] Sent: Thursday, December 15, 2005 7:25 PM To: 'user@xmlbeans.apache.org' Subject: RE: Illegal characters, can xmlbeans be forgiving? Thanks! That helps. I checked the API doc for setCharterEncoding but couldn’t find The supported encoding types. In other words which encodings are allowed in the Function setCharacterEncoding(encoding); ? Cheers / Christophe -Original Message- From: Lawrence Jones [mailto:[EMAIL PROTECTED] Sent: 16 Disember 2005 2:11 To: user@xmlbeans.apache.org Subject: RE: Illegal characters, can xmlbeans be forgiving? Hi Christophe It's very unlikely that the characters are the problem - all Unicode characters are allowed in XML - see e.g. http://www.xml.com/axml/testaxml.htm (section 2.2) and hence in XmlBeans. What is more likely is that the characters are not encoded (as bytes) in the way XmlBeans expects. By default XmlBeans assumes UTF-8 encoding. Yours are probably ISO8859_1 or some such thing. If you want to play around with character encoding have a look at XmlOptions.setCharacterEncoding(). Cheers, Lawrence -Original Message- From: Christophe Bouhier (MC/ECM) [mailto:[EMAIL PROTECTED] Sent: Wednesday, December 14, 2005 6:04 PM To: 'user@xmlbeans.apache.org' Subject: Illegal characters, can xmlbeans be forgiving? Hi, My application parses XML from many different sources. (It's a RSS reader/Podcast receiver). Before I switched to XMLBeans I was using an xml parser called nanoXMl which didn't mind Some illegal characters especially when wrapped in CDATA. Now XMLBeans stumbles over the illegal chars below:(“) (Throws exception). description![CDATA[ Miljenko “Mikeâ€? Grgich first gained international recognition at the celebrated “Paris Tastingâ€? of 1976. They had chosen Mike’s 1973 Chateau Montelena Chardonnay as the finest white wine in the world. Today, Mike oversees daily operations at his winery Grgich Hills. His aim, year after year, is to improve the quality of their [...]]]/description .. Is there anyway I can set an option to ignore illegal chars and go on. For me this could be a deal-breaker. I unfortunatly can't expect all XML out on the web to be nice and tidy. Thanks for the help! Cheers / Christophe - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Illegal characters, can xmlbeans be forgiving?
Hi Lawrence, I am not sure how to detect the XML charsets, besides just looping through the list of supported encodings and trying to parse succesfully. This is is not elegant but it worked for me. Thanks for your help. Cheers . Christophe -Original Message- From: Lawrence Jones [mailto:[EMAIL PROTECTED] Sent: 17 Disember 2005 0:59 To: user@xmlbeans.apache.org Subject: RE: Illegal characters, can xmlbeans be forgiving? Have a look at the code in: $XMLBEANS/src/common/org/apache/xmlbeans/impl/common/EncodingMap.java and the code that calls it in $XMLBEANS/src/store/org/apache/xmlbeans/impl/store/Saver.java around line 1760 onwards EncodingMap.java contains all the supported encodings in the static initializer at line 70. Cheers, Lawrence -Original Message- From: Christophe Bouhier (MC/ECM) [mailto:[EMAIL PROTECTED] Sent: Thursday, December 15, 2005 7:25 PM To: 'user@xmlbeans.apache.org' Subject: RE: Illegal characters, can xmlbeans be forgiving? Thanks! That helps. I checked the API doc for setCharterEncoding but couldn’t find The supported encoding types. In other words which encodings are allowed in the Function setCharacterEncoding(encoding); ? Cheers / Christophe -Original Message- From: Lawrence Jones [mailto:[EMAIL PROTECTED] Sent: 16 Disember 2005 2:11 To: user@xmlbeans.apache.org Subject: RE: Illegal characters, can xmlbeans be forgiving? Hi Christophe It's very unlikely that the characters are the problem - all Unicode characters are allowed in XML - see e.g. http://www.xml.com/axml/testaxml.htm (section 2.2) and hence in XmlBeans. What is more likely is that the characters are not encoded (as bytes) in the way XmlBeans expects. By default XmlBeans assumes UTF-8 encoding. Yours are probably ISO8859_1 or some such thing. If you want to play around with character encoding have a look at XmlOptions.setCharacterEncoding(). Cheers, Lawrence -Original Message- From: Christophe Bouhier (MC/ECM) [mailto:[EMAIL PROTECTED] Sent: Wednesday, December 14, 2005 6:04 PM To: 'user@xmlbeans.apache.org' Subject: Illegal characters, can xmlbeans be forgiving? Hi, My application parses XML from many different sources. (It's a RSS reader/Podcast receiver). Before I switched to XMLBeans I was using an xml parser called nanoXMl which didn't mind Some illegal characters especially when wrapped in CDATA. Now XMLBeans stumbles over the illegal chars below:(“) (Throws exception). description![CDATA[ Miljenko “Mikeâ€? Grgich first gained international recognition at the celebrated “Paris Tastingâ€? of 1976. They had chosen Mike’s 1973 Chateau Montelena Chardonnay as the finest white wine in the world. Today, Mike oversees daily operations at his winery Grgich Hills. His aim, year after year, is to improve the quality of their [...]]]/description .. Is there anyway I can set an option to ignore illegal chars and go on. For me this could be a deal-breaker. I unfortunatly can't expect all XML out on the web to be nice and tidy. Thanks for the help! Cheers / Christophe - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Illegal characters, can xmlbeans be forgiving?
Have a look at the code in: $XMLBEANS/src/common/org/apache/xmlbeans/impl/common/EncodingMap.java and the code that calls it in $XMLBEANS/src/store/org/apache/xmlbeans/impl/store/Saver.java around line 1760 onwards EncodingMap.java contains all the supported encodings in the static initializer at line 70. Cheers, Lawrence -Original Message- From: Christophe Bouhier (MC/ECM) [mailto:[EMAIL PROTECTED] Sent: Thursday, December 15, 2005 7:25 PM To: 'user@xmlbeans.apache.org' Subject: RE: Illegal characters, can xmlbeans be forgiving? Thanks! That helps. I checked the API doc for setCharterEncoding but couldn’t find The supported encoding types. In other words which encodings are allowed in the Function setCharacterEncoding(encoding); ? Cheers / Christophe -Original Message- From: Lawrence Jones [mailto:[EMAIL PROTECTED] Sent: 16 Disember 2005 2:11 To: user@xmlbeans.apache.org Subject: RE: Illegal characters, can xmlbeans be forgiving? Hi Christophe It's very unlikely that the characters are the problem - all Unicode characters are allowed in XML - see e.g. http://www.xml.com/axml/testaxml.htm (section 2.2) and hence in XmlBeans. What is more likely is that the characters are not encoded (as bytes) in the way XmlBeans expects. By default XmlBeans assumes UTF-8 encoding. Yours are probably ISO8859_1 or some such thing. If you want to play around with character encoding have a look at XmlOptions.setCharacterEncoding(). Cheers, Lawrence -Original Message- From: Christophe Bouhier (MC/ECM) [mailto:[EMAIL PROTECTED] Sent: Wednesday, December 14, 2005 6:04 PM To: 'user@xmlbeans.apache.org' Subject: Illegal characters, can xmlbeans be forgiving? Hi, My application parses XML from many different sources. (It's a RSS reader/Podcast receiver). Before I switched to XMLBeans I was using an xml parser called nanoXMl which didn't mind Some illegal characters especially when wrapped in CDATA. Now XMLBeans stumbles over the illegal chars below:(“) (Throws exception). description![CDATA[ Miljenko “Mikeâ€? Grgich first gained international recognition at the celebrated “Paris Tastingâ€? of 1976. They had chosen Mike’s 1973 Chateau Montelena Chardonnay as the finest white wine in the world. Today, Mike oversees daily operations at his winery Grgich Hills. His aim, year after year, is to improve the quality of their [...]]]/description .. Is there anyway I can set an option to ignore illegal chars and go on. For me this could be a deal-breaker. I unfortunatly can't expect all XML out on the web to be nice and tidy. Thanks for the help! Cheers / Christophe - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Illegal characters, can xmlbeans be forgiving?
Hi Christophe It's very unlikely that the characters are the problem - all Unicode characters are allowed in XML - see e.g. http://www.xml.com/axml/testaxml.htm (section 2.2) and hence in XmlBeans. What is more likely is that the characters are not encoded (as bytes) in the way XmlBeans expects. By default XmlBeans assumes UTF-8 encoding. Yours are probably ISO8859_1 or some such thing. If you want to play around with character encoding have a look at XmlOptions.setCharacterEncoding(). Cheers, Lawrence -Original Message- From: Christophe Bouhier (MC/ECM) [mailto:[EMAIL PROTECTED] Sent: Wednesday, December 14, 2005 6:04 PM To: 'user@xmlbeans.apache.org' Subject: Illegal characters, can xmlbeans be forgiving? Hi, My application parses XML from many different sources. (It's a RSS reader/Podcast receiver). Before I switched to XMLBeans I was using an xml parser called nanoXMl which didn't mind Some illegal characters especially when wrapped in CDATA. Now XMLBeans stumbles over the illegal chars below:(“) (Throws exception). description![CDATA[ Miljenko “Mike� Grgich first gained international recognition at the celebrated “Paris Tasting� of 1976. They had chosen Mike’s 1973 Chateau Montelena Chardonnay as the finest white wine in the world. Today, Mike oversees daily operations at his winery Grgich Hills. His aim, year after year, is to improve the quality of their [...]]]/description .. Is there anyway I can set an option to ignore illegal chars and go on. For me this could be a deal-breaker. I unfortunatly can't expect all XML out on the web to be nice and tidy. Thanks for the help! Cheers / Christophe - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Illegal characters, can xmlbeans be forgiving?
Thanks! That helps. I checked the API doc for setCharterEncoding but couldn’t find The supported encoding types. In other words which encodings are allowed in the Function setCharacterEncoding(encoding); ? Cheers / Christophe -Original Message- From: Lawrence Jones [mailto:[EMAIL PROTECTED] Sent: 16 Disember 2005 2:11 To: user@xmlbeans.apache.org Subject: RE: Illegal characters, can xmlbeans be forgiving? Hi Christophe It's very unlikely that the characters are the problem - all Unicode characters are allowed in XML - see e.g. http://www.xml.com/axml/testaxml.htm (section 2.2) and hence in XmlBeans. What is more likely is that the characters are not encoded (as bytes) in the way XmlBeans expects. By default XmlBeans assumes UTF-8 encoding. Yours are probably ISO8859_1 or some such thing. If you want to play around with character encoding have a look at XmlOptions.setCharacterEncoding(). Cheers, Lawrence -Original Message- From: Christophe Bouhier (MC/ECM) [mailto:[EMAIL PROTECTED] Sent: Wednesday, December 14, 2005 6:04 PM To: 'user@xmlbeans.apache.org' Subject: Illegal characters, can xmlbeans be forgiving? Hi, My application parses XML from many different sources. (It's a RSS reader/Podcast receiver). Before I switched to XMLBeans I was using an xml parser called nanoXMl which didn't mind Some illegal characters especially when wrapped in CDATA. Now XMLBeans stumbles over the illegal chars below:(“) (Throws exception). description![CDATA[ Miljenko “Mikeâ€? Grgich first gained international recognition at the celebrated “Paris Tastingâ€? of 1976. They had chosen Mike’s 1973 Chateau Montelena Chardonnay as the finest white wine in the world. Today, Mike oversees daily operations at his winery Grgich Hills. His aim, year after year, is to improve the quality of their [...]]]/description .. Is there anyway I can set an option to ignore illegal chars and go on. For me this could be a deal-breaker. I unfortunatly can't expect all XML out on the web to be nice and tidy. Thanks for the help! Cheers / Christophe - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]