Hi all, concerning the question of whether XmlBeans should enforce the "environmental" rules about encoding and by consequence about erroneous sequences, I have a sligthly different viewpoint: I agree that - given an encoding indication - the parser should detect and reject erroneous sentences. However, I don't see the strict necessity of the presence of such specifications: Think of internal parameter files of configurations or user setup which are strictly internal to the application using it - here I don't see any reason why to enforce such rules. I therefore would opt for XmlBean Options enabling to switch on or off the rigourous enforcing of such rules. What do you mean about? Dieter
________________________________ Von: Dennis Sosnoski [mailto:[EMAIL PROTECTED] Gesendet: Do 29.12.2005 09:44 An: [email protected] Betreff: Re: Illegal characters, can xmlbeans be forgiving? The XML recommendation says (4.3.3): "It is a fatal error when an XML processor encounters an entity with an encoding that it is unable to process. It is a fatal error if an XML entity is determined (via default, encoding declaration, or higher-level protocol) to be in a certain encoding but contains octet sequences that are not legal in that encoding. It is also a fatal error if an XML entity contains no encoding declaration and its content is not legal UTF-8 or UTF-16." Fatal errors are supposed to end processing. Since this doesn't seem to be enforced by XMLBeans (or more likely, by the parser), you should report this as an error. I think it'd be a much more serious problem if XMLBeans fails to process a document written as UTF-8 or UTF-16 without an encoding declaration, or a document written as ISO-8858-1 with an encoding declaration. You might want to test those variations. - Dennis maarten wrote: > I have noticed that xmlbeans 2.0 doesn't care whether the encoding > declaration > in the xml document matches the byte-encoding that is actually used. > It seems to be more forgiving than I would like it to be. > > For example: > > public static void test (String charsetDocument, String charsetBytes) > throws Exception { > System.out.print ("doc: " + charsetDocument + ", bytes: " + > charsetBytes + " => "); > String xml = > "<?xml version=\"1.0\" encoding=\"" + charsetDocument + "\"?>\n" + > "<vap xmlns=\"http://www.eurid.eu/2005/vap\" >" + > " <command>\n" + > " <login>\n" + > " <id>àáâäãa</id>\n" + > " <password>àáâäãa</password>\n" + > " </login> \n" + > " </command>\n" + > "</vap>"; > byte[] bytes = new byte[0]; > bytes = xml.getBytes(charsetBytes); > ByteArrayInputStream in = new ByteArrayInputStream(bytes); > try { > VapDocument document = VapDocument.Factory.parse(in); > if (document.validate()) { > System.out.println("valid, encoding = " + > document.documentProperties().getEncoding()); > return; > } > } catch(Exception e) { > System.out.println(e.getClass().getName()); > return; > } > } > > public static void main(String[] args) throws Exception { > test ("UTF-8", "UTF-8"); > test ("UTF-8", "UTF-16"); > test ("ISO-8859-1", "UTF-8"); > test ("ISO-8859-1", "UTF-16"); > test ("anything", "ISO-8859-1"); > test ("anything", "UTF-8"); > test ("anything", "UTF-16"); > } > > gives the following output: > > doc: UTF-8, bytes: UTF-8 => valid, encoding = UTF-8 > doc: UTF-8, bytes: UTF-16 => valid, encoding = UTF-8 > doc: ISO-8859-1, bytes: UTF-8 => valid, encoding = ISO-8859-1 > doc: ISO-8859-1, bytes: UTF-16 => valid, encoding = ISO-8859-1 > doc: anything, bytes: ISO-8859-1 => java.io.UnsupportedEncodingException > doc: anything, bytes: UTF-8 => java.io.UnsupportedEncodingException > doc: anything, bytes: UTF-16 => valid, encoding = anything > > > Anything I can do about this ? > > Maarten > > > Dennis Sosnoski wrote: > >> Do your XML documents specify the encoding in the XML declaration? If >> not, there's no way to distinguish between UTF-8 and ISO-8859-X >> without the multiple parses - and the multiple parse approach doesn't >> even come close to guaranteeing that you've ended up with the correct >> encoding (since the different flavors of ISO-8859-X reuse the same >> byte values for different characters). If the documents *do* give the >> encoding in the XML declaration, XMLBeans should be reading it and >> interpreting the document correctly. >> >> - Dennis >> >> Christophe Bouhier (MC/ECM) wrote: >> >>> Hi Lawrence, >>> I am not sure how to detect the XML charsets, besides just looping >>> through the list of supported encodings and trying to parse >>> succesfully. This is is not elegant but it worked for me. Thanks for >>> your help. >>> Cheers . Christophe >>> >>> >>>> -----Original Message----- >>>> From: Lawrence Jones [mailto:[EMAIL PROTECTED] Sent: 17 Disember 2005 >>>> 0:59 >>>> To: [email protected] >>>> Subject: RE: Illegal characters, can xmlbeans be forgiving? >>>> Have a look at the code in: >>>> >>>> $XMLBEANS/src/common/org/apache/xmlbeans/impl/common/EncodingMap.java >>>> >>>> and the code that calls it in >>>> >>>> $XMLBEANS/src/store/org/apache/xmlbeans/impl/store/Saver.java >>>> around line 1760 onwards >>>> >>>> EncodingMap.java contains all the supported encodings in the static >>>> initializer at line 70. >>>> >>>> Cheers, >>>> >>>> Lawrence >>>> >>>> >>>>> -----Original Message----- >>>>> From: Christophe Bouhier (MC/ECM) >>>>> [mailto:[EMAIL PROTECTED] >>>>> Sent: Thursday, December 15, 2005 7:25 PM >>>>> To: '[email protected]' >>>>> Subject: RE: Illegal characters, can xmlbeans be forgiving? >>>>> >>>>> Thanks! That helps. I checked the API doc for >>>> >>>> >>>> setCharterEncoding but >>>> >>>>> couldn't find The supported encoding types. In other words which >>>>> encodings are allowed in the Function >>>>> setCharacterEncoding("encoding"); ? >>>>> >>>>> Cheers / Christophe >>>>> >>>>> >>>>>> -----Original Message----- >>>>>> From: Lawrence Jones [mailto:[EMAIL PROTECTED] >>>>>> Sent: 16 Disember 2005 2:11 >>>>>> To: [email protected] >>>>>> Subject: RE: Illegal characters, can xmlbeans be forgiving? >>>>>> >>>>>> Hi Christophe >>>>>> >>>>>> It's very unlikely that the characters are the problem - >>>>> >>>>> >>>> all Unicode >>>> >>>>>> characters are allowed in XML - see e.g. >>>>>> http://www.xml.com/axml/testaxml.htm (section 2.2) and hence in >>>>>> XmlBeans. >>>>>> >>>>>> What is more likely is that the characters are not encoded (as >>>>>> bytes) in the way XmlBeans expects. By default XmlBeans assumes >>>>>> UTF-8 encoding. Yours are probably ISO8859_1 or some such >>>>> >>>>> >>>> thing. If >>>> >>>>>> you want to play around with character encoding have a look at >>>>>> XmlOptions.setCharacterEncoding(). >>>>>> >>>>>> Cheers, >>>>>> >>>>>> Lawrence >>>>>> >>>>>> >>>>>>> -----Original Message----- >>>>>>> From: Christophe Bouhier (MC/ECM) >>>>>>> [mailto:[EMAIL PROTECTED] >>>>>>> Sent: Wednesday, December 14, 2005 6:04 PM >>>>>>> To: '[email protected]' >>>>>>> Subject: Illegal characters, can xmlbeans be forgiving? >>>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> My application parses XML from many different sources. >>>>>> >>>>>> >>>> (It's a RSS >>>> >>>>>>> reader/Podcast receiver). >>>>>>> Before I switched to XMLBeans I was using an xml parser >>>>>>> >>>>>> called nanoXMl >>>>>> >>>>>>> which didn't mind Some illegal characters especially when >>>>>>> >>>>>> wrapped in >>>>>> >>>>>>> CDATA. >>>>>>> Now XMLBeans stumbles over the illegal chars >>>>>> >>>>>> >>>> below:(âEURoe) (Throws >>>> >>>>>>> exception). >>>>>>> >>>>>>> .... >>>>>>> <description><![CDATA[ >>>>>>> Miljenko âEURoeMikeâEUR? Grgich first gained international >>>>>>> >>>>>> recognition at >>>>>> >>>>>>> the celebrated âEURoeParis TastingâEUR? of 1976. They had >>>>>>> >>>>>> chosen MikeâEUR(tm)s >>>>>> >>>>>>> 1973 Chateau Montelena Chardonnay as the finest white wine >>>>>>> >>>>>> in the world. >>>>>> >>>>>>> Today, Mike oversees daily operations at his winery >>>>>>> >>>>>> Grgich Hills. >>>>>> >>>>>>> His aim, year after year, is to improve the quality of their >>>>>>> [...]]]></description> ...... >>>>>>> >>>>>>> Is there anyway I can set an option to ignore illegal chars >>>>>>> >>>>>> and go on. >>>>>> >>>>>>> For me this could be a deal-breaker. I unfortunatly can't >>>>>>> >>>>>> expect all >>>>>> >>>>>>> XML out on the web to be "nice and tidy". >>>>>>> >>>>>>> Thanks for the help! >>>>>>> Cheers / Christophe >>>>>>> >>>>>>> >>>>>>> >>>> -------------------------------------------------------------------- >>>> >>>>>> - >>>>>> >>>>>>> To unsubscribe, e-mail: [EMAIL PROTECTED] >>>>>>> For additional commands, e-mail: [EMAIL PROTECTED] >>>>>>> >>>>>> >>>>> >>>> --------------------------------------------------------------------- >>>> >>>>> To unsubscribe, e-mail: [EMAIL PROTECTED] >>>>> For additional commands, e-mail: [EMAIL PROTECTED] >>>>> >>>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: [EMAIL PROTECTED] >>> For additional commands, e-mail: [EMAIL PROTECTED] >>> >>> >>> >>> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> For additional commands, e-mail: [EMAIL PROTECTED] >> >> > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
<<winmail.dat>>
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

