Hi all,
 
concerning the question of whether XmlBeans should enforce the "environmental" 
rules about encoding and by consequence about erroneous sequences, I have a 
sligthly different viewpoint: I agree that - given an encoding indication - the 
parser should detect and reject erroneous sentences. However, I don't see the 
strict necessity of the presence of such specifications: Think of internal 
parameter files of configurations or user setup which are strictly internal to 
the application using it - here I don't see any reason why to enforce such 
rules. I therefore would opt for XmlBean Options enabling to switch on or off 
the rigourous enforcing of such rules.
 
What do you mean about?
 
Dieter

________________________________

Von: Dennis Sosnoski [mailto:[EMAIL PROTECTED]
Gesendet: Do 29.12.2005 09:44
An: [email protected]
Betreff: Re: Illegal characters, can xmlbeans be forgiving?



The XML recommendation says (4.3.3):

"It is a fatal error when an XML processor encounters an entity with an
encoding that it is unable to process. It is a fatal error if an XML
entity is determined (via default, encoding declaration, or higher-level
protocol) to be in a certain encoding but contains octet sequences that
are not legal in that encoding. It is also a fatal error if an XML
entity contains no encoding declaration and its content is not legal
UTF-8 or UTF-16."

Fatal errors are supposed to end processing. Since this doesn't seem to
be enforced by XMLBeans (or more likely, by the parser), you should
report this as an error.

I think it'd be a much more serious problem if XMLBeans fails to process
a document written as UTF-8 or UTF-16 without an encoding declaration,
or a document written as ISO-8858-1 with an encoding declaration. You
might want to test those variations.

  - Dennis

maarten wrote:

> I have noticed that xmlbeans 2.0 doesn't care whether the encoding
> declaration
> in the xml document matches the byte-encoding that is actually used.
> It seems to be more forgiving than I would like it to be.
>
> For example:
>
> public static void test (String charsetDocument, String charsetBytes)
> throws Exception {
> System.out.print ("doc: " + charsetDocument + ", bytes: " +
> charsetBytes + " => ");
> String xml =
> "<?xml version=\"1.0\" encoding=\"" + charsetDocument + "\"?>\n" +
> "<vap xmlns=\"http://www.eurid.eu/2005/vap\"; >" +
> " <command>\n" +
> " <login>\n" +
> " <id>àáâäãa</id>\n" +
> " <password>àáâäãa</password>\n" +
> " </login> \n" +
> " </command>\n" +
> "</vap>";
> byte[] bytes = new byte[0];
> bytes = xml.getBytes(charsetBytes);
> ByteArrayInputStream in = new ByteArrayInputStream(bytes);
> try {
> VapDocument document = VapDocument.Factory.parse(in);
> if (document.validate()) {
> System.out.println("valid, encoding = " +
> document.documentProperties().getEncoding());
> return;
> }
> } catch(Exception e) {
> System.out.println(e.getClass().getName());
> return;
> }
> }
>
> public static void main(String[] args) throws Exception {
> test ("UTF-8", "UTF-8");
> test ("UTF-8", "UTF-16");
> test ("ISO-8859-1", "UTF-8");
> test ("ISO-8859-1", "UTF-16");
> test ("anything", "ISO-8859-1");
> test ("anything", "UTF-8");
> test ("anything", "UTF-16");
> }
>
> gives the following output:
>
> doc: UTF-8, bytes: UTF-8 => valid, encoding = UTF-8
> doc: UTF-8, bytes: UTF-16 => valid, encoding = UTF-8
> doc: ISO-8859-1, bytes: UTF-8 => valid, encoding = ISO-8859-1
> doc: ISO-8859-1, bytes: UTF-16 => valid, encoding = ISO-8859-1
> doc: anything, bytes: ISO-8859-1 => java.io.UnsupportedEncodingException
> doc: anything, bytes: UTF-8 => java.io.UnsupportedEncodingException
> doc: anything, bytes: UTF-16 => valid, encoding = anything
>
>
> Anything I can do about this ?
>
> Maarten
>
>
> Dennis Sosnoski wrote:
>
>> Do your XML documents specify the encoding in the XML declaration? If
>> not, there's no way to distinguish between UTF-8 and ISO-8859-X
>> without the multiple parses - and the multiple parse approach doesn't
>> even come close to guaranteeing that you've ended up with the correct
>> encoding (since the different flavors of ISO-8859-X reuse the same
>> byte values for different characters). If the documents *do* give the
>> encoding in the XML declaration, XMLBeans should be reading it and
>> interpreting the document correctly.
>>
>> - Dennis
>>
>> Christophe Bouhier (MC/ECM) wrote:
>>
>>> Hi Lawrence,
>>> I am not sure how to detect the XML charsets, besides just looping
>>> through the list of supported encodings and trying to parse
>>> succesfully. This is is not elegant but it worked for me. Thanks for
>>> your help.
>>> Cheers . Christophe
>>>
>>>
>>>> -----Original Message-----
>>>> From: Lawrence Jones [mailto:[EMAIL PROTECTED] Sent: 17 Disember 2005
>>>> 0:59
>>>> To: [email protected]
>>>> Subject: RE: Illegal characters, can xmlbeans be forgiving?
>>>> Have a look at the code in:
>>>>
>>>> $XMLBEANS/src/common/org/apache/xmlbeans/impl/common/EncodingMap.java
>>>>
>>>> and the code that calls it in
>>>>
>>>> $XMLBEANS/src/store/org/apache/xmlbeans/impl/store/Saver.java
>>>> around line 1760 onwards
>>>>
>>>> EncodingMap.java contains all the supported encodings in the static
>>>> initializer at line 70.
>>>>
>>>> Cheers,
>>>>
>>>> Lawrence
>>>>
>>>>
>>>>> -----Original Message-----
>>>>> From: Christophe Bouhier (MC/ECM)
>>>>> [mailto:[EMAIL PROTECTED]
>>>>> Sent: Thursday, December 15, 2005 7:25 PM
>>>>> To: '[email protected]'
>>>>> Subject: RE: Illegal characters, can xmlbeans be forgiving?
>>>>>
>>>>> Thanks! That helps. I checked the API doc for
>>>>
>>>>
>>>> setCharterEncoding but
>>>>
>>>>> couldn't find The supported encoding types. In other words which
>>>>> encodings are allowed in the Function
>>>>> setCharacterEncoding("encoding"); ?
>>>>>
>>>>> Cheers / Christophe
>>>>>
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Lawrence Jones [mailto:[EMAIL PROTECTED]
>>>>>> Sent: 16 Disember 2005 2:11
>>>>>> To: [email protected]
>>>>>> Subject: RE: Illegal characters, can xmlbeans be forgiving?
>>>>>>
>>>>>> Hi Christophe
>>>>>>
>>>>>> It's very unlikely that the characters are the problem -
>>>>>
>>>>>
>>>> all Unicode
>>>>
>>>>>> characters are allowed in XML - see e.g.
>>>>>> http://www.xml.com/axml/testaxml.htm (section 2.2) and hence in
>>>>>> XmlBeans.
>>>>>>
>>>>>> What is more likely is that the characters are not encoded (as
>>>>>> bytes) in the way XmlBeans expects. By default XmlBeans assumes
>>>>>> UTF-8 encoding. Yours are probably ISO8859_1 or some such
>>>>>
>>>>>
>>>> thing. If
>>>>
>>>>>> you want to play around with character encoding have a look at
>>>>>> XmlOptions.setCharacterEncoding().
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> Lawrence
>>>>>>
>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Christophe Bouhier (MC/ECM)
>>>>>>> [mailto:[EMAIL PROTECTED]
>>>>>>> Sent: Wednesday, December 14, 2005 6:04 PM
>>>>>>> To: '[email protected]'
>>>>>>> Subject: Illegal characters, can xmlbeans be forgiving?
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> My application parses XML from many different sources.
>>>>>>
>>>>>>
>>>> (It's a RSS
>>>>
>>>>>>> reader/Podcast receiver).
>>>>>>> Before I switched to XMLBeans I was using an xml parser
>>>>>>>
>>>>>> called nanoXMl
>>>>>>
>>>>>>> which didn't mind Some illegal characters especially when
>>>>>>>
>>>>>> wrapped in
>>>>>>
>>>>>>> CDATA.
>>>>>>> Now XMLBeans stumbles over the illegal chars
>>>>>>
>>>>>>
>>>> below:(âEURoe) (Throws
>>>>
>>>>>>> exception).
>>>>>>>
>>>>>>> ....
>>>>>>> <description><![CDATA[
>>>>>>> Miljenko âEURoeMikeâEUR? Grgich first gained international
>>>>>>>
>>>>>> recognition at
>>>>>>
>>>>>>> the celebrated âEURoeParis TastingâEUR? of 1976. They had
>>>>>>>
>>>>>> chosen MikeâEUR(tm)s
>>>>>>
>>>>>>> 1973 Chateau Montelena Chardonnay as the finest white wine
>>>>>>>
>>>>>> in the world.
>>>>>>
>>>>>>> Today, Mike oversees daily operations at his winery
>>>>>>>
>>>>>> Grgich Hills.
>>>>>>
>>>>>>> His aim, year after year, is to improve the quality of their
>>>>>>> [...]]]></description> ......
>>>>>>>
>>>>>>> Is there anyway I can set an option to ignore illegal chars
>>>>>>>
>>>>>> and go on.
>>>>>>
>>>>>>> For me this could be a deal-breaker. I unfortunatly can't
>>>>>>>
>>>>>> expect all
>>>>>>
>>>>>>> XML out on the web to be "nice and tidy".
>>>>>>>
>>>>>>> Thanks for the help!
>>>>>>> Cheers / Christophe
>>>>>>>
>>>>>>>
>>>>>>>
>>>> --------------------------------------------------------------------
>>>>
>>>>>> -
>>>>>>
>>>>>>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>>>>>>> For additional commands, e-mail: [EMAIL PROTECTED]
>>>>>>>
>>>>>>
>>>>>
>>>> ---------------------------------------------------------------------
>>>>
>>>>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>>>>> For additional commands, e-mail: [EMAIL PROTECTED]
>>>>>
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>>> For additional commands, e-mail: [EMAIL PROTECTED]
>>>
>>>
>>>
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



<<winmail.dat>>

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to