Re: Illegal characters, can xmlbeans be forgiving?

2005-12-28 Thread maarten
I have noticed that xmlbeans 2.0 doesn't care whether the encoding 
declaration

in the xml document matches the byte-encoding that is actually used.
It seems to be more forgiving than I would like it to be.

For example:

public static void test (String charsetDocument, String charsetBytes) 
throws Exception {
System.out.print (doc:  + charsetDocument + , bytes:  + charsetBytes 
+  = );

String xml =
?xml version=\1.0\ encoding=\ + charsetDocument + \?\n +
vap xmlns=\http://www.eurid.eu/2005/vap\;  +
 command\n +
 login\n +
 idàáâäãā/id\n +
 passwordàáâäãā/password\n +
 /login \n +
 /command\n +
/vap;
byte[] bytes = new byte[0];
bytes = xml.getBytes(charsetBytes);
ByteArrayInputStream in = new ByteArrayInputStream(bytes);
try {
VapDocument document = VapDocument.Factory.parse(in);
if (document.validate()) {
System.out.println(valid, encoding =  + 
document.documentProperties().getEncoding());

return;
}
} catch(Exception e) {
System.out.println(e.getClass().getName());
return;
}
}

public static void main(String[] args) throws Exception {
test (UTF-8, UTF-8);
test (UTF-8, UTF-16);
test (ISO-8859-1, UTF-8);
test (ISO-8859-1, UTF-16);
test (anything, ISO-8859-1);
test (anything, UTF-8);
test (anything, UTF-16);
}

gives the following output:

doc: UTF-8, bytes: UTF-8 = valid, encoding = UTF-8
doc: UTF-8, bytes: UTF-16 = valid, encoding = UTF-8
doc: ISO-8859-1, bytes: UTF-8 = valid, encoding = ISO-8859-1
doc: ISO-8859-1, bytes: UTF-16 = valid, encoding = ISO-8859-1
doc: anything, bytes: ISO-8859-1 = java.io.UnsupportedEncodingException
doc: anything, bytes: UTF-8 = java.io.UnsupportedEncodingException
doc: anything, bytes: UTF-16 = valid, encoding = anything


Anything I can do about this ?

Maarten


Dennis Sosnoski wrote:

Do your XML documents specify the encoding in the XML declaration? If 
not, there's no way to distinguish between UTF-8 and ISO-8859-X 
without the multiple parses - and the multiple parse approach doesn't 
even come close to guaranteeing that you've ended up with the correct 
encoding (since the different flavors of ISO-8859-X reuse the same 
byte values for different characters). If the documents *do* give the 
encoding in the XML declaration, XMLBeans should be reading it and 
interpreting the document correctly.


- Dennis

Christophe Bouhier (MC/ECM) wrote:


Hi Lawrence,
I am not sure how to detect the XML charsets, besides just looping 
through the list of supported encodings and trying to parse 
succesfully. This is is not elegant but it worked for me. Thanks for 
your help.

Cheers . Christophe



-Original Message-
From: Lawrence Jones [mailto:[EMAIL PROTECTED] Sent: 17 Disember 2005 
0:59

To: user@xmlbeans.apache.org
Subject: RE: Illegal characters, can xmlbeans be forgiving?
Have a look at the code in:

$XMLBEANS/src/common/org/apache/xmlbeans/impl/common/EncodingMap.java

and the code that calls it in

$XMLBEANS/src/store/org/apache/xmlbeans/impl/store/Saver.java around 
line 1760 onwards


EncodingMap.java contains all the supported encodings in the static 
initializer at line 70.


Cheers,

Lawrence



-Original Message-
From: Christophe Bouhier (MC/ECM) 
[mailto:[EMAIL PROTECTED]

Sent: Thursday, December 15, 2005 7:25 PM
To: 'user@xmlbeans.apache.org'
Subject: RE: Illegal characters, can xmlbeans be forgiving?

Thanks! That helps. I checked the API doc for


setCharterEncoding but

couldn’t find The supported encoding types. In other words which 
encodings are allowed in the Function 
setCharacterEncoding(encoding); ?


Cheers / Christophe



-Original Message-
From: Lawrence Jones [mailto:[EMAIL PROTECTED]
Sent: 16 Disember 2005 2:11
To: user@xmlbeans.apache.org
Subject: RE: Illegal characters, can xmlbeans be forgiving?

Hi Christophe

It's very unlikely that the characters are the problem -



all Unicode


characters are allowed in XML - see e.g.
http://www.xml.com/axml/testaxml.htm (section 2.2) and hence in 
XmlBeans.


What is more likely is that the characters are not encoded (as 
bytes) in the way XmlBeans expects. By default XmlBeans assumes 
UTF-8 encoding. Yours are probably ISO8859_1 or some such



thing. If

you want to play around with character encoding have a look at 
XmlOptions.setCharacterEncoding().


Cheers,

Lawrence



-Original Message-
From: Christophe Bouhier (MC/ECM)
[mailto:[EMAIL PROTECTED]
Sent: Wednesday, December 14, 2005 6:04 PM
To: 'user@xmlbeans.apache.org'
Subject: Illegal characters, can xmlbeans be forgiving?

Hi,

My application parses XML from many different sources.



(It's a RSS


reader/Podcast receiver).
Before I switched to XMLBeans I was using an xml parser


called nanoXMl


which didn't mind Some illegal characters especially when


wrapped in


CDATA.
Now XMLBeans stumbles over the illegal chars



below:(“) (Throws


exception).


description![CDATA[
Miljenko “Mike� Grgich first gained international


recognition at


the celebrated “Paris Tasting� of 1976. They had

Re: Illegal characters, can xmlbeans be forgiving?

2005-12-27 Thread Dennis Sosnoski
Do your XML documents specify the encoding in the XML declaration? If 
not, there's no way to distinguish between UTF-8 and ISO-8859-X without 
the multiple parses - and the multiple parse approach doesn't even come 
close to guaranteeing that you've ended up with the correct encoding 
(since the different flavors of ISO-8859-X reuse the same byte values 
for different characters). If the documents *do* give the encoding in 
the XML declaration, XMLBeans should be reading it and interpreting the 
document correctly.


 - Dennis

Christophe Bouhier (MC/ECM) wrote:

Hi Lawrence, 

I am not sure how to detect the XML charsets, besides just looping through the list of supported encodings 
and trying to parse succesfully. This is is not elegant but it worked for me. 
Thanks for your help. 

Cheers . Christophe 

 


-Original Message-
From: Lawrence Jones [mailto:[EMAIL PROTECTED] 
Sent: 17 Disember 2005 0:59

To: user@xmlbeans.apache.org
Subject: RE: Illegal characters, can xmlbeans be forgiving? 


Have a look at the code in:

$XMLBEANS/src/common/org/apache/xmlbeans/impl/common/EncodingMap.java

and the code that calls it in

$XMLBEANS/src/store/org/apache/xmlbeans/impl/store/Saver.java 
around line 1760 onwards


EncodingMap.java contains all the supported encodings in the 
static initializer at line 70.


Cheers,

Lawrence

   


-Original Message-
From: Christophe Bouhier (MC/ECM) 
[mailto:[EMAIL PROTECTED]

Sent: Thursday, December 15, 2005 7:25 PM
To: 'user@xmlbeans.apache.org'
Subject: RE: Illegal characters, can xmlbeans be forgiving?

Thanks! That helps. I checked the API doc for 
 

setCharterEncoding but 
   

couldn’t find The supported encoding types. In other words which 
encodings are allowed in the Function 
setCharacterEncoding(encoding); ?


Cheers / Christophe

 


-Original Message-
From: Lawrence Jones [mailto:[EMAIL PROTECTED]
Sent: 16 Disember 2005 2:11
To: user@xmlbeans.apache.org
Subject: RE: Illegal characters, can xmlbeans be forgiving?

Hi Christophe

It's very unlikely that the characters are the problem - 
   

all Unicode 
   


characters are allowed in XML - see e.g.
http://www.xml.com/axml/testaxml.htm (section 2.2) and hence in 
XmlBeans.


What is more likely is that the characters are not encoded (as 
bytes) in the way XmlBeans expects. By default XmlBeans assumes 
UTF-8 encoding. Yours are probably ISO8859_1 or some such 
   

thing. If 
   

you want to play around with character encoding have a look at 
XmlOptions.setCharacterEncoding().


Cheers,

Lawrence

   


-Original Message-
From: Christophe Bouhier (MC/ECM)
[mailto:[EMAIL PROTECTED]
Sent: Wednesday, December 14, 2005 6:04 PM
To: 'user@xmlbeans.apache.org'
Subject: Illegal characters, can xmlbeans be forgiving?

Hi,

My application parses XML from many different sources. 
 

(It's a RSS 
   


reader/Podcast receiver).
Before I switched to XMLBeans I was using an xml parser
 


called nanoXMl
   


which didn't mind Some illegal characters especially when
 


wrapped in
   


CDATA.
Now XMLBeans stumbles over the illegal chars 
 

below:(“) (Throws 
   


exception).


description![CDATA[
Miljenko “Mike� Grgich first gained international
 


recognition at
   


the celebrated “Paris Tasting� of 1976.  They had
 


chosen Mike’s
   


1973 Chateau Montelena Chardonnay as the finest white wine
 


in the world.
   


Today, Mike oversees daily operations at his winery
 


Grgich Hills.
   

His aim, year after year, is to improve the quality of their 
[...]]]/description ..


Is there anyway I can set an option to ignore illegal chars
 


and go on.
   


For me this could be a deal-breaker. I unfortunatly can't
 


expect all
   


XML out on the web to be nice and tidy.

Thanks for the help!
Cheers / Christophe


 



   


-
   


To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 

   

 


-
   


To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 

   



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


 




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Illegal characters, can xmlbeans be forgiving?

2005-12-22 Thread Christophe Bouhier \(MC/ECM\)
Hi Lawrence, 

I am not sure how to detect the XML charsets, besides just looping through the 
list of supported encodings 
and trying to parse succesfully. This is is not elegant but it worked for me. 
Thanks for your help. 

Cheers . Christophe 

 -Original Message-
 From: Lawrence Jones [mailto:[EMAIL PROTECTED] 
 Sent: 17 Disember 2005 0:59
 To: user@xmlbeans.apache.org
 Subject: RE: Illegal characters, can xmlbeans be forgiving? 
 
 Have a look at the code in:
 
 $XMLBEANS/src/common/org/apache/xmlbeans/impl/common/EncodingMap.java
 
 and the code that calls it in
 
 $XMLBEANS/src/store/org/apache/xmlbeans/impl/store/Saver.java 
 around line 1760 onwards
 
 EncodingMap.java contains all the supported encodings in the 
 static initializer at line 70.
 
 Cheers,
 
 Lawrence
 
  -Original Message-
  From: Christophe Bouhier (MC/ECM) 
  [mailto:[EMAIL PROTECTED]
  Sent: Thursday, December 15, 2005 7:25 PM
  To: 'user@xmlbeans.apache.org'
  Subject: RE: Illegal characters, can xmlbeans be forgiving?
  
  Thanks! That helps. I checked the API doc for 
 setCharterEncoding but 
  couldn’t find The supported encoding types. In other words which 
  encodings are allowed in the Function 
  setCharacterEncoding(encoding); ?
  
  Cheers / Christophe
  
   -Original Message-
   From: Lawrence Jones [mailto:[EMAIL PROTECTED]
   Sent: 16 Disember 2005 2:11
   To: user@xmlbeans.apache.org
   Subject: RE: Illegal characters, can xmlbeans be forgiving?
  
   Hi Christophe
  
   It's very unlikely that the characters are the problem - 
 all Unicode 
   characters are allowed in XML - see e.g.
   http://www.xml.com/axml/testaxml.htm (section 2.2) and hence in 
   XmlBeans.
  
   What is more likely is that the characters are not encoded (as 
   bytes) in the way XmlBeans expects. By default XmlBeans assumes 
   UTF-8 encoding. Yours are probably ISO8859_1 or some such 
 thing. If 
   you want to play around with character encoding have a look at 
   XmlOptions.setCharacterEncoding().
  
   Cheers,
  
   Lawrence
  
-Original Message-
From: Christophe Bouhier (MC/ECM)
[mailto:[EMAIL PROTECTED]
Sent: Wednesday, December 14, 2005 6:04 PM
To: 'user@xmlbeans.apache.org'
Subject: Illegal characters, can xmlbeans be forgiving?
   
Hi,
   
My application parses XML from many different sources. 
 (It's a RSS 
reader/Podcast receiver).
Before I switched to XMLBeans I was using an xml parser
   called nanoXMl
which didn't mind Some illegal characters especially when
   wrapped in
CDATA.
Now XMLBeans stumbles over the illegal chars 
 below:(“) (Throws 
exception).
   

description![CDATA[
Miljenko “Mike� Grgich first gained international
   recognition at
the celebrated “Paris Tasting� of 1976.  They had
   chosen Mike’s
1973 Chateau Montelena Chardonnay as the finest white wine
   in the world.
Today, Mike oversees daily operations at his winery
   Grgich Hills.
His aim, year after year, is to improve the quality of their 
[...]]]/description ..
   
Is there anyway I can set an option to ignore illegal chars
   and go on.
For me this could be a deal-breaker. I unfortunatly can't
   expect all
XML out on the web to be nice and tidy.
   
Thanks for the help!
Cheers / Christophe
   
   
   
 
   -
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
  
  
  
  
 -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Illegal characters, can xmlbeans be forgiving?

2005-12-16 Thread Lawrence Jones
Have a look at the code in:

$XMLBEANS/src/common/org/apache/xmlbeans/impl/common/EncodingMap.java

and the code that calls it in

$XMLBEANS/src/store/org/apache/xmlbeans/impl/store/Saver.java around line 1760 
onwards

EncodingMap.java contains all the supported encodings in the static initializer 
at line 70.

Cheers,

Lawrence

 -Original Message-
 From: Christophe Bouhier (MC/ECM) [mailto:[EMAIL PROTECTED]
 Sent: Thursday, December 15, 2005 7:25 PM
 To: 'user@xmlbeans.apache.org'
 Subject: RE: Illegal characters, can xmlbeans be forgiving?
 
 Thanks! That helps. I checked the API doc for setCharterEncoding but
 couldn’t find
 The supported encoding types. In other words which encodings are allowed
 in the
 Function setCharacterEncoding(encoding); ?
 
 Cheers / Christophe
 
  -Original Message-
  From: Lawrence Jones [mailto:[EMAIL PROTECTED]
  Sent: 16 Disember 2005 2:11
  To: user@xmlbeans.apache.org
  Subject: RE: Illegal characters, can xmlbeans be forgiving?
 
  Hi Christophe
 
  It's very unlikely that the characters are the problem - all
  Unicode characters are allowed in XML - see e.g.
  http://www.xml.com/axml/testaxml.htm (section 2.2) and hence
  in XmlBeans.
 
  What is more likely is that the characters are not encoded
  (as bytes) in the way XmlBeans expects. By default XmlBeans
  assumes UTF-8 encoding. Yours are probably ISO8859_1 or some
  such thing. If you want to play around with character
  encoding have a look at XmlOptions.setCharacterEncoding().
 
  Cheers,
 
  Lawrence
 
   -Original Message-
   From: Christophe Bouhier (MC/ECM)
   [mailto:[EMAIL PROTECTED]
   Sent: Wednesday, December 14, 2005 6:04 PM
   To: 'user@xmlbeans.apache.org'
   Subject: Illegal characters, can xmlbeans be forgiving?
  
   Hi,
  
   My application parses XML from many different sources. (It's a RSS
   reader/Podcast receiver).
   Before I switched to XMLBeans I was using an xml parser
  called nanoXMl
   which didn't mind Some illegal characters especially when
  wrapped in
   CDATA.
   Now XMLBeans stumbles over the illegal chars below:(“) (Throws
   exception).
  
   
   description![CDATA[
 Miljenko “Mike� Grgich first gained international
  recognition at
   the celebrated “Paris Tasting� of 1976.  They had
  chosen Mike’s
   1973 Chateau Montelena Chardonnay as the finest white wine
  in the world.
 Today, Mike oversees daily operations at his winery
  Grgich Hills.
   His aim, year after year, is to improve the quality of their
   [...]]]/description ..
  
   Is there anyway I can set an option to ignore illegal chars
  and go on.
   For me this could be a deal-breaker. I unfortunatly can't
  expect all
   XML out on the web to be nice and tidy.
  
   Thanks for the help!
   Cheers / Christophe
  
  
  -
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



RE: Illegal characters, can xmlbeans be forgiving?

2005-12-15 Thread Lawrence Jones
Hi Christophe

It's very unlikely that the characters are the problem - all Unicode characters 
are allowed in XML - see e.g. http://www.xml.com/axml/testaxml.htm (section 
2.2) and hence in XmlBeans.

What is more likely is that the characters are not encoded (as bytes) in the 
way XmlBeans expects. By default XmlBeans assumes UTF-8 encoding. Yours are 
probably ISO8859_1 or some such thing. If you want to play around with 
character encoding have a look at XmlOptions.setCharacterEncoding().

Cheers,

Lawrence

 -Original Message-
 From: Christophe Bouhier (MC/ECM) [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, December 14, 2005 6:04 PM
 To: 'user@xmlbeans.apache.org'
 Subject: Illegal characters, can xmlbeans be forgiving?
 
 Hi,
 
 My application parses XML from many different sources. (It's a RSS
 reader/Podcast receiver).
 Before I switched to XMLBeans I was using an xml parser called nanoXMl
 which didn't mind
 Some illegal characters especially when wrapped in CDATA.
 Now XMLBeans stumbles over the illegal chars below:(“) (Throws
 exception).
 
 
 description![CDATA[
   Miljenko “Mike� Grgich first gained international recognition at
 the celebrated “Paris Tasting� of 1976.  They had chosen Mike’s 1973
 Chateau Montelena Chardonnay as the finest white wine in the world.
   Today, Mike oversees daily operations at his winery  Grgich Hills.
 His aim, year after year, is to improve the quality of their
 [...]]]/description
 ..
 
 Is there anyway I can set an option to ignore illegal chars and go on. For
 me this could be a deal-breaker. I unfortunatly can't expect all XML out
 on the web to be nice and tidy.
 
 Thanks for the help!
 Cheers / Christophe
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



RE: Illegal characters, can xmlbeans be forgiving?

2005-12-15 Thread Christophe Bouhier (MC/ECM)
Thanks! That helps. I checked the API doc for setCharterEncoding but couldn’t 
find
The supported encoding types. In other words which encodings are allowed in the
Function setCharacterEncoding(encoding); ?

Cheers / Christophe

 -Original Message-
 From: Lawrence Jones [mailto:[EMAIL PROTECTED] 
 Sent: 16 Disember 2005 2:11
 To: user@xmlbeans.apache.org
 Subject: RE: Illegal characters, can xmlbeans be forgiving? 
 
 Hi Christophe
 
 It's very unlikely that the characters are the problem - all 
 Unicode characters are allowed in XML - see e.g. 
 http://www.xml.com/axml/testaxml.htm (section 2.2) and hence 
 in XmlBeans.
 
 What is more likely is that the characters are not encoded 
 (as bytes) in the way XmlBeans expects. By default XmlBeans 
 assumes UTF-8 encoding. Yours are probably ISO8859_1 or some 
 such thing. If you want to play around with character 
 encoding have a look at XmlOptions.setCharacterEncoding().
 
 Cheers,
 
 Lawrence
 
  -Original Message-
  From: Christophe Bouhier (MC/ECM) 
  [mailto:[EMAIL PROTECTED]
  Sent: Wednesday, December 14, 2005 6:04 PM
  To: 'user@xmlbeans.apache.org'
  Subject: Illegal characters, can xmlbeans be forgiving?
  
  Hi,
  
  My application parses XML from many different sources. (It's a RSS 
  reader/Podcast receiver).
  Before I switched to XMLBeans I was using an xml parser 
 called nanoXMl 
  which didn't mind Some illegal characters especially when 
 wrapped in 
  CDATA.
  Now XMLBeans stumbles over the illegal chars below:(“) (Throws 
  exception).
  
  
  description![CDATA[
  Miljenko “Mike� Grgich first gained international 
 recognition at 
  the celebrated “Paris Tasting� of 1976.  They had 
 chosen Mike’s 
  1973 Chateau Montelena Chardonnay as the finest white wine 
 in the world.
  Today, Mike oversees daily operations at his winery  
 Grgich Hills.
  His aim, year after year, is to improve the quality of their 
  [...]]]/description ..
  
  Is there anyway I can set an option to ignore illegal chars 
 and go on. 
  For me this could be a deal-breaker. I unfortunatly can't 
 expect all 
  XML out on the web to be nice and tidy.
  
  Thanks for the help!
  Cheers / Christophe
  
  
 -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]