Wanted to know if anyone is working on this problem,also wanted to point that xerces currently ignores the encoding set via [1] eg : when DOMInput has a bytestream set. The encoding found in xml-declaration takes precedence right now.
Is it ok if we change the encoding mapping of IANA to java for UTF-16 as per [2] or should we continue with "Unicode".
Thanks, Venu
[1]http://www.w3.org/TR/DOM-Level-3-LS/load-save.html#LS-DOMInput-encoding [2]http://java.sun.com/j2se/1.4.2/docs/guide/intl/encoding.doc.html
--- Begin Message ---Hi Benson,The difference between setting or not setting an encoding on the InputSource is that the parser only performs autodetection of the document's encoding when no external encoding information has been specified. So when you set the encoding on the InputSource, autodetection isn't performed. For a UTF-16 document detecting the encoding also determines the byte order: whether it's UTF-16BE or UTF-16LE. When you specify UTF-16 as the encoding for the InputSource, this byte order detection should occur in the InputStreamReader. I suspect this is where the problem is occuring. We have a table that translates IANA encoding names to Java encoding names. The encoding "UTF-16" maps to the Java encoding name "Unicode". This Java encoding name is passed to the constructor of the InputStreamReader used by the parser. On Mon, 29 Sep 2003, Benson Cheng wrote: > Hi Sander, > > Thanks for you looking into this problem. I think its something to do > with the way xerces creating the Reader object when the > InputSource.setEncoding() method get called, because if I don't call the > InputSource.setEncoding() method, then everything works fine. Also, if > I created the InputStreamReader with the encoding outside the Xerces, > and then create the InputSource with that Reader, then everything works > fine as well (see attached java file). So this kind of make me think > its Xerces problem, but I could be wrong. > > thanks, > Benson. > > -----Original Message----- > From: Sander Bos [mailto:[EMAIL PROTECTED] > Sent: Monday, September 29, 2003 2:44 AM > To: [EMAIL PROTECTED] > Subject: RE: UTF-16 encoding problem > > > > Dear Benson, > > I am not sure if I am doing something wrong, or its a > JVM or Xerces problem, I am getting a "java.lang.InternalError" while > parsing an UTF-16 XML if I am using InputSoruce.setEncoding("UTF-16"). > I attached my sample file and a simple Sax parser class. I know I don't > have to call the setEncoding() function, the parser will detect itself, > but it shouldn't a problem even I set it. > > BTW, this problems happes to Xerces 2.4.0 and 2.5.0 with > JVM 1.4.0_01 and 1.3.1. > > I don't have an answer for you, apart from that I could > reproduce your problem (also with JDK 1.4.1_02), that I don't think you > do anything wrong but that I do not know what goes wrong. I found it > interesting that you could get an internal error so easily with so many > different JDK's so I looked at it a bit but could not figure it out. > > For others that may be interested, Since the bug came > from InputStreamReader.read I made a small test where I tried to set up > a stream just like Xerces, so > InputStream is = new FileInputStream(fname); > // Copied from XMLEntityManager > RewindableInputStream ris = new > RewindableInputStream(is); > > InputStreamReader reader = new > InputStreamReader(ris, "UTF-16"); > char cbuf[] = new char[1024]; > while ((reader.read(cbuf, 11, 11)) != -1) { > } > but for the different values of the two '11''s I tried > it for, I could not cause the same crash. I don't think the > rewindablestream is reset anywhere for UTF-16. > (I did find it kind of weird that an XML11EntityScanner > is used (see stacktrace), where the document is of version 1.0, but > maybe that is the default?) > > Kind regards, > > --Sander. > > > Here is the stack trace: > > D:\work\source\xml>java SimpleSaxParser test.xml UTF-16 > afile=D:\work\source\xml\test.xml, encoding=UTF-16 > Exception in thread "main" java.lang.InternalError: > Converter malfunction (Unicode) -- please submit a bug report via ht > tp://java.sun.com/cgi-bin/bugreport.cgi > at > sun.nio.cs.StreamDecoder$ConverterSD.malfunction(StreamDecoder.java:232) > at > sun.nio.cs.StreamDecoder$ConverterSD.convertInto(StreamDecoder.java:248) > at > sun.nio.cs.StreamDecoder$ConverterSD.implRead(StreamDecoder.java:294) > at > sun.nio.cs.StreamDecoder.read(StreamDecoder.java:179) > at > java.io.InputStreamReader.read(InputStreamReader.java:167) > at > org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source) > at > org.apache.xerces.impl.XML11EntityScanner.skipString(Unknown Source) > at > org.apache.xerces.impl.XMLVersionDetector.determineDocVersion(Unknown > Source) > at > org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) > at > org.apache.xerces.parsers.DTDConfiguration.parse(Unknown Source) > at > org.apache.xerces.parsers.XMLParser.parse(Unknown Source) > at > org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source) > at > javax.xml.parsers.SAXParser.parse(SAXParser.java:345) > at > SimpleSaxParser.parse(SimpleSaxParser.java:25) > at SimpleSaxParser.main(SimpleSaxParser.java:46) > > thanks, > Benson. > > -- -------------------- Michael Glavassevich [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
--- End Message ---
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
