The following issue has been updated:

    Updater: Daniel McLean (mailto:[EMAIL PROTECTED])
       Date: Mon, 4 Oct 2004 11:56 PM
    Comment:
Here is the testcase demonstrating the problems described.
    Changes:
             Attachment changed to MemParseEncoding.tar.gz
    ---------------------------------------------------------------------
For a full history of the issue, see:

  http://issues.apache.org/jira/browse/XERCESC-1284?page=history

---------------------------------------------------------------------
View the issue:
  http://issues.apache.org/jira/browse/XERCESC-1284

Here is an overview of the issue:
---------------------------------------------------------------------
        Key: XERCESC-1284
    Summary: Set "UTF-16" encoding for UTF16-BE entity with BOM results in parse 
failure
       Type: Bug

     Status: Unassigned
   Priority: Major

    Project: Xerces-C++
   Versions:
             2.6.0

   Assignee: 
   Reporter: Daniel McLean

    Created: Mon, 4 Oct 2004 11:55 PM
    Updated: Mon, 4 Oct 2004 11:56 PM
Environment: Fedora Core 1, x86 PC, gcc.  Also seen similar failures in a Solaris 9 
environment with the forte compiler.

Description:
Setting the encoding as "UTF-16" using the InputSource.setEncoding() method seems to 
create problems during parsing.

If I have a UTF-16BE document with a BOM, this parses successfully when no encoding 
set is explicitly set or when the encoding is set to "UTF-16BE".  When set to 
"UTF-16", a fatal error occurs with:               
   Fatal Error at (file test, line 1, char 1): Invalid document structure

Some investigation: Having looked through the Xerces source and done some testing, it 
appears that when "UTF-16BE" is set, the "UTF-16 (BE)" transcoder is used when a match 
is detected against the known encoding string.  When "UTF-16" is set, no known 
encoding is detected and the document is probed for an encoding, resulting in the 
XMLUTF16Transcoder being used.  In the latter case, when XMLScanner::scanProlog() is 
called, it ends up reading the BOM and choking because it doesn't look like a piece of 
prologue.  I'm guessing that either the trancoder should have removed the BOM, the BOM 
should be detected and ignored, or the BOM should have been trimmed off beforehand.

I've attached a test case which is derived from the MemParse sample, which parses four 
different UTF-16 document (BE with BOM, BE without BOM, LE with BOM, LE without BOM (I 
realise UTF-16 XML entities should have a BOM, but in my case I want to know what 
happens if a client of my software feeds in a UTF-16 document without a BOM) using 
four different encoding approaches (no encoding set, "UTF-16", "UTF-16BE", "UTF-16LE").

A summary of parsing success and failure on linux:

FILE: UTF-16BE with BOM
ENCODING: : Succeeded.
ENCODING: UTF-16: Fatal error.
ENCODING: UTF-16BE: Succeeded.
ENCODING: UTF-16LE: Fatal error.
--------------------------------
FILE: UTF-16BE without BOM
ENCODING: Fatal error. (due to guess of UTF-8)
ENCODING: UTF-16: Succeeded.
ENCODING: UTF-16BE: Succeeded.
ENCODING: UTF-16LE: Fatal error.
--------------------------------
FILE: UTF-16LE with BOM
ENCODING: : Succeeded.
ENCODING: UTF-16: Fatal error.
ENCODING: UTF-16BE: Fatal error.
ENCODING: UTF-16LE: Succeeded.
--------------------------------
FILE: UTF-16LE with BOM
ENCODING: : Fatal error. (due to guess of UTF-8)
ENCODING: UTF-16: Succeeded.
ENCODING: UTF-16BE: Fatal error.
ENCODING: UTF-16LE: Succeeded.
--------------------------------

Maybe there is a good reason for Xerces current behaviour, but it
escapes me.  I note that the lack of BOM helps parser success
when setting an encoding of "UTF-16", supporting my assertion above.


---------------------------------------------------------------------
JIRA INFORMATION:
This message is automatically generated by JIRA.

If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa

If you want more information on JIRA, or have a bug to report see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to