Hi Ha,

There's no doubt that the attached file is not well-formed.  The 0xb7
characters are definitely not properly encoded in UTF-8 (they'd need to be
encoded as 0xc2 0xb7 in order for the encoding to be proper (if I've done
the conversion correctly).  If no encoding declaration is specified, an XML
parser is required to treat a document as UTF-8 (unless it can determine
that it's actually UTF-16).

Note that all is well if you specify the document's encoding to be

      encoding="ISO-8859-1"

which is, I suspect, the actual encoding.  I was not able to reproduce the
behaviour you describe when the document is declared to be UTF-8:  the
parser still produced an error for me in this case.  If you continue to
observe this, please attach a test case declared to be UTF-8 that works.

Cheers,
Neil
Neil Graham
XML Parser Development
IBM Toronto Lab
Phone:  905-413-3519, T/L 969-3519
E-mail:  [EMAIL PROTECTED]




|---------+---------------------------->
|         |           "Huynh, Ha"      |
|         |           <[EMAIL PROTECTED]|
|         |           com>             |
|         |                            |
|         |           08/07/2003 06:12 |
|         |           PM               |
|         |           Please respond to|
|         |           xerces-c-dev     |
|         |                            |
|---------+---------------------------->
  
>---------------------------------------------------------------------------------------------------------------------------------------------|
  |                                                                                    
                                                         |
  |       To:       "'[EMAIL PROTECTED]'" <[EMAIL PROTECTED]>                          
                                     |
  |       cc:                                                                          
                                                         |
  |       Subject:  UTFDataFormatException (bitwise AND error in XMLUTF8Transcoder)    
                                                         |
  |                                                                                    
                                                         |
  |                                                                                    
                                                         |
  
>---------------------------------------------------------------------------------------------------------------------------------------------|




I am getting a UTFDataFormatException when using the following xml doc
(attached).
It appears to be complaining about the "bullet" character.  Note the xml
doc
contains hidden character (LATIN A with circumflex) right before the
bullet.

If I add the encoding="UTF-8" there is no UTFDataFormatException.
However, without specifying any encoding I get the following error.  When I
trace through the code it looks like the default encoding for xerces 2.3 is
to use UTF-8.  The UTFDataFormatException is thrown in
XMLUTF8Transcoder.cpp
ln 222.
        if((gUTFByteIndicatorTest[trailingBytes] & *srcPtr) !=
gUTFByteIndicator[trailingBytes]) { throw error here}

I checked the values and
gUTFByteIndicatorTest[trailingBytes] = 0
*srcPtr = 183
gUTFByteIndicator[trailingBytes] = 0

So we should not go into this loop.  However the computation of the line:
gUTFByteIndicatorTest[trailingBytes] & *srcPtr = 128  //This should be 0.

Another observation I made was that if I were to use the xml doc without
specifying an encoding AND move the bullet character and hidden character
value to another element of the xml, this exception does not occur. Not
sure
what's going on.

Fatal Error at file C:\temp\SAXSchemaParser\Debug/personal.xml, line 1,
char
22
  Message: An exception occurred! Type:UTFDataFormatException,
Message:invalid byte 1 (╖) of a 1-byte sequence.

I am running xerces 2.3 compiled with MSVS 7.0.
Any ideas?


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

#### personal.xml has been removed from this note on August 07 2003 by Neil
Graham

Reply via email to