There were cases where the xml files had BOM marks and the encoding specified
was utf-8. In those situation the parser's unable to recognize those files.

This change causes the UTF-8 BOM to be completely ignored for  any ASCII family
encoding.
Andy H had a valid question though - should the BOM  override the XML encoding
declaration, or should the declaration override the  BOM, or should it be an
error if they conflict?

Right now the encoding declaration overrides the  BOM.

Arundhati

Dean Roddey wrote:

> What is this UTF-8 BOM stuff? I've never heard of such a thing. Given the
> form of UTF-8, why would it need a BOM? Its a multi-byte encoding, so there
> are no components of it larger than a byte.
>
> --------------------------
> Dean Roddey
> The CIDLib C++ Frameworks
> Charmed Quark Software
> [EMAIL PROTECTED]
> http://www.charmedquark.com
>
> "You young, and you gotcha health. Whatchoo wanna job fer?"
>
> ----- Original Message -----
> From: <[EMAIL PROTECTED]>
> To: <[EMAIL PROTECTED]>
> Sent: Monday, July 31, 2000 12:00 PM
> Subject: cvs commit: xml-xerces/c/src/internal XMLReader.cpp
>
> > aruna1      00/07/31 12:00:50
> >
> >   Modified:    c/src/internal XMLReader.cpp
> >   Log:
> >   Fixed BOM in UTF-8 files
> >
> >   Revision  Changes    Path
> >   1.20      +15 -2     xml-xerces/c/src/internal/XMLReader.cpp
> >
> >   Index: XMLReader.cpp
> >   ===================================================================
> >   RCS file: /home/cvs/xml-xerces/c/src/internal/XMLReader.cpp,v
> >   retrieving revision 1.19
> >   retrieving revision 1.20
> >   diff -u -r1.19 -r1.20
> >   --- XMLReader.cpp 2000/07/25 22:33:05 1.19
> >   +++ XMLReader.cpp 2000/07/31 19:00:48 1.20
> >   @@ -55,7 +55,7 @@
> >     */
> >
> >    /*
> >   - * $Id: XMLReader.cpp,v 1.19 2000/07/25 22:33:05 aruna1 Exp $
> >   + * $Id: XMLReader.cpp,v 1.20 2000/07/31 19:00:48 aruna1 Exp $
> >     */
> >
> >
>
> // -------------------------------------------------------------------------
> --
> >   @@ -1331,11 +1331,24 @@
> >                break;
> >            }
> >
> >   -        case XMLRecognizer::US_ASCII :
> >            case XMLRecognizer::UTF_8 :
> >            {
> >   +            // If there's a utf-8 BOM  (0xEF 0xBB 0xBF), skip past it.
> >   +            //   Don't move to char buf - no one wants to see it.
> >   +            //   Note: this causes any encoding= declaration to
> override
> >   +            //         the BOM's attempt to say that the encoding is
> utf-8.
> >   +
> >                // Look at the raw buffer as short chars
> >                const char* asChars = (const char*)fRawByteBuf;
> >   +
> >   +            if (fRawBytesAvail > XMLRecognizer::fgUTF8BOMLen &&
> >   +                XMLString::compareNString(  asChars
> >   +                                            , XMLRecognizer::fgUTF8BOM
> >   +                                            ,
> XMLRecognizer::fgUTF8BOMLen) == 0)
> >   +            {
> >   +                fRawBufIndex += XMLRecognizer::fgUTF8BOMLen;
> >   +                asChars      += XMLRecognizer::fgUTF8BOMLen;
> >   +            }
> >
> >                //
> >                //  First check that there are enough bytes to even see the
> >
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >

--


Arundhati Bhowmick
IBM -- XML Technology Group (Silicon Valley)


Reply via email to