DO NOT REPLY [Bug 3914] New: - Encoding declaration ignored when doc saved with UTF-8 BOM

bugzilla Mon, 01 Oct 2001 17:29:17 -0700

DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG 
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://nagoya.apache.org/bugzilla/show_bug.cgi?id=3914>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND 
INSERTED IN THE BUG DATABASE.


http://nagoya.apache.org/bugzilla/show_bug.cgi?id=3914

Encoding declaration ignored when doc saved with UTF-8 BOM

           Summary: Encoding declaration ignored when doc saved with UTF-8
                    BOM
           Product: Xerces-J
           Version: CVS extract
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: Major
          Priority: Other
         Component: Core
        AssignedTo: [EMAIL PROTECTED]
        ReportedBy: [EMAIL PROTECTED]


Bug
Encoding declaration is not recognized when document is encoded with UTF-8 BOM.

Example.

<?xml version="1.0" encoding="Anything-you-like-without-a-space"?>

passes validation when saved as UTF-8 With BOM.  The encoding declaration is 
ignored.


Reason and Solution 

First Issue

The reason lies mainly within org.apache.xerces.readers.UTF8recognizer and how 
it uses the utils.ChunkyByteArray.

When a check for the BOM is done

        if (seeBOM) {
            // it will have the same content anyway.
            data.read(fUTF8BOM, 0, 3);

the data.read moves the fOffset variable of the ChunkyByteArray on 3 places 
past the BOM, as it should.

But in a further check for the characters "<?xml", we are looking directly at 
the data[][] of ChunkyByteArray.  Here the fOffset variable is NOT taken into 
account.  This means, if BOM is present, even though we moved the pointer on 3 
places, data.byteAt(0) is ALWAYS going to look at the first byte in the Array, 
not the first byte+fOffset.

Solution was to create a BOMOffset variable which is initialized to 0. If 
(seeBOM ==true) then BOMOffset = 3;

checks are then done on data.byteAt(BOMOffset + 0) etc.


Second Issue

if you get past this stage of the code you run into more problems.

We create an XMLEntityHandler.Entityreader in which we pass in the data 
stream.  Here, the constructor creates a variable fCurrentOffset with an 
initial value of 0.  There is no way to set this initial value to anything 
other than 0. Then we run into the similar problem as above, where the correct 
offset is not taken into account.

What should happen in my view is, a method in the ChunkyByteArray should be 
created to return the value of fOffset (the current offset of the byte array).  
Then in the constructor of the XMLEntityHandler.Entityreader, we set the 
fCurrentOffset to this value.  Then all our problems are solved.

However, I created a workaround in the UTF8Recognizer to increase this offset 
to it's correct value with the following code

 if (declReader.lookingAtChar((char)fUTF8BOM[0],true))
        if(declReader.lookingAtChar((char)fUTF8BOM[1],true))
                if(declReader.lookingAtChar((char)fUTF8BOM[2],true)){}


Unfortunately the lookingAtChar metnod of the XMLEntityHandler.Entityreader 
takes (char,boolean) as args, EVEN THOUGH within the method we are comparing 
the first argument with a byte, we have to cast the fUTF8BOM bytes to chars.  
Then we need to cast the fData.byteAt(fCurrentOffset) to char inside the 
lookingAtChar method of the XMLEntityHandler.Entityreader.

Third Issue

Then, when we think all is save, the method data.rewind() is used before 
returning the readers.  The intension here is to return the pointer on the 
array back to the start of the array.  Unfortunately, this is just undoing the 
above code of data.read(fUTF8BOM, 0, 3) which moved the pointer past the BOM.  
The data.rewind() is now pointing BEFORE the BOM once again.

The data.rewind() is not needed at all, as there is no method call in the class 
which unduely moves the fOffset past it's correct position. Removing it, 
elimitates the above issue.

Below is the modified UTF8Recognizer and XMLDeclReader for your review.

==========================

package org.apache.xerces.readers;

import org.apache.xerces.framework.XMLErrorReporter;
import org.apache.xerces.utils.ChunkyByteArray;
import org.apache.xerces.utils.QName;
import org.apache.xerces.utils.StringPool;

import java.io.InputStreamReader;
import java.io.IOException;
import java.io.UnsupportedEncodingException;

/**
 *
 * @version
 */
final class UTF8Recognizer extends XMLDeclRecognizer {
    private byte[] fUTF8BOM = {(byte)0xEF, (byte)0xBB, (byte)0xBF};
    //
    //
    //
    public XMLEntityHandler.EntityReader recognize(XMLEntityReaderFactory 
readerFactory,
                                                   XMLEntityHandler 
entityHandler,
                                                   XMLErrorReporter 
errorReporter,
                                                   boolean 
sendCharDataAsCharArray,
                                                   StringPool stringPool,
                                                   ChunkyByteArray data,
                                                   boolean xmlDecl,
                                                   boolean 
allowJavaEncodingName) throws Exception {
        XMLEntityHandler.EntityReader reader = null;

        //check to see if there is a UTF8 BOM, if see one, skip past it.
        boolean seeBOM = false;
        int BOMOffset = 0;

        byte bom0 = data.byteAt(0);
        if (bom0 == fUTF8BOM[0]) {
            byte bom1 = data.byteAt(1);
            if (bom1 == fUTF8BOM[1]) {
                byte bom2 = data.byteAt(2);
                if (bom2 == fUTF8BOM[2]) {
                    seeBOM = true;
                }
            }
        }
        if (seeBOM) {
            // it will have the same content anyway.
            data.read(fUTF8BOM, 0, 3);
            BOMOffset = 3;
        }

        byte b0 = data.byteAt(BOMOffset + 0);
        boolean debug = false;

        if (b0 == '<') {
            int b1 = data.byteAt(BOMOffset + 1);
            if (b1 == '?') {
                if (data.byteAt(BOMOffset + 2) == 'x' && data.byteAt(BOMOffset 
+ 3) == 'm' && data.byteAt(BOMOffset + 4) == 'l') {
                    int b5 = data.byteAt(BOMOffset + 5);
                    if (b5 == 0x20 || b5 == 0x09 || b5 == 0x0a || b5 == 0x0d) {
                        XMLEntityHandler.EntityReader declReader = new 
XMLDeclReader(entityHandler, errorReporter, sendCharDataAsCharArray, data, 
stringPool);

                        //Need to skip past BOM if Present. Method moves the 
fCurrentOffset pointer on the array past the BOM.
                        //A better way to do this is to expose the fOffset 
variable in the ChunkyByteArray. Then set the fCurrentOffset
                        //variable of the XMLEntityHandler.EntityReader to this 
value.  At the moment, this variable is set
                        //to 0, so if BOM is present, encoding cannot be 
detected. The below 3 lines is just a workaround.

                        if (declReader.lookingAtChar((char)fUTF8BOM[0],true))
                          if(declReader.lookingAtChar((char)fUTF8BOM[1],true))
                            if(declReader.lookingAtChar((char)fUTF8BOM[2],true))
{}

                        // Finished workaround

                        int encoding = prescanXMLDeclOrTextDecl(declReader, 
xmlDecl);
                        if (encoding != -1) {
                            String encname = stringPool.orphanString(encoding);
                            String enc = encname.toUpperCase();
                            if ("ISO-10646-UCS-2".equals(enc)) throw new 
UnsupportedEncodingException(encname);
                            if ("ISO-10646-UCS-4".equals(enc)) throw new 
UnsupportedEncodingException(encname);
                            if ("UTF-16".equals(enc)) throw new 
UnsupportedEncodingException(encname);

                            String javaencname = MIME2Java.convert(enc);
                            if (null == javaencname) {
                                // Not supported
                                if (allowJavaEncodingName) {
                                    javaencname = encname;
                                } else {
                                    throw new UnsupportedEncodingException
(encname);
                                }
                            }
                            try {
                                //data.rewind();
                                if ("UTF-8".equalsIgnoreCase(javaencname) 
|| "UTF8".equalsIgnoreCase(javaencname)) {
                                    reader = readerFactory.createUTF8Reader
(entityHandler, errorReporter, sendCharDataAsCharArray, data, stringPool);
                                } else {
                                    reader = readerFactory.createCharReader
(entityHandler, errorReporter, sendCharDataAsCharArray,
                                                                            new 
InputStreamReader(data, javaencname), stringPool);
                                }
                            } catch (UnsupportedEncodingException e) {
                                throw new UnsupportedEncodingException(encname);
                            } catch (Exception e) {
                                if( debug == true )
                                   e.printStackTrace();            // Internal 
Error
                            }
                        } else {
                            //data.rewind();
                            reader = readerFactory.createUTF8Reader
(entityHandler, errorReporter, sendCharDataAsCharArray, data, stringPool);
                        }
                    }
                }
            }
        }
        return reader;
    }

    final class XMLDeclReader extends XMLEntityReader {
        //
        //
        //
        private StringPool fStringPool = null;
        private ChunkyByteArray fData = null;
        //
        //
        //
        XMLDeclReader(XMLEntityHandler entityHandler, XMLErrorReporter 
errorReporter, boolean sendCharDataAsCharArray, ChunkyByteArray data, 
StringPool stringPool) {
            super(entityHandler, errorReporter, sendCharDataAsCharArray);
            fStringPool = stringPool;
            fData = data;
        }
        //
        // These methods are used to parse XMLDecl/TextDecl.
        //
        public boolean lookingAtChar(char ch, boolean skipPastChar) throws 
IOException {
            if ((char)fData.byteAt(fCurrentOffset) != ch)
                return false;
            if (skipPastChar)
                fCurrentOffset++;
            return true;
        }
        public boolean lookingAtSpace(boolean skipPastChar) throws IOException {
            int ch = fData.byteAt(fCurrentOffset) & 0xff;
            if (ch != 0x20 && ch != 0x09 && ch != 0x0A && ch != 0x0D)
                return false;
            if (skipPastChar)
                fCurrentOffset++;
            return true;
        }
        public void skipPastSpaces() throws IOException {
            while (true) {
                int ch = fData.byteAt(fCurrentOffset) & 0xff;
                if (ch != 0x20 && ch != 0x09 && ch != 0x0A && ch != 0x0D)
                    return;
                fCurrentOffset++;
            }
        }
        public boolean skippedString(char[] s) throws IOException {
            int offset = fCurrentOffset;
            for (int i = 0; i < s.length; i++) {
                if (fData.byteAt(offset) != s[i])
                    return false;
                offset++;
            }
            fCurrentOffset = offset;
            return true;
        }
        public int scanStringLiteral() throws Exception {
            boolean single;
            if (!(single = lookingAtChar('\'', true)) && !lookingAtChar('\"', 
true)) {
                return XMLEntityHandler.STRINGLIT_RESULT_QUOTE_REQUIRED;
            }
            int offset = fCurrentOffset;
            char qchar = single ? '\'' : '\"';
            while (true) {
                byte b = fData.byteAt(fCurrentOffset);
                if (b == qchar)
                    break;
                if (b == -1)
                    return XMLEntityHandler.STRINGLIT_RESULT_QUOTE_REQUIRED;
                fCurrentOffset++;
            }
            int length = fCurrentOffset - offset;
            StringBuffer str = new StringBuffer(length);
            for (int i = 0; i < length; i++) {
                str.append((char)fData.byteAt(offset + i));
            }
            int stringIndex = fStringPool.addString(str.toString());
            fCurrentOffset++; // move past qchar
            return stringIndex;
        }
        //
        // The rest of the methods in XMLReader are not used for parsing 
XMLDecl/TextDecl.
        //
        public void append(XMLEntityHandler.CharBuffer charBuffer, int offset, 
int length) {
            throw new RuntimeException("RDR002 cannot happen");
        }
        public int addString(int offset, int length) {
            throw new RuntimeException("RDR002 cannot happen");
        }
        public int addSymbol(int offset, int length) {
            throw new RuntimeException("RDR002 cannot happen");
        }
        public void skipToChar(char ch) throws IOException {
            throw new IOException("RDR002 cannot happen");
        }
        public void skipPastName(char fastcheck) throws IOException {
            throw new IOException("RDR002 cannot happen");
        }
        public void skipPastNmtoken(char fastcheck) throws IOException {
            throw new IOException("RDR002 cannot happen");
        }
        public boolean lookingAtValidChar(boolean skipPastChar) throws 
IOException {
            throw new IOException("RDR002 cannot happen");
        }
        public int scanInvalidChar() throws IOException {
            throw new IOException("RDR002 cannot happen");
        }
        public int scanCharRef(boolean hex) throws IOException {
            throw new IOException("RDR002 cannot happen");
        }
        public int scanAttValue(char qchar, boolean asSymbol) throws 
IOException {
            throw new IOException("RDR002 cannot happen");
        }
        public int scanEntityValue(int qchar, boolean createString) throws 
IOException {
            throw new IOException("RDR002 cannot happen");
        }
        public boolean scanExpectedName(char fastcheck, 
StringPool.CharArrayRange expectedName) throws IOException {
            throw new IOException("RDR002 cannot happen");
        }
        public void scanQName(char fastcheck, QName qname) throws IOException {
            throw new IOException("RDR002 cannot happen");
        }
        public int scanName(char fastcheck) throws IOException {
            throw new IOException("RDR002 cannot happen");
        }
        public int scanContent(QName element) throws IOException {
            throw new IOException("RDR002 cannot happen");
        }
    }
}

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

DO NOT REPLY [Bug 3914] New: - Encoding declaration ignored when doc saved with UTF-8 BOM

Reply via email to