DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT <http://nagoya.apache.org/bugzilla/show_bug.cgi?id=3914>. ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND INSERTED IN THE BUG DATABASE.
http://nagoya.apache.org/bugzilla/show_bug.cgi?id=3914 Encoding declaration ignored when doc saved with UTF-8 BOM Summary: Encoding declaration ignored when doc saved with UTF-8 BOM Product: Xerces-J Version: CVS extract Platform: All OS/Version: All Status: NEW Severity: Major Priority: Other Component: Core AssignedTo: [EMAIL PROTECTED] ReportedBy: [EMAIL PROTECTED] Bug Encoding declaration is not recognized when document is encoded with UTF-8 BOM. Example. <?xml version="1.0" encoding="Anything-you-like-without-a-space"?> passes validation when saved as UTF-8 With BOM. The encoding declaration is ignored. Reason and Solution First Issue The reason lies mainly within org.apache.xerces.readers.UTF8recognizer and how it uses the utils.ChunkyByteArray. When a check for the BOM is done if (seeBOM) { // it will have the same content anyway. data.read(fUTF8BOM, 0, 3); the data.read moves the fOffset variable of the ChunkyByteArray on 3 places past the BOM, as it should. But in a further check for the characters "<?xml", we are looking directly at the data[][] of ChunkyByteArray. Here the fOffset variable is NOT taken into account. This means, if BOM is present, even though we moved the pointer on 3 places, data.byteAt(0) is ALWAYS going to look at the first byte in the Array, not the first byte+fOffset. Solution was to create a BOMOffset variable which is initialized to 0. If (seeBOM ==true) then BOMOffset = 3; checks are then done on data.byteAt(BOMOffset + 0) etc. Second Issue if you get past this stage of the code you run into more problems. We create an XMLEntityHandler.Entityreader in which we pass in the data stream. Here, the constructor creates a variable fCurrentOffset with an initial value of 0. There is no way to set this initial value to anything other than 0. Then we run into the similar problem as above, where the correct offset is not taken into account. What should happen in my view is, a method in the ChunkyByteArray should be created to return the value of fOffset (the current offset of the byte array). Then in the constructor of the XMLEntityHandler.Entityreader, we set the fCurrentOffset to this value. Then all our problems are solved. However, I created a workaround in the UTF8Recognizer to increase this offset to it's correct value with the following code if (declReader.lookingAtChar((char)fUTF8BOM[0],true)) if(declReader.lookingAtChar((char)fUTF8BOM[1],true)) if(declReader.lookingAtChar((char)fUTF8BOM[2],true)){} Unfortunately the lookingAtChar metnod of the XMLEntityHandler.Entityreader takes (char,boolean) as args, EVEN THOUGH within the method we are comparing the first argument with a byte, we have to cast the fUTF8BOM bytes to chars. Then we need to cast the fData.byteAt(fCurrentOffset) to char inside the lookingAtChar method of the XMLEntityHandler.Entityreader. Third Issue Then, when we think all is save, the method data.rewind() is used before returning the readers. The intension here is to return the pointer on the array back to the start of the array. Unfortunately, this is just undoing the above code of data.read(fUTF8BOM, 0, 3) which moved the pointer past the BOM. The data.rewind() is now pointing BEFORE the BOM once again. The data.rewind() is not needed at all, as there is no method call in the class which unduely moves the fOffset past it's correct position. Removing it, elimitates the above issue. Below is the modified UTF8Recognizer and XMLDeclReader for your review. ========================== package org.apache.xerces.readers; import org.apache.xerces.framework.XMLErrorReporter; import org.apache.xerces.utils.ChunkyByteArray; import org.apache.xerces.utils.QName; import org.apache.xerces.utils.StringPool; import java.io.InputStreamReader; import java.io.IOException; import java.io.UnsupportedEncodingException; /** * * @version */ final class UTF8Recognizer extends XMLDeclRecognizer { private byte[] fUTF8BOM = {(byte)0xEF, (byte)0xBB, (byte)0xBF}; // // // public XMLEntityHandler.EntityReader recognize(XMLEntityReaderFactory readerFactory, XMLEntityHandler entityHandler, XMLErrorReporter errorReporter, boolean sendCharDataAsCharArray, StringPool stringPool, ChunkyByteArray data, boolean xmlDecl, boolean allowJavaEncodingName) throws Exception { XMLEntityHandler.EntityReader reader = null; //check to see if there is a UTF8 BOM, if see one, skip past it. boolean seeBOM = false; int BOMOffset = 0; byte bom0 = data.byteAt(0); if (bom0 == fUTF8BOM[0]) { byte bom1 = data.byteAt(1); if (bom1 == fUTF8BOM[1]) { byte bom2 = data.byteAt(2); if (bom2 == fUTF8BOM[2]) { seeBOM = true; } } } if (seeBOM) { // it will have the same content anyway. data.read(fUTF8BOM, 0, 3); BOMOffset = 3; } byte b0 = data.byteAt(BOMOffset + 0); boolean debug = false; if (b0 == '<') { int b1 = data.byteAt(BOMOffset + 1); if (b1 == '?') { if (data.byteAt(BOMOffset + 2) == 'x' && data.byteAt(BOMOffset + 3) == 'm' && data.byteAt(BOMOffset + 4) == 'l') { int b5 = data.byteAt(BOMOffset + 5); if (b5 == 0x20 || b5 == 0x09 || b5 == 0x0a || b5 == 0x0d) { XMLEntityHandler.EntityReader declReader = new XMLDeclReader(entityHandler, errorReporter, sendCharDataAsCharArray, data, stringPool); //Need to skip past BOM if Present. Method moves the fCurrentOffset pointer on the array past the BOM. //A better way to do this is to expose the fOffset variable in the ChunkyByteArray. Then set the fCurrentOffset //variable of the XMLEntityHandler.EntityReader to this value. At the moment, this variable is set //to 0, so if BOM is present, encoding cannot be detected. The below 3 lines is just a workaround. if (declReader.lookingAtChar((char)fUTF8BOM[0],true)) if(declReader.lookingAtChar((char)fUTF8BOM[1],true)) if(declReader.lookingAtChar((char)fUTF8BOM[2],true)) {} // Finished workaround int encoding = prescanXMLDeclOrTextDecl(declReader, xmlDecl); if (encoding != -1) { String encname = stringPool.orphanString(encoding); String enc = encname.toUpperCase(); if ("ISO-10646-UCS-2".equals(enc)) throw new UnsupportedEncodingException(encname); if ("ISO-10646-UCS-4".equals(enc)) throw new UnsupportedEncodingException(encname); if ("UTF-16".equals(enc)) throw new UnsupportedEncodingException(encname); String javaencname = MIME2Java.convert(enc); if (null == javaencname) { // Not supported if (allowJavaEncodingName) { javaencname = encname; } else { throw new UnsupportedEncodingException (encname); } } try { //data.rewind(); if ("UTF-8".equalsIgnoreCase(javaencname) || "UTF8".equalsIgnoreCase(javaencname)) { reader = readerFactory.createUTF8Reader (entityHandler, errorReporter, sendCharDataAsCharArray, data, stringPool); } else { reader = readerFactory.createCharReader (entityHandler, errorReporter, sendCharDataAsCharArray, new InputStreamReader(data, javaencname), stringPool); } } catch (UnsupportedEncodingException e) { throw new UnsupportedEncodingException(encname); } catch (Exception e) { if( debug == true ) e.printStackTrace(); // Internal Error } } else { //data.rewind(); reader = readerFactory.createUTF8Reader (entityHandler, errorReporter, sendCharDataAsCharArray, data, stringPool); } } } } } return reader; } final class XMLDeclReader extends XMLEntityReader { // // // private StringPool fStringPool = null; private ChunkyByteArray fData = null; // // // XMLDeclReader(XMLEntityHandler entityHandler, XMLErrorReporter errorReporter, boolean sendCharDataAsCharArray, ChunkyByteArray data, StringPool stringPool) { super(entityHandler, errorReporter, sendCharDataAsCharArray); fStringPool = stringPool; fData = data; } // // These methods are used to parse XMLDecl/TextDecl. // public boolean lookingAtChar(char ch, boolean skipPastChar) throws IOException { if ((char)fData.byteAt(fCurrentOffset) != ch) return false; if (skipPastChar) fCurrentOffset++; return true; } public boolean lookingAtSpace(boolean skipPastChar) throws IOException { int ch = fData.byteAt(fCurrentOffset) & 0xff; if (ch != 0x20 && ch != 0x09 && ch != 0x0A && ch != 0x0D) return false; if (skipPastChar) fCurrentOffset++; return true; } public void skipPastSpaces() throws IOException { while (true) { int ch = fData.byteAt(fCurrentOffset) & 0xff; if (ch != 0x20 && ch != 0x09 && ch != 0x0A && ch != 0x0D) return; fCurrentOffset++; } } public boolean skippedString(char[] s) throws IOException { int offset = fCurrentOffset; for (int i = 0; i < s.length; i++) { if (fData.byteAt(offset) != s[i]) return false; offset++; } fCurrentOffset = offset; return true; } public int scanStringLiteral() throws Exception { boolean single; if (!(single = lookingAtChar('\'', true)) && !lookingAtChar('\"', true)) { return XMLEntityHandler.STRINGLIT_RESULT_QUOTE_REQUIRED; } int offset = fCurrentOffset; char qchar = single ? '\'' : '\"'; while (true) { byte b = fData.byteAt(fCurrentOffset); if (b == qchar) break; if (b == -1) return XMLEntityHandler.STRINGLIT_RESULT_QUOTE_REQUIRED; fCurrentOffset++; } int length = fCurrentOffset - offset; StringBuffer str = new StringBuffer(length); for (int i = 0; i < length; i++) { str.append((char)fData.byteAt(offset + i)); } int stringIndex = fStringPool.addString(str.toString()); fCurrentOffset++; // move past qchar return stringIndex; } // // The rest of the methods in XMLReader are not used for parsing XMLDecl/TextDecl. // public void append(XMLEntityHandler.CharBuffer charBuffer, int offset, int length) { throw new RuntimeException("RDR002 cannot happen"); } public int addString(int offset, int length) { throw new RuntimeException("RDR002 cannot happen"); } public int addSymbol(int offset, int length) { throw new RuntimeException("RDR002 cannot happen"); } public void skipToChar(char ch) throws IOException { throw new IOException("RDR002 cannot happen"); } public void skipPastName(char fastcheck) throws IOException { throw new IOException("RDR002 cannot happen"); } public void skipPastNmtoken(char fastcheck) throws IOException { throw new IOException("RDR002 cannot happen"); } public boolean lookingAtValidChar(boolean skipPastChar) throws IOException { throw new IOException("RDR002 cannot happen"); } public int scanInvalidChar() throws IOException { throw new IOException("RDR002 cannot happen"); } public int scanCharRef(boolean hex) throws IOException { throw new IOException("RDR002 cannot happen"); } public int scanAttValue(char qchar, boolean asSymbol) throws IOException { throw new IOException("RDR002 cannot happen"); } public int scanEntityValue(int qchar, boolean createString) throws IOException { throw new IOException("RDR002 cannot happen"); } public boolean scanExpectedName(char fastcheck, StringPool.CharArrayRange expectedName) throws IOException { throw new IOException("RDR002 cannot happen"); } public void scanQName(char fastcheck, QName qname) throws IOException { throw new IOException("RDR002 cannot happen"); } public int scanName(char fastcheck) throws IOException { throw new IOException("RDR002 cannot happen"); } public int scanContent(QName element) throws IOException { throw new IOException("RDR002 cannot happen"); } } } --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
