[Bug 22137] mwdumper dies with not a name start character: U+26 error
https://bugzilla.wikimedia.org/show_bug.cgi?id=22137 --- Comment #11 from Kelson [Emmanuel Engelhart] emman...@engelhart.org 2010-02-13 16:34:17 UTC --- Seems to be a bug in gcj or libgcj. See my email to the java gcc ML: http://gcc.gnu.org/ml/java/2010-02/msg0.html --- Comment #12 from Kelson [Emmanuel Engelhart] emman...@engelhart.org 2010-02-15 10:51:28 UTC --- In the meantime, Platonides (or anyone having SVN write access), may you please apply the path from my comment #1 https://bugzilla.wikimedia.org /show_bug.cgi?id=22137#c1 ? Without it, this is impossible to know at which line a SAX parsing error occurs. -- Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are watching all bug changes. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 22137] mwdumper dies with not a name start character: U+26 error
https://bugzilla.wikimedia.org/show_bug.cgi?id=22137 --- Comment #9 from Kelson [Emmanuel Engelhart] emman...@engelhart.org 2010-02-13 11:11:27 UTC --- Created an attachment (id=7115) -- (https://bugzilla.wikimedia.org/attachment.cgi?id=7115) Much more simpler java code that demonstrates error Compile with: gcj -o test --main=Test Test.java run with the demo XML code as test.xml -- Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are watching all bug changes. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 22137] mwdumper dies with not a name start character: U+26 error
https://bugzilla.wikimedia.org/show_bug.cgi?id=22137 --- Comment #10 from Platonides platoni...@gmail.com 2010-02-13 16:30:32 UTC --- Sun jdk / OpenJdk is not affected. -- Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are watching all bug changes. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 22137] mwdumper dies with not a name start character: U+26 error
https://bugzilla.wikimedia.org/show_bug.cgi?id=22137 --- Comment #3 from Bawolff bawolff...@gmail.com 2010-02-12 19:32:37 UTC --- Created an attachment (id=7114) -- (https://bugzilla.wikimedia.org/attachment.cgi?id=7114) Much more simpler case that demonstrates error This is a unicode issue. If you remove the -- Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are watching all bug changes. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 22137] mwdumper dies with not a name start character: U+26 error
https://bugzilla.wikimedia.org/show_bug.cgi?id=22137 Bawolff bawolff...@gmail.com changed: What|Removed |Added CC||bawolff...@gmail.com --- Comment #4 from Bawolff bawolff...@gmail.com 2010-02-12 19:33:19 UTC --- Bugzilla screwed up my comment: This is a unicode issue. If you remove the -- Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are watching all bug changes. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 22137] mwdumper dies with not a name start character: U+26 error
https://bugzilla.wikimedia.org/show_bug.cgi?id=22137 --- Comment #5 from Bawolff bawolff...@gmail.com 2010-02-12 19:34:53 UTC --- Ok, apparently bugzilla suffers from the same issue as mwdumper ;) This is a unicode issue. If you remove the Unicode character removed from comment, lest bugzilla hate me ( U+1D59F - MATHEMATICAL BOLD FRAKTUR SMALL Z - however the article claims it to be U+1D537 which is MATHEMATICAL FRAKTUR SMALL Z but thats not what character is in the text. ) everything works fine. Since its not chocking on more ordinary unicode characters, i imagine its something to do with that character being a 4-byte character. It also appears that this interacts with other stuff in the file, as it doesn't cause the error by itself. Specifically entity references, seem to be what causes it to die after encountering the unicode character. I think It interpert that character as starting as outside the tag name (hence starting a new tag, but (aka U+0026) cannot start a new tag). Newline characters may also have something to do with it, as removing the newline between the unicode character and the changes the error message. Changing summary to more adequately reflect what i think the problem is. Attaching simpler test case. Note also, that if you replace the unicode character with its entity reference (#x1D59F;), everything works fine. -- Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are watching all bug changes. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 22137] mwdumper dies with not a name start character: U+26 error
https://bugzilla.wikimedia.org/show_bug.cgi?id=22137 --- Comment #6 from Platonides platoni...@gmail.com 2010-02-12 22:58:48 UTC --- Java internally uses UTF-16 The native coded character set of the Java programming language is that of the first seventeen planes of the Unicode version 3.0 character set; that is, it consists in the basic multilingual plane (BMP) of Unicode version 1 plus the next sixteen planes of Unicode version 3. This is because the language's internal representation of characters uses the UTF-16 encoding, which encodes the BMP directly and uses surrogate pairs, a simple escape mechanism, to encode the other planes. Hence a charset in the Java platform defines a mapping between sequences of sixteen-bit values in UTF-16 and sequences of bytes. http://java.sun.com/j2se/1.4.2/docs/api/java/nio/charset/Charset.html http://java.sun.com/javase/6/docs/api/java/nio/charset/Charset.html The file contains U+01D59F in UTF-8, thus F0 9D 96 9F. In binary 10011101 10010110 1001 I don't see why it is reading a U+26 (100110). PS: Maybe bugzilla is using mysql as utf-8 instead of binary? mysql unicode currently only supports the BMP. -- Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are watching all bug changes. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 22137] mwdumper dies with not a name start character: U+26 error
https://bugzilla.wikimedia.org/show_bug.cgi?id=22137 --- Comment #7 from Bawolff bawolff...@gmail.com 2010-02-12 23:41:45 UTC --- Java internally uses UTF-16 yes it does, but i think the file is interperted as utf-8, otherwise it wouldn't be able to make sense of it at all, as utf-8 and utf-16 look fairly different for your average english text (I'm under the impression that utf-16 is not compatible with ASCII thus nothing would work at all if it was using utf-16). I don't see why it is reading a U+26 (100110). The entity references that come after the problematic unicode character is where the U+26 () comes from. Its not considered a valid (tag) start character in XML. The question is why java would after failing to interpert the fancy unicode character, it would think that the document was starting a new tag. If you interpret F0 9D 96 9F in utf-16, you get: U+F09D: No name (Private Use Area) 隟 U+969F: Han ideograph (CJK Unified Ideographs) Which theoretically shouldn't cause any problems. (of course the rest of the file wouldn't make sense, and no guarantees that that is where the word boundaries would fall). I'm thinking this is a bug with the underlying java libraries, as opposed to mwdumper -- Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are watching all bug changes. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 22137] mwdumper dies with not a name start character: U+26 error
https://bugzilla.wikimedia.org/show_bug.cgi?id=22137 --- Comment #8 from Platonides platoni...@gmail.com 2010-02-12 23:48:54 UTC --- (In reply to comment #7) Java internally uses UTF-16 yes it does, but i think the file is interperted as utf-8, otherwise it wouldn't be able to make sense of it at all, as utf-8 and utf-16 look fairly different for your average english text (I'm under the impression that utf-16 is not compatible with ASCII thus nothing would work at all if it was using utf-16). Right. But it could be overflowing the 16-bit or some other failure. I don't see why it is reading a U+26 (100110). The entity references that come after the problematic unicode character is where the U+26 () comes from. Interesting. Saving from firefox produced a literal in the output. I'm thinking this is a bug with the underlying java libraries, as opposed to mwdumper I also think so. -- Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are watching all bug changes. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 22137] mwdumper dies with not a name start character: U+26 error
https://bugzilla.wikimedia.org/show_bug.cgi?id=22137 Daniel Kinzler brightb...@gmail.com changed: What|Removed |Added CC||brightb...@gmail.com -- Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are watching all bug changes. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 22137] mwdumper dies with not a name start character: U+26 error
https://bugzilla.wikimedia.org/show_bug.cgi?id=22137 Platonides platoni...@gmail.com changed: What|Removed |Added CC||platoni...@gmail.com -- Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are watching all bug changes. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 22137] mwdumper dies with not a name start character: U+26 error
https://bugzilla.wikimedia.org/show_bug.cgi?id=22137 Nemo_bis federicol...@tiscali.it changed: What|Removed |Added CC||federicol...@tiscali.it -- Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are watching all bug changes. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 22137] mwdumper dies with not a name start character: U+26 error
https://bugzilla.wikimedia.org/show_bug.cgi?id=22137 --- Comment #1 from Kelson [Emmanuel Engelhart] emman...@engelhart.org 2010-01-18 10:59:50 UTC --- Hier is a diff adding column and line information to the exception informations: === --- src/org/mediawiki/importer/XmlDumpReader.java (révision 61197) +++ src/org/mediawiki/importer/XmlDumpReader.java (copie de travail) @@ -36,6 +36,7 @@ import javax.xml.parsers.ParserConfigurationException; import javax.xml.parsers.SAXParser; import javax.xml.parsers.SAXParserFactory; +import org.xml.sax.SAXParseException; import org.xml.sax.Attributes; import org.xml.sax.SAXException; @@ -82,15 +83,17 @@ */ public void readDump() throws IOException { try { - SAXParserFactory factory = SAXParserFactory.newInstance(); - SAXParser parser = factory.newSAXParser(); + SAXParserFactory factory = SAXParserFactory.newInstance(); + SAXParser parser = factory.newSAXParser(); parser.parse(input, this); } catch (ParserConfigurationException e) { throw (IOException)new IOException(e.getMessage()).initCause(e); + } catch (SAXParseException e) { + throw (IOException)new IOException(e.getMessage() + (line: + e.getLineNumber() + column: + e.getColumnNumber() + )).initCause(e); } catch (SAXException e) { - throw (IOException)new IOException(e.getMessage()).initCause(e); - } + throw (IOException)new IOException(e.getMessage()).initCause(e); + } writer.close(); } -- Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 22137] mwdumper dies with not a name start character: U+26 error
https://bugzilla.wikimedia.org/show_bug.cgi?id=22137 --- Comment #2 from Kelson [Emmanuel Engelhart] emman...@engelhart.org 2010-01-18 11:37:36 UTC --- Created an attachment (id=6965) -- (https://bugzilla.wikimedia.org/attachment.cgi?id=6965) Problematic part of the XML dump I have extract the problematic part of the dump, see attachment. $ mwdumper --format=sql:1.5 sample.xml.bz2 | lzma -c -d sample.sql.lzma Exception in thread main java.io.IOException: not a name start character: U+26 (line: 82 column: 1) at org.mediawiki.importer.XmlDumpReader.readDump(mwdumper) at org.mediawiki.dumper.Dumper.main(mwdumper) Caused by: org.xml.sax.SAXParseException: not a name start character: U+26 at gnu.xml.stream.SAXParser.parse(libgcj.so.81) at javax.xml.parsers.SAXParser.parse(libgcj.so.81) at javax.xml.parsers.SAXParser.parse(libgcj.so.81) at org.mediawiki.importer.XmlDumpReader.readDump(mwdumper) ...1 more Caused by: javax.xml.stream.XMLStreamException: not a name start character: U+26 at gnu.xml.stream.XMLParser.error(libgcj.so.81) at gnu.xml.stream.XMLParser.readNmtoken(libgcj.so.81) at gnu.xml.stream.XMLParser.readNmtoken(libgcj.so.81) at gnu.xml.stream.XMLParser.readCharData(libgcj.so.81) at gnu.xml.stream.XMLParser.next(libgcj.so.81) at gnu.xml.stream.XMLParser.hasNext(libgcj.so.81) at gnu.xml.stream.SAXParser.parse(libgcj.so.81) ...4 more -- Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l