[Bug 22137] mwdumper dies with not a name start character: U+26 error

2010-02-15 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=22137

--- Comment #11 from Kelson [Emmanuel Engelhart] emman...@engelhart.org 
2010-02-13 16:34:17 UTC ---
Seems to be a bug in gcj or libgcj. See my email to the java gcc ML:
http://gcc.gnu.org/ml/java/2010-02/msg0.html

--- Comment #12 from Kelson [Emmanuel Engelhart] emman...@engelhart.org 
2010-02-15 10:51:28 UTC ---
In the meantime, Platonides (or anyone having SVN write access), may you please
apply the path from my comment #1 https://bugzilla.wikimedia.org
/show_bug.cgi?id=22137#c1 ?

Without it, this is impossible to know at which line a SAX parsing error
occurs.

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are watching all bug changes.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 22137] mwdumper dies with not a name start character: U+26 error

2010-02-13 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=22137

--- Comment #9 from Kelson [Emmanuel Engelhart] emman...@engelhart.org 
2010-02-13 11:11:27 UTC ---
Created an attachment (id=7115)
 -- (https://bugzilla.wikimedia.org/attachment.cgi?id=7115)
 Much more simpler java code that demonstrates error   

Compile with:
gcj -o test --main=Test Test.java

run with the demo XML code as test.xml

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are watching all bug changes.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 22137] mwdumper dies with not a name start character: U+26 error

2010-02-13 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=22137

--- Comment #10 from Platonides platoni...@gmail.com 2010-02-13 16:30:32 UTC 
---
Sun jdk / OpenJdk is not affected.

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are watching all bug changes.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 22137] mwdumper dies with not a name start character: U+26 error

2010-02-12 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=22137

--- Comment #3 from Bawolff bawolff...@gmail.com 2010-02-12 19:32:37 UTC ---
Created an attachment (id=7114)
 -- (https://bugzilla.wikimedia.org/attachment.cgi?id=7114)
Much more simpler case that demonstrates error

This is a unicode issue. If you remove the

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are watching all bug changes.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 22137] mwdumper dies with not a name start character: U+26 error

2010-02-12 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=22137

Bawolff bawolff...@gmail.com changed:

   What|Removed |Added

 CC||bawolff...@gmail.com

--- Comment #4 from Bawolff bawolff...@gmail.com 2010-02-12 19:33:19 UTC ---
Bugzilla screwed up my comment:

This is a unicode issue. If you remove the

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are watching all bug changes.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 22137] mwdumper dies with not a name start character: U+26 error

2010-02-12 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=22137

--- Comment #5 from Bawolff bawolff...@gmail.com 2010-02-12 19:34:53 UTC ---
Ok, apparently bugzilla suffers from the same issue as mwdumper ;)

This is a unicode issue. If you remove the Unicode character removed from
comment, lest bugzilla hate me ( U+1D59F - MATHEMATICAL BOLD FRAKTUR SMALL Z -
however the article claims it to be U+1D537 which is MATHEMATICAL FRAKTUR SMALL
Z  but thats not what character is in the text. ) everything works fine. Since
its not chocking on more ordinary unicode characters, i imagine its something
to do with that character being a 4-byte character.

It also appears that this interacts with other stuff in the file, as it doesn't
cause the error by itself. 

Specifically entity references, seem to be what causes it to die after
encountering the unicode character. I think It interpert that  character as
starting as outside the tag name (hence starting a new tag, but  (aka U+0026)
cannot start a new tag). Newline characters may also have something to do with
it, as removing the newline between the unicode character and the  changes the
error message.

Changing summary to more adequately reflect what i think the problem is.

Attaching simpler test case.

Note also, that if you replace the unicode character with its entity reference
(#x1D59F;), everything works fine.

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are watching all bug changes.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 22137] mwdumper dies with not a name start character: U+26 error

2010-02-12 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=22137

--- Comment #6 from Platonides platoni...@gmail.com 2010-02-12 22:58:48 UTC 
---
Java internally uses UTF-16

The native coded character set of the Java programming language is that of the
first seventeen planes of the Unicode version 3.0 character set; that is, it
consists in the basic multilingual plane (BMP) of Unicode version 1 plus the
next sixteen planes of Unicode version 3. This is because the language's
internal representation of characters uses the UTF-16 encoding, which encodes
the BMP directly and uses surrogate pairs, a simple escape mechanism, to encode
the other planes. Hence a charset in the Java platform defines a mapping
between sequences of sixteen-bit values in UTF-16 and sequences of bytes.
http://java.sun.com/j2se/1.4.2/docs/api/java/nio/charset/Charset.html
http://java.sun.com/javase/6/docs/api/java/nio/charset/Charset.html

The file contains U+01D59F in UTF-8, thus F0 9D 96 9F. In binary 
10011101 10010110 1001
I don't see why it is reading a U+26 (100110).


PS: Maybe bugzilla is using mysql as utf-8 instead of binary? mysql unicode
currently only supports the BMP.

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are watching all bug changes.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 22137] mwdumper dies with not a name start character: U+26 error

2010-02-12 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=22137

--- Comment #7 from Bawolff bawolff...@gmail.com 2010-02-12 23:41:45 UTC ---
Java internally uses UTF-16
yes it does, but i think the file is interperted as utf-8, otherwise it
wouldn't be able to make sense of it at all, as utf-8 and utf-16 look fairly
different for your average english text (I'm under the impression that utf-16
is not compatible with ASCII thus nothing would work at all if it was using
utf-16). 


I don't see why it is reading a U+26 (100110).

The entity references that come after the problematic unicode character is
where the U+26 () comes from. Its not considered a valid (tag) start character
in XML. The question is why java would after failing to interpert the fancy
unicode character, it would think that the document was starting a new tag. If
you interpret F0 9D 96 9F in utf-16, you get:
   U+F09D:   No name (Private Use Area)
隟   U+969F:   Han ideograph   (CJK Unified Ideographs)
Which theoretically shouldn't cause any problems. (of course the rest of the
file wouldn't make sense, and no guarantees that that is where the word
boundaries would fall).

I'm thinking this is a bug with the underlying java libraries, as opposed to
mwdumper

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are watching all bug changes.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 22137] mwdumper dies with not a name start character: U+26 error

2010-02-12 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=22137

--- Comment #8 from Platonides platoni...@gmail.com 2010-02-12 23:48:54 UTC 
---
(In reply to comment #7)
 Java internally uses UTF-16
 yes it does, but i think the file is interperted as utf-8, otherwise it
 wouldn't be able to make sense of it at all, as utf-8 and utf-16 look fairly
 different for your average english text (I'm under the impression that utf-16
 is not compatible with ASCII thus nothing would work at all if it was using
 utf-16). 

Right. But it could be overflowing the 16-bit or some other failure.


 I don't see why it is reading a U+26 (100110).
 
 The entity references that come after the problematic unicode character is
 where the U+26 () comes from.
Interesting. Saving from firefox produced a literal  in the output.

 I'm thinking this is a bug with the underlying java libraries, as opposed to
 mwdumper
I also think so.

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are watching all bug changes.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 22137] mwdumper dies with not a name start character: U+26 error

2010-02-11 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=22137

Daniel Kinzler brightb...@gmail.com changed:

   What|Removed |Added

 CC||brightb...@gmail.com

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are watching all bug changes.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 22137] mwdumper dies with not a name start character: U+26 error

2010-02-11 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=22137

Platonides platoni...@gmail.com changed:

   What|Removed |Added

 CC||platoni...@gmail.com

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are watching all bug changes.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 22137] mwdumper dies with not a name start character: U+26 error

2010-01-25 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=22137

Nemo_bis federicol...@tiscali.it changed:

   What|Removed |Added

 CC||federicol...@tiscali.it

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are watching all bug changes.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 22137] mwdumper dies with not a name start character: U+26 error

2010-01-18 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=22137





--- Comment #1 from Kelson [Emmanuel Engelhart] emman...@engelhart.org  
2010-01-18 10:59:50 UTC ---
Hier is a diff adding column and line information to the exception
informations:

===
--- src/org/mediawiki/importer/XmlDumpReader.java   (révision 61197)
+++ src/org/mediawiki/importer/XmlDumpReader.java   (copie de travail)
@@ -36,6 +36,7 @@
 import javax.xml.parsers.ParserConfigurationException;
 import javax.xml.parsers.SAXParser;
 import javax.xml.parsers.SAXParserFactory;
+import org.xml.sax.SAXParseException;

 import org.xml.sax.Attributes;
 import org.xml.sax.SAXException;
@@ -82,15 +83,17 @@
 */
public void readDump() throws IOException {
try {
-   SAXParserFactory factory =
SAXParserFactory.newInstance();
-   SAXParser parser = factory.newSAXParser();
+   SAXParserFactory factory = SAXParserFactory.newInstance();
+   SAXParser parser = factory.newSAXParser();

parser.parse(input, this);
} catch (ParserConfigurationException e) {
throw (IOException)new
IOException(e.getMessage()).initCause(e);
+   } catch (SAXParseException e) {
+   throw (IOException)new IOException(e.getMessage() + 
(line:  + e.getLineNumber() +  column:  + e.getColumnNumber() +
)).initCause(e);
} catch (SAXException e) {
-   throw (IOException)new
IOException(e.getMessage()).initCause(e);
-   }
+   throw (IOException)new
IOException(e.getMessage()).initCause(e);
+   }
writer.close();
}


-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 22137] mwdumper dies with not a name start character: U+26 error

2010-01-18 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=22137





--- Comment #2 from Kelson [Emmanuel Engelhart] emman...@engelhart.org  
2010-01-18 11:37:36 UTC ---
Created an attachment (id=6965)
 -- (https://bugzilla.wikimedia.org/attachment.cgi?id=6965)
Problematic part of the XML dump

I have extract the problematic part of the dump, see attachment.

$ mwdumper --format=sql:1.5 sample.xml.bz2 | lzma -c -d  sample.sql.lzma
Exception in thread main java.io.IOException: not a name start character:
U+26 (line: 82 column: 1)
   at org.mediawiki.importer.XmlDumpReader.readDump(mwdumper)
   at org.mediawiki.dumper.Dumper.main(mwdumper)
Caused by: org.xml.sax.SAXParseException: not a name start character: U+26
   at gnu.xml.stream.SAXParser.parse(libgcj.so.81)
   at javax.xml.parsers.SAXParser.parse(libgcj.so.81)
   at javax.xml.parsers.SAXParser.parse(libgcj.so.81)
   at org.mediawiki.importer.XmlDumpReader.readDump(mwdumper)
   ...1 more
Caused by: javax.xml.stream.XMLStreamException: not a name start character:
U+26
   at gnu.xml.stream.XMLParser.error(libgcj.so.81)
   at gnu.xml.stream.XMLParser.readNmtoken(libgcj.so.81)
   at gnu.xml.stream.XMLParser.readNmtoken(libgcj.so.81)
   at gnu.xml.stream.XMLParser.readCharData(libgcj.so.81)
   at gnu.xml.stream.XMLParser.next(libgcj.so.81)
   at gnu.xml.stream.XMLParser.hasNext(libgcj.so.81)
   at gnu.xml.stream.SAXParser.parse(libgcj.so.81)
   ...4 more


-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l