https://bugzilla.wikimedia.org/show_bug.cgi?id=22137

--- Comment #7 from Bawolff <bawolff...@gmail.com> 2010-02-12 23:41:45 UTC ---
>Java internally uses UTF-16
yes it does, but i think the file is interperted as utf-8, otherwise it
wouldn't be able to make sense of it at all, as utf-8 and utf-16 look fairly
different for your average english text (I'm under the impression that utf-16
is not compatible with ASCII thus nothing would work at all if it was using
utf-16). 


>I don't see why it is reading a U+26 (100110).

The entity references that come after the problematic unicode character is
where the U+26 (&) comes from. Its not considered a valid (tag) start character
in XML. The question is why java would after failing to interpert the fancy
unicode character, it would think that the document was starting a new tag. If
you interpret F0 9D 96 9F in utf-16, you get:
       U+F09D:   No name (Private Use Area)
    隟   U+969F:   Han ideograph   (CJK Unified Ideographs)
Which theoretically shouldn't cause any problems. (of course the rest of the
file wouldn't make sense, and no guarantees that that is where the word
boundaries would fall).

I'm thinking this is a bug with the underlying java libraries, as opposed to
mwdumper

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching all bug changes.
_______________________________________________
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

Reply via email to