https://bugzilla.wikimedia.org/show_bug.cgi?id=22137
--- Comment #7 from Bawolff <[email protected]> 2010-02-12 23:41:45 UTC --- >Java internally uses UTF-16 yes it does, but i think the file is interperted as utf-8, otherwise it wouldn't be able to make sense of it at all, as utf-8 and utf-16 look fairly different for your average english text (I'm under the impression that utf-16 is not compatible with ASCII thus nothing would work at all if it was using utf-16). >I don't see why it is reading a U+26 (100110). The entity references that come after the problematic unicode character is where the U+26 (&) comes from. Its not considered a valid (tag) start character in XML. The question is why java would after failing to interpert the fancy unicode character, it would think that the document was starting a new tag. If you interpret F0 9D 96 9F in utf-16, you get: U+F09D: No name (Private Use Area) 隟 U+969F: Han ideograph (CJK Unified Ideographs) Which theoretically shouldn't cause any problems. (of course the rest of the file wouldn't make sense, and no guarantees that that is where the word boundaries would fall). I'm thinking this is a bug with the underlying java libraries, as opposed to mwdumper -- Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching all bug changes. _______________________________________________ Wikibugs-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
