https://bugzilla.wikimedia.org/show_bug.cgi?id=16798
Summary: JSON encoding errors for characters outside the BMP
Product: MediaWiki
Version: 1.14-svn
Platform: All
OS/Version: All
Status: NEW
Severity: normal
Priority: Normal
Component: API
AssignedTo: [email protected]
ReportedBy: [email protected]
CC: [email protected], [email protected]
Consider the following query:
http://localhost/w/api.php?action=query&format=xml&action=expandtemplates&text=%ef%bf%bd%f0%90%80%80%f3%b0%80%8fzzz
It contains 6 characters: U+fffd, U+10000, U+f000f, U+007a, U+007a, and U+007a.
In json encoding, they should be \ufffd\ud800\udc00\udb80\udc0fzzz (U+10000 and
U+f000f must be encoded as surrogate pairs).
If I change the format to jsonfm, the three characters are instead encoded as
\ufffd\ud800dc00\udb80dc0fzzz, which cannot be decoded correctly. This should
be relatively simple to fix, I think.
If I change the format to json, it's even worse: the first two are output
correctly as \ufffd\ud800\udc00, but that's it! Apparently PHP's built-in
json_encode silently screws up anything over U+1ffff: U+20000-U+3ffff,
U+80000-U+bffff, and U+100000-U+10ffff seem to be incorrectly encoded as
U+10000-U+1ffff, while U+40000-U+7ffff and U+c0000-U+fffff seem to cause the
mentioned silent truncation. The only fix I can think of is to detect if these
characters are present and use the fallback code instead.
I'll see about posting a patch later on.
--
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.
_______________________________________________
Wikibugs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l