https://bugzilla.wikimedia.org/show_bug.cgi?id=27849
--- Comment #11 from Brion Vibber <[email protected]> 2011-05-05 17:28:14 UTC --- There are essentially two layers of work here, which our input validation merges into a single step: 1) invalid UTF-8 sequences must be found and replaced with valid placeholder characters 2) valid UTF-8 sequences are normalized to form C (eg, replacing 'e' followed by 'combining acute accent' into precombined character 'e with acute') The invalid UTF-8 sequences found in part 1) **cannot be represented as strings in JSON or XML output**, because JSON and XML formats are based on Unicode text. Even if you wanted them, you can't just output them directly, nor can you use any escaping method to represent the original bad sequences. Outputting the original bogus UTF-8 into the document would cause it to be unreadable, breaking the API. Most likely, only 2) is of real interest: "\u03a5\u0308" is a perfectly valid Unicode string, and can be shipped around either with the JSON string escapes as above or as literals in any Unicode encoding for any JSON or XML document. We can perfectly well expect clients to sent that string, and we should be able to represent it in output. That we normalize strings into NFC for most internal purposes should generally be an implementation detail of our data formats and how we do title comparisons, so it's reasonable to expect clients that input a given non-NFC string to see the same thing on the other side when we report how we normalized the title string. Running only UTF-8 sequence validation at the $wgRequest boundary, and doing stuff like the NFC conversion to avoid extra combining characters should really be at processing and comparison boundaries like Title normalization. So in short: don't worry about representing invalid UTF-8 byte sequences: either use a 'before' value that's been validated as UTF-8, or let the API output do UTF-8 validation (but make sure it *doesn't* apply NFC conversion on all output) -- Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug. _______________________________________________ Wikibugs-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
