https://bugzilla.wikimedia.org/show_bug.cgi?id=27849

--- Comment #11 from Brion Vibber <[email protected]> 2011-05-05 17:28:14 UTC 
---
There are essentially two layers of work here, which our input validation
merges into a single step:

1) invalid UTF-8 sequences must be found and replaced with valid placeholder
characters

2) valid UTF-8 sequences are normalized to form C (eg, replacing 'e' followed
by 'combining acute accent' into precombined character 'e with acute')

The invalid UTF-8 sequences found in part 1) **cannot be represented as strings
in JSON or XML output**, because JSON and XML formats are based on Unicode
text. Even if you wanted them, you can't just output them directly, nor can you
use any escaping method to represent the original bad sequences.

Outputting the original bogus UTF-8 into the document would cause it to be
unreadable, breaking the API.


Most likely, only 2) is of real interest: "\u03a5\u0308" is a perfectly valid
Unicode string, and can be shipped around either with the JSON string escapes
as above or as literals in any Unicode encoding for any JSON or XML document.
We can perfectly well expect clients to sent that string, and we should be able
to represent it in output.

That we normalize strings into NFC for most internal purposes should generally
be an implementation detail of our data formats and how we do title
comparisons, so it's reasonable to expect clients that input a given non-NFC
string to see the same thing on the other side when we report how we normalized
the title string.


Running only UTF-8 sequence validation at the $wgRequest boundary, and doing
stuff like the NFC conversion to avoid extra combining characters should really
be at processing and comparison boundaries like Title normalization.


So in short: don't worry about representing invalid UTF-8 byte sequences:
either use a 'before' value that's been validated as UTF-8, or let the API
output do UTF-8 validation (but make sure it *doesn't* apply NFC conversion on
all output)

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.

_______________________________________________
Wikibugs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

Reply via email to