https://bugzilla.wikimedia.org/show_bug.cgi?id=27849
Roan Kattouw <[email protected]> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |[email protected], | |[email protected] --- Comment #6 from Roan Kattouw <[email protected]> 2011-05-05 16:07:27 UTC --- I took a stab at this this afternoon, but ran into an issue that I think makes this impossible to solve. I managed to delay Unicode normalization of the titles parameter until ApiPageSet::processTitlesArray(), and got ?action=query&titles=Ϋ&format=jsonfm to output a 'normalized' object. However, all data in the API result data structure is Unicode-normalized before being output, so you get stuff like: "normalized": [ { "from": "\u03ab", "to": "\u03ab" } ], where the "from" entry was originally "\u03a5\u0308" (the value specified in the query string) but got normalized prior to being output. This means from and to will always be equal (sans underscores to spaces and other existing normalizations), so this is useless. I could armor the from value to protect it from Unicode normalization (I've written code for that before; I threw it out but I should be able to reproduce it quickly), but that would allow the injection or arbitrary non-normalized data into the result, which may be invalid UTF-8, which would break e.g. XML parsers. Is there a way I can do this only for cases where we want this? Is "\u03a5\u0308" a string that is valid UTF-8/Unicode but is nevertheless changed by Language::normalize()? Is this true for all cases where we want this feature? Is it possible to detect this somehow? CC Brion because he probably knows more about this subject that I do. -- Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug. _______________________________________________ Wikibugs-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
