[Bug 27849] API: add normalized info also for unicode normalization of titles

bugzilla-daemon Thu, 05 May 2011 09:07:33 -0700

https://bugzilla.wikimedia.org/show_bug.cgi?id=27849


Roan Kattouw <[email protected]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |[email protected],
                   |                            |[email protected]

--- Comment #6 from Roan Kattouw <[email protected]> 2011-05-05 16:07:27 
UTC ---
I took a stab at this this afternoon, but ran into an issue that I think makes
this impossible to solve. I managed to delay Unicode normalization of the
titles parameter until ApiPageSet::processTitlesArray(), and got
?action=query&titles=Ϋ&format=jsonfm to output a 'normalized' object. However,
all data in the API result data structure is Unicode-normalized before being
output, so you get stuff like: 

        "normalized": [
            {
                "from": "\u03ab",
                "to": "\u03ab"
            }
        ],

where the "from" entry was originally "\u03a5\u0308" (the value specified in
the query string) but got normalized prior to being output. This means from and
to will always be equal (sans underscores to spaces and other existing
normalizations), so this is useless.

I could armor the from value to protect it from Unicode normalization (I've
written code for that before; I threw it out but I should be able to reproduce
it quickly), but that would allow the injection or arbitrary non-normalized
data into the result, which may be invalid UTF-8, which would break e.g. XML
parsers.

Is there a way I can do this only for cases where we want this? Is
"\u03a5\u0308" a string that is valid UTF-8/Unicode but is nevertheless changed
by Language::normalize()? Is this true for all cases where we want this
feature? Is it possible to detect this somehow? CC Brion because he probably
knows more about this subject that I do.

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.
_______________________________________________
Wikibugs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

[Bug 27849] API: add normalized info also for unicode normalization of titles

Reply via email to