https://bugzilla.wikimedia.org/show_bug.cgi?id=36839

--- Comment #10 from Brad Jorsch <[email protected]> 2012-05-15 
16:40:04 UTC ---
I think I might have figured this out.

In a post on enwiki from May 11,[1] we are told that Roan changed the "PCRE
recursion limit" from the default 100k to 1k. I assume this is referring to
PHP's "pcre.recursion_limit" setting,[2] which indeed has a default of 100000.

One thing the recursion limit affects how often regexes with subexpressions
like "(x)+" can match. It seems that each match by "+" there uses up 2 of the
recursion limit; with a value of 1024, it can match at most 511 times. If it
would match 512 times, preg_match will return false instead. You can test this
easily enough if you have a recent-enough command-line PHP:

  php -r 'ini_set("pcre.recursion_limit", 1024); var_dump(preg_match("/(x)+/",
str_repeat("x", 511)));'
  php -r 'ini_set("pcre.recursion_limit", 1024); var_dump(preg_match("/(x)+/",
str_repeat("x", 512)));'

The first will succeed, while the second will fail. But if you bump the 1024 to
1026, the second will start working.

So what seems to be going on is this: The API uses the methods in WebRequest to
get the parameters from the client, all of which seem to come down to
getGPCVal. For any parameter that exists in $_GET (even if overridden by
$_POST), getGPCVal passes the value through Language::checkTitleEncoding to
make sure it's valid UTF-8. And due to the low recursion limit, the regex in
Language::checkTitleEncoding that tries to check whether the value is valid
UTF-8 will now think it is ''not'' valid if the value is more than 511
characters long, so it will treat it as the fallback 8-bit encoding
(windows-1252 for most languages), which gives the familiar "è" mojibake.

If I'm right, the fix for this bug would be to revert Roan's change to the
"pcre.recursion_limit" setting (and fix whatever PageTriage's problem is in
some other way), or at least turn it up to something more reasonable than 1024.
I'd expect this is causing problems in other areas of the code, too.

 [1]:
https://en.wikipedia.org/w/index.php?title=Wikipedia_talk:Database_reports&diff=491927371&oldid=491919743
 [2]:
http://us.php.net/manual/en/pcre.configuration.php#ini.pcre.recursion-limit

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.
_______________________________________________
Wikibugs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

Reply via email to