https://bugzilla.wikimedia.org/show_bug.cgi?id=22761
Mark Nelson <[email protected]> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |[email protected] --- Comment #6 from Mark Nelson <[email protected]> --- Is this still happening? My understanding is that it was a PHP behavior (possibly a "bug" depending on your interpretation) that was changed in more recent versions. My understanding of the problem: PHP passes its regexes to PCRE, and the traditional PCRE behavior is that the "traditional" character class \w is defined as just [A-Za-z0-9_], even when matching a UTF8 string. The word boundary class \b is defined based on \w, so behaves similarly. If you wanted the Unicode notion of "letter character", there are instead Unicode character classes, such as \pL. The /u switch in PHP just tells PCRE it's matching a Unicode string, but didn't change the definition of the legacy character classes. But, PCRE added a switch 'PCRE_UCP' in mid-2010, which if set makes the traditional character classes into aliases for morally equivalent Unicode character classes. That should produce something closer to the expected behavior, at least in our case. From what I can find, this PCRE switch is enabled when /u is specified in newer versions of PHP, starting with 5.3.4, which came out in late 2010. I assume Wikimedia must be using a sufficiently new version by now? -- You are receiving this mail because: You are on the CC list for the bug. You are the assignee for the bug. You are watching all bug changes. _______________________________________________ Wikibugs-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
