[Bug 22761] Abuse filter regex \b considers unicode characters as word boundries (probably missing /u flag)

bugzilla-daemon Thu, 31 Jan 2013 14:54:10 -0800

https://bugzilla.wikimedia.org/show_bug.cgi?id=22761


Mark Nelson <[email protected]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |[email protected]

--- Comment #6 from Mark Nelson <[email protected]> ---
Is this still happening? My understanding is that it was a PHP behavior
(possibly a "bug" depending on your interpretation) that was changed in more
recent versions.

My understanding of the problem:

PHP passes its regexes to PCRE, and the traditional PCRE behavior is that the
"traditional" character class \w is defined as just [A-Za-z0-9_], even when
matching a UTF8 string. The word boundary class \b is defined based on \w, so
behaves similarly. If you wanted the Unicode notion of "letter character",
there are instead Unicode character classes, such as \pL. The /u switch in PHP
just tells PCRE it's matching a Unicode string, but didn't change the
definition of the legacy character classes.

But, PCRE added a switch 'PCRE_UCP' in mid-2010, which if set makes the
traditional character classes into aliases for morally equivalent Unicode
character classes. That should produce something closer to the expected
behavior, at least in our case. From what I can find, this PCRE switch is
enabled when /u is specified in newer versions of PHP, starting with 5.3.4,
which came out in late 2010. I assume Wikimedia must be using a sufficiently
new version by now?

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.
You are watching all bug changes.
_______________________________________________
Wikibugs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

[Bug 22761] Abuse filter regex \b considers unicode characters as word boundries (probably missing /u flag)

Reply via email to