https://bugzilla.wikimedia.org/show_bug.cgi?id=22761
Summary: Abuse filter appears to mishandle unicode
Product: MediaWiki extensions
Version: any
Platform: All
OS/Version: All
Status: NEW
Severity: major
Priority: Normal
Component: AbuseFilter
AssignedTo: [email protected]
ReportedBy: [email protected]
CC: [email protected]
In analyzing a false positive, I've been trying to track down the reason my
regex debugger says a regex doesn't match yet it does match on the abuse
filter. Eventually I found what appears to be a good lead on the issue.
Details of the incorrect match are here:
http://test.wikipedia.org/w/index.php?title=Special:AbuseLog&details=1784
It appears what's going on is the é (which appears to be encoded in UTF-8) is
mishandled when testing against the regex. It appears to the regex engine as a
word boundary, so the match succeeds (specifically, "\brence\b" matches
"conférence").
Hopefully there's a way to correct this and it's not a problem in the heart of
PHP instead.
Please let me know if you need any additional information.
-- Shirik @ enwiki
--
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.
_______________________________________________
Wikibugs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l