[Bug 22761] New: Abuse filter appears to mishandle unicode

bugzilla-daemon Sun, 07 Mar 2010 19:29:08 -0800

https://bugzilla.wikimedia.org/show_bug.cgi?id=22761


           Summary: Abuse filter appears to mishandle unicode
           Product: MediaWiki extensions
           Version: any
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: major
          Priority: Normal
         Component: AbuseFilter
        AssignedTo: [email protected]
        ReportedBy: [email protected]
                CC: [email protected]


In analyzing a false positive, I've been trying to track down the reason my
regex debugger says a regex doesn't match yet it does match on the abuse
filter. Eventually I found what appears to be a good lead on the issue.

Details of the incorrect match are here:
http://test.wikipedia.org/w/index.php?title=Special:AbuseLog&details=1784

It appears what's going on is the é (which appears to be encoded in UTF-8) is
mishandled when testing against the regex. It appears to the regex engine as a
word boundary, so the match succeeds (specifically, "\brence\b" matches
"conférence"). 

Hopefully there's a way to correct this and it's not a problem in the heart of
PHP instead.

Please let me know if you need any additional information.

-- Shirik @ enwiki

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.
_______________________________________________
Wikibugs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

[Bug 22761] New: Abuse filter appears to mishandle unicode

Reply via email to