https://bugzilla.wikimedia.org/show_bug.cgi?id=22761

           Summary: Abuse filter appears to mishandle unicode
           Product: MediaWiki extensions
           Version: any
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: major
          Priority: Normal
         Component: AbuseFilter
        AssignedTo: agarr...@wikimedia.org
        ReportedBy: delbu...@my.erau.edu
                CC: wikibugs-l@lists.wikimedia.org


In analyzing a false positive, I've been trying to track down the reason my
regex debugger says a regex doesn't match yet it does match on the abuse
filter. Eventually I found what appears to be a good lead on the issue.

Details of the incorrect match are here:
http://test.wikipedia.org/w/index.php?title=Special:AbuseLog&details=1784

It appears what's going on is the é (which appears to be encoded in UTF-8) is
mishandled when testing against the regex. It appears to the regex engine as a
word boundary, so the match succeeds (specifically, "\brence\b" matches
"conférence"). 

Hopefully there's a way to correct this and it's not a problem in the heart of
PHP instead.

Please let me know if you need any additional information.

-- Shirik @ enwiki

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.
_______________________________________________
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

Reply via email to