Bug ID: 63242
           Summary: ccnorm revamp: add a more sensible interface for
                    normalised comparison
           Product: MediaWiki extensions
           Version: master
          Hardware: All
                OS: All
            Status: NEW
          Keywords: i18n
          Severity: enhancement
          Priority: Unprioritized
         Component: AntiSpoof
        Depends on: 63217
       Web browser: ---
   Mobile Platform: ---

As discussed on bug 27987, the current practice to run ccnorm on things and
then compare them to the alleged canonical form of a string is not viable.

The first problem is that often users are not comparing normalised strings to
normalised strings; apple and oranges comparisons have unpredictable results.
See bug 27987 comment 22 and bug 27987 comment 24.

Tim proposed something like:

(Tim Starling from bug 27987 comment 20)
> Well, how about
> added_lines cclike "testing|vandalizing"
> Where the regex would be tokenized and reassembled, with alphabetic parts
> normalised with equivset?

That's ok but I think a more sensible syntax would be like

cclike(added_lines, testing) || cclike(added_lines, vandalizing)

That is, a single function should take two strings and tell us if, once
canonicalised in whatever manner the code wants, they are the same thing, AKA
if they are confusable.

This is nothing special: it's the approach followed by the standard API to ICU
data, see uspoof_areConfusable     in
found from the documents mentioned in bug 63217. I was pointed to UTS #36 and
UTS #39 by Nikerabbit, they were just drafts when AntiSpoof was created. Now we
have better tools.

I'm marking this as blocked on bug 63217 because such a function seems trivial
to implement with the ICU API. I'll comment there more in general.

You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
Wikibugs-l mailing list

Reply via email to