https://bugzilla.wikimedia.org/show_bug.cgi?id=63933

            Bug ID: 63933
           Summary: Cohort Validation is not parsing correctly utf-8
                    usernames, results in overeporting invalid users
           Product: Analytics
           Version: unspecified
          Hardware: All
                OS: All
            Status: NEW
          Severity: normal
          Priority: Unprioritized
         Component: Wikimetrics
          Assignee: [email protected]
          Reporter: [email protected]
                CC: [email protected], [email protected],
                    [email protected], [email protected],
                    [email protected]
       Web browser: ---
   Mobile Platform: ---

Capitalization in parse_user function to format strings in the media wiki user
format is done assuming 1 byte per character, this breaks with user names whose
first character takes up two bytes.

Sample:
Current code:
>>> a = "èMarianne.ramsès  ".decode('utf-8')
>>> s = a.strip()
>>> s = a.strip().encode('utf-8')
>>> first = s[0]
>>> print first
� -> this is 'half' a character


Correct sequence:
>>> a = "èMarianne.ramsès  ".decode('utf-8')
>>> s = a.strip()
>>> first = s[0].upper().encode('utf-8')
>>> print first
È


We likely need to review all the code regarding string comparisons on
user_names. Perhaps having our own type for user names that wraps encoding
issues is best.

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
_______________________________________________
Wikibugs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

Reply via email to