https://bugzilla.wikimedia.org/show_bug.cgi?id=63933
Bug ID: 63933
Summary: Cohort Validation is not parsing correctly utf-8
usernames, results in overeporting invalid users
Product: Analytics
Version: unspecified
Hardware: All
OS: All
Status: NEW
Severity: normal
Priority: Unprioritized
Component: Wikimetrics
Assignee: [email protected]
Reporter: [email protected]
CC: [email protected], [email protected],
[email protected], [email protected],
[email protected]
Web browser: ---
Mobile Platform: ---
Capitalization in parse_user function to format strings in the media wiki user
format is done assuming 1 byte per character, this breaks with user names whose
first character takes up two bytes.
Sample:
Current code:
>>> a = "èMarianne.ramsès ".decode('utf-8')
>>> s = a.strip()
>>> s = a.strip().encode('utf-8')
>>> first = s[0]
>>> print first
� -> this is 'half' a character
Correct sequence:
>>> a = "èMarianne.ramsès ".decode('utf-8')
>>> s = a.strip()
>>> first = s[0].upper().encode('utf-8')
>>> print first
È
We likely need to review all the code regarding string comparisons on
user_names. Perhaps having our own type for user names that wraps encoding
issues is best.
--
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
_______________________________________________
Wikibugs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l