gerritbot added a comment.
Change 640929 **merged** by jenkins-bot:
[pywikibot/core@master] [bugfix] Do not strip all whitespaces from title
https://gerrit.wikimedia.org/r/640929
TASK DETAIL
https://phabricator.wikimedia.org/T197642
EMAIL PREFERENCES
gerritbot added a comment.
Change 640929 had a related patch set uploaded (by Xqt; owner: Xqt):
[pywikibot/core@master] [bugfix] Do not strip all whitespaces from title
https://gerrit.wikimedia.org/r/640929
TASK DETAIL
https://phabricator.wikimedia.org/T197642
EMAIL PREFERENCES
Xqt added a comment.
The problem is chr(133) is a whitespace, defined in unicodedata and will be
stripped.
>>> '\x85'.isspace()
True
>>> '\x85'.strip()
''
How does MW handle this?
TASK DETAIL
https://phabricator.wikimedia.org/T197642
EMAIL PREFERENCES
JJMC89 added a comment.
In T197642#4298742, @Dvorapa wrote:
Please try https://gerrit.wikimedia.org/r/#/c/pywikibot/core/+/395154/, I think I fixed also this error there.
I still get the exception. (Python 3.6.3)
>>> import pywikibot
>>> pywikibot.Page(pywikibot.Site('en', 'wikipedia'),
Xqt added a comment.
u'\x85' is a control sign and it doesn't look valid. You neither can link to the the redirect from redirect target nor from special page. The only reachable view is to edit the page [1]. I am wondering why mw accepts this as a page title; very strange!
[1]
Dvorapa added a comment.
Please try https://gerrit.wikimedia.org/r/#/c/pywikibot/core/+/395154/, I think I fixed also this error there.TASK DETAILhttps://phabricator.wikimedia.org/T197642EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: DvorapaCc: Dvorapa,
zhuyifei1999 added a comment.
Scratch that
>>> u'\x85'.strip()
u''TASK DETAILhttps://phabricator.wikimedia.org/T197642EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: zhuyifei1999Cc: zhuyifei1999, Aklapper, pywikibot-bugs-list, Xqt, JJMC89, APerson, Magul,
zhuyifei1999 added a comment.
>>> '\x85'.strip()
'\x85'TASK DETAILhttps://phabricator.wikimedia.org/T197642EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: zhuyifei1999Cc: zhuyifei1999, Aklapper, pywikibot-bugs-list, Xqt, JJMC89, APerson, Magul, Tbscho, MayS,
JJMC89 added a comment.
Changing the loop to the below tells me the first problematic pageid is 2868, which is the character \x85.
>>> for each_article in cat.articles(namespaces=(0)):
... try:
... print(each_article.title(withNamespace=True), each_article.pageid)
... except