New submission from Steve B <steven.byr...@gmail.com>: Here is an example involving the unicode character MIDDLE DOT · : The line
ab·cd = 7 is valid Python 3 code and is happily accepted by the CPython interpreter. However, tokenize.py does not like it. It says that the middle-dot is an error token. Here is an example you can run to see that: import tokenize from io import BytesIO test = 'ab·cd = 7'.encode('utf-8') x = tokenize.tokenize(BytesIO(test).readline) for i in x: print(i) For reference, the official definition of identifiers is: https://docs.python.org/3.6/reference/lexical_analysis.html#identifiers and details about MIDDLE DOT are at https://www.unicode.org/Public/10.0.0/ucd/PropList.txt MIDDLE DOT has the "Other_ID_Continue" property, so I think the interpreter is behaving correctly (i.e. consistent with the documented spec), while tokenize.py is wrong. ---------- components: Library (Lib), Unicode messages: 313168 nosy: ezio.melotti, steve, vstinner priority: normal severity: normal status: open title: tokenize.py parses unicode identifiers incorrectly type: behavior versions: Python 3.6 _______________________________________ Python tracker <rep...@bugs.python.org> <https://bugs.python.org/issue32987> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com