[issue32987] tokenize.py parses unicode identifiers incorrectly

Steve B Fri, 02 Mar 2018 15:33:06 -0800

New submission from Steve B <steven.byr...@gmail.com>:

Here is an example involving the unicode character MIDDLE DOT · : The line


ab·cd = 7

is valid Python 3 code and is happily accepted by the CPython interpreter. 
However, tokenize.py does not like it. It says that the middle-dot is an error 
token. Here is an example you can run to see that:

    import tokenize
    from io import BytesIO
    
    test = 'ab·cd = 7'.encode('utf-8')
    
    x = tokenize.tokenize(BytesIO(test).readline)
    for i in x: print(i)

For reference, the official definition of identifiers is: 

https://docs.python.org/3.6/reference/lexical_analysis.html#identifiers

and details about MIDDLE DOT are at

https://www.unicode.org/Public/10.0.0/ucd/PropList.txt

MIDDLE DOT has the "Other_ID_Continue" property, so I think the interpreter is 
behaving correctly (i.e. consistent with the documented spec), while 
tokenize.py is wrong.

----------
components: Library (Lib), Unicode
messages: 313168
nosy: ezio.melotti, steve, vstinner
priority: normal
severity: normal
status: open
title: tokenize.py parses unicode identifiers incorrectly
type: behavior
versions: Python 3.6

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue32987>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue32987] tokenize.py parses unicode identifiers incorrectly

Reply via email to