[issue17125] tokenizer.tokenize passes a bytes object to str.startswith

2013-02-04 Thread Tyler Crompton

New submission from Tyler Crompton:

Line 402 in lib/python3.3/tokenize.py, contains the following line:

if first.startswith(BOM_UTF8):

BOM_UTF8 is a bytes object. str.startswith does not accept bytes objects. I was 
able to use tokenize.tokenize only after making the following changes:

Change line 402 to the following:

if first.startswith(BOM_UTF8.decode()):

Add these two lines at line 374:

except AttributeError:
line_string = line

Change line 485 to the following:

try:
line = line.decode(encoding)
except AttributeError:
pass

I do not know if these changes are correct as I have not fully tested this 
module after these changes, but it started working for me. This is the meat of 
my invokation of tokenize.tokenize:

import tokenize

with open('example.py') as file: # opening a file encoded as UTF-8
for token in tokenize.tokenize(file.readline):
print(token)

I am not suggesting that these changes are correct, but I do believe that the 
current implementation is incorrect. I am also unsure as to what other versions 
of Python are affected by this.

--
components: Library (Lib)
messages: 181349
nosy: Tyler.Crompton
priority: normal
severity: normal
status: open
title: tokenizer.tokenize passes a bytes object to str.startswith
type: behavior
versions: Python 3.3

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue17125
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue17125] tokenizer.tokenize passes a bytes object to str.startswith

2013-02-04 Thread R. David Murray

R. David Murray added the comment:

The docs could certainly be more explicit...currently they state that tokenize 
is *detecting* the encoding of the file, which *implies* but does not make 
explicit that the input must be binary, not text.

The doc problem will get fixed as part of the fix to issue 12486, so I'm 
closing this as a duplicate.  If you want to help out with a patch review and 
doc patch suggestions on that issue, that would be great.

--
nosy: +r.david.murray
resolution:  - duplicate
stage:  - committed/rejected
status: open - closed
superseder:  - tokenize module should have a unicode API

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue17125
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com