[issue12675] tokenize module happily tokenizes code with syntax errors

2021-10-21 Thread Irit Katriel


Irit Katriel  added the comment:

Reproduced on 3.11.

--
nosy: +iritkatriel
versions: +Python 3.11 -Python 2.7, Python 3.3, Python 3.4

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12675] tokenize module happily tokenizes code with syntax errors

2014-02-08 Thread Terry J. Reedy

Changes by Terry J. Reedy tjre...@udel.edu:


--
versions: +Python 3.4 -Python 3.2

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12675
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12675] tokenize module happily tokenizes code with syntax errors

2011-10-22 Thread Meador Inge

Changes by Meador Inge mead...@gmail.com:


--
nosy: +meador.inge

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12675
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12675] tokenize module happily tokenizes code with syntax errors

2011-10-22 Thread Florent Xicluna

Changes by Florent Xicluna florent.xicl...@gmail.com:


--
nosy: +flox

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12675
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12675] tokenize module happily tokenizes code with syntax errors

2011-08-05 Thread Daniel Urban

Changes by Daniel Urban urban.dani...@gmail.com:


--
nosy: +durban

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12675
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12675] tokenize module happily tokenizes code with syntax errors

2011-08-05 Thread Terry J. Reedy

Terry J. Reedy tjre...@udel.edu added the comment:

I have not used tokenize, but if it is *not* intended to exactly reproduce the 
internal tokenizer  behavior, the claim that it is should be amended.

--
nosy: +terry.reedy

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12675
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12675] tokenize module happily tokenizes code with syntax errors

2011-08-05 Thread Gareth Rees

Gareth Rees g...@garethrees.org added the comment:

Terry: agreed. Does anyone actually use this module? Does anyone know what the 
design goals are for tokenize? If someone can tell me, I'll do my best to make 
it meet them.

Meanwhile, here's another bug. Each character of trailing whitespace is 
tokenized as an ERRORTOKEN.

Python 3.3.0a0 (default:c099ba0a278e, Aug  2 2011, 12:35:03) 
[GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2335.15.00)] on 
darwin
Type help, copyright, credits or license for more information.
 from tokenize import tokenize,untokenize
 from io import BytesIO
 list(tokenize(BytesIO('1 '.encode('utf8')).readline))
[TokenInfo(type=57 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), 
line=''), TokenInfo(type=2 (NUMBER), string='1', start=(1, 0), end=(1, 1), 
line='1 '), TokenInfo(type=54 (ERRORTOKEN), string=' ', start=(1, 1), end=(1, 
2), line='1 '), TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 
0), line='')]

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12675
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12675] tokenize module happily tokenizes code with syntax errors

2011-08-05 Thread Sandro Tosi

Changes by Sandro Tosi sandro.t...@gmail.com:


--
nosy: +sandro.tosi

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12675
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12675] tokenize module happily tokenizes code with syntax errors

2011-08-04 Thread Gareth Rees

Gareth Rees g...@garethrees.org added the comment:

I'm having a look to see if I can make tokenize.py better match the real 
tokenizer, but I need some feedback on a couple of design decisions. 

First, how to handle tokenization errors? There are three possibilities:

1. Generate an ERRORTOKEN, resynchronize, and continue to tokenize from after 
the error. This is what tokenize.py currently does in the two cases where it 
detects an error.

2. Generate an ERRORTOKEN and stop tokenizing. This is what tokenizer.c does.

3. Raise an exception (IndentationError, SyntaxError, or TabError). This is 
what the user sees when the parser is invoked from pythonrun.c.

Since the documentation for tokenize.py says, It is designed to match the 
working of the Python tokenizer exactly, I think that implementing option (2) 
is best here. (This will mean changing the behaviour of tokenize.py in the two 
cases where it currently detects an error, so that it stops tokenizing.)

Second, how to record the cause of the error? The real tokenizer records the 
cause of the error in the 'done' field of the 'tok_state structure, but 
tokenize.py loses this information. I propose to add fields to the TokenInfo 
structure (which is a namedtuple) to record this information. The real 
tokenizer uses numeric constants from errcode.h (E_TOODEEP, E_TABSPACE, 
E_DEDENT etc), and pythonrun.c converts these to English-language error 
messages (E_TOODEEP: too many levels of indentation). Both of these pieces of 
information will be useful, so I propose to add two fields error (containing 
a string like TOODEEP) and errormessage (containing the English-language 
error message).

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12675
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12675] tokenize module happily tokenizes code with syntax errors

2011-08-04 Thread Gareth Rees

Gareth Rees g...@garethrees.org added the comment:

Having looked at some of the consumers of the tokenize module, I don't think my 
proposed solutions will work.

It seems to be the case that the resynchronization behaviour of tokenize.py is 
important for consumers that are using it to transform arbitrary Python source 
code (like 2to3.py). These consumers are relying on the roundtrip property 
that X == untokenize(tokenize(X)). So solution (1) is necessary for the 
handling of tokenization errors.

Also, that fact that TokenInfo is a 5-tuple is relied on in some places (e.g. 
lib2to3/patcomp.py line 38), so it can't be extended. And there are consumers 
(though none in the standard library) that are relying on type=ERRORTOKEN being 
the way to detect errors in a tokenization stream. So I can't overload that 
field of the structure.

Any good ideas for how to record the cause of error without breaking backwards 
compatibility?

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12675
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12675] tokenize module happily tokenizes code with syntax errors

2011-08-04 Thread Gareth Rees

Gareth Rees g...@garethrees.org added the comment:

Ah ... TokenInfo is a *subclass* of namedtuple, so I can add extra properties 
to it without breaking consumers that expect it to be a 5-tuple.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12675
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12675] tokenize module happily tokenizes code with syntax errors

2011-08-01 Thread Gareth Rees

New submission from Gareth Rees g...@garethrees.org:

The tokenize module is happy to tokenize Python source code that the real 
tokenizer would reject. Pretty much any instance where tokenizer.c returns 
ERRORTOKEN will illustrate this feature. Here are some examples:

Python 3.3.0a0 (default:2d69900c0820, Aug  1 2011, 13:46:51) 
[GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2335.15.00)] on 
darwin
Type help, copyright, credits or license for more information.
 from tokenize import generate_tokens
 from io import StringIO
 def tokens(s):
...Return a string showing the tokens in the string s.
...return '|'.join(t[1] for t in generate_tokens(StringIO(s).readline))
...
 # Bad exponent
 print(tokens('1if 2else 3'))
1|if|2|else|3|
 1if 2else 3
  File stdin, line 1
1if 2else 3
 ^
SyntaxError: invalid token
 # Bad hexadecimal constant.
 print(tokens('0xfg'))
0xf|g|
 0xfg
  File stdin, line 1
0xfg
   ^
SyntaxError: invalid syntax
 # Missing newline after continuation character.
 print(tokens('\\pass'))
\|pass|
 \pass 
  File stdin, line 1
\pass
^
SyntaxError: unexpected character after line continuation character

It is surprising that the tokenize module does not yield the same tokens as 
Python itself, but as this limitation only affects incorrect Python code, 
perhaps it just needs a mention in the tokenize documentation. Something along 
the lines of, The tokenize module generates the same tokens as Python's own 
tokenizer if it is given correct Python code. However, it may incorrectly 
tokenize Python code containing syntax errors that the real tokenizer would 
reject.

--
components: Library (Lib)
messages: 141503
nosy: Gareth.Rees
priority: normal
severity: normal
status: open
title: tokenize module happily tokenizes code with syntax errors
type: behavior
versions: Python 3.3

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12675
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12675] tokenize module happily tokenizes code with syntax errors

2011-08-01 Thread R. David Murray

R. David Murray rdmur...@bitdance.com added the comment:

I'm not familiar with the parser internals (I'm nosying someone who is), but I 
suspect what you are seeing at the command line is the errors being caught at a 
stage later than the tokenizer.

--
nosy: +benjamin.peterson, r.david.murray

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12675
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12675] tokenize module happily tokenizes code with syntax errors

2011-08-01 Thread Gareth Rees

Gareth Rees g...@garethrees.org added the comment:

These errors are generated directly by the tokenizer. In tokenizer.c, the 
tokenizer generates ERRORTOKEN when it encounters something it can't tokenize. 
This causes parsetok() in parsetok.c to stop tokenizing and return an error.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12675
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12675] tokenize module happily tokenizes code with syntax errors

2011-08-01 Thread Benjamin Peterson

Benjamin Peterson benja...@python.org added the comment:

This should probably be fixed (patches welcome). However, note even with valid 
Python code, the tokens are not the same.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12675
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12675] tokenize module happily tokenizes code with syntax errors

2011-08-01 Thread Ezio Melotti

Changes by Ezio Melotti ezio.melo...@gmail.com:


--
nosy: +ezio.melotti
versions: +Python 2.7, Python 3.2

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12675
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12675] tokenize module happily tokenizes code with syntax errors

2011-08-01 Thread Ezio Melotti

Changes by Ezio Melotti ezio.melo...@gmail.com:


--
stage:  - test needed

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12675
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12675] tokenize module happily tokenizes code with syntax errors

2011-08-01 Thread Eric Snow

Changes by Eric Snow ericsnowcurren...@gmail.com:


--
nosy: +ericsnow

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12675
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12675] tokenize module happily tokenizes code with syntax errors

2011-08-01 Thread Vlad Riscutia

Vlad Riscutia riscutiav...@gmail.com added the comment:

How come tokenizer module is not based on actual C tokenizer? Wouldn't that 
make more sense (and prevent this kind of issues)?

--
nosy: +vladris

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12675
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12675] tokenize module happily tokenizes code with syntax errors

2011-08-01 Thread Benjamin Peterson

Benjamin Peterson benja...@python.org added the comment:

tokenize has useful features that the builtin tokenizer does not possess such 
as the NL token.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12675
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com