[issue42974] tokenize reports incorrect end col offset and line string when input ends without explicit newline

2021-02-19 Thread Guido van Rossum


Change by Guido van Rossum :


--
nosy:  -gvanrossum

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue42974] tokenize reports incorrect end col offset and line string when input ends without explicit newline

2021-01-29 Thread Guido van Rossum


Guido van Rossum  added the comment:

Do you still need my opinion? I can't say that I have one ATM.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue42974] tokenize reports incorrect end col offset and line string when input ends without explicit newline

2021-01-23 Thread Terry J. Reedy


Terry J. Reedy  added the comment:

The question in the issue for tokenizer and virtual newlines is mimicking the C 
tokenizer versus consistency between and within token 5-tuples.

As a tokenizer user, I would lean toward consistency but it is definitely not 
my call.

https://github.com/python/cpython/pull/24260/files has several tokenize.c 
warnings like this:
'=': conversion from '__int64' to 'int', possible loss of data

--
nosy: +gvanrossum, lys.nikolaou, pablogsal, terry.reedy
type:  -> behavior
versions:  -Python 3.6, Python 3.7

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue42974] tokenize reports incorrect end col offset and line string when input ends without explicit newline

2021-01-20 Thread Brian Romanowski


Brian Romanowski  added the comment:

I took a look at Parser/tokenizer.c.  From what I can tell, the tokenizer does 
fake a newline character when the input buffer does not end with actual newline 
characters and that the returned NEWLINE token has an effective length of 1 
because of this faked newline character.  That is, tok->cur - tok->start == 1 
when the NEWLINE token is returned.

If this part of the C tokenizer is meant to be modeled exactly by the Python 
tokenize module, then the current code is correct.  If there is some wiggle 
room because tok->start and tok->cur are converted into line numbers and column 
offsets, then maybe it's acceptable to change them?  If not, then the current 
documentation is misleading because the newline_token_2.end[1] element from my 
original example is not "...  column where the token ends in the source".  
There is no such column.

I'm not sure whether the C tokenizer exposes anything like 
newline_token_2.string, directly.  If so, does it hold the faked newline 
character or does it hold the empty string like the current tokenize module 
does?

I'm also not sure whether the C tokenizer exposes anything like 
newline_token_2.line.  If it does, I'd be surprised if the faked newline would 
cause this to somehow become the empty string instead of the actual line 
content.  So I'm guessing that current tokenize module's behavior here is still 
a real bug?  If not, then this is another case that might benefit from some 
edge-case documentation.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue42974] tokenize reports incorrect end col offset and line string when input ends without explicit newline

2021-01-20 Thread Brian Romanowski


Brian Romanowski  added the comment:

Shoot, just realized that consistency isn't the important thing here, the most 
important thing is that the tokenizer module exactly matches the output of the 
Python tokenizer.  It's possible that my changes violate that constraint, I'll 
take a look at that.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue42974] tokenize reports incorrect end col offset and line string when input ends without explicit newline

2021-01-19 Thread Brian Romanowski


Change by Brian Romanowski :


--
keywords: +patch
pull_requests: +23085
stage:  -> patch review
pull_request: https://github.com/python/cpython/pull/24260

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue42974] tokenize reports incorrect end col offset and line string when input ends without explicit newline

2021-01-19 Thread Brian Romanowski


New submission from Brian Romanowski :

The tokenize module's tokenizer functions output incorrect (or at least 
misleading) information when the content being tokenized does not end in a line 
ending character.  This is related to the fix for issue<33899>  which added the 
NEWLINE tokens for this case but did not fill out the whole token tuple 
correctly.

The bug can be seen by running a version of the test in 
Lib/test/test_tokenize.py:

import io, tokenize

newline_token_1 = 
list(tokenize.tokenize(io.BytesIO("x\n".encode('utf-8')).readline))[-2]
newline_token_2 = 
list(tokenize.tokenize(io.BytesIO("x".encode('utf-8')).readline))[-2]

print(newline_token_1)
print(newline_token_2)

# Prints:
# TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 1), end=(1, 2), line='x\n')
# TokenInfo(type=4 (NEWLINE), string='', start=(1, 1), end=(1, 2), line='')  # 
bad "end" and "line"!

Notice that "len(newline_token_2.string) == 0" but "newline_token_2.end[1] - 
newline_token_2.start[1] == 1".  Seems more consistent if the 
newline_token_2.end == (1, 1).

Also, newline_token_2.line should hold the physical line rather than the empty 
string.  This would make it consistent with newline_token_1.line.

I'll add a PR shortly with a change so the output from the two cases is:

TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 1), end=(1, 2), line='x\n')
TokenInfo(type=4 (NEWLINE), string='', start=(1, 1), end=(1, 1), line='x')

If this looks reasonable, I can backport it for the other branches.  Thanks!

--
components: Library (Lib)
messages: 385313
nosy: romanows
priority: normal
severity: normal
status: open
title: tokenize reports incorrect end col offset and line string when input 
ends without explicit newline
versions: Python 3.10, Python 3.6, Python 3.7, Python 3.8, Python 3.9

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com