[issue20387] tokenize/untokenize roundtrip fails with tabs

2015-11-18 Thread Jason R. Coombs

Changes by Jason R. Coombs :


--
status: open -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20387] tokenize/untokenize roundtrip fails with tabs

2015-11-18 Thread Martin Panter

Martin Panter added the comment:

It seems the problem with tabs in the indentation is fixed. Can we close this?

For the problem with tabs between tokens in a line, I think that would be 
better handled in separate report. But I suspect it is at worst a documentation 
problem. You can’t guarantee that you can reconstruct the exact source based 
off just the tokenized tokens.

--
nosy: +martin.panter

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20387] tokenize/untokenize roundtrip fails with tabs

2015-06-28 Thread Roundup Robot

Roundup Robot added the comment:

New changeset b784c842a63c by Jason R. Coombs in branch '3.4':
Issue #20387: Add test capturing failure to roundtrip indented code in tokenize 
module.
https://hg.python.org/cpython/rev/b784c842a63c

New changeset 49323e5f6391 by Jason R. Coombs in branch '3.4':
Issue #20387: Correct test to properly capture expectation.
https://hg.python.org/cpython/rev/49323e5f6391

New changeset ff47efeeed48 by Dingyuan Wang in branch '3.4':
Issue #20387: Restore retention of indentation during untokenize.
https://hg.python.org/cpython/rev/ff47efeeed48

New changeset 4856ae883041 by Jason R. Coombs in branch '3.4':
Issue #20387: Update Misc/NEWS
https://hg.python.org/cpython/rev/4856ae883041

New changeset 330e28b28334 by Jason R. Coombs in branch '3.4':
Issue #20387: Merge patch and test
https://hg.python.org/cpython/rev/330e28b28334

New changeset 9ce5c1f371f7 by Jason R. Coombs in branch '3.4':
Issue #20387: Merge
https://hg.python.org/cpython/rev/9ce5c1f371f7

New changeset 98380a6e037c by Jason R. Coombs in branch '3.5':
Issue #20387: Merge test and patch from 3.4.4
https://hg.python.org/cpython/rev/98380a6e037c

New changeset f2f5d1c928eb by Jason R. Coombs in branch 'default':
Issue #20387: Merge with 3.5
https://hg.python.org/cpython/rev/f2f5d1c928eb

--
nosy: +python-dev

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue20387
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20387] tokenize/untokenize roundtrip fails with tabs

2015-06-28 Thread Jason R. Coombs

Jason R. Coombs added the comment:

Benjamin, any objections to a backport of this patch?

--
nosy: +benjamin.peterson

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue20387
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20387] tokenize/untokenize roundtrip fails with tabs

2015-06-28 Thread Jason R. Coombs

Jason R. Coombs added the comment:

Patch and test applied to 3.4+. I'm inclined to backport this to Python 2.7, as 
that was where I encountered it originally.

--
versions: +Python 3.5 -Python 3.3

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue20387
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20387] tokenize/untokenize roundtrip fails with tabs

2015-06-28 Thread Roundup Robot

Roundup Robot added the comment:

New changeset 524a0e755797 by Jason R. Coombs in branch '2.7':
Issue #20387: Backport test from Python 3.4
https://hg.python.org/cpython/rev/524a0e755797

New changeset cb9df1ae287b by Jason R. Coombs in branch '2.7':
Issue #20387: Backport fix from Python 3.4
https://hg.python.org/cpython/rev/cb9df1ae287b

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue20387
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20387] tokenize/untokenize roundtrip fails with tabs

2015-06-28 Thread Jason R. Coombs

Jason R. Coombs added the comment:

For the sake of expediency, I've gone ahead and backported and pushed the fix 
to 2.7. Please back out the changes if appropriate.

--
resolution:  - fixed

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue20387
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20387] tokenize/untokenize roundtrip fails with tabs

2015-06-26 Thread Dingyuan Wang

Dingyuan Wang added the comment:

I mean the patch only restores tabs in indentation. The reports above should be 
corrected.
Tabs between tokens and other race conditions can't be restored exactly 
providing the token stream. This won't affect the syntax. I wonder if it's also 
a bug or a wont-fix in tokenize/untokenize roundtrip because fixing it involves 
adding whitespace information in the token stream.

(I can't pull down the whole hg repo, so the patch is manually created by just 
diffing two versions.)

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue20387
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20387] tokenize/untokenize roundtrip fails with tabs

2015-06-25 Thread Jason R. Coombs

Jason R. Coombs added the comment:

@gumblex, I've applied your updated patch (though I couldn't figure out why it 
wouldn't apply mechanically; I had to paste it). I also corrected the test. 
Thanks for the advice on that.

I don't understand the second half of your message. Are you stating caveats for 
our edification? Do you have outstanding concerns? Can you confirm your 
original report is corrected by the latest tip in the patch repository?

I believe what remains to be done is to add a NEWS entry and merge the changes 
with the 3.4, 3.5, and default branches. If a 2.7 backport is required, that 
could be done also.

--
stage: needs patch - patch review

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue20387
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20387] tokenize/untokenize roundtrip fails with tabs

2015-06-20 Thread Dingyuan Wang

Dingyuan Wang added the comment:

Sorry for the inconvenience. I failed to find this old bug.

I think there is another problem. The docs of `untokenize` said The iterable 
must return sequences with **at least** two elements, the token type and the 
token string. Any additional sequence elements are ignored., so if I feed in, 
say, a 3-tuple, the untokenize should accept it as tok[:2].

The attached patch should have addressed the problems above. 

When trying to make a patch, a tokenize bug was found. Consider the new 
attached tab.py, the tabs between comments and code, and the tabs between 
expressions are lost, so when untokenizing, position information is used to 
produce equivalent spaces, instead of tabs.

Despite the tokenization problem, the patch can produce syntactically correct 
code as accurately as it can.

The PEP 8 recommends spaces for indentation, but the usage of tabs should not 
be ignored.

new tab.py (in Python string):

'#!/usr/bin/env python\n# -*- coding: utf-8 -*-\n\ndef foo():\n\t\n\tTests 
tabs in tokenization\n\t\tfoo\n\t\n\tpass\n\tpass\n\tif 1:\n\t\t# not indent 
correctly\n\t\tpass\n\t\t# correct\ttab\n\t\tpass\n\tpass\n\tbaaz = 
{\'a\ttab\':\t1,\n\t\t\t\'b\': 2}\t\t# also fails\n\npass\n#if 
2:\n\t#pass\n#pass\n'

--
keywords: +patch
nosy: +gumblex
Added file: http://bugs.python.org/file39748/tokenize.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue20387
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20387] tokenize/untokenize roundtrip fails with tabs

2015-06-20 Thread Jason R. Coombs

Jason R. Coombs added the comment:

@gumblex: This is a good start. It certainly provides a candidate 
implementation.

First, can I suggest that you remove the changes pertinent to the at least two 
elements and address that in a separate ticket/discussion?

Second, any patch will necessarily need to include a test that fails before the 
patch and passes after the test.

Third, and least important, the proposed implementation adds a lot of branching 
and cognitive complexity to an already complicated function. I'd like to work 
on the implementation to see if we can make it easier to comprehend. That said, 
practicality beats purity, so if it passes the test, then it's surely 
acceptable.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue20387
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20387] tokenize/untokenize roundtrip fails with tabs

2015-06-20 Thread Jason R. Coombs

Jason R. Coombs added the comment:

I've committed the patch without the change for at least two elements as 
https://bitbucket.org/jaraco/cpython-issue20387/commits/b7fe3c865b8dbdb33d26f4bc5cbb6096f5445fb2.

The patch corrects the new test, demonstrating its effectiveness, but yields 
two new test failures (apparent regressions):

$ ./python.exe Lib/test/test_tokenize.py 
**
File /public/cpython-issue20387/Lib/test/test_tokenize.py, line ?, in 
test.test_tokenize.__test__.doctests
Failed example:
roundtrip(try: import somemodule\n
  except ImportError: # comment\n
  print('Can not import' # comment2\n)
  else:   print('Loaded')\n)
Exception raised:
Traceback (most recent call last):
  File /public/cpython-issue20387/Lib/doctest.py, line 1318, in __run
compileflags, 1), test.globs)
  File doctest test.test_tokenize.__test__.doctests[14], line 1, in 
module
roundtrip(try: import somemodule\n
  File /public/cpython-issue20387/Lib/test/test_tokenize.py, line 698, in 
roundtrip
bytes_from5 = untokenize(tokens5)
  File /public/cpython-issue20387/Lib/tokenize.py, line 339, in untokenize
out = ut.untokenize(iterable)
  File /public/cpython-issue20387/Lib/tokenize.py, line 273, in untokenize
self.add_whitespace(start)
  File /public/cpython-issue20387/Lib/tokenize.py, line 236, in 
add_whitespace
.format(row, col, self.prev_row, self.prev_col))
ValueError: start (4,1) precedes previous end (4,4)
**
File /public/cpython-issue20387/Lib/test/test_tokenize.py, line ?, in 
test.test_tokenize.__test__.doctests
Failed example:
for testfile in testfiles:
if not roundtrip(open(testfile, 'rb')):
print(Roundtrip failed for file %s % testfile)
break
else: True
Exception raised:
Traceback (most recent call last):
  File /public/cpython-issue20387/Lib/doctest.py, line 1318, in __run
compileflags, 1), test.globs)
  File doctest test.test_tokenize.__test__.doctests[70], line 2, in 
module
if not roundtrip(open(testfile, 'rb')):
  File /public/cpython-issue20387/Lib/test/test_tokenize.py, line 698, in 
roundtrip
bytes_from5 = untokenize(tokens5)
  File /public/cpython-issue20387/Lib/tokenize.py, line 339, in untokenize
out = ut.untokenize(iterable)
  File /public/cpython-issue20387/Lib/tokenize.py, line 273, in untokenize
self.add_whitespace(start)
  File /public/cpython-issue20387/Lib/tokenize.py, line 236, in 
add_whitespace
.format(row, col, self.prev_row, self.prev_col))
ValueError: start (73,8) precedes previous end (73,12)
**
1 items had failures:
   2 of  74 in test.test_tokenize.__test__.doctests
***Test Failed*** 2 failures.
Traceback (most recent call last):
  File Lib/test/test_tokenize.py, line 1261, in module
test_main()
  File Lib/test/test_tokenize.py, line 1252, in test_main
support.run_doctest(test_tokenize, True)
  File /public/cpython-issue20387/Lib/test/support/__init__.py, line 1854, in 
run_doctest
raise TestFailed(%d of %d doctests failed % (f, t))
test.support.TestFailed: 2 of 79 doctests failed

@gumblex, Can you review the failures and address them?

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue20387
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20387] tokenize/untokenize roundtrip fails with tabs

2015-06-19 Thread Terry J. Reedy

Terry J. Reedy added the comment:

#24447, closed as duplicate, has another example.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue20387
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20387] tokenize/untokenize roundtrip fails with tabs

2014-02-02 Thread Terry J. Reedy

Terry J. Reedy added the comment:

I read the manual more carefully and noticed that the guarantee is that 
tokenizing the result of untokenize matches the input to untokenize.  The 
result is guaranteed to tokenize back to match the input so that the conversion 
is lossless and round-trips are assured.

I ran the test in 3.4 and noticed the exactness and direction of the matching 
does not matter because only the 2nd \t is replaced by a space, making the 
invalid code that raises IndentationError.

So this is definitely a bug and I withdraw the close suggestion.

I believe the test should be that both lines of the body begin with the same 
indent.

from tokenize import tokenize, untokenize
import io

def test_tab_indent(self):
code = bif False:\n\tx=3\n\ty=3\n
codelines = untokenize(tokenize(io.BytesIO(s).readline)).split(\n)
assertEqual(codelines[1], codelines[2])

Of course, the test of tokenize and untokenize should be separated.

--
stage:  - needs patch
versions: +Python 3.3

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue20387
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20387] tokenize/untokenize roundtrip fails with tabs

2014-02-02 Thread Terry J. Reedy

Terry J. Reedy added the comment:

I think the problem is with untokenize.

s =bif False:\n\tx=3\n\ty=3\n
t = tokenize(io.BytesIO(s).readline)
for i in t: print(i)

produces a token stream that seems correct.

TokenInfo(type=56 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
TokenInfo(type=1 (NAME), string='if', start=(1, 0), end=(1, 2), line='if 
False:\n')
TokenInfo(type=1 (NAME), string='False', start=(1, 3), end=(1, 8), line='if 
False:\n')
TokenInfo(type=52 (OP), string=':', start=(1, 8), end=(1, 9), line='if 
False:\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 9), end=(1, 10), line='if 
False:\n')
TokenInfo(type=5 (INDENT), string='\t', start=(2, 0), end=(2, 1), 
line='\tx=3\n')
TokenInfo(type=1 (NAME), string='x', start=(2, 1), end=(2, 2), line='\tx=3\n')
TokenInfo(type=52 (OP), string='=', start=(2, 2), end=(2, 3), line='\tx=3\n')
TokenInfo(type=2 (NUMBER), string='3', start=(2, 3), end=(2, 4), line='\tx=3\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(2, 4), end=(2, 5), 
line='\tx=3\n')
TokenInfo(type=1 (NAME), string='y', start=(3, 1), end=(3, 2), line='\ty=3\n')
TokenInfo(type=52 (OP), string='=', start=(3, 2), end=(3, 3), line='\ty=3\n')
TokenInfo(type=2 (NUMBER), string='3', start=(3, 3), end=(3, 4), line='\ty=3\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(3, 4), end=(3, 5), 
line='\ty=3\n')
TokenInfo(type=6 (DEDENT), string='', start=(4, 0), end=(4, 0), line='')
TokenInfo(type=0 (ENDMARKER), string='', start=(4, 0), end=(4, 0), line='')

The problem with untokenize and indents is this: In the old untokenize duples 
function, now called 'compat', INDENT strings were added to a list and popped 
by the corresponding DEDENT. While compat has the minor problem of returning a 
string instead of bytes (which is actually as I think it should be) and adding 
extraneous spaces within and at the end of lines, it correctly handles tabs in 
your example and this:

s =bif False:\n\tx=1\n\t\ty=2\n\t\t\tz=3\n
t = tokenize(io.BytesIO(s).readline)
print(untokenize(i[:2] for i in t).encode())
 
b'if False :\n\tx =1 \n\t\ty =2 \n\t\t\tz =3 \n'

When tokenize was changed to producing 5-tuples, untokenize was changed to use 
the start and end coordinates, but all special processing of indents was cut in 
favor of .add_space(). So this issue is a regression due in inadequate testing.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue20387
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20387] tokenize/untokenize roundtrip fails with tabs

2014-02-02 Thread Terry J. Reedy

Terry J. Reedy added the comment:

The untokenize docstring has a stronger guarantee, and in the direction you 
were claiming. Round-trip invariant for full input: Untokenized source will 
match input source exactly. For this to be true, the indent strings must be 
saved and not replaced by spaces.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue20387
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20387] tokenize/untokenize roundtrip fails with tabs

2014-02-01 Thread Terry J. Reedy

Terry J. Reedy added the comment:

Whitespace equivalence is explicitly disclaimed. The guarantee applies only to 
the token type and token string as the spacing between tokens (column 
positions) may change. The assert is not a valid test. I think you should 
close this. (Note that there are several issues for bugs where untokenize 
results in a sematically different text.)

--
nosy: +terry.reedy

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue20387
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20387] tokenize/untokenize roundtrip fails with tabs

2014-01-24 Thread Jason R. Coombs

New submission from Jason R. Coombs:

Consider this simple unit test:

def test_tokenize():
input = if False:\n\tx=3\n\ty=3\n
g = list(generate_tokens(io.StringIO(input).readline))
assert untokenize(g) == input

According to the docs, untokenize guarantees the output equals the input to 
tokenize: http://docs.python.org/2/library/tokenize.html#tokenize.untokenize

As this test fails on Python 2 and Python 3 (use io.BytesIO on Python 2), I 
believe the test illustrates a violation of this guarantee. The second line in 
the if block gets its tab replaced by a space.

As the input is valid Python Syntax, this behavior is particularly surprising.

--
messages: 209137
nosy: jason.coombs
priority: normal
severity: normal
status: open
title: tokenize/untokenize roundtrip fails with tabs
versions: Python 2.7, Python 3.4

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue20387
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20387] tokenize/untokenize roundtrip fails with tabs

2014-01-24 Thread Arfrever Frehtes Taifersar Arahesis

Changes by Arfrever Frehtes Taifersar Arahesis arfrever@gmail.com:


--
nosy: +Arfrever

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue20387
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com