[issue36911] ast.parse outputs ast.Strs which do not differentiate between the ASCII codepoint 12 (literal new line) and the ASCII codepoints 134 and 156 ("\n")

2019-05-14 Thread Eric V. Smith


Eric V. Smith  added the comment:

I agree with Mark: the string is being correctly interpreted by the AST parser, 
per Python's tokenizer rules.

You might want to look at lib2to3, which I think is also used by black. It's 
also possible that mypy or another static analyzer would be using some library 
you can leverage.

--
resolution:  -> not a bug
stage:  -> resolved
status: open -> closed
type:  -> behavior

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue36911] ast.parse outputs ast.Strs which do not differentiate between the ASCII codepoint 12 (literal new line) and the ASCII codepoints 134 and 156 ("\n")

2019-05-14 Thread Mark Dickinson


Mark Dickinson  added the comment:

The AST _does_ correctly represent the Python string object in the source, 
though. After:

>>> s = """
... Hello \n world
... """

we have a Python object `s` of type `str`, which contains exactly three 
newlines, zero "n" characters, and zero backslashes. So:

>>> s == '\nHello \n world\n'
True


If the AST Str node value were '\nHello \\\n world\n' as you suggest, that 
would represent a different string to `s`: one containing two newline 
characters, one "n" and one backslash.

If you need to operate directly on the source as text, then the AST 
representation probably isn't what you want.

--
nosy: +mark.dickinson

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue36911] ast.parse outputs ast.Strs which do not differentiate between the ASCII codepoint 12 (literal new line) and the ASCII codepoints 134 and 156 ("\n")

2019-05-14 Thread Amber Brown


Amber Brown  added the comment:

There's a difference between round-tripping back to the source text and 
correctly representing the text in the source, though.

Since I'm using this module to perform static analysis of a Python module to 
retrieve class/function definitions and their docstrings to create API 
documentation, the string being the same as what it is in the file is important 
to me.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue36911] ast.parse outputs ast.Strs which do not differentiate between the ASCII codepoint 12 (literal new line) and the ASCII codepoints 134 and 156 ("\n")

2019-05-14 Thread Eric V. Smith


Eric V. Smith  added the comment:

The existing behavior is what I'd expect.

Using python3:

>>> import ast
>>> s = open('file.py', 'rb').read()
>>> s
b'"""\nHello \\n blah.\n"""\n'
>>> ast.dump(ast.parse(s))
"Module(body=[Expr(value=Str(s='\\nHello \\n blah.\\n'))])"
>>> eval(s)
'\nHello \n blah.\n'

As always with the AST, some information is lost. It's not designed to be able 
to round-trip back to the source text.

--
nosy: +eric.smith

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue36911] ast.parse outputs ast.Strs which do not differentiate between the ASCII codepoint 12 (literal new line) and the ASCII codepoints 134 and 156 ("\n")

2019-05-13 Thread Matthias Bussonnier

Matthias Bussonnier  added the comment:

I believe this one is even before the ast, in the tokenizer. Though the AST is 
also doing some normalisation in identifiers (“ε” U+03B5 Greek Small Letter 
Epsilon Unicode Character , and “ϵ” U+03F5 Greek Lunate Epsilon Symbol Unicode 
Character get normalized to the same for example, which is problematic as the 
look different, but end up being same identifier).

I'd be interested in an opt-in flag to not do this normalisation (I have a 
prototype with this for the identifier normalisation in ast, but I have not 
looked at the tokenizer), which might be useful for some linting tools.

--
nosy: +mbussonn

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue36911] ast.parse outputs ast.Strs which do not differentiate between the ASCII codepoint 12 (literal new line) and the ASCII codepoints 134 and 156 ("\n")

2019-05-13 Thread Amber Brown


New submission from Amber Brown :

reproducing case:

file.py:

```
"""
Hello \n blah.
"""
```

And then in a REPL (2.7 or 3+):

```
>>> import ast
>>> f = ast.parse(open("test.py", 'rb').read())
>>> f
<_ast.Module object at 0x7f609d0a4d68>
>>> f.body[0]
<_ast.Expr object at 0x7f609d0a4e10>
>>> f.body[0].value
<_ast.Str object at 0x7f609d02b780>
>>> f.body[0].value.s
'\nHello \n blah.\n'
>>> repr(f.body[0].value.s)
"'\\nHello \\n blah.\\n'"
```

Expected behaviour:
```
>>> repr(f.body[0].value.s)
"'\\nHello n blah.\\n'"
```

--
components: Library (Lib)
messages: 342417
nosy: hawkowl
priority: normal
severity: normal
status: open
title: ast.parse outputs ast.Strs which do not differentiate between the ASCII 
codepoint 12 (literal new line) and the ASCII codepoints 134 and 156 ("\n")
versions: Python 2.7, Python 3.8

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com