Re: [Python-Dev] Why aren't escape sequences in literal strings handled by the tokenizer?

2018-05-18 Thread Greg Ewing
Eric V. Smith wrote: I assume the intent is to not throw away any information in the lexer, and give the parser full access to the original string. But that's just a guess. More likely it's because the lexer is fairly dumb and can basically just recognise regular expressions. -- Greg

Re: [Python-Dev] Why aren't escape sequences in literal strings handled by the tokenizer?

2018-05-17 Thread Guido van Rossum
To answer Larry's question, there's an overwhelming number of different options -- bytes/unicode, raw/cooked, and (in Py2) `from __future__ import unicode_literals`. So it's easier to do the actual semantic conversion in a later stage -- then the lexer only has to worry about hopping over backslash

Re: [Python-Dev] Why aren't escape sequences in literal strings handled by the tokenizer?

2018-05-17 Thread Eric V. Smith
On 5/17/2018 3:01 PM, Larry Hastings wrote: I fed this into tokenize.tokenize(): b''' x = "\u1234" ''' I was a bit surprised to see \U in the output.  Particularly because the output (t.string) was a *string* and not *bytes*. For those (like me) who have no idea how to use tokenize

[Python-Dev] Why aren't escape sequences in literal strings handled by the tokenizer?

2018-05-17 Thread Larry Hastings
I fed this into tokenize.tokenize(): b''' x = "\u1234" ''' I was a bit surprised to see \U in the output.  Particularly because the output (t.string) was a *string* and not *bytes*. It turns out, Python's tokenizer ignores escape sequences.  All it does is ignore the next character