Eric V. Smith wrote:
I assume the intent is to not throw away any information in the lexer,
and give the parser full access to the original string. But that's just
a guess.
More likely it's because the lexer is fairly dumb and can
basically just recognise regular expressions.
--
Greg
To answer Larry's question, there's an overwhelming number of different
options -- bytes/unicode, raw/cooked, and (in Py2) `from __future__ import
unicode_literals`. So it's easier to do the actual semantic conversion in a
later stage -- then the lexer only has to worry about hopping over
backslash
On 5/17/2018 3:01 PM, Larry Hastings wrote:
I fed this into tokenize.tokenize():
b''' x = "\u1234" '''
I was a bit surprised to see \U in the output. Particularly because
the output (t.string) was a *string* and not *bytes*.
For those (like me) who have no idea how to use tokenize
I fed this into tokenize.tokenize():
b''' x = "\u1234" '''
I was a bit surprised to see \U in the output. Particularly because
the output (t.string) was a *string* and not *bytes*.
It turns out, Python's tokenizer ignores escape sequences. All it does
is ignore the next character