Re: [Python-Dev] What does a double coding cookie mean?

2016-03-20 Thread Guido van Rossum
On Thu, Mar 17, 2016 at 9:50 AM, Serhiy Storchaka  wrote:
> On 17.03.16 16:55, Guido van Rossum wrote:
>>
>> On Thu, Mar 17, 2016 at 5:04 AM, Serhiy Storchaka 
>> wrote:

 Should we recommend that everyone use tokenize.detect_encoding()?
>>>
>>>
>>> Likely. However the interface of tokenize.detect_encoding() is not very
>>> simple.
>>
>>
>> I just found that out yesterday. You have to give it a readline()
>> function, which is cumbersome if all you have is a (byte) string and
>> you don't want to split it on lines just yet. And the readline()
>> function raises SyntaxError when the encoding isn't right. I wish
>> there were a lower-level helper that just took a line and told you
>> what the encoding in it was, if any. Then the rest of the logic can be
>> handled by the caller (including the logic of trying up to two lines).
>
>
> The simplest way to detect encoding of bytes string:
>
> lines = data.splitlines()
> encoding = tokenize.detect_encoding(iter(lines).__next__)[0]

This will raise SyntaxError if the encoding is unknown. That needs to
be caught in mypy's case and then it needs to get the line number from
the exception. I tried this and it was too painful, so now I've just
changed the regex that mypy uses to use non-eager matching
(https://github.com/python/mypy/commit/b291998a46d580df412ed28af1ba1658446b9fe5).

> If you don't want to split all data on lines, the most efficient way in
> Python 3.5 is:
>
> encoding = tokenize.detect_encoding(io.BytesIO(data).readline)[0]
>
> In Python 3.5 io.BytesIO(data) has constant complexity.

Ditto with the SyntaxError though.

> In older versions for detecting encoding without copying data or splitting
> all data on lines you should write line iterator. For example:
>
> def iterlines(data):
> start = 0
> while True:
> end = data.find(b'\n', start) + 1
> if not end:
> break
> yield data[start:end]
> start = end
> yield data[start:]
>
> encoding = tokenize.detect_encoding(iterlines(data).__next__)[0]
>
> or
>
> it = (m.group() for m in re.finditer(b'.*\n?', data))
> encoding = tokenize.detect_encoding(it.__next__)
>
> I don't know what approach is more efficient.

Having my own regex was simpler. :-(

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] What does a double coding cookie mean?

2016-03-20 Thread Ethan Furman

On 03/17/2016 04:54 PM, Glenn Linderman wrote:

On 3/16/2016 12:59 AM, Serhiy Storchaka wrote:



Actually "must match the regular expression" is not correct, because
re.match() implies anchoring at the start. I have proposed more
correct regular expression in other branch of this thread.


"match" doesn't imply anchoring at the start.  "re.match()" does (and as
a result is very confusing to newbies to Python re, that have used other
regexp systems).


It still confuses me from time to time.  :(

--
~Ethan~
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] What does a double coding cookie mean?

2016-03-20 Thread Nick Coghlan
On 20 March 2016 at 07:46, Glenn Linderman  wrote:
> Diagnosing ambiguous conditions, even including my example above, might be
> useful... for a few files... is it worth the effort? What % of .py sources
> have coding specifications? What % of those have two?

And there's a decent argument for leaving detecting such cases to
linters rather than the tokeniser.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] What does a double coding cookie mean?

2016-03-20 Thread M.-A. Lemburg
On 17.03.2016 15:02, Serhiy Storchaka wrote:
> On 17.03.16 15:14, M.-A. Lemburg wrote:
>> On 17.03.2016 01:29, Guido van Rossum wrote:
>>> Should we recommend that everyone use tokenize.detect_encoding()?
>>
>> I'd prefer a separate utility for this somewhere, since
>> tokenize.detect_encoding() is not available in Python 2.
>>
>> I've attached an example implementation with tests, which works
>> in Python 2.7 and 3.
> 
> Sorry, but this code doesn't match the behaviour of Python interpreter,
> nor other tools. I suggest to backport tokenize.detect_encoding() (but
> be aware that the default encoding in Python 2 is ASCII, not UTF-8).

Yes, I got the default for Python 3 wrong. I'll fix that. Thanks
for the note.

What other aspects are different than what Python implements ?

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Experts (#1, Mar 17 2016)
>>> Python Projects, Coaching and Consulting ...  http://www.egenix.com/
>>> Python Database Interfaces ...   http://products.egenix.com/
>>> Plone/Zope Database Interfaces ...   http://zope.egenix.com/

2016-03-07: Released eGenix pyOpenSSL 0.13.14 ... http://egenix.com/go89
2016-02-19: Released eGenix PyRun 2.1.2 ...   http://egenix.com/go88

::: We implement business ideas - efficiently in both time and costs :::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
  http://www.malemburg.com/

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] What does a double coding cookie mean?

2016-03-19 Thread Serhiy Storchaka

On 17.03.16 15:14, M.-A. Lemburg wrote:

On 17.03.2016 01:29, Guido van Rossum wrote:

Should we recommend that everyone use tokenize.detect_encoding()?


I'd prefer a separate utility for this somewhere, since
tokenize.detect_encoding() is not available in Python 2.

I've attached an example implementation with tests, which works
in Python 2.7 and 3.


Sorry, but this code doesn't match the behaviour of Python interpreter, 
nor other tools. I suggest to backport tokenize.detect_encoding() (but 
be aware that the default encoding in Python 2 is ASCII, not UTF-8).



___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] What does a double coding cookie mean?

2016-03-19 Thread Glenn Linderman

On 3/16/2016 12:59 AM, Serhiy Storchaka wrote:

On 16.03.16 09:46, Glenn Linderman wrote:

On 3/16/2016 12:09 AM, Serhiy Storchaka wrote:

On 16.03.16 08:34, Glenn Linderman wrote:

 From the PEP 263:


More precisely, the first or second line must match the regular
expression "coding[:=]\s*([-\w.]+)". The first group of this
expression is then interpreted as encoding name. If the encoding
is unknown to Python, an error is raised during compilation. 
There

must not be any Python statement on the line that contains the
encoding declaration.


Clearly the regular expression would only match the first of multiple
cookies on the same line, so the first one should always win... but
there should only be one, from the first PEP quote "a magic comment".


"The first group of this expression" means the first regular
expression group. Only the part between parenthesis "([-\w.]+)" is
interpreted as encoding name, not all expression.


Sure.  But there is no mention anywhere in the PEP of more than one
being legal: just more than one position for it, EITHER line 1 or line
2. So while the regular expression mentioned is not anchored, to allow
variation in syntax between emacs and vim, "must match the regular
expression" doesn't imply "several times", and when searching for a
regular expression that might not be anchored, one typically expects to
find the first.


Actually "must match the regular expression" is not correct, because 
re.match() implies anchoring at the start. I have proposed more 
correct regular expression in other branch of this thread.


"match" doesn't imply anchoring at the start.  "re.match()" does (and as 
a result is very confusing to newbies to Python re, that have used other 
regexp systems).
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] What does a double coding cookie mean?

2016-03-19 Thread Glenn Linderman

On 3/19/2016 2:37 PM, Serhiy Storchaka wrote:

On 19.03.16 19:36, Glenn Linderman wrote:

On 3/19/2016 8:19 AM, Serhiy Storchaka wrote:

On 16.03.16 08:03, Serhiy Storchaka wrote:
I just tested with Emacs, and it looks that when specify different
codings on two different lines, the first coding wins, but when
specify different codings on the same line, the last coding wins.

Therefore current CPython behavior can be correct, and the regular
expression in PEP 263 should be changed to use greedy repetition.


Just because emacs works that way (and even though I'm an emacs user),
that doesn't mean CPython should act like emacs.


Yes. But current CPython works that way. The behavior of Emacs is the 
argument that maybe this is not a bug.


If CPython properly handles the following line as having only one proper 
coding declaration (utf-8), then I might reluctantly agree that the 
behavior of Emacs might be a relevant argument.  Otherwise, vehemently 
not relevant.


  # -*- coding: utf-8 -*- this file does not use coding: latin-1





(4) there is no benefit to specifying the coding twice on a line, it
only adds confusion, whether in CPython, emacs, or vim.
(4a) Here's an untested line that emacs would interpret as utf-8, and
CPython with the greedy regulare expression would interpret as latin-1,
because emacs looks only between the -*- pair, and CPython ignores that.
   # -*- coding: utf-8 -*- this file does not use coding: latin-1


Since Emacs allows to specify the coding twice on a line, and this can 
be ambiguous, and CPython already detects some ambiguous situations 
(UTF-8 BOM and non-UTF-8 coding cookie), it may be worth to add a 
check that the coding is specified only once on a line.


Diagnosing ambiguous conditions, even including my example above, might 
be useful... for a few files... is it worth the effort? What % of .py 
sources have coding specifications? What % of those have two?
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] What does a double coding cookie mean?

2016-03-19 Thread Serhiy Storchaka

On 19.03.16 19:36, Glenn Linderman wrote:

On 3/19/2016 8:19 AM, Serhiy Storchaka wrote:

On 16.03.16 08:03, Serhiy Storchaka wrote:
I just tested with Emacs, and it looks that when specify different
codings on two different lines, the first coding wins, but when
specify different codings on the same line, the last coding wins.

Therefore current CPython behavior can be correct, and the regular
expression in PEP 263 should be changed to use greedy repetition.


Just because emacs works that way (and even though I'm an emacs user),
that doesn't mean CPython should act like emacs.


Yes. But current CPython works that way. The behavior of Emacs is the 
argument that maybe this is not a bug.



(4) there is no benefit to specifying the coding twice on a line, it
only adds confusion, whether in CPython, emacs, or vim.
(4a) Here's an untested line that emacs would interpret as utf-8, and
CPython with the greedy regulare expression would interpret as latin-1,
because emacs looks only between the -*- pair, and CPython ignores that.
   # -*- coding: utf-8 -*- this file does not use coding: latin-1


Since Emacs allows to specify the coding twice on a line, and this can 
be ambiguous, and CPython already detects some ambiguous situations 
(UTF-8 BOM and non-UTF-8 coding cookie), it may be worth to add a check 
that the coding is specified only once on a line.



___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] What does a double coding cookie mean?

2016-03-19 Thread M.-A. Lemburg
On 17.03.2016 18:53, Serhiy Storchaka wrote:
> On 17.03.16 19:23, M.-A. Lemburg wrote:
>> On 17.03.2016 15:02, Serhiy Storchaka wrote:
>>> On 17.03.16 15:14, M.-A. Lemburg wrote:
 On 17.03.2016 01:29, Guido van Rossum wrote:
> Should we recommend that everyone use tokenize.detect_encoding()?

 I'd prefer a separate utility for this somewhere, since
 tokenize.detect_encoding() is not available in Python 2.

 I've attached an example implementation with tests, which works
 in Python 2.7 and 3.
>>>
>>> Sorry, but this code doesn't match the behaviour of Python interpreter,
>>> nor other tools. I suggest to backport tokenize.detect_encoding() (but
>>> be aware that the default encoding in Python 2 is ASCII, not UTF-8).
>>
>> Yes, I got the default for Python 3 wrong. I'll fix that. Thanks
>> for the note.
>>
>> What other aspects are different than what Python implements ?
> 
> 1. If there is a BOM and coding cookie, the source encoding is "utf-8-sig".

Ok, that makes sense (even though it's not mandated by the PEP;
the utf-8-sig codec didn't exist yet).

> 2. If there is a BOM and coding cookie is not 'utf-8', this is an error.

It's an error for Python, but why should a detection function
always raise an error for this case ? It would probably be a good
idea to have an errors parameter to leave this to the use to decide.

Same for unknown encodings.

> 3. If the first line is not blank or comment line, the coding cookie is
> not searched in the second line.

Hmm, the PEP does allow having the coding cookie in the
second line, even if the first line is not a comment. Perhaps
that's not really needed.

> 4. Encoding name should be canonized. "UTF8", "utf8", "utf_8" and
> "utf-8" is the same encoding (and all are changed to "utf-8-sig" with BOM).

Well, that's cosmetics :-) The codec system will take care of
this when needed.

> 5. There isn't the limit of 400 bytes. Actually there is a bug with
> handling long lines in current code, but even with this bug the limit is
> larger.

I think it's a reasonable limit, since shebang lines may only be
127 long on at least Linux (and probably several other Unix systems
as well).

But just in case, I made this configurable :-)

> 6. I made a mistake in the regular expression, missed the underscore.

I added it.

> tokenize.detect_encoding() is the closest imitation of the behavior of
> Python interpreter.

Probably, but that doesn't us on Python 2, right ?

I'll upload the script to github later today or tomorrow to
continue development.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Experts (#1, Mar 17 2016)
>>> Python Projects, Coaching and Consulting ...  http://www.egenix.com/
>>> Python Database Interfaces ...   http://products.egenix.com/
>>> Plone/Zope Database Interfaces ...   http://zope.egenix.com/

2016-03-07: Released eGenix pyOpenSSL 0.13.14 ... http://egenix.com/go89
2016-02-19: Released eGenix PyRun 2.1.2 ...   http://egenix.com/go88

::: We implement business ideas - efficiently in both time and costs :::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
  http://www.malemburg.com/

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] What does a double coding cookie mean?

2016-03-19 Thread Stephen J. Turnbull
Glenn Linderman writes:
 > On 3/19/2016 8:19 AM, Serhiy Storchaka wrote:

 > > Therefore current CPython behavior can be correct, and the regular 
 > > expression in PEP 263 should be changed to use greedy repetition.
 > 
 > Just because emacs works that way (and even though I'm an emacs user), 
 > that doesn't mean CPython should act like emacs.
 > 
 > (1) CPython should not necessarily act like emacs,

We can't treat Emacs as a spec, because Emacs doesn't follow specs,
doesn't respect standards, and above a certain level of inconvenience
to developers doesn't respect backward compatibility.  There's never
any guarantee that Emacs will do the same thing tomorrow that it does
today, although inertia has mostly the same effect.

In this case, there's a reason why Emacs behaves the way it does,
which is that you can put an arbitrary sequence of variable
assignments in "-*- ... -*-" and they will be executed in order.  So
it makes sense that "last coding wins".  But pragmas are severely
deprecated in Python; cookies got a very special exception.  So that
rationale can't apply to Python.

 > (4) there is no benefit to specifying the coding twice on a line, it 
 > only adds confusion, whether in CPython, emacs, or vim.

Indeed.  I see no point in reading past the first cookie found
(whether a valid codec or not), unless an error would be raised.  That
might be a good idea, but I doubt it's worth the implementation
complexity.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] What does a double coding cookie mean?

2016-03-19 Thread Glenn Linderman

On 3/19/2016 8:19 AM, Serhiy Storchaka wrote:

On 16.03.16 08:03, Serhiy Storchaka wrote:

On 15.03.16 22:30, Guido van Rossum wrote:

I came across a file that had two different coding cookies -- one on
the first line and one on the second. CPython uses the first, but mypy
happens to use the second. I couldn't find anything in the spec or
docs ruling out the second interpretation. Does anyone have a
suggestion (apart from following CPython)?

Reference: https://github.com/python/mypy/issues/1281


There is similar question. If a file has two different coding cookies on
the same line, what should win? Currently the last cookie wins, in
CPython parser, in the tokenize module, in IDLE, and in number of other
code. I think this is a bug.


I just tested with Emacs, and it looks that when specify different 
codings on two different lines, the first coding wins, but when 
specify different codings on the same line, the last coding wins.


Therefore current CPython behavior can be correct, and the regular 
expression in PEP 263 should be changed to use greedy repetition.


Just because emacs works that way (and even though I'm an emacs user), 
that doesn't mean CPython should act like emacs.


(1) CPython should not necessarily act like emacs, unless the coding 
syntax exactly matches emacs, rather than the generic coding that 
CPython interprets, that matches emacs, vim, and other similar things 
that both emacs and vim would ignore.
(1a) Maybe if a similar test were run on vim with its syntax, and it 
also works the same way, then one might think it is a trend worth 
following, but it is not clear to this non-vim user that vim syntax 
allows more than one coding specification per line.


(2) emacs has no requirement that the coding be placed on the first two 
lines. It specifically looks at the second line only if the first line 
has a “ #! ” or a “ '\" ” (for troff). (according to docs, not 
experimentation)


(3) emacs also allows for Local Variables to be specified at the end of 
the file.  If CPython were really to act like emacs, then it would need 
to allow for that too.


(4) there is no benefit to specifying the coding twice on a line, it 
only adds confusion, whether in CPython, emacs, or vim.
(4a) Here's an untested line that emacs would interpret as utf-8, and 
CPython with the greedy regulare expression would interpret as latin-1, 
because emacs looks only between the -*- pair, and CPython ignores that.

  # -*- coding: utf-8 -*- this file does not use coding: latin-1
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] What does a double coding cookie mean?

2016-03-19 Thread M.-A. Lemburg
On 17.03.2016 01:29, Guido van Rossum wrote:
> I've updated the PEP. Please review. I decided not to update the
> Unicode howto (the thing is too obscure). Serhiy, you're probably in a
> better position to fix the code looking for cookies to pick the first
> one if there are two on the same line (or do whatever you think should
> be done there).

Thanks, will do.

> Should we recommend that everyone use tokenize.detect_encoding()?

I'd prefer a separate utility for this somewhere, since
tokenize.detect_encoding() is not available in Python 2.

I've attached an example implementation with tests, which works
in Python 2.7 and 3.

> On Wed, Mar 16, 2016 at 5:05 PM, Guido van Rossum  wrote:
>> On Wed, Mar 16, 2016 at 12:59 AM, M.-A. Lemburg  wrote:
>>> The only reason to read up to two lines was to address the use of
>>> the shebang on Unix, not to be able to define two competing
>>> source code encodings :-)
>>
>> I know. I was just surprised that the PEP was sufficiently vague about
>> it that when I found that mypy picked the second if there were two, I
>> couldn't prove to myself that it was violating the PEP. I'd rather
>> clarify the PEP than rely on the reasoning presented earlier here.

I suppose it's a rather rare case, since it's the first time
that I heard about anyone thinking that a possible second line
could be picked - after 15 years :-)

>> I don't like erroring out when there are two different cookies on two
>> lines; I feel that the spirit of the PEP is to read up to two lines
>> until a cookie is found, whichever comes first.
>>
>> I will update the regex in the PEP too (or change the wording to avoid 
>> "match").
>>
>> I'm not sure what to do if there are two cooking on one line. If
>> CPython currently picks the latter we may want to preserve that
>> behavior.
>>
>> Should we recommend that everyone use tokenize.detect_encoding()?
>>
>> --
>> --Guido van Rossum (python.org/~guido)
> 
> 
> 

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Experts (#1, Mar 17 2016)
>>> Python Projects, Coaching and Consulting ...  http://www.egenix.com/
>>> Python Database Interfaces ...   http://products.egenix.com/
>>> Plone/Zope Database Interfaces ...   http://zope.egenix.com/

2016-03-07: Released eGenix pyOpenSSL 0.13.14 ... http://egenix.com/go89
2016-02-19: Released eGenix PyRun 2.1.2 ...   http://egenix.com/go88

::: We implement business ideas - efficiently in both time and costs :::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
  http://www.malemburg.com/

#!/usr/bin/python
"""
Utility to detect the source code encoding of a Python file.

Marc-Andre Lemburg, 2016.

Supports Python 2.7 and 3.

"""
import sys
import re
import codecs

# Debug output ?
_debug = True

# PEP 263 RE
PEP263 = re.compile(b'^[ \t]*#.*?coding[:=][ \t]*([-.a-zA-Z0-9]+)',
re.MULTILINE)

###

def detect_source_encoding(code, buffer_size=400):

""" Detect and return the source code encoding of the Python code
given in code.

code must be given as bytes.

The function uses a buffer to determine the first two code lines
with a default size of 400 bytes/code points.  This can be adjusted
using the buffer_size parameter.

"""
# Get the first two lines
first_two_lines = b'\n'.join(code[:buffer_size].splitlines()[:2])
# BOMs override any source code encoding comments
if first_two_lines.startswith(codecs.BOM):
return 'utf-8'
# .search() picks the first occurrance
m = PEP263.search(first_two_lines)
if m is None:
return 'ascii'
return m.group(1).decode('ascii')

# Tests

def _test():

l = (
  (b"""\
# No encoding
""", 'ascii'),
  (b"""\
# coding: latin-1
""", 'latin-1'),
  (b"""\
#!/usr/bin/python
# coding: utf-8
""", 'utf-8'),
  (b"""\
coding=123
# The above could be detected as source code encoding
""", 'ascii'),
  (b"""\
# coding: latin-1
# coding: utf-8
""", 'latin-1'),
  (b"""\
# No encoding on first line
# No encoding on second line
# coding: utf-8
""", 'ascii'),
  (codecs.BOM + b"""\
# No encoding
""", 'utf-8'),
  (codecs.BOM + b"""\
# BOM and encoding
# coding: latin-1
""", 'utf-8'),
)
for code, encoding in l:
if _debug:
print ('=' * 72)
print ('Checking:')
print ('-' * 72)
print (code.decode('latin-1'))
print ('-' * 72)
detected_encoding = detect_source_encoding(code)
if _debug:
print ('detected: %s, expected: %s' % 
   (detected_encoding, encoding))
assert 

Re: [Python-Dev] What does a double coding cookie mean?

2016-03-19 Thread Serhiy Storchaka

On 16.03.16 08:03, Serhiy Storchaka wrote:

On 15.03.16 22:30, Guido van Rossum wrote:

I came across a file that had two different coding cookies -- one on
the first line and one on the second. CPython uses the first, but mypy
happens to use the second. I couldn't find anything in the spec or
docs ruling out the second interpretation. Does anyone have a
suggestion (apart from following CPython)?

Reference: https://github.com/python/mypy/issues/1281


There is similar question. If a file has two different coding cookies on
the same line, what should win? Currently the last cookie wins, in
CPython parser, in the tokenize module, in IDLE, and in number of other
code. I think this is a bug.


I just tested with Emacs, and it looks that when specify different 
codings on two different lines, the first coding wins, but when specify 
different codings on the same line, the last coding wins.


Therefore current CPython behavior can be correct, and the regular 
expression in PEP 263 should be changed to use greedy repetition.



___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] What does a double coding cookie mean?

2016-03-19 Thread M.-A. Lemburg
On 17.03.2016 15:55, Guido van Rossum wrote:
> On Thu, Mar 17, 2016 at 5:04 AM, Serhiy Storchaka  wrote:
>>> Should we recommend that everyone use tokenize.detect_encoding()?
>>
>> Likely. However the interface of tokenize.detect_encoding() is not very
>> simple.
> 
> I just found that out yesterday. You have to give it a readline()
> function, which is cumbersome if all you have is a (byte) string and
> you don't want to split it on lines just yet. And the readline()
> function raises SyntaxError when the encoding isn't right. I wish
> there were a lower-level helper that just took a line and told you
> what the encoding in it was, if any. Then the rest of the logic can be
> handled by the caller (including the logic of trying up to two lines).

I've uploaded the code I posted yesterday, modified to address
some of the issues it had to github:

https://github.com/malemburg/python-snippets/blob/master/detect_source_encoding.py

I'm pretty sure the two-lines read can be optimized away and
put straight into the regular expression used for matching.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Experts (#1, Mar 18 2016)
>>> Python Projects, Coaching and Consulting ...  http://www.egenix.com/
>>> Python Database Interfaces ...   http://products.egenix.com/
>>> Plone/Zope Database Interfaces ...   http://zope.egenix.com/

2016-03-07: Released eGenix pyOpenSSL 0.13.14 ... http://egenix.com/go89
2016-02-19: Released eGenix PyRun 2.1.2 ...   http://egenix.com/go88

::: We implement business ideas - efficiently in both time and costs :::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
  http://www.malemburg.com/

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

From openstack-dev-bounces+archive=mail-archive@lists.openstack.org Sat Mar 
19 08:12:23 2016
Return-path: 

Envelope-to: arch...@mail-archive.com
Delivery-date: Sat, 19 Mar 2016 08:12:23 -0700
Received: from bolt10a.mxthunder.net ([209.105.224.168])
by mail-archive.com with esmtp (Exim 4.76)
(envelope-from 
)
id 1ahIYA-0002fr-Cy
for arch...@mail-archive.com; Sat, 19 Mar 2016 08:12:22 -0700
Received: by bolt10a.mxthunder.net (Postfix, from userid 12345)
id 3qRVGK3w2Rz19ktG; Fri, 18 Mar 2016 08:56:36 -0700 (PDT)
Received: from lists.openstack.org (lists.openstack.org [50.56.173.222])
(using TLSv1 with cipher AES256-SHA (256/256 bits))
(No client certificate requested)
by bolt10a.mxthunder.net (Postfix) with ESMTPS id 3qRVFs4wjPz19kcC
for ; Fri, 18 Mar 2016 08:56:33 -0700 (PDT)
Received: from localhost ([127.0.0.1] helo=lists.openstack.org)
by lists.openstack.org with esmtp (Exim 4.76)
(envelope-from )
id 1agwhl-0005a9-59; Fri, 18 Mar 2016 15:52:49 +
Received: from g4t3426.houston.hp.com ([15.201.208.54])
 by lists.openstack.org with esmtp (Exim 4.76)
 (envelope-from ) id 1agwhj-0005YM-O5
 for openstack-...@lists.openstack.org; Fri, 18 Mar 2016 15:52:47 +
Received: from G4W9121.americas.hpqcorp.net (g4w9121.houston.hp.com
 [16.210.21.16]) (using TLSv1.2 with cipher AES256-SHA (256/256 bits))
 (No client certificate requested)
 by g4t3426.houston.hp.com (Postfix) with ESMTPS id 605A664
 for ; Fri, 18 Mar 2016 15:52:47 + (UTC)
Received: from G4W9121.americas.hpqcorp.net (16.210.21.16) by
 G4W9121.americas.hpqcorp.net (16.210.21.16) with Microsoft SMTP Server (TLS)
 id 15.0.1076.9; Fri, 18 Mar 2016 15:52:38 +
Received: from G4W6304.americas.hpqcorp.net (16.210.26.229) by
 G4W9121.americas.hpqcorp.net (16.210.21.16) with Microsoft SMTP Server (TLS)
 id 15.0.1076.9 via Frontend Transport; Fri, 18 Mar 2016 15:52:38 +
Received: from G9W0750.americas.hpqcorp.net ([169.254.9.246]) by
 G4W6304.americas.hpqcorp.net ([16.210.26.229]) with mapi id 14.03.0169.001;
 Fri, 18 Mar 2016 15:52:38 +
From: "Hayes, Graham" 
To: "OpenStack Development Mailing List (not for usage questions)"
 
Thread-Topic: [openstack-dev] [all][infra][ptls] tagging reviews, making
 tags searchable
Thread-Index: AQHRgSbdQRQMpxFOeUKGrubFuR+3RA==
Date: Fri, 18 Mar 2016 15:52:37 +
Message-ID: 

Re: [Python-Dev] What does a double coding cookie mean?

2016-03-19 Thread Serhiy Storchaka

On 17.03.16 21:11, Guido van Rossum wrote:

I tried this and it was too painful, so now I've just
changed the regex that mypy uses to use non-eager matching
(https://github.com/python/mypy/commit/b291998a46d580df412ed28af1ba1658446b9fe5).


\s* matches newlines.

{0,1}? is the same as ??.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] What does a double coding cookie mean?

2016-03-19 Thread Terry Reedy

On 3/16/2016 3:14 AM, Serhiy Storchaka wrote:

On 16.03.16 02:28, Guido van Rossum wrote:

I agree that the spirit of the PEP is to stop at the first coding
cookie found. Would it be okay if I updated the PEP to clarify this?
I'll definitely also update the docs.


Could you please also update the regular expression in PEP 263 to
"^[ \t\v]*#.*?coding[:=][ \t]*([-.a-zA-Z0-9]+)"?

Coding cookie must be in comment, only the first occurrence in the line
must be taken to account (here is a bug in CPython), encoding name must
be ASCII, and there must not be any Python statement on the line that
contains the encoding declaration. [1]

[1] https://bugs.python.org/issue18873


Also, I think there should be one 'official' function somewhere in the 
stdlib to get and return the encoding declaration. The patch for the 
issue above had to make the same change in four places other than tests, 
a violent violation of DRY.


--
Terry Jan Reedy

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] What does a double coding cookie mean?

2016-03-19 Thread Guido van Rossum
On Thu, Mar 17, 2016 at 5:04 AM, Serhiy Storchaka  wrote:
>> Should we recommend that everyone use tokenize.detect_encoding()?
>
> Likely. However the interface of tokenize.detect_encoding() is not very
> simple.

I just found that out yesterday. You have to give it a readline()
function, which is cumbersome if all you have is a (byte) string and
you don't want to split it on lines just yet. And the readline()
function raises SyntaxError when the encoding isn't right. I wish
there were a lower-level helper that just took a line and told you
what the encoding in it was, if any. Then the rest of the logic can be
handled by the caller (including the logic of trying up to two lines).

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] What does a double coding cookie mean?

2016-03-19 Thread Brett Cannon
On Thu, 17 Mar 2016 at 07:56 Guido van Rossum  wrote:

> On Thu, Mar 17, 2016 at 5:04 AM, Serhiy Storchaka 
> wrote:
> >> Should we recommend that everyone use tokenize.detect_encoding()?
> >
> > Likely. However the interface of tokenize.detect_encoding() is not very
> > simple.
>
> I just found that out yesterday. You have to give it a readline()
> function, which is cumbersome if all you have is a (byte) string and
> you don't want to split it on lines just yet. And the readline()
> function raises SyntaxError when the encoding isn't right. I wish
> there were a lower-level helper that just took a line and told you
> what the encoding in it was, if any. Then the rest of the logic can be
> handled by the caller (including the logic of trying up to two lines).
>

Since this is for mypy my guess is you only want to know the encoding, but
if you're simply trying to decode bytes of syntax then
importilb.util.decode_source() will handle that for you.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] What does a double coding cookie mean?

2016-03-19 Thread Serhiy Storchaka

On 17.03.16 21:11, Guido van Rossum wrote:

This will raise SyntaxError if the encoding is unknown. That needs to
be caught in mypy's case and then it needs to get the line number from
the exception.


Good point. "lineno" and "offset" attributes of SyntaxError is set to 
None by tokenize.detect_encoding() and to 0 by CPython interpreter. They 
should be set to useful values.



___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] What does a double coding cookie mean?

2016-03-19 Thread Glenn Linderman

On 3/16/2016 5:29 PM, Guido van Rossum wrote:

I've updated the PEP. Please review. I decided not to update the
Unicode howto (the thing is too obscure). Serhiy, you're probably in a
better position to fix the code looking for cookies to pick the first
one if there are two on the same line (or do whatever you think should
be done there).

Should we recommend that everyone use tokenize.detect_encoding()?

On Wed, Mar 16, 2016 at 5:05 PM, Guido van Rossum  wrote:

On Wed, Mar 16, 2016 at 12:59 AM, M.-A. Lemburg  wrote:

The only reason to read up to two lines was to address the use of
the shebang on Unix, not to be able to define two competing
source code encodings :-)

I know. I was just surprised that the PEP was sufficiently vague about
it that when I found that mypy picked the second if there were two, I
couldn't prove to myself that it was violating the PEP. I'd rather
clarify the PEP than rely on the reasoning presented earlier here.


Oh sure.  Updating the PEP is the best way forward. But the reasoning, 
although from somewhat vague specifications, seems sound enough to 
declare that it meant "find the first cookie in the first two lines".


Which is what you've said in the update, although not quite that 
tersely.  It now leaves no room for ambiguous interpretations.




I don't like erroring out when there are two different cookies on two
lines; I feel that the spirit of the PEP is to read up to two lines
until a cookie is found, whichever comes first.


The only reason for an error would be to alert people that had depended 
on the bugs, or misinterpretations.


Personally, I think if they haven't converted to UTF-8 by now, they've 
got bigger problems than this change.


I will update the regex in the PEP too (or change the wording to avoid "match").

I'm not sure what to do if there are two cooking on one line. If
CPython currently picks the latter we may want to preserve that
behavior.

Should we recommend that everyone use tokenize.detect_encoding()?

--
--Guido van Rossum (python.org/~guido)





___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] What does a double coding cookie mean?

2016-03-19 Thread Guido van Rossum
I've updated the PEP. Please review. I decided not to update the
Unicode howto (the thing is too obscure). Serhiy, you're probably in a
better position to fix the code looking for cookies to pick the first
one if there are two on the same line (or do whatever you think should
be done there).

Should we recommend that everyone use tokenize.detect_encoding()?

On Wed, Mar 16, 2016 at 5:05 PM, Guido van Rossum  wrote:
> On Wed, Mar 16, 2016 at 12:59 AM, M.-A. Lemburg  wrote:
>> The only reason to read up to two lines was to address the use of
>> the shebang on Unix, not to be able to define two competing
>> source code encodings :-)
>
> I know. I was just surprised that the PEP was sufficiently vague about
> it that when I found that mypy picked the second if there were two, I
> couldn't prove to myself that it was violating the PEP. I'd rather
> clarify the PEP than rely on the reasoning presented earlier here.
>
> I don't like erroring out when there are two different cookies on two
> lines; I feel that the spirit of the PEP is to read up to two lines
> until a cookie is found, whichever comes first.
>
> I will update the regex in the PEP too (or change the wording to avoid 
> "match").
>
> I'm not sure what to do if there are two cooking on one line. If
> CPython currently picks the latter we may want to preserve that
> behavior.
>
> Should we recommend that everyone use tokenize.detect_encoding()?
>
> --
> --Guido van Rossum (python.org/~guido)



-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] What does a double coding cookie mean?

2016-03-19 Thread Stephen J. Turnbull
Guido van Rossum writes:

 > > Should we recommend that everyone use tokenize.detect_encoding()?

+1

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] What does a double coding cookie mean?

2016-03-19 Thread Serhiy Storchaka

On 17.03.16 02:29, Guido van Rossum wrote:

I've updated the PEP. Please review. I decided not to update the
Unicode howto (the thing is too obscure). Serhiy, you're probably in a
better position to fix the code looking for cookies to pick the first
one if there are two on the same line (or do whatever you think should
be done there).


http://bugs.python.org/issue26581


Should we recommend that everyone use tokenize.detect_encoding()?


Likely. However the interface of tokenize.detect_encoding() is not very 
simple.



___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] What does a double coding cookie mean?

2016-03-19 Thread Serhiy Storchaka

On 17.03.16 19:23, M.-A. Lemburg wrote:

On 17.03.2016 15:02, Serhiy Storchaka wrote:

On 17.03.16 15:14, M.-A. Lemburg wrote:

On 17.03.2016 01:29, Guido van Rossum wrote:

Should we recommend that everyone use tokenize.detect_encoding()?


I'd prefer a separate utility for this somewhere, since
tokenize.detect_encoding() is not available in Python 2.

I've attached an example implementation with tests, which works
in Python 2.7 and 3.


Sorry, but this code doesn't match the behaviour of Python interpreter,
nor other tools. I suggest to backport tokenize.detect_encoding() (but
be aware that the default encoding in Python 2 is ASCII, not UTF-8).


Yes, I got the default for Python 3 wrong. I'll fix that. Thanks
for the note.

What other aspects are different than what Python implements ?


1. If there is a BOM and coding cookie, the source encoding is "utf-8-sig".

2. If there is a BOM and coding cookie is not 'utf-8', this is an error.

3. If the first line is not blank or comment line, the coding cookie is 
not searched in the second line.


4. Encoding name should be canonized. "UTF8", "utf8", "utf_8" and 
"utf-8" is the same encoding (and all are changed to "utf-8-sig" with BOM).


5. There isn't the limit of 400 bytes. Actually there is a bug with 
handling long lines in current code, but even with this bug the limit is 
larger.


6. I made a mistake in the regular expression, missed the underscore.

tokenize.detect_encoding() is the closest imitation of the behavior of 
Python interpreter.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] What does a double coding cookie mean?

2016-03-18 Thread Serhiy Storchaka

On 17.03.16 16:55, Guido van Rossum wrote:

On Thu, Mar 17, 2016 at 5:04 AM, Serhiy Storchaka  wrote:

Should we recommend that everyone use tokenize.detect_encoding()?


Likely. However the interface of tokenize.detect_encoding() is not very
simple.


I just found that out yesterday. You have to give it a readline()
function, which is cumbersome if all you have is a (byte) string and
you don't want to split it on lines just yet. And the readline()
function raises SyntaxError when the encoding isn't right. I wish
there were a lower-level helper that just took a line and told you
what the encoding in it was, if any. Then the rest of the logic can be
handled by the caller (including the logic of trying up to two lines).


The simplest way to detect encoding of bytes string:

lines = data.splitlines()
encoding = tokenize.detect_encoding(iter(lines).__next__)[0]

If you don't want to split all data on lines, the most efficient way in 
Python 3.5 is:


encoding = tokenize.detect_encoding(io.BytesIO(data).readline)[0]

In Python 3.5 io.BytesIO(data) has constant complexity.

In older versions for detecting encoding without copying data or 
splitting all data on lines you should write line iterator. For example:


def iterlines(data):
start = 0
while True:
end = data.find(b'\n', start) + 1
if not end:
break
yield data[start:end]
start = end
yield data[start:]

encoding = tokenize.detect_encoding(iterlines(data).__next__)[0]

or

it = (m.group() for m in re.finditer(b'.*\n?', data))
encoding = tokenize.detect_encoding(it.__next__)

I don't know what approach is more efficient.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] What does a double coding cookie mean?

2016-03-18 Thread Guido van Rossum
On Wed, Mar 16, 2016 at 12:59 AM, M.-A. Lemburg  wrote:
> The only reason to read up to two lines was to address the use of
> the shebang on Unix, not to be able to define two competing
> source code encodings :-)

I know. I was just surprised that the PEP was sufficiently vague about
it that when I found that mypy picked the second if there were two, I
couldn't prove to myself that it was violating the PEP. I'd rather
clarify the PEP than rely on the reasoning presented earlier here.

I don't like erroring out when there are two different cookies on two
lines; I feel that the spirit of the PEP is to read up to two lines
until a cookie is found, whichever comes first.

I will update the regex in the PEP too (or change the wording to avoid "match").

I'm not sure what to do if there are two cooking on one line. If
CPython currently picks the latter we may want to preserve that
behavior.

Should we recommend that everyone use tokenize.detect_encoding()?

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] What does a double coding cookie mean?

2016-03-16 Thread Serhiy Storchaka

On 16.03.16 09:46, Glenn Linderman wrote:

On 3/16/2016 12:09 AM, Serhiy Storchaka wrote:

On 16.03.16 08:34, Glenn Linderman wrote:

 From the PEP 263:


More precisely, the first or second line must match the regular
expression "coding[:=]\s*([-\w.]+)". The first group of this
expression is then interpreted as encoding name. If the encoding
is unknown to Python, an error is raised during compilation. There
must not be any Python statement on the line that contains the
encoding declaration.


Clearly the regular expression would only match the first of multiple
cookies on the same line, so the first one should always win... but
there should only be one, from the first PEP quote "a magic comment".


"The first group of this expression" means the first regular
expression group. Only the part between parenthesis "([-\w.]+)" is
interpreted as encoding name, not all expression.


Sure.  But there is no mention anywhere in the PEP of more than one
being legal: just more than one position for it, EITHER line 1 or line
2. So while the regular expression mentioned is not anchored, to allow
variation in syntax between emacs and vim, "must match the regular
expression" doesn't imply "several times", and when searching for a
regular expression that might not be anchored, one typically expects to
find the first.


Actually "must match the regular expression" is not correct, because 
re.match() implies anchoring at the start. I have proposed more correct 
regular expression in other branch of this thread.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] What does a double coding cookie mean?

2016-03-16 Thread M.-A. Lemburg
On 16.03.2016 01:28, Guido van Rossum wrote:
> I agree that the spirit of the PEP is to stop at the first coding
> cookie found. Would it be okay if I updated the PEP to clarify this?
> I'll definitely also update the docs.

+1

The only reason to read up to two lines was to address the use of
the shebang on Unix, not to be able to define two competing
source code encodings :-)

> On Tue, Mar 15, 2016 at 2:04 PM, Brett Cannon  wrote:
>>
>>
>> On Tue, 15 Mar 2016 at 13:31 Guido van Rossum  wrote:
>>>
>>> I came across a file that had two different coding cookies -- one on
>>> the first line and one on the second. CPython uses the first, but mypy
>>> happens to use the second. I couldn't find anything in the spec or
>>> docs ruling out the second interpretation. Does anyone have a
>>> suggestion (apart from following CPython)?
>>>
>>> Reference: https://github.com/python/mypy/issues/1281
>>
>>
>> I think the spirit of PEP 263 is for the first specified encoding to win as
>> the support of two lines is to support shebangs and not multiple encodings
>> :) . I also think the fact that tokenize.detect_encoding() doesn't
>> automatically read two lines from its input also suggests the intent is
>> "first encoding wins" (and that is the semantics of the function).
> 
> 
> 

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Experts (#1, Mar 16 2016)
>>> Python Projects, Coaching and Consulting ...  http://www.egenix.com/
>>> Python Database Interfaces ...   http://products.egenix.com/
>>> Plone/Zope Database Interfaces ...   http://zope.egenix.com/

2016-03-07: Released eGenix pyOpenSSL 0.13.14 ... http://egenix.com/go89
2016-02-19: Released eGenix PyRun 2.1.2 ...   http://egenix.com/go88

::: We implement business ideas - efficiently in both time and costs :::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
  http://www.malemburg.com/

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] What does a double coding cookie mean?

2016-03-16 Thread Glenn Linderman

On 3/16/2016 12:09 AM, Serhiy Storchaka wrote:

On 16.03.16 08:34, Glenn Linderman wrote:

 From the PEP 263:


More precisely, the first or second line must match the regular
expression "coding[:=]\s*([-\w.]+)". The first group of this
expression is then interpreted as encoding name. If the encoding
is unknown to Python, an error is raised during compilation. There
must not be any Python statement on the line that contains the
encoding declaration.


Clearly the regular expression would only match the first of multiple
cookies on the same line, so the first one should always win... but
there should only be one, from the first PEP quote "a magic comment".


"The first group of this expression" means the first regular 
expression group. Only the part between parenthesis "([-\w.]+)" is 
interpreted as encoding name, not all expression.


Sure.  But there is no mention anywhere in the PEP of more than one 
being legal: just more than one position for it, EITHER line 1 or line 
2. So while the regular expression mentioned is not anchored, to allow 
variation in syntax between emacs and vim, "must match the regular 
expression" doesn't imply "several times", and when searching for a 
regular expression that might not be anchored, one typically expects to 
find the first.


Glenn
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] What does a double coding cookie mean?

2016-03-16 Thread Serhiy Storchaka

On 16.03.16 02:28, Guido van Rossum wrote:

I agree that the spirit of the PEP is to stop at the first coding
cookie found. Would it be okay if I updated the PEP to clarify this?
I'll definitely also update the docs.


Could you please also update the regular expression in PEP 263 to
"^[ \t\v]*#.*?coding[:=][ \t]*([-.a-zA-Z0-9]+)"?

Coding cookie must be in comment, only the first occurrence in the line 
must be taken to account (here is a bug in CPython), encoding name must 
be ASCII, and there must not be any Python statement on the line that 
contains the encoding declaration. [1]


[1] https://bugs.python.org/issue18873

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] What does a double coding cookie mean?

2016-03-16 Thread Serhiy Storchaka

On 16.03.16 08:34, Glenn Linderman wrote:

 From the PEP 263:


More precisely, the first or second line must match the regular
expression "coding[:=]\s*([-\w.]+)". The first group of this
expression is then interpreted as encoding name. If the encoding
is unknown to Python, an error is raised during compilation. There
must not be any Python statement on the line that contains the
encoding declaration.


Clearly the regular expression would only match the first of multiple
cookies on the same line, so the first one should always win... but
there should only be one, from the first PEP quote "a magic comment".


"The first group of this expression" means the first regular expression 
group. Only the part between parenthesis "([-\w.]+)" is interpreted as 
encoding name, not all expression.



___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] What does a double coding cookie mean?

2016-03-16 Thread Glenn Linderman

On 3/15/2016 11:07 PM, Chris Angelico wrote:

On Wed, Mar 16, 2016 at 5:03 PM, Serhiy Storchaka  wrote:

On 15.03.16 22:30, Guido van Rossum wrote:

I came across a file that had two different coding cookies -- one on
the first line and one on the second. CPython uses the first, but mypy
happens to use the second. I couldn't find anything in the spec or
docs ruling out the second interpretation. Does anyone have a
suggestion (apart from following CPython)?

Reference: https://github.com/python/mypy/issues/1281


There is similar question. If a file has two different coding cookies on the
same line, what should win? Currently the last cookie wins, in CPython
parser, in the tokenize module, in IDLE, and in number of other code. I
think this is a bug.

Why would you ever have two coding cookies in a file? Surely this
should be either an error, or ill-defined (ie parsers are allowed to
pick whichever they like, including raising)?

ChrisA


From the PEP 263:


To define a source code encoding, a magic comment must
be placed into the source files either as first or second
line in the file, such as:


So clearly there is only one magic comment. "either" the first or second 
line, not both.  Both, therefore, should be an error.



From the PEP 263:


More precisely, the first or second line must match the regular
expression "coding[:=]\s*([-\w.]+)". The first group of this
expression is then interpreted as encoding name. If the encoding
is unknown to Python, an error is raised during compilation. There
must not be any Python statement on the line that contains the
encoding declaration.


Clearly the regular expression would only match the first of multiple 
cookies on the same line, so the first one should always win... but 
there should only be one, from the first PEP quote "a magic comment".


Glenn
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] What does a double coding cookie mean?

2016-03-16 Thread Jonathan Goble
On Wed, Mar 16, 2016 at 2:07 AM, Chris Angelico  wrote:
> Why would you ever have two coding cookies in a file? Surely this
> should be either an error, or ill-defined (ie parsers are allowed to
> pick whichever they like, including raising)?
>
> ChrisA

+1. If multiple coding cookies are found, and all do not agree, I
would expect an error to be raised. That it apparently does not raise
an error currently is surprising to me.

(If multiple coding cookies are found but do agree, perhaps raising a
warning would be a good idea.)
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] What does a double coding cookie mean?

2016-03-16 Thread Chris Angelico
On Wed, Mar 16, 2016 at 5:03 PM, Serhiy Storchaka  wrote:
> On 15.03.16 22:30, Guido van Rossum wrote:
>>
>> I came across a file that had two different coding cookies -- one on
>> the first line and one on the second. CPython uses the first, but mypy
>> happens to use the second. I couldn't find anything in the spec or
>> docs ruling out the second interpretation. Does anyone have a
>> suggestion (apart from following CPython)?
>>
>> Reference: https://github.com/python/mypy/issues/1281
>
>
> There is similar question. If a file has two different coding cookies on the
> same line, what should win? Currently the last cookie wins, in CPython
> parser, in the tokenize module, in IDLE, and in number of other code. I
> think this is a bug.

Why would you ever have two coding cookies in a file? Surely this
should be either an error, or ill-defined (ie parsers are allowed to
pick whichever they like, including raising)?

ChrisA
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] What does a double coding cookie mean?

2016-03-16 Thread Serhiy Storchaka

On 15.03.16 22:30, Guido van Rossum wrote:

I came across a file that had two different coding cookies -- one on
the first line and one on the second. CPython uses the first, but mypy
happens to use the second. I couldn't find anything in the spec or
docs ruling out the second interpretation. Does anyone have a
suggestion (apart from following CPython)?

Reference: https://github.com/python/mypy/issues/1281


There is similar question. If a file has two different coding cookies on 
the same line, what should win? Currently the last cookie wins, in 
CPython parser, in the tokenize module, in IDLE, and in number of other 
code. I think this is a bug.



___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] What does a double coding cookie mean?

2016-03-15 Thread Ben Finney
Guido van Rossum  writes:

> I agree that the spirit of the PEP is to stop at the first coding
> cookie found. Would it be okay if I updated the PEP to clarify this?
> I'll definitely also update the docs.

+1, it never occurred to me that the specification could mean otherwise.
On reflection I can't see a good reason for it to mean otherwise.

-- 
 \ “Alternative explanations are always welcome in science, if |
  `\   they are better and explain more. Alternative explanations that |
_o__) explain nothing are not welcome.” —Victor J. Stenger, 2001-11-05 |
Ben Finney

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] What does a double coding cookie mean?

2016-03-15 Thread Guido van Rossum
I agree that the spirit of the PEP is to stop at the first coding
cookie found. Would it be okay if I updated the PEP to clarify this?
I'll definitely also update the docs.

On Tue, Mar 15, 2016 at 2:04 PM, Brett Cannon  wrote:
>
>
> On Tue, 15 Mar 2016 at 13:31 Guido van Rossum  wrote:
>>
>> I came across a file that had two different coding cookies -- one on
>> the first line and one on the second. CPython uses the first, but mypy
>> happens to use the second. I couldn't find anything in the spec or
>> docs ruling out the second interpretation. Does anyone have a
>> suggestion (apart from following CPython)?
>>
>> Reference: https://github.com/python/mypy/issues/1281
>
>
> I think the spirit of PEP 263 is for the first specified encoding to win as
> the support of two lines is to support shebangs and not multiple encodings
> :) . I also think the fact that tokenize.detect_encoding() doesn't
> automatically read two lines from its input also suggests the intent is
> "first encoding wins" (and that is the semantics of the function).



-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] What does a double coding cookie mean?

2016-03-15 Thread MRAB

On 2016-03-15 20:53, MRAB wrote:

On 2016-03-15 20:30, Guido van Rossum wrote:

I came across a file that had two different coding cookies -- one on
the first line and one on the second. CPython uses the first, but mypy
happens to use the second. I couldn't find anything in the spec or
docs ruling out the second interpretation. Does anyone have a
suggestion (apart from following CPython)?

Reference: https://github.com/python/mypy/issues/1281


I think it should follow CPython.

As I see it, CPython allows it to be on the second line because the
first line might be needed for the shebang.

If the first two lines both had an encoding, and then you inserted a
shebang line, the second one would be ignored anyway.

A further thought: is mypy just assuming that the first line contains 
the shebang?


If there's only one encoding line, and it's the first line, does mypy 
still get it right?


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] What does a double coding cookie mean?

2016-03-15 Thread Jon Ribbens
On Tue, Mar 15, 2016 at 01:30:08PM -0700, Guido van Rossum wrote:
> I came across a file that had two different coding cookies -- one on
> the first line and one on the second. CPython uses the first, but mypy
> happens to use the second. I couldn't find anything in the spec or
> docs ruling out the second interpretation. Does anyone have a
> suggestion (apart from following CPython)?
> 
> Reference: https://github.com/python/mypy/issues/1281

If it helps, what 'vim' appears to do is to read the first 'n' lines
in order and then last 'n' lines in reverse order, stopping if the
second stage reaches a line already processed by the first stage.
So with 'modelines=5', the following file:

  /* vim: set ts=1: */
  /* vim: set ts=2: */
  /* vim: set ts=3: */
  /* vim: set ts=4: */
  /* vim: set sw=5 ts=5: */
  /* vim: set ts=6: */
  /* vim: set ts=7: */
  /* vim: set ts=8: */

sets sw=5 and ts=6.

Obviously CPython shouldn't be going through all that palaver!
But it would be a bit more vim-like to use the second line rather than
the first if both lines have the cookie.

Take that as you will - I'm not saying being 'vim-like' is an inherent
virtue ;-)
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] What does a double coding cookie mean?

2016-03-15 Thread Brett Cannon
On Tue, 15 Mar 2016 at 13:31 Guido van Rossum  wrote:

> I came across a file that had two different coding cookies -- one on
> the first line and one on the second. CPython uses the first, but mypy
> happens to use the second. I couldn't find anything in the spec or
> docs ruling out the second interpretation. Does anyone have a
> suggestion (apart from following CPython)?
>
> Reference: https://github.com/python/mypy/issues/1281


I think the spirit of PEP 263 is for the first specified encoding to win as
the support of two lines is to support shebangs and not multiple encodings
:) . I also think the fact that tokenize.detect_encoding()

doesn't automatically read two lines from its input also suggests the
intent is "first encoding wins" (and that is the semantics of the function).
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] What does a double coding cookie mean?

2016-03-15 Thread MRAB

On 2016-03-15 20:30, Guido van Rossum wrote:

I came across a file that had two different coding cookies -- one on
the first line and one on the second. CPython uses the first, but mypy
happens to use the second. I couldn't find anything in the spec or
docs ruling out the second interpretation. Does anyone have a
suggestion (apart from following CPython)?

Reference: https://github.com/python/mypy/issues/1281


I think it should follow CPython.

As I see it, CPython allows it to be on the second line because the 
first line might be needed for the shebang.


If the first two lines both had an encoding, and then you inserted a 
shebang line, the second one would be ignored anyway.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com