Re: [Python-Dev] What does a double coding cookie mean?

2016-03-18 Thread Serhiy Storchaka

On 17.03.16 16:55, Guido van Rossum wrote:

On Thu, Mar 17, 2016 at 5:04 AM, Serhiy Storchaka  wrote:

Should we recommend that everyone use tokenize.detect_encoding()?


Likely. However the interface of tokenize.detect_encoding() is not very
simple.


I just found that out yesterday. You have to give it a readline()
function, which is cumbersome if all you have is a (byte) string and
you don't want to split it on lines just yet. And the readline()
function raises SyntaxError when the encoding isn't right. I wish
there were a lower-level helper that just took a line and told you
what the encoding in it was, if any. Then the rest of the logic can be
handled by the caller (including the logic of trying up to two lines).


The simplest way to detect encoding of bytes string:

lines = data.splitlines()
encoding = tokenize.detect_encoding(iter(lines).__next__)[0]

If you don't want to split all data on lines, the most efficient way in 
Python 3.5 is:


encoding = tokenize.detect_encoding(io.BytesIO(data).readline)[0]

In Python 3.5 io.BytesIO(data) has constant complexity.

In older versions for detecting encoding without copying data or 
splitting all data on lines you should write line iterator. For example:


def iterlines(data):
start = 0
while True:
end = data.find(b'\n', start) + 1
if not end:
break
yield data[start:end]
start = end
yield data[start:]

encoding = tokenize.detect_encoding(iterlines(data).__next__)[0]

or

it = (m.group() for m in re.finditer(b'.*\n?', data))
encoding = tokenize.detect_encoding(it.__next__)

I don't know what approach is more efficient.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] What does a double coding cookie mean?

2016-03-18 Thread Guido van Rossum
On Wed, Mar 16, 2016 at 12:59 AM, M.-A. Lemburg  wrote:
> The only reason to read up to two lines was to address the use of
> the shebang on Unix, not to be able to define two competing
> source code encodings :-)

I know. I was just surprised that the PEP was sufficiently vague about
it that when I found that mypy picked the second if there were two, I
couldn't prove to myself that it was violating the PEP. I'd rather
clarify the PEP than rely on the reasoning presented earlier here.

I don't like erroring out when there are two different cookies on two
lines; I feel that the spirit of the PEP is to read up to two lines
until a cookie is found, whichever comes first.

I will update the regex in the PEP too (or change the wording to avoid "match").

I'm not sure what to do if there are two cooking on one line. If
CPython currently picks the latter we may want to preserve that
behavior.

Should we recommend that everyone use tokenize.detect_encoding()?

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 515: Underscores in Numeric Literals (revision 3)

2016-03-18 Thread Guido van Rossum
I'm happy to accept this PEP as is stands, assuming the authors are
ready for this news. I recommend also implementing the option from
footnote [11] (extend the number-to-string formatting language to
allow ``_`` as a thousans separator).

On Thu, Mar 17, 2016 at 11:19 AM, Brett Cannon  wrote:
> Where did this PEP leave off? Anything blocking its acceptance?
>
> On Sat, 13 Feb 2016 at 00:49 Georg Brandl  wrote:
>>
>> Hi all,
>>
>> after talking to Guido and Serhiy we present the next revision
>> of this PEP.  It is a compromise that we are all happy with,
>> and a relatively restricted rule that makes additions to PEP 8
>> basically unnecessary.
>>
>> I think the discussion has shown that supporting underscores in
>> the from-string constructors is valuable, therefore this is now
>> added to the specification section.
>>
>> The remaining open question is about the reverse direction: do
>> we want a string formatting modifier that adds underscores as
>> thousands separators?
>>
>> cheers,
>> Georg
>>
>> -
>>
>> PEP: 515
>> Title: Underscores in Numeric Literals
>> Version: $Revision$
>> Last-Modified: $Date$
>> Author: Georg Brandl, Serhiy Storchaka
>> Status: Draft
>> Type: Standards Track
>> Content-Type: text/x-rst
>> Created: 10-Feb-2016
>> Python-Version: 3.6
>> Post-History: 10-Feb-2016, 11-Feb-2016
>>
>> Abstract and Rationale
>> ==
>>
>> This PEP proposes to extend Python's syntax and number-from-string
>> constructors so that underscores can be used as visual separators for
>> digit grouping purposes in integral, floating-point and complex number
>> literals.
>>
>> This is a common feature of other modern languages, and can aid
>> readability of long literals, or literals whose value should clearly
>> separate into parts, such as bytes or words in hexadecimal notation.
>>
>> Examples::
>>
>> # grouping decimal numbers by thousands
>> amount = 10_000_000.0
>>
>> # grouping hexadecimal addresses by words
>> addr = 0xDEAD_BEEF
>>
>> # grouping bits into nibbles in a binary literal
>> flags = 0b_0011__0100_1110
>>
>> # same, for string conversions
>> flags = int('0b__', 2)
>>
>>
>> Specification
>> =
>>
>> The current proposal is to allow one underscore between digits, and
>> after base specifiers in numeric literals.  The underscores have no
>> semantic meaning, and literals are parsed as if the underscores were
>> absent.
>>
>> Literal Grammar
>> ---
>>
>> The production list for integer literals would therefore look like
>> this::
>>
>>integer: decinteger | bininteger | octinteger | hexinteger
>>decinteger: nonzerodigit (["_"] digit)* | "0" (["_"] "0")*
>>bininteger: "0" ("b" | "B") (["_"] bindigit)+
>>octinteger: "0" ("o" | "O") (["_"] octdigit)+
>>hexinteger: "0" ("x" | "X") (["_"] hexdigit)+
>>nonzerodigit: "1"..."9"
>>digit: "0"..."9"
>>bindigit: "0" | "1"
>>octdigit: "0"..."7"
>>hexdigit: digit | "a"..."f" | "A"..."F"
>>
>> For floating-point and complex literals::
>>
>>floatnumber: pointfloat | exponentfloat
>>pointfloat: [digitpart] fraction | digitpart "."
>>exponentfloat: (digitpart | pointfloat) exponent
>>digitpart: digit (["_"] digit)*
>>fraction: "." digitpart
>>exponent: ("e" | "E") ["+" | "-"] digitpart
>>imagnumber: (floatnumber | digitpart) ("j" | "J")
>>
>> Constructors
>> 
>>
>> Following the same rules for placement, underscores will be allowed in
>> the following constructors:
>>
>> - ``int()`` (with any base)
>> - ``float()``
>> - ``complex()``
>> - ``Decimal()``
>>
>>
>> Prior Art
>> =
>>
>> Those languages that do allow underscore grouping implement a large
>> variety of rules for allowed placement of underscores.  In cases where
>> the language spec contradicts the actual behavior, the actual behavior
>> is listed.  ("single" or "multiple" refer to allowing runs of
>> consecutive underscores.)
>>
>> * Ada: single, only between digits [8]_
>> * C# (open proposal for 7.0): multiple, only between digits [6]_
>> * C++14: single, between digits (different separator chosen) [1]_
>> * D: multiple, anywhere, including trailing [2]_
>> * Java: multiple, only between digits [7]_
>> * Julia: single, only between digits (but not in float exponent parts)
>>   [9]_
>> * Perl 5: multiple, basically anywhere, although docs say it's
>>   restricted to one underscore between digits [3]_
>> * Ruby: single, only between digits (although docs say "anywhere")
>>   [10]_
>> * Rust: multiple, anywhere, except for between exponent "e" and digits
>>   [4]_
>> * Swift: multiple, between digits and trailing (although textual
>>   description says only "between digits") [5]_
>>
>>
>> Alternative Syntax
>> ==
>>
>> Underscore Placement Rules
>> --
>>
>> Instead of the relatively strict rule specified above, the use of
>> undersco