Re: [Python-Dev] PEP 461 updates

2014-01-21 Thread Chris Barker
On Sun, Jan 19, 2014 at 7:21 AM, Oscar Benjamin
oscar.j.benja...@gmail.comwrote:

  long as numpy.loadtxt is explicitly documented as only working with
  latin-1 encoded files (it currently isn't), there's no problem.

 Actually there is problem. If it explicitly specified the encoding as
 latin-1 when opening the file then it could document the fact that it
 works for latin-1 encoded files. However it actually uses the system
 default encoding to read the file


which is a really bad default -- oh well. Also, I don't think it was a
choice, at least not a well thought out one, but rather what fell out of
tryin gto make it just work on py3.

and then converts the strings to
 bytes with the as_bytes function that is hard-coded to use latin-1:
 https://github.com/numpy/numpy/blob/master/numpy/compat/py3k.py#L28

 So it only works if the system default encoding is latin-1 and the
 file content is white-space and newline compatible with latin-1.
 Regardless of whether the file itself is in utf-8 or latin-1 it will
 only work if the system default encoding is latin-1. I've never used a
 system that had latin-1 as the default encoding (unless you count
 cp1252 as latin-1).


even if it was a common default it would be a bad idea. Fortunately (?),
so it really is broken, we can fix it without being too constrained by
backwards compatibility.


  If it's supposed to work with other encodings (but the entire file is
  still required to use a consistent encoding), then it just needs
  encoding and errors arguments to fit the Python 3 text model (with
  latin-1 documented as the default encoding).

 This is the right solution. Have an encoding argument, document the
 fact that it will use the system default encoding if none is
 specified, and re-encode using the same encoding to fit any dtype='S'
 bytes column. This will then work for any encoding including the ones
 that aren't ASCII-compatible (e.g. utf-16).


Exactly, except I dont think the system encoding as a default is a good
choice. If there is a default MOST people will use it. And it will work for
a lot of their test code. Then it will break if the code is passed to a
system with a different default encoding, or a file comes from another
source in a different encoding. This is very, very likely. Far
more likely that files consistently being in the system encoding


  default behaviour, since passing something like
  codecs.getdecoder(utf-8) as a column converter should do the right
  thing.


that seems to work at the moment, actually, if done with care.

That's just getting silly IMO. If the file uses mixed encodings then I
 don't consider it a valid text file and see no reason for loadtxt to
 support reading it.


agreed -- that's just getting crazy -- the only use-case I can image is to
clean up a file that got moji-baked by some other process -- not really the
use case for loadtxt and friends.

-Chris

-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/ORR(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 461 updates

2014-01-19 Thread Oscar Benjamin
On 19 January 2014 06:19, Nick Coghlan ncogh...@gmail.com wrote:

 While I agree it's not relevant to the PEP 460/461 discussions, so
 long as numpy.loadtxt is explicitly documented as only working with
 latin-1 encoded files (it currently isn't), there's no problem.

Actually there is problem. If it explicitly specified the encoding as
latin-1 when opening the file then it could document the fact that it
works for latin-1 encoded files. However it actually uses the system
default encoding to read the file and then converts the strings to
bytes with the as_bytes function that is hard-coded to use latin-1:
https://github.com/numpy/numpy/blob/master/numpy/compat/py3k.py#L28

So it only works if the system default encoding is latin-1 and the
file content is white-space and newline compatible with latin-1.
Regardless of whether the file itself is in utf-8 or latin-1 it will
only work if the system default encoding is latin-1. I've never used a
system that had latin-1 as the default encoding (unless you count
cp1252 as latin-1).

 If it's supposed to work with other encodings (but the entire file is
 still required to use a consistent encoding), then it just needs
 encoding and errors arguments to fit the Python 3 text model (with
 latin-1 documented as the default encoding).

This is the right solution. Have an encoding argument, document the
fact that it will use the system default encoding if none is
specified, and re-encode using the same encoding to fit any dtype='S'
bytes column. This will then work for any encoding including the ones
that aren't ASCII-compatible (e.g. utf-16).

Then instead of having a compat module with an as_bytes helper to get
rid of all the unicode strings on Python 3, you can have a compat
module with an open_unicode helper to do the right thing on Python 2.
The as_bytes function is just a way of fighting the Python 3 text
model: I don't care about mojibake just do whatever it takes to shut
up the interpreter and its error messages and make sure it works for
ASCII data.

 If it is intended to
 allow S columns to contain text in arbitrary encodings, then that
 should also be supported by the current API with an adjustment to the
 default behaviour, since passing something like
 codecs.getdecoder(utf-8) as a column converter should do the right
 thing. However, if you're currently decoding S columns with latin-1
 *before* passing the value to the converter, then you'll need to use a
 WSGI style decoding dance instead:

 def fix_encoding(text):
 return text.encode(latin-1).decode(utf-8) # For example

That's just getting silly IMO. If the file uses mixed encodings then I
don't consider it a valid text file and see no reason for loadtxt to
support reading it.

 That's more wasteful than just passing the raw bytes through for
 decoding, but is the simplest backwards compatible option if you're
 doing latin-1 decoding already.

 If different rows in the *same* column are allowed to have different
 encodings, then that's not a valid use of the operation (since the
 column converter has no access to the rest of the row to determine
 what encoding should be used for the decode operation).

Ditto.


Oscar
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 461 updates

2014-01-18 Thread Oscar Benjamin
On 17 January 2014 21:37, Chris Barker chris.bar...@noaa.gov wrote:

 For the record, we've got a pretty good thread (not this good, though!) over
 on the numpy list about how to untangle the mess that has resulted from
 porting text-file-parsing code to py3 (and the underlying issue with the 'S'
 data type in numpy...)

 One note from the github issue:
 
  The use of asbytes originates only from the fact that b'%d' % (20,) does
 not work.
 

 So yeah PEP 461! (even if too late for numpy...)

The discussion about numpy.loadtxt and the 'S' dtype is not relevant
to PEP 461.  PEP 461 is about facilitating handling ascii/binary
protocols and file formats. The loadtxt function is for reading text
files. Reading text files is already handled very well in Python 3.
The only caveat is that you need to specify the encoding when you open
the file.

The loadtxt function doesn't specify the encoding when it opens the
file so on Python 3 it gets the system default encoding when reading
from the file. Since the 'S' dtype is for an array of bytes the
loadtxt function has to encode the unicode strings before storing them
in the array. The function has no idea what encoding the user wants so
it just uses latin-1 leading to mojibake if the file content and
encoding are not compatible with latin-1 e.g.: utf-8.

The loadtxt function is a classic example of how *not* to do text and
whoever made it that way probably didn't understand unicode and the
Python 3 text model. If they did understand what they were doing then
they knew that they were implementing a dirty hack.

If you want to draw a relevant lesson from that thread in this one
then the lesson argues against PEP 461: adding back the bytes
formatting methods helps people who refuse to understand text
processing and continue implementing dirty hacks instead of doing it
properly.


Oscar
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 461 updates

2014-01-18 Thread Nick Coghlan
On 19 January 2014 00:39, Oscar Benjamin oscar.j.benja...@gmail.com wrote:

 If you want to draw a relevant lesson from that thread in this one
 then the lesson argues against PEP 461: adding back the bytes
 formatting methods helps people who refuse to understand text
 processing and continue implementing dirty hacks instead of doing it
 properly.

Yes, that's why it has taken so long to even *consider* bringing
binary interpolation support back - one of our primary concerns in the
early days of Python 3 was developers (including core developers!)
attempting to translate bad habits from Python 2 into Python 3 by
continuing to treat binary data as text. Making interpolation a purely
text domain operation helped strongly in enforcing this distinction,
as it generally required thinking about encoding issues in order to
get things into the text domain (or hitting them with the latin-1
hammer, in which case... *sigh*).

The reason PEP 460/461 came up is that we *do* acknowledge that there
is a legitimate use case for binary interpolation support when dealing
with binary formats that contain ASCII compatible segments. Now that
people have had a few years to get used to the Python 3 text model ,
lowering the barrier to migration from Python 2 and better handling
that use case in Python 3 in general has finally tilted the scales in
favour of providing the feature (assuming Guido is happy with PEP 461
after Ethan finishes the Rationale section).

(Tangent)

While I agree it's not relevant to the PEP 460/461 discussions, so
long as numpy.loadtxt is explicitly documented as only working with
latin-1 encoded files (it currently isn't), there's no problem. If
it's supposed to work with other encodings (but the entire file is
still required to use a consistent encoding), then it just needs
encoding and errors arguments to fit the Python 3 text model (with
latin-1 documented as the default encoding). If it is intended to
allow S columns to contain text in arbitrary encodings, then that
should also be supported by the current API with an adjustment to the
default behaviour, since passing something like
codecs.getdecoder(utf-8) as a column converter should do the right
thing. However, if you're currently decoding S columns with latin-1
*before* passing the value to the converter, then you'll need to use a
WSGI style decoding dance instead:

def fix_encoding(text):
return text.encode(latin-1).decode(utf-8) # For example

That's more wasteful than just passing the raw bytes through for
decoding, but is the simplest backwards compatible option if you're
doing latin-1 decoding already.

If different rows in the *same* column are allowed to have different
encodings, then that's not a valid use of the operation (since the
column converter has no access to the rest of the row to determine
what encoding should be used for the decode operation).

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 461 updates

2014-01-17 Thread Stephen J. Turnbull
Steven D'Aprano writes:
  On Fri, Jan 17, 2014 at 11:19:44AM +0900, Stephen J. Turnbull wrote:

   ASCII compatible is a technical term in encodings, which means
   bytes in the range 0-127 always have ASCII coded character semantics,
   do what you like with bytes in the range 128-255.[1]
  
  Examples, and counter-examples, may help. Let me see if I have got this 
  right: an ASCII-compatible encoding may be an ASCII-superset like 
  Latin-1, or a variable-width encoding like UTF-8 where the ASCII chars 
  are encoded to the same bytes as ASCII, and non-ASCII chars are not. A 
  counter-example would be UTF-16, or some of the Asian encodings like 
  Big5. Am I right so far?

All correct.

  But Nick isn't talking about an encoding, he's talking about a data 
  format. I think that an ASCII-compatible format means one where (in at 
  least *some* parts of the data) bytes between 0 and 127 have the same 
  meaning as in ASCII, e.g. byte 84 is to be interpreted as ASCII 
  character T. This doesn't mean that every byte 84 means T, only that 
  some of them do -- hopefully a well-defined sections of the data. Below, 
  you introduce the term ASCII segments for these.

Yes, except that I believe Nick, as well as the file-and-wire guys,
strengthen hopefully well-defined to just well-defined.

   specified bytes methods are designed for use *only* on bytes
   that are ASCII segments; use on other data is likely to cause
   hard-to-diagnose corruption.
  
  An example: if you have the byte b'\x63', calling upper() on that will 
  return b'\x43'. That is only meaningful if the byte is intended as the 
  ASCII character c.

Good example.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 461 updates

2014-01-17 Thread Chris Barker
For the record, we've got a pretty good thread (not this good, though!)
over on the numpy list about how to untangle the mess that has resulted
from porting text-file-parsing code to py3 (and the underlying issue with
the 'S' data type in numpy...)

One note from the github issue:

 The use of asbytes originates only from the fact that b'%d' % (20,) does
not work.


So yeah PEP 461! (even if too late for numpy...)

-Chris

-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/ORR(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 461 updates

2014-01-17 Thread Chris Barker
I hope you didn't mean to take this off-list:
On Fri, Jan 17, 2014 at 2:06 PM, Neil Schemenauer n...@arctrix.com wrote:

 In gmane.comp.python.devel, you wrote:
  For the record, we've got a pretty good thread (not this good, though!)
  over on the numpy list about how to untangle the mess that has resulted



 Not sure about your definition of good. ;-)


well, in the sense of big anyway...


  Could you summarize the main points on python-dev?  I'm not feeling up to
 wading through
 another massive thread but I'm quite interested to hear the
 challenges that numpy deals with.


Well, not much new to it, really. But here's a re-cap:

numpy has had an 'S' dtype for a while, which corresponded to the py2
string type (except for being fixed length). So it could auto-convert
to-from python strings... all was good and happy.

Enter py3: what to do? there is no py2 string type anymore. So it was
decided to have the 'S' dtype correspond to the py3 bytes
type. Apparently there was thought of renaming it, but the 'B' and 'b'
type identifiers were already takes, so 'S' was kept.

However, as we all know in this thread, the py3 bytes type is not the same
thing as a py2 string (or py2 bytes, natch), and folks like to use the 'S'
type for text data -- so that is kind of broken in py3.

However, other folks use the 'S' type for binary data, so like (and rely
on) it being mapped to the py3 bytes type. So we are stuck with that.

Given the nature of numpy, and scientific data, there is talk of having a
one-byte-per-char text type in numpy (there is already a unicode type, but
it uses 4-bytes-per-char, as it's key to the numpy data model that all
objects of a given type are the same size.) This would be analogous to the
current multiple precision options for numbers. It would take up less
memory, and would not be able to hold all values. It's not clear what the
level of support is for this right now -- after all, you can do everything
you need to do with the appropriate calls to encode() and decode(), if a
bit awkward.

Meanwhile, back at the ranch -- related, but separate issues
have arisen with the functions that parse text files: numpy.loadtxt and
numpy.genfromtxt. These functions were adapted for py3 just enough to get
things to mostly work, but have some serious limitations when doing
anything with unicode -- and in fact do some weird things with plain ascii
text files if you ask it to create unicode objects, and that is a natural
thing to do (and the right thing to do in the Py3 text model) if you do
something like:

arr = loadtxt('a_file_name', dtype=str)

on py3, an str is a py3unicode string, so you get the numpy 'U' datatype
but loadtxt wasn't designed to deal with that, so you can get stuff like:

[b'C:UsersDocumentsProjectmytextfile1.txt'
 b'C:UsersDocumentsProjectmytextfile2.txt'
 b'C:UsersDocumentsProjectmytextfile3.txt']

This was (Presumably, I haven't debugged the code) due to conversion from
bytes to unicode...(I'm still confused about the extra slashes)

And this ascii text -- it gets worse if there is non-ascii text in there.

Anyway, the truth is, this stuff is hard, but it will get at least a touch
easier with PEP 461.

[though to be truthful, I'm not sure why someone put a comment in the issue
tracker about b'%d'%some_num being an issue ... I'm not sure how when we're
going from text to numbers, not the other way around...]

-Chris

























-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/ORR(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 461 updates

2014-01-17 Thread Eric V. Smith
On 1/17/2014 4:37 PM, Chris Barker wrote:
 For the record, we've got a pretty good thread (not this good, though!)
 over on the numpy list about how to untangle the mess that has resulted
 from porting text-file-parsing code to py3 (and the underlying issue
 with the 'S' data type in numpy...)
 
 One note from the github issue:
 
  The use of asbytes originates only from the fact that b'%d' % (20,)
 does not work.
 
 
 So yeah PEP 461! (even if too late for numpy...)

Would they use (u'%d' % (20,)).encode('ascii') for that? Just curious
as to what they're planning on doing.

Eric.

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 461 updates

2014-01-16 Thread Nick Coghlan
On 16 Jan 2014 11:45, Carl Meyer c...@oddbird.net wrote:

 Hi Ethan,

 I haven't chimed into this discussion, but the direction it's headed
 recently seems right to me. Thanks for putting together a PEP. Some
 comments on it:

 On 01/15/2014 05:13 PM, Ethan Furman wrote:
  
  Abstract
  
 
  This PEP proposes adding the % and {} formatting operations from str to
  bytes [1].

 I think the PEP could really use a rationale section summarizing _why_
 these formatting operations are being added to bytes; namely that they
 are useful when working with various ASCIIish-but-not-properly-text
 network protocols and file formats, and in particular when porting code
 dealing with such formats/protocols from Python 2.

 Also I think it would be useful to have a section summarizing the
 primary objections that have been raised, and why those objections have
 been overruled (presuming the PEP is accepted). For instance: the main
 objection, AIUI, has been that the bytes type is for pure bytes-handling
 with no assumptions about encoding, and thus we should not add features
 to it that assume ASCIIness, and that may be attractive nuisances for
 people writing bytes-handling code that should not assume ASCIIness but
 will once they use the feature.

Close, but not quite - the concern was that this was a feature that didn't
*inherently* imply a restriction to ASCII compatible data, but only did so
when the numeric formatting codes were used. This made it a source of value
dependent compatibility errors based on the format string, akin to the kind
of value dependent errors seen when implicitly encoding arbitrary text as
ASCII.

Guido's successful counter was to point out that the parsing of the format
string itself assumes ASCII compatible data, thus placing at least the
mod-formatting operation in the same category as the currently existing
valid-for-sufficiently-ASCII-compatible-data only operations.

Current discussions suggest to me that the argument against implicit
encoding operations that introduce latent data driven defects may still
apply to bytes.format though, so I've reverted to being -1 on that.

Cheers,
Nick.

And the refutation: that the bytes type
 already provides some operations that assume ASCIIness, and these new
 formatting features are no more of an attractive nuisance than those;
 since the syntax of the formatting mini-languages themselves itself
 assumes ASCIIness, there is not likely to be any temptation to use it
 with binary data that cannot.

 Although it can be hard to arrive at accurate and agreed-on summaries of
 the discussion, recording such summaries in the PEP is important; it may
 help save our future selves and colleagues from having to revisit all
 these same discussions and megathreads.

  Overriding Principles
  =
 
  In order to avoid the problems of auto-conversion and value-generated
  exceptions,
  all object checking will be done via isinstance, not by values contained
  in a
  Unicode representation.  In other words::
 
- duck-typing to allow/reject entry into a byte-stream
- no value generated errors

 This seems self-contradictory; isinstance is type-checking, which is
 the opposite of duck-typing. A duck-typing implementation would not use
 isinstance, it would call / check for the existence of a certain magic
 method instead.

 I think it might also be good to expand (very) slightly on what the
 problems of auto-conversion and value-generated exceptions are; that
 is, that the benefit of Python 3's model is that encoding is explicit,
 not implicit, making it harder to unwittingly write code that works as
 long as all data is ASCII, but fails as soon as someone feeds in
 non-ASCII text data.

 Not everyone who reads this PEP will be steeped in years of discussion
 about the relative merits of the Python 2 vs 3 models; it doesn't hurt
 to spell out a few assumptions.


  Proposed semantics for bytes formatting
  ===
 
  %-interpolation
  ---
 
  All the numeric formatting codes (such as %x, %o, %e, %f, %g, etc.)
  will be supported, and will work as they do for str, including the
  padding, justification and other related modifiers, except locale.
 
  Example::
 
  b'%4x' % 10
 b'   a'
 
  %c will insert a single byte, either from an int in range(256), or from
  a bytes argument of length 1.
 
  Example:
 
   b'%c' % 48
  b'0'
 
   b'%c' % b'a'
  b'a'
 
  %s is restricted in what it will accept::
 
- input type supports Py_buffer?
  use it to collect the necessary bytes
 
- input type is something else?
  use its __bytes__ method; if there isn't one, raise an exception [2]
 
  Examples:
 
   b'%s' % b'abc'
  b'abc'
 
   b'%s' % 3.14
  Traceback (most recent call last):
  ...
  TypeError: 3.14 has no __bytes__ method
 
   b'%s' % 'hello world!'
  Traceback (most recent call last):
  ...

Re: [Python-Dev] PEP 461 updates

2014-01-16 Thread Ethan Furman

On 01/16/2014 04:49 AM, Michael Urman wrote:

On Thu, Jan 16, 2014 at 1:52 AM, Ethan Furman et...@stoneleaf.us wrote:

Is this an intended exception to the overriding principle?



Hmm, thanks for spotting that.  Yes, that would be a value error if anything
over 255 is used, both currently in Py2, and for bytes in Py3.  As Carl
suggested, a little more explanation is needed in the PEP.


FYI, note that str/unicode already has another value-dependent
exception with %c. I find the message surprising, as I wasn't aware
Python had a 'char' type:


'%c' % 'a'

'a'

'%c' % 'abc'

Traceback (most recent call last):
   File stdin, line 1, in module
TypeError: %c requires int or char


Python doesn't have a char type, it has str's of length 1... which are usually 
referred to as char's.  ;)

--
~Ethan~
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 461 updates

2014-01-16 Thread Neil Schemenauer
Carl Meyer c...@oddbird.net wrote:
 I think the PEP could really use a rationale section summarizing _why_
 these formatting operations are being added to bytes

I agree.  My attempt at re-writing the PEP is below.

 In order to avoid the problems of auto-conversion and
 value-generated exceptions, all object checking will be done via
 isinstance, not by values contained in a Unicode representation.
 In other words::
 
   - duck-typing to allow/reject entry into a byte-stream
   - no value generated errors

 This seems self-contradictory; isinstance is type-checking, which is
 the opposite of duck-typing.

Again, I agree.  We should avoid isinstance checks if possible.



Abstract


This PEP proposes adding %-interpolation to the bytes object.


Rational


A distruptive but useful change introduced in Python 3.0 was the clean
separation of byte strings (i.e. the bytes object) from character
strings (i.e. the str object).  The benefit is that character
encodings must be explicitly specified and the risk of corrupting
character data is reduced.

Unfortunately, this separation has made writing certain types of
programs more complicated and verbose.  For example, programs that deal
with network protocols often manipulate ASCII encoded strings.  Since
the bytes type does not support string formatting, extra encoding and
decoding between the str type is required.

For simplicity and convenience it is desireable to introduce formatting
methods to bytes that allow formatting of ASCII-encoded character
data.  This change would blur the clean separation of byte strings and
character strings.  However, it is felt that the practical benefits
outweigh the purity costs.  The implicit assumption of ASCII-encoding
would be limited to formatting methods.

One source of many problems with the Python 2 Unicode implementation is
the implicit coercion of Unicode character strings into byte strings
using the ascii codec.  If the character strings contain only ASCII
characters, all was well.  However, if the string contains a non-ASCII
character then coercion causes an exception.

The combination of implicit coercion and value dependent failures has
proven to be a recipe for hard to debug errors.  A program may seem to
work correctly when tested (e.g. string input that happened to be ASCII
only) but later would fail, often with a traceback far from the source
of the real error.  The formatting methods for bytes() should avoid this
problem by not implicitly encoding data that might fail based on the
content of the data.

Another desirable feature is to allow arbitrary user classes to be used
as formatting operands.  Generally this is done by introducing a special
method that can be implemented by the new class.


Proposed semantics for bytes formatting
===

Special method __ascii__


A new special method, analogous to __format__, is introduced.  This
method takes a single argument, a format specifier.  The return
value is a bytes object.  Objects that have an ASCII only
representation can implement this method to allow them to be used as
format operators.  Objects with natural byte representations should
implement __bytes__ or the Py_buffer API.


%-interpolation
---

All the numeric formatting codes (such as %x, %o, %e, %f, %g, etc.)
will be supported, and will work as they do for str, including the
padding, justification and other related modifiers.  To avoid having to
introduce two special methods, the format specifications will be
translated to equivalent __format__ specifiers and __ascii__ method
of each argument would be called.

Example::

b'%4x' % 10
   b'   a'

%c will insert a single byte, either from an int in range(256), or from
a bytes argument of length 1.

Example:

 b'%c' % 48
b'0'

 b'%c' % b'a'
b'a'

%s is a restricted in what it will accept::

  - input type supports Py_buffer or has __bytes__?
use it to collect the necessary bytes (may contain non-ASCII
characters)

  - input type is something else?
use its __ascii__ method; if there isn't one, raise TypeErorr

Examples:

 b'%s' % b'abc'
b'abc'

 b'%s' % 3.14
b'3.14'

 b'%4s' % 12
b'  12'

 b'%s' % 'hello world!'
Traceback (most recent call last):
...
TypeError: 'hello world' has no __ascii__ method, perhaps you need to 
encode it?

.. note::

   Because the str type does not have a __ascii__ method, attempts to
   directly use 'a string' as a bytes interpolation value will raise an
   exception.  To use 'string' values, they must be encoded or otherwise
   transformed into a bytes sequence::

  'a string'.encode('latin-1')

Unsupported % format codes
^^

%r (which calls __repr__) is not supported


format
--

The format() method will not be implemented at this time but may be
added in a later Python release.  The __ascii__ method is designed
to make adding it later 

Re: [Python-Dev] PEP 461 updates

2014-01-16 Thread Terry Reedy

On 1/16/2014 5:11 AM, Nick Coghlan wrote:


Guido's successful counter was to point out that the parsing of the
format string itself assumes ASCII compatible data,


Did you see my explanation, which I wrote in response to one of your 
earlier posts, of why I think the parsing of the format string itself 
assumes ASCII compatible data that statement is confused and wrong? The 
above seems to say that what I wrote is impossible, but perhaps I 
misunderstand what Guido and you mean. Among my questions are by data, 
do you mean interpolated objects or interpolated bytes? and what 
restriction on 'data' do you intend by 'ASCII compatible'?.


--
Terry Jan Reedy

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 461 updates

2014-01-16 Thread Guido van Rossum
On Thu, Jan 16, 2014 at 1:18 PM, Terry Reedy tjre...@udel.edu wrote:
 On 1/16/2014 5:11 AM, Nick Coghlan wrote:

 Guido's successful counter was to point out that the parsing of the
 format string itself assumes ASCII compatible data,

 Did you see my explanation, which I wrote in response to one of your earlier
 posts, of why I think the parsing of the format string itself assumes ASCII
 compatible data that statement is confused and wrong? The above seems to
 say that what I wrote is impossible, but perhaps I misunderstand what Guido
 and you mean. Among my questions are by data, do you mean interpolated
 objects or interpolated bytes? and what restriction on 'data' do you
 intend by 'ASCII compatible'?.

Can you move the meta-discussion off-list? I'm getting tired of did
you understand what I said.

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 461 updates

2014-01-16 Thread Terry Reedy

On 1/16/2014 4:59 PM, Guido van Rossum wrote:


I'm getting tired of did you understand what I said.


I was asking whether I needed to repeat myself, but forget that.
I was also saying that while I understand 'ascii-compatible encoding', I 
do not understand the notion of 'ascii-compatible data' or statements 
based on it.

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 461 updates

2014-01-16 Thread Nick Coghlan
On 17 Jan 2014 09:36, Terry Reedy tjre...@udel.edu wrote:

 On 1/16/2014 4:59 PM, Guido van Rossum wrote:

 I'm getting tired of did you understand what I said.


 I was asking whether I needed to repeat myself, but forget that.
 I was also saying that while I understand 'ascii-compatible encoding', I
do not understand the notion of 'ascii-compatible data' or statements based
on it.

There are plenty of data formats (like SMTP and HTTP) that are constrained
to be ASCII compatible, either globally, or locally in the parts being
manipulated by an application (such as a file header). ASCII incompatible
segments may be present, but in ways that allow the data processing to
handle them correctly. The ASCII assuming methods on bytes objects are
there to help in dealing with that kind of data.

If the binary data is just one large block in a single text encoding, it's
generally easier to just decode it to text, but multipart formats generally
don't allow that.


 ___
 Python-Dev mailing list
 Python-Dev@python.org
 https://mail.python.org/mailman/listinfo/python-dev
 Unsubscribe:
https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 461 updates

2014-01-16 Thread Greg

On 17/01/2014 10:18 a.m., Terry Reedy wrote:

On 1/16/2014 5:11 AM, Nick Coghlan wrote:


Guido's successful counter was to point out that the parsing of the
format string itself assumes ASCII compatible data,


Nick's initial arguments against bytes formatting were very
abstract and philosophical, along the lines that it violated
some pure mental model of text/bytes separation.

Then Guido said something that Nick took to be an equal and
opposite philosophical argument that cancelled out his original
objections, and he withdrew them.

I don't think it matters whether the internal details of that
debate make sense to the rest of us. The main thing is that
a consensus seems to have been reached on bytes formatting
being basically a good thing.

--
Greg

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 461 updates

2014-01-16 Thread Ethan Furman

On 01/16/2014 05:32 PM, Greg wrote:


I don't think it matters whether the internal details of that
debate make sense to the rest of us. The main thing is that
a consensus seems to have been reached on bytes formatting
being basically a good thing.


And a good thing, too, on both counts!  :)

A few folks have suggested not implementing .format() on bytes;  I've been resistant, but then I remembered that format 
is also a function.


http://docs.python.org/3/library/functions.html?highlight=ascii#format
==
format(value[, format_spec])

Convert a value to a “formatted” representation, as controlled by format_spec. The interpretation of format_spec 
will depend on the type of the value argument, however there is a standard formatting syntax that is used by most 
built-in types: Format Specification Mini-Language.


The default format_spec is an empty string which usually gives the same 
effect as calling str(value).

A call to format(value, format_spec) is translated to type(value).__format__(format_spec) which bypasses the 
instance dictionary when searching for the value’s __format__() method. A TypeError exception is raised if the method is 
not found or if either the format_spec or the return value are not strings.

==

Given that, I can relent on .format and just go with .__mod__ .  A low-level 
service for a low-level protocol, what?  ;)

--
~Ethan~
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 461 updates

2014-01-16 Thread Stephen J. Turnbull
Meta enough that I'll take Guido out of the CC.

Nick Coghlan writes:

  There are plenty of data formats (like SMTP and HTTP) that are
  constrained to be ASCII compatible,

ASCII compatible is a technical term in encodings, which means
bytes in the range 0-127 always have ASCII coded character semantics,
do what you like with bytes in the range 128-255.[1]

Worse, it's clearly confusing in this discussion.  Let's stop using
this term to mean

the data format has elements that are defined to contain only
bytes with ASCII coded character semantics

(which is the relevant restriction AFAICS -- I don't know of any
ASCII-compatible formats where the bytes 128-255 are used for any
purpose other than encoding non-ASCII characters).  OTOH, if it *is*
an ASCII-compatible text encoding, the semantics are dubious if the
bytes versions of many of these methods/operations are used.

A documentation suggestion: It's easy enough to rewrite

  constrained to be ASCII compatible, either globally, or locally in
  the parts being manipulated by an application (such as a file
  header). ASCII incompatible segments may be present, but in ways
  that allow the data processing to handle them correctly.

as 

containing 'well-defined segments constrained to be (strictly)
ASCII-encoded' (aka ASCII segments).

And then you can say 

specified bytes methods are designed for use *only* on bytes
that are ASCII segments; use on other data is likely to cause
hard-to-diagnose corruption.

If there are other use cases for ASCII-compatible data formats as
defined above (not worrying about codecs, because they are a very
small minority of code-to-be-written at this point), I don't know
about them.  Does anyone?  If there are any, I'll be happy to revise.
If not, that seems to be a precise and intelligible statement of the
restrictions that is useful to the practical use cases.  And nothing
stops users who think they know what they're doing from using them in
other contexts (which can be documented if they turn out to be broadly
useful).

Footnotes: 
[1]  ASCII coded character semantics is of course mildly ambiguous
due to considerations like EOL conventions.  But you know what I'm
talking about.

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 461 updates

2014-01-16 Thread Stephen J. Turnbull
Greg writes:

  I don't think it matters whether the internal details of [the EIBTI
  vs. PBP] debate make sense to the rest of us. The main thing is
  that a consensus seems to have been reached on bytes formatting
  being basically a good thing.

I think some of it matters to the documentation.

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 461 updates

2014-01-16 Thread Neil Schemenauer
Greg greg.ew...@canterbury.ac.nz wrote:
 I don't think it matters whether the internal details of that
 debate make sense to the rest of us. The main thing is that
 a consensus seems to have been reached on bytes formatting
 being basically a good thing.

I've been mostly steering clear of the metaphysical and writing
code today. ;-)  An extremely rough patch has been uploaded:

http://bugs.python.org/issue20284

I have a new one almost ready that introduces __ascii__ rather than
overloading __format__.  I like it better, will upload to issue
tracker soon.

Regards,

  Neil

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 461 updates

2014-01-16 Thread Steven D'Aprano
On Fri, Jan 17, 2014 at 11:19:44AM +0900, Stephen J. Turnbull wrote:
 Meta enough that I'll take Guido out of the CC.
 
 Nick Coghlan writes:
 
   There are plenty of data formats (like SMTP and HTTP) that are
   constrained to be ASCII compatible,
 
 ASCII compatible is a technical term in encodings, which means
 bytes in the range 0-127 always have ASCII coded character semantics,
 do what you like with bytes in the range 128-255.[1]

Examples, and counter-examples, may help. Let me see if I have got this 
right: an ASCII-compatible encoding may be an ASCII-superset like 
Latin-1, or a variable-width encoding like UTF-8 where the ASCII chars 
are encoded to the same bytes as ASCII, and non-ASCII chars are not. A 
counter-example would be UTF-16, or some of the Asian encodings like 
Big5. Am I right so far?

But Nick isn't talking about an encoding, he's talking about a data 
format. I think that an ASCII-compatible format means one where (in at 
least *some* parts of the data) bytes between 0 and 127 have the same 
meaning as in ASCII, e.g. byte 84 is to be interpreted as ASCII 
character T. This doesn't mean that every byte 84 means T, only that 
some of them do -- hopefully a well-defined sections of the data. Below, 
you introduce the term ASCII segments for these.


 Worse, it's clearly confusing in this discussion.  Let's stop using
 this term to mean
 
 the data format has elements that are defined to contain only
 bytes with ASCII coded character semantics
 
 (which is the relevant restriction AFAICS -- I don't know of any
 ASCII-compatible formats where the bytes 128-255 are used for any
 purpose other than encoding non-ASCII characters).  OTOH, if it *is*
 an ASCII-compatible text encoding, the semantics are dubious if the
 bytes versions of many of these methods/operations are used.
 
 A documentation suggestion: It's easy enough to rewrite
 
   constrained to be ASCII compatible, either globally, or locally in
   the parts being manipulated by an application (such as a file
   header). ASCII incompatible segments may be present, but in ways
   that allow the data processing to handle them correctly.
 
 as 
 
 containing 'well-defined segments constrained to be (strictly)
 ASCII-encoded' (aka ASCII segments).
 
 And then you can say 
 
 specified bytes methods are designed for use *only* on bytes
 that are ASCII segments; use on other data is likely to cause
 hard-to-diagnose corruption.

An example: if you have the byte b'\x63', calling upper() on that will 
return b'\x43'. That is only meaningful if the byte is intended as the 
ASCII character c.


 Footnotes: 
 [1]  ASCII coded character semantics is of course mildly ambiguous
 due to considerations like EOL conventions.  But you know what I'm
 talking about.

I think I know what your talking about, but don't know for sure unless I 
explain it back to you.


-- 
Steven
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 461 updates

2014-01-16 Thread Nick Coghlan
On 17 January 2014 11:51, Ethan Furman et...@stoneleaf.us wrote:
 On 01/16/2014 05:32 PM, Greg wrote:


 I don't think it matters whether the internal details of that
 debate make sense to the rest of us. The main thing is that
 a consensus seems to have been reached on bytes formatting
 being basically a good thing.


 And a good thing, too, on both counts!  :)

 A few folks have suggested not implementing .format() on bytes;  I've been
 resistant, but then I remembered that format is also a function.

 http://docs.python.org/3/library/functions.html?highlight=ascii#format
 ==
 format(value[, format_spec])

 Convert a value to a “formatted” representation, as controlled by
 format_spec. The interpretation of format_spec will depend on the type of
 the value argument, however there is a standard formatting syntax that is
 used by most built-in types: Format Specification Mini-Language.

 The default format_spec is an empty string which usually gives the same
 effect as calling str(value).

 A call to format(value, format_spec) is translated to
 type(value).__format__(format_spec) which bypasses the instance dictionary
 when searching for the value’s __format__() method. A TypeError exception is
 raised if the method is not found or if either the format_spec or the return
 value are not strings.
 ==

 Given that, I can relent on .format and just go with .__mod__ .  A low-level
 service for a low-level protocol, what?  ;)

Exactly - while I'm a fan of the new extensible formatting system and
strongly prefer it to printf-style formatting for text, it also has a
whole lot of complexity that is hard to translate to the binary
domain, including the format() builtin and __format__ methods.

Since the relevant use cases appear to be already covered adequately
by prinft-style formatting, attempting to translate the flexible text
formatting system as well just becomes additional complexity we don't
need.

I like Stephen Turnbull's suggestion of using binary formats with
ASCII segments to distinguish the kind of formats we're talking about
from ASCII compatible text encodings, and I think Python 3.5 will end
up with a suite of solutions that suitably covers all use cases, just
by bringing back printf-style formatting directly to bytes:

* format(), str.format(), str.format_map(): a rich extensible text
formatting system, including date interpolation support
* str.__mod__: retained primarily for backwards compatibility, may
occasionally be used as a text formatting optimisation tool (since the
inflexibility means it will likely always be marginally faster than
the rich formatting system for the cases that it covers)
* bytes.__mod__, bytearray.__mod__: restored in Python 3.5 to simplify
production of data in variable length binary formats that contain
ASCII segments
* the struct module: rich (but not extensible) formatting system for
fixed length binary formats

In Python 2, the binary format with ASCII segments use case was
intermingled with general purpose text formatting on the str type,
which is I think the main reason it has taken us so long to convince
ourselves it is something that is genuinely worth bringing back in a
more limited form in Python 3, rather than just being something we
wanted back because we were used to having it in Python 2.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 461 updates

2014-01-16 Thread Glenn Linderman

On 1/16/2014 9:46 PM, Nick Coghlan wrote:

On 17 January 2014 11:51, Ethan Furman et...@stoneleaf.us wrote:

On 01/16/2014 05:32 PM, Greg wrote:


I don't think it matters whether the internal details of that
debate make sense to the rest of us. The main thing is that
a consensus seems to have been reached on bytes formatting
being basically a good thing.


And a good thing, too, on both counts!  :)

A few folks have suggested not implementing .format() on bytes;  I've been
resistant, but then I remembered that format is also a function.

http://docs.python.org/3/library/functions.html?highlight=ascii#format
==
format(value[, format_spec])

 Convert a value to a “formatted” representation, as controlled by
format_spec. The interpretation of format_spec will depend on the type of
the value argument, however there is a standard formatting syntax that is
used by most built-in types: Format Specification Mini-Language.

 The default format_spec is an empty string which usually gives the same
effect as calling str(value).

 A call to format(value, format_spec) is translated to
type(value).__format__(format_spec) which bypasses the instance dictionary
when searching for the value’s __format__() method. A TypeError exception is
raised if the method is not found or if either the format_spec or the return
value are not strings.
==

Given that, I can relent on .format and just go with .__mod__ .  A low-level
service for a low-level protocol, what?  ;)

Exactly - while I'm a fan of the new extensible formatting system and
strongly prefer it to printf-style formatting for text, it also has a
whole lot of complexity that is hard to translate to the binary
domain, including the format() builtin and __format__ methods.

Since the relevant use cases appear to be already covered adequately
by prinft-style formatting, attempting to translate the flexible text
formatting system as well just becomes additional complexity we don't
need.

I like Stephen Turnbull's suggestion of using binary formats with
ASCII segments to distinguish the kind of formats we're talking about
from ASCII compatible text encodings,


I liked that too, and almost said so on his posting, but will say it 
here, instead.



and I think Python 3.5 will end
up with a suite of solutions that suitably covers all use cases, just
by bringing back printf-style formatting directly to bytes:

* format(), str.format(), str.format_map(): a rich extensible text
formatting system, including date interpolation support
* str.__mod__: retained primarily for backwards compatibility, may
occasionally be used as a text formatting optimisation tool (since the
inflexibility means it will likely always be marginally faster than
the rich formatting system for the cases that it covers)
* bytes.__mod__, bytearray.__mod__: restored in Python 3.5 to simplify
production of data in variable length binary formats that contain
ASCII segments
* the struct module: rich (but not extensible) formatting system for
fixed length binary formats


Adding format codes with variable length could enhance the struct module 
to additional uses. C structs, on which it is modeled, often get around 
the difficulty of variable length items by defining one variable length 
item at the end, or by defining offsets in the fixed part, to variable 
length parts that follows. Such a structure cannot presently be created 
by struct alone.



In Python 2, the binary format with ASCII segments use case was
intermingled with general purpose text formatting on the str type,
which is I think the main reason it has taken us so long to convince
ourselves it is something that is genuinely worth bringing back in a
more limited form in Python 3, rather than just being something we
wanted back because we were used to having it in Python 2.

Cheers,
Nick.



___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 461 updates

2014-01-15 Thread Carl Meyer
Hi Ethan,

I haven't chimed into this discussion, but the direction it's headed
recently seems right to me. Thanks for putting together a PEP. Some
comments on it:

On 01/15/2014 05:13 PM, Ethan Furman wrote:
 
 Abstract
 
 
 This PEP proposes adding the % and {} formatting operations from str to
 bytes [1].

I think the PEP could really use a rationale section summarizing _why_
these formatting operations are being added to bytes; namely that they
are useful when working with various ASCIIish-but-not-properly-text
network protocols and file formats, and in particular when porting code
dealing with such formats/protocols from Python 2.

Also I think it would be useful to have a section summarizing the
primary objections that have been raised, and why those objections have
been overruled (presuming the PEP is accepted). For instance: the main
objection, AIUI, has been that the bytes type is for pure bytes-handling
with no assumptions about encoding, and thus we should not add features
to it that assume ASCIIness, and that may be attractive nuisances for
people writing bytes-handling code that should not assume ASCIIness but
will once they use the feature. And the refutation: that the bytes type
already provides some operations that assume ASCIIness, and these new
formatting features are no more of an attractive nuisance than those;
since the syntax of the formatting mini-languages themselves itself
assumes ASCIIness, there is not likely to be any temptation to use it
with binary data that cannot.

Although it can be hard to arrive at accurate and agreed-on summaries of
the discussion, recording such summaries in the PEP is important; it may
help save our future selves and colleagues from having to revisit all
these same discussions and megathreads.

 Overriding Principles
 =
 
 In order to avoid the problems of auto-conversion and value-generated
 exceptions,
 all object checking will be done via isinstance, not by values contained
 in a
 Unicode representation.  In other words::
 
   - duck-typing to allow/reject entry into a byte-stream
   - no value generated errors

This seems self-contradictory; isinstance is type-checking, which is
the opposite of duck-typing. A duck-typing implementation would not use
isinstance, it would call / check for the existence of a certain magic
method instead.

I think it might also be good to expand (very) slightly on what the
problems of auto-conversion and value-generated exceptions are; that
is, that the benefit of Python 3's model is that encoding is explicit,
not implicit, making it harder to unwittingly write code that works as
long as all data is ASCII, but fails as soon as someone feeds in
non-ASCII text data.

Not everyone who reads this PEP will be steeped in years of discussion
about the relative merits of the Python 2 vs 3 models; it doesn't hurt
to spell out a few assumptions.


 Proposed semantics for bytes formatting
 ===
 
 %-interpolation
 ---
 
 All the numeric formatting codes (such as %x, %o, %e, %f, %g, etc.)
 will be supported, and will work as they do for str, including the
 padding, justification and other related modifiers, except locale.
 
 Example::
 
 b'%4x' % 10
b'   a'
 
 %c will insert a single byte, either from an int in range(256), or from
 a bytes argument of length 1.
 
 Example:
 
  b'%c' % 48
 b'0'
 
  b'%c' % b'a'
 b'a'
 
 %s is restricted in what it will accept::
 
   - input type supports Py_buffer?
 use it to collect the necessary bytes
 
   - input type is something else?
 use its __bytes__ method; if there isn't one, raise an exception [2]
 
 Examples:
 
  b'%s' % b'abc'
 b'abc'
 
  b'%s' % 3.14
 Traceback (most recent call last):
 ...
 TypeError: 3.14 has no __bytes__ method
 
  b'%s' % 'hello world!'
 Traceback (most recent call last):
 ...
 TypeError: 'hello world' has no __bytes__ method, perhaps you need
 to encode it?
 
 .. note::
 
Because the str type does not have a __bytes__ method, attempts to
directly use 'a string' as a bytes interpolation value will raise an
exception.  To use 'string' values, they must be encoded or otherwise
transformed into a bytes sequence::
 
   'a string'.encode('latin-1')
 
 format
 --
 
 The format mini language codes, where they correspond with the
 %-interpolation codes,
 will be used as-is, with three exceptions::
 
   - !s is not supported, as {} can mean the default for both str and
 bytes, in both
 Py2 and Py3.
   - !b is supported, and new Py3k code can use it to be explicit.
   - no other __format__ method will be called.
 
 Numeric Format Codes
 
 
 To properly handle int and float subclasses, int(), index(), and float()
 will be called on the
 objects intended for (d, i, u), (b, o, x, X), and (e, E, f, F, g, G).
 
 Unsupported codes
 -
 
 %r 

Re: [Python-Dev] PEP 461 updates

2014-01-15 Thread Glenn Linderman

On 1/15/2014 4:13 PM, Ethan Furman wrote:
  - no value generated errors 

...


%c will insert a single byte, either from an int in range(256), or from
a bytes argument of length 1. 


what does

x = 354
b%c % x

produce?  Seems that construct produces a value dependent error in both 
python 2  3 (although it takes a much bigger value to produce the error 
in python 3, with str %... with bytes %, the problem with be reached at 
256, just like python 2).


Is this an intended exception to the overriding principle?

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 461 updates

2014-01-15 Thread Guido van Rossum
Surprisingly, in this case the exception is just what the doctor ordered. :-)

On Wed, Jan 15, 2014 at 6:12 PM, Glenn Linderman v+pyt...@g.nevcal.com wrote:
 On 1/15/2014 4:13 PM, Ethan Furman wrote:

   - no value generated errors

 ...


 %c will insert a single byte, either from an int in range(256), or from
 a bytes argument of length 1.


 what does

 x = 354
 b%c % x

 produce?  Seems that construct produces a value dependent error in both
 python 2  3 (although it takes a much bigger value to produce the error in
 python 3, with str %... with bytes %, the problem with be reached at 256,
 just like python 2).

 Is this an intended exception to the overriding principle?


 ___
 Python-Dev mailing list
 Python-Dev@python.org
 https://mail.python.org/mailman/listinfo/python-dev
 Unsubscribe:
 https://mail.python.org/mailman/options/python-dev/guido%40python.org




-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 461 updates

2014-01-15 Thread Greg Ewing

Glenn Linderman wrote:


x = 354
b%c % x

Is this an intended exception to the overriding principle?


I think it's an unavoidable one, unless we want to
introduce an integer in the range 0-255 type. But
that would just push the problem into another place,
since

   b%c % byte(x)

would then blow up on byte(x) if x were out of
range.

If you really want to make sure it won't crash, you
can always do

  b%c % (x  0xff)

or whatever your favourite method of mangling out-
of-range ints is.

--
Greg
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 461 updates

2014-01-15 Thread Ethan Furman

On 01/15/2014 05:17 PM, Carl Meyer wrote:


I think the PEP could really use a rationale section


It will have one before it's done.



Also I think it would be useful to have a section summarizing the
primary objections that have been raised, and why those objections have
been overruled


Excellent point.  That section will also be present.



In order to avoid the problems of auto-conversion and value-generated
exceptions,
all object checking will be done via isinstance, not by values contained
in a
Unicode representation.  In other words::

   - duck-typing to allow/reject entry into a byte-stream
   - no value generated errors


This seems self-contradictory; isinstance is type-checking, which is
the opposite of duck-typing.


Good point, I'll reword that.  It will be duck-typing.



I think it might also be good to expand (very) slightly on what the
problems of auto-conversion and value-generated exceptions are


Will do.



.. [2] TypeError, ValueError, or UnicodeEncodeError?


TypeError seems right to me. Definitely not UnicodeEncodeError - refusal
to implicitly encode is not at all the same thing as an encoding error.


That's the direction I'm leaning, too.

Thanks for your comments!

--
~Ethan~
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 461 updates

2014-01-15 Thread Ethan Furman

On 01/15/2014 06:12 PM, Glenn Linderman wrote:

On 1/15/2014 4:13 PM, Ethan Furman wrote:


  - no value generated errors

...


%c will insert a single byte, either from an int in range(256), or from
a bytes argument of length 1.


what does

x = 354
b%c % x

produce?  Seems that construct produces a value dependent error in both python 2 
 3 (although it takes a much bigger
value to produce the error in python 3, with str %... with bytes %, the problem 
with be reached at 256, just like python 2).

Is this an intended exception to the overriding principle?


Hmm, thanks for spotting that.  Yes, that would be a value error if anything over 255 is used, both currently in Py2, 
and for bytes in Py3.  As Carl suggested, a little more explanation is needed in the PEP.


--
~Ethan~
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com