Re: [Python-Dev] PEP 460 reboot

2014-01-13 Thread Mark Shannon

On 13/01/14 03:47, Guido van Rossum wrote:

On Sun, Jan 12, 2014 at 6:24 PM, Ethan Furman et...@stoneleaf.us wrote:

On 01/12/2014 06:16 PM, Ethan Furman wrote:



If you do :

-- b'%s' % 'some text'



Ignore what I previously said.  With no encoding the result would be:

b'some text'

So an encoding should definitely be specified.


Yes, but the encoding is no business of %s or %. As far as the
formatting operation cares, if the argument is bytes they will be
copied literally, and if the argument is a str (or anything else) it
will call ascii() on it.


It seems to me that what people want from '%s' is:
Convert to a str then encode as ascii for non-bytes
or copy directly for bytes.

So why not replace '%s' with '%a' for the ascii case and
with '%b' for directly inserting bytes.
That way, the encoding is explicit.

I think it is vital that the encoding is explicit in all cases where
bytes - str conversion occurs.

Cheers,
Mark.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460 reboot

2014-01-13 Thread M.-A. Lemburg
On 13.01.2014 07:51, Nick Coghlan wrote:

 [Using a new asciistr type]

 The key thing that the text model change in Python 3 enabled is for us
 to use the type system to *help* with managing the complexity of
 dealing with text encodings. We've got a long way with just the two
 pure types, and no additional types that straddle the binary/text
 boundary the way the Python 2 str type did. Unlike introducing *new*
 ASCII-only operations to the bytes type, adding new types specifically
 for dealing with ASCII compatible formats (especially starting life as
 a third party library) isn't compromising the Python 3 text model,
 it's embracing it and making it work for us (which is why I've been
 suggesting that it be considered since at least 2010). The problem
 with str in Python 2 was that one type was used to represent too
 many things with serious semantic differences.
 
 The ongoing attempts to reintroduce that ambiguity to the core bytes
 type rather than exploring the creation of new types and then filing
 bugs for any interoperability issues those attempts uncover in the
 core types represents one of the worst cases of paradigm lock that I
 have ever seen :P

In theory this sounds nice, but in practice you often run into the issue
that whenever you pass such a str-subtype to some function that
works on str doesn't return the str-subtype as result, but instead
a new str object.

As a result, you have to keep track of which operations work
on your str-subtype alone and which convert it back to a str,
making the approach infeasible for all but the most basic
uses.

This is why we try to make the basic types as useful as possible
for everyone. It's also the main reason why subtyping 8-bit strings
and Unicode in Python 2 wasn't a popular sport :-)

Leaving aside the discussion about str and bytes, I think PEP 460
has much potential of making life easier for people dealing with binary
data: the formatting codes for the bytes format methods could
be extended to include the struct module features - with the struct
module then turning into a proxy for these new format methods (much
like we did with the string module when string methods were
introduced).


BTW: There's a little known trick in Python 2 which also lets you
disable the string to Unicode coercion: all you have to do is
set the default encoding to undefined (see site.py:setencoding()).
Python 2 will then raise a UnicodeError whenever coercion would trigger.
I added that codec to experiment with this scenario in the early days
of the Unicode integration.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Jan 13 2014)
 Python Projects, Consulting and Support ...   http://www.egenix.com/
 mxODBC.Zope/Plone.Database.Adapter ...   http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


: Try our mxODBC.Connect Python Database Interface for free ! ::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460 reboot

2014-01-13 Thread Nick Coghlan
On 13 Jan 2014 17:43, Ethan Furman et...@stoneleaf.us wrote:

 On 01/12/2014 10:51 PM, Nick Coghlan wrote:


 I am a strong -1 on the more lenient proposal, as it makes binary
 interpolation in Python 3 an *unsafe operation* for ASCII incompatible
 binary formats.


 No more unsafe that calling .upper() on ASCII incompatible streams.

Right - Guido's proposal is *completely useless* for arbitrary binary data.
You can't trust it.

However, Python 3 has no equivalent binary interpolation feature that *is*
safe for arbitrary binary data, so the lenient version *will* be a bug
magnet if it is the only version of binary interpolation provided.

However, if new formatb and formatb_map methods were included in the
proposal with the current strict PEP 460 semantics, then my objections
would be reduced substantially. In that case, we'd still be providing the
new binary interpolation feature *in addition* to restoring the ASCII
compatible interpolation feature, so the latter would be less of an
attractive nuisance when writing code that needs to handle arbitrary binary
formats and can't assume ASCII compatibility.

With that approach, I'd even support the idea of implicit strict ASCII
encoding of text inputs for the ASCII compatible version.




 The existing binary operations that assume ASCII do so *inherently* -
 they're not input driven, the operation itself assumes ASCII, so if
 you're working with data that may not be ASCII compatible, you simply
 don't use them (these are operations like title(), upper(), lower(),
 the default arguments for split() and strip(), etc).


 How is this different from not using % interpolation when the byte stream
is incompatible?  It isn't.

Because I *want to use* the PEP 460 binary interpolation API, but wouldn't
be able to use Guido's more lenient proposal, as it is a bug magnet in the
presence of arbitrary binary data. Provide both APIs and my objections go
away - ASCII interpolation just becomes another way to translate between
structured and text data, while binary interpolation would be a strictly
binary only operation.


 And what do you mean by input driven?  If the LHS is bytes, the result
is bytes, no matter what the input is.  This is not the Py2 world where you
may end up with str or unicode; you always end up with bytes if the LHS is
bytes.

The LHS may or may not be tainted with assumptions about ASCII
compatibility, which means it effectively *is* tainted with such
assumptions, which means code that needs to handle arbitrary binary data
can't use it and is left without a binary interpolation feature.

That's why *adding* formatb to Guido's more lenient proposal resolves my
objections: it provides the binary interpolation feature I want, and
maintains Python 3's clear distinction between the text domain and the
binary domain.

Cheers,
Nick.


 [snip the rest that seems to flow from these misunderstandings]

 --
 ~Ethan~

 ___
 Python-Dev mailing list
 Python-Dev@python.org
 https://mail.python.org/mailman/listinfo/python-dev
 Unsubscribe:
https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460 reboot

2014-01-13 Thread Glenn Linderman

On 1/13/2014 12:46 AM, Mark Shannon wrote:

On 13/01/14 03:47, Guido van Rossum wrote:
On Sun, Jan 12, 2014 at 6:24 PM, Ethan Furman et...@stoneleaf.us 
wrote:

On 01/12/2014 06:16 PM, Ethan Furman wrote:



If you do :

-- b'%s' % 'some text'



Ignore what I previously said.  With no encoding the result would be:

b'some text'

So an encoding should definitely be specified.


Yes, but the encoding is no business of %s or %. As far as the
formatting operation cares, if the argument is bytes they will be
copied literally, and if the argument is a str (or anything else) it
will call ascii() on it.


It seems to me that what people want from '%s' is:
Convert to a str then encode as ascii for non-bytes
or copy directly for bytes.


Maybe. But it only takes a small tweak to the parameter to get what they 
want... a tweak that works in both Python 2.7 and Python 
3.whatever-version-gets-this.


Instead of

b%s % foo

they must use

b%s  % foo.encode( explicitEncoding )

which is what they should have been doing in Python 2.7 all along, and 
if they were, they need make no change.


Oh, foo was a Python 2.7 str? Converted to Python 3.x str, by default 
conversion rules? Already in ASCII? No harm.
Oh, foo was a literal? Add b prefix, instead of the .encode(ASCII), if 
you prefer.



So why not replace '%s' with '%a' for the ascii case and
with '%b' for directly inserting bytes.


Because %a and %b don't exist in Python 2.7?


That way, the encoding is explicit.


The encoding is already explicit.  If it is bytes encoded from str, that 
transformation had an explicit encoding.  If it is %s % str(...), then 
there is no encoding, but rather a transformation into an ASCII 
representation of the Unicode code points, using escape sequences. Which 
isn't likely to be what they want, but see the parameter tweak above.



I think it is vital that the encoding is explicit in all cases where
bytes - str conversion occurs.


Since it is explicit, you have no concerns in this area.


Regarding the concern about implicit use of ASCII by certain bytes 
methods and proposed interpolations, I'm curious how many standard 
encodings exist that do not have an ASCII subset. I can enumerate a 
starting list, but if there are others in actual use, I'm unaware of them.


EBCDIC
UTF-16 BE  LE
UTF-32 BE  LE

Wikipedia: The vast majority of code pages in current use are supersets 
of ASCII http://en.wikipedia.org/wiki/ASCII, a 7-bit code representing 
128 control codes and printable characters.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460 reboot

2014-01-13 Thread Mark Shannon



On 13/01/14 09:19, Glenn Linderman wrote:

On 1/13/2014 12:46 AM, Mark Shannon wrote:

On 13/01/14 03:47, Guido van Rossum wrote:

On Sun, Jan 12, 2014 at 6:24 PM, Ethan Furman et...@stoneleaf.us wrote:

On 01/12/2014 06:16 PM, Ethan Furman wrote:



If you do :

-- b'%s' % 'some text'



Ignore what I previously said.  With no encoding the result would be:

b'some text'

So an encoding should definitely be specified.


Yes, but the encoding is no business of %s or %. As far as the
formatting operation cares, if the argument is bytes they will be
copied literally, and if the argument is a str (or anything else) it
will call ascii() on it.


It seems to me that what people want from '%s' is:
Convert to a str then encode as ascii for non-bytes
or copy directly for bytes.


Maybe. But it only takes a small tweak to the parameter to get what they 
want... a tweak that works in both Python 2.7 and Python 
3.whatever-version-gets-this.

Instead of

b%s % foo

they must use

b%s  % foo.encode( explicitEncoding )

which is what they should have been doing in Python 2.7 all along, and if they 
were, they need make no change.

Oh, foo was a Python 2.7 str? Converted to Python 3.x str, by default 
conversion rules? Already in ASCII? No harm.
Oh, foo was a literal? Add b prefix, instead of the .encode(ASCII), if you 
prefer.


So why not replace '%s' with '%a' for the ascii case and
with '%b' for directly inserting bytes.


Because %a and %b don't exist in Python 2.7?


I thought this was about 3.5, not 2.7 ;)
'%s' can't work in 3.5, as we must differentiate between
strings which meed to be encoded and bytes which don't.




That way, the encoding is explicit.


The encoding is already explicit.  If it is bytes encoded from str, that transformation 
had an explicit encoding.  If it is %s % str(...), then there is no encoding, 
but rather a transformation into
an ASCII representation of the Unicode code points, using escape sequences. 
Which isn't likely to be what they want, but see the parameter tweak above.


I think it is vital that the encoding is explicit in all cases where
bytes - str conversion occurs.


Since it is explicit, you have no concerns in this area.


Regarding the concern about implicit use of ASCII by certain bytes methods and 
proposed interpolations, I'm curious how many standard encodings exist that do 
not have an ASCII subset. I can enumerate
a starting list, but if there are others in actual use, I'm unaware of them.

EBCDIC
UTF-16 BE  LE
UTF-32 BE  LE

Wikipedia: The vast majority of code pages in current use are supersets of ASCII 
http://en.wikipedia.org/wiki/ASCII, a 7-bit code representing 128 control 
codes and printable characters.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: https://mail.python.org/mailman/options/python-dev/mark%40hotpy.org


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460 reboot and a bitter fight

2014-01-13 Thread Ethan Furman

On 01/12/2014 11:15 PM, Guido van Rossum wrote:


(It's too late here to write more, but it looks like we are in for a
bitter fight. :-( )


It's already been a bitter fight.

The opponents of %-interpolation (Nick, Antoine, Turnbull, D'Aprano, et al*) all seem to be arguing basically what Nick 
said.


The proponents (myself, you, Stufft, Eric Smith, et al*) are arguing that bytes already has an ASCII bias, already has 
ASCII string methods, that it isn't the same as the Py2 world because if you combine a bytes object with a str object 
outside of interpolation (such as b'hello' + 'world') it doesn't work, that only bytes would ever be returned, etc, etc.


With the possible exception of the question I just asked Nick,  I don't think 
we're going to get any new information.

I suppose you're used to not being able to please everybody.  :/

--
~Ethan~

* et al means everyone whose name I couldn't remember, or figure out which camp 
you were in in the wee hours of the night.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460: allowing %d and %f and mojibake

2014-01-13 Thread Stephen J. Turnbull
Ethan Furman writes:

  The part that you don't seem to acknowledge (sorry if I missed it)
  is that there are str-like methods already on bytes.

I haven't expressed myself well, but I don't much care about that.
It's what Knuth would classify as a seminumerical method.  What I do
care about is that the methods that convert other types to text
(including format) not work for bytes.  That's where I consider text
to start.

   is *exactly* the Python 2 model of text.  But you deny that the
   effect of your proposals (eg, b%d % (12,)) is to reintroduce Python
   2's bytes/character confusion, don't you?
  
  Given that the default (and only) text type in Py3 is str, which is
  unicode, I don't think any confusion will be as severe, but I
  acknowledge that there could be some.

I fear it will be quite severe where I live, in Shift JIS/GB18030
land.  (The two most obnoxious encodings known to man, except perhaps
the syntax of Brainf!ck.)

   *My* definition is not ambiguous at all.  If this particular part
   of the byte stream is defined to contain ASCII-encoded text, then I
   can use the bytes text methods to work with it.
  
   But how is Python supposed to know that?
  
  Python doesn't need to.

... because you know it.  But the ideal of object-oriented programming
(and duck-typing) is that you shouldn't need to; the object should
know how to produce appropriate behavior itself.
 
   But under your definition, you need to make the decision, or
   explicitly code the decision, on the basis of context.
  
  Exactly so.  I even have to do that in Py2.

Even.  This is exactly where PBP and EIBTI part company, I think.
EIBTI thinks its a bad idea to pass around bytes that are implicitly
some other type, and Python 3 *should be good enough to make that
unnecessary*.  I'm convinced, and Nick is convinced, that we can make
that true for 90% of the cases that it isn't now, if we could just
figure out what's hard about the use cases where Python 3 isn't up to
snuff yet (and figure out which use cases we need to handle to get us
up to 90%!)

PBP doesn't think it's a great idea to pass around bytes that are
implicitly some other type, but didn't mind it (or got used to it) in
Python 2, and so they're not looking at that as a problem that Python
3 can solve.  They're looking at Python 3 as the problem that prevents
them from doing what worked fine in Python 2.  I understand that point
of view, I just think we should be able to do better in Python 3, and
should give it a serious try before giving in.  Remember, Special
cases aren't special enough to break the rules comes *before*
Although practicality beats purity.  Not to forget that Explicit is
better than implicit is second[1] on the list. ;-)

After looking at this thread, I feel that (due to misunderstandings on
both sides) purity hasn't really been tried yet.

   If that particular configuration of bytes is because it's
   ASCII-encoded text, then sure.
  
   Once again, you are advocate precisely the Python 2 model of text.
  
  Not exactly, because what I get back is bytes, which cannot
  directly be mixed with unicode (str) as it was in Py2.  I think
  this is a key difference.

You're in good company there; that was Guido's rationale for not
worrying, too.  I agree it's key (and I'm sure Nick will, on
reflection if not already).  But I worry (a lot) that it's not enough.

  This confuses me somewhat.  It's okay to use b'ethan'.upper(),
  which only makes semantic sense as ASCII-encoded text,

Not really OK.  In theory, because it doesn't require serialization/
encoding of a primitive type, it doesn't matter.  In practice, without
powerful formatting, it isn't even a major attraction.  In practice,
with powerful formatting, it adds to the attraction.

Note that regex doesn't require type conversions (matches have methods
to return positions in the target or subsequences of the target, not
values of other types), which is why I (and I suspect Nick for the
same reason) am comfortable with polymorphic regex but not with bytes
formatting.

  (Aside, I'm perfectly comfortable with ASCII-encoded text because
  if you took u'ethan'.encode('ascii') you would get b'ethan'.  If it
  was some other encoding, such as cp1251, I would call that
  particular byte stream cp1251-encoded text.

Even though ethan is perfectly good ASCII-encoded text (as well as
the integer 435,744,694,638 on a bigendian machine with 5-byte words,
and you have no way of knowing whether it was user data (CP1251) or a
metadata keyword (ASCII) or be the US national debt in 1967 dollars
(integer) when b'ethan' shows up in a trace?

  And if there were methods that worked directly on a cp1251-encoded
  byte stream I would not have any problem using them on
  cp1251-encoded text.)

I was afraid of that: all of those methods (except the case methods[2])
will work fine on a cp1251-encoded text.  And because they only know
that the string is bytes, the case methods will silently corrupt your
text as 

Re: [Python-Dev] PEP 460 reboot

2014-01-13 Thread Nick Coghlan
On 13 Jan 2014 17:14, Donald Stufft don...@stufft.io wrote:


 On Jan 13, 2014, at 1:59 AM, Nick Coghlan ncogh...@gmail.com wrote:

  On 13 January 2014 16:52, Donald Stufft don...@stufft.io wrote:
 
  On Jan 13, 2014, at 12:45 AM, Glenn Linderman v+pyt...@g.nevcal.com
wrote:
 
  So then the question is whether to proceed with 3.4, delay this
feature to
  3.5, or to delay 3.4 to include this feature, both have been
discussed, with
  the justification for the latter being to make 3.4 the ultimate Python
3
  porting target for recalcitrant module authors, sooner than later.
 
 
  I really hope this can make it in 3.4, needing to wait another 2 years
or so
  until this is available would be a shame.
 
  Indeed, it would be a shame to have to wait. Fortunately, people don't
  even need to wait until the release of Python 3.4, they can instead
  try to help out with the asciicompat project, which aims to provide
  this functionality in Python 3.3+:
  https://github.com/jeamland/asciicompat
 
  All it takes is to let go of the idea I wish the Python 3 bytes type
  was more like the Python 2 str type and instead think hmm, the
  Python 3 bytes type doesn't seem like a great fit for my use case,
  maybe I need a different type”.

 It’s almost a fine fit for the usecase afaict the major thing it’s missing
 is an easy way to handle this last use case. I don’t see how this proposal
 is any different than cases such as int(b”1”). ASCII is already special,
 giving an area that Python3 has made things worse a better way forward
 isn’t comprising the text model, it’s recognizing the realities of the
world.

The difference between this and int() is that there's no structural
ambiguity introduced in the case of int(): the output is always an integer,
regardless of the input type.

Arbitrary binary data and ASCII  compatible binary data are *different
things* and the only argument in favour of modelling them with a single
type is because Python 2 did it that way.

The Python 3 text model was built on the notion of no implicit encoding
and decoding, and Guido's more lenient proposal brings that back by
stealth: the semantics proposed for the integer codes are that they be
essentially equivalent to performing the operation in the text domain and
then encoding with ASCII.

However, I'm OK with the idea if there are separate formatb/formatb_map
APIs that allow the encoding support to be bypassed entirely - that way,
using mod-formatting, format or format_map *is* explicit, since the only
reason to use them over formatb/formatb_map would be for the implicit ASCII
encoding support, eliminating the ambiguity.

Regards,
Nick.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460 reboot

2014-01-13 Thread Antoine Pitrou
On Sun, 12 Jan 2014 18:11:47 -0800
Guido van Rossum gu...@python.org wrote:
 On Sun, Jan 12, 2014 at 5:27 PM, Ethan Furman et...@stoneleaf.us wrote:
  On 01/12/2014 04:47 PM, Guido van Rossum wrote:
  %s seems the trickiest: I think with a bytes argument it should just
  insert those bytes (and the padding modifiers should work too), and
  for other types it should probably work like %a, so that it works as
  expected for numeric values, and with a string argument it will return
  the ascii()-variant of its repr(). Examples:
 
  b'%s' % 42 == b'42'
  b'%s' % 'x' == b'x' (i.e. the three-byte string containing an 'x'
  enclosed in single quotes)
 
  I'm not sure about the quotes.  Would anyone ever actually want those in the
  byte stream?
 
 Perhaps not, but it's a hint that you should probably think about an
 encoding. It's symmetric with how '%s' % b'x' returns b'x'. Think of
 it as payback time. :-)

What is the use case for embedding a quoted ASCII-encoded representation
in a byte stream?

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Trying to focus the whole bytes/str formatting discussion

2014-01-13 Thread Nick Coghlan
On 13 January 2014 08:46, Brett Cannon br...@python.org wrote:
 I don't know about the rest of you but I feel like the discussion is heading
 off the rails (if it hasn't already jumped the tracks). Let's try to bring
 this back around to something actionable which people can focus their energy
 on as the amount of developer time spent arguing could have led to several
 coded-up solutions.

 I see it as a practicality-beats-purity vs.
 explicit-is-better-than-implicit. The PBP group want bytes.format() (just
 assume I include interpolation support if you want that) to work as close to
 a drop-in replacement for current str.format() use in Python 2 to ease
 porting. The argument is that code looks cleaner and the amount of changes
 in Python 2 code being ported to Python 3 is much smaller.

 THE EIBTI group are willing to support PEP 460 but beyond that don't want to
 have in Python itself anything for bytes.format() which takes in a string
 and spits out bytes. It's bytes in-bytes out and not bytes  str in-bytes
 out as the PBP group is after. The EIBTI group are arguing that letting str
 into bytes.format() and then automatically be converted to strict ASCII
 leads to conflating the text/bytes divide as well as being too magical, e.g.
 what if you actually wanted UTF-16 for you number string instead of ASCII;
 the EIBTI group **wants** to force people to make a decision. They are also
 less concerned with making users update Python 2 code to handle this as it
 already needs to be updated for other Python 3 things anyway.

 From where I'm sitting, the EIBTI group and their PEP 460 proposal from
 Antoine (and no longer Victor) are not controversial. Everyone seems to
 agree that PEP 460 **at minimum** is acceptable and should happen for Python
 3.5. The people with the uphill battle and something to prove are those
 arguing for str in-bytes out support in bytes.format(). The added features
 that the PBP group want are the ones being argued over.

 As the onus is on the PBP group to convince the EIBTI group (or Guido), I
 think the PBP group should code up a solution that does what they want and
 put it on PyPI to see what the community thinks. If the PBP group wants to
 convince the EIBTI group that str in-bytes out for bytes.format() is
 critical in getting a key group of users to start using Python 3 then I
 think that needs to be demonstrated through real-world usage by some people.

Note that I am now fine with Guido's more lenient proposal *so long
as* explicitly bytes-only formatb and formatb_map methods are also
included.

That would give us the following situation in 3.5:

Text interpolation: str.__mod__, str.format, str.format_map
ASCII compatible interpolation: bytes.__mod__, bytes.format, bytes.format_map
Arbitrary binary interpolation: bytes.formatb, bytes.formatb_map

Those are all reasonable operations for the language to support
natively, and by providing convenient access to all three, we avoid
the attractive nuisance that would be created by providing *only*
ASCII interpolation without providing strict binary interpolation
(since people would inevitably use the former when they should really
be using the latter, because interpolation is such a convenient
construct), while still addressing the interests of both groups
(people like me and Antoine that like PEP 460 as it stands, as well as
those that favour the ASCII encoding features).

It's only the introduction of ASCII compatible interpolation support
*without* binary interpolation support that I am adamantly opposed to
- that's the kind of attractive nuisance that leads to people
inappropriately using ASCII compatible only APIs and then discovering
that their code breaks when confronted with ASCII incompatible
encodings like UTF-16, ShiftJIS and ISO-2022.

Originally I was opposed to the idea entirely, but then Antoine wrote
the binary only version of PEP 460 and I found it to be a *very*
elegant solution that didn't compromise the Python 3 text model. As
long as this pure API remains available in some form (such as formatb
and formatb_map methods), then I'm OK with the ASCII only version
existing in parallel - at that point, it *is* analogous to all the
other existing bytes methods that assume the use of ASCII compatible
data.

** The caveat **

However, note that there were *two* significant issues that were
raised in the recent broader discussions. PEP 460 only tackles the
more tractable of the two: the fact that Twisted and Mercurial both
consider bytes.__mod__ support a blocker for switching to Python 3.
That's a useful discussion to have, but it's important for people to
realise that the mod-formatting feature is utterly irrelevant to the
concerns Armin Ronacher raised in
http://lucumr.pocoo.org/2014/1/5/unicode-in-2-and-3/ that kicked off
this whole recent spate of interest in the topic.

Obviously, I disagree with his conclusions (and personally wish Python
2 Unicode experts would show a little more humility in trying to

Re: [Python-Dev] PEP 460 reboot

2014-01-13 Thread Nick Coghlan
On 13 January 2014 17:15, Guido van Rossum gu...@python.org wrote:
 On Sun, Jan 12, 2014 at 10:59 PM, Nick Coghlan ncogh...@gmail.com wrote:
 All it takes is to let go of the idea I wish the Python 3 bytes type
 was more like the Python 2 str type and instead think hmm, the
 Python 3 bytes type doesn't seem like a great fit for my use case,
 maybe I need a different type.

 Maybe you're letting your excitement about asciistr get the better of
 you? IMO we don't need more types. If you can refrain from using
 int(b), b.lower() and b += 'abc' when b isn't ASCII-encoded, why
 couldn't you also refrain from b += b'%s' % 42?

It's the fact I'd feel obliged to refrain from using *any* of the
proposed interpolation methods when dealing with arbitrary binary data
if they include the assumption of ASCII compatibility. The reason
Antoine's updates to PEP 460 earned an immediate +1 from me (even
though I was initially dubious about the PEP in general) is that it
aligns *exactly* with how I usually use the bytes type in Python 3 -
as a pure container of arbitrary binary data, without making
assumptions about whether it is ASCII compatible or not.

While I still occasionally have reservations about it, I think on
balance it's a good thing that the bytes type has a much support for
ASCII compatible data , but my specific concern with your more lenient
proposal is that it takes something that I liked and would use (the
current PEP 460 API) and turned it into something I would have to
avoid because it doesn't correctly support arbitrary binary data.

 I'll suppress the urge to quote verbatim from my first message in this
 thread (about the motivation for bytes) but I'll just recommend you
 re-read it.

 (It's too late here to write more, but it looks like we are in for a
 bitter fight. :-( )

I realised my problem was specifically with providing the ASCII
compatible version *without* providing a pure binary equivalent that
*doesn't* involve making the assumption of ASCII compatibility. This
means that adding formatb and formatb_map methods with the current
semantics of format and format_map from PEP 460 would cover the use
cases I care about, and I can then happily ignore the debates about
what the semantics of the ASCII compatible version will be.

The semantics of binary interpolation could potentially even be
simplified further, since the ASCII assuming versions would be
responsible for handling the 2/3 source compatibility problem.
{}.formatb(other) would also provide an alternative to calling the
bytes constructor that doesn't suffer from the
unexpected-int-input-is-handled-as-a-length failure mode.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460 reboot and a bitter fight

2014-01-13 Thread Nick Coghlan
On 13 January 2014 17:59, Ethan Furman et...@stoneleaf.us wrote:
 On 01/12/2014 11:15 PM, Guido van Rossum wrote:


 (It's too late here to write more, but it looks like we are in for a
 bitter fight. :-( )


 It's already been a bitter fight.

 The opponents of %-interpolation (Nick, Antoine, Turnbull, D'Aprano, et al*)
 all seem to be arguing basically what Nick said.

 The proponents (myself, you, Stufft, Eric Smith, et al*) are arguing that
 bytes already has an ASCII bias, already has ASCII string methods, that it
 isn't the same as the Py2 world because if you combine a bytes object with a
 str object outside of interpolation (such as b'hello' + 'world') it doesn't
 work, that only bytes would ever be returned, etc, etc.

 With the possible exception of the question I just asked Nick,  I don't
 think we're going to get any new information.

I figured out tonight that it's only positioning ASCII interpolation
as an *alternative* to adding binary interpolation that I have a
problem with. It isn't, because you lose the structural assurance that
you haven't inadvertently introduced an assumption of ASCII
compatibility when you didn't need to. However, interpolation support
is a convenient enough interface that I can see a version that *only*
supports ASCII compatible interpolation being an attractive nuisance
that becomes a source of hard to detect and fix data corruption bugs
(just like the str type in Python 2).

If we add both, my objections go away: people like me can use the
Python 3 only formatb and formatb_map methods and be confident we
haven't inadvertently introduced any assumptions regarding ASCII
compatibility, while folks that know they're dealing with an ASCII
compatible format can use the ASCII assuming versions that are
designed to be source compatible with Python 2.

If someone incorrectly uses format() or format_map() when they should
be using the pure binary versions, that's a trivial bug fix (adding
the necessary b, and perhaps some explicit encoding calls) rather
than a major restructuring of the code.

If they use mod-formatting, that's a slightly bigger fix, but still
just switching to a different spelling of the formatting operation.

Both use cases (binary only and ASCII compatible) get covered cleanly,
and nobody has to lose out.

Cheers,
Nick.


-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python advanced debug support (update frame code)

2014-01-13 Thread Nick Coghlan
On 13 January 2014 09:08, Fabio Zadrozny fabi...@gmail.com wrote:
 Hi Python-dev.

 I'm playing a bit on the concept on live-coding during a debug session and
 one of the most annoying things is that although I can reload the code for a
 function (using something close to xreload), it seems it's not possible to
 change the code for the current frame (i.e.: I need to get out of the
 function call and then back in to a call to the method from that frame to
 see the changes).

 I gave a look on the frameobject and it seems it would be possible to set
 frame.f_code to another code object -- and set the line number to the start
 of the new object, which would cover the most common situation, which would
 be restarting the current frame -- provided the arguments remain the same
 (which is close to what the java debugger in Eclipse does when it drops the
 current frame -- on Python, provided I'm not in a try..except block I can do
 even better setting the the frame.f_lineno, but without being able to change
 the frame f_code it loses a lot of its usefulness).

 So, I'd like to ask for feedback from people with more knowledge on whether
 it'd be actually feasible to change the frame.f_code and possible
 implications on doing that.

Huh, I would have sworn there was already an issue on the tracker
about that, but it appears not (Eric Snow has one about adding a
reference to the running function, but nothing about trying to switch
an executing frame: http://bugs.python.org/issue12857).

Anyway, your main problem isn't the reference to the code object from
the frame: it's the fact that the main eval loop has a reference to
that code object from a C level stack variable, and stores a bunch of
other state directly on the C stack.

I don't see anything *intrinsically* impossible about the idea, it
just wouldn't be easy, since you'd have to come up with a way of
dealing with that C level state that didn't slow down normal
operation.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460 reboot and a bitter fight

2014-01-13 Thread Mark Lawrence

On 13/01/2014 07:59, Ethan Furman wrote:

On 01/12/2014 11:15 PM, Guido van Rossum wrote:





The proponents (myself, you, Stufft, Eric Smith, et al*) are arguing
that bytes already has an ASCII bias, already has ASCII string methods,
that it isn't the same as the Py2 world because if you combine a bytes
object with a str object outside of interpolation (such as b'hello' +
'world') it doesn't work, that only bytes would ever be returned, etc, etc.

--
~Ethan~


ASCII bias seems to me an understatement.  From 
http://docs.python.org/3/library/stdtypes.html#bytes-and-bytearray-operations 
Due to the common use of ASCII text as the basis for binary protocols, 
bytes and bytearray objects provide almost all methods found on text 
strings.  Can you get any clearer than that, or have I been completely 
swamped by the massive tsunami that these PEP 460 threads are?


Note that I'm *NOT* taking sides here, I'd just like to see a peaceful 
settlement without any bloodshed :)


--
My fellow Pythonistas, ask not what our language can do for you, ask 
what you can do for our language.


Mark Lawrence

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460: allowing %d and %f and mojibake

2014-01-13 Thread Stephen J. Turnbull
Glenn Linderman writes:

  On 1/12/2014 4:08 PM, Stephen J. Turnbull wrote:
  Glenn Linderman writes:
  the proposals to embed binary in Unicode by abusing Latin-1
  encoding.

  Those aren't proposals, they are currently feasible
  techniques in Python 3 for *some* use cases. The question is why
  infecting Python 3 with the byte/character confoundance virus is
  preferable to such techniques, especially if their (serious!)
  deficiencies are removed by creating a new type such as
  asciistr.

  smuggled binary (great term borrowed from a different
  subthread) muddies the waters of what you are dealing with.

Not really.  The mud is one or more of the serious deficiencies.  It
can be removed, I believe (and Nick apparently does, too).  asciistr
is one way to try that.

  When the mixture of text and binary is done as encoded text in
  binary, then it is obvious that only limited text processing can be
  performed,

Hardly.  After all, that's how all text processing was done for
decades.  Still is, in some programs, especially C programs.

  And there are no extra, confusing Latin-1 encode/decode operations
  required.

The extra encode/decode operations are mostly (perhaps all) due to
examples that started from bytes and end with bytes.  Of course if you
assume that API and propose to do the operations using Unicode, you'll
get extra decode/encode operations.

  From a higher-level perspective, I think it would be great to have
  a module, perhaps called boundary (let's call it that for now),
  that allow some definition syntax (augmented BNF? augmented ABNF?)
  to explain the format of a binary blob.

We have struct, for one.  I'm not sure why you want more than that.  I
suppose you could go all the way to ASN.1.

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460 reboot

2014-01-13 Thread Barry Warsaw
On Jan 12, 2014, at 06:11 PM, Guido van Rossum wrote:

Perhaps not, but it's a hint that you should probably think about an
encoding. It's symmetric with how '%s' % b'x' returns b'x'. Think of
it as payback time. :-)

Which unfortunately causes no end of headaches, often difficult to debug.

https://wiki.python.org/moin/PortingToPy3k/BilingualQuickRef

(see 'doctests' for one such impact).

-Barry
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460 reboot

2014-01-13 Thread Barry Warsaw
On Jan 12, 2014, at 09:45 PM, Glenn Linderman wrote:

Quotes in the stream are a great debug hint, without blowing up.

They actually terrible for debugging for exactly the same reason as coercion
in Python 2.  It's rarely what you really want, it silently succeeds, and it
means that the user visible error is far removed from the actual bug, both in
code distance and time.  So yes, it tells you Something Went Wrong, but is
actually a hindrance to finding and fixing the problem.

-Barry
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460 reboot

2014-01-13 Thread Ethan Furman

On 01/13/2014 01:49 AM, Mark Shannon wrote:


'%s' can't work in 3.5, as we must differentiate between
strings which meed to be encoded and bytes which don't.


I don't understand this objection:

def __mod__(self, other):
if isinstance(other, bytes):
 # no encoding necessary
elif isinstance(other, str):
 # payback time!
 other = ascii(other)

Where is the problem?

--
~Ethan~
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460 reboot

2014-01-13 Thread Guido van Rossum
On Mon, Jan 13, 2014 at 3:41 AM, Antoine Pitrou solip...@pitrou.net wrote:
 What is the use case for embedding a quoted ASCII-encoded representation
 in a byte stream?

It doesn't crash but produces undesired output (always, not only when
the data is non-ASCII) that gives the developer a hint to think about
encoding to bytes.

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460 reboot

2014-01-13 Thread Antoine Pitrou
On Mon, 13 Jan 2014 07:59:10 -0800
Guido van Rossum gu...@python.org wrote:
 On Mon, Jan 13, 2014 at 3:41 AM, Antoine Pitrou solip...@pitrou.net wrote:
  What is the use case for embedding a quoted ASCII-encoded representation
  in a byte stream?
 
 It doesn't crash but produces undesired output (always, not only when
 the data is non-ASCII) that gives the developer a hint to think about
 encoding to bytes.

But why is it better to give a hint by producing undesired output (which
may actually go unnoticed for some time and produce issues down the
road), rather than simply by raising TypeError?

By that token we may simply insert an error string (CAUTION: YOU MISS
AN ENCODING HERE), rather than the ascii() representation of the
argument.

Regards

Antoine.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] PEP460 thoughts from a Mercurial dev

2014-01-13 Thread Augie Fackler
(sorry for not piling on any existing threads - I don't subscribe to
python-dev due to lack of time)

Brett Cannon asked me to chime in - I haven't actually read the very long
thread at this point, I'm just providing responses to things Brett
mentioned:

1) What do we need in terms of functionality

Best guess, %s, %d, and %f. I've not done a full audit of the code, but
some limited looking over the grep hits for % in .py files suggests I'm
right, and we could even do without %f (we only use that for 'hg --time'
output, which we could do in unicode).

We also need some way to emit raw bytes (in potentially mixed encodings,
yes I know this is doing it wrong) to stdout/stderr (example: someone
changes a file from latin1 to utf8, and then wants to see the resulting
diff).

2) Would having it as an external library that worked with Python 2 help?

Probably, IF it came with 2.4 support (RHEL support, basically), and we
could bundle it in our source tree. It's been extremely valuable to have
the install only depend on a working C compiler and Python.

3) If this does go in, how long would it take us to port Mercurial to py3?
Would it being in 3.5 hold us up?

I'm honestly not sure. I'm still in the outermost layers of this yak shave:
fixing cyclic imports. I'll know more when I can at least get 'hg version'
to print its own version, because at that point the testsuite failures
might be informative. I'd honestly _rather_ this went into 3.5 *and* got
lots of validation by both us and twisted (the other folks that care?)
before becoming set in stone by a release. Does that make sense?

4) Do we care if it's .format()/%, or could it be in the stdlib?

It'd be really nice to not have to boil the oceans as far as editing
everyplace in the codebase that does % today. If we do have to do that,
it's not going to be much more helpful than something like:

def maybestr(a):
  if isinstance(a, bytes):
return a.decode('latin1)
  return a

def sprintf(fmt, *args):
  (fmt.decode('latin1') % [maybestr(a) for a in args]).encode('latin1)

or similar. That was (roughly) what I was figuring I'd do today without any
formal bytes-string-formatting support.


He also mentioned that some are calling for a shortened 3.5 release cycle -
I'd rather not see that happen, for the aforementioned reason of wanting
time to make sure this is Right - it'd be a shame to do the work and rush
it out only to find something missing in an important way.

Feel free to ask further questions - I'll try to respond promptly.

AF

(For those curious: my hg-on-py3 repo isn't published at the moment because
I rebuilt the server it lived on and I forgot to publish it. I'll rectify
that sometime this week, I hope, but it's really totally nonfunctional due
to cyclic imports.)
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460: allowing %d and %f and mojibake

2014-01-13 Thread Ethan Furman

On 01/13/2014 02:48 AM, Stephen J. Turnbull wrote:

Ethan Furman writes:


The part that you don't seem to acknowledge (sorry if I missed it)
is that there are str-like methods already on bytes.


I haven't expressed myself well, but I don't much care about that.


You don't care that there are str-like methods on bytes?  Whether you do or not, they are there, and they impact how 
people think about bytes and what is (and what should be) allowed.



It's what Knuth would classify as a seminumerical method.


I do not see how that's relevant.  What matters is not how we can manipulate the data (everything is reduced to numbers 
at some point), but what the data represents.


[snip]


*My* definition is not ambiguous at all.  If this particular part
of the byte stream is defined to contain ASCII-encoded text, then I
can use the bytes text methods to work with it.


But how is Python supposed to know that?


Python doesn't need to.


... because you know it.  But the ideal of object-oriented programming
(and duck-typing) is that you shouldn't need to; the object should
know how to produce appropriate behavior itself.


The ideal, sure.  But if you're stuck with using a list to hold data for your higher-order recursive function are you 
going to expect the list data type to know which pops and inserts are allowed and which are not?  Of course not.  And 
you'd probably build a proper class on top of the list so those things could be checked.  Now imagine that the list type 
didn't offer insert and pop, and you had to use slice replacement -- what a pain that would be!


[snip]


But under your definition, you need to make the decision, or
explicitly code the decision, on the basis of context.


Exactly so.  I even have to do that in Py2.


Even.  This is exactly where PBP and EIBTI part company, I think.
EIBTI thinks its a bad idea to pass around bytes that are implicitly
some other type


bytes are /always/ implicitly some other type.  They are basically raw data.  They are given meaning by how we interpret 
them.


[snip]


Even though ethan is perfectly good ASCII-encoded text (as well as
the integer 435,744,694,638 on a bigendian machine with 5-byte words,
and you have no way of knowing whether it was user data (CP1251) or a
metadata keyword (ASCII) or be the US national debt in 1967 dollars
(integer) when b'ethan' shows up in a trace?


Context is everything.  If b'ethan' shows up in a trace I would have to examine the surrounding code to see how those 
bytes were being used.



And if there were methods that worked directly on a cp1251-encoded
byte stream I would not have any problem using them on
cp1251-encoded text.)


I was afraid of that: all of those methods (except the case methods)
will work fine on a cp1251-encoded text.


Really?  Huh.  They wouldn't work fine with the Spanish alphabet.  I should've 
used that for my example.  :/


And because they only know
that the string is bytes, the case methods will silently corrupt your
text as soon as they get a chance.


Inevitably there are methods that will work even if given the wrong data type, while others will either corrupt or 
blow up if not given exactly what they expect.  You tell me that some ASCII methods will work okay on cp1251 text, and 
others will not.  So I'm not going to use any of them on cp1251 as that is not what they are intended for.



That bothers me, even if it
doesn't bother you.  Purity again, if you like.  (But you'd take a
safe .upper if you got it for free, no?)


Well, there is no such thing as free.  ;)  And there already is a safe .upper -- str.upper.  And if I don't know that my 
bytes are ASCII, but I did know they were text, I wouldn't use ASCII methods, I'd convert to str and work there.


--
~Ethan~
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460 reboot

2014-01-13 Thread Ethan Furman

On 01/13/2014 08:09 AM, Antoine Pitrou wrote:

On Mon, 13 Jan 2014 07:59:10 -0800
Guido van Rossum gu...@python.org wrote:

On Mon, Jan 13, 2014 at 3:41 AM, Antoine Pitrou solip...@pitrou.net wrote:

What is the use case for embedding a quoted ASCII-encoded representation
in a byte stream?


It doesn't crash but produces undesired output (always, not only when
the data is non-ASCII) that gives the developer a hint to think about
encoding to bytes.


But why is it better to give a hint by producing undesired output (which
may actually go unnoticed for some time and produce issues down the
road), rather than simply by raising TypeError?


You mean crash all the time?  I'd be fine with that for both the str case and the bytes case.  But's probably too late 
to change the str case, and the bytes case should mirror what str does.




By that token we may simply insert an error string (CAUTION: YOU MISS
AN ENCODING HERE), rather than the ascii() representation of the
argument.


Well, the ascii repr is at least some clue as to where.  A generic message 
would be no clue at all.

--
~Ethan~
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP460 thoughts from a Mercurial dev

2014-01-13 Thread Nick Coghlan
On 13 January 2014 23:57, Augie Fackler r...@durin42.com wrote:
 (sorry for not piling on any existing threads - I don't subscribe to
 python-dev due to lack of time)

 Brett Cannon asked me to chime in - I haven't actually read the very long
 thread at this point, I'm just providing responses to things Brett
 mentioned:

 1) What do we need in terms of functionality

 Best guess, %s, %d, and %f. I've not done a full audit of the code, but some
 limited looking over the grep hits for % in .py files suggests I'm right,
 and we could even do without %f (we only use that for 'hg --time' output,
 which we could do in unicode).

I think PEP 460 will have you covered there, or hopefully asciistr on 3.3+

 We also need some way to emit raw bytes (in potentially mixed encodings, yes
 I know this is doing it wrong) to stdout/stderr (example: someone changes
 a file from latin1 to utf8, and then wants to see the resulting diff).

Writing to sys.stdout.buffer may work for that, or else being able to
change the encoding of an existing stream. For the latter, Victor had
a working patch to _pyio at http://bugs.python.org/issue15216 and
general consensus that the semantics were sensible, but it needs to be
worked up into a full patch that covers the C version as well (I tried
to muster some helpers for that in the leadup to 3.4 feature freeze,
but unfortunately without any luck)

 2) Would having it as an external library that worked with Python 2 help?

 Probably, IF it came with 2.4 support (RHEL support, basically), and we
 could bundle it in our source tree. It's been extremely valuable to have the
 install only depend on a working C compiler and Python.

asciicompat.asciistr is just an alias for str on Python 2.x, so if we
get that working, it may be something you could vendor into Mercurial
for Python 3.3+ support. (There will likely be gaps in what asciistr
can do due to interoperability issues in the core types, but the PEP
393 changes to the internal representation mean it should be able to
get us pretty close)

 3) If this does go in, how long would it take us to port Mercurial to py3?
 Would it being in 3.5 hold us up?

 I'm honestly not sure. I'm still in the outermost layers of this yak shave:
 fixing cyclic imports. I'll know more when I can at least get 'hg version'
 to print its own version, because at that point the testsuite failures might
 be informative. I'd honestly _rather_ this went into 3.5 *and* got lots of
 validation by both us and twisted (the other folks that care?) before
 becoming set in stone by a release. Does that make sense?

Yes, that actually makes a lot of sense to me - there's no point in us
rushing to get this into 3.4 and then you folks discovering in 6
months it doesn't quite work for you, and then having to wait for 3.5
anyway (or, worse, Python 3 being locked into a solution that doesn't
work for you by it's own internal backwards compatibility
requirements).


 4) Do we care if it's .format()/%, or could it be in the stdlib?

 It'd be really nice to not have to boil the oceans as far as editing
 everyplace in the codebase that does % today. If we do have to do that, it's
 not going to be much more helpful than something like:

 def maybestr(a):
   if isinstance(a, bytes):
 return a.decode('latin1)
   return a

 def sprintf(fmt, *args):
   (fmt.decode('latin1') % [maybestr(a) for a in args]).encode('latin1)

 or similar. That was (roughly) what I was figuring I'd do today without any
 formal bytes-string-formatting support.

Agreed - I think the two solutions that potentially make the most
sense are PEP 460 and an interoperable third party type like asciistr.
They each have different pros and cons, so I'm actually currently a
plan of doing both (if Guido is amenable to my suggestion of providing
both ASCII compatible and binary interpolation).

 He also mentioned that some are calling for a shortened 3.5 release cycle -
 I'd rather not see that happen, for the aforementioned reason of wanting
 time to make sure this is Right - it'd be a shame to do the work and rush it
 out only to find something missing in an important way.

By shortened, we're mostly talking about ensuring 3.5 is published
before the 2.7.9 maintenance release. So early-to-mid 2015 rather than
the more typical late 2015.

 Feel free to ask further questions - I'll try to respond promptly.

Thanks for the contribution! I found it very helpful :)

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460 reboot

2014-01-13 Thread Ethan Furman

On 01/13/2014 07:52 AM, Barry Warsaw wrote:

On Jan 12, 2014, at 09:45 PM, Glenn Linderman wrote:


Quotes in the stream are a great debug hint, without blowing up.


They actually terrible for debugging for exactly the same reason as coercion
in Python 2.  It's rarely what you really want, it silently succeeds, and it
means that the user visible error is far removed from the actual bug, both in
code distance and time.  So yes, it tells you Something Went Wrong, but is
actually a hindrance to finding and fixing the problem.


You mean like this is?

-- '%s' % b'abc'
b'abc'

I agree, but we're stuck with it with str, we may as well be stuck with it for 
bytes, too.  :/

--
~Ethan~
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460 reboot

2014-01-13 Thread Ethan Furman

On 01/13/2014 01:13 AM, Nick Coghlan wrote:

On 13 Jan 2014 17:43, Ethan Furman wrote:

On 01/12/2014 10:51 PM, Nick Coghlan wrote:


I am a strong -1 on the more lenient proposal, as it makes binary
interpolation in Python 3 an *unsafe operation* for ASCII incompatible
binary formats.


No more unsafe that calling .upper() on ASCII incompatible streams.


Right - Guido's proposal is *completely useless* for arbitrary binary data. You 
can't trust it.


Forgive me for being dense, but I don't understand your objection.  With Guido's proposal, '%s' % bytes_data, bytes_data 
is passed through unchanged.  Did you mean something else by binary data?


--
~Ethan~
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460 reboot

2014-01-13 Thread Nick Coghlan
On 14 January 2014 01:54, Ethan Furman et...@stoneleaf.us wrote:
 On 01/13/2014 01:13 AM, Nick Coghlan wrote:

 On 13 Jan 2014 17:43, Ethan Furman wrote:

 On 01/12/2014 10:51 PM, Nick Coghlan wrote:


 I am a strong -1 on the more lenient proposal, as it makes binary
 interpolation in Python 3 an *unsafe operation* for ASCII incompatible
 binary formats.


 No more unsafe that calling .upper() on ASCII incompatible streams.


 Right - Guido's proposal is *completely useless* for arbitrary binary
 data. You can't trust it.


 Forgive me for being dense, but I don't understand your objection.  With
 Guido's proposal, '%s' % bytes_data, bytes_data is passed through unchanged.
 Did you mean something else by binary data?

I mean it will work, but it will mean you've introduced an implicit
assumption of ASCII compatibility into the structure your program,
with no straightforward way of removing it (you would have to rewrite
your code to not rely on interpolation). This becomes most obvious
when the formatting string is passed as a variable, rather than being
provided as a literal, or when you don't know the type of the *value*
provided and some types may involved implicit encoding operation (I
don't think Guido proposed that, but others have). That's the kind of
data driven uncertainty I don't like in Python 2, and I find it's
categorical elimination to be one of the best features of Python 3 -
there are certain kinds of data manipulation bugs that simply *can't
exist* because the types don't work that way any more.

However, that's also why *adding* formatb/formatb_map to the proposal
(with Antoine's stricter semantics) would resolve my concerns - you
can ensure you don't introduce an implicit assumption of ASCII
compatibility by using those for interpolation rather than the ASCII
compatible __mod__/format/format_map that the bytes type will share
with the str type.

The combination of the two is completely in keeping with the Python 3
text model - we would offer text interpolation, hybrid ASCII
compatible interpolation *and* pure binary interpolation. Offering
only the first two would mean relegating the pure binary domain to a
lower status again, since assuming ASCII compatibility would grant you
access to an interpolation API, so people would be inclined to use it
even when doing so opens the door to data corruption bugs.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460 reboot

2014-01-13 Thread Antoine Pitrou
On Mon, 13 Jan 2014 08:36:05 -0800
Ethan Furman et...@stoneleaf.us wrote:
 On 01/13/2014 08:09 AM, Antoine Pitrou wrote:
  On Mon, 13 Jan 2014 07:59:10 -0800
  Guido van Rossum gu...@python.org wrote:
  On Mon, Jan 13, 2014 at 3:41 AM, Antoine Pitrou solip...@pitrou.net 
  wrote:
  What is the use case for embedding a quoted ASCII-encoded representation
  in a byte stream?
 
  It doesn't crash but produces undesired output (always, not only when
  the data is non-ASCII) that gives the developer a hint to think about
  encoding to bytes.
 
  But why is it better to give a hint by producing undesired output (which
  may actually go unnoticed for some time and produce issues down the
  road), rather than simply by raising TypeError?
 
 You mean crash all the time?  I'd be fine with that for both the str
 case and the bytes case.  But's probably too late 
 to change the str case, and the bytes case should mirror what str does.

No, there's a good reason for the str case: it's that every Python
object should have a working __str__ (for debugging, REPL use, etc.).
So bytes has a __str__ too and that's why %s % (some_bytes_object)
succeeds.

Conversely, though, str needn't and shouldn't have a __bytes__, so
there's no good reason for b%s % (some_str_object) to succeed.

(moreover, I don't think we did it wrong here should be a good reason
for doing it wrong there too)

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460 reboot

2014-01-13 Thread Antoine Pitrou
On Mon, 13 Jan 2014 08:36:05 -0800
Ethan Furman et...@stoneleaf.us wrote:

 On 01/13/2014 08:09 AM, Antoine Pitrou wrote:
  On Mon, 13 Jan 2014 07:59:10 -0800
  Guido van Rossum gu...@python.org wrote:
  On Mon, Jan 13, 2014 at 3:41 AM, Antoine Pitrou solip...@pitrou.net 
  wrote:
  What is the use case for embedding a quoted ASCII-encoded representation
  in a byte stream?
 
  It doesn't crash but produces undesired output (always, not only when
  the data is non-ASCII) that gives the developer a hint to think about
  encoding to bytes.
 
  But why is it better to give a hint by producing undesired output (which
  may actually go unnoticed for some time and produce issues down the
  road), rather than simply by raising TypeError?
 
 You mean crash all the time?  I'd be fine with that for both the str case
 and the bytes case.  But's probably too late 
 to change the str case, and the bytes case should mirror what str does.

Let me add something else: str and bytes don't have to be symmetrical.
In Python 2, str and unicode were symmetrical, they allowed exactly the
same operations and were composable.
In Python 3, str and bytes are different beasts; they have different
operations *and* different semantics (for example, bytes interoperates
with bytearray and memoryview, while str doesn't).

So bytes formatting really needn't (and shouldn't, IMO) mirror str
formatting.

(the only reason I used %s in PEP 460 is to allow a migration path
from 2.x bytes-formatting to 3.x bytes-formatting; in a really pure
proposal it would have been called something else)

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460 reboot

2014-01-13 Thread Ethan Furman

On 01/13/2014 07:49 AM, Barry Warsaw wrote:

On Jan 12, 2014, at 06:11 PM, Guido van Rossum wrote:


Perhaps not, but it's a hint that you should probably think about an
encoding. It's symmetric with how '%s' % b'x' returns b'x'. Think of
it as payback time. :-)


Which unfortunately causes no end of headaches, often difficult to debug.


Is it, in fact, too late to change that behavior?

--
~Ethan~
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460 reboot

2014-01-13 Thread Ethan Furman

On 01/13/2014 08:39 AM, Ethan Furman wrote:

On 01/13/2014 07:49 AM, Barry Warsaw wrote:

On Jan 12, 2014, at 06:11 PM, Guido van Rossum wrote:


Perhaps not, but it's a hint that you should probably think about an
encoding. It's symmetric with how '%s' % b'x' returns b'x'. Think of
it as payback time. :-)


Which unfortunately causes no end of headaches, often difficult to debug.


Is it, in fact, too late to change that behavior?


Never mind, Antoine explained it for me.  :)

--
~Ethan~
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP460 thoughts from a Mercurial dev

2014-01-13 Thread Guido van Rossum
On Mon, Jan 13, 2014 at 9:37 AM, Augie Fackler r...@durin42.com wrote:

 On Mon, Jan 13, 2014 at 12:34 PM, Guido van Rossum gu...@python.org wrote:

 On Mon, Jan 13, 2014 at 8:51 AM, Nick Coghlan ncogh...@gmail.com wrote:
  On 13 January 2014 23:57, Augie Fackler r...@durin42.com wrote:
  1) What do we need in terms of functionality
 
  Best guess, %s, %d, and %f. I've not done a full audit of the code, but
  some
  limited looking over the grep hits for % in .py files suggests I'm
  right,
  and we could even do without %f (we only use that for 'hg --time'
  output,
  which we could do in unicode).
 
  I think PEP 460 will have you covered there, or hopefully asciistr on
  3.3+

 I'm confused on how PEP 460 would help -- Augie mentioned %d, which it
 excludes.



 Yes - not having %d makes this much much less useful to me.

 For my part, it'd probably be fine if we could do %s (which would handle an
 RHS that was bytes, and only bytes, no handing of str or __bytes__-type
 stuff at all) and %d (with all the usual format modifiers, and would result
 in an ascii-compatible sequence of bytes all the time).

Would it be okay of instead of %s you had to use %b for those
semantics? (%d would still exist)

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP460 thoughts from a Mercurial dev

2014-01-13 Thread Guido van Rossum
On Mon, Jan 13, 2014 at 8:51 AM, Nick Coghlan ncogh...@gmail.com wrote:
 On 13 January 2014 23:57, Augie Fackler r...@durin42.com wrote:
 1) What do we need in terms of functionality

 Best guess, %s, %d, and %f. I've not done a full audit of the code, but some
 limited looking over the grep hits for % in .py files suggests I'm right,
 and we could even do without %f (we only use that for 'hg --time' output,
 which we could do in unicode).

 I think PEP 460 will have you covered there, or hopefully asciistr on 3.3+

I'm confused on how PEP 460 would help -- Augie mentioned %d, which it excludes.

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP460 thoughts from a Mercurial dev

2014-01-13 Thread Antoine Pitrou
On Mon, 13 Jan 2014 09:34:39 -0800
Guido van Rossum gu...@python.org wrote:
 On Mon, Jan 13, 2014 at 8:51 AM, Nick Coghlan ncogh...@gmail.com wrote:
  On 13 January 2014 23:57, Augie Fackler r...@durin42.com wrote:
  1) What do we need in terms of functionality
 
  Best guess, %s, %d, and %f. I've not done a full audit of the code, but 
  some
  limited looking over the grep hits for % in .py files suggests I'm right,
  and we could even do without %f (we only use that for 'hg --time' output,
  which we could do in unicode).
 
  I think PEP 460 will have you covered there, or hopefully asciistr on 3.3+
 
 I'm confused on how PEP 460 would help -- Augie mentioned %d, which it 
 excludes.

Serhiy did a survey of formatting codes in the Mercurial sources:
https://mail.python.org/pipermail/python-dev/2014-January/130969.html

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460 reboot

2014-01-13 Thread R. David Murray
On Mon, 13 Jan 2014 12:41:18 +0100, Antoine Pitrou solip...@pitrou.net wrote:
 On Sun, 12 Jan 2014 18:11:47 -0800
 Guido van Rossum gu...@python.org wrote:
  On Sun, Jan 12, 2014 at 5:27 PM, Ethan Furman et...@stoneleaf.us wrote:
   On 01/12/2014 04:47 PM, Guido van Rossum wrote:
   %s seems the trickiest: I think with a bytes argument it should just
   insert those bytes (and the padding modifiers should work too), and
   for other types it should probably work like %a, so that it works as
   expected for numeric values, and with a string argument it will return
   the ascii()-variant of its repr(). Examples:
  
   b'%s' % 42 == b'42'
   b'%s' % 'x' == b'x' (i.e. the three-byte string containing an 'x'
   enclosed in single quotes)
  
   I'm not sure about the quotes.  Would anyone ever actually want those in 
   the
   byte stream?
  
  Perhaps not, but it's a hint that you should probably think about an
  encoding. It's symmetric with how '%s' % b'x' returns b'x'. Think of
  it as payback time. :-)
 
 What is the use case for embedding a quoted ASCII-encoded representation
 in a byte stream?

There is no use case in the sense you are asking, just like there is no
real use case for '%s' % b'x' producing b'x'.  But the real use case
is exactly the same: to let you know your code is screwed up without
actually blowing up with a encoding Exception.

For the record, I like Guido's logic and proposal.  I don't understand
Nick's objection, since I don't see the difference between the situation
here where a string gets interpolated into bytes as 'xxx' and the
corresponding situation where bytes gets interpolated into a string
as b'xxx'.  Why struggle to keep bytes interpolation pure if string
interpolation isn't?

Guido's proposal makes the language more symmetric, and thus more
consistent and less surprising.  Exactly the hallmarks of Python's design
sense, IMO.  (Big surprise, right? :)

Of course, this point of view *is* based on the idea that when you are
doing interpolation using %/.format, you are in fact primarily concerned
with ASCII compatible byte streams.  This is a Practicality sort of
argument.  It is, after all, by far the most common use case when
doing interpolation[*].

If you wanted to do a purist version of this symmetry, you'd have bytes(x)
calling __bytes__ if it was defined and falling back to calling a
__brepr__ otherwise.

But what would __brepr__ implement?  The variety of format codes in
the struct module argues that there is no one obvious binary
repr for most types.  (Those that have one would implement __bytes__).
And what would be the __brepr__ of an arbitrary 'object'?

Faced with the impracticality of defining __brepr__ usefully in any pure
bytes form, it seems sensible to admit that the most useful __brepr__
is the ascii() encoding of the __repr__.  Which naturally produces 'xxx'
as the __brepr__ of a string.

This does cause things to get a little un-pretty when you are operating
at the python prompt:

 b'%s' % object
b'class \\\'object\\\''

But then again that is most likely really not what you mean to do, so
it becomes a big red flag...just like b'xxx' is a small red flag when
you accidentally interpolate unencoded bytes into a string.

--David

PS: When I first read Guido's remark that the result of interpolating a
string should be 'xxx', I went Wah?  I had to reason my way through to
it as above, but to him it was just the natural answer.  Guido isn't
always right, but this kind of automatic language design consistency
is one reason he's the BDFL.

[*] I still think that you mostly want to design your library so that
you are handling the text parts as text and the bytes parts as bytes,
and encoding/gluing them as appropriate at the IO boundary.  But if Guido
says his real code would benefit by being able to interpolate ASCII into
bytes at certain points, I'll believe him.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP460 thoughts from a Mercurial dev

2014-01-13 Thread Augie Fackler
On Mon, Jan 13, 2014 at 12:34 PM, Guido van Rossum gu...@python.org wrote:

 On Mon, Jan 13, 2014 at 8:51 AM, Nick Coghlan ncogh...@gmail.com wrote:
  On 13 January 2014 23:57, Augie Fackler r...@durin42.com wrote:
  1) What do we need in terms of functionality
 
  Best guess, %s, %d, and %f. I've not done a full audit of the code, but
 some
  limited looking over the grep hits for % in .py files suggests I'm
 right,
  and we could even do without %f (we only use that for 'hg --time'
 output,
  which we could do in unicode).
 
  I think PEP 460 will have you covered there, or hopefully asciistr on
 3.3+

 I'm confused on how PEP 460 would help -- Augie mentioned %d, which it
 excludes.



Yes - not having %d makes this much much less useful to me.

For my part, it'd probably be fine if we could do %s (which would handle an
RHS that was bytes, and only bytes, no handing of str or __bytes__-type
stuff at all) and %d (with all the usual format modifiers, and would result
in an ascii-compatible sequence of bytes all the time).
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460 reboot

2014-01-13 Thread Yury Selivanov
On January 13, 2014 at 12:45:40 PM, R. David Murray (rdmur...@bitdance.com) 
wrote:
[snip]
 There is no use case in the sense you are asking, just like there  
 is no
 real use case for '%s' % b'x' producing b'x'. But the real use  
 case
 is exactly the same: to let you know your code is screwed up without  
 actually blowing up with a encoding Exception.

Blowing up with an encoding exception is the *only* sane method of
making you aware that something is wrong. It’s much better than
just keeping producing some broken output, until it gets noticed.

What’s the point of writing a piece of software that is working wrong
without crashing?

 For the record, I like Guido's logic and proposal. I don't understand  
 Nick's objection, since I don't see the difference between the  
 situation
 here where a string gets interpolated into bytes as 'xxx' and  
 the
 corresponding situation where bytes gets interpolated into  
 a string
 as b'xxx'. Why struggle to keep bytes interpolation pure if  
 string
 interpolation isn’t?

Isn’t the whole point of this discussion to make python2 people
who want to migrate on python3 happier?  What’s the point for them
to have a ported python2 code that produces Status: b’42’” for
b’Status: %d’ % 42”? And if you want to call ‘str’ on 42 and then
encode the output in latin-1/ascii, then you’re just turning python3
in python2.

-
Yury
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP460 thoughts from a Mercurial dev

2014-01-13 Thread Augie Fackler
On Mon, Jan 13, 2014 at 12:39 PM, Guido van Rossum gu...@python.org wrote:

 On Mon, Jan 13, 2014 at 9:37 AM, Augie Fackler r...@durin42.com wrote:
 
  On Mon, Jan 13, 2014 at 12:34 PM, Guido van Rossum gu...@python.org
 wrote:
 
  On Mon, Jan 13, 2014 at 8:51 AM, Nick Coghlan ncogh...@gmail.com
 wrote:
   On 13 January 2014 23:57, Augie Fackler r...@durin42.com wrote:
   1) What do we need in terms of functionality
  
   Best guess, %s, %d, and %f. I've not done a full audit of the code,
 but
   some
   limited looking over the grep hits for % in .py files suggests I'm
   right,
   and we could even do without %f (we only use that for 'hg --time'
   output,
   which we could do in unicode).
  
   I think PEP 460 will have you covered there, or hopefully asciistr on
   3.3+
 
  I'm confused on how PEP 460 would help -- Augie mentioned %d, which it
  excludes.
 
 
 
  Yes - not having %d makes this much much less useful to me.
 
  For my part, it'd probably be fine if we could do %s (which would handle
 an
  RHS that was bytes, and only bytes, no handing of str or __bytes__-type
  stuff at all) and %d (with all the usual format modifiers, and would
 result
  in an ascii-compatible sequence of bytes all the time).

 Would it be okay of instead of %s you had to use %b for those
 semantics? (%d would still exist)



Probably, but it'd be quite painful, since we'd have to to some kind of
.sub() call all over the place to remain compatible with 2.4 and 2.6.

Dropping 2.4 might be possible in the 3.5 timeframe - 2.6 almost certainly
not.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460 reboot

2014-01-13 Thread Ethan Furman

On 01/13/2014 09:31 AM, Antoine Pitrou wrote:

On Mon, 13 Jan 2014 08:36:05 -0800
Ethan Furman wrote:


You mean crash all the time?  I'd be fine with that for both the str case
and the bytes case.  But's probably too late
to change the str case, and the bytes case should mirror what str does.


Let me add something else: str and bytes don't have to be symmetrical.
In Python 2, str and unicode were symmetrical, they allowed exactly the
same operations and were composable.
In Python 3, str and bytes are different beasts; they have different
operations *and* different semantics (for example, bytes interoperates
with bytearray and memoryview, while str doesn't).


This makes sense to me.

So I'm guess I'm fine with either the quoted ascii repr or the always blowing up method, with leaning towards the 
blowing up method.


--
~Ethan~
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460 reboot

2014-01-13 Thread Georg Brandl
Am 13.01.2014 18:38, schrieb Ethan Furman:
 On 01/13/2014 09:31 AM, Antoine Pitrou wrote:
 On Mon, 13 Jan 2014 08:36:05 -0800 Ethan Furman wrote:
 
 You mean crash all the time?  I'd be fine with that for both the str
 case and the bytes case.  But's probably too late to change the str case,
 and the bytes case should mirror what str does.
 
 Let me add something else: str and bytes don't have to be symmetrical. In
 Python 2, str and unicode were symmetrical, they allowed exactly the same
 operations and were composable. In Python 3, str and bytes are different
 beasts; they have different operations *and* different semantics (for
 example, bytes interoperates with bytearray and memoryview, while str
 doesn't).
 
 This makes sense to me.
 
 So I'm guess I'm fine with either the quoted ascii repr or the always blowing
 up method, with leaning towards the blowing up method.

+1.

Georg

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460 reboot

2014-01-13 Thread Brett Cannon
On Mon, Jan 13, 2014 at 12:31 PM, Antoine Pitrou solip...@pitrou.netwrote:

 On Mon, 13 Jan 2014 08:36:05 -0800
 Ethan Furman et...@stoneleaf.us wrote:

  On 01/13/2014 08:09 AM, Antoine Pitrou wrote:
   On Mon, 13 Jan 2014 07:59:10 -0800
   Guido van Rossum gu...@python.org wrote:
   On Mon, Jan 13, 2014 at 3:41 AM, Antoine Pitrou solip...@pitrou.net
 wrote:
   What is the use case for embedding a quoted ASCII-encoded
 representation
   in a byte stream?
  
   It doesn't crash but produces undesired output (always, not only when
   the data is non-ASCII) that gives the developer a hint to think about
   encoding to bytes.
  
   But why is it better to give a hint by producing undesired output
 (which
   may actually go unnoticed for some time and produce issues down the
   road), rather than simply by raising TypeError?
 
  You mean crash all the time?  I'd be fine with that for both the str case
  and the bytes case.  But's probably too late
  to change the str case, and the bytes case should mirror what str does.

 Let me add something else: str and bytes don't have to be symmetrical.
 In Python 2, str and unicode were symmetrical, they allowed exactly the
 same operations and were composable.
 In Python 3, str and bytes are different beasts; they have different
 operations *and* different semantics (for example, bytes interoperates
 with bytearray and memoryview, while str doesn't).


This is also why the int type doesn't have a __bytes__ method (ignoring the
use of an integer to bytes()): it's universally defined what str(10) should
return, but who know what you want when you would want the bytes of 10
(e.g. base-2, ASCII, UTF-16, etc.).



 So bytes formatting really needn't (and shouldn't, IMO) mirror str
 formatting.


I think one of the things about Guido's proposal that bugs me is that it
breaks the mental model of the .format() method from str in terms of how
the mini-language works. For str.format() you have the conversion and the
format spec (e.g. {!r} and {:d}, respectively). You apply the
conversion by calling the appropriate built-in, e.g. 'r' calls repr(). The
format spec semantically gets passed with the object to format() which
calls the object's __format__() method: ``format(number, 'd')``.

Now Guido's suggestion has two parts that affect the mini-language for
.format(). One is that for bytes.format() the default conversion is bytes()
instead of str(), which is fine (probably want to add 'b' as a conversion
value as well to be consistent). But the other bit is that the format spec
goes from semantically meaning ``format(thing, format_spec)`` to
``format(thing, format_spec).encode('ascii', 'strict')`` for at least
numbers. That implicitness bugs me as I have always thought of format specs
just leading to a call to format(). I think I can live with it, though, as
long as it is **consistently** applied across the board for bytes.format();
every use of a format spec leads to calling ``format(thing,
format_spec).encode('ascii', 'strict')`` no matter what type 'thing' would
be and it is clearly documented that this is done to ease porting and
handle the common case then I can live with it.

This even gives people in-place ASCII encoding for strings by always using
'{:s}' with text which they can do when they port their code to run under
both Python 2 and 3. So you should be able to do ``b'Content-Type:
{:s}'.format('image/jpeg')`` and have it give ASCII. If you want more
explicit encoding to latin-1 then you need to do it explicitly and not rely
on the mini-language to do tricks for you.

IOW I want to treat the format mini-language as a language and thus not
have any special-casing or massive shifts in meaning between str.format()
and bytes.format() so my mental model doesn't have to contort based on
whether it's str or bytes. My preference is not have any, but if Guido is
going say PBP here then I want absolute consistency across the board in how
bytes.format() tweaks things.

As for %s for the % operator calling ascii(), I think that will be a
porting nightmare of finding out why your bytes suddenly stopped being
formatted properly and then having to crawl through all of your code for
that one use of %s which is getting bytes in. By raising a TypeError you
will very easily detect where your screw-up occurred thanks to the
traceback; do so otherwise feels too much like implicit type conversion and
ask any JavaScript developer how that can be a bad thing.

-Brett




 (the only reason I used %s in PEP 460 is to allow a migration path
 from 2.x bytes-formatting to 3.x bytes-formatting; in a really pure
 proposal it would have been called something else)

 Regards

 Antoine.


 ___
 Python-Dev mailing list
 Python-Dev@python.org
 https://mail.python.org/mailman/listinfo/python-dev
 Unsubscribe:
 https://mail.python.org/mailman/options/python-dev/brett%40python.org

___
Python-Dev mailing list

Re: [Python-Dev] PEP 460 reboot

2014-01-13 Thread Daniel Holth
On Mon, Jan 13, 2014 at 12:42 PM, R. David Murray rdmur...@bitdance.com wrote:
 On Mon, 13 Jan 2014 12:41:18 +0100, Antoine Pitrou solip...@pitrou.net 
 wrote:
 On Sun, 12 Jan 2014 18:11:47 -0800
 Guido van Rossum gu...@python.org wrote:
  On Sun, Jan 12, 2014 at 5:27 PM, Ethan Furman et...@stoneleaf.us wrote:
   On 01/12/2014 04:47 PM, Guido van Rossum wrote:
   %s seems the trickiest: I think with a bytes argument it should just
   insert those bytes (and the padding modifiers should work too), and
   for other types it should probably work like %a, so that it works as
   expected for numeric values, and with a string argument it will return
   the ascii()-variant of its repr(). Examples:
  
   b'%s' % 42 == b'42'
   b'%s' % 'x' == b'x' (i.e. the three-byte string containing an 'x'
   enclosed in single quotes)
  
   I'm not sure about the quotes.  Would anyone ever actually want those in 
   the
   byte stream?
 
  Perhaps not, but it's a hint that you should probably think about an
  encoding. It's symmetric with how '%s' % b'x' returns b'x'. Think of
  it as payback time. :-)

 What is the use case for embedding a quoted ASCII-encoded representation
 in a byte stream?

 There is no use case in the sense you are asking, just like there is no
 real use case for '%s' % b'x' producing b'x'.  But the real use case
 is exactly the same: to let you know your code is screwed up without
 actually blowing up with a encoding Exception.

 For the record, I like Guido's logic and proposal.  I don't understand
 Nick's objection, since I don't see the difference between the situation
 here where a string gets interpolated into bytes as 'xxx' and the
 corresponding situation where bytes gets interpolated into a string
 as b'xxx'.  Why struggle to keep bytes interpolation pure if string
 interpolation isn't?

 Guido's proposal makes the language more symmetric, and thus more
 consistent and less surprising.  Exactly the hallmarks of Python's design
 sense, IMO.  (Big surprise, right? :)

 Of course, this point of view *is* based on the idea that when you are
 doing interpolation using %/.format, you are in fact primarily concerned
 with ASCII compatible byte streams.  This is a Practicality sort of
 argument.  It is, after all, by far the most common use case when
 doing interpolation[*].

 If you wanted to do a purist version of this symmetry, you'd have bytes(x)
 calling __bytes__ if it was defined and falling back to calling a
 __brepr__ otherwise.

 But what would __brepr__ implement?  The variety of format codes in
 the struct module argues that there is no one obvious binary
 repr for most types.  (Those that have one would implement __bytes__).
 And what would be the __brepr__ of an arbitrary 'object'?

 Faced with the impracticality of defining __brepr__ usefully in any pure
 bytes form, it seems sensible to admit that the most useful __brepr__
 is the ascii() encoding of the __repr__.  Which naturally produces 'xxx'
 as the __brepr__ of a string.

 This does cause things to get a little un-pretty when you are operating
 at the python prompt:

  b'%s' % object
 b'class \\\'object\\\''

 But then again that is most likely really not what you mean to do, so
 it becomes a big red flag...just like b'xxx' is a small red flag when
 you accidentally interpolate unencoded bytes into a string.

 --David

 PS: When I first read Guido's remark that the result of interpolating a
 string should be 'xxx', I went Wah?  I had to reason my way through to
 it as above, but to him it was just the natural answer.  Guido isn't
 always right, but this kind of automatic language design consistency
 is one reason he's the BDFL.

 [*] I still think that you mostly want to design your library so that
 you are handling the text parts as text and the bytes parts as bytes,
 and encoding/gluing them as appropriate at the IO boundary.  But if Guido
 says his real code would benefit by being able to interpolate ASCII into
 bytes at certain points, I'll believe him.

elided rant/

If you think corrupted data is easier or more pleasant to track down
than encoding exceptions then I think you are strange. It makes
porting really difficult while you are still trying to figure out
where the bytes/str boundaries are. I am now deeply suspicious of all
% formatting.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460 reboot

2014-01-13 Thread Donald Stufft

On Jan 13, 2014, at 1:45 PM, Daniel Holth dho...@gmail.com wrote:

 On Mon, Jan 13, 2014 at 12:42 PM, R. David Murray rdmur...@bitdance.com 
 wrote:
 On Mon, 13 Jan 2014 12:41:18 +0100, Antoine Pitrou solip...@pitrou.net 
 wrote:
 On Sun, 12 Jan 2014 18:11:47 -0800
 Guido van Rossum gu...@python.org wrote:
 On Sun, Jan 12, 2014 at 5:27 PM, Ethan Furman et...@stoneleaf.us wrote:
 On 01/12/2014 04:47 PM, Guido van Rossum wrote:
 %s seems the trickiest: I think with a bytes argument it should just
 insert those bytes (and the padding modifiers should work too), and
 for other types it should probably work like %a, so that it works as
 expected for numeric values, and with a string argument it will return
 the ascii()-variant of its repr(). Examples:
 
 b'%s' % 42 == b'42'
 b'%s' % 'x' == b'x' (i.e. the three-byte string containing an 'x'
 enclosed in single quotes)
 
 I'm not sure about the quotes.  Would anyone ever actually want those in 
 the
 byte stream?
 
 Perhaps not, but it's a hint that you should probably think about an
 encoding. It's symmetric with how '%s' % b'x' returns b'x'. Think of
 it as payback time. :-)
 
 What is the use case for embedding a quoted ASCII-encoded representation
 in a byte stream?
 
 There is no use case in the sense you are asking, just like there is no
 real use case for '%s' % b'x' producing b'x'.  But the real use case
 is exactly the same: to let you know your code is screwed up without
 actually blowing up with a encoding Exception.
 
 For the record, I like Guido's logic and proposal.  I don't understand
 Nick's objection, since I don't see the difference between the situation
 here where a string gets interpolated into bytes as 'xxx' and the
 corresponding situation where bytes gets interpolated into a string
 as b'xxx'.  Why struggle to keep bytes interpolation pure if string
 interpolation isn't?
 
 Guido's proposal makes the language more symmetric, and thus more
 consistent and less surprising.  Exactly the hallmarks of Python's design
 sense, IMO.  (Big surprise, right? :)
 
 Of course, this point of view *is* based on the idea that when you are
 doing interpolation using %/.format, you are in fact primarily concerned
 with ASCII compatible byte streams.  This is a Practicality sort of
 argument.  It is, after all, by far the most common use case when
 doing interpolation[*].
 
 If you wanted to do a purist version of this symmetry, you'd have bytes(x)
 calling __bytes__ if it was defined and falling back to calling a
 __brepr__ otherwise.
 
 But what would __brepr__ implement?  The variety of format codes in
 the struct module argues that there is no one obvious binary
 repr for most types.  (Those that have one would implement __bytes__).
 And what would be the __brepr__ of an arbitrary 'object'?
 
 Faced with the impracticality of defining __brepr__ usefully in any pure
 bytes form, it seems sensible to admit that the most useful __brepr__
 is the ascii() encoding of the __repr__.  Which naturally produces 'xxx'
 as the __brepr__ of a string.
 
 This does cause things to get a little un-pretty when you are operating
 at the python prompt:
 
 b'%s' % object
b'class \\\'object\\\''
 
 But then again that is most likely really not what you mean to do, so
 it becomes a big red flag...just like b'xxx' is a small red flag when
 you accidentally interpolate unencoded bytes into a string.
 
 --David
 
 PS: When I first read Guido's remark that the result of interpolating a
 string should be 'xxx', I went Wah?  I had to reason my way through to
 it as above, but to him it was just the natural answer.  Guido isn't
 always right, but this kind of automatic language design consistency
 is one reason he's the BDFL.
 
 [*] I still think that you mostly want to design your library so that
 you are handling the text parts as text and the bytes parts as bytes,
 and encoding/gluing them as appropriate at the IO boundary.  But if Guido
 says his real code would benefit by being able to interpolate ASCII into
 bytes at certain points, I'll believe him.
 
 elided rant/
 
 If you think corrupted data is easier or more pleasant to track down
 than encoding exceptions then I think you are strange. It makes
 porting really difficult while you are still trying to figure out
 where the bytes/str boundaries are. I am now deeply suspicious of all
 % formatting.
 ___
 Python-Dev mailing list
 Python-Dev@python.org
 https://mail.python.org/mailman/listinfo/python-dev
 Unsubscribe: 
 https://mail.python.org/mailman/options/python-dev/donald%40stufft.io

For the record, I think %d and %f and such where the RHS is guaranteed to have a
certain set of “characters” that are guaranteed to be ascii compatible is fine 
and it’s
perfectly acceptable to have an implicit ASCII encode for them. The %s code I’m 
not
sure of, I think trying to ascii encode that (just using encode()) is 
dangerous, and I 
think that using ascii() and adding quotes to it 

Re: [Python-Dev] PEP 460 reboot

2014-01-13 Thread Ethan Furman

On 01/13/2014 09:12 AM, Nick Coghlan wrote:

On 14 January 2014 01:54, Ethan Furman wrote:


Forgive me for being dense, but I don't understand your objection.  With
Guido's proposal, '%s' % bytes_data, bytes_data is passed through unchanged.
Did you mean something else by binary data?


I mean it will work, but it will mean you've introduced an implicit
assumption of ASCII compatibility into the structure your program


Okay, I'm still trying to understand.  Apparently we both mean the same thing by binary data / bytes, so the difference 
must be the %s, yes?  And the concern as that because you have used %s as the format code, if somebody accidentally put, 
say, stupid bug on the RHS you would end up with b'stupid bug' instead of an exception, which you get if you had 
used %b instead.  Am I following?


--
~Ethan~
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460 reboot

2014-01-13 Thread Guido van Rossum
Let me try rebooting the reboot.

My interpretation of Nick's argument is that he are asking for a bytes
formatting language that doesn't have an implicit ASCII assumption.

To me this feels absurd. The formatting codes (%s, %c) themselves are
expressed as ASCII characters. If you include anything else in the
format string besides formatting codes (e.g. b'%s'), you are giving
it as ASCII characters. I don't know what characters the EBCDIC codes
37, 99 or 115 encode (these are the ASCII codes for '%', 'c', 's') but
it certainly wouldn't be safe to use % when the LHS is EBCDIC-encoded.

If I had some byte strings in an unknown encoding (but the same
encoding for all) that I needed to concatenate I would never think of
'%s%s' % (x, y) -- I would write x+y. (Even in Python 2.)

If I see some code using *any* formatting operation (regardless of
whether it's %d, %r, %s or %c) I am going to assume that there is some
ASCII-ness, and if there isn't, the code's author has obscured their
goal to me.

I hear the objections against b'%s' % 'x' returning b'x' loud and
clear, and if the noise about that sub-issue is preventing folks from
seeing the absurdity in PEP 460, we can talk about a compromise, e.g.
use %b which would require its argument to be bytes. Those bytes
should still probably be ASCII-ish, but there's no way to test that.
That's fine with me and should be fine to Nick as well -- PEP 460
doesn't check that your encodings match (how could it? :-), nor does
plain string concatenation using +.

In my head I make the following classification of situations where you
work with bytes and/or text.

(A) Pure binary formats (e.g. most IP-level packet formats, media
files, .pyc files, tar/zip files, compressed data, etc.). These are
handled using the struct module (e.g. tar/zip) and/or custom C
extensions (e.g. gzip).

(B) Encoded text. Here you should just decode everything into str
objects and parse your text at that level. If you really want to
manipulate the data as bytes (e.g. because you have a lot of data to
process and very light processing) you may be able to do it, but
unless it's a verbatim copy, you are probably going to make
assumptions about the encoding. You are also probably going to mess up
for some encodings (e.g. leave BOM turds in the middle of a file).

(C) Loosely text-based protocols and formats that have an ASCII
assumption in the spec. Most classic Internet protocols (FTP, SMTP,
HTTP, IRC, etc.) fall in this category; I expect there are also plenty
of file formats using similar conventions (e.g. mailbox files). These
protocols and formats often require text-ish manipulations, e.g. for
case-insensitive headers or commands, or to split things at
whitespace. This is where I find uses for the current ASCII-assuming
bytes operations (e.g. b.lower(), b.split(), but also int(b)) and
where the lack of number formatting (especially %d and %x) is most
painful. I see no benefit in forcing the programmer writing such
protocol code handling to use more cumbersome ways of converting
between numbers and bytes, nor in forcing them to insert an
encoding/decoding layer -- these protocols often switch between text
and binary data at line boundaries, so the most basic part of parsing
(splitting the input into lines) must still happen in the realm of
bytes.

IMO PEP 460 and the mindset that goes with it don't apply to any of
these three cases.

Also, IMO requiring a new type to handle (C) also seems adding too
much complexity, and adds to porting efforts. I may have felt
differently in the past, but ATM I feel that if newer versions of
Python 3 make porting of Python 2 code easier, through minor
compromises, that's a *good* thing. (Example: adding u... literals
to 3.3.)

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP460 thoughts from a Mercurial dev

2014-01-13 Thread Augie Fackler
Antoine Pitrou solipsis at pitrou.net writes:

 
 On Mon, 13 Jan 2014 09:34:39 -0800
 Guido van Rossum guido at python.org wrote:
  On Mon, Jan 13, 2014 at 8:51 AM, Nick Coghlan ncoghlan at gmail.com
wrote:
   On 13 January 2014 23:57, Augie Fackler raf at durin42.com wrote:
   1) What do we need in terms of functionality
  
   Best guess, %s, %d, and %f. I've not done a full audit of the code,
but some
   limited looking over the grep hits for % in .py files suggests I'm right,
   and we could even do without %f (we only use that for 'hg --time' output,
   which we could do in unicode).
  
   I think PEP 460 will have you covered there, or hopefully asciistr on 3.3+
  
  I'm confused on how PEP 460 would help -- Augie mentioned %d, which it
excludes.
 
 Serhiy did a survey of formatting codes in the Mercurial sources:
 https://mail.python.org/pipermail/python-dev/2014-January/130969.html

Note that a lot of those are in debug code (eg the only %f I've spotted is),
or are time format specifiers (which can be unicode just fine). A few others
(eg %ln) are for our internal revset format-string language, so this
overstates what we'd need in bytes by a little. %f would probably be good
too, as I look a little more.

(Please don't remove me from the CC list - I could only respond via gmane
because I'm not subscribed to python-dev.)

 
 Regards
 
 Antoine.
 
 




___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460 reboot

2014-01-13 Thread Donald Stufft

On Jan 13, 2014, at 1:58 PM, Guido van Rossum gu...@python.org wrote:

 I hear the objections against b'%s' % 'x' returning b'x' loud and
 clear, and if the noise about that sub-issue is preventing folks from
 seeing the absurdity in PEP 460, we can talk about a compromise, e.g.
 use %b which would require its argument to be bytes. Those bytes
 should still probably be ASCII-ish, but there's no way to test that.
 That's fine with me and should be fine to Nick as well -- PEP 460
 doesn't check that your encodings match (how could it? :-), nor does
 plain string concatenation using +.

I think disallowing %s is the right thing to do, but I definitely think numbers
and %b should be allowed.

-
Donald Stufft
PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP460 thoughts from a Mercurial dev

2014-01-13 Thread Antoine Pitrou
On Mon, 13 Jan 2014 18:51:32 + (UTC)
Augie Fackler r...@durin42.com wrote:
 
 (Please don't remove me from the CC list - I could only respond via gmane
 because I'm not subscribed to python-dev.)

Responding via gmane is what I do, too :-)
My NNTP client doesn't allow SMTP / NNTP mixed postings, so I'm forced
to remove you from CC.

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460 reboot

2014-01-13 Thread Terry Reedy

On 1/13/2014 1:40 PM, Brett Cannon wrote:


 So bytes formatting really needn't (and shouldn't, IMO) mirror str
 formatting.


This was my presumption in writing byteformat().


I think one of the things about Guido's proposal that bugs me is that it
breaks the mental model of the .format() method from str in terms of how
the mini-language works. For str.format() you have the conversion and
the format spec (e.g. {!r} and {:d}, respectively). You apply the
conversion by calling the appropriate built-in, e.g. 'r' calls repr().
The format spec semantically gets passed with the object to format()
which calls the object's __format__() method: ``format(number, 'd')``.

Now Guido's suggestion has two parts that affect the mini-language for
.format(). One is that for bytes.format() the default conversion is
bytes() instead of str(), which is fine (probably want to add 'b' as a
conversion value as well to be consistent). But the other bit is that
the format spec goes from semantically meaning ``format(thing,
format_spec)`` to ``format(thing, format_spec).encode('ascii',
'strict')`` for at least numbers. That implicitness bugs me as I have
always thought of format specs just leading to a call to format(). I
think I can live with it, though, as long as it is **consistently**
applied across the board for bytes.format(); every use of a format spec
leads to calling ``format(thing, format_spec).encode('ascii',
'strict')`` no matter what type 'thing' would be and it is clearly
documented that this is done to ease porting and handle the common case
then I can live with it.


This is how my byteformat function works, except that when no 
format_spec is given, byte and bytearrary objects are left unchanged 
rather than being decoded and encoded again.



This even gives people in-place ASCII encoding for strings by always
using '{:s}' with text which they can do when they port their code to
run under both Python 2 and 3. So you should be able to do
``b'Content-Type: {:s}'.format('image/jpeg')`` and have it give ASCII.
If you want more explicit encoding to latin-1 then you need to do it
explicitly and not rely on the mini-language to do tricks for you.

IOW I want to treat the format mini-language as a language and thus not
have any special-casing or massive shifts in meaning between
str.format() and bytes.format() so my mental model doesn't have to
contort based on whether it's str or bytes. My preference is not have
any, but if Guido is going say PBP here then I want absolute consistency
across the board in how bytes.format() tweaks things.

As for %s for the % operator calling ascii(), I think that will be a
porting nightmare of finding out why your bytes suddenly stopped being
formatted properly and then having to crawl through all of your code for
that one use of %s which is getting bytes in. By raising a TypeError you
will very easily detect where your screw-up occurred thanks to the
traceback; do so otherwise feels too much like implicit type conversion
and ask any JavaScript developer how that can be a bad thing.


I personally would not add 'bytes % whatever'.

--
Terry Jan Reedy

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460 reboot

2014-01-13 Thread Barry Warsaw
On Jan 13, 2014, at 02:13 PM, Donald Stufft wrote:


On Jan 13, 2014, at 1:58 PM, Guido van Rossum gu...@python.org wrote:

 I hear the objections against b'%s' % 'x' returning b'x' loud and
 clear, and if the noise about that sub-issue is preventing folks from
 seeing the absurdity in PEP 460, we can talk about a compromise, e.g.
 use %b which would require its argument to be bytes. Those bytes
 should still probably be ASCII-ish, but there's no way to test that.
 That's fine with me and should be fine to Nick as well -- PEP 460
 doesn't check that your encodings match (how could it? :-), nor does
 plain string concatenation using +.

I think disallowing %s is the right thing to do, but I definitely think
numbers and %b should be allowed.

I guess I agree.  The behavior of b'%s' % 'x' returning b'x' is almost
always useless at best.  (I would have thought maybe %a for ascii() but don't
care that strongly.)

-Barry


signature.asc
Description: PGP signature
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460 reboot

2014-01-13 Thread Brett Cannon
On Mon, Jan 13, 2014 at 2:51 PM, Terry Reedy tjre...@udel.edu wrote:

 On 1/13/2014 1:40 PM, Brett Cannon wrote:

   So bytes formatting really needn't (and shouldn't, IMO) mirror str
  formatting.


 This was my presumption in writing byteformat().


  I think one of the things about Guido's proposal that bugs me is that it
 breaks the mental model of the .format() method from str in terms of how
 the mini-language works. For str.format() you have the conversion and
 the format spec (e.g. {!r} and {:d}, respectively). You apply the
 conversion by calling the appropriate built-in, e.g. 'r' calls repr().
 The format spec semantically gets passed with the object to format()
 which calls the object's __format__() method: ``format(number, 'd')``.

 Now Guido's suggestion has two parts that affect the mini-language for
 .format(). One is that for bytes.format() the default conversion is
 bytes() instead of str(), which is fine (probably want to add 'b' as a
 conversion value as well to be consistent). But the other bit is that
 the format spec goes from semantically meaning ``format(thing,
 format_spec)`` to ``format(thing, format_spec).encode('ascii',
 'strict')`` for at least numbers. That implicitness bugs me as I have
 always thought of format specs just leading to a call to format(). I
 think I can live with it, though, as long as it is **consistently**
 applied across the board for bytes.format(); every use of a format spec
 leads to calling ``format(thing, format_spec).encode('ascii',
 'strict')`` no matter what type 'thing' would be and it is clearly
 documented that this is done to ease porting and handle the common case
 then I can live with it.


 This is how my byteformat function works, except that when no format_spec
 is given, byte and bytearrary objects are left unchanged rather than being
 decoded and encoded again.


Right, which is what the default conversion covers. And as your code shows
this can be made available today without having to wait for Python 3.5 and
so can go up on PyPI and be used **today**.




  This even gives people in-place ASCII encoding for strings by always
 using '{:s}' with text which they can do when they port their code to
 run under both Python 2 and 3. So you should be able to do
 ``b'Content-Type: {:s}'.format('image/jpeg')`` and have it give ASCII.
 If you want more explicit encoding to latin-1 then you need to do it
 explicitly and not rely on the mini-language to do tricks for you.

 IOW I want to treat the format mini-language as a language and thus not
 have any special-casing or massive shifts in meaning between
 str.format() and bytes.format() so my mental model doesn't have to
 contort based on whether it's str or bytes. My preference is not have
 any, but if Guido is going say PBP here then I want absolute consistency
 across the board in how bytes.format() tweaks things.

 As for %s for the % operator calling ascii(), I think that will be a
 porting nightmare of finding out why your bytes suddenly stopped being
 formatted properly and then having to crawl through all of your code for
 that one use of %s which is getting bytes in. By raising a TypeError you
 will very easily detect where your screw-up occurred thanks to the
 traceback; do so otherwise feels too much like implicit type conversion
 and ask any JavaScript developer how that can be a bad thing.


 I personally would not add 'bytes % whatever'.


Personally, neither would I; just focus on bytes.format() and let %
operator on strings slowly go away.

-Brett




 --
 Terry Jan Reedy


 ___
 Python-Dev mailing list
 Python-Dev@python.org
 https://mail.python.org/mailman/listinfo/python-dev
 Unsubscribe: https://mail.python.org/mailman/options/python-dev/
 brett%40python.org

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460 reboot

2014-01-13 Thread Daniel Holth
I see it now. bfoo%sbar % b'baz' should also expand to bfoob'foo'bar

Instead of %b could %j mean I should have used + or join() here
but was too lazy and work on str too?

On Mon, Jan 13, 2014 at 2:51 PM, Terry Reedy tjre...@udel.edu wrote:
 On 1/13/2014 1:40 PM, Brett Cannon wrote:

  So bytes formatting really needn't (and shouldn't, IMO) mirror str
  formatting.


 This was my presumption in writing byteformat().


 I think one of the things about Guido's proposal that bugs me is that it
 breaks the mental model of the .format() method from str in terms of how
 the mini-language works. For str.format() you have the conversion and
 the format spec (e.g. {!r} and {:d}, respectively). You apply the
 conversion by calling the appropriate built-in, e.g. 'r' calls repr().
 The format spec semantically gets passed with the object to format()
 which calls the object's __format__() method: ``format(number, 'd')``.

 Now Guido's suggestion has two parts that affect the mini-language for
 .format(). One is that for bytes.format() the default conversion is
 bytes() instead of str(), which is fine (probably want to add 'b' as a
 conversion value as well to be consistent). But the other bit is that
 the format spec goes from semantically meaning ``format(thing,
 format_spec)`` to ``format(thing, format_spec).encode('ascii',
 'strict')`` for at least numbers. That implicitness bugs me as I have
 always thought of format specs just leading to a call to format(). I
 think I can live with it, though, as long as it is **consistently**
 applied across the board for bytes.format(); every use of a format spec
 leads to calling ``format(thing, format_spec).encode('ascii',
 'strict')`` no matter what type 'thing' would be and it is clearly
 documented that this is done to ease porting and handle the common case
 then I can live with it.


 This is how my byteformat function works, except that when no format_spec is
 given, byte and bytearrary objects are left unchanged rather than being
 decoded and encoded again.


 This even gives people in-place ASCII encoding for strings by always
 using '{:s}' with text which they can do when they port their code to
 run under both Python 2 and 3. So you should be able to do
 ``b'Content-Type: {:s}'.format('image/jpeg')`` and have it give ASCII.
 If you want more explicit encoding to latin-1 then you need to do it
 explicitly and not rely on the mini-language to do tricks for you.

 IOW I want to treat the format mini-language as a language and thus not
 have any special-casing or massive shifts in meaning between
 str.format() and bytes.format() so my mental model doesn't have to
 contort based on whether it's str or bytes. My preference is not have
 any, but if Guido is going say PBP here then I want absolute consistency
 across the board in how bytes.format() tweaks things.

 As for %s for the % operator calling ascii(), I think that will be a
 porting nightmare of finding out why your bytes suddenly stopped being
 formatted properly and then having to crawl through all of your code for
 that one use of %s which is getting bytes in. By raising a TypeError you
 will very easily detect where your screw-up occurred thanks to the
 traceback; do so otherwise feels too much like implicit type conversion
 and ask any JavaScript developer how that can be a bad thing.


 I personally would not add 'bytes % whatever'.

 --
 Terry Jan Reedy


 ___
 Python-Dev mailing list
 Python-Dev@python.org
 https://mail.python.org/mailman/listinfo/python-dev
 Unsubscribe:
 https://mail.python.org/mailman/options/python-dev/dholth%40gmail.com
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460 reboot

2014-01-13 Thread Guido van Rossum
On Mon, Jan 13, 2014 at 11:57 AM, Barry Warsaw ba...@python.org wrote:

 On Jan 13, 2014, at 02:13 PM, Donald Stufft wrote:

On Jan 13, 2014, at 1:58 PM, Guido van Rossum gu...@python.org wrote:

 I hear the objections against b'%s' % 'x' returning b'x' loud and
 clear, and if the noise about that sub-issue is preventing folks from
 seeing the absurdity in PEP 460, we can talk about a compromise, e.g.
 use %b which would require its argument to be bytes. Those bytes
 should still probably be ASCII-ish, but there's no way to test that.
 That's fine with me and should be fine to Nick as well -- PEP 460
 doesn't check that your encodings match (how could it? :-), nor does
 plain string concatenation using +.

I think disallowing %s is the right thing to do, but I definitely think
numbers and %b should be allowed.

 I guess I agree.  The behavior of b'%s' % 'x' returning b'x' is almost
 always useless at best.  (I would have thought maybe %a for ascii() but don't
 care that strongly.)

Yeah, the %s behavior with a string argument was a messy attempt at
compromise. I was hoping to mimick a common use of %s in Python 2,
where it can be used with either an 8-bit string or a number as
argument, acting like %b in the former case and like %d in the latter
case. Not having %s at all in Python 3 means that porting requires
more thinking (== more opportunity for mistakes when you're converting
in bulk) and there's no easy way to write code that works in Python 2
and 3.

If we have %b for strictly interpolating bytes, I'm fine with adding
%a for calling ascii() on the argument and then interpolating the
result after ASCII-encoding it.

If somehow (unlikely though it seems) we end up keeping %s (e.g.
strictly to ease porting), we could also keep %r as an alias for %a.

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460 reboot

2014-01-13 Thread Yury Selivanov
On January 13, 2014 at 3:08:43 PM, Daniel Holth (dho...@gmail.com) wrote:
  
 I see it now. bfoo%sbar % b'baz' should also expand to bfoob'foo'bar  
  
 Instead of %b could %j mean I should have used + or join()  
 here
 but was too lazy and work on str too?

Isn’t this just error prone? Since it’s a new format character, many,
probably, would write %s by mistake. And, besides, there was no %j
in python2.

-
Yury
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460 reboot

2014-01-13 Thread Guido van Rossum
On Mon, Jan 13, 2014 at 12:02 PM, Brett Cannon br...@python.org wrote:
 On Mon, Jan 13, 2014 at 2:51 PM, Terry Reedy tjre...@udel.edu wrote:
 I personally would not add 'bytes % whatever'.

 Personally, neither would I; just focus on bytes.format() and let % operator
 on strings slowly go away.

Well, % has some very strong arguments in its favor still -- for
example, the sheer amount of code that currently uses it, the fact
that it's as close as we get to a cross-language standard, and the
fact that nobody wants to tackle its use in the logging module (since
logger objects are often shared between packages that don't know about
each other).

Anyway, the % or .format() issue seems completely orthogonal to the
issues that get people riled up (which are mostly about whether using
either implies some kind of ASCII compatibility).

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460 reboot

2014-01-13 Thread Daniel Holth
On Mon, Jan 13, 2014 at 3:11 PM, Yury Selivanov yselivanov...@gmail.com wrote:
 On January 13, 2014 at 3:08:43 PM, Daniel Holth (dho...@gmail.com) wrote:

 I see it now. bfoo%sbar % b'baz' should also expand to bfoob'foo'bar

 Instead of %b could %j mean I should have used + or join()
 here
 but was too lazy and work on str too?

 Isn’t this just error prone? Since it’s a new format character, many,
 probably, would write %s by mistake. And, besides, there was no %j
 in python2.

Merely a flesh wound.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460 reboot

2014-01-13 Thread Eric V. Smith
On 01/13/2014 03:09 PM, Guido van Rossum wrote:
 If we have %b for strictly interpolating bytes, I'm fine with adding
 %a for calling ascii() on the argument and then interpolating the
 result after ASCII-encoding it.
 
 If somehow (unlikely though it seems) we end up keeping %s (e.g.
 strictly to ease porting), we could also keep %r as an alias for %a.

Wouldn't %s as an alias for %b simplify porting from Python 2?


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] cpython (3.3): Update Sphinx toolchain.

2014-01-13 Thread anatoly techtonik
That's cool, but historical heritage makes the make argument
somewhat confusing for new users. The immediate question I
can sense is What is the difference between build and make?

To make (this word again) the critics constructive, let me pass
some ideas about ideal user experience as I see it.

--[installation]--
1   I install Sphinx. Two scenarios.
   1.1   I am not a Python user - use installer
  1.1.1   Installer should obviously install Python
  1.1.2   And install sphinx command
  1.1.3   And add sphinx to PATH
   1.2   I am a Python user - use pip
  1.2.1   pip should not alter my PATH (for virtualenv)

--[usage]--
2   Two scenarios
   2.1   sphinx as a system command from PATH
   2.2   python -m sphinx for current virtualenv / test config

--[user experience]--
3   These two invocations are equal
 sphinx
 python -m sphinx

4. They give the following ouput

Sphinx 1.2 Documentation Generator

Commands:

   build   build documentation
   init start new project [also quickstart]
   make  helper for common build commands

Use sphinx -h command or sphinx command --help for details


I am not using sphinx ATM otherwise I'd spent more time to
design ideal command set to get rid of build/make duality, but
it should work ok.

Actually sphinx is a new command, so you may rethink the
syntax for build arguments to contain html instead of dir names,
and move dir names into parameters, because it is how it is most
often used.

--
anatoly t.


On Sun, Jan 12, 2014 at 4:53 PM, Georg Brandl g.bra...@gmx.net wrote:
 That's also planned, see 
 https://bitbucket.org/birkenfeld/sphinx-new-make-mode/.

 Georg

 Am 12.01.2014 09:49, schrieb anatoly techtonik:
 And cross-platform automation tools in Python instead of make
 https://bitbucket.org/birkenfeld/sphinx/issue/456/makepy-command-script
 --
 anatoly t.


 On Sun, Jan 12, 2014 at 11:12 AM, INADA Naoki songofaca...@gmail.com wrote:
 What about using venv and pip instead of svn?


 ___
 Python-Dev mailing list
 Python-Dev@python.org
 https://mail.python.org/mailman/listinfo/python-dev
 Unsubscribe: 
 https://mail.python.org/mailman/options/python-dev/techtonik%40gmail.com
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460 reboot

2014-01-13 Thread Ethan Furman

On 01/13/2014 12:02 PM, Brett Cannon wrote:


Personally, neither would I; just focus on bytes.format() and let % operator on 
strings slowly go away.


Hey, now, some of us like %!  ;)

--
~Ethan~
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460: allowing %d and %f and mojibake

2014-01-13 Thread Glenn Linderman

On 1/13/2014 6:43 AM, Stephen J. Turnbull wrote:

Glenn Linderman writes:

   On 1/12/2014 4:08 PM, Stephen J. Turnbull wrote:
   Glenn Linderman writes:
   the proposals to embed binary in Unicode by abusing Latin-1
   encoding.

   Those aren't proposals, they are currently feasible
   techniques in Python 3 for *some* use cases. The question is why
   infecting Python 3 with the byte/character confoundance virus is
   preferable to such techniques, especially if their (serious!)
   deficiencies are removed by creating a new type such as
   asciistr.

   smuggled binary (great term borrowed from a different
   subthread) muddies the waters of what you are dealing with.

Not really.  The mud is one or more of the serious deficiencies.  It
can be removed, I believe (and Nick apparently does, too).  asciistr
is one way to try that.


Yes really. Use of smuggled binary means the str containing it can no 
longer be treated completely as a str. That is muddier than having a 
str that is only a str.



   When the mixture of text and binary is done as encoded text in
   binary, then it is obvious that only limited text processing can be
   performed,

Hardly.  After all, that's how all text processing was done for
decades.  Still is, in some programs, especially C programs.


I disagree, and so do you... text processing must be limited to the text 
subsets of the text that includes smuggled binary... that is limited... 
you can't just apply text searches, scans, and transformations over the 
complete str, when it contains smuggled binary.  You know that, but must 
have not considered it a limitation, because you know you can do any 
text processing on the text parts.  But it is a limitation to have to 
keep track of it, and apply the text processing only to the parts that 
are text. Yes, it has been done that way, and the limitations of doing 
it that way led to the plethora of encodings each of which was intended 
to be sufficient for some problem domain, but most of which were only 
sufficient for a smaller problem domain than intended, especially as 
communications became more global in nature.




   And there are no extra, confusing Latin-1 encode/decode operations
   required.

The extra encode/decode operations are mostly (perhaps all) due to
examples that started from bytes and end with bytes.  Of course if you
assume that API and propose to do the operations using Unicode, you'll
get extra decode/encode operations.


No, the extra encode/decode are from the requirement that smuggled 
binary use latin-1, and other binary flavors are not always latin-1.




   From a higher-level perspective, I think it would be great to have
   a module, perhaps called boundary (let's call it that for now),
   that allow some definition syntax (augmented BNF? augmented ABNF?)
   to explain the format of a binary blob.

We have struct, for one.  I'm not sure why you want more than that.  I
suppose you could go all the way to ASN.1.


struct is insufficient to capture a whole file format, with optional 
parts, although it suffices for fragments.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] cpython (3.3): Update Sphinx toolchain.

2014-01-13 Thread Georg Brandl
[If you want to continue this discussio, please move it from python-dev to
sphinx-users.  It is now completely offtopic for the former.]

Anyway, just as a short explanation, you missed the point of the change:
-M is not meant to be used directly but still via a (very short)
Makefile.  This isn't be a change meant to be visible to users.

Georg

Am 13.01.2014 20:56, schrieb anatoly techtonik:
 That's cool, but historical heritage makes the make argument
 somewhat confusing for new users. The immediate question I
 can sense is What is the difference between build and make?
 
 To make (this word again) the critics constructive, let me pass
 some ideas about ideal user experience as I see it.
 
 --[installation]--
 1   I install Sphinx. Two scenarios.
1.1   I am not a Python user - use installer
   1.1.1   Installer should obviously install Python
   1.1.2   And install sphinx command
   1.1.3   And add sphinx to PATH
1.2   I am a Python user - use pip
   1.2.1   pip should not alter my PATH (for virtualenv)
 
 --[usage]--
 2   Two scenarios
2.1   sphinx as a system command from PATH
2.2   python -m sphinx for current virtualenv / test config
 
 --[user experience]--
 3   These two invocations are equal
 sphinx
 python -m sphinx
 
 4. They give the following ouput

 Sphinx 1.2 Documentation Generator
 
 Commands:
 
build   build documentation
init start new project [also quickstart]
make  helper for common build commands
 
 Use sphinx -h command or sphinx command --help for details
 
 
 I am not using sphinx ATM otherwise I'd spent more time to
 design ideal command set to get rid of build/make duality, but
 it should work ok.
 
 Actually sphinx is a new command, so you may rethink the
 syntax for build arguments to contain html instead of dir names,
 and move dir names into parameters, because it is how it is most
 often used.
 
 --
 anatoly t.
 
 
 On Sun, Jan 12, 2014 at 4:53 PM, Georg Brandl g.bra...@gmx.net wrote:
 That's also planned, see 
 https://bitbucket.org/birkenfeld/sphinx-new-make-mode/.

 Georg

 Am 12.01.2014 09:49, schrieb anatoly techtonik:
 And cross-platform automation tools in Python instead of make
 https://bitbucket.org/birkenfeld/sphinx/issue/456/makepy-command-script
 --
 anatoly t.


 On Sun, Jan 12, 2014 at 11:12 AM, INADA Naoki songofaca...@gmail.com 
 wrote:
 What about using venv and pip instead of svn?


 ___
 Python-Dev mailing list
 Python-Dev@python.org
 https://mail.python.org/mailman/listinfo/python-dev
 Unsubscribe: 
 https://mail.python.org/mailman/options/python-dev/techtonik%40gmail.com
 


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460 reboot

2014-01-13 Thread Glenn Linderman

On 1/13/2014 1:49 AM, Mark Shannon wrote:

So why not replace '%s' with '%a' for the ascii case and
with '%b' for directly inserting bytes.


Because %a and %b don't exist in Python 2.7?


I thought this was about 3.5, not 2.7 ;)
'%s' can't work in 3.5, as we must differentiate between
strings which meed to be encoded and bytes which don't. 


It's about migrating code to reach a point where it can work on both 2.7 
and 3.5.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP460 thoughts from a Mercurial dev

2014-01-13 Thread Serhiy Storchaka

13.01.14 15:57, Augie Fackler написав(ла):

1) What do we need in terms of functionality

Best guess, %s, %d, and %f. I've not done a full audit of the code, but
some limited looking over the grep hits for % in .py files suggests I'm
right, and we could even do without %f (we only use that for 'hg --time'
output, which we could do in unicode).


Most popular formatting codes in Mercurial sources (excluding %Y, %M, etc):

   2519 %s
493 %d
102 %r
 33 %i
 23 %ld
 19 %ln
 12 %.3f
 10 %.1f
  9 %(val)r
  9 %p
  9 %.2f

%s covers almost 80% of use cases and %d covers almost 20%. %r covers 
about 3%, %f covers less than 1%. So I think anything except %s and %d 
can be ignored.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460 reboot

2014-01-13 Thread Greg Ewing

Guido van Rossum wrote:

On Sun, Jan 12, 2014 at 5:27 PM, Ethan Furman et...@stoneleaf.us wrote:


On 01/12/2014 04:47 PM, Guido van Rossum wrote:



b'%s' % 'x' == b'x' (i.e. the three-byte string containing an 'x'
enclosed in single quotes)


I'm not sure about the quotes.  Would anyone ever actually want those in the
byte stream?


Perhaps not, but it's a hint that you should probably think about an
encoding. It's symmetric with how '%s' % b'x' returns b'x'. Think of
it as payback time. :-)


If it's never useful, wouldn't it be better to raise an
exception in this case?

That way, someone porting code from py2 that does this
without appropriate modification will find out about
the problem immediately, rather than have spurious
quotes inserted into their binary data, which -- being
binary data -- will likely go unnoticed until something
else tries to read the data.

I don't think the rule against operations that work on
all-but-one-type really applies here, because the mistake
it's intended to catch is not an obscure corner case.
If your program's logic includes interpolating strings
into bytes objects, then you're going to be testing
that.

--
Greg
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460 reboot

2014-01-13 Thread Paul Moore
On 13 January 2014 18:58, Guido van Rossum gu...@python.org wrote:
 I hear the objections against b'%s' % 'x' returning b'x' loud and
 clear, and if the noise about that sub-issue is preventing folks from
 seeing the absurdity in PEP 460, we can talk about a compromise, e.g.
 use %b which would require its argument to be bytes. Those bytes
 should still probably be ASCII-ish, but there's no way to test that.
 That's fine with me and should be fine to Nick as well -- PEP 460
 doesn't check that your encodings match (how could it? :-), nor does
 plain string concatenation using +.

For the record, Guido's reboot posting and rationale has convinced me,
and I am essentially in favour of his proposal.

Nick's remaining objection seems to me to have some validity if the
format string is a user-supplied variable, but this type of usage is
vanishingly small in my experience, and shouldn't dictate the whole
design.

I don't like b'%s' % 'x' behaviour, and would prefer one of the
alternatives. I'm not entirely clear about the details of the
alternative proposals, so I won't try to pick one.

I think this should be for 3.5, and should not involve an accelerated
release of 3.5 - we should get it into the 3.5 code early and let
people thrash out the details during the 3.5 release cycle.

Paul.

PS For all the heated arguments and occasional frayed tempers, this
has been an impressively civil debate. I think that's one of the best
things about python-dev, that discussions like these never degenerate
into flamewars. Kudos to all concerned!
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460 reboot and a bitter fight

2014-01-13 Thread Glenn Linderman

On 1/13/2014 5:06 AM, Nick Coghlan wrote:

I figured out tonight that it's only positioning ASCII interpolation
as an*alternative*  to adding binary interpolation that I have a
problem with. It isn't, because you lose the structural assurance that
you haven't inadvertently introduced an assumption of ASCII
compatibility when you didn't need to. However, interpolation support
is a convenient enough interface that I can see a version that*only*
supports ASCII compatible interpolation being an attractive nuisance
that becomes a source of hard to detect and fix data corruption bugs
(just like the str type in Python 2).

If we add both, my objections go away: people like me can use the
Python 3 only formatb and formatb_map methods and be confident we
haven't inadvertently introduced any assumptions regarding ASCII
compatibility, while folks that know they're dealing with an ASCII
compatible format can use the ASCII assuming versions that are
designed to be source compatible with Python 2.

If someone incorrectly uses format() or format_map() when they should
be using the pure binary versions, that's a trivial bug fix (adding
the necessary b, and perhaps some explicit encoding calls) rather
than a major restructuring of the code.

If they use mod-formatting, that's a slightly bigger fix, but still
just switching to a different spelling of the formatting operation.

Both use cases (binary only and ASCII compatible) get covered cleanly,
and nobody has to lose out.

Cheers,
Nick.


As part of that, what about an alternate spelling of  %  to allow 
binary-only interpolation operations using the handy syntax of % ? 
Doesn't seem like  /  is defined for bytes or str on the LHS.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460 reboot

2014-01-13 Thread Glenn Linderman

On 1/13/2014 10:40 AM, Brett Cannon wrote:
This even gives people in-place ASCII encoding for strings by always 
using '{:s}' with text which they can do when they port their code to 
run under both Python 2 and 3. So you should be able to do 
``b'Content-Type: {:s}'.format('image/jpeg')`` and have it give ASCII. 
If you want more explicit encoding to latin-1 then you need to do it 
explicitly and not rely on the mini-language to do tricks for you.


My preference is not have any, but if Guido is going say PBP here then 
I want absolute consistency across the board in how bytes.format() 
tweaks things.


As for %s for the % operator calling ascii(), I think that will be a 
porting nightmare of finding out why your bytes suddenly stopped being 
formatted properly and then having to crawl through all of your code 
for that one use of %s which is getting bytes in. By raising a 
TypeError you will very easily detect where your screw-up occurred 
thanks to the traceback; do so otherwise feels too much like implicit 
type conversion and ask any JavaScript developer how that can be a bad 
thing.




So quote 3 is necessarily a violation of quote 1.  But if quote 2 can 
allow for one exception to its absolute consistency... that is probably 
the best solution overall...
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460 reboot

2014-01-13 Thread Glenn Linderman

On 1/13/2014 9:38 AM, Ethan Furman wrote:

On 01/13/2014 09:31 AM, Antoine Pitrou wrote:

On Mon, 13 Jan 2014 08:36:05 -0800
Ethan Furman wrote:


You mean crash all the time?  I'd be fine with that for both the str 
case

and the bytes case.  But's probably too late
to change the str case, and the bytes case should mirror what str does.


Let me add something else: str and bytes don't have to be symmetrical.
In Python 2, str and unicode were symmetrical, they allowed exactly the
same operations and were composable.
In Python 3, str and bytes are different beasts; they have different
operations *and* different semantics (for example, bytes interoperates
with bytearray and memoryview, while str doesn't).


This makes sense to me.

So I'm guess I'm fine with either the quoted ascii repr or the always 
blowing up method, with leaning towards the blowing up method. 


+1 - what Ethan said. A real death, instead death by inappropriately 
transformed data, is fine by me, if b%s % str(...) doesn't have the 
appropriate .encode(...) call. But I could live with either.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460 reboot

2014-01-13 Thread Mark Lawrence

On 13/01/2014 21:01, Paul Moore wrote:


I think this should be for 3.5, and should not involve an accelerated
release of 3.5 - we should get it into the 3.5 code early and let
people thrash out the details during the 3.5 release cycle.


I disagree, it should be on pypi now so people can start trying it out, 
or as others have suggested incorporate it into the six module.  Surely 
that'd make the job of getting it into 3.5 far easier?




Paul.

PS For all the heated arguments and occasional frayed tempers, this
has been an impressively civil debate. I think that's one of the best
things about python-dev, that discussions like these never degenerate
into flamewars. Kudos to all concerned!



+1

--
My fellow Pythonistas, ask not what our language can do for you, ask 
what you can do for our language.


Mark Lawrence

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460 reboot

2014-01-13 Thread Greg Ewing

Glenn Linderman wrote:

Quotes in the stream are a great debug hint, without blowing up.


But do you really want those quotes turning up in
a *binary* stream, where they're somewhere between
awkward and near-impossible to spot by eyeballing,
and may only be discovered when something else --
likely a different program, possibly being run
by a different person -- tries to read the data
back, and blows up because the binary format is
corrupted?

I'd much rather it blew up at the writing stage,
myself. Corrupted binary data is *much* harder to
debug than corrupted text, because binary formats
typically have little to no margin for error
before they become complete garbage.

--
Greg
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460 reboot

2014-01-13 Thread Guido van Rossum
I will doggedly keep posting to this thread rather than creating more threads.

In another thread, Nick has said he's okay with my proposal (not sure
if that includes %s or not, but it now seems of lesser importance) as
long as we simultaneously introduce formatb() and formatb_map() (the
latter is just a minor variation of the former, so I won't mention it
further).

But formatb() feels absurd to me. PEP 460 has neither a precise
specification or any actual examples, so I can't tell whether the
intention is that the format string can *only* contain {...} sequences
or whether it can also contain regular characters. Translating to
formatb(), my question comes down to the legality of the following
example:

  b'Hello, {}'.formatb(name)  # Where name is some bytes object

If this is allowed, it reintroduces the ASCII bias (since the
substring 'Hello' is clearly ASCII).

If this isn't allowed, it feels like a perversion of the notion of a
formatting language, and I really don't see the attraction over
using a combination of concatenation and the struct module, perhaps
augmented with some use of bytes([i]) as an alternative to %c or {!c}
(if that is what is meant by PEP 460 with 'c modifier' -- I can't find
the word 'modifier' in the docs for format().

Note that I honestly don't understand which of these PEP 460 means.

Either way, PEP 460's motivation seems kind of subjective and esthetic:


While there are reasonably efficient ways to accumulate binary data
(such as using a bytearray object, the bytes.join method or even
io.BytesIO), none of them leads to the kind of readable and intuitive
code that is produced by a %-formatted or {}-formatted template and a
formatting operation.


I would buy this if a binary format string could contain embedded text
(like 'Hello' in my example above), but then the argument about
avoiding ASCII bias seems to fall apart so I am at a loss about what
Nick actually wants, and even about what PEP 460 actually specifies.

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460 reboot

2014-01-13 Thread Glenn Linderman

On 1/13/2014 12:09 PM, Guido van Rossum wrote:

Yeah, the %s behavior with a string argument was a messy attempt at
compromise. I was hoping to mimick a common use of %s in Python 2,
where it can be used with either an 8-bit string or a number as
argument, acting like %b in the former case and like %d in the latter
case. Not having %s at all in Python 3 means that porting requires
more thinking (== more opportunity for mistakes when you're converting
in bulk) and there's no easy way to write code that works in Python 2
and 3.

If we have %b for strictly interpolating bytes, I'm fine with adding
%a for calling ascii() on the argument and then interpolating the
result after ASCII-encoding it.

If somehow (unlikely though it seems) we end up keeping %s (e.g.
strictly to ease porting), we could also keep %r as an alias for %a.


%s for strictly interpolating bytes eases porting. Sad name, but good 
for compatibility. When the blowup happens, due to having a str type 
passed, the porter adds the appropriate .encode(...) to the parameter, 
so it doesn't blow up on Py 3, and it'll be OK for Py 2 as well, will it 
not?
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460 reboot

2014-01-13 Thread Antoine Pitrou
On Mon, 13 Jan 2014 13:32:28 -0800
Guido van Rossum gu...@python.org wrote:
 
 But formatb() feels absurd to me. PEP 460 has neither a precise
 specification or any actual examples, so I can't tell whether the
 intention is that the format string can *only* contain {...} sequences
 or whether it can also contain regular characters. Translating to
 formatb(), my question comes down to the legality of the following
 example:
 
   b'Hello, {}'.formatb(name)  # Where name is some bytes object

Yes, it's allowed. But so is:

  b'\xff\x00{}\x85{}'.formatb(payload, trailer)

The ASCII bias is because of the bytes literal notation.

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460 reboot

2014-01-13 Thread Ethan Furman

On 01/13/2014 01:08 PM, Glenn Linderman wrote:


+1 - what Ethan said. A real death, instead death by inappropriately transformed data, is 
fine by me, if b%s %
str(...) doesn't have the appropriate .encode(...) call. But I could live with 
either.


You mean instead of death by a thousand quotes?  *ducks and runs*

--
~Ethan~
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460 reboot

2014-01-13 Thread Guido van Rossum
Terminology. Let's use the official terminology rather than making stuff up.

The docs at http://docs.python.org/3/library/string.html#formatspec
use the following terminology:

Replacement field: {...}; contains field name, conversion, format spec
in that order, all optional.

Field name: either a decimal integer (referring to an argument by
position) or an identifier (by name), or omitted (uses the next
available position).

Conversion: !r, !s, !a; these refer to repr(), str(), ascii() to the
value, and then the format spec applies to the resulting string.

Format spec: colon, bunch of stuff, type; the type is a letter such as
d (decimal) or s (string), and the stuff between the colon and the
type is used to specify field width, alignment, sign, padding and
such.


Also. {:b} means binary (i.e. numbers in base 2). I'm not sure what
this leaves for interpolating bytes if we don't want to use {:s}. The
docs at 
http://docs.python.org/3/library/stdtypes.html#printf-style-string-formatting
don't show %b so it could still be used there, but it would be nicer
to be consistent.

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460 reboot

2014-01-13 Thread Greg Ewing

Nick Coghlan wrote:

By allowing format characters that *do* assume ASCII, the entire
construct is rendered unsafe - you have to look inside the format
string to determine if it is assuming ASCII compatibility or not, thus
the entire construct must be deemed as assuming ASCII compatibility at
the level of static semantic analysis.


I don't see how any of the currently proposed formatting
operations make a data-dependent ASCII assumption.

When you write b%d % x, you're
not assuming that x is ASCII, you're assuming that it's
an *integer*. The %d conversion of an integer is defined
to produce only ASCII characters, and it works on any
integer, so there's no data-dependent assumption there.

Something that *would* involve such an assumption would
be if b%s % 'hello' were defined to encode 'hello' as
ASCII. But Guido has proposed not doing that, and instead
interpolating ascii('hello'). Since ascii() is defined to
return only ASCII characters, and works on any string,
there is again no data-dependent assumption.

My preference would be for b%s % 'hello' to raise an
exception, but that would still be data-independent.

As for having to look inside the format string to know
what types are expected, that's no different from any
other formatting operation. All it means is that static
type analysis in Python is hard, but we already knew
that.


Allowing these ASCII assuming format codes in the core bytes
interpolation introduces *exactly* the same problem as is present in
the Python 2 text model: code that *appears* to support arbitrary
binary data, but is in fact assuming ASCII compatibility.


Can you provide an example of code using Guido's
currently approved formatting semantics that would
fail when given arbitrary binary data? I don't see
how it can happen.

--
Greg
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP460 thoughts from a Mercurial dev

2014-01-13 Thread Nick Coghlan
On 14 Jan 2014 03:34, Guido van Rossum gu...@python.org wrote:

 On Mon, Jan 13, 2014 at 8:51 AM, Nick Coghlan ncogh...@gmail.com wrote:
  On 13 January 2014 23:57, Augie Fackler r...@durin42.com wrote:
  1) What do we need in terms of functionality
 
  Best guess, %s, %d, and %f. I've not done a full audit of the code,
but some
  limited looking over the grep hits for % in .py files suggests I'm
right,
  and we could even do without %f (we only use that for 'hg --time'
output,
  which we could do in unicode).
 
  I think PEP 460 will have you covered there, or hopefully asciistr on
3.3+

 I'm confused on how PEP 460 would help -- Augie mentioned %d, which it
excludes.

I meant your proposed more lenient version (since there's no need for the
binary only version to be in the common 2/3 subset).

Cheers,
Nick.


 --
 --Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460 reboot

2014-01-13 Thread Ethan Furman

On 01/13/2014 01:20 PM, Mark Lawrence wrote:

On 13/01/2014 21:01, Paul Moore wrote:


I think this should be for 3.5, and should not involve an accelerated
release of 3.5 - we should get it into the 3.5 code early and let
people thrash out the details during the 3.5 release cycle.


I disagree, it should be on pypi now so people can start trying it out, or as 
others have suggested incorporate it into
the six module.  Surely that'd make the job of getting it into 3.5 far easier?


It's a bit harder to put a core feature on PyPI.  I'm not even sure how it would be done.  Fortunately, once it is in 
3.5 trunk the adventurous can build their own and try it out that way.


--
~Ethan~
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460 reboot

2014-01-13 Thread Guido van Rossum
On Mon, Jan 13, 2014 at 1:40 PM, Antoine Pitrou solip...@pitrou.net wrote:
 On Mon, 13 Jan 2014 13:32:28 -0800
 Guido van Rossum gu...@python.org wrote:

 But formatb() feels absurd to me. PEP 460 has neither a precise
 specification or any actual examples, so I can't tell whether the
 intention is that the format string can *only* contain {...} sequences
 or whether it can also contain regular characters. Translating to
 formatb(), my question comes down to the legality of the following
 example:

   b'Hello, {}'.formatb(name)  # Where name is some bytes object

 Yes, it's allowed. But so is:

   b'\xff\x00{}\x85{}'.formatb(payload, trailer)

 The ASCII bias is because of the bytes literal notation.

But it is nevertheless there. Including arbitrary hex bytes in the
ASCII range should be a liability, unless you have memorized the hex
codes for ASCII and know that e.g. '\x25' is '%' and '\x7b' is '{'.

The above example (is it from a real protocol?) would be just as clear
or clearer written as

b'\xff\x00' + payload + b'\x85' + trailer

or

b''.join([b'\xff\x00', payload, b'\x85', trailer])

and reasoning about those versions requires no understanding of ASCII.

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460 reboot

2014-01-13 Thread Guido van Rossum
On Mon, Jan 13, 2014 at 1:29 PM, Glenn Linderman v+pyt...@g.nevcal.com wrote:
 On 1/13/2014 12:09 PM, Guido van Rossum wrote:

 Yeah, the %s behavior with a string argument was a messy attempt at
 compromise. I was hoping to mimick a common use of %s in Python 2,
 where it can be used with either an 8-bit string or a number as
 argument, acting like %b in the former case and like %d in the latter
 case. Not having %s at all in Python 3 means that porting requires
 more thinking (== more opportunity for mistakes when you're converting
 in bulk) and there's no easy way to write code that works in Python 2
 and 3.

 If we have %b for strictly interpolating bytes, I'm fine with adding
 %a for calling ascii() on the argument and then interpolating the
 result after ASCII-encoding it.

 If somehow (unlikely though it seems) we end up keeping %s (e.g.
 strictly to ease porting), we could also keep %r as an alias for %a.


 %s for strictly interpolating bytes eases porting. Sad name, but good for
 compatibility. When the blowup happens, due to having a str type passed, the
 porter adds the appropriate .encode(...) to the parameter, so it doesn't
 blow up on Py 3, and it'll be OK for Py 2 as well, will it not?

Lots of code uses %s with numbers too, and probably the occasional
None or list (relying on the Python 2 near-guarantee that most
objects' str() is their repr() and that repr() nearly guarantees to
return only ASCII).

E.g. I'm sure you can find live code doing something like

headers.append('Content-Length: %s\r\n' % len(body))

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460 reboot

2014-01-13 Thread Brett Cannon
On Mon, Jan 13, 2014 at 4:51 PM, Guido van Rossum gu...@python.org wrote:

 Terminology. Let's use the official terminology rather than making stuff
 up.

 The docs at http://docs.python.org/3/library/string.html#formatspec
 use the following terminology:

 Replacement field: {...}; contains field name, conversion, format spec
 in that order, all optional.

 Field name: either a decimal integer (referring to an argument by
 position) or an identifier (by name), or omitted (uses the next
 available position).

 Conversion: !r, !s, !a; these refer to repr(), str(), ascii() to the
 value, and then the format spec applies to the resulting string.

 Format spec: colon, bunch of stuff, type; the type is a letter such as
 d (decimal) or s (string), and the stuff between the colon and the
 type is used to specify field width, alignment, sign, padding and
 such.


 Also. {:b} means binary (i.e. numbers in base 2). I'm not sure what
 this leaves for interpolating bytes if we don't want to use {:s}. The
 docs at
 http://docs.python.org/3/library/stdtypes.html#printf-style-string-formatting
 don't show %b so it could still be used there, but it would be nicer
 to be consistent.


I have been going on the assumption that bytes.format() would change what
'{}' meant for itself and would only interpolate bytes. That convenient
between Python 2 and 3 since it represents what we want it to (str and
bytes under the hood, respectively), so it just falls through. We could
also add a 'b' conversion for bytes() explicitly so as to help people not
accidentally mix up things in bytes.format() and str.format(). But I was
not suggesting adding a specific format spec for bytes but instead making
bytes.format() just do the .encode('ascii') automatically to help with
compatibility when a format spec was present. If people want fancy
formatting for bytes they can always do it themselves before calling
bytes.format().
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460 reboot

2014-01-13 Thread Daniel Holth
On Mon, Jan 13, 2014 at 4:59 PM, Guido van Rossum gu...@python.org wrote:
 On Mon, Jan 13, 2014 at 1:29 PM, Glenn Linderman v+pyt...@g.nevcal.com 
 wrote:
 On 1/13/2014 12:09 PM, Guido van Rossum wrote:

 Yeah, the %s behavior with a string argument was a messy attempt at
 compromise. I was hoping to mimick a common use of %s in Python 2,
 where it can be used with either an 8-bit string or a number as
 argument, acting like %b in the former case and like %d in the latter
 case. Not having %s at all in Python 3 means that porting requires
 more thinking (== more opportunity for mistakes when you're converting
 in bulk) and there's no easy way to write code that works in Python 2
 and 3.

 If we have %b for strictly interpolating bytes, I'm fine with adding
 %a for calling ascii() on the argument and then interpolating the
 result after ASCII-encoding it.

 If somehow (unlikely though it seems) we end up keeping %s (e.g.
 strictly to ease porting), we could also keep %r as an alias for %a.


 %s for strictly interpolating bytes eases porting. Sad name, but good for
 compatibility. When the blowup happens, due to having a str type passed, the
 porter adds the appropriate .encode(...) to the parameter, so it doesn't
 blow up on Py 3, and it'll be OK for Py 2 as well, will it not?

 Lots of code uses %s with numbers too, and probably the occasional
 None or list (relying on the Python 2 near-guarantee that most
 objects' str() is their repr() and that repr() nearly guarantees to
 return only ASCII).

 E.g. I'm sure you can find live code doing something like

 headers.append('Content-Length: %s\r\n' % len(body))

But if the alternative is spurious quotes then the choice is clear...
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460 reboot

2014-01-13 Thread Brett Cannon
On Mon, Jan 13, 2014 at 4:36 PM, Ethan Furman et...@stoneleaf.us wrote:

 On 01/13/2014 01:20 PM, Mark Lawrence wrote:

 On 13/01/2014 21:01, Paul Moore wrote:


 I think this should be for 3.5, and should not involve an accelerated
 release of 3.5 - we should get it into the 3.5 code early and let
 people thrash out the details during the 3.5 release cycle.


 I disagree, it should be on pypi now so people can start trying it out,
 or as others have suggested incorporate it into
 the six module.  Surely that'd make the job of getting it into 3.5 far
 easier?


 It's a bit harder to put a core feature on PyPI.  I'm not even sure how it
 would be done.  Fortunately, once it is in 3.5 trunk the adventurous can
 build their own and try it out that way.


You make it a function that under Python 2 and  3.5 does what needs to be
done and on 3.5 just directly calls the underlying method. People will
still have to change their code, but the idea is it becomes a refactoring
instead of a change in how the code is structured.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460 reboot

2014-01-13 Thread Antoine Pitrou
On Mon, 13 Jan 2014 13:56:44 -0800
Guido van Rossum gu...@python.org wrote:
 On Mon, Jan 13, 2014 at 1:40 PM, Antoine Pitrou solip...@pitrou.net wrote:
  On Mon, 13 Jan 2014 13:32:28 -0800
  Guido van Rossum gu...@python.org wrote:
 
  But formatb() feels absurd to me. PEP 460 has neither a precise
  specification or any actual examples, so I can't tell whether the
  intention is that the format string can *only* contain {...} sequences
  or whether it can also contain regular characters. Translating to
  formatb(), my question comes down to the legality of the following
  example:
 
b'Hello, {}'.formatb(name)  # Where name is some bytes object
 
  Yes, it's allowed. But so is:
 
b'\xff\x00{}\x85{}'.formatb(payload, trailer)
 
  The ASCII bias is because of the bytes literal notation.
 
 But it is nevertheless there. Including arbitrary hex bytes in the
 ASCII range should be a liability, unless you have memorized the hex
 codes for ASCII and know that e.g. '\x25' is '%' and '\x7b' is '{'.

That's a good point. I hadn't really thought about that.

 The above example (is it from a real protocol?)

(no, it's cooked up)

 would be just as clear
 or clearer written as
 
 b'\xff\x00' + payload + b'\x85' + trailer
 
 or
 
 b''.join([b'\xff\x00', payload, b'\x85', trailer])
 
 and reasoning about those versions requires no understanding of ASCII.

Fair enough.

Regards

Antoine.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460 reboot

2014-01-13 Thread Guido van Rossum
On Mon, Jan 13, 2014 at 2:05 PM, Brett Cannon br...@python.org wrote:
 I have been going on the assumption that bytes.format() would change what
 '{}' meant for itself and would only interpolate bytes. That convenient
 between Python 2 and 3 since it represents what we want it to (str and bytes
 under the hood, respectively), so it just falls through. We could also add a
 'b' conversion for bytes() explicitly so as to help people not accidentally
 mix up things in bytes.format() and str.format(). But I was not suggesting
 adding a specific format spec for bytes but instead making bytes.format()
 just do the .encode('ascii') automatically to help with compatibility when a
 format spec was present. If people want fancy formatting for bytes they can
 always do it themselves before calling bytes.format().

This seems hastily written (e.g. verb missing :-), and I'm not clear
on what you are (or were) actually proposing. When exactly would
bytes.format() need .encode('ascii')?

I would be happy to wait a few hours or days for you to to write it up
clearly, rather than responding in a hurry.

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460 reboot

2014-01-13 Thread Eric V. Smith
On 1/13/2014 4:59 PM, Guido van Rossum wrote:
 On Mon, Jan 13, 2014 at 1:29 PM, Glenn Linderman v+pyt...@g.nevcal.com 
 wrote:
 If somehow (unlikely though it seems) we end up keeping %s (e.g.
 strictly to ease porting), we could also keep %r as an alias for %a.


 %s for strictly interpolating bytes eases porting. Sad name, but good for
 compatibility. When the blowup happens, due to having a str type passed, the
 porter adds the appropriate .encode(...) to the parameter, so it doesn't
 blow up on Py 3, and it'll be OK for Py 2 as well, will it not?
 
 Lots of code uses %s with numbers too, and probably the occasional
 None or list (relying on the Python 2 near-guarantee that most
 objects' str() is their repr() and that repr() nearly guarantees to
 return only ASCII).
 
 E.g. I'm sure you can find live code doing something like
 
 headers.append('Content-Length: %s\r\n' % len(body))
 

That's why I think we should support %s taking bytes, int, float. And
make %b mean the same thing, if you want. But I think we need to keep %s
(however limited) for compatibility with Python 2.

Personally, I'd be okay with %s not accepting str (by raising an exception).

I think that would give us a large compatibility surface in common
with Python 2.

Eric.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460 reboot

2014-01-13 Thread Donald Stufft

On Jan 13, 2014, at 5:25 PM, Eric V. Smith e...@trueblade.com wrote:

 On 1/13/2014 4:59 PM, Guido van Rossum wrote:
 On Mon, Jan 13, 2014 at 1:29 PM, Glenn Linderman v+pyt...@g.nevcal.com 
 wrote:
 If somehow (unlikely though it seems) we end up keeping %s (e.g.
 strictly to ease porting), we could also keep %r as an alias for %a.
 
 
 %s for strictly interpolating bytes eases porting. Sad name, but good for
 compatibility. When the blowup happens, due to having a str type passed, the
 porter adds the appropriate .encode(...) to the parameter, so it doesn't
 blow up on Py 3, and it'll be OK for Py 2 as well, will it not?
 
 Lots of code uses %s with numbers too, and probably the occasional
 None or list (relying on the Python 2 near-guarantee that most
 objects' str() is their repr() and that repr() nearly guarantees to
 return only ASCII).
 
 E.g. I'm sure you can find live code doing something like
 
 headers.append('Content-Length: %s\r\n' % len(body))
 
 
 That's why I think we should support %s taking bytes, int, float. And
 make %b mean the same thing, if you want. But I think we need to keep %s
 (however limited) for compatibility with Python 2.
 
 Personally, I'd be okay with %s not accepting str (by raising an exception).
 
 I think that would give us a large compatibility surface in common
 with Python 2.

%s not accepting str is the major thing I’d personally be against. %s taking 
numeric
types and bytes would be fine. The main thing i’d be worried about is where the 
RHS
may possibly contain something non ASCII that needs encoding (such as the str 
case).


-
Donald Stufft
PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460 reboot

2014-01-13 Thread Donald Stufft

On Jan 13, 2014, at 5:31 PM, Donald Stufft don...@stufft.io wrote:

 %s not accepting str is the major thing I’d personally be against.

To be more clear

b”%s” % “abc” == No
b”%s” % 123 == Fine

-
Donald Stufft
PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460 reboot

2014-01-13 Thread Nick Coghlan
On 14 Jan 2014 04:58, Guido van Rossum gu...@python.org wrote:

 Let me try rebooting the reboot.

 My interpretation of Nick's argument is that he are asking for a bytes
 formatting language that doesn't have an implicit ASCII assumption.

 To me this feels absurd. The formatting codes (%s, %c) themselves are
 expressed as ASCII characters. If you include anything else in the
 format string besides formatting codes (e.g. b'%s'), you are giving
 it as ASCII characters. I don't know what characters the EBCDIC codes
 37, 99 or 115 encode (these are the ASCII codes for '%', 'c', 's') but
 it certainly wouldn't be safe to use % when the LHS is EBCDIC-encoded.

Except we allow string escapes and programmatic creation of format strings,
so while ASCII snippets in formatting code are certainly easier to type,
they are by no means a mandatory feature of using interpolation operations.
I agree

Can you roll your own binary interpolation support with join() and simple
concatenation? Yes, but Antoine's proposal provides a clean and reliable
approach to flexible binary templating that isn't offered by the more
lenient version.

My problem is with telling Python users that if they're working with ASCII
compatible data, they get access to a clean interpolation mini-language for
templating purposes, but if they aren't, they don't.

That's the part I see as potentially breaking the text model: now you have
a convenient API on a core type encouraging you to treat your data as ASCII
compatible with implicit serialisation of semantic data as ASCII text, even
if that may not be appropriate.

If pure binary interpolation is added at the same time (regardless of the
exact spelling, so long as it's as easy to access as the ASCII templating),
that objection goes away.

That said, the fact that the interpolation mini-languages themselves assume
ASCII is the most compelling rationale I have heard so far for treating
interpolation as an operation that inherently assumes ASCII compatibility -
you can't use arbitrary bytes in your formatting strings without escaping
the formatting characters appropriately. While I don't see that as
substantially different to needing to escape them in order to retain them
in the output of text or ASCII formatting, it's at least a teachable
rationale for the absence of a pure binary equivalent.

 If I had some byte strings in an unknown encoding (but the same
 encoding for all) that I needed to concatenate I would never think of
 '%s%s' % (x, y) -- I would write x+y. (Even in Python 2.)

 If I see some code using *any* formatting operation (regardless of
 whether it's %d, %r, %s or %c) I am going to assume that there is some
 ASCII-ness, and if there isn't, the code's author has obscured their
 goal to me.

Right, that's a rationale I can explain to people. It also occurred to me
that it's easier to build pure binary interpolation on top of ASCII
interpolation than I previously thought: I can just check all the input
values are compatible with memoryview. At that point, attempting to pass in
anything that would trigger implicit encoding at the formatting stage will
fail.

(Aside: bytes(memoryview(obj)) is also a potentially handy way to avoid the
bytes(int)) trap)

 I hear the objections against b'%s' % 'x' returning b'x' loud and
 clear, and if the noise about that sub-issue is preventing folks from
 seeing the absurdity in PEP 460, we can talk about a compromise, e.g.
 use %b which would require its argument to be bytes. Those bytes
 should still probably be ASCII-ish, but there's no way to test that.
 That's fine with me and should be fine to Nick as well -- PEP 460
 doesn't check that your encodings match (how could it? :-), nor does
 plain string concatenation using +.

Plus there genuinely are formats where different parts have different
encodings and you rely on metadata or format definitions to know what they
are.

I would actually suggest something like Brett's approach for %s , but with
memoryview in the mix: if the object exports a PEP 3118 buffer, interpolate
it directly, otherwise invoke normal string formatting and then do strict
ASCII encoding at the end.

That way people don't have to learn new formatting mini-languages and only
have two new behaviours to learn: buffer exporters are interpolated
directly, anything else is formatted normally and then implicitly encoding
as strict ASCII.


 In my head I make the following classification of situations where you
 work with bytes and/or text.

 (A) Pure binary formats (e.g. most IP-level packet formats, media
 files, .pyc files, tar/zip files, compressed data, etc.). These are
 handled using the struct module (e.g. tar/zip) and/or custom C
 extensions (e.g. gzip).

 (B) Encoded text. Here you should just decode everything into str
 objects and parse your text at that level. If you really want to
 manipulate the data as bytes (e.g. because you have a lot of data to
 process and very light processing) you may be able to do it, but
 unless it's 

Re: [Python-Dev] PEP 460 reboot

2014-01-13 Thread Glenn Linderman

On 1/13/2014 1:59 PM, Guido van Rossum wrote:

On Mon, Jan 13, 2014 at 1:29 PM, Glenn Linderman v+pyt...@g.nevcal.com wrote:

On 1/13/2014 12:09 PM, Guido van Rossum wrote:

Yeah, the %s behavior with a string argument was a messy attempt at
compromise. I was hoping to mimick a common use of %s in Python 2,
where it can be used with either an 8-bit string or a number as
argument, acting like %b in the former case and like %d in the latter
case. Not having %s at all in Python 3 means that porting requires
more thinking (== more opportunity for mistakes when you're converting
in bulk) and there's no easy way to write code that works in Python 2
and 3.

If we have %b for strictly interpolating bytes, I'm fine with adding
%a for calling ascii() on the argument and then interpolating the
result after ASCII-encoding it.

If somehow (unlikely though it seems) we end up keeping %s (e.g.
strictly to ease porting), we could also keep %r as an alias for %a.


%s for strictly interpolating bytes eases porting. Sad name, but good for
compatibility. When the blowup happens, due to having a str type passed, the
porter adds the appropriate .encode(...) to the parameter, so it doesn't
blow up on Py 3, and it'll be OK for Py 2 as well, will it not?

Lots of code uses %s with numbers too, and probably the occasional
None or list (relying on the Python 2 near-guarantee that most
objects' str() is their repr() and that repr() nearly guarantees to
return only ASCII).

E.g. I'm sure you can find live code doing something like

headers.append('Content-Length: %s\r\n' % len(body))


That's portably fixable by switching to %d... or by adding .encode('ascii')
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] PEP 460 -- adding explicit assumptions

2014-01-13 Thread Jim J. Jewett


As best I can tell, some people (apparently including Guido
and PEP author Antoine) are taking some assumptions almost
for granted, while other people (including me, before Nick's
messages) were not assuming them at all.

Since these assumptions (or, possibly, rejections of them?)
are likely to decide the outcome, the assumptions should be
explicit in the PEP.

(1)  The bytes-related classes do include methods that
 are only useful when the already-contained data
 is encoded ASCII.

 They do not (and will not) include any operations
 that *require* an encoding assumption.  This
 implies that no non-bytes data can be added without
 an explicit encoding.

(1a) Not even by assuming ASCII with strict error handling.

(1b) Not even for numbers, where ASCII/strict really is
 sufficient.

Note that this doesn't rule out a solution where objects
(or maybe just numbers and ASCII-kind text) provide their own
encoding to bytes -- but that has to be done by the objects
themselves, not by the bytes container or  by the interpreter.

(2)  Most python programmers are still in the future.

 So an API that confuses people who are still learning
 about Unicode and the text model is bad -- even if it
 would work fine for those who do already understand it.

-jJ

-- 

If there are still threading problems with my replies, please 
email me with details, so that I can try to resolve them.  -jJ

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460 reboot

2014-01-13 Thread Greg Ewing

Nick Coghlan wrote:


so the latter would be less of 
an attractive nuisance when writing code that needs to handle arbitrary 
binary formats and can't assume ASCII compatibility.


Hang on a moment. What do you mean by code that
handles arbitrary binary formats?

As far as I can see, the proposed features are for
code that handles *particular* binary formats. Ones
with well-defined fields that are specified to contain
ASCII-encoded text. It's the programmer's responsibility
to make sure that the fields he's treating as ASCII
really do contain ASCII, just as it's his responsibility
to make sure he reads and writes a text file using
the correct encoding.

Now, it's possible that if you were working from an
incomplete spec and some examples, you might be
led to believe that a particular field was ASCII
when in fact it was some ASCII superset such as
latin1 or utf8. In that case, if you parsed it
assuming ASCII, you would get into trouble of
some sort with bytes greater than 127.

However, the proposed formatting operations are
concerned only with *generating* binary data, not
parsing it. Under Guido's proposed semantics, all
of the ASCII formatting operations are guaranteed
to produce valid ASCII, regardless of what types
or values are thrown at them. So as long as the
field's true encoding is something ASCII-compatible,
you will always generate valid data.

Because I *want to use* the PEP 460 binary interpolation API, but 
wouldn't be able to use Guido's more lenient proposal, as it is a bug 
magnet in the presence of arbitrary binary data.


Where exactly is this arbitrary binary data that you
keep talking about? The only place that arbitrary
bytes comes into the picture is through b%s % b...,
and that's defined to just pass the bytes straight
through. I don't see how that could attract any
bugs that weren't already present in the data being
interpolated.

The LHS may or may not be tainted with assumptions about ASCII 
compatibility, which means it effectively *is* tainted with such 
assumptions, which means code that needs to handle arbitrary binary data 
can't use it and is left without a binary interpolation feature.


If I understand correctly, what concerns you here
is that you can't tell by looking at b%s % x
whether it encodes anything as ASCII without knowing
the type of x.

I'm not sure how serious a problem that would be.
Most of the time I think it will be fairly obvious
from the purpose of the code what the type of x
is *intended* to be. If it's not actually that type,
then clearly there's a bug somewhere.

Of all such possible bugs, the one most likely to
arise due to a confusion in the programmer's mind
between text and bytes would be for x to be a string
when it was meant to be bytes or vice versa.

Due to the still-very-strong separation between text
and bytes in Py3, this is unlikely to happen without
something else blowing up first.

Even if it does happen, it won't result in a data-
dependent failure. If b%s % 'hello' were defined to
interpolate 'hello'.encode('ascii'), then there *would*
be cause for concern. But this is not what Guido
proposes -- instead he proposes interpolating
ascii('hello') == 'hello'. This is almost certainly
*never* what the file spec calls for, so you'll find
out about it very soon one way or another.

Effectively this means that b%s % x where x is a
string is useless, so I'd much prefer it to just
raise an exception in that case to make the failure
immediately obvious. But either way, you're not
going to end up with a latent failure waiting for
some non-ASCII data to come along before you notice
it.

To summarise, I think the idea of binary format strings
being too tainted for a program that does not want
to use ASCII formatting to rely on is mostly FUD.

--
Greg
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] Automatic encoding detection [was: Re: Python3 complexity - 2 use cases]

2014-01-13 Thread Jim J. Jewett


 So when it is time to guess [at the character encoding of a file],
 a source of good guesses is an important battery to include.

 The barrier for entry to the standard library is higher than mere
 usefulness.

Agreed.  But most programs will need it, and people will either
include (the same) 3rd-party library themselves, or write their
own workaround, or have buggy code *is* sufficient.

The points of contention are

(1)  How many programs have to deal with documents written
 outside their control -- and probably originating on
 another system.

I'm not ready to say most programs in general, but I think that
barrier is met for both web clients (for which we already supply
several batteries) and quick-and-dirty utilities.

(2)  How serious are the bugs / How annoying are the workarounds?

As someone who mostly sticks to English, and who tends to manually
ignore stray bytes when dealing with a semi-binary file format,
the bugs aren't that serious for me personally.  So I may well
choose to write buggy programs, and the bug may well never get
triggered on my own machine.

But having a batch process crash one run in ten (where it didn't
crash at all under Python 2) is a bad thing.  There are environments
where (once I knew about it) I would add chardet (if I could get
approval for the 3rd-party component).

(3)  How clearcut is the *right* answer?

As I said, at one point (several years ago), the w3c and whatwg
started to standardize the right answer.  They backed that out,
because vendors wanted the option to improve their detection in
the future without violating standards.

There are certainly situations where local knowledge can do
better than a global solution like chardet,  but ... the
right answer is clear most of the time.

Just ignoring the problem is still a 99% answer, because most text
in ASCII-mostly environments really is close enough.  But that
is harder (and the One Obvious Way is less reliable) under Python 3
than it was under Python 2.

An alias for open that defaulted to surrogate-escape (or returned
the new ASCIIstr bytes hybrid) would probably be sufficient to get
back (almost) to Python 2 levels of ease and reliability.  But it
would tend to encourage ASCII/English-only assumptions.

You could fix most of the remaining problems by scripting a web
browser, except that scripting the browser in a cross-platform
manner is slow and problematic, even with webbrowser.py.

Whatever a recent Firefox does is (almost by definition) good
enough, and is available ... but maybe not in a convenient form,
which is one reason that chardet was created as a port thereof.
Also note that firefox assumes you will update more often than
Python does.

Whatever chardet said at the time the Python release was cut
is almost certainly good enough too.

The browser makers go to great lengths to match each other even 
in bizarre corner cases.  (Which is one reason there aren't more
competing solutions.)  But that doesn't mean it is *impossible*
to construct a test case where they disagree -- or even one where
a recent improvement in the algorithms led to regressions for one
particular document.

That said, such regressions should be limited to documents that
were not properly labeled in the first place, and should be rare
even there.  Think of the changes as obscure bugfixes, akin to
a program starting to handle NaN properly, in a place where it
should not ever see one.

-jJ

-- 

If there are still threading problems with my replies, please 
email me with details, so that I can try to resolve them.  -jJ

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Automatic encoding detection [was: Re: Python3 complexity - 2 use cases]

2014-01-13 Thread Chris Angelico
On Tue, Jan 14, 2014 at 10:48 AM, Jim J. Jewett jimjjew...@gmail.com wrote:
 The barrier for entry to the standard library is higher than mere
 usefulness.

 Agreed.  But most programs will need it, and people will either
 include (the same) 3rd-party library themselves, or write their
 own workaround, or have buggy code *is* sufficient.

Well, no, that's not sufficient on its own either. But yes, it's a
stronger argument.

 But having a batch process crash one run in ten (where it didn't
 crash at all under Python 2) is a bad thing.  There are environments
 where (once I knew about it) I would add chardet (if I could get
 approval for the 3rd-party component).

Having it *do the wrong thing* one run in ten is even worse.

If you need chardet, then get approval for the third-party component.
That's a political issue, not a technical one. This needs to be in
the stdlib because I'm not allowed to install anything else? I hope
not. Also, a PyPI package is free to update independently of the
Python version schedule. The stdlib is bound.

ChrisA
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460: allowing %d and %f and mojibake

2014-01-13 Thread Greg Ewing

Stephen J. Turnbull wrote:

PBP doesn't think it's a great idea to pass around bytes that are
implicitly some other type, but didn't mind it (or got used to it) in
Python 2, and so they're not looking at that as a problem that Python
3 can solve.  They're looking at Python 3 as the problem that prevents
them from doing what worked fine in Python 2.


While some people may think that way, I don't think
it's fair to characterise *all* proponents of bytes
formatting as luddites that refuse to get with the
Python 3 way.

Some of us *do* understand the principles of text/
bytes separation in Python 3 and agree that they're
a good idea. We just don't agree that the proposed
formatting operations violate those principles to
any degree worth worrying about.

I don't think of my viewpoint as being PBP. That term
assumes there is purity there to be beaten. To my mind,
any notion of purity with respect to bytes objects
went out the window as soon as it was given a pile
of text methods -- together with a text-like literal
syntax and default repr(), even though at least half
the time they're completely inappropriate!

--
Greg
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460 reboot

2014-01-13 Thread Greg Ewing

Nick Coghlan wrote:
Arbitrary binary data and ASCII  compatible binary data are *different 
things* and the only argument in favour of modelling them with a single 
type is because Python 2 did it that way.


I would say that ASCII compatible binary data is a
*subset* of arbitrary binary data. As such, a type
designed for arbitrary binary data is a perfectly good
way of representing ASCII compatible binary data.

What are you saying -- that there should be one type
for ASCII compatible binary data, and another type
for all binary data *except* when it's ASCII compatible?

That makes no sense to me.

The Python 3 text model was built on the notion of no implicit encoding 
and decoding


This is nonsense. There are plenty of implicit
encoding and decoding operations in Python 3.

When you open a text file, it gets an encoding. After
that, anything you write to it is implicitly encoded
using that encoding. There's even a default encoding
when you open the file, so you don't even have to be
explicit about that.

It's more correct to say that it was built on the
notion of using separate types for encoded and
decoded data, so that it's *possible* to keep track
of the difference. It doesn't mean that there can't
be conversions between the two types that are
implicit to one degree or another.

--
Greg
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] Test failures when running as root

2014-01-13 Thread Chris Angelico
And now for something completely different.

My root buildbot is finally now able to telnet out and get Connection
refused errors. (For the curious, the VirtualBox NAT mode doesn't
work properly, but the new NAT Network mode does. Why? I have no
idea. But if anyone else is having the same problem, upgrade to the
latest VirtualBox and set up a NAT Network. All I care is, it now
works.) The test suite is now failing at another point, and this
applies to 2.7, 3.3, and 3.x.

==
ERROR: test_initgroups (test.test_posix.PosixGroupsTester)
--
Traceback (most recent call last):
  File /root/buildarea/3.x.angelico-debian-amd64/build/Lib/test/test_posix.py,
line 1143, in test_initgroups
g = max(self.saved_groups) + 1
ValueError: max() arg is an empty sequence

--

The saved_groups value comes from posix.getgroups(), and it's being
used to try to get a group that this user doesn't have (I think). When
I run Python as root, posix.getgroups() returns [0], but apparently
it's not returning any groups when the test runs.

So, two questions. Firstly, is this a problem that needs to be fixed
in Python, or is it a configuration change that I made? It began
failing recently, so possibly when I rebooted the VM as part of
VirtualBox changes I mucked something up.

And secondly, how can I run the tests manually? I can't find a binary
inside the buildarea tree. Does it get deleted afterward?

Apologies if these are dumb questions, hopefully they're a small
distraction from PEP 460 arguments!

ChrisA
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460 reboot

2014-01-13 Thread Terry Reedy

On 1/13/2014 3:13 PM, Guido van Rossum wrote:

On Mon, Jan 13, 2014 at 12:02 PM, Brett Cannon br...@python.org wrote:

On Mon, Jan 13, 2014 at 2:51 PM, Terry Reedy tjre...@udel.edu wrote:

I personally would not add 'bytes % whatever'.


Personally, neither would I; just focus on bytes.format() and let % operator
on strings slowly go away.


Well, % has some very strong arguments in its favor still -- for


If I shift from a 'personal' to a 'BDFL' viewpoint, I have to agree.


example, the sheer amount of code that currently uses it, the fact
that it's as close as we get to a cross-language standard, and the


This much I know.


fact that nobody wants to tackle its use in the logging module (since
logger objects are often shared between packages that don't know about
each other).


This I did not know.


Anyway, the % or .format() issue seems completely orthogonal to the
issues that get people riled up (which are mostly about whether using
either implies some kind of ASCII compatibility).


A possibly important difference between '%s' and '{:s}' is that the 's' 
is required in the former and optional in the latter. So in 
byteformat(), b'{:s}' continues to format a string (as encoded bytes) 
while '{:}' 'formats' a byte without having to invent a new code that 
does not exist in 2.7. That particular solution to does 's' mean bytes 
or string does not work for % formatting. (And that lack, in turn, is 
part of what lay behind the inclination expressed above.)


For % formatting, I would be inclined to start with 'what does mecurial 
need?' or even 'does anything even really work for hg?'. Hg is part of 
our development ecosystem, and we have an hg rep who expressed a desire 
to experiment.


Terry

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Automatic encoding detection [was: Re: Python3 complexity - 2 use cases]

2014-01-13 Thread Terry Reedy

On 1/13/2014 7:06 PM, Chris Angelico wrote:

On Tue, Jan 14, 2014 at 10:48 AM, Jim J. Jewett jimjjew...@gmail.com wrote:

Agreed.  But most programs will need it, and people will either
include (the same) 3rd-party library themselves, or write their
own workaround, or have buggy code *is* sufficient.


Well, no, that's not sufficient on its own either. But yes, it's a
stronger argument.


But having a batch process crash one run in ten (where it didn't
crash at all under Python 2) is a bad thing.  There are environments
where (once I knew about it) I would add chardet (if I could get
approval for the 3rd-party component).


Having it *do the wrong thing* one run in ten is even worse.

If you need chardet, then get approval for the third-party component.
That's a political issue, not a technical one. This needs to be in
the stdlib because I'm not allowed to install anything else? I hope
not. Also, a PyPI package is free to update independently of the
Python version schedule. The stdlib is bound.


This discussion strikes me as more appropriate for python-ideas. That 
said, I am leery of a heuristics module in the stdlib. When is a change 
a 'bug fix'? and when is it an 'enhancement'?


--
Terry Jan Reedy

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460 reboot

2014-01-13 Thread MRAB

On 2014-01-13 21:51, Guido van Rossum wrote:

Terminology. Let's use the official terminology rather than making stuff up.

The docs at http://docs.python.org/3/library/string.html#formatspec
use the following terminology:

Replacement field: {...}; contains field name, conversion, format spec
in that order, all optional.

Field name: either a decimal integer (referring to an argument by
position) or an identifier (by name), or omitted (uses the next
available position).

Conversion: !r, !s, !a; these refer to repr(), str(), ascii() to the
value, and then the format spec applies to the resulting string.


If all you wanted to do was interpolate bytes then you could define a
new conversion !b. This would, however, mean that the format spec would
be applied to bytes.


Format spec: colon, bunch of stuff, type; the type is a letter such as
d (decimal) or s (string), and the stuff between the colon and the
type is used to specify field width, alignment, sign, padding and
such.


Also. {:b} means binary (i.e. numbers in base 2). I'm not sure what
this leaves for interpolating bytes if we don't want to use {:s}. The
docs at 
http://docs.python.org/3/library/stdtypes.html#printf-style-string-formatting
don't show %b so it could still be used there, but it would be nicer
to be consistent.



___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


  1   2   >