Re: [Python-Dev] PEP 460 reboot
On 13/01/14 03:47, Guido van Rossum wrote: On Sun, Jan 12, 2014 at 6:24 PM, Ethan Furman et...@stoneleaf.us wrote: On 01/12/2014 06:16 PM, Ethan Furman wrote: If you do : -- b'%s' % 'some text' Ignore what I previously said. With no encoding the result would be: b'some text' So an encoding should definitely be specified. Yes, but the encoding is no business of %s or %. As far as the formatting operation cares, if the argument is bytes they will be copied literally, and if the argument is a str (or anything else) it will call ascii() on it. It seems to me that what people want from '%s' is: Convert to a str then encode as ascii for non-bytes or copy directly for bytes. So why not replace '%s' with '%a' for the ascii case and with '%b' for directly inserting bytes. That way, the encoding is explicit. I think it is vital that the encoding is explicit in all cases where bytes - str conversion occurs. Cheers, Mark. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On 13.01.2014 07:51, Nick Coghlan wrote: [Using a new asciistr type] The key thing that the text model change in Python 3 enabled is for us to use the type system to *help* with managing the complexity of dealing with text encodings. We've got a long way with just the two pure types, and no additional types that straddle the binary/text boundary the way the Python 2 str type did. Unlike introducing *new* ASCII-only operations to the bytes type, adding new types specifically for dealing with ASCII compatible formats (especially starting life as a third party library) isn't compromising the Python 3 text model, it's embracing it and making it work for us (which is why I've been suggesting that it be considered since at least 2010). The problem with str in Python 2 was that one type was used to represent too many things with serious semantic differences. The ongoing attempts to reintroduce that ambiguity to the core bytes type rather than exploring the creation of new types and then filing bugs for any interoperability issues those attempts uncover in the core types represents one of the worst cases of paradigm lock that I have ever seen :P In theory this sounds nice, but in practice you often run into the issue that whenever you pass such a str-subtype to some function that works on str doesn't return the str-subtype as result, but instead a new str object. As a result, you have to keep track of which operations work on your str-subtype alone and which convert it back to a str, making the approach infeasible for all but the most basic uses. This is why we try to make the basic types as useful as possible for everyone. It's also the main reason why subtyping 8-bit strings and Unicode in Python 2 wasn't a popular sport :-) Leaving aside the discussion about str and bytes, I think PEP 460 has much potential of making life easier for people dealing with binary data: the formatting codes for the bytes format methods could be extended to include the struct module features - with the struct module then turning into a proxy for these new format methods (much like we did with the string module when string methods were introduced). BTW: There's a little known trick in Python 2 which also lets you disable the string to Unicode coercion: all you have to do is set the default encoding to undefined (see site.py:setencoding()). Python 2 will then raise a UnicodeError whenever coercion would trigger. I added that codec to experiment with this scenario in the early days of the Unicode integration. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jan 13 2014) Python Projects, Consulting and Support ... http://www.egenix.com/ mxODBC.Zope/Plone.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ : Try our mxODBC.Connect Python Database Interface for free ! :: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On 13 Jan 2014 17:43, Ethan Furman et...@stoneleaf.us wrote: On 01/12/2014 10:51 PM, Nick Coghlan wrote: I am a strong -1 on the more lenient proposal, as it makes binary interpolation in Python 3 an *unsafe operation* for ASCII incompatible binary formats. No more unsafe that calling .upper() on ASCII incompatible streams. Right - Guido's proposal is *completely useless* for arbitrary binary data. You can't trust it. However, Python 3 has no equivalent binary interpolation feature that *is* safe for arbitrary binary data, so the lenient version *will* be a bug magnet if it is the only version of binary interpolation provided. However, if new formatb and formatb_map methods were included in the proposal with the current strict PEP 460 semantics, then my objections would be reduced substantially. In that case, we'd still be providing the new binary interpolation feature *in addition* to restoring the ASCII compatible interpolation feature, so the latter would be less of an attractive nuisance when writing code that needs to handle arbitrary binary formats and can't assume ASCII compatibility. With that approach, I'd even support the idea of implicit strict ASCII encoding of text inputs for the ASCII compatible version. The existing binary operations that assume ASCII do so *inherently* - they're not input driven, the operation itself assumes ASCII, so if you're working with data that may not be ASCII compatible, you simply don't use them (these are operations like title(), upper(), lower(), the default arguments for split() and strip(), etc). How is this different from not using % interpolation when the byte stream is incompatible? It isn't. Because I *want to use* the PEP 460 binary interpolation API, but wouldn't be able to use Guido's more lenient proposal, as it is a bug magnet in the presence of arbitrary binary data. Provide both APIs and my objections go away - ASCII interpolation just becomes another way to translate between structured and text data, while binary interpolation would be a strictly binary only operation. And what do you mean by input driven? If the LHS is bytes, the result is bytes, no matter what the input is. This is not the Py2 world where you may end up with str or unicode; you always end up with bytes if the LHS is bytes. The LHS may or may not be tainted with assumptions about ASCII compatibility, which means it effectively *is* tainted with such assumptions, which means code that needs to handle arbitrary binary data can't use it and is left without a binary interpolation feature. That's why *adding* formatb to Guido's more lenient proposal resolves my objections: it provides the binary interpolation feature I want, and maintains Python 3's clear distinction between the text domain and the binary domain. Cheers, Nick. [snip the rest that seems to flow from these misunderstandings] -- ~Ethan~ ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On 1/13/2014 12:46 AM, Mark Shannon wrote: On 13/01/14 03:47, Guido van Rossum wrote: On Sun, Jan 12, 2014 at 6:24 PM, Ethan Furman et...@stoneleaf.us wrote: On 01/12/2014 06:16 PM, Ethan Furman wrote: If you do : -- b'%s' % 'some text' Ignore what I previously said. With no encoding the result would be: b'some text' So an encoding should definitely be specified. Yes, but the encoding is no business of %s or %. As far as the formatting operation cares, if the argument is bytes they will be copied literally, and if the argument is a str (or anything else) it will call ascii() on it. It seems to me that what people want from '%s' is: Convert to a str then encode as ascii for non-bytes or copy directly for bytes. Maybe. But it only takes a small tweak to the parameter to get what they want... a tweak that works in both Python 2.7 and Python 3.whatever-version-gets-this. Instead of b%s % foo they must use b%s % foo.encode( explicitEncoding ) which is what they should have been doing in Python 2.7 all along, and if they were, they need make no change. Oh, foo was a Python 2.7 str? Converted to Python 3.x str, by default conversion rules? Already in ASCII? No harm. Oh, foo was a literal? Add b prefix, instead of the .encode(ASCII), if you prefer. So why not replace '%s' with '%a' for the ascii case and with '%b' for directly inserting bytes. Because %a and %b don't exist in Python 2.7? That way, the encoding is explicit. The encoding is already explicit. If it is bytes encoded from str, that transformation had an explicit encoding. If it is %s % str(...), then there is no encoding, but rather a transformation into an ASCII representation of the Unicode code points, using escape sequences. Which isn't likely to be what they want, but see the parameter tweak above. I think it is vital that the encoding is explicit in all cases where bytes - str conversion occurs. Since it is explicit, you have no concerns in this area. Regarding the concern about implicit use of ASCII by certain bytes methods and proposed interpolations, I'm curious how many standard encodings exist that do not have an ASCII subset. I can enumerate a starting list, but if there are others in actual use, I'm unaware of them. EBCDIC UTF-16 BE LE UTF-32 BE LE Wikipedia: The vast majority of code pages in current use are supersets of ASCII http://en.wikipedia.org/wiki/ASCII, a 7-bit code representing 128 control codes and printable characters. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On 13/01/14 09:19, Glenn Linderman wrote: On 1/13/2014 12:46 AM, Mark Shannon wrote: On 13/01/14 03:47, Guido van Rossum wrote: On Sun, Jan 12, 2014 at 6:24 PM, Ethan Furman et...@stoneleaf.us wrote: On 01/12/2014 06:16 PM, Ethan Furman wrote: If you do : -- b'%s' % 'some text' Ignore what I previously said. With no encoding the result would be: b'some text' So an encoding should definitely be specified. Yes, but the encoding is no business of %s or %. As far as the formatting operation cares, if the argument is bytes they will be copied literally, and if the argument is a str (or anything else) it will call ascii() on it. It seems to me that what people want from '%s' is: Convert to a str then encode as ascii for non-bytes or copy directly for bytes. Maybe. But it only takes a small tweak to the parameter to get what they want... a tweak that works in both Python 2.7 and Python 3.whatever-version-gets-this. Instead of b%s % foo they must use b%s % foo.encode( explicitEncoding ) which is what they should have been doing in Python 2.7 all along, and if they were, they need make no change. Oh, foo was a Python 2.7 str? Converted to Python 3.x str, by default conversion rules? Already in ASCII? No harm. Oh, foo was a literal? Add b prefix, instead of the .encode(ASCII), if you prefer. So why not replace '%s' with '%a' for the ascii case and with '%b' for directly inserting bytes. Because %a and %b don't exist in Python 2.7? I thought this was about 3.5, not 2.7 ;) '%s' can't work in 3.5, as we must differentiate between strings which meed to be encoded and bytes which don't. That way, the encoding is explicit. The encoding is already explicit. If it is bytes encoded from str, that transformation had an explicit encoding. If it is %s % str(...), then there is no encoding, but rather a transformation into an ASCII representation of the Unicode code points, using escape sequences. Which isn't likely to be what they want, but see the parameter tweak above. I think it is vital that the encoding is explicit in all cases where bytes - str conversion occurs. Since it is explicit, you have no concerns in this area. Regarding the concern about implicit use of ASCII by certain bytes methods and proposed interpolations, I'm curious how many standard encodings exist that do not have an ASCII subset. I can enumerate a starting list, but if there are others in actual use, I'm unaware of them. EBCDIC UTF-16 BE LE UTF-32 BE LE Wikipedia: The vast majority of code pages in current use are supersets of ASCII http://en.wikipedia.org/wiki/ASCII, a 7-bit code representing 128 control codes and printable characters. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/mark%40hotpy.org ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot and a bitter fight
On 01/12/2014 11:15 PM, Guido van Rossum wrote: (It's too late here to write more, but it looks like we are in for a bitter fight. :-( ) It's already been a bitter fight. The opponents of %-interpolation (Nick, Antoine, Turnbull, D'Aprano, et al*) all seem to be arguing basically what Nick said. The proponents (myself, you, Stufft, Eric Smith, et al*) are arguing that bytes already has an ASCII bias, already has ASCII string methods, that it isn't the same as the Py2 world because if you combine a bytes object with a str object outside of interpolation (such as b'hello' + 'world') it doesn't work, that only bytes would ever be returned, etc, etc. With the possible exception of the question I just asked Nick, I don't think we're going to get any new information. I suppose you're used to not being able to please everybody. :/ -- ~Ethan~ * et al means everyone whose name I couldn't remember, or figure out which camp you were in in the wee hours of the night. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460: allowing %d and %f and mojibake
Ethan Furman writes: The part that you don't seem to acknowledge (sorry if I missed it) is that there are str-like methods already on bytes. I haven't expressed myself well, but I don't much care about that. It's what Knuth would classify as a seminumerical method. What I do care about is that the methods that convert other types to text (including format) not work for bytes. That's where I consider text to start. is *exactly* the Python 2 model of text. But you deny that the effect of your proposals (eg, b%d % (12,)) is to reintroduce Python 2's bytes/character confusion, don't you? Given that the default (and only) text type in Py3 is str, which is unicode, I don't think any confusion will be as severe, but I acknowledge that there could be some. I fear it will be quite severe where I live, in Shift JIS/GB18030 land. (The two most obnoxious encodings known to man, except perhaps the syntax of Brainf!ck.) *My* definition is not ambiguous at all. If this particular part of the byte stream is defined to contain ASCII-encoded text, then I can use the bytes text methods to work with it. But how is Python supposed to know that? Python doesn't need to. ... because you know it. But the ideal of object-oriented programming (and duck-typing) is that you shouldn't need to; the object should know how to produce appropriate behavior itself. But under your definition, you need to make the decision, or explicitly code the decision, on the basis of context. Exactly so. I even have to do that in Py2. Even. This is exactly where PBP and EIBTI part company, I think. EIBTI thinks its a bad idea to pass around bytes that are implicitly some other type, and Python 3 *should be good enough to make that unnecessary*. I'm convinced, and Nick is convinced, that we can make that true for 90% of the cases that it isn't now, if we could just figure out what's hard about the use cases where Python 3 isn't up to snuff yet (and figure out which use cases we need to handle to get us up to 90%!) PBP doesn't think it's a great idea to pass around bytes that are implicitly some other type, but didn't mind it (or got used to it) in Python 2, and so they're not looking at that as a problem that Python 3 can solve. They're looking at Python 3 as the problem that prevents them from doing what worked fine in Python 2. I understand that point of view, I just think we should be able to do better in Python 3, and should give it a serious try before giving in. Remember, Special cases aren't special enough to break the rules comes *before* Although practicality beats purity. Not to forget that Explicit is better than implicit is second[1] on the list. ;-) After looking at this thread, I feel that (due to misunderstandings on both sides) purity hasn't really been tried yet. If that particular configuration of bytes is because it's ASCII-encoded text, then sure. Once again, you are advocate precisely the Python 2 model of text. Not exactly, because what I get back is bytes, which cannot directly be mixed with unicode (str) as it was in Py2. I think this is a key difference. You're in good company there; that was Guido's rationale for not worrying, too. I agree it's key (and I'm sure Nick will, on reflection if not already). But I worry (a lot) that it's not enough. This confuses me somewhat. It's okay to use b'ethan'.upper(), which only makes semantic sense as ASCII-encoded text, Not really OK. In theory, because it doesn't require serialization/ encoding of a primitive type, it doesn't matter. In practice, without powerful formatting, it isn't even a major attraction. In practice, with powerful formatting, it adds to the attraction. Note that regex doesn't require type conversions (matches have methods to return positions in the target or subsequences of the target, not values of other types), which is why I (and I suspect Nick for the same reason) am comfortable with polymorphic regex but not with bytes formatting. (Aside, I'm perfectly comfortable with ASCII-encoded text because if you took u'ethan'.encode('ascii') you would get b'ethan'. If it was some other encoding, such as cp1251, I would call that particular byte stream cp1251-encoded text. Even though ethan is perfectly good ASCII-encoded text (as well as the integer 435,744,694,638 on a bigendian machine with 5-byte words, and you have no way of knowing whether it was user data (CP1251) or a metadata keyword (ASCII) or be the US national debt in 1967 dollars (integer) when b'ethan' shows up in a trace? And if there were methods that worked directly on a cp1251-encoded byte stream I would not have any problem using them on cp1251-encoded text.) I was afraid of that: all of those methods (except the case methods[2]) will work fine on a cp1251-encoded text. And because they only know that the string is bytes, the case methods will silently corrupt your text as
Re: [Python-Dev] PEP 460 reboot
On 13 Jan 2014 17:14, Donald Stufft don...@stufft.io wrote: On Jan 13, 2014, at 1:59 AM, Nick Coghlan ncogh...@gmail.com wrote: On 13 January 2014 16:52, Donald Stufft don...@stufft.io wrote: On Jan 13, 2014, at 12:45 AM, Glenn Linderman v+pyt...@g.nevcal.com wrote: So then the question is whether to proceed with 3.4, delay this feature to 3.5, or to delay 3.4 to include this feature, both have been discussed, with the justification for the latter being to make 3.4 the ultimate Python 3 porting target for recalcitrant module authors, sooner than later. I really hope this can make it in 3.4, needing to wait another 2 years or so until this is available would be a shame. Indeed, it would be a shame to have to wait. Fortunately, people don't even need to wait until the release of Python 3.4, they can instead try to help out with the asciicompat project, which aims to provide this functionality in Python 3.3+: https://github.com/jeamland/asciicompat All it takes is to let go of the idea I wish the Python 3 bytes type was more like the Python 2 str type and instead think hmm, the Python 3 bytes type doesn't seem like a great fit for my use case, maybe I need a different type”. It’s almost a fine fit for the usecase afaict the major thing it’s missing is an easy way to handle this last use case. I don’t see how this proposal is any different than cases such as int(b”1”). ASCII is already special, giving an area that Python3 has made things worse a better way forward isn’t comprising the text model, it’s recognizing the realities of the world. The difference between this and int() is that there's no structural ambiguity introduced in the case of int(): the output is always an integer, regardless of the input type. Arbitrary binary data and ASCII compatible binary data are *different things* and the only argument in favour of modelling them with a single type is because Python 2 did it that way. The Python 3 text model was built on the notion of no implicit encoding and decoding, and Guido's more lenient proposal brings that back by stealth: the semantics proposed for the integer codes are that they be essentially equivalent to performing the operation in the text domain and then encoding with ASCII. However, I'm OK with the idea if there are separate formatb/formatb_map APIs that allow the encoding support to be bypassed entirely - that way, using mod-formatting, format or format_map *is* explicit, since the only reason to use them over formatb/formatb_map would be for the implicit ASCII encoding support, eliminating the ambiguity. Regards, Nick. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On Sun, 12 Jan 2014 18:11:47 -0800 Guido van Rossum gu...@python.org wrote: On Sun, Jan 12, 2014 at 5:27 PM, Ethan Furman et...@stoneleaf.us wrote: On 01/12/2014 04:47 PM, Guido van Rossum wrote: %s seems the trickiest: I think with a bytes argument it should just insert those bytes (and the padding modifiers should work too), and for other types it should probably work like %a, so that it works as expected for numeric values, and with a string argument it will return the ascii()-variant of its repr(). Examples: b'%s' % 42 == b'42' b'%s' % 'x' == b'x' (i.e. the three-byte string containing an 'x' enclosed in single quotes) I'm not sure about the quotes. Would anyone ever actually want those in the byte stream? Perhaps not, but it's a hint that you should probably think about an encoding. It's symmetric with how '%s' % b'x' returns b'x'. Think of it as payback time. :-) What is the use case for embedding a quoted ASCII-encoded representation in a byte stream? Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Trying to focus the whole bytes/str formatting discussion
On 13 January 2014 08:46, Brett Cannon br...@python.org wrote: I don't know about the rest of you but I feel like the discussion is heading off the rails (if it hasn't already jumped the tracks). Let's try to bring this back around to something actionable which people can focus their energy on as the amount of developer time spent arguing could have led to several coded-up solutions. I see it as a practicality-beats-purity vs. explicit-is-better-than-implicit. The PBP group want bytes.format() (just assume I include interpolation support if you want that) to work as close to a drop-in replacement for current str.format() use in Python 2 to ease porting. The argument is that code looks cleaner and the amount of changes in Python 2 code being ported to Python 3 is much smaller. THE EIBTI group are willing to support PEP 460 but beyond that don't want to have in Python itself anything for bytes.format() which takes in a string and spits out bytes. It's bytes in-bytes out and not bytes str in-bytes out as the PBP group is after. The EIBTI group are arguing that letting str into bytes.format() and then automatically be converted to strict ASCII leads to conflating the text/bytes divide as well as being too magical, e.g. what if you actually wanted UTF-16 for you number string instead of ASCII; the EIBTI group **wants** to force people to make a decision. They are also less concerned with making users update Python 2 code to handle this as it already needs to be updated for other Python 3 things anyway. From where I'm sitting, the EIBTI group and their PEP 460 proposal from Antoine (and no longer Victor) are not controversial. Everyone seems to agree that PEP 460 **at minimum** is acceptable and should happen for Python 3.5. The people with the uphill battle and something to prove are those arguing for str in-bytes out support in bytes.format(). The added features that the PBP group want are the ones being argued over. As the onus is on the PBP group to convince the EIBTI group (or Guido), I think the PBP group should code up a solution that does what they want and put it on PyPI to see what the community thinks. If the PBP group wants to convince the EIBTI group that str in-bytes out for bytes.format() is critical in getting a key group of users to start using Python 3 then I think that needs to be demonstrated through real-world usage by some people. Note that I am now fine with Guido's more lenient proposal *so long as* explicitly bytes-only formatb and formatb_map methods are also included. That would give us the following situation in 3.5: Text interpolation: str.__mod__, str.format, str.format_map ASCII compatible interpolation: bytes.__mod__, bytes.format, bytes.format_map Arbitrary binary interpolation: bytes.formatb, bytes.formatb_map Those are all reasonable operations for the language to support natively, and by providing convenient access to all three, we avoid the attractive nuisance that would be created by providing *only* ASCII interpolation without providing strict binary interpolation (since people would inevitably use the former when they should really be using the latter, because interpolation is such a convenient construct), while still addressing the interests of both groups (people like me and Antoine that like PEP 460 as it stands, as well as those that favour the ASCII encoding features). It's only the introduction of ASCII compatible interpolation support *without* binary interpolation support that I am adamantly opposed to - that's the kind of attractive nuisance that leads to people inappropriately using ASCII compatible only APIs and then discovering that their code breaks when confronted with ASCII incompatible encodings like UTF-16, ShiftJIS and ISO-2022. Originally I was opposed to the idea entirely, but then Antoine wrote the binary only version of PEP 460 and I found it to be a *very* elegant solution that didn't compromise the Python 3 text model. As long as this pure API remains available in some form (such as formatb and formatb_map methods), then I'm OK with the ASCII only version existing in parallel - at that point, it *is* analogous to all the other existing bytes methods that assume the use of ASCII compatible data. ** The caveat ** However, note that there were *two* significant issues that were raised in the recent broader discussions. PEP 460 only tackles the more tractable of the two: the fact that Twisted and Mercurial both consider bytes.__mod__ support a blocker for switching to Python 3. That's a useful discussion to have, but it's important for people to realise that the mod-formatting feature is utterly irrelevant to the concerns Armin Ronacher raised in http://lucumr.pocoo.org/2014/1/5/unicode-in-2-and-3/ that kicked off this whole recent spate of interest in the topic. Obviously, I disagree with his conclusions (and personally wish Python 2 Unicode experts would show a little more humility in trying to
Re: [Python-Dev] PEP 460 reboot
On 13 January 2014 17:15, Guido van Rossum gu...@python.org wrote: On Sun, Jan 12, 2014 at 10:59 PM, Nick Coghlan ncogh...@gmail.com wrote: All it takes is to let go of the idea I wish the Python 3 bytes type was more like the Python 2 str type and instead think hmm, the Python 3 bytes type doesn't seem like a great fit for my use case, maybe I need a different type. Maybe you're letting your excitement about asciistr get the better of you? IMO we don't need more types. If you can refrain from using int(b), b.lower() and b += 'abc' when b isn't ASCII-encoded, why couldn't you also refrain from b += b'%s' % 42? It's the fact I'd feel obliged to refrain from using *any* of the proposed interpolation methods when dealing with arbitrary binary data if they include the assumption of ASCII compatibility. The reason Antoine's updates to PEP 460 earned an immediate +1 from me (even though I was initially dubious about the PEP in general) is that it aligns *exactly* with how I usually use the bytes type in Python 3 - as a pure container of arbitrary binary data, without making assumptions about whether it is ASCII compatible or not. While I still occasionally have reservations about it, I think on balance it's a good thing that the bytes type has a much support for ASCII compatible data , but my specific concern with your more lenient proposal is that it takes something that I liked and would use (the current PEP 460 API) and turned it into something I would have to avoid because it doesn't correctly support arbitrary binary data. I'll suppress the urge to quote verbatim from my first message in this thread (about the motivation for bytes) but I'll just recommend you re-read it. (It's too late here to write more, but it looks like we are in for a bitter fight. :-( ) I realised my problem was specifically with providing the ASCII compatible version *without* providing a pure binary equivalent that *doesn't* involve making the assumption of ASCII compatibility. This means that adding formatb and formatb_map methods with the current semantics of format and format_map from PEP 460 would cover the use cases I care about, and I can then happily ignore the debates about what the semantics of the ASCII compatible version will be. The semantics of binary interpolation could potentially even be simplified further, since the ASCII assuming versions would be responsible for handling the 2/3 source compatibility problem. {}.formatb(other) would also provide an alternative to calling the bytes constructor that doesn't suffer from the unexpected-int-input-is-handled-as-a-length failure mode. Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot and a bitter fight
On 13 January 2014 17:59, Ethan Furman et...@stoneleaf.us wrote: On 01/12/2014 11:15 PM, Guido van Rossum wrote: (It's too late here to write more, but it looks like we are in for a bitter fight. :-( ) It's already been a bitter fight. The opponents of %-interpolation (Nick, Antoine, Turnbull, D'Aprano, et al*) all seem to be arguing basically what Nick said. The proponents (myself, you, Stufft, Eric Smith, et al*) are arguing that bytes already has an ASCII bias, already has ASCII string methods, that it isn't the same as the Py2 world because if you combine a bytes object with a str object outside of interpolation (such as b'hello' + 'world') it doesn't work, that only bytes would ever be returned, etc, etc. With the possible exception of the question I just asked Nick, I don't think we're going to get any new information. I figured out tonight that it's only positioning ASCII interpolation as an *alternative* to adding binary interpolation that I have a problem with. It isn't, because you lose the structural assurance that you haven't inadvertently introduced an assumption of ASCII compatibility when you didn't need to. However, interpolation support is a convenient enough interface that I can see a version that *only* supports ASCII compatible interpolation being an attractive nuisance that becomes a source of hard to detect and fix data corruption bugs (just like the str type in Python 2). If we add both, my objections go away: people like me can use the Python 3 only formatb and formatb_map methods and be confident we haven't inadvertently introduced any assumptions regarding ASCII compatibility, while folks that know they're dealing with an ASCII compatible format can use the ASCII assuming versions that are designed to be source compatible with Python 2. If someone incorrectly uses format() or format_map() when they should be using the pure binary versions, that's a trivial bug fix (adding the necessary b, and perhaps some explicit encoding calls) rather than a major restructuring of the code. If they use mod-formatting, that's a slightly bigger fix, but still just switching to a different spelling of the formatting operation. Both use cases (binary only and ASCII compatible) get covered cleanly, and nobody has to lose out. Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python advanced debug support (update frame code)
On 13 January 2014 09:08, Fabio Zadrozny fabi...@gmail.com wrote: Hi Python-dev. I'm playing a bit on the concept on live-coding during a debug session and one of the most annoying things is that although I can reload the code for a function (using something close to xreload), it seems it's not possible to change the code for the current frame (i.e.: I need to get out of the function call and then back in to a call to the method from that frame to see the changes). I gave a look on the frameobject and it seems it would be possible to set frame.f_code to another code object -- and set the line number to the start of the new object, which would cover the most common situation, which would be restarting the current frame -- provided the arguments remain the same (which is close to what the java debugger in Eclipse does when it drops the current frame -- on Python, provided I'm not in a try..except block I can do even better setting the the frame.f_lineno, but without being able to change the frame f_code it loses a lot of its usefulness). So, I'd like to ask for feedback from people with more knowledge on whether it'd be actually feasible to change the frame.f_code and possible implications on doing that. Huh, I would have sworn there was already an issue on the tracker about that, but it appears not (Eric Snow has one about adding a reference to the running function, but nothing about trying to switch an executing frame: http://bugs.python.org/issue12857). Anyway, your main problem isn't the reference to the code object from the frame: it's the fact that the main eval loop has a reference to that code object from a C level stack variable, and stores a bunch of other state directly on the C stack. I don't see anything *intrinsically* impossible about the idea, it just wouldn't be easy, since you'd have to come up with a way of dealing with that C level state that didn't slow down normal operation. Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot and a bitter fight
On 13/01/2014 07:59, Ethan Furman wrote: On 01/12/2014 11:15 PM, Guido van Rossum wrote: The proponents (myself, you, Stufft, Eric Smith, et al*) are arguing that bytes already has an ASCII bias, already has ASCII string methods, that it isn't the same as the Py2 world because if you combine a bytes object with a str object outside of interpolation (such as b'hello' + 'world') it doesn't work, that only bytes would ever be returned, etc, etc. -- ~Ethan~ ASCII bias seems to me an understatement. From http://docs.python.org/3/library/stdtypes.html#bytes-and-bytearray-operations Due to the common use of ASCII text as the basis for binary protocols, bytes and bytearray objects provide almost all methods found on text strings. Can you get any clearer than that, or have I been completely swamped by the massive tsunami that these PEP 460 threads are? Note that I'm *NOT* taking sides here, I'd just like to see a peaceful settlement without any bloodshed :) -- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Mark Lawrence ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460: allowing %d and %f and mojibake
Glenn Linderman writes: On 1/12/2014 4:08 PM, Stephen J. Turnbull wrote: Glenn Linderman writes: the proposals to embed binary in Unicode by abusing Latin-1 encoding. Those aren't proposals, they are currently feasible techniques in Python 3 for *some* use cases. The question is why infecting Python 3 with the byte/character confoundance virus is preferable to such techniques, especially if their (serious!) deficiencies are removed by creating a new type such as asciistr. smuggled binary (great term borrowed from a different subthread) muddies the waters of what you are dealing with. Not really. The mud is one or more of the serious deficiencies. It can be removed, I believe (and Nick apparently does, too). asciistr is one way to try that. When the mixture of text and binary is done as encoded text in binary, then it is obvious that only limited text processing can be performed, Hardly. After all, that's how all text processing was done for decades. Still is, in some programs, especially C programs. And there are no extra, confusing Latin-1 encode/decode operations required. The extra encode/decode operations are mostly (perhaps all) due to examples that started from bytes and end with bytes. Of course if you assume that API and propose to do the operations using Unicode, you'll get extra decode/encode operations. From a higher-level perspective, I think it would be great to have a module, perhaps called boundary (let's call it that for now), that allow some definition syntax (augmented BNF? augmented ABNF?) to explain the format of a binary blob. We have struct, for one. I'm not sure why you want more than that. I suppose you could go all the way to ASN.1. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On Jan 12, 2014, at 06:11 PM, Guido van Rossum wrote: Perhaps not, but it's a hint that you should probably think about an encoding. It's symmetric with how '%s' % b'x' returns b'x'. Think of it as payback time. :-) Which unfortunately causes no end of headaches, often difficult to debug. https://wiki.python.org/moin/PortingToPy3k/BilingualQuickRef (see 'doctests' for one such impact). -Barry ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On Jan 12, 2014, at 09:45 PM, Glenn Linderman wrote: Quotes in the stream are a great debug hint, without blowing up. They actually terrible for debugging for exactly the same reason as coercion in Python 2. It's rarely what you really want, it silently succeeds, and it means that the user visible error is far removed from the actual bug, both in code distance and time. So yes, it tells you Something Went Wrong, but is actually a hindrance to finding and fixing the problem. -Barry ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On 01/13/2014 01:49 AM, Mark Shannon wrote: '%s' can't work in 3.5, as we must differentiate between strings which meed to be encoded and bytes which don't. I don't understand this objection: def __mod__(self, other): if isinstance(other, bytes): # no encoding necessary elif isinstance(other, str): # payback time! other = ascii(other) Where is the problem? -- ~Ethan~ ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On Mon, Jan 13, 2014 at 3:41 AM, Antoine Pitrou solip...@pitrou.net wrote: What is the use case for embedding a quoted ASCII-encoded representation in a byte stream? It doesn't crash but produces undesired output (always, not only when the data is non-ASCII) that gives the developer a hint to think about encoding to bytes. -- --Guido van Rossum (python.org/~guido) ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On Mon, 13 Jan 2014 07:59:10 -0800 Guido van Rossum gu...@python.org wrote: On Mon, Jan 13, 2014 at 3:41 AM, Antoine Pitrou solip...@pitrou.net wrote: What is the use case for embedding a quoted ASCII-encoded representation in a byte stream? It doesn't crash but produces undesired output (always, not only when the data is non-ASCII) that gives the developer a hint to think about encoding to bytes. But why is it better to give a hint by producing undesired output (which may actually go unnoticed for some time and produce issues down the road), rather than simply by raising TypeError? By that token we may simply insert an error string (CAUTION: YOU MISS AN ENCODING HERE), rather than the ascii() representation of the argument. Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] PEP460 thoughts from a Mercurial dev
(sorry for not piling on any existing threads - I don't subscribe to python-dev due to lack of time) Brett Cannon asked me to chime in - I haven't actually read the very long thread at this point, I'm just providing responses to things Brett mentioned: 1) What do we need in terms of functionality Best guess, %s, %d, and %f. I've not done a full audit of the code, but some limited looking over the grep hits for % in .py files suggests I'm right, and we could even do without %f (we only use that for 'hg --time' output, which we could do in unicode). We also need some way to emit raw bytes (in potentially mixed encodings, yes I know this is doing it wrong) to stdout/stderr (example: someone changes a file from latin1 to utf8, and then wants to see the resulting diff). 2) Would having it as an external library that worked with Python 2 help? Probably, IF it came with 2.4 support (RHEL support, basically), and we could bundle it in our source tree. It's been extremely valuable to have the install only depend on a working C compiler and Python. 3) If this does go in, how long would it take us to port Mercurial to py3? Would it being in 3.5 hold us up? I'm honestly not sure. I'm still in the outermost layers of this yak shave: fixing cyclic imports. I'll know more when I can at least get 'hg version' to print its own version, because at that point the testsuite failures might be informative. I'd honestly _rather_ this went into 3.5 *and* got lots of validation by both us and twisted (the other folks that care?) before becoming set in stone by a release. Does that make sense? 4) Do we care if it's .format()/%, or could it be in the stdlib? It'd be really nice to not have to boil the oceans as far as editing everyplace in the codebase that does % today. If we do have to do that, it's not going to be much more helpful than something like: def maybestr(a): if isinstance(a, bytes): return a.decode('latin1) return a def sprintf(fmt, *args): (fmt.decode('latin1') % [maybestr(a) for a in args]).encode('latin1) or similar. That was (roughly) what I was figuring I'd do today without any formal bytes-string-formatting support. He also mentioned that some are calling for a shortened 3.5 release cycle - I'd rather not see that happen, for the aforementioned reason of wanting time to make sure this is Right - it'd be a shame to do the work and rush it out only to find something missing in an important way. Feel free to ask further questions - I'll try to respond promptly. AF (For those curious: my hg-on-py3 repo isn't published at the moment because I rebuilt the server it lived on and I forgot to publish it. I'll rectify that sometime this week, I hope, but it's really totally nonfunctional due to cyclic imports.) ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460: allowing %d and %f and mojibake
On 01/13/2014 02:48 AM, Stephen J. Turnbull wrote: Ethan Furman writes: The part that you don't seem to acknowledge (sorry if I missed it) is that there are str-like methods already on bytes. I haven't expressed myself well, but I don't much care about that. You don't care that there are str-like methods on bytes? Whether you do or not, they are there, and they impact how people think about bytes and what is (and what should be) allowed. It's what Knuth would classify as a seminumerical method. I do not see how that's relevant. What matters is not how we can manipulate the data (everything is reduced to numbers at some point), but what the data represents. [snip] *My* definition is not ambiguous at all. If this particular part of the byte stream is defined to contain ASCII-encoded text, then I can use the bytes text methods to work with it. But how is Python supposed to know that? Python doesn't need to. ... because you know it. But the ideal of object-oriented programming (and duck-typing) is that you shouldn't need to; the object should know how to produce appropriate behavior itself. The ideal, sure. But if you're stuck with using a list to hold data for your higher-order recursive function are you going to expect the list data type to know which pops and inserts are allowed and which are not? Of course not. And you'd probably build a proper class on top of the list so those things could be checked. Now imagine that the list type didn't offer insert and pop, and you had to use slice replacement -- what a pain that would be! [snip] But under your definition, you need to make the decision, or explicitly code the decision, on the basis of context. Exactly so. I even have to do that in Py2. Even. This is exactly where PBP and EIBTI part company, I think. EIBTI thinks its a bad idea to pass around bytes that are implicitly some other type bytes are /always/ implicitly some other type. They are basically raw data. They are given meaning by how we interpret them. [snip] Even though ethan is perfectly good ASCII-encoded text (as well as the integer 435,744,694,638 on a bigendian machine with 5-byte words, and you have no way of knowing whether it was user data (CP1251) or a metadata keyword (ASCII) or be the US national debt in 1967 dollars (integer) when b'ethan' shows up in a trace? Context is everything. If b'ethan' shows up in a trace I would have to examine the surrounding code to see how those bytes were being used. And if there were methods that worked directly on a cp1251-encoded byte stream I would not have any problem using them on cp1251-encoded text.) I was afraid of that: all of those methods (except the case methods) will work fine on a cp1251-encoded text. Really? Huh. They wouldn't work fine with the Spanish alphabet. I should've used that for my example. :/ And because they only know that the string is bytes, the case methods will silently corrupt your text as soon as they get a chance. Inevitably there are methods that will work even if given the wrong data type, while others will either corrupt or blow up if not given exactly what they expect. You tell me that some ASCII methods will work okay on cp1251 text, and others will not. So I'm not going to use any of them on cp1251 as that is not what they are intended for. That bothers me, even if it doesn't bother you. Purity again, if you like. (But you'd take a safe .upper if you got it for free, no?) Well, there is no such thing as free. ;) And there already is a safe .upper -- str.upper. And if I don't know that my bytes are ASCII, but I did know they were text, I wouldn't use ASCII methods, I'd convert to str and work there. -- ~Ethan~ ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On 01/13/2014 08:09 AM, Antoine Pitrou wrote: On Mon, 13 Jan 2014 07:59:10 -0800 Guido van Rossum gu...@python.org wrote: On Mon, Jan 13, 2014 at 3:41 AM, Antoine Pitrou solip...@pitrou.net wrote: What is the use case for embedding a quoted ASCII-encoded representation in a byte stream? It doesn't crash but produces undesired output (always, not only when the data is non-ASCII) that gives the developer a hint to think about encoding to bytes. But why is it better to give a hint by producing undesired output (which may actually go unnoticed for some time and produce issues down the road), rather than simply by raising TypeError? You mean crash all the time? I'd be fine with that for both the str case and the bytes case. But's probably too late to change the str case, and the bytes case should mirror what str does. By that token we may simply insert an error string (CAUTION: YOU MISS AN ENCODING HERE), rather than the ascii() representation of the argument. Well, the ascii repr is at least some clue as to where. A generic message would be no clue at all. -- ~Ethan~ ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP460 thoughts from a Mercurial dev
On 13 January 2014 23:57, Augie Fackler r...@durin42.com wrote: (sorry for not piling on any existing threads - I don't subscribe to python-dev due to lack of time) Brett Cannon asked me to chime in - I haven't actually read the very long thread at this point, I'm just providing responses to things Brett mentioned: 1) What do we need in terms of functionality Best guess, %s, %d, and %f. I've not done a full audit of the code, but some limited looking over the grep hits for % in .py files suggests I'm right, and we could even do without %f (we only use that for 'hg --time' output, which we could do in unicode). I think PEP 460 will have you covered there, or hopefully asciistr on 3.3+ We also need some way to emit raw bytes (in potentially mixed encodings, yes I know this is doing it wrong) to stdout/stderr (example: someone changes a file from latin1 to utf8, and then wants to see the resulting diff). Writing to sys.stdout.buffer may work for that, or else being able to change the encoding of an existing stream. For the latter, Victor had a working patch to _pyio at http://bugs.python.org/issue15216 and general consensus that the semantics were sensible, but it needs to be worked up into a full patch that covers the C version as well (I tried to muster some helpers for that in the leadup to 3.4 feature freeze, but unfortunately without any luck) 2) Would having it as an external library that worked with Python 2 help? Probably, IF it came with 2.4 support (RHEL support, basically), and we could bundle it in our source tree. It's been extremely valuable to have the install only depend on a working C compiler and Python. asciicompat.asciistr is just an alias for str on Python 2.x, so if we get that working, it may be something you could vendor into Mercurial for Python 3.3+ support. (There will likely be gaps in what asciistr can do due to interoperability issues in the core types, but the PEP 393 changes to the internal representation mean it should be able to get us pretty close) 3) If this does go in, how long would it take us to port Mercurial to py3? Would it being in 3.5 hold us up? I'm honestly not sure. I'm still in the outermost layers of this yak shave: fixing cyclic imports. I'll know more when I can at least get 'hg version' to print its own version, because at that point the testsuite failures might be informative. I'd honestly _rather_ this went into 3.5 *and* got lots of validation by both us and twisted (the other folks that care?) before becoming set in stone by a release. Does that make sense? Yes, that actually makes a lot of sense to me - there's no point in us rushing to get this into 3.4 and then you folks discovering in 6 months it doesn't quite work for you, and then having to wait for 3.5 anyway (or, worse, Python 3 being locked into a solution that doesn't work for you by it's own internal backwards compatibility requirements). 4) Do we care if it's .format()/%, or could it be in the stdlib? It'd be really nice to not have to boil the oceans as far as editing everyplace in the codebase that does % today. If we do have to do that, it's not going to be much more helpful than something like: def maybestr(a): if isinstance(a, bytes): return a.decode('latin1) return a def sprintf(fmt, *args): (fmt.decode('latin1') % [maybestr(a) for a in args]).encode('latin1) or similar. That was (roughly) what I was figuring I'd do today without any formal bytes-string-formatting support. Agreed - I think the two solutions that potentially make the most sense are PEP 460 and an interoperable third party type like asciistr. They each have different pros and cons, so I'm actually currently a plan of doing both (if Guido is amenable to my suggestion of providing both ASCII compatible and binary interpolation). He also mentioned that some are calling for a shortened 3.5 release cycle - I'd rather not see that happen, for the aforementioned reason of wanting time to make sure this is Right - it'd be a shame to do the work and rush it out only to find something missing in an important way. By shortened, we're mostly talking about ensuring 3.5 is published before the 2.7.9 maintenance release. So early-to-mid 2015 rather than the more typical late 2015. Feel free to ask further questions - I'll try to respond promptly. Thanks for the contribution! I found it very helpful :) Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On 01/13/2014 07:52 AM, Barry Warsaw wrote: On Jan 12, 2014, at 09:45 PM, Glenn Linderman wrote: Quotes in the stream are a great debug hint, without blowing up. They actually terrible for debugging for exactly the same reason as coercion in Python 2. It's rarely what you really want, it silently succeeds, and it means that the user visible error is far removed from the actual bug, both in code distance and time. So yes, it tells you Something Went Wrong, but is actually a hindrance to finding and fixing the problem. You mean like this is? -- '%s' % b'abc' b'abc' I agree, but we're stuck with it with str, we may as well be stuck with it for bytes, too. :/ -- ~Ethan~ ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On 01/13/2014 01:13 AM, Nick Coghlan wrote: On 13 Jan 2014 17:43, Ethan Furman wrote: On 01/12/2014 10:51 PM, Nick Coghlan wrote: I am a strong -1 on the more lenient proposal, as it makes binary interpolation in Python 3 an *unsafe operation* for ASCII incompatible binary formats. No more unsafe that calling .upper() on ASCII incompatible streams. Right - Guido's proposal is *completely useless* for arbitrary binary data. You can't trust it. Forgive me for being dense, but I don't understand your objection. With Guido's proposal, '%s' % bytes_data, bytes_data is passed through unchanged. Did you mean something else by binary data? -- ~Ethan~ ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On 14 January 2014 01:54, Ethan Furman et...@stoneleaf.us wrote: On 01/13/2014 01:13 AM, Nick Coghlan wrote: On 13 Jan 2014 17:43, Ethan Furman wrote: On 01/12/2014 10:51 PM, Nick Coghlan wrote: I am a strong -1 on the more lenient proposal, as it makes binary interpolation in Python 3 an *unsafe operation* for ASCII incompatible binary formats. No more unsafe that calling .upper() on ASCII incompatible streams. Right - Guido's proposal is *completely useless* for arbitrary binary data. You can't trust it. Forgive me for being dense, but I don't understand your objection. With Guido's proposal, '%s' % bytes_data, bytes_data is passed through unchanged. Did you mean something else by binary data? I mean it will work, but it will mean you've introduced an implicit assumption of ASCII compatibility into the structure your program, with no straightforward way of removing it (you would have to rewrite your code to not rely on interpolation). This becomes most obvious when the formatting string is passed as a variable, rather than being provided as a literal, or when you don't know the type of the *value* provided and some types may involved implicit encoding operation (I don't think Guido proposed that, but others have). That's the kind of data driven uncertainty I don't like in Python 2, and I find it's categorical elimination to be one of the best features of Python 3 - there are certain kinds of data manipulation bugs that simply *can't exist* because the types don't work that way any more. However, that's also why *adding* formatb/formatb_map to the proposal (with Antoine's stricter semantics) would resolve my concerns - you can ensure you don't introduce an implicit assumption of ASCII compatibility by using those for interpolation rather than the ASCII compatible __mod__/format/format_map that the bytes type will share with the str type. The combination of the two is completely in keeping with the Python 3 text model - we would offer text interpolation, hybrid ASCII compatible interpolation *and* pure binary interpolation. Offering only the first two would mean relegating the pure binary domain to a lower status again, since assuming ASCII compatibility would grant you access to an interpolation API, so people would be inclined to use it even when doing so opens the door to data corruption bugs. Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On Mon, 13 Jan 2014 08:36:05 -0800 Ethan Furman et...@stoneleaf.us wrote: On 01/13/2014 08:09 AM, Antoine Pitrou wrote: On Mon, 13 Jan 2014 07:59:10 -0800 Guido van Rossum gu...@python.org wrote: On Mon, Jan 13, 2014 at 3:41 AM, Antoine Pitrou solip...@pitrou.net wrote: What is the use case for embedding a quoted ASCII-encoded representation in a byte stream? It doesn't crash but produces undesired output (always, not only when the data is non-ASCII) that gives the developer a hint to think about encoding to bytes. But why is it better to give a hint by producing undesired output (which may actually go unnoticed for some time and produce issues down the road), rather than simply by raising TypeError? You mean crash all the time? I'd be fine with that for both the str case and the bytes case. But's probably too late to change the str case, and the bytes case should mirror what str does. No, there's a good reason for the str case: it's that every Python object should have a working __str__ (for debugging, REPL use, etc.). So bytes has a __str__ too and that's why %s % (some_bytes_object) succeeds. Conversely, though, str needn't and shouldn't have a __bytes__, so there's no good reason for b%s % (some_str_object) to succeed. (moreover, I don't think we did it wrong here should be a good reason for doing it wrong there too) Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On Mon, 13 Jan 2014 08:36:05 -0800 Ethan Furman et...@stoneleaf.us wrote: On 01/13/2014 08:09 AM, Antoine Pitrou wrote: On Mon, 13 Jan 2014 07:59:10 -0800 Guido van Rossum gu...@python.org wrote: On Mon, Jan 13, 2014 at 3:41 AM, Antoine Pitrou solip...@pitrou.net wrote: What is the use case for embedding a quoted ASCII-encoded representation in a byte stream? It doesn't crash but produces undesired output (always, not only when the data is non-ASCII) that gives the developer a hint to think about encoding to bytes. But why is it better to give a hint by producing undesired output (which may actually go unnoticed for some time and produce issues down the road), rather than simply by raising TypeError? You mean crash all the time? I'd be fine with that for both the str case and the bytes case. But's probably too late to change the str case, and the bytes case should mirror what str does. Let me add something else: str and bytes don't have to be symmetrical. In Python 2, str and unicode were symmetrical, they allowed exactly the same operations and were composable. In Python 3, str and bytes are different beasts; they have different operations *and* different semantics (for example, bytes interoperates with bytearray and memoryview, while str doesn't). So bytes formatting really needn't (and shouldn't, IMO) mirror str formatting. (the only reason I used %s in PEP 460 is to allow a migration path from 2.x bytes-formatting to 3.x bytes-formatting; in a really pure proposal it would have been called something else) Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On 01/13/2014 07:49 AM, Barry Warsaw wrote: On Jan 12, 2014, at 06:11 PM, Guido van Rossum wrote: Perhaps not, but it's a hint that you should probably think about an encoding. It's symmetric with how '%s' % b'x' returns b'x'. Think of it as payback time. :-) Which unfortunately causes no end of headaches, often difficult to debug. Is it, in fact, too late to change that behavior? -- ~Ethan~ ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On 01/13/2014 08:39 AM, Ethan Furman wrote: On 01/13/2014 07:49 AM, Barry Warsaw wrote: On Jan 12, 2014, at 06:11 PM, Guido van Rossum wrote: Perhaps not, but it's a hint that you should probably think about an encoding. It's symmetric with how '%s' % b'x' returns b'x'. Think of it as payback time. :-) Which unfortunately causes no end of headaches, often difficult to debug. Is it, in fact, too late to change that behavior? Never mind, Antoine explained it for me. :) -- ~Ethan~ ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP460 thoughts from a Mercurial dev
On Mon, Jan 13, 2014 at 9:37 AM, Augie Fackler r...@durin42.com wrote: On Mon, Jan 13, 2014 at 12:34 PM, Guido van Rossum gu...@python.org wrote: On Mon, Jan 13, 2014 at 8:51 AM, Nick Coghlan ncogh...@gmail.com wrote: On 13 January 2014 23:57, Augie Fackler r...@durin42.com wrote: 1) What do we need in terms of functionality Best guess, %s, %d, and %f. I've not done a full audit of the code, but some limited looking over the grep hits for % in .py files suggests I'm right, and we could even do without %f (we only use that for 'hg --time' output, which we could do in unicode). I think PEP 460 will have you covered there, or hopefully asciistr on 3.3+ I'm confused on how PEP 460 would help -- Augie mentioned %d, which it excludes. Yes - not having %d makes this much much less useful to me. For my part, it'd probably be fine if we could do %s (which would handle an RHS that was bytes, and only bytes, no handing of str or __bytes__-type stuff at all) and %d (with all the usual format modifiers, and would result in an ascii-compatible sequence of bytes all the time). Would it be okay of instead of %s you had to use %b for those semantics? (%d would still exist) -- --Guido van Rossum (python.org/~guido) ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP460 thoughts from a Mercurial dev
On Mon, Jan 13, 2014 at 8:51 AM, Nick Coghlan ncogh...@gmail.com wrote: On 13 January 2014 23:57, Augie Fackler r...@durin42.com wrote: 1) What do we need in terms of functionality Best guess, %s, %d, and %f. I've not done a full audit of the code, but some limited looking over the grep hits for % in .py files suggests I'm right, and we could even do without %f (we only use that for 'hg --time' output, which we could do in unicode). I think PEP 460 will have you covered there, or hopefully asciistr on 3.3+ I'm confused on how PEP 460 would help -- Augie mentioned %d, which it excludes. -- --Guido van Rossum (python.org/~guido) ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP460 thoughts from a Mercurial dev
On Mon, 13 Jan 2014 09:34:39 -0800 Guido van Rossum gu...@python.org wrote: On Mon, Jan 13, 2014 at 8:51 AM, Nick Coghlan ncogh...@gmail.com wrote: On 13 January 2014 23:57, Augie Fackler r...@durin42.com wrote: 1) What do we need in terms of functionality Best guess, %s, %d, and %f. I've not done a full audit of the code, but some limited looking over the grep hits for % in .py files suggests I'm right, and we could even do without %f (we only use that for 'hg --time' output, which we could do in unicode). I think PEP 460 will have you covered there, or hopefully asciistr on 3.3+ I'm confused on how PEP 460 would help -- Augie mentioned %d, which it excludes. Serhiy did a survey of formatting codes in the Mercurial sources: https://mail.python.org/pipermail/python-dev/2014-January/130969.html Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On Mon, 13 Jan 2014 12:41:18 +0100, Antoine Pitrou solip...@pitrou.net wrote: On Sun, 12 Jan 2014 18:11:47 -0800 Guido van Rossum gu...@python.org wrote: On Sun, Jan 12, 2014 at 5:27 PM, Ethan Furman et...@stoneleaf.us wrote: On 01/12/2014 04:47 PM, Guido van Rossum wrote: %s seems the trickiest: I think with a bytes argument it should just insert those bytes (and the padding modifiers should work too), and for other types it should probably work like %a, so that it works as expected for numeric values, and with a string argument it will return the ascii()-variant of its repr(). Examples: b'%s' % 42 == b'42' b'%s' % 'x' == b'x' (i.e. the three-byte string containing an 'x' enclosed in single quotes) I'm not sure about the quotes. Would anyone ever actually want those in the byte stream? Perhaps not, but it's a hint that you should probably think about an encoding. It's symmetric with how '%s' % b'x' returns b'x'. Think of it as payback time. :-) What is the use case for embedding a quoted ASCII-encoded representation in a byte stream? There is no use case in the sense you are asking, just like there is no real use case for '%s' % b'x' producing b'x'. But the real use case is exactly the same: to let you know your code is screwed up without actually blowing up with a encoding Exception. For the record, I like Guido's logic and proposal. I don't understand Nick's objection, since I don't see the difference between the situation here where a string gets interpolated into bytes as 'xxx' and the corresponding situation where bytes gets interpolated into a string as b'xxx'. Why struggle to keep bytes interpolation pure if string interpolation isn't? Guido's proposal makes the language more symmetric, and thus more consistent and less surprising. Exactly the hallmarks of Python's design sense, IMO. (Big surprise, right? :) Of course, this point of view *is* based on the idea that when you are doing interpolation using %/.format, you are in fact primarily concerned with ASCII compatible byte streams. This is a Practicality sort of argument. It is, after all, by far the most common use case when doing interpolation[*]. If you wanted to do a purist version of this symmetry, you'd have bytes(x) calling __bytes__ if it was defined and falling back to calling a __brepr__ otherwise. But what would __brepr__ implement? The variety of format codes in the struct module argues that there is no one obvious binary repr for most types. (Those that have one would implement __bytes__). And what would be the __brepr__ of an arbitrary 'object'? Faced with the impracticality of defining __brepr__ usefully in any pure bytes form, it seems sensible to admit that the most useful __brepr__ is the ascii() encoding of the __repr__. Which naturally produces 'xxx' as the __brepr__ of a string. This does cause things to get a little un-pretty when you are operating at the python prompt: b'%s' % object b'class \\\'object\\\'' But then again that is most likely really not what you mean to do, so it becomes a big red flag...just like b'xxx' is a small red flag when you accidentally interpolate unencoded bytes into a string. --David PS: When I first read Guido's remark that the result of interpolating a string should be 'xxx', I went Wah? I had to reason my way through to it as above, but to him it was just the natural answer. Guido isn't always right, but this kind of automatic language design consistency is one reason he's the BDFL. [*] I still think that you mostly want to design your library so that you are handling the text parts as text and the bytes parts as bytes, and encoding/gluing them as appropriate at the IO boundary. But if Guido says his real code would benefit by being able to interpolate ASCII into bytes at certain points, I'll believe him. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP460 thoughts from a Mercurial dev
On Mon, Jan 13, 2014 at 12:34 PM, Guido van Rossum gu...@python.org wrote: On Mon, Jan 13, 2014 at 8:51 AM, Nick Coghlan ncogh...@gmail.com wrote: On 13 January 2014 23:57, Augie Fackler r...@durin42.com wrote: 1) What do we need in terms of functionality Best guess, %s, %d, and %f. I've not done a full audit of the code, but some limited looking over the grep hits for % in .py files suggests I'm right, and we could even do without %f (we only use that for 'hg --time' output, which we could do in unicode). I think PEP 460 will have you covered there, or hopefully asciistr on 3.3+ I'm confused on how PEP 460 would help -- Augie mentioned %d, which it excludes. Yes - not having %d makes this much much less useful to me. For my part, it'd probably be fine if we could do %s (which would handle an RHS that was bytes, and only bytes, no handing of str or __bytes__-type stuff at all) and %d (with all the usual format modifiers, and would result in an ascii-compatible sequence of bytes all the time). ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On January 13, 2014 at 12:45:40 PM, R. David Murray (rdmur...@bitdance.com) wrote: [snip] There is no use case in the sense you are asking, just like there is no real use case for '%s' % b'x' producing b'x'. But the real use case is exactly the same: to let you know your code is screwed up without actually blowing up with a encoding Exception. Blowing up with an encoding exception is the *only* sane method of making you aware that something is wrong. It’s much better than just keeping producing some broken output, until it gets noticed. What’s the point of writing a piece of software that is working wrong without crashing? For the record, I like Guido's logic and proposal. I don't understand Nick's objection, since I don't see the difference between the situation here where a string gets interpolated into bytes as 'xxx' and the corresponding situation where bytes gets interpolated into a string as b'xxx'. Why struggle to keep bytes interpolation pure if string interpolation isn’t? Isn’t the whole point of this discussion to make python2 people who want to migrate on python3 happier? What’s the point for them to have a ported python2 code that produces Status: b’42’” for b’Status: %d’ % 42”? And if you want to call ‘str’ on 42 and then encode the output in latin-1/ascii, then you’re just turning python3 in python2. - Yury ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP460 thoughts from a Mercurial dev
On Mon, Jan 13, 2014 at 12:39 PM, Guido van Rossum gu...@python.org wrote: On Mon, Jan 13, 2014 at 9:37 AM, Augie Fackler r...@durin42.com wrote: On Mon, Jan 13, 2014 at 12:34 PM, Guido van Rossum gu...@python.org wrote: On Mon, Jan 13, 2014 at 8:51 AM, Nick Coghlan ncogh...@gmail.com wrote: On 13 January 2014 23:57, Augie Fackler r...@durin42.com wrote: 1) What do we need in terms of functionality Best guess, %s, %d, and %f. I've not done a full audit of the code, but some limited looking over the grep hits for % in .py files suggests I'm right, and we could even do without %f (we only use that for 'hg --time' output, which we could do in unicode). I think PEP 460 will have you covered there, or hopefully asciistr on 3.3+ I'm confused on how PEP 460 would help -- Augie mentioned %d, which it excludes. Yes - not having %d makes this much much less useful to me. For my part, it'd probably be fine if we could do %s (which would handle an RHS that was bytes, and only bytes, no handing of str or __bytes__-type stuff at all) and %d (with all the usual format modifiers, and would result in an ascii-compatible sequence of bytes all the time). Would it be okay of instead of %s you had to use %b for those semantics? (%d would still exist) Probably, but it'd be quite painful, since we'd have to to some kind of .sub() call all over the place to remain compatible with 2.4 and 2.6. Dropping 2.4 might be possible in the 3.5 timeframe - 2.6 almost certainly not. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On 01/13/2014 09:31 AM, Antoine Pitrou wrote: On Mon, 13 Jan 2014 08:36:05 -0800 Ethan Furman wrote: You mean crash all the time? I'd be fine with that for both the str case and the bytes case. But's probably too late to change the str case, and the bytes case should mirror what str does. Let me add something else: str and bytes don't have to be symmetrical. In Python 2, str and unicode were symmetrical, they allowed exactly the same operations and were composable. In Python 3, str and bytes are different beasts; they have different operations *and* different semantics (for example, bytes interoperates with bytearray and memoryview, while str doesn't). This makes sense to me. So I'm guess I'm fine with either the quoted ascii repr or the always blowing up method, with leaning towards the blowing up method. -- ~Ethan~ ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
Am 13.01.2014 18:38, schrieb Ethan Furman: On 01/13/2014 09:31 AM, Antoine Pitrou wrote: On Mon, 13 Jan 2014 08:36:05 -0800 Ethan Furman wrote: You mean crash all the time? I'd be fine with that for both the str case and the bytes case. But's probably too late to change the str case, and the bytes case should mirror what str does. Let me add something else: str and bytes don't have to be symmetrical. In Python 2, str and unicode were symmetrical, they allowed exactly the same operations and were composable. In Python 3, str and bytes are different beasts; they have different operations *and* different semantics (for example, bytes interoperates with bytearray and memoryview, while str doesn't). This makes sense to me. So I'm guess I'm fine with either the quoted ascii repr or the always blowing up method, with leaning towards the blowing up method. +1. Georg ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On Mon, Jan 13, 2014 at 12:31 PM, Antoine Pitrou solip...@pitrou.netwrote: On Mon, 13 Jan 2014 08:36:05 -0800 Ethan Furman et...@stoneleaf.us wrote: On 01/13/2014 08:09 AM, Antoine Pitrou wrote: On Mon, 13 Jan 2014 07:59:10 -0800 Guido van Rossum gu...@python.org wrote: On Mon, Jan 13, 2014 at 3:41 AM, Antoine Pitrou solip...@pitrou.net wrote: What is the use case for embedding a quoted ASCII-encoded representation in a byte stream? It doesn't crash but produces undesired output (always, not only when the data is non-ASCII) that gives the developer a hint to think about encoding to bytes. But why is it better to give a hint by producing undesired output (which may actually go unnoticed for some time and produce issues down the road), rather than simply by raising TypeError? You mean crash all the time? I'd be fine with that for both the str case and the bytes case. But's probably too late to change the str case, and the bytes case should mirror what str does. Let me add something else: str and bytes don't have to be symmetrical. In Python 2, str and unicode were symmetrical, they allowed exactly the same operations and were composable. In Python 3, str and bytes are different beasts; they have different operations *and* different semantics (for example, bytes interoperates with bytearray and memoryview, while str doesn't). This is also why the int type doesn't have a __bytes__ method (ignoring the use of an integer to bytes()): it's universally defined what str(10) should return, but who know what you want when you would want the bytes of 10 (e.g. base-2, ASCII, UTF-16, etc.). So bytes formatting really needn't (and shouldn't, IMO) mirror str formatting. I think one of the things about Guido's proposal that bugs me is that it breaks the mental model of the .format() method from str in terms of how the mini-language works. For str.format() you have the conversion and the format spec (e.g. {!r} and {:d}, respectively). You apply the conversion by calling the appropriate built-in, e.g. 'r' calls repr(). The format spec semantically gets passed with the object to format() which calls the object's __format__() method: ``format(number, 'd')``. Now Guido's suggestion has two parts that affect the mini-language for .format(). One is that for bytes.format() the default conversion is bytes() instead of str(), which is fine (probably want to add 'b' as a conversion value as well to be consistent). But the other bit is that the format spec goes from semantically meaning ``format(thing, format_spec)`` to ``format(thing, format_spec).encode('ascii', 'strict')`` for at least numbers. That implicitness bugs me as I have always thought of format specs just leading to a call to format(). I think I can live with it, though, as long as it is **consistently** applied across the board for bytes.format(); every use of a format spec leads to calling ``format(thing, format_spec).encode('ascii', 'strict')`` no matter what type 'thing' would be and it is clearly documented that this is done to ease porting and handle the common case then I can live with it. This even gives people in-place ASCII encoding for strings by always using '{:s}' with text which they can do when they port their code to run under both Python 2 and 3. So you should be able to do ``b'Content-Type: {:s}'.format('image/jpeg')`` and have it give ASCII. If you want more explicit encoding to latin-1 then you need to do it explicitly and not rely on the mini-language to do tricks for you. IOW I want to treat the format mini-language as a language and thus not have any special-casing or massive shifts in meaning between str.format() and bytes.format() so my mental model doesn't have to contort based on whether it's str or bytes. My preference is not have any, but if Guido is going say PBP here then I want absolute consistency across the board in how bytes.format() tweaks things. As for %s for the % operator calling ascii(), I think that will be a porting nightmare of finding out why your bytes suddenly stopped being formatted properly and then having to crawl through all of your code for that one use of %s which is getting bytes in. By raising a TypeError you will very easily detect where your screw-up occurred thanks to the traceback; do so otherwise feels too much like implicit type conversion and ask any JavaScript developer how that can be a bad thing. -Brett (the only reason I used %s in PEP 460 is to allow a migration path from 2.x bytes-formatting to 3.x bytes-formatting; in a really pure proposal it would have been called something else) Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/brett%40python.org ___ Python-Dev mailing list
Re: [Python-Dev] PEP 460 reboot
On Mon, Jan 13, 2014 at 12:42 PM, R. David Murray rdmur...@bitdance.com wrote: On Mon, 13 Jan 2014 12:41:18 +0100, Antoine Pitrou solip...@pitrou.net wrote: On Sun, 12 Jan 2014 18:11:47 -0800 Guido van Rossum gu...@python.org wrote: On Sun, Jan 12, 2014 at 5:27 PM, Ethan Furman et...@stoneleaf.us wrote: On 01/12/2014 04:47 PM, Guido van Rossum wrote: %s seems the trickiest: I think with a bytes argument it should just insert those bytes (and the padding modifiers should work too), and for other types it should probably work like %a, so that it works as expected for numeric values, and with a string argument it will return the ascii()-variant of its repr(). Examples: b'%s' % 42 == b'42' b'%s' % 'x' == b'x' (i.e. the three-byte string containing an 'x' enclosed in single quotes) I'm not sure about the quotes. Would anyone ever actually want those in the byte stream? Perhaps not, but it's a hint that you should probably think about an encoding. It's symmetric with how '%s' % b'x' returns b'x'. Think of it as payback time. :-) What is the use case for embedding a quoted ASCII-encoded representation in a byte stream? There is no use case in the sense you are asking, just like there is no real use case for '%s' % b'x' producing b'x'. But the real use case is exactly the same: to let you know your code is screwed up without actually blowing up with a encoding Exception. For the record, I like Guido's logic and proposal. I don't understand Nick's objection, since I don't see the difference between the situation here where a string gets interpolated into bytes as 'xxx' and the corresponding situation where bytes gets interpolated into a string as b'xxx'. Why struggle to keep bytes interpolation pure if string interpolation isn't? Guido's proposal makes the language more symmetric, and thus more consistent and less surprising. Exactly the hallmarks of Python's design sense, IMO. (Big surprise, right? :) Of course, this point of view *is* based on the idea that when you are doing interpolation using %/.format, you are in fact primarily concerned with ASCII compatible byte streams. This is a Practicality sort of argument. It is, after all, by far the most common use case when doing interpolation[*]. If you wanted to do a purist version of this symmetry, you'd have bytes(x) calling __bytes__ if it was defined and falling back to calling a __brepr__ otherwise. But what would __brepr__ implement? The variety of format codes in the struct module argues that there is no one obvious binary repr for most types. (Those that have one would implement __bytes__). And what would be the __brepr__ of an arbitrary 'object'? Faced with the impracticality of defining __brepr__ usefully in any pure bytes form, it seems sensible to admit that the most useful __brepr__ is the ascii() encoding of the __repr__. Which naturally produces 'xxx' as the __brepr__ of a string. This does cause things to get a little un-pretty when you are operating at the python prompt: b'%s' % object b'class \\\'object\\\'' But then again that is most likely really not what you mean to do, so it becomes a big red flag...just like b'xxx' is a small red flag when you accidentally interpolate unencoded bytes into a string. --David PS: When I first read Guido's remark that the result of interpolating a string should be 'xxx', I went Wah? I had to reason my way through to it as above, but to him it was just the natural answer. Guido isn't always right, but this kind of automatic language design consistency is one reason he's the BDFL. [*] I still think that you mostly want to design your library so that you are handling the text parts as text and the bytes parts as bytes, and encoding/gluing them as appropriate at the IO boundary. But if Guido says his real code would benefit by being able to interpolate ASCII into bytes at certain points, I'll believe him. elided rant/ If you think corrupted data is easier or more pleasant to track down than encoding exceptions then I think you are strange. It makes porting really difficult while you are still trying to figure out where the bytes/str boundaries are. I am now deeply suspicious of all % formatting. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On Jan 13, 2014, at 1:45 PM, Daniel Holth dho...@gmail.com wrote: On Mon, Jan 13, 2014 at 12:42 PM, R. David Murray rdmur...@bitdance.com wrote: On Mon, 13 Jan 2014 12:41:18 +0100, Antoine Pitrou solip...@pitrou.net wrote: On Sun, 12 Jan 2014 18:11:47 -0800 Guido van Rossum gu...@python.org wrote: On Sun, Jan 12, 2014 at 5:27 PM, Ethan Furman et...@stoneleaf.us wrote: On 01/12/2014 04:47 PM, Guido van Rossum wrote: %s seems the trickiest: I think with a bytes argument it should just insert those bytes (and the padding modifiers should work too), and for other types it should probably work like %a, so that it works as expected for numeric values, and with a string argument it will return the ascii()-variant of its repr(). Examples: b'%s' % 42 == b'42' b'%s' % 'x' == b'x' (i.e. the three-byte string containing an 'x' enclosed in single quotes) I'm not sure about the quotes. Would anyone ever actually want those in the byte stream? Perhaps not, but it's a hint that you should probably think about an encoding. It's symmetric with how '%s' % b'x' returns b'x'. Think of it as payback time. :-) What is the use case for embedding a quoted ASCII-encoded representation in a byte stream? There is no use case in the sense you are asking, just like there is no real use case for '%s' % b'x' producing b'x'. But the real use case is exactly the same: to let you know your code is screwed up without actually blowing up with a encoding Exception. For the record, I like Guido's logic and proposal. I don't understand Nick's objection, since I don't see the difference between the situation here where a string gets interpolated into bytes as 'xxx' and the corresponding situation where bytes gets interpolated into a string as b'xxx'. Why struggle to keep bytes interpolation pure if string interpolation isn't? Guido's proposal makes the language more symmetric, and thus more consistent and less surprising. Exactly the hallmarks of Python's design sense, IMO. (Big surprise, right? :) Of course, this point of view *is* based on the idea that when you are doing interpolation using %/.format, you are in fact primarily concerned with ASCII compatible byte streams. This is a Practicality sort of argument. It is, after all, by far the most common use case when doing interpolation[*]. If you wanted to do a purist version of this symmetry, you'd have bytes(x) calling __bytes__ if it was defined and falling back to calling a __brepr__ otherwise. But what would __brepr__ implement? The variety of format codes in the struct module argues that there is no one obvious binary repr for most types. (Those that have one would implement __bytes__). And what would be the __brepr__ of an arbitrary 'object'? Faced with the impracticality of defining __brepr__ usefully in any pure bytes form, it seems sensible to admit that the most useful __brepr__ is the ascii() encoding of the __repr__. Which naturally produces 'xxx' as the __brepr__ of a string. This does cause things to get a little un-pretty when you are operating at the python prompt: b'%s' % object b'class \\\'object\\\'' But then again that is most likely really not what you mean to do, so it becomes a big red flag...just like b'xxx' is a small red flag when you accidentally interpolate unencoded bytes into a string. --David PS: When I first read Guido's remark that the result of interpolating a string should be 'xxx', I went Wah? I had to reason my way through to it as above, but to him it was just the natural answer. Guido isn't always right, but this kind of automatic language design consistency is one reason he's the BDFL. [*] I still think that you mostly want to design your library so that you are handling the text parts as text and the bytes parts as bytes, and encoding/gluing them as appropriate at the IO boundary. But if Guido says his real code would benefit by being able to interpolate ASCII into bytes at certain points, I'll believe him. elided rant/ If you think corrupted data is easier or more pleasant to track down than encoding exceptions then I think you are strange. It makes porting really difficult while you are still trying to figure out where the bytes/str boundaries are. I am now deeply suspicious of all % formatting. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/donald%40stufft.io For the record, I think %d and %f and such where the RHS is guaranteed to have a certain set of “characters” that are guaranteed to be ascii compatible is fine and it’s perfectly acceptable to have an implicit ASCII encode for them. The %s code I’m not sure of, I think trying to ascii encode that (just using encode()) is dangerous, and I think that using ascii() and adding quotes to it
Re: [Python-Dev] PEP 460 reboot
On 01/13/2014 09:12 AM, Nick Coghlan wrote: On 14 January 2014 01:54, Ethan Furman wrote: Forgive me for being dense, but I don't understand your objection. With Guido's proposal, '%s' % bytes_data, bytes_data is passed through unchanged. Did you mean something else by binary data? I mean it will work, but it will mean you've introduced an implicit assumption of ASCII compatibility into the structure your program Okay, I'm still trying to understand. Apparently we both mean the same thing by binary data / bytes, so the difference must be the %s, yes? And the concern as that because you have used %s as the format code, if somebody accidentally put, say, stupid bug on the RHS you would end up with b'stupid bug' instead of an exception, which you get if you had used %b instead. Am I following? -- ~Ethan~ ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
Let me try rebooting the reboot. My interpretation of Nick's argument is that he are asking for a bytes formatting language that doesn't have an implicit ASCII assumption. To me this feels absurd. The formatting codes (%s, %c) themselves are expressed as ASCII characters. If you include anything else in the format string besides formatting codes (e.g. b'%s'), you are giving it as ASCII characters. I don't know what characters the EBCDIC codes 37, 99 or 115 encode (these are the ASCII codes for '%', 'c', 's') but it certainly wouldn't be safe to use % when the LHS is EBCDIC-encoded. If I had some byte strings in an unknown encoding (but the same encoding for all) that I needed to concatenate I would never think of '%s%s' % (x, y) -- I would write x+y. (Even in Python 2.) If I see some code using *any* formatting operation (regardless of whether it's %d, %r, %s or %c) I am going to assume that there is some ASCII-ness, and if there isn't, the code's author has obscured their goal to me. I hear the objections against b'%s' % 'x' returning b'x' loud and clear, and if the noise about that sub-issue is preventing folks from seeing the absurdity in PEP 460, we can talk about a compromise, e.g. use %b which would require its argument to be bytes. Those bytes should still probably be ASCII-ish, but there's no way to test that. That's fine with me and should be fine to Nick as well -- PEP 460 doesn't check that your encodings match (how could it? :-), nor does plain string concatenation using +. In my head I make the following classification of situations where you work with bytes and/or text. (A) Pure binary formats (e.g. most IP-level packet formats, media files, .pyc files, tar/zip files, compressed data, etc.). These are handled using the struct module (e.g. tar/zip) and/or custom C extensions (e.g. gzip). (B) Encoded text. Here you should just decode everything into str objects and parse your text at that level. If you really want to manipulate the data as bytes (e.g. because you have a lot of data to process and very light processing) you may be able to do it, but unless it's a verbatim copy, you are probably going to make assumptions about the encoding. You are also probably going to mess up for some encodings (e.g. leave BOM turds in the middle of a file). (C) Loosely text-based protocols and formats that have an ASCII assumption in the spec. Most classic Internet protocols (FTP, SMTP, HTTP, IRC, etc.) fall in this category; I expect there are also plenty of file formats using similar conventions (e.g. mailbox files). These protocols and formats often require text-ish manipulations, e.g. for case-insensitive headers or commands, or to split things at whitespace. This is where I find uses for the current ASCII-assuming bytes operations (e.g. b.lower(), b.split(), but also int(b)) and where the lack of number formatting (especially %d and %x) is most painful. I see no benefit in forcing the programmer writing such protocol code handling to use more cumbersome ways of converting between numbers and bytes, nor in forcing them to insert an encoding/decoding layer -- these protocols often switch between text and binary data at line boundaries, so the most basic part of parsing (splitting the input into lines) must still happen in the realm of bytes. IMO PEP 460 and the mindset that goes with it don't apply to any of these three cases. Also, IMO requiring a new type to handle (C) also seems adding too much complexity, and adds to porting efforts. I may have felt differently in the past, but ATM I feel that if newer versions of Python 3 make porting of Python 2 code easier, through minor compromises, that's a *good* thing. (Example: adding u... literals to 3.3.) -- --Guido van Rossum (python.org/~guido) ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP460 thoughts from a Mercurial dev
Antoine Pitrou solipsis at pitrou.net writes: On Mon, 13 Jan 2014 09:34:39 -0800 Guido van Rossum guido at python.org wrote: On Mon, Jan 13, 2014 at 8:51 AM, Nick Coghlan ncoghlan at gmail.com wrote: On 13 January 2014 23:57, Augie Fackler raf at durin42.com wrote: 1) What do we need in terms of functionality Best guess, %s, %d, and %f. I've not done a full audit of the code, but some limited looking over the grep hits for % in .py files suggests I'm right, and we could even do without %f (we only use that for 'hg --time' output, which we could do in unicode). I think PEP 460 will have you covered there, or hopefully asciistr on 3.3+ I'm confused on how PEP 460 would help -- Augie mentioned %d, which it excludes. Serhiy did a survey of formatting codes in the Mercurial sources: https://mail.python.org/pipermail/python-dev/2014-January/130969.html Note that a lot of those are in debug code (eg the only %f I've spotted is), or are time format specifiers (which can be unicode just fine). A few others (eg %ln) are for our internal revset format-string language, so this overstates what we'd need in bytes by a little. %f would probably be good too, as I look a little more. (Please don't remove me from the CC list - I could only respond via gmane because I'm not subscribed to python-dev.) Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On Jan 13, 2014, at 1:58 PM, Guido van Rossum gu...@python.org wrote: I hear the objections against b'%s' % 'x' returning b'x' loud and clear, and if the noise about that sub-issue is preventing folks from seeing the absurdity in PEP 460, we can talk about a compromise, e.g. use %b which would require its argument to be bytes. Those bytes should still probably be ASCII-ish, but there's no way to test that. That's fine with me and should be fine to Nick as well -- PEP 460 doesn't check that your encodings match (how could it? :-), nor does plain string concatenation using +. I think disallowing %s is the right thing to do, but I definitely think numbers and %b should be allowed. - Donald Stufft PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA signature.asc Description: Message signed with OpenPGP using GPGMail ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP460 thoughts from a Mercurial dev
On Mon, 13 Jan 2014 18:51:32 + (UTC) Augie Fackler r...@durin42.com wrote: (Please don't remove me from the CC list - I could only respond via gmane because I'm not subscribed to python-dev.) Responding via gmane is what I do, too :-) My NNTP client doesn't allow SMTP / NNTP mixed postings, so I'm forced to remove you from CC. Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On 1/13/2014 1:40 PM, Brett Cannon wrote: So bytes formatting really needn't (and shouldn't, IMO) mirror str formatting. This was my presumption in writing byteformat(). I think one of the things about Guido's proposal that bugs me is that it breaks the mental model of the .format() method from str in terms of how the mini-language works. For str.format() you have the conversion and the format spec (e.g. {!r} and {:d}, respectively). You apply the conversion by calling the appropriate built-in, e.g. 'r' calls repr(). The format spec semantically gets passed with the object to format() which calls the object's __format__() method: ``format(number, 'd')``. Now Guido's suggestion has two parts that affect the mini-language for .format(). One is that for bytes.format() the default conversion is bytes() instead of str(), which is fine (probably want to add 'b' as a conversion value as well to be consistent). But the other bit is that the format spec goes from semantically meaning ``format(thing, format_spec)`` to ``format(thing, format_spec).encode('ascii', 'strict')`` for at least numbers. That implicitness bugs me as I have always thought of format specs just leading to a call to format(). I think I can live with it, though, as long as it is **consistently** applied across the board for bytes.format(); every use of a format spec leads to calling ``format(thing, format_spec).encode('ascii', 'strict')`` no matter what type 'thing' would be and it is clearly documented that this is done to ease porting and handle the common case then I can live with it. This is how my byteformat function works, except that when no format_spec is given, byte and bytearrary objects are left unchanged rather than being decoded and encoded again. This even gives people in-place ASCII encoding for strings by always using '{:s}' with text which they can do when they port their code to run under both Python 2 and 3. So you should be able to do ``b'Content-Type: {:s}'.format('image/jpeg')`` and have it give ASCII. If you want more explicit encoding to latin-1 then you need to do it explicitly and not rely on the mini-language to do tricks for you. IOW I want to treat the format mini-language as a language and thus not have any special-casing or massive shifts in meaning between str.format() and bytes.format() so my mental model doesn't have to contort based on whether it's str or bytes. My preference is not have any, but if Guido is going say PBP here then I want absolute consistency across the board in how bytes.format() tweaks things. As for %s for the % operator calling ascii(), I think that will be a porting nightmare of finding out why your bytes suddenly stopped being formatted properly and then having to crawl through all of your code for that one use of %s which is getting bytes in. By raising a TypeError you will very easily detect where your screw-up occurred thanks to the traceback; do so otherwise feels too much like implicit type conversion and ask any JavaScript developer how that can be a bad thing. I personally would not add 'bytes % whatever'. -- Terry Jan Reedy ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On Jan 13, 2014, at 02:13 PM, Donald Stufft wrote: On Jan 13, 2014, at 1:58 PM, Guido van Rossum gu...@python.org wrote: I hear the objections against b'%s' % 'x' returning b'x' loud and clear, and if the noise about that sub-issue is preventing folks from seeing the absurdity in PEP 460, we can talk about a compromise, e.g. use %b which would require its argument to be bytes. Those bytes should still probably be ASCII-ish, but there's no way to test that. That's fine with me and should be fine to Nick as well -- PEP 460 doesn't check that your encodings match (how could it? :-), nor does plain string concatenation using +. I think disallowing %s is the right thing to do, but I definitely think numbers and %b should be allowed. I guess I agree. The behavior of b'%s' % 'x' returning b'x' is almost always useless at best. (I would have thought maybe %a for ascii() but don't care that strongly.) -Barry signature.asc Description: PGP signature ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On Mon, Jan 13, 2014 at 2:51 PM, Terry Reedy tjre...@udel.edu wrote: On 1/13/2014 1:40 PM, Brett Cannon wrote: So bytes formatting really needn't (and shouldn't, IMO) mirror str formatting. This was my presumption in writing byteformat(). I think one of the things about Guido's proposal that bugs me is that it breaks the mental model of the .format() method from str in terms of how the mini-language works. For str.format() you have the conversion and the format spec (e.g. {!r} and {:d}, respectively). You apply the conversion by calling the appropriate built-in, e.g. 'r' calls repr(). The format spec semantically gets passed with the object to format() which calls the object's __format__() method: ``format(number, 'd')``. Now Guido's suggestion has two parts that affect the mini-language for .format(). One is that for bytes.format() the default conversion is bytes() instead of str(), which is fine (probably want to add 'b' as a conversion value as well to be consistent). But the other bit is that the format spec goes from semantically meaning ``format(thing, format_spec)`` to ``format(thing, format_spec).encode('ascii', 'strict')`` for at least numbers. That implicitness bugs me as I have always thought of format specs just leading to a call to format(). I think I can live with it, though, as long as it is **consistently** applied across the board for bytes.format(); every use of a format spec leads to calling ``format(thing, format_spec).encode('ascii', 'strict')`` no matter what type 'thing' would be and it is clearly documented that this is done to ease porting and handle the common case then I can live with it. This is how my byteformat function works, except that when no format_spec is given, byte and bytearrary objects are left unchanged rather than being decoded and encoded again. Right, which is what the default conversion covers. And as your code shows this can be made available today without having to wait for Python 3.5 and so can go up on PyPI and be used **today**. This even gives people in-place ASCII encoding for strings by always using '{:s}' with text which they can do when they port their code to run under both Python 2 and 3. So you should be able to do ``b'Content-Type: {:s}'.format('image/jpeg')`` and have it give ASCII. If you want more explicit encoding to latin-1 then you need to do it explicitly and not rely on the mini-language to do tricks for you. IOW I want to treat the format mini-language as a language and thus not have any special-casing or massive shifts in meaning between str.format() and bytes.format() so my mental model doesn't have to contort based on whether it's str or bytes. My preference is not have any, but if Guido is going say PBP here then I want absolute consistency across the board in how bytes.format() tweaks things. As for %s for the % operator calling ascii(), I think that will be a porting nightmare of finding out why your bytes suddenly stopped being formatted properly and then having to crawl through all of your code for that one use of %s which is getting bytes in. By raising a TypeError you will very easily detect where your screw-up occurred thanks to the traceback; do so otherwise feels too much like implicit type conversion and ask any JavaScript developer how that can be a bad thing. I personally would not add 'bytes % whatever'. Personally, neither would I; just focus on bytes.format() and let % operator on strings slowly go away. -Brett -- Terry Jan Reedy ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/ brett%40python.org ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
I see it now. bfoo%sbar % b'baz' should also expand to bfoob'foo'bar Instead of %b could %j mean I should have used + or join() here but was too lazy and work on str too? On Mon, Jan 13, 2014 at 2:51 PM, Terry Reedy tjre...@udel.edu wrote: On 1/13/2014 1:40 PM, Brett Cannon wrote: So bytes formatting really needn't (and shouldn't, IMO) mirror str formatting. This was my presumption in writing byteformat(). I think one of the things about Guido's proposal that bugs me is that it breaks the mental model of the .format() method from str in terms of how the mini-language works. For str.format() you have the conversion and the format spec (e.g. {!r} and {:d}, respectively). You apply the conversion by calling the appropriate built-in, e.g. 'r' calls repr(). The format spec semantically gets passed with the object to format() which calls the object's __format__() method: ``format(number, 'd')``. Now Guido's suggestion has two parts that affect the mini-language for .format(). One is that for bytes.format() the default conversion is bytes() instead of str(), which is fine (probably want to add 'b' as a conversion value as well to be consistent). But the other bit is that the format spec goes from semantically meaning ``format(thing, format_spec)`` to ``format(thing, format_spec).encode('ascii', 'strict')`` for at least numbers. That implicitness bugs me as I have always thought of format specs just leading to a call to format(). I think I can live with it, though, as long as it is **consistently** applied across the board for bytes.format(); every use of a format spec leads to calling ``format(thing, format_spec).encode('ascii', 'strict')`` no matter what type 'thing' would be and it is clearly documented that this is done to ease porting and handle the common case then I can live with it. This is how my byteformat function works, except that when no format_spec is given, byte and bytearrary objects are left unchanged rather than being decoded and encoded again. This even gives people in-place ASCII encoding for strings by always using '{:s}' with text which they can do when they port their code to run under both Python 2 and 3. So you should be able to do ``b'Content-Type: {:s}'.format('image/jpeg')`` and have it give ASCII. If you want more explicit encoding to latin-1 then you need to do it explicitly and not rely on the mini-language to do tricks for you. IOW I want to treat the format mini-language as a language and thus not have any special-casing or massive shifts in meaning between str.format() and bytes.format() so my mental model doesn't have to contort based on whether it's str or bytes. My preference is not have any, but if Guido is going say PBP here then I want absolute consistency across the board in how bytes.format() tweaks things. As for %s for the % operator calling ascii(), I think that will be a porting nightmare of finding out why your bytes suddenly stopped being formatted properly and then having to crawl through all of your code for that one use of %s which is getting bytes in. By raising a TypeError you will very easily detect where your screw-up occurred thanks to the traceback; do so otherwise feels too much like implicit type conversion and ask any JavaScript developer how that can be a bad thing. I personally would not add 'bytes % whatever'. -- Terry Jan Reedy ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/dholth%40gmail.com ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On Mon, Jan 13, 2014 at 11:57 AM, Barry Warsaw ba...@python.org wrote: On Jan 13, 2014, at 02:13 PM, Donald Stufft wrote: On Jan 13, 2014, at 1:58 PM, Guido van Rossum gu...@python.org wrote: I hear the objections against b'%s' % 'x' returning b'x' loud and clear, and if the noise about that sub-issue is preventing folks from seeing the absurdity in PEP 460, we can talk about a compromise, e.g. use %b which would require its argument to be bytes. Those bytes should still probably be ASCII-ish, but there's no way to test that. That's fine with me and should be fine to Nick as well -- PEP 460 doesn't check that your encodings match (how could it? :-), nor does plain string concatenation using +. I think disallowing %s is the right thing to do, but I definitely think numbers and %b should be allowed. I guess I agree. The behavior of b'%s' % 'x' returning b'x' is almost always useless at best. (I would have thought maybe %a for ascii() but don't care that strongly.) Yeah, the %s behavior with a string argument was a messy attempt at compromise. I was hoping to mimick a common use of %s in Python 2, where it can be used with either an 8-bit string or a number as argument, acting like %b in the former case and like %d in the latter case. Not having %s at all in Python 3 means that porting requires more thinking (== more opportunity for mistakes when you're converting in bulk) and there's no easy way to write code that works in Python 2 and 3. If we have %b for strictly interpolating bytes, I'm fine with adding %a for calling ascii() on the argument and then interpolating the result after ASCII-encoding it. If somehow (unlikely though it seems) we end up keeping %s (e.g. strictly to ease porting), we could also keep %r as an alias for %a. -- --Guido van Rossum (python.org/~guido) ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On January 13, 2014 at 3:08:43 PM, Daniel Holth (dho...@gmail.com) wrote: I see it now. bfoo%sbar % b'baz' should also expand to bfoob'foo'bar Instead of %b could %j mean I should have used + or join() here but was too lazy and work on str too? Isn’t this just error prone? Since it’s a new format character, many, probably, would write %s by mistake. And, besides, there was no %j in python2. - Yury ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On Mon, Jan 13, 2014 at 12:02 PM, Brett Cannon br...@python.org wrote: On Mon, Jan 13, 2014 at 2:51 PM, Terry Reedy tjre...@udel.edu wrote: I personally would not add 'bytes % whatever'. Personally, neither would I; just focus on bytes.format() and let % operator on strings slowly go away. Well, % has some very strong arguments in its favor still -- for example, the sheer amount of code that currently uses it, the fact that it's as close as we get to a cross-language standard, and the fact that nobody wants to tackle its use in the logging module (since logger objects are often shared between packages that don't know about each other). Anyway, the % or .format() issue seems completely orthogonal to the issues that get people riled up (which are mostly about whether using either implies some kind of ASCII compatibility). -- --Guido van Rossum (python.org/~guido) ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On Mon, Jan 13, 2014 at 3:11 PM, Yury Selivanov yselivanov...@gmail.com wrote: On January 13, 2014 at 3:08:43 PM, Daniel Holth (dho...@gmail.com) wrote: I see it now. bfoo%sbar % b'baz' should also expand to bfoob'foo'bar Instead of %b could %j mean I should have used + or join() here but was too lazy and work on str too? Isn’t this just error prone? Since it’s a new format character, many, probably, would write %s by mistake. And, besides, there was no %j in python2. Merely a flesh wound. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On 01/13/2014 03:09 PM, Guido van Rossum wrote: If we have %b for strictly interpolating bytes, I'm fine with adding %a for calling ascii() on the argument and then interpolating the result after ASCII-encoding it. If somehow (unlikely though it seems) we end up keeping %s (e.g. strictly to ease porting), we could also keep %r as an alias for %a. Wouldn't %s as an alias for %b simplify porting from Python 2? ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] cpython (3.3): Update Sphinx toolchain.
That's cool, but historical heritage makes the make argument somewhat confusing for new users. The immediate question I can sense is What is the difference between build and make? To make (this word again) the critics constructive, let me pass some ideas about ideal user experience as I see it. --[installation]-- 1 I install Sphinx. Two scenarios. 1.1 I am not a Python user - use installer 1.1.1 Installer should obviously install Python 1.1.2 And install sphinx command 1.1.3 And add sphinx to PATH 1.2 I am a Python user - use pip 1.2.1 pip should not alter my PATH (for virtualenv) --[usage]-- 2 Two scenarios 2.1 sphinx as a system command from PATH 2.2 python -m sphinx for current virtualenv / test config --[user experience]-- 3 These two invocations are equal sphinx python -m sphinx 4. They give the following ouput Sphinx 1.2 Documentation Generator Commands: build build documentation init start new project [also quickstart] make helper for common build commands Use sphinx -h command or sphinx command --help for details I am not using sphinx ATM otherwise I'd spent more time to design ideal command set to get rid of build/make duality, but it should work ok. Actually sphinx is a new command, so you may rethink the syntax for build arguments to contain html instead of dir names, and move dir names into parameters, because it is how it is most often used. -- anatoly t. On Sun, Jan 12, 2014 at 4:53 PM, Georg Brandl g.bra...@gmx.net wrote: That's also planned, see https://bitbucket.org/birkenfeld/sphinx-new-make-mode/. Georg Am 12.01.2014 09:49, schrieb anatoly techtonik: And cross-platform automation tools in Python instead of make https://bitbucket.org/birkenfeld/sphinx/issue/456/makepy-command-script -- anatoly t. On Sun, Jan 12, 2014 at 11:12 AM, INADA Naoki songofaca...@gmail.com wrote: What about using venv and pip instead of svn? ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/techtonik%40gmail.com ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On 01/13/2014 12:02 PM, Brett Cannon wrote: Personally, neither would I; just focus on bytes.format() and let % operator on strings slowly go away. Hey, now, some of us like %! ;) -- ~Ethan~ ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460: allowing %d and %f and mojibake
On 1/13/2014 6:43 AM, Stephen J. Turnbull wrote: Glenn Linderman writes: On 1/12/2014 4:08 PM, Stephen J. Turnbull wrote: Glenn Linderman writes: the proposals to embed binary in Unicode by abusing Latin-1 encoding. Those aren't proposals, they are currently feasible techniques in Python 3 for *some* use cases. The question is why infecting Python 3 with the byte/character confoundance virus is preferable to such techniques, especially if their (serious!) deficiencies are removed by creating a new type such as asciistr. smuggled binary (great term borrowed from a different subthread) muddies the waters of what you are dealing with. Not really. The mud is one or more of the serious deficiencies. It can be removed, I believe (and Nick apparently does, too). asciistr is one way to try that. Yes really. Use of smuggled binary means the str containing it can no longer be treated completely as a str. That is muddier than having a str that is only a str. When the mixture of text and binary is done as encoded text in binary, then it is obvious that only limited text processing can be performed, Hardly. After all, that's how all text processing was done for decades. Still is, in some programs, especially C programs. I disagree, and so do you... text processing must be limited to the text subsets of the text that includes smuggled binary... that is limited... you can't just apply text searches, scans, and transformations over the complete str, when it contains smuggled binary. You know that, but must have not considered it a limitation, because you know you can do any text processing on the text parts. But it is a limitation to have to keep track of it, and apply the text processing only to the parts that are text. Yes, it has been done that way, and the limitations of doing it that way led to the plethora of encodings each of which was intended to be sufficient for some problem domain, but most of which were only sufficient for a smaller problem domain than intended, especially as communications became more global in nature. And there are no extra, confusing Latin-1 encode/decode operations required. The extra encode/decode operations are mostly (perhaps all) due to examples that started from bytes and end with bytes. Of course if you assume that API and propose to do the operations using Unicode, you'll get extra decode/encode operations. No, the extra encode/decode are from the requirement that smuggled binary use latin-1, and other binary flavors are not always latin-1. From a higher-level perspective, I think it would be great to have a module, perhaps called boundary (let's call it that for now), that allow some definition syntax (augmented BNF? augmented ABNF?) to explain the format of a binary blob. We have struct, for one. I'm not sure why you want more than that. I suppose you could go all the way to ASN.1. struct is insufficient to capture a whole file format, with optional parts, although it suffices for fragments. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] cpython (3.3): Update Sphinx toolchain.
[If you want to continue this discussio, please move it from python-dev to sphinx-users. It is now completely offtopic for the former.] Anyway, just as a short explanation, you missed the point of the change: -M is not meant to be used directly but still via a (very short) Makefile. This isn't be a change meant to be visible to users. Georg Am 13.01.2014 20:56, schrieb anatoly techtonik: That's cool, but historical heritage makes the make argument somewhat confusing for new users. The immediate question I can sense is What is the difference between build and make? To make (this word again) the critics constructive, let me pass some ideas about ideal user experience as I see it. --[installation]-- 1 I install Sphinx. Two scenarios. 1.1 I am not a Python user - use installer 1.1.1 Installer should obviously install Python 1.1.2 And install sphinx command 1.1.3 And add sphinx to PATH 1.2 I am a Python user - use pip 1.2.1 pip should not alter my PATH (for virtualenv) --[usage]-- 2 Two scenarios 2.1 sphinx as a system command from PATH 2.2 python -m sphinx for current virtualenv / test config --[user experience]-- 3 These two invocations are equal sphinx python -m sphinx 4. They give the following ouput Sphinx 1.2 Documentation Generator Commands: build build documentation init start new project [also quickstart] make helper for common build commands Use sphinx -h command or sphinx command --help for details I am not using sphinx ATM otherwise I'd spent more time to design ideal command set to get rid of build/make duality, but it should work ok. Actually sphinx is a new command, so you may rethink the syntax for build arguments to contain html instead of dir names, and move dir names into parameters, because it is how it is most often used. -- anatoly t. On Sun, Jan 12, 2014 at 4:53 PM, Georg Brandl g.bra...@gmx.net wrote: That's also planned, see https://bitbucket.org/birkenfeld/sphinx-new-make-mode/. Georg Am 12.01.2014 09:49, schrieb anatoly techtonik: And cross-platform automation tools in Python instead of make https://bitbucket.org/birkenfeld/sphinx/issue/456/makepy-command-script -- anatoly t. On Sun, Jan 12, 2014 at 11:12 AM, INADA Naoki songofaca...@gmail.com wrote: What about using venv and pip instead of svn? ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/techtonik%40gmail.com ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On 1/13/2014 1:49 AM, Mark Shannon wrote: So why not replace '%s' with '%a' for the ascii case and with '%b' for directly inserting bytes. Because %a and %b don't exist in Python 2.7? I thought this was about 3.5, not 2.7 ;) '%s' can't work in 3.5, as we must differentiate between strings which meed to be encoded and bytes which don't. It's about migrating code to reach a point where it can work on both 2.7 and 3.5. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP460 thoughts from a Mercurial dev
13.01.14 15:57, Augie Fackler написав(ла): 1) What do we need in terms of functionality Best guess, %s, %d, and %f. I've not done a full audit of the code, but some limited looking over the grep hits for % in .py files suggests I'm right, and we could even do without %f (we only use that for 'hg --time' output, which we could do in unicode). Most popular formatting codes in Mercurial sources (excluding %Y, %M, etc): 2519 %s 493 %d 102 %r 33 %i 23 %ld 19 %ln 12 %.3f 10 %.1f 9 %(val)r 9 %p 9 %.2f %s covers almost 80% of use cases and %d covers almost 20%. %r covers about 3%, %f covers less than 1%. So I think anything except %s and %d can be ignored. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
Guido van Rossum wrote: On Sun, Jan 12, 2014 at 5:27 PM, Ethan Furman et...@stoneleaf.us wrote: On 01/12/2014 04:47 PM, Guido van Rossum wrote: b'%s' % 'x' == b'x' (i.e. the three-byte string containing an 'x' enclosed in single quotes) I'm not sure about the quotes. Would anyone ever actually want those in the byte stream? Perhaps not, but it's a hint that you should probably think about an encoding. It's symmetric with how '%s' % b'x' returns b'x'. Think of it as payback time. :-) If it's never useful, wouldn't it be better to raise an exception in this case? That way, someone porting code from py2 that does this without appropriate modification will find out about the problem immediately, rather than have spurious quotes inserted into their binary data, which -- being binary data -- will likely go unnoticed until something else tries to read the data. I don't think the rule against operations that work on all-but-one-type really applies here, because the mistake it's intended to catch is not an obscure corner case. If your program's logic includes interpolating strings into bytes objects, then you're going to be testing that. -- Greg ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On 13 January 2014 18:58, Guido van Rossum gu...@python.org wrote: I hear the objections against b'%s' % 'x' returning b'x' loud and clear, and if the noise about that sub-issue is preventing folks from seeing the absurdity in PEP 460, we can talk about a compromise, e.g. use %b which would require its argument to be bytes. Those bytes should still probably be ASCII-ish, but there's no way to test that. That's fine with me and should be fine to Nick as well -- PEP 460 doesn't check that your encodings match (how could it? :-), nor does plain string concatenation using +. For the record, Guido's reboot posting and rationale has convinced me, and I am essentially in favour of his proposal. Nick's remaining objection seems to me to have some validity if the format string is a user-supplied variable, but this type of usage is vanishingly small in my experience, and shouldn't dictate the whole design. I don't like b'%s' % 'x' behaviour, and would prefer one of the alternatives. I'm not entirely clear about the details of the alternative proposals, so I won't try to pick one. I think this should be for 3.5, and should not involve an accelerated release of 3.5 - we should get it into the 3.5 code early and let people thrash out the details during the 3.5 release cycle. Paul. PS For all the heated arguments and occasional frayed tempers, this has been an impressively civil debate. I think that's one of the best things about python-dev, that discussions like these never degenerate into flamewars. Kudos to all concerned! ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot and a bitter fight
On 1/13/2014 5:06 AM, Nick Coghlan wrote: I figured out tonight that it's only positioning ASCII interpolation as an*alternative* to adding binary interpolation that I have a problem with. It isn't, because you lose the structural assurance that you haven't inadvertently introduced an assumption of ASCII compatibility when you didn't need to. However, interpolation support is a convenient enough interface that I can see a version that*only* supports ASCII compatible interpolation being an attractive nuisance that becomes a source of hard to detect and fix data corruption bugs (just like the str type in Python 2). If we add both, my objections go away: people like me can use the Python 3 only formatb and formatb_map methods and be confident we haven't inadvertently introduced any assumptions regarding ASCII compatibility, while folks that know they're dealing with an ASCII compatible format can use the ASCII assuming versions that are designed to be source compatible with Python 2. If someone incorrectly uses format() or format_map() when they should be using the pure binary versions, that's a trivial bug fix (adding the necessary b, and perhaps some explicit encoding calls) rather than a major restructuring of the code. If they use mod-formatting, that's a slightly bigger fix, but still just switching to a different spelling of the formatting operation. Both use cases (binary only and ASCII compatible) get covered cleanly, and nobody has to lose out. Cheers, Nick. As part of that, what about an alternate spelling of % to allow binary-only interpolation operations using the handy syntax of % ? Doesn't seem like / is defined for bytes or str on the LHS. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On 1/13/2014 10:40 AM, Brett Cannon wrote: This even gives people in-place ASCII encoding for strings by always using '{:s}' with text which they can do when they port their code to run under both Python 2 and 3. So you should be able to do ``b'Content-Type: {:s}'.format('image/jpeg')`` and have it give ASCII. If you want more explicit encoding to latin-1 then you need to do it explicitly and not rely on the mini-language to do tricks for you. My preference is not have any, but if Guido is going say PBP here then I want absolute consistency across the board in how bytes.format() tweaks things. As for %s for the % operator calling ascii(), I think that will be a porting nightmare of finding out why your bytes suddenly stopped being formatted properly and then having to crawl through all of your code for that one use of %s which is getting bytes in. By raising a TypeError you will very easily detect where your screw-up occurred thanks to the traceback; do so otherwise feels too much like implicit type conversion and ask any JavaScript developer how that can be a bad thing. So quote 3 is necessarily a violation of quote 1. But if quote 2 can allow for one exception to its absolute consistency... that is probably the best solution overall... ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On 1/13/2014 9:38 AM, Ethan Furman wrote: On 01/13/2014 09:31 AM, Antoine Pitrou wrote: On Mon, 13 Jan 2014 08:36:05 -0800 Ethan Furman wrote: You mean crash all the time? I'd be fine with that for both the str case and the bytes case. But's probably too late to change the str case, and the bytes case should mirror what str does. Let me add something else: str and bytes don't have to be symmetrical. In Python 2, str and unicode were symmetrical, they allowed exactly the same operations and were composable. In Python 3, str and bytes are different beasts; they have different operations *and* different semantics (for example, bytes interoperates with bytearray and memoryview, while str doesn't). This makes sense to me. So I'm guess I'm fine with either the quoted ascii repr or the always blowing up method, with leaning towards the blowing up method. +1 - what Ethan said. A real death, instead death by inappropriately transformed data, is fine by me, if b%s % str(...) doesn't have the appropriate .encode(...) call. But I could live with either. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On 13/01/2014 21:01, Paul Moore wrote: I think this should be for 3.5, and should not involve an accelerated release of 3.5 - we should get it into the 3.5 code early and let people thrash out the details during the 3.5 release cycle. I disagree, it should be on pypi now so people can start trying it out, or as others have suggested incorporate it into the six module. Surely that'd make the job of getting it into 3.5 far easier? Paul. PS For all the heated arguments and occasional frayed tempers, this has been an impressively civil debate. I think that's one of the best things about python-dev, that discussions like these never degenerate into flamewars. Kudos to all concerned! +1 -- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Mark Lawrence ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
Glenn Linderman wrote: Quotes in the stream are a great debug hint, without blowing up. But do you really want those quotes turning up in a *binary* stream, where they're somewhere between awkward and near-impossible to spot by eyeballing, and may only be discovered when something else -- likely a different program, possibly being run by a different person -- tries to read the data back, and blows up because the binary format is corrupted? I'd much rather it blew up at the writing stage, myself. Corrupted binary data is *much* harder to debug than corrupted text, because binary formats typically have little to no margin for error before they become complete garbage. -- Greg ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
I will doggedly keep posting to this thread rather than creating more threads. In another thread, Nick has said he's okay with my proposal (not sure if that includes %s or not, but it now seems of lesser importance) as long as we simultaneously introduce formatb() and formatb_map() (the latter is just a minor variation of the former, so I won't mention it further). But formatb() feels absurd to me. PEP 460 has neither a precise specification or any actual examples, so I can't tell whether the intention is that the format string can *only* contain {...} sequences or whether it can also contain regular characters. Translating to formatb(), my question comes down to the legality of the following example: b'Hello, {}'.formatb(name) # Where name is some bytes object If this is allowed, it reintroduces the ASCII bias (since the substring 'Hello' is clearly ASCII). If this isn't allowed, it feels like a perversion of the notion of a formatting language, and I really don't see the attraction over using a combination of concatenation and the struct module, perhaps augmented with some use of bytes([i]) as an alternative to %c or {!c} (if that is what is meant by PEP 460 with 'c modifier' -- I can't find the word 'modifier' in the docs for format(). Note that I honestly don't understand which of these PEP 460 means. Either way, PEP 460's motivation seems kind of subjective and esthetic: While there are reasonably efficient ways to accumulate binary data (such as using a bytearray object, the bytes.join method or even io.BytesIO), none of them leads to the kind of readable and intuitive code that is produced by a %-formatted or {}-formatted template and a formatting operation. I would buy this if a binary format string could contain embedded text (like 'Hello' in my example above), but then the argument about avoiding ASCII bias seems to fall apart so I am at a loss about what Nick actually wants, and even about what PEP 460 actually specifies. -- --Guido van Rossum (python.org/~guido) ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On 1/13/2014 12:09 PM, Guido van Rossum wrote: Yeah, the %s behavior with a string argument was a messy attempt at compromise. I was hoping to mimick a common use of %s in Python 2, where it can be used with either an 8-bit string or a number as argument, acting like %b in the former case and like %d in the latter case. Not having %s at all in Python 3 means that porting requires more thinking (== more opportunity for mistakes when you're converting in bulk) and there's no easy way to write code that works in Python 2 and 3. If we have %b for strictly interpolating bytes, I'm fine with adding %a for calling ascii() on the argument and then interpolating the result after ASCII-encoding it. If somehow (unlikely though it seems) we end up keeping %s (e.g. strictly to ease porting), we could also keep %r as an alias for %a. %s for strictly interpolating bytes eases porting. Sad name, but good for compatibility. When the blowup happens, due to having a str type passed, the porter adds the appropriate .encode(...) to the parameter, so it doesn't blow up on Py 3, and it'll be OK for Py 2 as well, will it not? ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On Mon, 13 Jan 2014 13:32:28 -0800 Guido van Rossum gu...@python.org wrote: But formatb() feels absurd to me. PEP 460 has neither a precise specification or any actual examples, so I can't tell whether the intention is that the format string can *only* contain {...} sequences or whether it can also contain regular characters. Translating to formatb(), my question comes down to the legality of the following example: b'Hello, {}'.formatb(name) # Where name is some bytes object Yes, it's allowed. But so is: b'\xff\x00{}\x85{}'.formatb(payload, trailer) The ASCII bias is because of the bytes literal notation. Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On 01/13/2014 01:08 PM, Glenn Linderman wrote: +1 - what Ethan said. A real death, instead death by inappropriately transformed data, is fine by me, if b%s % str(...) doesn't have the appropriate .encode(...) call. But I could live with either. You mean instead of death by a thousand quotes? *ducks and runs* -- ~Ethan~ ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
Terminology. Let's use the official terminology rather than making stuff up. The docs at http://docs.python.org/3/library/string.html#formatspec use the following terminology: Replacement field: {...}; contains field name, conversion, format spec in that order, all optional. Field name: either a decimal integer (referring to an argument by position) or an identifier (by name), or omitted (uses the next available position). Conversion: !r, !s, !a; these refer to repr(), str(), ascii() to the value, and then the format spec applies to the resulting string. Format spec: colon, bunch of stuff, type; the type is a letter such as d (decimal) or s (string), and the stuff between the colon and the type is used to specify field width, alignment, sign, padding and such. Also. {:b} means binary (i.e. numbers in base 2). I'm not sure what this leaves for interpolating bytes if we don't want to use {:s}. The docs at http://docs.python.org/3/library/stdtypes.html#printf-style-string-formatting don't show %b so it could still be used there, but it would be nicer to be consistent. -- --Guido van Rossum (python.org/~guido) ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
Nick Coghlan wrote: By allowing format characters that *do* assume ASCII, the entire construct is rendered unsafe - you have to look inside the format string to determine if it is assuming ASCII compatibility or not, thus the entire construct must be deemed as assuming ASCII compatibility at the level of static semantic analysis. I don't see how any of the currently proposed formatting operations make a data-dependent ASCII assumption. When you write b%d % x, you're not assuming that x is ASCII, you're assuming that it's an *integer*. The %d conversion of an integer is defined to produce only ASCII characters, and it works on any integer, so there's no data-dependent assumption there. Something that *would* involve such an assumption would be if b%s % 'hello' were defined to encode 'hello' as ASCII. But Guido has proposed not doing that, and instead interpolating ascii('hello'). Since ascii() is defined to return only ASCII characters, and works on any string, there is again no data-dependent assumption. My preference would be for b%s % 'hello' to raise an exception, but that would still be data-independent. As for having to look inside the format string to know what types are expected, that's no different from any other formatting operation. All it means is that static type analysis in Python is hard, but we already knew that. Allowing these ASCII assuming format codes in the core bytes interpolation introduces *exactly* the same problem as is present in the Python 2 text model: code that *appears* to support arbitrary binary data, but is in fact assuming ASCII compatibility. Can you provide an example of code using Guido's currently approved formatting semantics that would fail when given arbitrary binary data? I don't see how it can happen. -- Greg ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP460 thoughts from a Mercurial dev
On 14 Jan 2014 03:34, Guido van Rossum gu...@python.org wrote: On Mon, Jan 13, 2014 at 8:51 AM, Nick Coghlan ncogh...@gmail.com wrote: On 13 January 2014 23:57, Augie Fackler r...@durin42.com wrote: 1) What do we need in terms of functionality Best guess, %s, %d, and %f. I've not done a full audit of the code, but some limited looking over the grep hits for % in .py files suggests I'm right, and we could even do without %f (we only use that for 'hg --time' output, which we could do in unicode). I think PEP 460 will have you covered there, or hopefully asciistr on 3.3+ I'm confused on how PEP 460 would help -- Augie mentioned %d, which it excludes. I meant your proposed more lenient version (since there's no need for the binary only version to be in the common 2/3 subset). Cheers, Nick. -- --Guido van Rossum (python.org/~guido) ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On 01/13/2014 01:20 PM, Mark Lawrence wrote: On 13/01/2014 21:01, Paul Moore wrote: I think this should be for 3.5, and should not involve an accelerated release of 3.5 - we should get it into the 3.5 code early and let people thrash out the details during the 3.5 release cycle. I disagree, it should be on pypi now so people can start trying it out, or as others have suggested incorporate it into the six module. Surely that'd make the job of getting it into 3.5 far easier? It's a bit harder to put a core feature on PyPI. I'm not even sure how it would be done. Fortunately, once it is in 3.5 trunk the adventurous can build their own and try it out that way. -- ~Ethan~ ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On Mon, Jan 13, 2014 at 1:40 PM, Antoine Pitrou solip...@pitrou.net wrote: On Mon, 13 Jan 2014 13:32:28 -0800 Guido van Rossum gu...@python.org wrote: But formatb() feels absurd to me. PEP 460 has neither a precise specification or any actual examples, so I can't tell whether the intention is that the format string can *only* contain {...} sequences or whether it can also contain regular characters. Translating to formatb(), my question comes down to the legality of the following example: b'Hello, {}'.formatb(name) # Where name is some bytes object Yes, it's allowed. But so is: b'\xff\x00{}\x85{}'.formatb(payload, trailer) The ASCII bias is because of the bytes literal notation. But it is nevertheless there. Including arbitrary hex bytes in the ASCII range should be a liability, unless you have memorized the hex codes for ASCII and know that e.g. '\x25' is '%' and '\x7b' is '{'. The above example (is it from a real protocol?) would be just as clear or clearer written as b'\xff\x00' + payload + b'\x85' + trailer or b''.join([b'\xff\x00', payload, b'\x85', trailer]) and reasoning about those versions requires no understanding of ASCII. -- --Guido van Rossum (python.org/~guido) ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On Mon, Jan 13, 2014 at 1:29 PM, Glenn Linderman v+pyt...@g.nevcal.com wrote: On 1/13/2014 12:09 PM, Guido van Rossum wrote: Yeah, the %s behavior with a string argument was a messy attempt at compromise. I was hoping to mimick a common use of %s in Python 2, where it can be used with either an 8-bit string or a number as argument, acting like %b in the former case and like %d in the latter case. Not having %s at all in Python 3 means that porting requires more thinking (== more opportunity for mistakes when you're converting in bulk) and there's no easy way to write code that works in Python 2 and 3. If we have %b for strictly interpolating bytes, I'm fine with adding %a for calling ascii() on the argument and then interpolating the result after ASCII-encoding it. If somehow (unlikely though it seems) we end up keeping %s (e.g. strictly to ease porting), we could also keep %r as an alias for %a. %s for strictly interpolating bytes eases porting. Sad name, but good for compatibility. When the blowup happens, due to having a str type passed, the porter adds the appropriate .encode(...) to the parameter, so it doesn't blow up on Py 3, and it'll be OK for Py 2 as well, will it not? Lots of code uses %s with numbers too, and probably the occasional None or list (relying on the Python 2 near-guarantee that most objects' str() is their repr() and that repr() nearly guarantees to return only ASCII). E.g. I'm sure you can find live code doing something like headers.append('Content-Length: %s\r\n' % len(body)) -- --Guido van Rossum (python.org/~guido) ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On Mon, Jan 13, 2014 at 4:51 PM, Guido van Rossum gu...@python.org wrote: Terminology. Let's use the official terminology rather than making stuff up. The docs at http://docs.python.org/3/library/string.html#formatspec use the following terminology: Replacement field: {...}; contains field name, conversion, format spec in that order, all optional. Field name: either a decimal integer (referring to an argument by position) or an identifier (by name), or omitted (uses the next available position). Conversion: !r, !s, !a; these refer to repr(), str(), ascii() to the value, and then the format spec applies to the resulting string. Format spec: colon, bunch of stuff, type; the type is a letter such as d (decimal) or s (string), and the stuff between the colon and the type is used to specify field width, alignment, sign, padding and such. Also. {:b} means binary (i.e. numbers in base 2). I'm not sure what this leaves for interpolating bytes if we don't want to use {:s}. The docs at http://docs.python.org/3/library/stdtypes.html#printf-style-string-formatting don't show %b so it could still be used there, but it would be nicer to be consistent. I have been going on the assumption that bytes.format() would change what '{}' meant for itself and would only interpolate bytes. That convenient between Python 2 and 3 since it represents what we want it to (str and bytes under the hood, respectively), so it just falls through. We could also add a 'b' conversion for bytes() explicitly so as to help people not accidentally mix up things in bytes.format() and str.format(). But I was not suggesting adding a specific format spec for bytes but instead making bytes.format() just do the .encode('ascii') automatically to help with compatibility when a format spec was present. If people want fancy formatting for bytes they can always do it themselves before calling bytes.format(). ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On Mon, Jan 13, 2014 at 4:59 PM, Guido van Rossum gu...@python.org wrote: On Mon, Jan 13, 2014 at 1:29 PM, Glenn Linderman v+pyt...@g.nevcal.com wrote: On 1/13/2014 12:09 PM, Guido van Rossum wrote: Yeah, the %s behavior with a string argument was a messy attempt at compromise. I was hoping to mimick a common use of %s in Python 2, where it can be used with either an 8-bit string or a number as argument, acting like %b in the former case and like %d in the latter case. Not having %s at all in Python 3 means that porting requires more thinking (== more opportunity for mistakes when you're converting in bulk) and there's no easy way to write code that works in Python 2 and 3. If we have %b for strictly interpolating bytes, I'm fine with adding %a for calling ascii() on the argument and then interpolating the result after ASCII-encoding it. If somehow (unlikely though it seems) we end up keeping %s (e.g. strictly to ease porting), we could also keep %r as an alias for %a. %s for strictly interpolating bytes eases porting. Sad name, but good for compatibility. When the blowup happens, due to having a str type passed, the porter adds the appropriate .encode(...) to the parameter, so it doesn't blow up on Py 3, and it'll be OK for Py 2 as well, will it not? Lots of code uses %s with numbers too, and probably the occasional None or list (relying on the Python 2 near-guarantee that most objects' str() is their repr() and that repr() nearly guarantees to return only ASCII). E.g. I'm sure you can find live code doing something like headers.append('Content-Length: %s\r\n' % len(body)) But if the alternative is spurious quotes then the choice is clear... ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On Mon, Jan 13, 2014 at 4:36 PM, Ethan Furman et...@stoneleaf.us wrote: On 01/13/2014 01:20 PM, Mark Lawrence wrote: On 13/01/2014 21:01, Paul Moore wrote: I think this should be for 3.5, and should not involve an accelerated release of 3.5 - we should get it into the 3.5 code early and let people thrash out the details during the 3.5 release cycle. I disagree, it should be on pypi now so people can start trying it out, or as others have suggested incorporate it into the six module. Surely that'd make the job of getting it into 3.5 far easier? It's a bit harder to put a core feature on PyPI. I'm not even sure how it would be done. Fortunately, once it is in 3.5 trunk the adventurous can build their own and try it out that way. You make it a function that under Python 2 and 3.5 does what needs to be done and on 3.5 just directly calls the underlying method. People will still have to change their code, but the idea is it becomes a refactoring instead of a change in how the code is structured. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On Mon, 13 Jan 2014 13:56:44 -0800 Guido van Rossum gu...@python.org wrote: On Mon, Jan 13, 2014 at 1:40 PM, Antoine Pitrou solip...@pitrou.net wrote: On Mon, 13 Jan 2014 13:32:28 -0800 Guido van Rossum gu...@python.org wrote: But formatb() feels absurd to me. PEP 460 has neither a precise specification or any actual examples, so I can't tell whether the intention is that the format string can *only* contain {...} sequences or whether it can also contain regular characters. Translating to formatb(), my question comes down to the legality of the following example: b'Hello, {}'.formatb(name) # Where name is some bytes object Yes, it's allowed. But so is: b'\xff\x00{}\x85{}'.formatb(payload, trailer) The ASCII bias is because of the bytes literal notation. But it is nevertheless there. Including arbitrary hex bytes in the ASCII range should be a liability, unless you have memorized the hex codes for ASCII and know that e.g. '\x25' is '%' and '\x7b' is '{'. That's a good point. I hadn't really thought about that. The above example (is it from a real protocol?) (no, it's cooked up) would be just as clear or clearer written as b'\xff\x00' + payload + b'\x85' + trailer or b''.join([b'\xff\x00', payload, b'\x85', trailer]) and reasoning about those versions requires no understanding of ASCII. Fair enough. Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On Mon, Jan 13, 2014 at 2:05 PM, Brett Cannon br...@python.org wrote: I have been going on the assumption that bytes.format() would change what '{}' meant for itself and would only interpolate bytes. That convenient between Python 2 and 3 since it represents what we want it to (str and bytes under the hood, respectively), so it just falls through. We could also add a 'b' conversion for bytes() explicitly so as to help people not accidentally mix up things in bytes.format() and str.format(). But I was not suggesting adding a specific format spec for bytes but instead making bytes.format() just do the .encode('ascii') automatically to help with compatibility when a format spec was present. If people want fancy formatting for bytes they can always do it themselves before calling bytes.format(). This seems hastily written (e.g. verb missing :-), and I'm not clear on what you are (or were) actually proposing. When exactly would bytes.format() need .encode('ascii')? I would be happy to wait a few hours or days for you to to write it up clearly, rather than responding in a hurry. -- --Guido van Rossum (python.org/~guido) ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On 1/13/2014 4:59 PM, Guido van Rossum wrote: On Mon, Jan 13, 2014 at 1:29 PM, Glenn Linderman v+pyt...@g.nevcal.com wrote: If somehow (unlikely though it seems) we end up keeping %s (e.g. strictly to ease porting), we could also keep %r as an alias for %a. %s for strictly interpolating bytes eases porting. Sad name, but good for compatibility. When the blowup happens, due to having a str type passed, the porter adds the appropriate .encode(...) to the parameter, so it doesn't blow up on Py 3, and it'll be OK for Py 2 as well, will it not? Lots of code uses %s with numbers too, and probably the occasional None or list (relying on the Python 2 near-guarantee that most objects' str() is their repr() and that repr() nearly guarantees to return only ASCII). E.g. I'm sure you can find live code doing something like headers.append('Content-Length: %s\r\n' % len(body)) That's why I think we should support %s taking bytes, int, float. And make %b mean the same thing, if you want. But I think we need to keep %s (however limited) for compatibility with Python 2. Personally, I'd be okay with %s not accepting str (by raising an exception). I think that would give us a large compatibility surface in common with Python 2. Eric. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On Jan 13, 2014, at 5:25 PM, Eric V. Smith e...@trueblade.com wrote: On 1/13/2014 4:59 PM, Guido van Rossum wrote: On Mon, Jan 13, 2014 at 1:29 PM, Glenn Linderman v+pyt...@g.nevcal.com wrote: If somehow (unlikely though it seems) we end up keeping %s (e.g. strictly to ease porting), we could also keep %r as an alias for %a. %s for strictly interpolating bytes eases porting. Sad name, but good for compatibility. When the blowup happens, due to having a str type passed, the porter adds the appropriate .encode(...) to the parameter, so it doesn't blow up on Py 3, and it'll be OK for Py 2 as well, will it not? Lots of code uses %s with numbers too, and probably the occasional None or list (relying on the Python 2 near-guarantee that most objects' str() is their repr() and that repr() nearly guarantees to return only ASCII). E.g. I'm sure you can find live code doing something like headers.append('Content-Length: %s\r\n' % len(body)) That's why I think we should support %s taking bytes, int, float. And make %b mean the same thing, if you want. But I think we need to keep %s (however limited) for compatibility with Python 2. Personally, I'd be okay with %s not accepting str (by raising an exception). I think that would give us a large compatibility surface in common with Python 2. %s not accepting str is the major thing I’d personally be against. %s taking numeric types and bytes would be fine. The main thing i’d be worried about is where the RHS may possibly contain something non ASCII that needs encoding (such as the str case). - Donald Stufft PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA signature.asc Description: Message signed with OpenPGP using GPGMail ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On Jan 13, 2014, at 5:31 PM, Donald Stufft don...@stufft.io wrote: %s not accepting str is the major thing I’d personally be against. To be more clear b”%s” % “abc” == No b”%s” % 123 == Fine - Donald Stufft PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA signature.asc Description: Message signed with OpenPGP using GPGMail ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On 14 Jan 2014 04:58, Guido van Rossum gu...@python.org wrote: Let me try rebooting the reboot. My interpretation of Nick's argument is that he are asking for a bytes formatting language that doesn't have an implicit ASCII assumption. To me this feels absurd. The formatting codes (%s, %c) themselves are expressed as ASCII characters. If you include anything else in the format string besides formatting codes (e.g. b'%s'), you are giving it as ASCII characters. I don't know what characters the EBCDIC codes 37, 99 or 115 encode (these are the ASCII codes for '%', 'c', 's') but it certainly wouldn't be safe to use % when the LHS is EBCDIC-encoded. Except we allow string escapes and programmatic creation of format strings, so while ASCII snippets in formatting code are certainly easier to type, they are by no means a mandatory feature of using interpolation operations. I agree Can you roll your own binary interpolation support with join() and simple concatenation? Yes, but Antoine's proposal provides a clean and reliable approach to flexible binary templating that isn't offered by the more lenient version. My problem is with telling Python users that if they're working with ASCII compatible data, they get access to a clean interpolation mini-language for templating purposes, but if they aren't, they don't. That's the part I see as potentially breaking the text model: now you have a convenient API on a core type encouraging you to treat your data as ASCII compatible with implicit serialisation of semantic data as ASCII text, even if that may not be appropriate. If pure binary interpolation is added at the same time (regardless of the exact spelling, so long as it's as easy to access as the ASCII templating), that objection goes away. That said, the fact that the interpolation mini-languages themselves assume ASCII is the most compelling rationale I have heard so far for treating interpolation as an operation that inherently assumes ASCII compatibility - you can't use arbitrary bytes in your formatting strings without escaping the formatting characters appropriately. While I don't see that as substantially different to needing to escape them in order to retain them in the output of text or ASCII formatting, it's at least a teachable rationale for the absence of a pure binary equivalent. If I had some byte strings in an unknown encoding (but the same encoding for all) that I needed to concatenate I would never think of '%s%s' % (x, y) -- I would write x+y. (Even in Python 2.) If I see some code using *any* formatting operation (regardless of whether it's %d, %r, %s or %c) I am going to assume that there is some ASCII-ness, and if there isn't, the code's author has obscured their goal to me. Right, that's a rationale I can explain to people. It also occurred to me that it's easier to build pure binary interpolation on top of ASCII interpolation than I previously thought: I can just check all the input values are compatible with memoryview. At that point, attempting to pass in anything that would trigger implicit encoding at the formatting stage will fail. (Aside: bytes(memoryview(obj)) is also a potentially handy way to avoid the bytes(int)) trap) I hear the objections against b'%s' % 'x' returning b'x' loud and clear, and if the noise about that sub-issue is preventing folks from seeing the absurdity in PEP 460, we can talk about a compromise, e.g. use %b which would require its argument to be bytes. Those bytes should still probably be ASCII-ish, but there's no way to test that. That's fine with me and should be fine to Nick as well -- PEP 460 doesn't check that your encodings match (how could it? :-), nor does plain string concatenation using +. Plus there genuinely are formats where different parts have different encodings and you rely on metadata or format definitions to know what they are. I would actually suggest something like Brett's approach for %s , but with memoryview in the mix: if the object exports a PEP 3118 buffer, interpolate it directly, otherwise invoke normal string formatting and then do strict ASCII encoding at the end. That way people don't have to learn new formatting mini-languages and only have two new behaviours to learn: buffer exporters are interpolated directly, anything else is formatted normally and then implicitly encoding as strict ASCII. In my head I make the following classification of situations where you work with bytes and/or text. (A) Pure binary formats (e.g. most IP-level packet formats, media files, .pyc files, tar/zip files, compressed data, etc.). These are handled using the struct module (e.g. tar/zip) and/or custom C extensions (e.g. gzip). (B) Encoded text. Here you should just decode everything into str objects and parse your text at that level. If you really want to manipulate the data as bytes (e.g. because you have a lot of data to process and very light processing) you may be able to do it, but unless it's
Re: [Python-Dev] PEP 460 reboot
On 1/13/2014 1:59 PM, Guido van Rossum wrote: On Mon, Jan 13, 2014 at 1:29 PM, Glenn Linderman v+pyt...@g.nevcal.com wrote: On 1/13/2014 12:09 PM, Guido van Rossum wrote: Yeah, the %s behavior with a string argument was a messy attempt at compromise. I was hoping to mimick a common use of %s in Python 2, where it can be used with either an 8-bit string or a number as argument, acting like %b in the former case and like %d in the latter case. Not having %s at all in Python 3 means that porting requires more thinking (== more opportunity for mistakes when you're converting in bulk) and there's no easy way to write code that works in Python 2 and 3. If we have %b for strictly interpolating bytes, I'm fine with adding %a for calling ascii() on the argument and then interpolating the result after ASCII-encoding it. If somehow (unlikely though it seems) we end up keeping %s (e.g. strictly to ease porting), we could also keep %r as an alias for %a. %s for strictly interpolating bytes eases porting. Sad name, but good for compatibility. When the blowup happens, due to having a str type passed, the porter adds the appropriate .encode(...) to the parameter, so it doesn't blow up on Py 3, and it'll be OK for Py 2 as well, will it not? Lots of code uses %s with numbers too, and probably the occasional None or list (relying on the Python 2 near-guarantee that most objects' str() is their repr() and that repr() nearly guarantees to return only ASCII). E.g. I'm sure you can find live code doing something like headers.append('Content-Length: %s\r\n' % len(body)) That's portably fixable by switching to %d... or by adding .encode('ascii') ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] PEP 460 -- adding explicit assumptions
As best I can tell, some people (apparently including Guido and PEP author Antoine) are taking some assumptions almost for granted, while other people (including me, before Nick's messages) were not assuming them at all. Since these assumptions (or, possibly, rejections of them?) are likely to decide the outcome, the assumptions should be explicit in the PEP. (1) The bytes-related classes do include methods that are only useful when the already-contained data is encoded ASCII. They do not (and will not) include any operations that *require* an encoding assumption. This implies that no non-bytes data can be added without an explicit encoding. (1a) Not even by assuming ASCII with strict error handling. (1b) Not even for numbers, where ASCII/strict really is sufficient. Note that this doesn't rule out a solution where objects (or maybe just numbers and ASCII-kind text) provide their own encoding to bytes -- but that has to be done by the objects themselves, not by the bytes container or by the interpreter. (2) Most python programmers are still in the future. So an API that confuses people who are still learning about Unicode and the text model is bad -- even if it would work fine for those who do already understand it. -jJ -- If there are still threading problems with my replies, please email me with details, so that I can try to resolve them. -jJ ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
Nick Coghlan wrote: so the latter would be less of an attractive nuisance when writing code that needs to handle arbitrary binary formats and can't assume ASCII compatibility. Hang on a moment. What do you mean by code that handles arbitrary binary formats? As far as I can see, the proposed features are for code that handles *particular* binary formats. Ones with well-defined fields that are specified to contain ASCII-encoded text. It's the programmer's responsibility to make sure that the fields he's treating as ASCII really do contain ASCII, just as it's his responsibility to make sure he reads and writes a text file using the correct encoding. Now, it's possible that if you were working from an incomplete spec and some examples, you might be led to believe that a particular field was ASCII when in fact it was some ASCII superset such as latin1 or utf8. In that case, if you parsed it assuming ASCII, you would get into trouble of some sort with bytes greater than 127. However, the proposed formatting operations are concerned only with *generating* binary data, not parsing it. Under Guido's proposed semantics, all of the ASCII formatting operations are guaranteed to produce valid ASCII, regardless of what types or values are thrown at them. So as long as the field's true encoding is something ASCII-compatible, you will always generate valid data. Because I *want to use* the PEP 460 binary interpolation API, but wouldn't be able to use Guido's more lenient proposal, as it is a bug magnet in the presence of arbitrary binary data. Where exactly is this arbitrary binary data that you keep talking about? The only place that arbitrary bytes comes into the picture is through b%s % b..., and that's defined to just pass the bytes straight through. I don't see how that could attract any bugs that weren't already present in the data being interpolated. The LHS may or may not be tainted with assumptions about ASCII compatibility, which means it effectively *is* tainted with such assumptions, which means code that needs to handle arbitrary binary data can't use it and is left without a binary interpolation feature. If I understand correctly, what concerns you here is that you can't tell by looking at b%s % x whether it encodes anything as ASCII without knowing the type of x. I'm not sure how serious a problem that would be. Most of the time I think it will be fairly obvious from the purpose of the code what the type of x is *intended* to be. If it's not actually that type, then clearly there's a bug somewhere. Of all such possible bugs, the one most likely to arise due to a confusion in the programmer's mind between text and bytes would be for x to be a string when it was meant to be bytes or vice versa. Due to the still-very-strong separation between text and bytes in Py3, this is unlikely to happen without something else blowing up first. Even if it does happen, it won't result in a data- dependent failure. If b%s % 'hello' were defined to interpolate 'hello'.encode('ascii'), then there *would* be cause for concern. But this is not what Guido proposes -- instead he proposes interpolating ascii('hello') == 'hello'. This is almost certainly *never* what the file spec calls for, so you'll find out about it very soon one way or another. Effectively this means that b%s % x where x is a string is useless, so I'd much prefer it to just raise an exception in that case to make the failure immediately obvious. But either way, you're not going to end up with a latent failure waiting for some non-ASCII data to come along before you notice it. To summarise, I think the idea of binary format strings being too tainted for a program that does not want to use ASCII formatting to rely on is mostly FUD. -- Greg ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] Automatic encoding detection [was: Re: Python3 complexity - 2 use cases]
So when it is time to guess [at the character encoding of a file], a source of good guesses is an important battery to include. The barrier for entry to the standard library is higher than mere usefulness. Agreed. But most programs will need it, and people will either include (the same) 3rd-party library themselves, or write their own workaround, or have buggy code *is* sufficient. The points of contention are (1) How many programs have to deal with documents written outside their control -- and probably originating on another system. I'm not ready to say most programs in general, but I think that barrier is met for both web clients (for which we already supply several batteries) and quick-and-dirty utilities. (2) How serious are the bugs / How annoying are the workarounds? As someone who mostly sticks to English, and who tends to manually ignore stray bytes when dealing with a semi-binary file format, the bugs aren't that serious for me personally. So I may well choose to write buggy programs, and the bug may well never get triggered on my own machine. But having a batch process crash one run in ten (where it didn't crash at all under Python 2) is a bad thing. There are environments where (once I knew about it) I would add chardet (if I could get approval for the 3rd-party component). (3) How clearcut is the *right* answer? As I said, at one point (several years ago), the w3c and whatwg started to standardize the right answer. They backed that out, because vendors wanted the option to improve their detection in the future without violating standards. There are certainly situations where local knowledge can do better than a global solution like chardet, but ... the right answer is clear most of the time. Just ignoring the problem is still a 99% answer, because most text in ASCII-mostly environments really is close enough. But that is harder (and the One Obvious Way is less reliable) under Python 3 than it was under Python 2. An alias for open that defaulted to surrogate-escape (or returned the new ASCIIstr bytes hybrid) would probably be sufficient to get back (almost) to Python 2 levels of ease and reliability. But it would tend to encourage ASCII/English-only assumptions. You could fix most of the remaining problems by scripting a web browser, except that scripting the browser in a cross-platform manner is slow and problematic, even with webbrowser.py. Whatever a recent Firefox does is (almost by definition) good enough, and is available ... but maybe not in a convenient form, which is one reason that chardet was created as a port thereof. Also note that firefox assumes you will update more often than Python does. Whatever chardet said at the time the Python release was cut is almost certainly good enough too. The browser makers go to great lengths to match each other even in bizarre corner cases. (Which is one reason there aren't more competing solutions.) But that doesn't mean it is *impossible* to construct a test case where they disagree -- or even one where a recent improvement in the algorithms led to regressions for one particular document. That said, such regressions should be limited to documents that were not properly labeled in the first place, and should be rare even there. Think of the changes as obscure bugfixes, akin to a program starting to handle NaN properly, in a place where it should not ever see one. -jJ -- If there are still threading problems with my replies, please email me with details, so that I can try to resolve them. -jJ ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Automatic encoding detection [was: Re: Python3 complexity - 2 use cases]
On Tue, Jan 14, 2014 at 10:48 AM, Jim J. Jewett jimjjew...@gmail.com wrote: The barrier for entry to the standard library is higher than mere usefulness. Agreed. But most programs will need it, and people will either include (the same) 3rd-party library themselves, or write their own workaround, or have buggy code *is* sufficient. Well, no, that's not sufficient on its own either. But yes, it's a stronger argument. But having a batch process crash one run in ten (where it didn't crash at all under Python 2) is a bad thing. There are environments where (once I knew about it) I would add chardet (if I could get approval for the 3rd-party component). Having it *do the wrong thing* one run in ten is even worse. If you need chardet, then get approval for the third-party component. That's a political issue, not a technical one. This needs to be in the stdlib because I'm not allowed to install anything else? I hope not. Also, a PyPI package is free to update independently of the Python version schedule. The stdlib is bound. ChrisA ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460: allowing %d and %f and mojibake
Stephen J. Turnbull wrote: PBP doesn't think it's a great idea to pass around bytes that are implicitly some other type, but didn't mind it (or got used to it) in Python 2, and so they're not looking at that as a problem that Python 3 can solve. They're looking at Python 3 as the problem that prevents them from doing what worked fine in Python 2. While some people may think that way, I don't think it's fair to characterise *all* proponents of bytes formatting as luddites that refuse to get with the Python 3 way. Some of us *do* understand the principles of text/ bytes separation in Python 3 and agree that they're a good idea. We just don't agree that the proposed formatting operations violate those principles to any degree worth worrying about. I don't think of my viewpoint as being PBP. That term assumes there is purity there to be beaten. To my mind, any notion of purity with respect to bytes objects went out the window as soon as it was given a pile of text methods -- together with a text-like literal syntax and default repr(), even though at least half the time they're completely inappropriate! -- Greg ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
Nick Coghlan wrote: Arbitrary binary data and ASCII compatible binary data are *different things* and the only argument in favour of modelling them with a single type is because Python 2 did it that way. I would say that ASCII compatible binary data is a *subset* of arbitrary binary data. As such, a type designed for arbitrary binary data is a perfectly good way of representing ASCII compatible binary data. What are you saying -- that there should be one type for ASCII compatible binary data, and another type for all binary data *except* when it's ASCII compatible? That makes no sense to me. The Python 3 text model was built on the notion of no implicit encoding and decoding This is nonsense. There are plenty of implicit encoding and decoding operations in Python 3. When you open a text file, it gets an encoding. After that, anything you write to it is implicitly encoded using that encoding. There's even a default encoding when you open the file, so you don't even have to be explicit about that. It's more correct to say that it was built on the notion of using separate types for encoded and decoded data, so that it's *possible* to keep track of the difference. It doesn't mean that there can't be conversions between the two types that are implicit to one degree or another. -- Greg ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] Test failures when running as root
And now for something completely different. My root buildbot is finally now able to telnet out and get Connection refused errors. (For the curious, the VirtualBox NAT mode doesn't work properly, but the new NAT Network mode does. Why? I have no idea. But if anyone else is having the same problem, upgrade to the latest VirtualBox and set up a NAT Network. All I care is, it now works.) The test suite is now failing at another point, and this applies to 2.7, 3.3, and 3.x. == ERROR: test_initgroups (test.test_posix.PosixGroupsTester) -- Traceback (most recent call last): File /root/buildarea/3.x.angelico-debian-amd64/build/Lib/test/test_posix.py, line 1143, in test_initgroups g = max(self.saved_groups) + 1 ValueError: max() arg is an empty sequence -- The saved_groups value comes from posix.getgroups(), and it's being used to try to get a group that this user doesn't have (I think). When I run Python as root, posix.getgroups() returns [0], but apparently it's not returning any groups when the test runs. So, two questions. Firstly, is this a problem that needs to be fixed in Python, or is it a configuration change that I made? It began failing recently, so possibly when I rebooted the VM as part of VirtualBox changes I mucked something up. And secondly, how can I run the tests manually? I can't find a binary inside the buildarea tree. Does it get deleted afterward? Apologies if these are dumb questions, hopefully they're a small distraction from PEP 460 arguments! ChrisA ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On 1/13/2014 3:13 PM, Guido van Rossum wrote: On Mon, Jan 13, 2014 at 12:02 PM, Brett Cannon br...@python.org wrote: On Mon, Jan 13, 2014 at 2:51 PM, Terry Reedy tjre...@udel.edu wrote: I personally would not add 'bytes % whatever'. Personally, neither would I; just focus on bytes.format() and let % operator on strings slowly go away. Well, % has some very strong arguments in its favor still -- for If I shift from a 'personal' to a 'BDFL' viewpoint, I have to agree. example, the sheer amount of code that currently uses it, the fact that it's as close as we get to a cross-language standard, and the This much I know. fact that nobody wants to tackle its use in the logging module (since logger objects are often shared between packages that don't know about each other). This I did not know. Anyway, the % or .format() issue seems completely orthogonal to the issues that get people riled up (which are mostly about whether using either implies some kind of ASCII compatibility). A possibly important difference between '%s' and '{:s}' is that the 's' is required in the former and optional in the latter. So in byteformat(), b'{:s}' continues to format a string (as encoded bytes) while '{:}' 'formats' a byte without having to invent a new code that does not exist in 2.7. That particular solution to does 's' mean bytes or string does not work for % formatting. (And that lack, in turn, is part of what lay behind the inclination expressed above.) For % formatting, I would be inclined to start with 'what does mecurial need?' or even 'does anything even really work for hg?'. Hg is part of our development ecosystem, and we have an hg rep who expressed a desire to experiment. Terry ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Automatic encoding detection [was: Re: Python3 complexity - 2 use cases]
On 1/13/2014 7:06 PM, Chris Angelico wrote: On Tue, Jan 14, 2014 at 10:48 AM, Jim J. Jewett jimjjew...@gmail.com wrote: Agreed. But most programs will need it, and people will either include (the same) 3rd-party library themselves, or write their own workaround, or have buggy code *is* sufficient. Well, no, that's not sufficient on its own either. But yes, it's a stronger argument. But having a batch process crash one run in ten (where it didn't crash at all under Python 2) is a bad thing. There are environments where (once I knew about it) I would add chardet (if I could get approval for the 3rd-party component). Having it *do the wrong thing* one run in ten is even worse. If you need chardet, then get approval for the third-party component. That's a political issue, not a technical one. This needs to be in the stdlib because I'm not allowed to install anything else? I hope not. Also, a PyPI package is free to update independently of the Python version schedule. The stdlib is bound. This discussion strikes me as more appropriate for python-ideas. That said, I am leery of a heuristics module in the stdlib. When is a change a 'bug fix'? and when is it an 'enhancement'? -- Terry Jan Reedy ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On 2014-01-13 21:51, Guido van Rossum wrote: Terminology. Let's use the official terminology rather than making stuff up. The docs at http://docs.python.org/3/library/string.html#formatspec use the following terminology: Replacement field: {...}; contains field name, conversion, format spec in that order, all optional. Field name: either a decimal integer (referring to an argument by position) or an identifier (by name), or omitted (uses the next available position). Conversion: !r, !s, !a; these refer to repr(), str(), ascii() to the value, and then the format spec applies to the resulting string. If all you wanted to do was interpolate bytes then you could define a new conversion !b. This would, however, mean that the format spec would be applied to bytes. Format spec: colon, bunch of stuff, type; the type is a letter such as d (decimal) or s (string), and the stuff between the colon and the type is used to specify field width, alignment, sign, padding and such. Also. {:b} means binary (i.e. numbers in base 2). I'm not sure what this leaves for interpolating bytes if we don't want to use {:s}. The docs at http://docs.python.org/3/library/stdtypes.html#printf-style-string-formatting don't show %b so it could still be used there, but it would be nicer to be consistent. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com