Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On Sun, 12 Jan 2014 17:51:41 +1000, Nick Coghlan ncogh...@gmail.com wrote: On 12 January 2014 04:38, R. David Murray rdmur...@bitdance.com wrote: But! Our goal should be to help people convert to Python3. So how can we find out what the specific problems are that real-world programs are facing, look at the *actual code*, and help that project figure out the best way to make that code work in both python2 and python3? That seems like the best way to find out what needs to be added to python3 or pypi: help port the actual code of the developers who are running into problems. Yes, I'm volunteering to help with this, though of course I can't promise exactly how much time I'll have available. And, as has been the case for a long time, the PSF stands ready to help with funding credible grant proposals for Python 3 porting efforts. I believe some of the core devs (including David?) do freelance and contract work, so that's an option definitely worth considered if a project would like to support Python 3, but are having difficulty getting their with purely volunteer effort. Yes, I do contract programming, as part of Murray and Walker, Inc (web site coming soon but not there yet). And yes I currently have time available in my schedule. --David ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On Sun, Jan 12, 2014 at 2:35 AM, Steven D'Aprano st...@pearwood.infowrote: On Sat, Jan 11, 2014 at 08:13:39PM -0200, Mariano Reingart wrote: AFAIK (and just for the record), there could be both Latin1 text and UTF-16 in a PDF (and other encodings too), depending on the font used: [...] In Python2, txt is just a str, but in Python3 handling everything as latin1 string obviously doesn't work for TTF in this case. Nobody is suggesting that you use Latin-1 for *everything*. We're suggesting that you use it for blobs of binary data that represent arbitrary bytes. First you have to get your binary data in the first place, using whatever technique is necessary. Just to check I understood what you are saying. Instead of writing: content = b'\n'.join([ b'header', b'part 2 %.3f' % number, binary_image_data, utf16_string.encode('utf-16be'), b'trailer']) it should now look like: content = '\n'.join([ 'header', 'part 2 %.3f' % number, binary_image_data.decode('latin-1'), utf16_string.encode('utf-16be').decode('latin-1'), 'trailer']).encode('latin-1') Correct? ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On 12 Jan 2014 21:53, Juraj Sukop juraj.su...@gmail.com wrote: On Sun, Jan 12, 2014 at 2:35 AM, Steven D'Aprano st...@pearwood.info wrote: On Sat, Jan 11, 2014 at 08:13:39PM -0200, Mariano Reingart wrote: AFAIK (and just for the record), there could be both Latin1 text and UTF-16 in a PDF (and other encodings too), depending on the font used: [...] In Python2, txt is just a str, but in Python3 handling everything as latin1 string obviously doesn't work for TTF in this case. Nobody is suggesting that you use Latin-1 for *everything*. We're suggesting that you use it for blobs of binary data that represent arbitrary bytes. First you have to get your binary data in the first place, using whatever technique is necessary. Just to check I understood what you are saying. Instead of writing: content = b'\n'.join([ b'header', b'part 2 %.3f' % number, binary_image_data, utf16_string.encode('utf-16be'), b'trailer']) it should now look like: content = '\n'.join([ 'header', 'part 2 %.3f' % number, binary_image_data.decode('latin-1'), utf16_string.encode('utf-16be').decode('latin-1'), 'trailer']).encode('latin-1') Why are you proposing to do the *join* in text space? Encode all the parts separately, concatenate them with b'\n'.join() (or whatever separator is appropriate). It's only the *text formatting operation* that needs to be done in text space and then explicitly encoded (and this example doesn't even need latin-1,ASCII is sufficient): content = b'\n'.join([ b'header', ('part 2 %.3f' % number).encode('ascii'), binary_image_data, utf16_string.encode('utf-16be'), b'trailer']) Correct? My updated version above is the reasonable way to do it in Python 3, and the one I consider clearly superior to reintroducing implicit encoding to ASCII as part of the core text model. This is why I *don't* have a problem with PEP 460 as it stands - it's just syntactic sugar for something you can already do with b''.join(), and thus not particularly controversial. It's only proposals that add any form of implicit encoding that silently switches from the text domain to the binary domain that conflict with the core Python 3 text model (although third party types remain largely free to do whatever they want). Cheers, Nick. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On Sun, Jan 12, 2014 at 2:16 PM, Nick Coghlan ncogh...@gmail.com wrote: Why are you proposing to do the *join* in text space? Encode all the parts separately, concatenate them with b'\n'.join() (or whatever separator is appropriate). It's only the *text formatting operation* that needs to be done in text space and then explicitly encoded (and this example doesn't even need latin-1,ASCII is sufficient): I apparently misunderstood what was Steven suggesting, thanks for the clarification. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On Sun, Jan 12, 2014 at 12:52:18PM +0100, Juraj Sukop wrote: On Sun, Jan 12, 2014 at 2:35 AM, Steven D'Aprano st...@pearwood.infowrote: On Sat, Jan 11, 2014 at 08:13:39PM -0200, Mariano Reingart wrote: AFAIK (and just for the record), there could be both Latin1 text and UTF-16 in a PDF (and other encodings too), depending on the font used: [...] In Python2, txt is just a str, but in Python3 handling everything as latin1 string obviously doesn't work for TTF in this case. Nobody is suggesting that you use Latin-1 for *everything*. We're suggesting that you use it for blobs of binary data that represent arbitrary bytes. First you have to get your binary data in the first place, using whatever technique is necessary. Just to check I understood what you are saying. Instead of writing: content = b'\n'.join([ b'header', b'part 2 %.3f' % number, binary_image_data, utf16_string.encode('utf-16be'), b'trailer']) Which doesn't work, since bytes don't support %f in Python 3. it should now look like: content = '\n'.join([ 'header', 'part 2 %.3f' % number, binary_image_data.decode('latin-1'), utf16_string.encode('utf-16be').decode('latin-1'), 'trailer']).encode('latin-1') Correct? Not quite as you show. First, utf16_string confuses me. What is it? If it is a Unicode string, i.e.: # Python 3 semantics type(utf16_string) = returns str then the name is horribly misleading, and it is best handled like this: content = '\n'.join([ 'header', 'part 2 %.3f' % number, binary_image_data.decode('latin-1'), utf16_string, # Misleading name, actually Unicode string 'trailer']) Note that since it's text, and content is text, there is no need to encode then decode. UTF-16 is not another name for Unicode. Unicode is a character set. UTF-16 is just one of a number of different encodings which map the 0x10 distinct Unicode characters (actually code points) to bytes. UTF-16 is one possible way to implement Unicode strings in memory, but not the only way. Python has, or does, use four distinct implementations: 1) UTF-16 in narrow builds 2) UTF-32 in wide builds 3) a hybrid approach starting in Python 3.3, where strings are stored as either: 3a) Latin-1 3b) UCS-2 3c) UTF-32 depending on the content of the string. So calling an arbitrary string utf16_string is misleading or wrong. On the other hand, if it is actually a bytes object which is the product of UTF-16 encoding, i.e.: type(utf16_string) = returns bytes and those bytes were generated by some text.encode(utf-16), then it is already binary data and needs to be smuggled into the text string. Latin-1 is good for that: content = '\n'.join([ 'header', 'part 2 %.3f' % number, binary_image_data.decode('latin-1'), utf16_string.decode('latin-1'), 'trailer']) Both examples assume that you intend to do further processing of content before sending it, and will encode just before sending: content.encode('utf-8') (Don't use Latin-1, since it cannot handle the full range of text characters.) If that's not the case, then perhaps this is better suited to what you are doing: content = b'\n'.join([ b'header', ('part 2 %.3f' % number).encode('ascii'), binary_image_data, # already bytes utf16_string, # already bytes b'trailer']) -- Steven ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On Sun, Jan 12, 2014 at 11:16:37PM +1000, Nick Coghlan wrote: content = '\n'.join([ 'header', 'part 2 %.3f' % number, binary_image_data.decode('latin-1'), utf16_string.encode('utf-16be').decode('latin-1'), 'trailer']).encode('latin-1') Why are you proposing to do the *join* in text space? In defence of that, doing the join as text may be useful if you have additional text processing that you want to do after assembling the whole string, but before calling encode. Even if you intend to encode to bytes at the end, you might prefer to work in the text domain right until just before the end: - no need for b' prefixes; - indexing a string returns a 1-char string, not an int; - can use the full range of % formatting, etc. -- Steven ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
Wait a second, this is how I understood it but what Nick said made me think otherwise... On Sun, Jan 12, 2014 at 6:22 PM, Steven D'Aprano st...@pearwood.infowrote: On Sun, Jan 12, 2014 at 12:52:18PM +0100, Juraj Sukop wrote: On Sun, Jan 12, 2014 at 2:35 AM, Steven D'Aprano st...@pearwood.info wrote: Just to check I understood what you are saying. Instead of writing: content = b'\n'.join([ b'header', b'part 2 %.3f' % number, binary_image_data, utf16_string.encode('utf-16be'), b'trailer']) Which doesn't work, since bytes don't support %f in Python 3. I know and this was an example of the ideal (for me, anyway) way of formatting bytes. First, utf16_string confuses me. What is it? If it is a Unicode string, i.e.: It is a Unicode string which happens to contain code points outside U+00FF (as with the TTF example above), so that it triggers the (at least) 2-bytes memory representation in CPython 3.3+. I agree, I chose the variable name poorly, my bad. content = '\n'.join([ 'header', 'part 2 %.3f' % number, binary_image_data.decode('latin-1'), utf16_string, # Misleading name, actually Unicode string 'trailer']) Which, because of that horribly-named-variable, prevents the use of simple memcpy and makes the image data occupy way more memory than as when it was in simple bytes. Both examples assume that you intend to do further processing of content before sending it, and will encode just before sending: Not really, I was interested to compare it to bytes formatting, hence it included the encode() as well. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
Daniel Holth writes: -1 on adding more surrogateesapes by default. It's a pain to track down where the encoding errors came from. What do you mean by default? It was quite explicit in the code I posted, and it's the only reasonable thing to do with text data without known (but ASCII compatible) encoding or multiple different encodings in a single data chunk. If you leave it as bytes, it will barf as soon as you try to mix it with text even if it is pure ASCII! ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On 01/12/2014 12:39 PM, Stephen J. Turnbull wrote: Daniel Holth writes: -1 on adding more surrogateesapes by default. It's a pain to track down where the encoding errors came from. What do you mean by default? It was quite explicit in the code I posted, and it's the only reasonable thing to do with text data without known (but ASCII compatible) encoding or multiple different encodings in a single data chunk. If you leave it as bytes, it will barf as soon as you try to mix it with text even if it is pure ASCII! Which is why some (including myself) are asking to be able to stay in bytes land and do any necessary interpolation there. No resulting unicode, no barfing. ;) -- ~Ethan~ ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
Why not just use six.byte_format(fmt, *args)? It works on both Python2 and Python3 and accepts the numerical format specifiers, plus '%b' for inserting bytes and '%a' for converting text to ascii. Admittedly it doesn't exist yet, but it could and it would save a lot of arguing :) (Apologies to anyone who doesn't appreciate my mischievous sense of humour) Cheers, Mark. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On 01/12/2014 01:59 PM, Mark Shannon wrote: Why not just use six.byte_format(fmt, *args)? It works on both Python2 and Python3 and accepts the numerical format specifiers, plus '%b' for inserting bytes and '%a' for converting text to ascii. Sounds like the second best option! Admittedly it doesn't exist yet, but it could and it would save a lot of arguing :) :) -- ~Ethan~ ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On Mon, Jan 13, 2014 at 4:57 AM, Juraj Sukop juraj.su...@gmail.com wrote: On Sun, Jan 12, 2014 at 6:22 PM, Steven D'Aprano st...@pearwood.info wrote: First, utf16_string confuses me. What is it? If it is a Unicode string, i.e.: It is a Unicode string which happens to contain code points outside U+00FF (as with the TTF example above), so that it triggers the (at least) 2-bytes memory representation in CPython 3.3+. I agree, I chose the variable name poorly, my bad. When I'm talking about Unicode strings based on their maximum codepoint, I usually call them something like ASCII string, Latin-1 string, BMP string, and SMP string. Still not wholly accurate, but less confusing than naming an encoding... oh wait, two of those _are_ encodings :| But you could use narrow string for the first two. Or string(0..127) for ASCII, string(0..255) for Latin-1, and then for consistency string(0..65535) and string(0..1114111) for the others, except that I doubt that'd be helpful :) At any rate, BMP as a term for includes characters outside of Latin-1 but all on the Basic Multilingual Plane would probably be close enough to get away with. ChrisA ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
Steven D'Aprano writes: then the name is horribly misleading, and it is best handled like this: content = '\n'.join([ 'header', 'part 2 %.3f' % number, binary_image_data.decode('latin-1'), utf16_string, # Misleading name, actually Unicode string 'trailer']) This loses bigtime, as any encoding that can handle non-latin1 in utf16_string will corrupt binary_image_data. OTOH, latin1 will raise on non-latin1 characters. utf16_string must be encoded appropriately then decoded by latin1 to be reencoded by latin1 on output. On the other hand, if it is actually a bytes object which is the product of UTF-16 encoding, i.e.: type(utf16_string) = returns bytes and those bytes were generated by some text.encode(utf-16), then it is already binary data and needs to be smuggled into the text string. Latin-1 is good for that: content = '\n'.join([ 'header', 'part 2 %.3f' % number, binary_image_data.decode('latin-1'), utf16_string.decode('latin-1'), 'trailer']) Both examples assume that you intend to do further processing of content before sending it, and will encode just before sending: content.encode('utf-8') (Don't use Latin-1, since it cannot handle the full range of text characters.) This corrupts binary_image_data. Each byte 127 will be replaced by two bytes. In the second case, you can use latin1 to encode, it it gives you what you want. This kind of subtlety is precisely why MAL warned about use of latin1 to smuggle bytes. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On 01/12/2014 02:31 PM, Stephen J. Turnbull wrote: This corrupts binary_image_data. Each byte 127 will be replaced by two bytes. In the second case, you can use latin1 to encode, it it gives you what you want. This kind of subtlety is precisely why MAL warned about use of latin1 to smuggle bytes. And why I've been fighting Steven D'Aprano on it. -- ~Ethan~ ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On Mon, Jan 13, 2014 at 07:31:16AM +0900, Stephen J. Turnbull wrote: Steven D'Aprano writes: then the name is horribly misleading, and it is best handled like this: content = '\n'.join([ 'header', 'part 2 %.3f' % number, binary_image_data.decode('latin-1'), utf16_string, # Misleading name, actually Unicode string 'trailer']) This loses bigtime, as any encoding that can handle non-latin1 in utf16_string will corrupt binary_image_data. OTOH, latin1 will raise on non-latin1 characters. utf16_string must be encoded appropriately then decoded by latin1 to be reencoded by latin1 on output. Of course you're right, but I have understood the above as being a sketch and not real code. (E.g. does header really mean the literal string header, or does it stand in for something which is a header?) In real code, one would need to have some way of telling where the binary image data ends and the Unicode string begins. If I have misunderstood the situation, then my apologies for compounding the error [...] Both examples assume that you intend to do further processing of content before sending it, and will encode just before sending: content.encode('utf-8') (Don't use Latin-1, since it cannot handle the full range of text characters.) This corrupts binary_image_data. Each byte 127 will be replaced by two bytes. And reading it back using decode('utf-8') will replace those two bytes with a single byte, round-tripping exactly. Of course if you encode to UTF-8 and then try to read the binary data as raw bytes, you'll get corrupted data. But do people expect to do this? That's a genuine question -- again, I assumed (apparently wrongly) that the idea was to write the content out as *text* containing smuggled bytes, and read it back the same way. In the second case, you can use latin1 to encode, it it gives you what you want. This kind of subtlety is precisely why MAL warned about use of latin1 to smuggle bytes. How would you smuggle a chunk of arbitrary bytes into a text string? Short of doing something like uuencoding it into ASCII, or equivalent. -- Steven ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
Ethan Furman writes: This kind of subtlety is precisely why MAL warned about use of latin1 to smuggle bytes. And why I've been fighting Steven D'Aprano on it. No, I think you haven't been fighting Steven d'A on it. You're talking about parsing and generating structured binary files, he's talking about techniques for parsing and generating streams with no real structure above the byte or encoded character level. Of course you can implement the former with the latter using Python 3 str, but it's ugly, maybe even painful if you need to encode binary blobs back to binary to process them. (More discussion in my other post, although I suspect you're not going to be terribly happy with that, either. ;-) This generally *is not* the case for the wire protocol guys. AFAICT they really do want to process things as streams of ASCII-compatible text, with the non-ASCII stuff treated as runs of uninterpreted bytes that are just passed through. So when you talk about we, I suspect you are not the we everybody else is arguing with. In particular, AIUI your use case is not included in the use cases most of us -- including Steven -- are thinking about. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On 01/12/2014 04:02 PM, Stephen J. Turnbull wrote: So when you talk about we, I suspect you are not the we everybody else is arguing with. In particular, AIUI your use case is not included in the use cases most of us -- including Steven -- are thinking about. Ah, so even in the minority I'm in the minority. :/ The we I am usually referring to are those of us who have to deal with the mixed ASCII/binary/encoded text files (a couple have spoken up about PDFs, and I have DBF). -- ~Ethan~ ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
Steven D'Aprano writes: Of course you're right, but I have understood the above as being a sketch and not real code. (E.g. does header really mean the literal string header, or does it stand in for something which is a header?) In real code, one would need to have some way of telling where the binary image data ends and the Unicode string begins. Sure, but I think in Ethan's case it's probably out of band. I have been assuming out of band. This corrupts binary_image_data. Each byte 127 will be replaced by two bytes. And reading it back using decode('utf-8') will replace those two bytes with a single byte, round-tripping exactly. True, but I'm assuming Ethan himself didn't choose DBF format. Of course if you encode to UTF-8 and then try to read the binary data as raw bytes, you'll get corrupted data. But do people expect to do this? People? Real People use Python, they wouldn't do that. :-) But the app that forced Ethan to deal with DBF might. This kind of subtlety is precisely why MAL warned about use of latin1 to smuggle bytes. How would you smuggle a chunk of arbitrary bytes into a text string? Short of doing something like uuencoding it into ASCII, or equivalent. Arbitary bytes as a chunk? I wouldn't do that, probably (see below), and it's not possible in Python 3 at present (in str ASCII codes always represent the corresponding ASCII character, they are never uninterpreted bytes). But if I know where the bytes are going to be in the str, I'd use latin1 or (encoding='ascii', errors='surrogateescape') depending on how well-controlled the processing is. If I really own those bytes, I might use latin1, and just forget all of the string-processing functions that care about character identity (eg, case manipulation). If the bytes might somehow end up leaking into the rest of the program, I'd use surrogateescape and live with the doubled space usage. But really, if it's not a wire-to-wire protocol kind of thing, I'd go ahead and create a proper model for the data, and text would be text, and chunks of arbitrary bytes would be bytes and integers would be integers ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On 11 January 2014 08:58, Ethan Furman et...@stoneleaf.us wrote: On 01/10/2014 02:42 PM, Antoine Pitrou wrote: On Fri, 10 Jan 2014 17:33:57 -0500 Eric V. Smith e...@trueblade.com wrote: On 1/10/2014 5:29 PM, Antoine Pitrou wrote: On Fri, 10 Jan 2014 12:56:19 -0500 Eric V. Smith e...@trueblade.com wrote: I agree. I don't see any reason to exclude int and float. See Guido's messages http://bugs.python.org/issue3982#msg180423 and http://bugs.python.org/issue3982#msg180430 for some justification and discussion. If you are representing int and float, you're really formatting a text message, not bytes. Basically if you allow the formatting of int and float instances, there's no reason not to allow the formatting of arbitrary objects through __str__. It doesn't make sense to special-case those two types and nothing else. It might not for .format(), but I'm not convinced. But for %-formatting, str is already special-cased for these types. That's not what I'm saying. str.__mod__ is able to represent all kinds of types through %s and calling __str__. It doesn't make sense for bytes.__mod__ to only support int and float. Why only them? Because embedding the ASCII equivalent of ints and floats in byte streams is a common operation? It's emphatically *NOT* a binary interpolation operation though - the binary representation of the integer 1 is the byte value 1, not the byte value 49. If you want the byte value 49 to appear in the stream, then you need to interpolate the *ASCII encoding* of the string 1, not the integer 1. If you want to manipulate text representations, do it in the text domain. If you want to manipulate binary representations, do it in the binary domain. The *whole point* of the text model change in Python 3 is to force programmers to *decide* which domain they're operating in at any given point in time - while the approach of blurring the boundaries between the two can be convenient for wire protocol and file format manipulation, it is a horrendous bug magnet everywhere else. PEP 360 is just about adding back some missing functionality in the binary domain (interpolating binary sequences together), not about bringing back the problematic text model that allows particular text representations to be interpreted as if they were also binary data. That said, I actually think there's a valid use case for a Python 3 type that allows the bytes/text boundary to be blurred in making it easier to port certain kinds of Python 2 code to Python 3 (specifically, working with wire protocols and file formats that contain a mixture of encodings, but all encodings are *known* to at least be ASCII compatible). It is highly unlikely that such a type will *ever* be part of the standard library, though - idiomatic Python 3 code shouldn't need it, affected Python 2 code *can* be ported without it (but may look more complicated due to the use of explicit decoding and encoding operations, rather than relying on implicit ones), and it should be entirely possible to implement it as an extension module (modulo one bug in CPython that may impact the approach, but we won't know for sure until people actually try it out). Fortunately, after years of my suggesting the idea to almost everyone that complained about the move away from the broken POSIX text model in Python 3, Benno Rice has started experimenting with such a type based on a preliminary test case I wrote at linux.conf.au last week: https://github.com/jeamland/asciicompat/blob/master/tests/ncoghlan.py Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On 11 January 2014 12:28, Ethan Furman et...@stoneleaf.us wrote: On 01/10/2014 06:04 PM, Antoine Pitrou wrote: On Fri, 10 Jan 2014 20:53:09 -0500 Eric V. Smith e...@trueblade.com wrote: So, I'm -1 on the PEP. It doesn't address the cases laid out in issue 3892. See for example http://bugs.python.org/issue3982#msg180432 . Then we might as well not do anything, since any attempt to advance things is met by stubborn opposition in the name of not far enough. Heh, and here I thought it was stubborn opposition in the name of purity. ;) No, it's the POSIX text model is completely broken and we're not letting people bring it back by stealth because they want to stuff their esoteric use case back into the builtin data types instead of writing their own dedicated type now that the builtin types don't handle it any more. Yes, we know we changed the text model and knocked wire protocols off their favoured perch, and we're (thoroughly) aware of the fact that wire protocol developers don't like the fact that the default model now strongly favours the vastly more common case of application development. However, until Benno volunteered to start experimenting with implementing an asciistr type yesterday, there have been *zero* meaningful attempts at trying to solve the issues with wire protocol manipulation outside the Python 3 core - instead there has just been a litany of whining that Python 3 is different from Python 2, and a complete and total refusal to attempt to understand *why* we changed the text model. The answer *should* be obvious: the POSIX based text model in Python 2 makes web frameworks easier to write at the expense of making web applications *harder* to write, and the same is true for every other domain where the wire protocol and file format handling is isolated to widely used frameworks and support libraries, with the application code itself operating mostly on text and structured data. With the Python 3 text model, we decided that was a terrible trade-off, so the core text model now *strongly* favours application code. This means that is now *boundary* code that may need additional helper types, because the core types aren't necessarily going to cover all those use cases any more. In particular, the bytes type is, and always will be, designed for pure binary manipulation, while the str type is designed for text manipulation. The weird kinda-text-kinda-binary 8-bit builtin type is gone, and *deliberately* so. I've been saying for years that people should experiment with creating a Python 3 extension type that behaves more like the Python 2 str type. For the standard library, we've never hit a case where the explicit encoding and decoding was so complicated that creating such a type seemed simpler, so *we're* not going to do it. After discussing it with me at LCA, Benno Rice offered to try out the idea, just to determine whether or not it was actually possible. If there are any CPython bugs that mean the idea *doesn't* currently work (such as interoperability issues in the core types), then I'm certainly happy for us to fix *those*. But we're never ever going to change the core text model back to the broken POSIX one, or even take steps in that direction. Regards, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
For not caring much, your own stubbornness is quite notable throughout this discussion. Stones and glass houses. :) That said: Twisted and Mercurial aren't the only ones who are hurt by this, at all. I'm aware of at least two other projects who are actively hindered in their support or migration to Python 3 by the bytes type not having some basic functionality that strings had in 2.0. The purity crowd in here has brought up that it was an important and serious decision to split Text from Bytes in Py3, and I actually agree with that. However, it is missing some very real and very concrete use-cases -- there are multiple situations where there are byte streams which have a known text-subset which they really, really do need to operate on. There's been a number of examples given: PDF, HTTP, network streams that switch inline from text-ish to binary and back-again.. But, we can focus that down to a very narrow and not at all uncommon situation in the latter. Look at the HTTP Content-Length header. HTTP headers are fuzzy. My understanding is, per the RFCs, their body can be arbitrary octets to the exclusion of line feeds and DELs-- my understanding may be a bit off here, and please feel free to correct me -- but the relevant specifications are a bit fuzzy to begin with. To my understanding of the spec, the header field name is essentially an ASCII text field (sans separator), and the body is... anything, or nearly anything. This is HTTP, which is surely one of the most used protocols in the world. The need to be able to assemble and disassemble such streams of that is a real, valid use-case. But looking at it, now look to the Content-Length header I mentioned. It seems those who are declaring a purity priority in bytes/string separation think it reasonable to do things like: headers.append((bContent-Length: (%d % (len(content))).encode(ascii))) Or something. In the middle of processing a stream, you need to convert this number into a string then encode it into bytes to just represent the number as the extremely common, widely-accessible 7-bit ascii subset of its numerical value. This isn't some rare, grandiose or fiendish undertaking, or trying to merge Strings and Bytes back together: this is the simple practical recognition that representing a number as its ascii-numerical value is actually not at all uncommon. This position seems utterly astonishing in its ridiculousness to me. The recognition that the number 123 may be represented as b123 surprises me as a controversial thing, considering how often I see it in real life. There is a LOT of code out there which needs a little bit of a middle ground between bytes and strings; it doesn't mean you are giving way and allowing strings and bytes to merge and giving up on the Edict of Separation. But there are real world use-cases where you simply need to be able to do many basic String like operations on byte-streams. The removal of the ability to use interpolation to construct such byte strings was a major regression in python 3 and is a big hurdle for more then a few projects to upgrade. I mean, its not like the bytes type lacks knowledge of the subset of bytes that happen to be 7-bit ascii-compatible and can't perform text-ish operations on them-- Python 3.3.3 (v3.3.3:c3896275c0f6, Nov 18 2013, 21:18:40) [MSC v.1600 32 bit (Intel)] on win32 Type help, copyright, credits or license for more information. bstephen hansen.title() b'Stephen Hansen' How is this not a practical recognition that yes, while bytes are byte streams and not text, a huge subset of bytes are text-y, and as long as we maintain the barrier between higher characters and implicit conversion therein, we're fine? I don't see the difference here. There is a very real, practical need to interpolate bytes. This very real, practical need includes the very real recognition that converting 12345 to b'12345' is not something weird, unusual, and subject to the thorny issues of Encodings. It is not violating the doctrine of separation of powers between Text and Bytes. Personally, I won't be converting my day job's codebase to Python 3 anytime soon (where 'soon' is defined as 'within five years, assuming a best-case scenario that a number of third-party issues are resolved. But! I'm aware and involved with other projects, and this has bit two of them specifically. I'm sure there are others who are not aware of this list or don't feel comfortable talking on it (as it is, I encouraged one of the project's coder to speak up, but they thought the question was a lost one due to previous responses on the original issue ticket and gave up.). On Fri, Jan 10, 2014 at 6:04 PM, Antoine Pitrou solip...@pitrou.net wrote: On Fri, 10 Jan 2014 20:53:09 -0500 Eric V. Smith e...@trueblade.com wrote: So, I'm -1 on the PEP. It doesn't address the cases laid out in issue 3892. See for example http://bugs.python.org/issue3982#msg180432 . Then we might as well not do anything, since any
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On 1/11/2014 1:44 AM, Stephen Hansen wrote: There's been a number of examples given: PDF, HTTP, network streams that switch inline from text-ish to binary and back-again.. But, we can focus that down to a very narrow and not at all uncommon situation in the latter. PDF has been mentioned a few times. ReportLAB recently decided to convert to Python 3, and fairly quickly (from my perspective, it took them a _long_ time to decide to port, but once they decided to, then it seemed quick) produced an alpha version that passes many of their tests. I've not tried it yet, although it interests me, as I have some Python 2 code written only because ReportLAB didn't support Python 3, and I wanted to generate some PDF files. I'll be glad to get rid of the Python 2 code, once they are released. But I guess they figured out a solution that wasn't onerous, I'd have to go re-read the threads to be sure, but it seems they are running one code base for both... not sure of the details of what techniques they used, or if they ever used the % operator :) But I'm wondering, since they did what they did so quickly, if the mixed bytes and str use case is mostly, in fact, a mind-set issue... yes, likely some code has to change, but maybe the changes really aren't all that significant. I wouldn't want to drag them into this discussion, I'd rather they get the port complete, but it would be interesting to know what they did, and how they did it, and what problems they had, etc. If anyone here knows that code a bit, perhaps the diffs could be examined in their repository to figure out what they did, and how much it impacted their code. I do know they switched XML parsers along the way, as well as dealing with string handling differences. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
I don't know what the fuss is about. This isn't about breaking the text model. It's about a convenient way to turn text into bytes using a default, lenient, way. Not the other way round. Here's my proposal b'foo%sbar' % (a) would implicitly apply the following function equivalent to every object in the tuple: def coerce_ascii(o): if has_bytes_interface(o): return o return o.encode('ascii', 'strict') There's no need for special %d or %f formatting. If more fanciful formatting is required, e.g. exponents or, or precision, then by all means, to it in the str domain: b'foo%sbar' %(%.15f%(42.2, )) Basically, let's just support simple bytes interpolation that will support coercing into bytes by means of strict ascii. It's a one way convenience, explicitly requested, and for conselting adults. -Original Message- From: Python-Dev [mailto:python-dev-bounces+kristjan=ccpgames@python.org] On Behalf Of Nick Coghlan Sent: 11. janúar 2014 08:43 To: Ethan Furman Cc: python-dev@python.org Subject: Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5 No, it's the POSIX text model is completely broken and we're not letting people bring it back by stealth because they want to stuff their esoteric use case back into the builtin data types instead of writing their own dedicated type now that the builtin types don't handle it any more. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On Sat, Jan 11, 2014 at 5:14 AM, Cameron Simpson c...@zip.com.au wrote: Hi Juraj, Hello Cameron. data = b' '.join( bytify( [ 10, 0, obj, binary_image_data, ... ] ) ) Thanks for the suggestion! The problem with bytify is that some items might require different formatting than other items. For example, in Cross-Reference Table there are three different formats: non-padded integer (1), 10- and 15digit integer, (03, 65535). ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On Sat, Jan 11, 2014 at 6:36 AM, Steven D'Aprano st...@pearwood.infowrote: I'm sorry, I don't understand what you mean here. I'm honestly not trying to be difficult, but you sound confident that you understand what you are doing, but your description doesn't make sense to me. To me, it looks like you are conflating bytes and ASCII characters, that is, assuming that characters are in some sense identical to their ASCII representation. Let me explain: The integer that in English is written as 100 is represented in memory as bytes 0x0064 (assuming a big-endian C short), so when you say an integer is written down AS-IS (emphasis added), to me that says that the PDF file includes the bytes 0x0064. But then you go on to write the three character string 100, which (assuming ASCII) is the bytes 0x313030. Going from the C short to the ASCII representation 0x313030 is nothing like inserting the int as-is. To put it another way, the Python 2 '%d' format code does not just copy bytes. Sorry, I should've included an example: when I said as-is I meant 1, 0, 0 so that would be yours 0x313030. If you consider PDF as binary with occasional pieces of ASCII text, then working with bytes makes sense. But I wonder whether it might be better to consider PDF as mostly text with some binary bytes. Even though the bulk of the PDF will be binary, the interesting bits are text. E.g. your example: Even though the binary image data is probably much, much larger in length than the text shown above, it's (probably) trivial to deal with: convert your image data into bytes, decode those bytes into Latin-1, then concatenate the Latin-1 string into the text above. This is similar to what Chris Barker suggested. I also don't try to be difficult here but please explain to me one thing. To treat bytes as if they were Latin-1 is bad idea, that's why %f got dropped in the first place, right? How is it then alright to put an image inside an Unicode string? Also, apart from the in/out conversions, do any other difficulties come to your mind? Please also take note that in Python 3.3 and better, the internal representation of Unicode strings containing only code points up to 255 (i.e. pure ASCII or pure Latin-1) is very efficient, using only one byte per character. I guess you meant [C]Python... In any case, thanks for the detailed reply. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
Am 11.01.2014 09:43, schrieb Nick Coghlan: On 11 January 2014 12:28, Ethan Furman et...@stoneleaf.us wrote: On 01/10/2014 06:04 PM, Antoine Pitrou wrote: On Fri, 10 Jan 2014 20:53:09 -0500 Eric V. Smith e...@trueblade.com wrote: So, I'm -1 on the PEP. It doesn't address the cases laid out in issue 3892. See for example http://bugs.python.org/issue3982#msg180432 . Then we might as well not do anything, since any attempt to advance things is met by stubborn opposition in the name of not far enough. Heh, and here I thought it was stubborn opposition in the name of purity. ;) No, it's the POSIX text model is completely broken and we're not letting people bring it back by stealth because they want to stuff their esoteric use case back into the builtin data types instead of writing their own dedicated type now that the builtin types don't handle it any more. Yes, we know we changed the text model and knocked wire protocols off their favoured perch, and we're (thoroughly) aware of the fact that wire protocol developers don't like the fact that the default model now strongly favours the vastly more common case of application development. However, until Benno volunteered to start experimenting with implementing an asciistr type yesterday, there have been *zero* meaningful attempts at trying to solve the issues with wire protocol manipulation outside the Python 3 core Can we please also include pseudo-binary file formats? It's not just wire protocols. Georg ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
Am 11.01.2014 10:44, schrieb Stephen Hansen: I mean, its not like the bytes type lacks knowledge of the subset of bytes that happen to be 7-bit ascii-compatible and can't perform text-ish operations on them-- Python 3.3.3 (v3.3.3:c3896275c0f6, Nov 18 2013, 21:18:40) [MSC v.1600 32 bit (Intel)] on win32 Type help, copyright, credits or license for more information. bstephen hansen.title() b'Stephen Hansen' How is this not a practical recognition that yes, while bytes are byte streams and not text, a huge subset of bytes are text-y, and as long as we maintain the barrier between higher characters and implicit conversion therein, we're fine? I don't see the difference here. There is a very real, practical need to interpolate bytes. This very real, practical need includes the very real recognition that converting 12345 to b'12345' is not something weird, unusual, and subject to the thorny issues of Encodings. It is not violating the doctrine of separation of powers between Text and Bytes. This. Exactly. Thanks for putting it so nicely, Stephen. Georg ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
Am 11.01.2014 14:49, schrieb Georg Brandl: Am 11.01.2014 10:44, schrieb Stephen Hansen: I mean, its not like the bytes type lacks knowledge of the subset of bytes that happen to be 7-bit ascii-compatible and can't perform text-ish operations on them-- Python 3.3.3 (v3.3.3:c3896275c0f6, Nov 18 2013, 21:18:40) [MSC v.1600 32 bit (Intel)] on win32 Type help, copyright, credits or license for more information. bstephen hansen.title() b'Stephen Hansen' How is this not a practical recognition that yes, while bytes are byte streams and not text, a huge subset of bytes are text-y, and as long as we maintain the barrier between higher characters and implicit conversion therein, we're fine? I don't see the difference here. There is a very real, practical need to interpolate bytes. This very real, practical need includes the very real recognition that converting 12345 to b'12345' is not something weird, unusual, and subject to the thorny issues of Encodings. It is not violating the doctrine of separation of powers between Text and Bytes. This. Exactly. Thanks for putting it so nicely, Stephen. To elaborate: if the bytes type didn't have all this ASCII-aware functionality already, I think we would have (and be using) a dedicated asciistr type right now. But it has the functionality, and it's way too late to remove it. Georg ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On 11.01.2014 14:54, Georg Brandl wrote: Am 11.01.2014 14:49, schrieb Georg Brandl: Am 11.01.2014 10:44, schrieb Stephen Hansen: I mean, its not like the bytes type lacks knowledge of the subset of bytes that happen to be 7-bit ascii-compatible and can't perform text-ish operations on them-- Python 3.3.3 (v3.3.3:c3896275c0f6, Nov 18 2013, 21:18:40) [MSC v.1600 32 bit (Intel)] on win32 Type help, copyright, credits or license for more information. bstephen hansen.title() b'Stephen Hansen' How is this not a practical recognition that yes, while bytes are byte streams and not text, a huge subset of bytes are text-y, and as long as we maintain the barrier between higher characters and implicit conversion therein, we're fine? I don't see the difference here. There is a very real, practical need to interpolate bytes. This very real, practical need includes the very real recognition that converting 12345 to b'12345' is not something weird, unusual, and subject to the thorny issues of Encodings. It is not violating the doctrine of separation of powers between Text and Bytes. This. Exactly. Thanks for putting it so nicely, Stephen. To elaborate: if the bytes type didn't have all this ASCII-aware functionality already, I think we would have (and be using) a dedicated asciistr type right now. But it has the functionality, and it's way too late to remove it. I think we need to step back a little from the purist view of things and give more emphasis on the practicality beats purity Zen. I complete agree with Stephen, that bytes are in fact often an encoding of text. If that text is ASCII compatible, I don't see any reason why we should not continue to expose the C lib standard string APIs available for text manipulations on bytes. We don't have to be pedantic about the bytes/text separation. It doesn't help in real life. If you give programmers the choice they will - most of the time - do the right thing. If you don't give them the tools, they'll work around the missing features in a gazillion different ways of which many will probably miss a few edge cases. bytes already have most of the 8-bit string methods from Python 2, so it doesn't hurt adding some more of the missing features from Python 2 on top to make life easier for people dealing with multiple/unknown encoding data. BTW: I don't know why so many people keep asking for use cases. Isn't it obvious that text data without known (but ASCII compatible) encoding or multiple different encodings in a single data chunk is part of life ? Most HTTP packets fall into this category, many email messages as well. And let's not forget that we don't live in a perfect world. Broken encodings are everywhere around you - just have a look at your spam folder for a decent chunk of example data :-) -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jan 11 2014) Python Projects, Consulting and Support ... http://www.egenix.com/ mxODBC.Zope/Plone.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ : Try our mxODBC.Connect Python Database Interface for free ! :: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On Sat, 11 Jan 2014 08:26:57 +0100 Georg Brandl g.bra...@gmx.net wrote: Am 11.01.2014 03:04, schrieb Antoine Pitrou: On Fri, 10 Jan 2014 20:53:09 -0500 Eric V. Smith e...@trueblade.com wrote: So, I'm -1 on the PEP. It doesn't address the cases laid out in issue 3892. See for example http://bugs.python.org/issue3982#msg180432 . I agree. Then we might as well not do anything, since any attempt to advance things is met by stubborn opposition in the name of not far enough. (I don't care much personally, I think the issue is quite overblown anyway) So you wouldn't mind another overhaul of the PEP including a bit more functionality again? :) I really think that practicality beats purity here. (I'm not advocating free mixing bytes and str, mind you!) The PEP already proposes a certain amount of practicality. I personally *would* mind adding %d and friends to it. But of course someone can fork the PEP or write another one. Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On 12 January 2014 01:15, M.-A. Lemburg m...@egenix.com wrote: On 11.01.2014 14:54, Georg Brandl wrote: Am 11.01.2014 14:49, schrieb Georg Brandl: Am 11.01.2014 10:44, schrieb Stephen Hansen: I mean, its not like the bytes type lacks knowledge of the subset of bytes that happen to be 7-bit ascii-compatible and can't perform text-ish operations on them-- Python 3.3.3 (v3.3.3:c3896275c0f6, Nov 18 2013, 21:18:40) [MSC v.1600 32 bit (Intel)] on win32 Type help, copyright, credits or license for more information. bstephen hansen.title() b'Stephen Hansen' How is this not a practical recognition that yes, while bytes are byte streams and not text, a huge subset of bytes are text-y, and as long as we maintain the barrier between higher characters and implicit conversion therein, we're fine? I don't see the difference here. There is a very real, practical need to interpolate bytes. This very real, practical need includes the very real recognition that converting 12345 to b'12345' is not something weird, unusual, and subject to the thorny issues of Encodings. It is not violating the doctrine of separation of powers between Text and Bytes. This. Exactly. Thanks for putting it so nicely, Stephen. To elaborate: if the bytes type didn't have all this ASCII-aware functionality already, I think we would have (and be using) a dedicated asciistr type right now. But it has the functionality, and it's way too late to remove it. I think we need to step back a little from the purist view of things and give more emphasis on the practicality beats purity Zen. I complete agree with Stephen, that bytes are in fact often an encoding of text. If that text is ASCII compatible, I don't see any reason why we should not continue to expose the C lib standard string APIs available for text manipulations on bytes. We don't have to be pedantic about the bytes/text separation. It doesn't help in real life. Yes, it bloody well does. The number of people who have told me that using Python 3 is what allowed them to finally understand how Unicode works vastly exceeds the number of wire protocol and file format devs that have complained about working with binary formats being significantly less tolerant of the it's really like ASCII text mindset. We are NOT going back to the confusing incoherent mess that is the Python 2 model of bolting Unicode onto the side of POSIX: http://python-notes.curiousefficiency.org/en/latest/python3/questions_and_answers.html#what-actually-changed-in-the-text-model-between-python-2-and-python-3 While that was an *expedient* (and, in fact, necessary) solution at the time, the fact it is still thoroughly confusing people 13 years later shows it is not a *comprehensible* solution. If you give programmers the choice they will - most of the time - do the right thing. If you don't give them the tools, they'll work around the missing features in a gazillion different ways of which many will probably miss a few edge cases. bytes already have most of the 8-bit string methods from Python 2, so it doesn't hurt adding some more of the missing features from Python 2 on top to make life easier for people dealing with multiple/unknown encoding data. Because people that aren't happy with the current bytes type persistently refuse to experiment with writing their own extension type to figure out what the API should look like. Jamming speculative API design into the core text model without experimenting in a third party extension first is a straight up stupid idea. Anyone that is pushing for this should be checking out Benno's first draft experimental prototype for asciistr and be working on getting it passing the test suite I created: https://github.com/jeamland/asciicompat The Wah, you broke it and now I have completely forgotten how to create custom types, so I'm just going to piss and moan until somebody else fixes it infantilism of the past five years in this regard has frankly pissed me off. Regards, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On Sat, Jan 11, 2014 at 01:56:56PM +0100, Juraj Sukop wrote: On Sat, Jan 11, 2014 at 6:36 AM, Steven D'Aprano st...@pearwood.infowrote: If you consider PDF as binary with occasional pieces of ASCII text, then working with bytes makes sense. But I wonder whether it might be better to consider PDF as mostly text with some binary bytes. Even though the bulk of the PDF will be binary, the interesting bits are text. E.g. your example: 10 0 obj /Type /XObject /Width 100 /Height 100 /Alternates 15 0 R /Length 2167 stream ...binary image data... endstream endobj Even though the binary image data is probably much, much larger in length than the text shown above, it's (probably) trivial to deal with: convert your image data into bytes, decode those bytes into Latin-1, then concatenate the Latin-1 string into the text above. This is similar to what Chris Barker suggested. I also don't try to be difficult here but please explain to me one thing. To treat bytes as if they were Latin-1 is bad idea, Correct. Bytes are not Latin-1. Here are some bytes which represent a word I extracted from a text file on my computer: b'\x8a\x75\xa7\x65\x72\x73\x74' If you imagine that they are Latin-1, you might think that the word is a C1 control character (VTS, or Vertical Tabulation Set) followed by u§erst, but it is not. It is actually the German word äußerst (extremely), and the text file was generated on a 1990s vintage Macintosh using the MacRoman extended ASCII code page. that's why %f got dropped in the first place, right? How is it then alright to put an image inside an Unicode string? The point that I am making is that many people want to add formatting operations to bytes so they can put ASCII strings inside bytes. But (as far as I can tell) they don't need to do this, because they can treat Unicode strings containing code points U+ through U+00FF (i.e. the same range as handled by Latin-1) as if they were bytes. This gives you: - convenient syntax, no need to prefix strings with b; - mostly avoid needing to decode and encode strings, except at a few points in your code; - the full set of string methods; - can easily include arbitrary octal or hex byte values, using \o and \x escapes; - error checking: when you finally encode the text to bytes before writing to a file, or sending over a wire, any code-point greater than U+00FF will give you an exception unless explicitly silenced. No need to wait for Python 3.5 to come out, you can do this *right now*. Of course, this is a little bit unclean, it breaks the separation of text and bytes by treating bytes *as if* they were Unicode code points, which they are not, but I believe that this is a practical technique which is not too hard to deal with. For instance, suppose I have a mixed format which consists of an ASCII tag, a number written in ASCII, a NULL separator, and some binary data: # Using bytes values = [29460, 29145, 31098, 27123] blob = b.join(struct.pack(h, n) for n in values) data = bTag: + str(len(values)).encode('ascii') + b\0 + blob = gives data = b'Tag:4\x00s\x14q\xd9yzi\xf3' That's a bit ugly, but not too ugly. I could write code like that. But if bytes had % formatting, I might write this instead: data = bTag:%d\0%s % (len(values), blob) This is a small improvement, but I can't use it until Python 3.5 comes out. Or I could do this right now: # Using text values = [29460, 29145, 31098, 27123] blob = b.join(struct.pack(h, n) for n in values) data = Tag:%d\0%s % (len(values), blob.decode('latin-1')) = gives data = 'Tag:4\x00s\x14qÙyzió' When I'm ready to transmit this over the wire, or write to disk, then I encode, and get: data.encode('latin-1') = b'Tag:4\x00s\x14q\xd9yzi\xf3' which is exactly the same as I got in the first place. In this case, I'm not using Latin-1 for the semantics of bytes to characters (e.g. byte \xf3 = char ó), but for the useful property that all 256 distinct bytes are valid in Latin-1. Any other encoding with the same property will do. It is a little unfortunate that struct gives bytes rather than a str, but you can hide that with a simple helper function: def b2s(bytes): return bytes.decode('latin1') data = Tag:%d\0%s % (len(values), b2s(blob)) Also, apart from the in/out conversions, do any other difficulties come to your mind? No. If you accidentally introduce a non-Latin1 code point, when you decode you'll get an exception. -- Steven ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On Sun, 12 Jan 2014 01:34:26 +1000 Nick Coghlan ncogh...@gmail.com wrote: Yes, it bloody well does. The number of people who have told me that using Python 3 is what allowed them to finally understand how Unicode works vastly exceeds the number of wire protocol and file format devs that have complained about working with binary formats being significantly less tolerant of the it's really like ASCII text mindset. +1 to what Nick says. Forcing some constructs to be explicit leads people to know about the issue and understand it, rather than sweep it under the carpet as Python 2 encouraged them to do. Yes, if you're dealing with a file format or network protocol, you'd better know in which charset its textual information is being expressed. It's a very sane question to ask yourself! Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On 01/11/2014 07:38 AM, Steven D'Aprano wrote: The point that I am making is that many people want to add formatting operations to bytes so they can put ASCII strings inside bytes. But (as far as I can tell) they don't need to do this, because they can treat Unicode strings containing code points U+ through U+00FF (i.e. the same range as handled by Latin-1) as if they were bytes. So instead of blurring the line between bytes and text, you're blurring the line between text and bytes (with a few extra seat belts thrown in). Besides being a bit awkward, this also means that any encoded text (even the plain ASCII stuff) is now being transformed three times instead of one: unicode to bytes bytes to unicode using latin1 unicode to bytes Even if the cost of moving those bytes around is cheap, it's not free. When you're creating hundreds of PDFs at a time that's going to make a difference. -- ~Ethan~ ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On 11.01.2014 16:34, Nick Coghlan wrote: On 12 January 2014 01:15, M.-A. Lemburg m...@egenix.com wrote: On 11.01.2014 14:54, Georg Brandl wrote: Am 11.01.2014 14:49, schrieb Georg Brandl: Am 11.01.2014 10:44, schrieb Stephen Hansen: I mean, its not like the bytes type lacks knowledge of the subset of bytes that happen to be 7-bit ascii-compatible and can't perform text-ish operations on them-- Python 3.3.3 (v3.3.3:c3896275c0f6, Nov 18 2013, 21:18:40) [MSC v.1600 32 bit (Intel)] on win32 Type help, copyright, credits or license for more information. bstephen hansen.title() b'Stephen Hansen' How is this not a practical recognition that yes, while bytes are byte streams and not text, a huge subset of bytes are text-y, and as long as we maintain the barrier between higher characters and implicit conversion therein, we're fine? I don't see the difference here. There is a very real, practical need to interpolate bytes. This very real, practical need includes the very real recognition that converting 12345 to b'12345' is not something weird, unusual, and subject to the thorny issues of Encodings. It is not violating the doctrine of separation of powers between Text and Bytes. This. Exactly. Thanks for putting it so nicely, Stephen. To elaborate: if the bytes type didn't have all this ASCII-aware functionality already, I think we would have (and be using) a dedicated asciistr type right now. But it has the functionality, and it's way too late to remove it. I think we need to step back a little from the purist view of things and give more emphasis on the practicality beats purity Zen. I complete agree with Stephen, that bytes are in fact often an encoding of text. If that text is ASCII compatible, I don't see any reason why we should not continue to expose the C lib standard string APIs available for text manipulations on bytes. We don't have to be pedantic about the bytes/text separation. It doesn't help in real life. Yes, it bloody well does. The number of people who have told me that using Python 3 is what allowed them to finally understand how Unicode works vastly exceeds the number of wire protocol and file format devs that have complained about working with binary formats being significantly less tolerant of the it's really like ASCII text mindset. We are NOT going back to the confusing incoherent mess that is the Python 2 model of bolting Unicode onto the side of POSIX: http://python-notes.curiousefficiency.org/en/latest/python3/questions_and_answers.html#what-actually-changed-in-the-text-model-between-python-2-and-python-3 While that was an *expedient* (and, in fact, necessary) solution at the time, the fact it is still thoroughly confusing people 13 years later shows it is not a *comprehensible* solution. FWIW: I quite liked the Python 2 model, but perhaps that's because I already knww how Unicode works, so could use it to make my life easier ;-) Seriously, Unicode has always caused heated discussions and I don't expect this to change in the next 5-10 years. The point is: there is no 100% perfect solution either way and when you acknowledge this, things don't look black and white anymore, but instead full of colors :-) Python 3 forces people to actually use Unicode; in Python 2 they could easily avoid it. It's good to educate people on how it's used and the issues you can run into, but let's not forget that people are trying to get work done and we all love readable code. PEP 460 just adds two more methods to the bytes object which come in handy when formatting binary data; I don't think it has potential to muddy the Python 3 text model, given that the bytes object already exposes a dozen of other ASCII text methods :-) If you give programmers the choice they will - most of the time - do the right thing. If you don't give them the tools, they'll work around the missing features in a gazillion different ways of which many will probably miss a few edge cases. bytes already have most of the 8-bit string methods from Python 2, so it doesn't hurt adding some more of the missing features from Python 2 on top to make life easier for people dealing with multiple/unknown encoding data. Because people that aren't happy with the current bytes type persistently refuse to experiment with writing their own extension type to figure out what the API should look like. Jamming speculative API design into the core text model without experimenting in a third party extension first is a straight up stupid idea. Anyone that is pushing for this should be checking out Benno's first draft experimental prototype for asciistr and be working on getting it passing the test suite I created: https://github.com/jeamland/asciicompat The Wah, you broke it and now I have completely forgotten how to create custom types, so I'm just going to piss and moan until somebody else fixes it infantilism of the past five years in
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 2014-01-11, 10:56 GMT, you wrote: I don't know what the fuss is about. I just cannot resist: When you are calm while everybody else is in the state of panic, you haven’t understood the problem. -- one of many collections of Murphy’s Laws Matěj -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.22 (GNU/Linux) iD8DBQFS0UBf4J/vJdlkhKwRAtc3AJ9c1ElUhLjvHX+Jw4/NvvmGABNbTQCfe9Zm rD65ozDhpj/Fu3ydM8Oipco= =TDQP -END PGP SIGNATURE- ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On 01/11/2014 12:43 AM, Nick Coghlan wrote: In particular, the bytes type is, and always will be, designed for pure binary manipulation [...] I apologize for being blunt, but this is a lie. Lets take a look at the methods defined by bytes: dir(b'') ['__add__', '__class__', '__contains__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getnewargs__', '__gt__', '__hash__', '__init__', '__iter__', '__le__', '__len__', '__lt__', '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__rmul__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', 'capitalize', 'center', 'count', 'decode', 'endswith', 'expandtabs', 'find', 'fromhex', 'index', 'isalnum', 'isalpha', 'isdigit', 'islower', 'isspace', 'istitle', 'isupper', 'join', 'ljust', 'lower', 'lstrip', 'maketrans', 'partition', 'replace', 'rfind', 'rindex', 'rjust', 'rpartition', 'rsplit', 'rstrip', 'split', 'splitlines', 'startswith', 'strip', 'swapcase', 'title', 'translate', 'upper', 'zfill'] Are you really going to insist that expandtabs, isalnum, isalpha, isdigit, islower, isspace, istitle, isupper, ljust, lower, lstrip, rjust, splitlines, swapcase, title, upper, and zfill are pure binary manipulation methods? Let's take a look at the repr of bytes: bytes([48, 49, 50, 51]) b'0123' Wow, that sure doesn't look like binary data! Py3 did not go from three text models to two, it went to one good one (unicode strings) and one broken one (bytes). If the aim was indeed for pure binary manipulation, we failed. We left in bunches of methods which can *only* be interpreted as supporting ASCII manipulation. Due to backwards compatibility we cannot now finish yanking those out, so either we live with a half-dead class screaming I want be ASCII! I want to be ASCII! or add back the missing functionality. -- ~Ethan~ ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On 01/11/2014 07:34 AM, Nick Coghlan wrote: On 12 January 2014 01:15, M.-A. Lemburg wrote: We don't have to be pedantic about the bytes/text separation. It doesn't help in real life. Yes, it bloody well does. The number of people who have told me that using Python 3 is what allowed them to finally understand how Unicode works . . . We are not proposing a change to the unicode string type in any way. We are NOT going back to the confusing incoherent mess that is the Python 2 model of bolting Unicode onto the side of POSIX . . . We are not asking for that. bytes already have most of the 8-bit string methods from Python 2, so it doesn't hurt adding some more of the missing features from Python 2 on top to make life easier for people dealing with multiple/unknown encoding data. Because people that aren't happy with the current bytes type persistently refuse to experiment with writing their own extension type to figure out what the API should look like. Jamming speculative API design into the core text model without experimenting in a third party extension first is a straight up stupid idea. True, if this were a new API; but it isn't, it's the Py2 str API that was stripped out. The one big difference being that if the results of %s (or %d or any other %) is not in the 0-127 range it errors out. -- ~Ethan~ ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On Sat, Jan 11, 2014 at 08:20:27AM -0800, Ethan Furman wrote: On 01/11/2014 07:38 AM, Steven D'Aprano wrote: The point that I am making is that many people want to add formatting operations to bytes so they can put ASCII strings inside bytes. But (as far as I can tell) they don't need to do this, because they can treat Unicode strings containing code points U+ through U+00FF (i.e. the same range as handled by Latin-1) as if they were bytes. So instead of blurring the line between bytes and text, you're blurring the line between text and bytes (with a few extra seat belts thrown in). I'm not blurring anything. The people who designed the file format that mixes textual data and binary data did the blurring. Given that such formats exist, it is inevitable that we need to put text into bytes, or bytes into text. The situation is already blurred, we just have to decide how to handle it. There are three broad strategies: 1) Make bytes more string-like, so that we can process our data as bytes, but still do string operations on the bits that are ASCII. 2) Make strings more byte-like, so that we can process our data as strings, but do byte operations (like bit mask operations) on the parts that are binary data. 3) Don't do either. Keep the text parts of your data as text, and the binary parts of your data as bytes. Do your text operations on text, and your byte operations on bytes. At some point, of course, they need to be combined. We have a choice: * Right now, we can use text as the base, and combine bytes into the text using Latin-1, and it Just Works. * Or we can wait until (maybe) Python 3.5, when (perhaps) bytes objects will be more text-like, and then use bytes as the base, and (with luck) it Should Just Work. There's another disadvantage with the second: treating bytes as if they were ASCII by default reinforces the same old harmful paradigm that text is ASCII that we're trying to get away from. That's a bad, painful idea that causes a lot of problems and buggy code, and should be resisted. On the other hand, embedding arbitrary binary data in Unicode text doesn't reinforce any common or harmful paradigms. It just requires the programmer to forget about characters and concentrate on code points, since Latin-1 maps bytes to code points in a very convenient way: Byte 0x00 maps to code point U+ Byte 0x01 maps to code point U+0001 Byte 0x02 maps to code point U+0002 ... Byte 0xFF maps to code point U+00FF So to embed the binary data 0xDEADBEEF in your string, you can just use '\xDE\xAD\xBE\xEF' regardless of what character those code points happen to be. If we are manipulating data *as if it were text*, then we ought to treat it as text, not add methods to bytes that makes bytes text-like. If we are manipulating data *as if it were bytes*, doing byte-manipulation operations like bit-masking, then we ought to treat it as numeric bytes, not add numeric methods to text. Is that really a controversial opinion? Besides being a bit awkward, this also means that any encoded text (even the plain ASCII stuff) is now being transformed three times instead of one: unicode to bytes bytes to unicode using latin1 unicode to bytes Where do you get this from? I don't follow your logic. Start with a text template: template = \xDE\xAD\xBE\xEF Name:\0\0\0%s Age:\0\0\0\0%d Data:\0\0\0%s blah blah blah data = template % (George, 42, blob.decode('latin-1')) Only the binary blobs need to be decoded. We don't need to encode the template to bytes, and the textual data doesn't get encoded until we're ready to send it across the wire or write it to disk. And when we do, since all the code points are in the range U+ to U+00FF, encoding it to Latin-1 ought to be a fast, efficient operation, possibly even just a mem copy. It's true that the individual binary data fields will been to be decoded from bytes, but unless you want Python to guess an encoding (which is the old broken Python 2 model), you're going to have to do that regardless. Even if the cost of moving those bytes around is cheap, it's not free. When you're creating hundreds of PDFs at a time that's going to make a difference. You've profiled it? Unless you've measured it, it doesn't exist. I'm not going to debate performance penalties of code you haven't written yet. -- Steven ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
tl;dr: At the end I'm volunteering to look at real code that is having porting problems. On Sat, 11 Jan 2014 17:33:17 +0100, M.-A. Lemburg m...@egenix.com wrote: asciistr is interesting in that it coerces to bytes instead of to Unicode (as is the case in Python 2). At the moment it doesn't cover the more common case bytes + str, just str + bytes, but let's assume it would, then you'd write ... headers += asciistr('Length: %i bytes\n' % 123) headers += b'\n\n' body = b'...' socket.send(headers + body) ... With PEP 460, you could write the above as: ... headers += b'Length: %i bytes\n' % 123 headers += b'\n\n' body = b'...' socket.send(headers + body) ... IMO, that's more readable. Both variants essentially do the same thing: they implicitly coerce ASCII text strings to bytes, so conceptually, there's little difference. And if we are explicit: headers = u'Length: %i bytes\n' % 123 headers += u'\n\n' body = b'...' socket.send(headers.encode('ascii') + body) (I included the 'u' prefix only because we are talking about shared-codebase python2/python3 code.) That looks pretty readable to me, and it is explicit about what parts are text and what parts are binary. But of course we'd never do exactly that in any but the simplest of protocols and scripts. Instead we'd write a library that had one or more object that modeled our wire/file protocol. The text parts the API would accept input as text strings. The binary parts it would accept input as bytes. Then, when reading or writing the data stream, we perform the appropriate conversions on the appropriate parts. Our library does a more complex analog of 'socket.send(headers.encode('ascii') + body)', one that understands the various parts and glues them together, encoding the text parts to the appropriate encoding (often-but-not-always ascii) as it does so. And yes, I have written code that does this in Python3. What I haven't done is written that code to run in both Python3 and Python2. I *think* the only missing thing I would need to back-port it is the surrogateescape error handler, but I haven't tried it. And I could probably conditionalize the code to use latin1 on python2 instead and get away with it. And please note that email is probably the messiest of messy binary wire protocols. Not only do you have bytes and text mixed in the same data stream, with internal markers (in the text parts) that specify how to interpret the binary, including what encodings each part of that binary data is in for cases where that matters, you *also* have to deal with the possibility of there being *invalid* binary data mixed in with the ostensibly text parts, that you nevertheless are expected to both preserve and parse around. When I started adding back binary support to the email package, I was really annoyed by the lack of certain string features in the bytes type. But in the end, it turned out to be really simple to instead think of the text-with-invalid-bytes parts as *text*-with-invalid-bytes (surrogateescaped bytes). Now, if I was designing from the ground up I'd store the stuff that was really binary as bytes in the model object instead of storing it as surrogateescaed text, but that problem is a consequence of how we got from there to here (python2-email to python3-email-that-didn't-handle-8bit-data to python3-email-that-works) rather than a problem with the python3 core data model. So it seems like I'm with Nick and Antoine and company here. The byte-interpolation proposed by Antoine seems reasonable, but I don't see the *need* for the other stuff. I think that programs will be cleaner if the text parts of the protocol are handled *as text*. On the other hand, Ethan's point that bytes *does* have text methods is true. However, other than the perfectly-sensible-for-bytes split, strip, and ends/startswith, I don't think I actually use any of them. But! Our goal should be to help people convert to Python3. So how can we find out what the specific problems are that real-world programs are facing, look at the *actual code*, and help that project figure out the best way to make that code work in both python2 and python3? That seems like the best way to find out what needs to be added to python3 or pypi: help port the actual code of the developers who are running into problems. Yes, I'm volunteering to help with this, though of course I can't promise exactly how much time I'll have available. --David ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
M.-A. Lemburg writes: I complete agree with Stephen, that bytes are in fact often an encoding of text. If that text is ASCII compatible, I don't see any reason why we should not continue to expose the C lib standard string APIs available for text manipulations on bytes. We already *have* a type in Python 3.3 that provides text manipulations on arrays of 8-bit objects: str (per PEP 393). BTW: I don't know why so many people keep asking for use cases. Isn't it obvious that text data without known (but ASCII compatible) encoding or multiple different encodings in a single data chunk is part of life ? Isn't it equally obvious that if you create or read all such ASCII- compatible chunks as (encoding='ascii', errors='surrogateescape') that you *don't need* string APIs for bytes? Why do these text chunks need to be bytes in the first place? That's why we ask for use cases. AFAICS, reading and writing ASCII- compatible text data as 'latin1' is just as fast as bytes I/O. So it's not I/O efficiency, and (since in this model we don't do any en/decoding on bytes/str), it's not redundant en/decoding of bytes to str and back. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On Sat, Jan 11, 2014 at 04:15:35PM +0100, M.-A. Lemburg wrote: I think we need to step back a little from the purist view of things and give more emphasis on the practicality beats purity Zen. I complete agree with Stephen, that bytes are in fact often an encoding of text. If that text is ASCII compatible, I don't see any reason why we should not continue to expose the C lib standard string APIs available for text manipulations on bytes. Later in your post, you talk about the masses of broken encodings found everywhere (not just in your spam folder). How do the C lib standard string APIs help programmers to avoid broken encodings? We don't have to be pedantic about the bytes/text separation. It doesn't help in real life. On the contrary, it helps a lot. To the extent that people keep that clean bytes/text separation, it helps avoid bugs. It prevents problems like this Python 2 nonsense: s = Straße assert len(s) == 6 # fails assert s[5] == 'e' # fails Most problematic, printing s may (depending on your terminal settings) actually look like Straße. Not only is having a clean bytes/text separation the pedantic thing to do, it's also the right thing to do nearly always (not withstanding a few exceptions, allegedly). If you give programmers the choice they will - most of the time - do the right thing. Unicode has been available in Python since version 2.2, more than a decade ago. And yet here we are, five point releases later (2.7), and the majority of text processing code is still using bytes. I'm not just pointing the finger at others. My 2.x only code almost always uses byte strings for text processing, and not always because it was old code I wrote before I knew better. The coders I work with do the same, only you can remove the word almost. The code I see posted on comp.lang.python and Reddit and the tutor mailing list invariably uses byte strings. The beginners on the tutor list at least have an excuse that they are beginners. A quarter of a century after Unicode was first published, nearly 28 years since IBM first introduced the concept of code pages to PC users, and we still have programmers writing ASCII only string-handling code that, if it works with extended character sets, only works by accident. The majority of programmer still have *no idea* of even the most basic parts of Unicode. They've had the the right tools for a decade, and ignored them. Python 3 forces the issue, and my code is better for it. bytes already have most of the 8-bit string methods from Python 2, so it doesn't hurt adding some more of the missing features from Python 2 on top to make life easier for people dealing with multiple/unknown encoding data. I personally think it was a mistake to keep text operations like upper() and lower() on bytes. I think it will compound the mistake to add even more text operations. -- Steven ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On Sat, Jan 11, 2014 at 05:33:17PM +0100, M.-A. Lemburg wrote: FWIW: I quite liked the Python 2 model, but perhaps that's because I already knww how Unicode works, so could use it to make my life easier ;-) /incredulous I would really love to see you justify that claim. How do you use the Python 2 string type to make processing Unicode text easier? -- Steven ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On 2014-01-11 05:36, Steven D'Aprano wrote: [snip] Latin-1 has the nice property that every byte decodes into the character with the same code point, and visa versa. So: for i in range(256): assert bytes([i]).decode('latin-1') == chr(i) assert chr(i).encode('latin-1') == bytes([i]) passes. It seems to me that your problem goes away if you use Unicode text with embedded binary data, rather than binary data with embedded ASCII text. Then when writing the file to disk, of course you encode it to Latin-1, either explicitly: pdf = ... # Unicode string containing the PDF contents with open(outfile.pdf, wb) as f: f.write(pdf.encode(latin-1) or implicitly: with open(outfile.pdf, w, encoding=latin-1) as f: f.write(pdf) [snip] The second example won't work because you're forgetting about the handling of line endings in text mode. Suppose you have some binary data bytes([10]). You convert it into a Unicode string using Latin-1, giving '\n'. You write it out to a file opened in text mode. On Windows, that string '\n' will be written to the file as b'\r\n'. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On 01/11/2014 10:36 AM, Steven D'Aprano wrote: On Sat, Jan 11, 2014 at 08:20:27AM -0800, Ethan Furman wrote: unicode to bytes bytes to unicode using latin1 unicode to bytes Where do you get this from? I don't follow your logic. Start with a text template: template = \xDE\xAD\xBE\xEF Name:\0\0\0%s Age:\0\0\0\0%d Data:\0\0\0%s blah blah blah data = template % (George, 42, blob.decode('latin-1')) Only the binary blobs need to be decoded. We don't need to encode the template to bytes, and the textual data doesn't get encoded until we're ready to send it across the wire or write it to disk. And what if your name field has data not representable in latin-1? -- '\xd1\x81\xd1\x80\xd0\x83'.decode('utf8') u'\u0441\u0440\u0403' -- '\xd1\x81\xd1\x80\xd0\x83'.decode('utf8').encode('latin1') Traceback (most recent call last): File stdin, line 1, in module UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-2: ordinal not in range(256) So really your example should be: data = template % (George.encode('some_non_ascii_encoding_such_as_cp1251').decode('latin-1'), 42, blob.decode('latin-1')) Which is a mess. -- ~Ethan~ ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
MRAB writes: with open(outfile.pdf, w, encoding=latin-1) as f: f.write(pdf) [snip] The second example won't work because you're forgetting about the handling of line endings in text mode. Not so fast! Forgot, yes (me too!), but not work? Not quite: with open(outfile.pdf, w, encoding=latin-1, newline=) as f: f.write(pdf) should do the trick. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On 01/11/2014 11:49 AM, Stephen J. Turnbull wrote: MRAB writes: with open(outfile.pdf, w, encoding=latin-1) as f: f.write(pdf) [snip] The second example won't work because you're forgetting about the handling of line endings in text mode. Not so fast! Forgot, yes (me too!), but not work? Not quite: with open(outfile.pdf, w, encoding=latin-1, newline=) as f: f.write(pdf) should do the trick. Well, it's good that there is a work-a-round. Are we going to have a document listing all the work-a-rounds needed to program a bytes-oriented style using unicode? -- ~Ethan~ ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On Sat, 11 Jan 2014 11:54:26 -0800, Ethan Furman et...@stoneleaf.us wrote: On 01/11/2014 11:49 AM, Stephen J. Turnbull wrote: MRAB writes: with open(outfile.pdf, w, encoding=latin-1) as f: f.write(pdf) [snip] The second example won't work because you're forgetting about the handling of line endings in text mode. Not so fast! Forgot, yes (me too!), but not work? Not quite: with open(outfile.pdf, w, encoding=latin-1, newline=) as f: f.write(pdf) should do the trick. Well, it's good that there is a work-a-round. Are we going to have a document listing all the work-a-rounds needed to program a bytes-oriented style using unicode? That's not a work-around (if you are talking specifically about the newline=). That's just the way the python3 IO library works. If you want to preserve the newlines in your data, but still have the text-io machinery count them for deciding when to trigger io/buffering behavior, you use newline=''. It's not the most intuitive API, so I won't be surprised if a lot of people don't know about it or get confused by it when they see it. I first learned about it in the context of csv files, another one of those legacy file protocols that are mostly-text-but-not-entirely. --David ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On Jan 11, 2014, at 10:34 AM, Nick Coghlan ncogh...@gmail.com wrote: Yes, it bloody well does. The number of people who have told me that using Python 3 is what allowed them to finally understand how Unicode works vastly exceeds the number of wire protocol and file format devs that have complained about working with binary formats being significantly less tolerant of the it's really like ASCII text mindset. FWIW as one of the people who it took Python3 to finally figure out how to actually use unicode, it was the absence of encode on bytes and decode on str that actually did it. Giving bytes a format method would not have affected that either way I don’t believe. - Donald Stufft PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA signature.asc Description: Message signed with OpenPGP using GPGMail ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On 1/11/2014 1:44 PM, Stephen J. Turnbull wrote: We already *have* a type in Python 3.3 that provides text manipulations on arrays of 8-bit objects: str (per PEP 393). BTW: I don't know why so many people keep asking for use cases. Isn't it obvious that text data without known (but ASCII compatible) encoding or multiple different encodings in a single data chunk is part of life ? Isn't it equally obvious that if you create or read all such ASCII- compatible chunks as (encoding='ascii', errors='surrogateescape') that you *don't need* string APIs for bytes? Why do these text chunks need to be bytes in the first place? That's why we ask for use cases. AFAICS, reading and writing ASCII- compatible text data as 'latin1' is just as fast as bytes I/O. So it's not I/O efficiency, and (since in this model we don't do any en/decoding on bytes/str), it's not redundant en/decoding of bytes to str and back. The problem with some criticisms of using 'unicode in Python 3' is that there really is no such thing. Unicode in 3.0 to 3.2 used the old internal model inherited from 2.x. Unicode in 3.3+ uses a different internal model that is a game changer with respect to certain issues of space and time efficiency (and cross-platform correctness and portability). So at least some the valid criticisms based on the old model are out of date and no longer valid. -- Terry Jan Reedy ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On Sat, Jan 11, 2014 at 4:28 PM, Terry Reedy tjre...@udel.edu wrote: On 1/11/2014 1:44 PM, Stephen J. Turnbull wrote: We already *have* a type in Python 3.3 that provides text manipulations on arrays of 8-bit objects: str (per PEP 393). BTW: I don't know why so many people keep asking for use cases. Isn't it obvious that text data without known (but ASCII compatible) encoding or multiple different encodings in a single data chunk is part of life ? Isn't it equally obvious that if you create or read all such ASCII- compatible chunks as (encoding='ascii', errors='surrogateescape') that you *don't need* string APIs for bytes? Why do these text chunks need to be bytes in the first place? That's why we ask for use cases. AFAICS, reading and writing ASCII- compatible text data as 'latin1' is just as fast as bytes I/O. So it's not I/O efficiency, and (since in this model we don't do any en/decoding on bytes/str), it's not redundant en/decoding of bytes to str and back. The problem with some criticisms of using 'unicode in Python 3' is that there really is no such thing. Unicode in 3.0 to 3.2 used the old internal model inherited from 2.x. Unicode in 3.3+ uses a different internal model that is a game changer with respect to certain issues of space and time efficiency (and cross-platform correctness and portability). So at least some the valid criticisms based on the old model are out of date and no longer valid. -1 on adding more surrogateesapes by default. It's a pain to track down where the encoding errors came from. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On 01/11/2014 12:45 PM, Donald Stufft wrote: FWIW as one of the people who it took Python3 to finally figure out how to actually use unicode, it was the absence of encode on bytes and decode on str that actually did it. Giving bytes a format method would not have affected that either way I don’t believe. My biggest hurdle was realizing that ASCII was an encoding. -- ~Ethan~ ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On Fri, Jan 10, 2014 at 9:13 PM, Juraj Sukop juraj.su...@gmail.com wrote: On Sat, Jan 11, 2014 at 12:49 AM, Antoine Pitrou solip...@pitrou.netwrote: Also, when you say you've never encountered UTF-16 text in PDFs, it sounds like those people who've never encountered any non-ASCII data in their programs. Let me clarify: one does not think in writing text in Unicode-terms in PDF. Instead, one records the sequence of character codes which correspond to glyphs or the glyph IDs directly. That's because one Unicode character may have more than one glyph and more characters can be shown as one glyph. AFAIK (and just for the record), there could be both Latin1 text and UTF-16 in a PDF (and other encodings too), depending on the font used: /Encoding /WinAnsiEncoding (mostly latin1 standard fonts) /Encoding /Identity-H (generally for unicode UTF-16 True Type embedded fonts) For example, in PyFPDF (a PHP library ported to python), the following code writes out text that could be encoded in two different encodings: s = sprintf(BT %.2f %.2f Td (%s) Tj ET, x*self.k, (self.h-y)*self.k, txt) https://code.google.com/p/pyfpdf/source/browse/fpdf/fpdf.py#602 In Python2, txt is just a str, but in Python3 handling everything as latin1 string obviously doesn't work for TTF in this case. Best regards Mariano Reingart http://www.sistemasagiles.com.ar http://reingart.blogspot.com ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On Sat, Jan 11, 2014 at 07:22:30PM +, MRAB wrote: with open(outfile.pdf, w, encoding=latin-1) as f: f.write(pdf) [snip] The second example won't work because you're forgetting about the handling of line endings in text mode. So I did! Thank you for the correction. -- Steven ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 2014-01-11, 18:09 GMT, you wrote: We are NOT going back to the confusing incoherent mess that is the Python 2 model of bolting Unicode onto the side of POSIX . . . We are not asking for that. Yes, you do. Maybe not you personally, but number of people here on this list (for F...k sake, this is for DEVELOPERS of the langauge, not some bloody users!) for whom the current suggestion is just the way how to avoid Unicode and keep all those broken script which barfs at me all the time alive is quit non-zero I am afraid. Best, Matěj -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.22 (GNU/Linux) iD8DBQFS0ev24J/vJdlkhKwRAoHOAJ9crimnp+TtXCxmZLvTUSFVFSESAwCeNrby Yjwk6Ydzc/REezfHP046C5Y= =c2vl -END PGP SIGNATURE- ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On Sat, Jan 11, 2014 at 04:28:34PM -0500, Terry Reedy wrote: The problem with some criticisms of using 'unicode in Python 3' is that there really is no such thing. Unicode in 3.0 to 3.2 used the old internal model inherited from 2.x. Unicode in 3.3+ uses a different internal model that is a game changer with respect to certain issues of space and time efficiency (and cross-platform correctness and portability). So at least some the valid criticisms based on the old model are out of date and no longer valid. While there are definitely performance savings (particularly of memory) regarding the FSR in Python 3.3, for the use-case we're talking about, Python 3.1 and 3.2 (and for that matter, 2.2 through 2.7) Unicode strings should be perfectly adequate. The textual data being used is ASCII, and the binary blobs are encoded to Latin-1, so everything is a subset of Unicode, namely U+ to U+00FF. That means there are no astral characters, and no behavioural differences between wide and narrow builds (apart from memory use). -- Steven ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On Sat, Jan 11, 2014 at 08:13:39PM -0200, Mariano Reingart wrote: AFAIK (and just for the record), there could be both Latin1 text and UTF-16 in a PDF (and other encodings too), depending on the font used: [...] In Python2, txt is just a str, but in Python3 handling everything as latin1 string obviously doesn't work for TTF in this case. Nobody is suggesting that you use Latin-1 for *everything*. We're suggesting that you use it for blobs of binary data that represent arbitrary bytes. First you have to get your binary data in the first place, using whatever technique is necessary. Here's one way to get a blob of binary data: # encode four C shorts into a fixed-width struct struct.pack(, 23, 42, 17, 99) Here's another way: # encode a text string into UTF-16 My name is Steven.encode(utf-16be) Both examples return a bytes object containing arbitrary bytes. How do you combine those arbitrary bytes with a string template while still keeping all code-points under U+0100? By decoding to Latin-1. -- Steven ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On 11Jan2014 13:15, Juraj Sukop juraj.su...@gmail.com wrote: On Sat, Jan 11, 2014 at 5:14 AM, Cameron Simpson c...@zip.com.au wrote: data = b' '.join( bytify( [ 10, 0, obj, binary_image_data, ... ] ) ) Thanks for the suggestion! The problem with bytify is that some items might require different formatting than other items. For example, in Cross-Reference Table there are three different formats: non-padded integer (1), 10- and 15digit integer, (03, 65535). Well, this is partly my point: you probably want to exert more control that is reasonable for the PEP to offer, and you're better off with a helper function of your own. In particular, aside from passing in a default char=bytes encoding, you can provide your own format hooks. In particular, str already provides a completish % suite and you have no issue with encodings in that phase because it is all Unicode. So the points where you're treating PDF as text are probably best tackled as text and then encoded with a helper like bytify when you have to glom bytes and textish stuff together. Crude example, hacked up from yours: data = b''.join( bytify( (%d %d obj ... stream % (10, 0)), binary_image_data, endstream endobj, ))) where bytify swallows your encoding decisions. Since encoding anything-not-bytes into a bytes sequence inherently involves an encoding decision, I think I'm +1 on the PEP's aim of never mixing bytes with non-bytes, keeping all the encoding decisions in the caller's hands. I quite understand not wanting to belabour the code with .encode('ascii') but that should be said somewhere, so best to do so yourself in as compact and ergonomic fashion as possible. Cheers, -- Cameron Simpson c...@zip.com.au Serious error. All shortcuts have disappeared. Screen. Mind. Both are blank. - Haiku Error Messages http://www.salonmagazine.com/21st/chal/1998/02/10chal2.html ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On 12 Jan 2014 03:29, Ethan Furman et...@stoneleaf.us wrote: On 01/11/2014 12:43 AM, Nick Coghlan wrote: In particular, the bytes type is, and always will be, designed for pure binary manipulation [...] I apologize for being blunt, but this is a lie. Lets take a look at the methods defined by bytes: dir(b'') ['__add__', '__class__', '__contains__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getnewargs__', '__gt__', '__hash__', '__init__', '__iter__', '__le__', '__len__', '__lt__', '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__rmul__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', 'capitalize', 'center', 'count', 'decode', 'endswith', 'expandtabs', 'find', 'fromhex', 'index', 'isalnum', 'isalpha', 'isdigit', 'islower', 'isspace', 'istitle', 'isupper', 'join', 'ljust', 'lower', 'lstrip', 'maketrans', 'partition', 'replace', 'rfind', 'rindex', 'rjust', 'rpartition', 'rsplit', 'rstrip', 'split', 'splitlines', 'startswith', 'strip', 'swapcase', 'title', 'translate', 'upper', 'zfill'] Are you really going to insist that expandtabs, isalnum, isalpha, isdigit, islower, isspace, istitle, isupper, ljust, lower, lstrip, rjust, splitlines, swapcase, title, upper, and zfill are pure binary manipulation methods? Do you think I don't know that? However, those are all *in-place* modifications. Yes, they assume ASCII compatible formats, but they're a far cry from encouraging combination of data from potentially different sources. I'm also on record as considering this a design decision I regret, precisely because it has resulted in experienced Python 2 developers failing to understand that the Python 3 text model is *different* and they may need to create a new type. Let's take a look at the repr of bytes: bytes([48, 49, 50, 51]) b'0123' Wow, that sure doesn't look like binary data! Py3 did not go from three text models to two, it went to one good one (unicode strings) and one broken one (bytes). If the aim was indeed for pure binary manipulation, we failed. We left in bunches of methods which can *only* be interpreted as supporting ASCII manipulation. No, no, no. We made some concessions in the design of the bytes type to *ease* development and debugging of ASCII compatible protocols *where we believed we could do so without compromising the underlying text model changes. Many experienced Python 2 developers are now suffering one of the worst cases of paradigm lock I have ever seen as they keep trying to make the Python 3 text model the same as the Python 2 one instead of actually learning how Python 3 works and recognising that they may actually need to create a new type for their use case and then potentially seek core dev assistance if that type reveals new interoperability bugs in the core types (or encounters old ones). Due to backwards compatibility we cannot now finish yanking those out, so either we live with a half-dead class screaming I want be ASCII! I want to be ASCII! or add back the missing functionality. No, we don't - we treat the core bytes type as PEP 460 does, by adding a *new* feature proposed by a couple people writing native Python 3 libraries like asyncio that makes binary formats easier to deal with without carrying forward even *more* broken assumptions from the Python 2 text model. (Remember, I'm in favour of Antoine's updated PEP, because it's a real spec for a new feature, rather than yet another proposal to bolt on even more text specific formatting features from someone that has never bothered to understand the reasons for the differences between the two versions). People that want a full hybrid type back can then pursue the custom extension type approach. Cheers, Nick. -- ~Ethan~ ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On 01/11/2014 06:29 PM, Steven D'Aprano wrote: On Sat, Jan 11, 2014 at 11:05:36AM -0800, Ethan Furman wrote: On 01/11/2014 10:36 AM, Steven D'Aprano wrote: On Sat, Jan 11, 2014 at 08:20:27AM -0800, Ethan Furman wrote: unicode to bytes bytes to unicode using latin1 unicode to bytes Where do you get this from? I don't follow your logic. Start with a text template: template = \xDE\xAD\xBE\xEF Name:\0\0\0%s Age:\0\0\0\0%d Data:\0\0\0%s blah blah blah data = template % (George, 42, blob.decode('latin-1')) Since the use-cases people have been speaking about include only ASCII (or at most, Latin-1) text and arbitrary binary bytes, my example is limited to showing only ASCII text. But it will work with any text data, so long as you have a well-defined format that lets you tell which parts are interpreted as text and which parts as binary data. Since you're talking to me, it would be nice if you addressed the same use-case I was addressing, which is mixed: ascii-encoded text, ascii-encoded numbers, ascii-encoded bools, binary-encoded numbers, and misc-encoded text. And no, your example will not work with any text, it would completely moji-bake my dbf files. Only the binary blobs need to be decoded. We don't need to encode the template to bytes, and the textual data doesn't get encoded until we're ready to send it across the wire or write it to disk. No! When I have text, part of which gets ascii-encoded and part of which gets, say, cp1251 encoded, I cannot wait till the end! And what if your name field has data not representable in latin-1? -- '\xd1\x81\xd1\x80\xd0\x83'.decode('utf8') u'\u0441\u0440\u0403' Where did you get those bytes from? You got them from somewhere. For the sake of argument, pretend a user entered them in. Who knows? Who cares? Once you have bytes, you can treat them as a blob of arbitrary bytes and write them to the record using the Latin-1 trick. No, I can't. See above. If you're reading those bytes from some stream that gives you bytes, you don't have to care where they came from. You're kidding, right? If I don't know where they came from (a graphics field? a note field?) how am I going to know how to treat them? But what if you don't start with bytes? If you start with a bunch of floats, you'll probably convert them to bytes using the struct module. Yup, and I do. If you start with non-ASCII text, you have to convert them to bytes too. No difference here. Really? You just said above that it will work with any text data -- you can't have it both ways. You ask the user for their name, they answer срЃ which is given to you as a Unicode string, and you want to include it in your data record. The specifications of your file format aren't clear, so I'm going to assume that: 1) ASCII text is allowed as-is (that is, the name George will be in the final data file as b'George'); User data is not (typically) where the ASCII data is, but some of the metadata is definitely and always ASCII. The user text data needs to be encoded using whichever codec is specified by the file, which is only occasionally ASCII. 2) any other non-ASCII text will be encoded as some fixed encoding which we can choose to suit ourselves; Well, the user chooses it, we have to abide by their choice. (It's kept in the file metadata.) 3) arbitrary binary data is allowed as-is (i.e. byte N has to end up being written as byte N, for any value of N between 0 and 255). In a couple field types, yes. Usually the binary data is numeric or date related and there is conversion going on there, too, to give me the bytes I need. [snip] -- '\xd1\x81\xd1\x80\xd0\x83'.decode('utf8').encode('latin1') Traceback (most recent call last): File stdin, line 1, in module UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-2: ordinal not in range(256) That is backwards to what I've shown. Look at my earlier example again: And you are not paying attention: '\xd1\x81\xd1\x80\xd0\x83'.decode('utf8').encode('latin1') \--/ \-/ a non-ascii compatible unicode string to latin1 bytes (срЃ.encode('some_non_ascii_encoding_such_as_cp1251').decode('latin-1'), 42, blob.decode('latin-1')) \--/ \--/ getting the actual bytes I needand back into unicode until I write them later You did say to use a *text* template to manipulate my data, and then write it later, no? Well, this is what it would look like. Bytes get DECODED to latin-1, not encoded. Bytes - text is *decoding* Text - bytes is *encoding* Pretend for a moment I know that, and look at my examples again. I am demonstrating the contortions needed when my TEXTual data is not ASCII-compatible: It must be ENcoded using the appropriate codec to BYTES, then DEcoded back to unicode using latin1, all so later I can ENcode the
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On 12 January 2014 02:33, M.-A. Lemburg m...@egenix.com wrote: On 11.01.2014 16:34, Nick Coghlan wrote: While that was an *expedient* (and, in fact, necessary) solution at the time, the fact it is still thoroughly confusing people 13 years later shows it is not a *comprehensible* solution. FWIW: I quite liked the Python 2 model, but perhaps that's because I already knww how Unicode works, so could use it to make my life easier ;-) Right, I tried to capture that in http://python-notes.curiousefficiency.org/en/latest/python3/questions_and_answers.html#what-actually-changed-in-the-text-model-between-python-2-and-python-3 by pointing out that there are two *very* different kinds of code to consider when discussing text modelling. Application code lives in a nice clean world of structured data, text data and binary data, with clean conversion functions for switching between them. Boundary code, by contrast, has to deal with the messy task of translating between them all. The Python 2 text model is a convenient model for boundary code, because it implicitly allows switch between binary and text interpretations of a data stream, and that's often useful due to the way protocols and file formats are designed. However, that kind of implicit switching is thoroughly inappropriate for *application* code. So Python 3 switches the core text model to one where implicitly switching between the binary domain and the text domain is considered a *bad* thing, and we object strongly to any proposals which suggest blurry the boundaries again, since that is going back to a boundary code model rather than an application code one. I've been saying for years that we may need a third type, but it has been nigh on impossible to get boundary code developers to say anything more useful than I preferred the Python 2 model, that was more convenient for me. Yes, we know it was (we do maintain both of them, after all, and did the update for the standard library's own boundary code), but application developers are vastly more common, so boundary code developers lost out on that one and we need to come up with solutions that *respect* the Python 3 text model, rather than trying to change it back to the Python 2 one. Seriously, Unicode has always caused heated discussions and I don't expect this to change in the next 5-10 years. The point is: there is no 100% perfect solution either way and when you acknowledge this, things don't look black and white anymore, but instead full of colors :-) It would be nice if more boundary code developers actually did that rather than coming out with accusatory hyperbole and pining for the halcyon days of Python 2 where the text model favoured their use case over that of normal application developers. Python 3 forces people to actually use Unicode; in Python 2 they could easily avoid it. It's good to educate people on how it's used and the issues you can run into, but let's not forget that people are trying to get work done and we all love readable code. PEP 460 just adds two more methods to the bytes object which come in handy when formatting binary data; I don't think it has potential to muddy the Python 3 text model, given that the bytes object already exposes a dozen of other ASCII text methods :-) I dropped my objections to PEP 460 once Antoine fixed it to respect the boundaries between binary and text data. It's now a pure binary interpolation proposal, and one I think is a fine idea - there's no implicit encoding or decoding involved, it's just a tool for manipulating binary data. That leaves the implicit encoding and decoding to the third party asciistr type, as it should be. asciistr is interesting in that it coerces to bytes instead of to Unicode (as is the case in Python 2). Not quite - the idea of asciistr is that it is designed to be a *hybrid* type, like str was in Python 2. If it interacts with binary objects, it will give a binary result, if it interacts with text objects, it will give a text result. This makes it potentially suitable for use for constants in hybrid binary/text APIs like urllib.parse, allowing them to be implemented using a shared code path once again. The initial experimental implementation only works with 7 bit ASCII, but the UTF-8 caching in the PEP 393 implementation opens up the possibility of offering a non-strict mode in the future, as does the option of allowing arbitrary 8-bit data and disallowing interoperation with text strings in that case. At the moment it doesn't cover the more common case bytes + str, just str + bytes, but let's assume it would, Right, I suspect we have some overbroad PyUnicode_Check() calls in CPython that will need to be addressed before this substitution works seamlessly - that's one of the reasons I've been asking people to experiment with the idea since at least 2010 and let us know what doesn't work (nobody did though, until Benno agreed to try it out because it sounded like an interesting puzzle
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On 12 January 2014 04:38, R. David Murray rdmur...@bitdance.com wrote: But! Our goal should be to help people convert to Python3. So how can we find out what the specific problems are that real-world programs are facing, look at the *actual code*, and help that project figure out the best way to make that code work in both python2 and python3? That seems like the best way to find out what needs to be added to python3 or pypi: help port the actual code of the developers who are running into problems. Yes, I'm volunteering to help with this, though of course I can't promise exactly how much time I'll have available. And, as has been the case for a long time, the PSF stands ready to help with funding credible grant proposals for Python 3 porting efforts. I believe some of the core devs (including David?) do freelance and contract work, so that's an option definitely worth considered if a project would like to support Python 3, but are having difficulty getting their with purely volunteer effort. Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On Fri, 10 Jan 2014 11:32:05 +1000 Nick Coghlan ncogh...@gmail.com wrote: It's consistent with bytearray.join's behaviour: x = bytearray() x.join([babc]) bytearray(b'abc') x bytearray(b'') Yeah, I guess I'm OK with us being consistent on that one. It's still weird, but also clearly useful :) Will the new binary format ever call __format__? I assume not, but it's probably best to make that absolutely explicit in the PEP. Not indeed. I'll add that to the PEP, thanks. cheers Antoine. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
(Sorry if this messes-up the thread order, it is meant as a reply to the original RFC.) Dear list, newbie here. After much hesitation I decided to put forward a use case which bothers me about the current proposal. Disclaimer: I happen to write a library which is directly influenced by this. As you may know, PDF operates over bytes and an integer or floating-point number is written down as-is, for example 100 or 1.23. However, the proposal drops %d, %f and %x formats and the suggested workaround for writing down a number is to use .encode('ascii'), which I think has two problems: One is that it needs to construct one additional object per formatting as opposed to Python 2; it is not uncommon for a PDF file to contain millions of numbers. The second problem is that, in my eyes, it is very counter-intuitive to require the use of str only to get formatting on bytes. Consider the case where a large bytes object is created out of many smaller bytes objects. If I wanted to format a part I had to use str instead. For example: content = b''.join([ b'header', b'some dictionary structure', b'part 1 abc', ('part 2 %.3f' % number).encode('ascii'), b'trailer']) In the case of PDF, the embedding of an image into PDF looks like: 10 0 obj /Type /XObject /Width 100 /Height 100 /Alternates 15 0 R /Length 2167 stream ...binary image data... endstream endobj Because of the image it makes sense to store such structure inside bytes. On the other hand, there may well be another obj which contains the coordinates of Bezier paths: 11 0 obj ... stream 0.5 0.1 0.2 RG 300 300 m 300 400 400 400 400 300 c b endstream endobj To summarize, there are cases which mix binary and text and, in my opinion, dropping the bytes-formatting of numbers makes it more complicated than it was. I would appreciate any explanation on how: b'%.1f %.1f %.1f RG' % (r, g, b) is more confusing than: b'%s %s %s RG' % tuple(map(lambda x: (u'%.1f' % x).encode('ascii'), (r, g, b))) Similar situation exists for HTTP (Content-Length: 123) and ASCII STL (vertex 1.0 0.0 0.0). Thanks and have a nice day, Juraj Sukop PS: In the case the proposal will not include the number formatting, it would be nice to list there a set of guidelines or examples on how to proceed with porting Python 2 formats to Python 3. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On 1/10/2014 12:17 PM, Juraj Sukop wrote: (Sorry if this messes-up the thread order, it is meant as a reply to the original RFC.) Dear list, newbie here. After much hesitation I decided to put forward a use case which bothers me about the current proposal. Disclaimer: I happen to write a library which is directly influenced by this. As you may know, PDF operates over bytes and an integer or floating-point number is written down as-is, for example 100 or 1.23. However, the proposal drops %d, %f and %x formats and the suggested workaround for writing down a number is to use .encode('ascii'), which I think has two problems: One is that it needs to construct one additional object per formatting as opposed to Python 2; it is not uncommon for a PDF file to contain millions of numbers. The second problem is that, in my eyes, it is very counter-intuitive to require the use of str only to get formatting on bytes. Consider the case where a large bytes object is created out of many smaller bytes objects. If I wanted to format a part I had to use str instead. For example: content = b''.join([ b'header', b'some dictionary structure', b'part 1 abc', ('part 2 %.3f' % number).encode('ascii'), b'trailer']) I agree. I don't see any reason to exclude int and float. See Guido's messages http://bugs.python.org/issue3982#msg180423 and http://bugs.python.org/issue3982#msg180430 for some justification and discussion. Since converting int and float to strings generates a very small range of ASCII characters, ([0-9a-fx.-=], plus the uppercase versions), what problem is introduced by allowing int and float? The original str.format() work relied on this fact in its stringlib implementation. Eric. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On 06/01/2014 13:24, Victor Stinner wrote: Hi, bytes % args and bytes.format(args) are requested by Mercurial and Twisted projects. The issue #3982 was stuck because nobody proposed a complete definition of the new features. Here is a try as a PEP. Apologies if this has already been said, but Terry Reedy attached a proof of concept to issue 3982 which might be worth taking a look at if you haven't yet done so. -- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Mark Lawrence ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
Am 10.01.2014 18:56, schrieb Eric V. Smith: On 1/10/2014 12:17 PM, Juraj Sukop wrote: (Sorry if this messes-up the thread order, it is meant as a reply to the original RFC.) Dear list, newbie here. After much hesitation I decided to put forward a use case which bothers me about the current proposal. Disclaimer: I happen to write a library which is directly influenced by this. As you may know, PDF operates over bytes and an integer or floating-point number is written down as-is, for example 100 or 1.23. However, the proposal drops %d, %f and %x formats and the suggested workaround for writing down a number is to use .encode('ascii'), which I think has two problems: One is that it needs to construct one additional object per formatting as opposed to Python 2; it is not uncommon for a PDF file to contain millions of numbers. The second problem is that, in my eyes, it is very counter-intuitive to require the use of str only to get formatting on bytes. Consider the case where a large bytes object is created out of many smaller bytes objects. If I wanted to format a part I had to use str instead. For example: content = b''.join([ b'header', b'some dictionary structure', b'part 1 abc', ('part 2 %.3f' % number).encode('ascii'), b'trailer']) I agree. I don't see any reason to exclude int and float. See Guido's messages http://bugs.python.org/issue3982#msg180423 and http://bugs.python.org/issue3982#msg180430 for some justification and discussion. Since converting int and float to strings generates a very small range of ASCII characters, ([0-9a-fx.-=], plus the uppercase versions), what problem is introduced by allowing int and float? The original str.format() work relied on this fact in its stringlib implementation. I agree. I would have needed bytes-formatting (with numbers) recently writing .rtf files. Georg ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On Fri, Jan 10, 2014 at 9:17 AM, Juraj Sukop juraj.su...@gmail.com wrote: As you may know, PDF operates over bytes and an integer or floating-point number is written down as-is, for example 100 or 1.23. Just to be clear here -- is PDF specifically bytes+ascii? Or could there be some-other-encoding unicode in there? If so, then you really have a mess! if it is bytes+ascii, then it seems you could use a unicode object and encode/decode to latin-1 Perhaps still a bit klunkier than formatting directly into a bytes object, but workable. b'%.1f %.1f %.1f RG' % (r, g, b) is more confusing than: b'%s %s %s RG' % tuple(map(lambda x: (u'%.1f' % x).encode('ascii'), (r, g, b))) Let's see, I think that would be: u'%.1f %.1f %.1f RG' % (r, g, b) then when you want to write it out: .encode('latin-1') dumping the binary data in would be a bit uglier, for teh image example: stream ...binary image data... endstream endobj ustream\n%s\nendstream\nendobj%binary_data.decode('latin-1') I think. not too bad, though if nothing else an alias for latin-1 that made it clear it worked for this would be nice. maybe ascii_plus_binary or something? -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
2014/1/10 Juraj Sukop juraj.su...@gmail.com: In the case of PDF, the embedding of an image into PDF looks like: 10 0 obj /Type /XObject /Width 100 /Height 100 /Alternates 15 0 R /Length 2167 stream ...binary image data... endstream endobj What not building 10 0 obj ... stream and endstream endobj in Unicode and then encode to ASCII? Example: data = b''.join(( (%d %d obj ... stream % (10, 0)).encode('ascii'), binary_image_data, (endstream endobj).encode('ascii'), )) Victor ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On 1/10/2014 5:12 PM, Victor Stinner wrote: 2014/1/10 Juraj Sukop juraj.su...@gmail.com: In the case of PDF, the embedding of an image into PDF looks like: 10 0 obj /Type /XObject /Width 100 /Height 100 /Alternates 15 0 R /Length 2167 stream ...binary image data... endstream endobj What not building 10 0 obj ... stream and endstream endobj in Unicode and then encode to ASCII? Example: data = b''.join(( (%d %d obj ... stream % (10, 0)).encode('ascii'), binary_image_data, (endstream endobj).encode('ascii'), )) Isn't the point of the PEP to make it easier to port 2.x code to 3.5? Is there really existing code like this in 2.x? I think what we're trying to do is to make code that looks like: b'%d %d obj ... stream' % (10, 0) work in both 2.x and 3.5. But correct me if I'm wrong. I'll admit to not following 100% of these emails. Eric. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On Fri, 10 Jan 2014 12:56:19 -0500 Eric V. Smith e...@trueblade.com wrote: I agree. I don't see any reason to exclude int and float. See Guido's messages http://bugs.python.org/issue3982#msg180423 and http://bugs.python.org/issue3982#msg180430 for some justification and discussion. If you are representing int and float, you're really formatting a text message, not bytes. Basically if you allow the formatting of int and float instances, there's no reason not to allow the formatting of arbitrary objects through __str__. It doesn't make sense to special-case those two types and nothing else. Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On 1/10/2014 5:29 PM, Antoine Pitrou wrote: On Fri, 10 Jan 2014 12:56:19 -0500 Eric V. Smith e...@trueblade.com wrote: I agree. I don't see any reason to exclude int and float. See Guido's messages http://bugs.python.org/issue3982#msg180423 and http://bugs.python.org/issue3982#msg180430 for some justification and discussion. If you are representing int and float, you're really formatting a text message, not bytes. Basically if you allow the formatting of int and float instances, there's no reason not to allow the formatting of arbitrary objects through __str__. It doesn't make sense to special-case those two types and nothing else. It might not for .format(), but I'm not convinced. But for %-formatting, str is already special-cased for these types. Eric. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On Fri, 10 Jan 2014 17:20:32 -0500 Eric V. Smith e...@trueblade.com wrote: Isn't the point of the PEP to make it easier to port 2.x code to 3.5? Is there really existing code like this in 2.x? No, but so what? The point of the PEP is not to allow arbitrary Python 2 code to run without modification under Python 3. There's a reason we broke compatibility, and there's no way we're gonna undo that. I think what we're trying to do is to make code that looks like: b'%d %d obj ... stream' % (10, 0) work in both 2.x and 3.5. That's not what *I* am trying to do. As far as I'm concerned the aim of the PEP is to ease bytes interpolation, not to provide some kind of magical construct that will solve everyone's porting problems. Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On Fri, 10 Jan 2014 17:33:57 -0500 Eric V. Smith e...@trueblade.com wrote: On 1/10/2014 5:29 PM, Antoine Pitrou wrote: On Fri, 10 Jan 2014 12:56:19 -0500 Eric V. Smith e...@trueblade.com wrote: I agree. I don't see any reason to exclude int and float. See Guido's messages http://bugs.python.org/issue3982#msg180423 and http://bugs.python.org/issue3982#msg180430 for some justification and discussion. If you are representing int and float, you're really formatting a text message, not bytes. Basically if you allow the formatting of int and float instances, there's no reason not to allow the formatting of arbitrary objects through __str__. It doesn't make sense to special-case those two types and nothing else. It might not for .format(), but I'm not convinced. But for %-formatting, str is already special-cased for these types. That's not what I'm saying. str.__mod__ is able to represent all kinds of types through %s and calling __str__. It doesn't make sense for bytes.__mod__ to only support int and float. Why only them? Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On 01/10/2014 02:42 PM, Antoine Pitrou wrote: On Fri, 10 Jan 2014 17:33:57 -0500 Eric V. Smith e...@trueblade.com wrote: On 1/10/2014 5:29 PM, Antoine Pitrou wrote: On Fri, 10 Jan 2014 12:56:19 -0500 Eric V. Smith e...@trueblade.com wrote: I agree. I don't see any reason to exclude int and float. See Guido's messages http://bugs.python.org/issue3982#msg180423 and http://bugs.python.org/issue3982#msg180430 for some justification and discussion. If you are representing int and float, you're really formatting a text message, not bytes. Basically if you allow the formatting of int and float instances, there's no reason not to allow the formatting of arbitrary objects through __str__. It doesn't make sense to special-case those two types and nothing else. It might not for .format(), but I'm not convinced. But for %-formatting, str is already special-cased for these types. That's not what I'm saying. str.__mod__ is able to represent all kinds of types through %s and calling __str__. It doesn't make sense for bytes.__mod__ to only support int and float. Why only them? Because embedding the ASCII equivalent of ints and floats in byte streams is a common operation? -- ~Ethan~ ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On Fri, 10 Jan 2014 14:58:15 -0800 Ethan Furman et...@stoneleaf.us wrote: On 01/10/2014 02:42 PM, Antoine Pitrou wrote: On Fri, 10 Jan 2014 17:33:57 -0500 Eric V. Smith e...@trueblade.com wrote: On 1/10/2014 5:29 PM, Antoine Pitrou wrote: On Fri, 10 Jan 2014 12:56:19 -0500 Eric V. Smith e...@trueblade.com wrote: I agree. I don't see any reason to exclude int and float. See Guido's messages http://bugs.python.org/issue3982#msg180423 and http://bugs.python.org/issue3982#msg180430 for some justification and discussion. If you are representing int and float, you're really formatting a text message, not bytes. Basically if you allow the formatting of int and float instances, there's no reason not to allow the formatting of arbitrary objects through __str__. It doesn't make sense to special-case those two types and nothing else. It might not for .format(), but I'm not convinced. But for %-formatting, str is already special-cased for these types. That's not what I'm saying. str.__mod__ is able to represent all kinds of types through %s and calling __str__. It doesn't make sense for bytes.__mod__ to only support int and float. Why only them? Because embedding the ASCII equivalent of ints and floats in byte streams is a common operation? Again, if you're representing ASCII, you're representing text and should use a str object. Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On Fri, 10 Jan 2014 18:14:45 -0500 Eric V. Smith e...@trueblade.com wrote: Because embedding the ASCII equivalent of ints and floats in byte streams is a common operation? Again, if you're representing ASCII, you're representing text and should use a str object. Yes, but is there existing 2.x code that uses %s for int and float (perhaps unwittingly), and do we want to help that code out? Or do we want to make porters first change to using %d or %f instead of %s? I'm afraid you're misunderstanding me. The PEP doesn't allow for %d and %f on bytes objects. I think what you're getting at is that in addition to not calling __format__, we don't want to call __str__, either, for the same reason. Not only. We don't want to do anything that actually asks for a *textual* representation of something. %d and %f ask for a textual representation of a number, so they're right out. Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On Fri, Jan 10, 2014 at 10:52 PM, Chris Barker chris.bar...@noaa.govwrote: On Fri, Jan 10, 2014 at 9:17 AM, Juraj Sukop juraj.su...@gmail.comwrote: As you may know, PDF operates over bytes and an integer or floating-point number is written down as-is, for example 100 or 1.23. Just to be clear here -- is PDF specifically bytes+ascii? Or could there be some-other-encoding unicode in there? From the specs: At the most fundamental level, a PDF file is a sequence of 8-bit bytes. But it is also possible to represent a PDF using printable ASCII + whitespace by using escapes and filters. Then, there are also text strings which might be encoded in UTF+16. What this all means is that the PDF objects are expressed in ASCII, stream objects like images and fonts may have a binary part and I never saw those UTF+16 strings. ustream\n%s\nendstream\nendobj%binary_data.decode('latin-1') The argument for dropping %f et al. has been that if something is a text, then it should be Unicode. Conversely, if it is not text, then it should not be Unicode. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On Fri, Jan 10, 2014 at 11:12 PM, Victor Stinner victor.stin...@gmail.comwrote: What not building 10 0 obj ... stream and endstream endobj in Unicode and then encode to ASCII? Example: data = b''.join(( (%d %d obj ... stream % (10, 0)).encode('ascii'), binary_image_data, (endstream endobj).encode('ascii'), )) The key is encode to ASCII which means that the result is bytes. Then, there is this 11 0 obj which should also be bytes. But it has no binary_image_data - only lots of numbers waiting to be somehow converted to bytes. I already mentioned the problems with .encode('ascii') but it does not stop here. Numbers may appear not only inside streams but almost anywhere: in the header there is PDF version, an image has to have width and height, at the end of PDF there is a structure containing offsets to all of the objects in file. Basically, to .encode('ascii') every possible number is not exactly simple or pretty. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On Sat, 11 Jan 2014 00:43:39 +0100 Juraj Sukop juraj.su...@gmail.com wrote: Basically, to .encode('ascii') every possible number is not exactly simple or pretty. Well it strikes me that the PDF format itself is not exactly simple or pretty. It might be convenient that Python 2 allows you, in certain cases, to ignore encoding issues because the main text type is actually a bytestring, but under the Python 3 model there's no reason to allow the same shortcuts. Also, when you say you've never encountered UTF-16 text in PDFs, it sounds like those people who've never encountered any non-ASCII data in their programs. Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On Fri, Jan 10, 2014 at 3:40 PM, Juraj Sukop juraj.su...@gmail.com wrote: What this all means is that the PDF objects are expressed in ASCII, stream objects like images and fonts may have a binary part and I never saw those UTF+16 strings. hmm -- I wonder if they are out there in the wild, though ustream\n%s\nendstream\nendobj%binary_data.decode('latin-1') The argument for dropping %f et al. has been that if something is a text, then it should be Unicode. Conversely, if it is not text, then it should not be Unicode. What I'm trying to demostrate / test is that you can use unicode objects for mixed binary + ascii, if you make sure to encode/decode using latin-1 ( any others?). The idea is that ascii can be seen/used as text, and other bytes are preserved, and you can ignore whatever meaning latin-1 gives them. using unicode objects means that you can use the existing string formatting (%s), and if you want to pass in binary blobs, you need to decode them as latin-1, creating a unicode object, which will get interpolated into your unicode object, but then that unicode gets encoded back to latin-1, the original bytes are preserved. I think this it confusing, as we are calling it latin-1, but not really using it that way, but it seems it should work. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On Sat, Jan 11, 2014 at 12:49 AM, Antoine Pitrou solip...@pitrou.netwrote: Also, when you say you've never encountered UTF-16 text in PDFs, it sounds like those people who've never encountered any non-ASCII data in their programs. Let me clarify: one does not think in writing text in Unicode-terms in PDF. Instead, one records the sequence of character codes which correspond to glyphs or the glyph IDs directly. That's because one Unicode character may have more than one glyph and more characters can be shown as one glyph. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On 01/08/2014 02:42 PM, Antoine Pitrou wrote: With Victor's consent, I overhauled PEP 460 and made the feature set more restricted and consistent with the bytes/str separation. From the PEP: = Python 3 generally mandates that text be stored and manipulated as unicode (i.e. str objects, not bytes). In some cases, though, it makes sense to manipulate bytes objects directly. Typical usage is binary network protocols, where you can want to interpolate and assemble several bytes object (some of them literals, some of them compute) to produce complete protocol messages. For example, protocols such as HTTP or SIP have headers with ASCII names and opaque textual values using a varying and/or sometimes ill-defined encoding. Moreover, those headers can be followed by a binary body... which can be chunked and decorated with ASCII headers and trailers! As it stands now, the PEP talks about ASCII, about how it makes sense sometimes to work directly with bytes objects, and then refuses to allow % to embed ASCII text in the byte stream. All other features present in formatting of str objects (either through the percent operator or the str.format() method) are unsupported. Those features imply treating the recipient of the operator or method as text, which goes counter to the text / bytes separation (for example, accepting %d as a format code would imply that the bytes object really is a ASCII-compatible text string). No, it implies that portion of the byte stream is ASCII compatible. And we have several examples: PDF, HTML, DBF, just about every network protocol (not counting M$), and, I'm sure, plenty I haven't heard of. -1 on the PEP as it stands now. -- ~Ethan~ ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On Fri, 10 Jan 2014 16:23:53 -0800 Ethan Furman et...@stoneleaf.us wrote: On 01/08/2014 02:42 PM, Antoine Pitrou wrote: With Victor's consent, I overhauled PEP 460 and made the feature set more restricted and consistent with the bytes/str separation. From the PEP: = Python 3 generally mandates that text be stored and manipulated as unicode (i.e. str objects, not bytes). In some cases, though, it makes sense to manipulate bytes objects directly. Typical usage is binary network protocols, where you can want to interpolate and assemble several bytes object (some of them literals, some of them compute) to produce complete protocol messages. For example, protocols such as HTTP or SIP have headers with ASCII names and opaque textual values using a varying and/or sometimes ill-defined encoding. Moreover, those headers can be followed by a binary body... which can be chunked and decorated with ASCII headers and trailers! As it stands now, the PEP talks about ASCII, about how it makes sense sometimes to work directly with bytes objects, and then refuses to allow % to embed ASCII text in the byte stream. Indeed I refuse for %-formatting to allow the mixing of bytes and str objects, in the same way that it is forbidden to concatenate a and bb together, or to write b.join([abc]). Python 3 was made *precisely* because the implicit conversion between ASCII unicode and bytes is deemed harmful. It's completely counter-productive and misleading for our users to start mudding the message by introducing exceptions to that rule. Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On 1/10/2014 8:12 PM, Antoine Pitrou wrote: On Fri, 10 Jan 2014 16:23:53 -0800 Ethan Furman et...@stoneleaf.us wrote: On 01/08/2014 02:42 PM, Antoine Pitrou wrote: With Victor's consent, I overhauled PEP 460 and made the feature set more restricted and consistent with the bytes/str separation. From the PEP: = Python 3 generally mandates that text be stored and manipulated as unicode (i.e. str objects, not bytes). In some cases, though, it makes sense to manipulate bytes objects directly. Typical usage is binary network protocols, where you can want to interpolate and assemble several bytes object (some of them literals, some of them compute) to produce complete protocol messages. For example, protocols such as HTTP or SIP have headers with ASCII names and opaque textual values using a varying and/or sometimes ill-defined encoding. Moreover, those headers can be followed by a binary body... which can be chunked and decorated with ASCII headers and trailers! As it stands now, the PEP talks about ASCII, about how it makes sense sometimes to work directly with bytes objects, and then refuses to allow % to embed ASCII text in the byte stream. Indeed I refuse for %-formatting to allow the mixing of bytes and str objects, in the same way that it is forbidden to concatenate a and bb together, or to write b.join([abc]). I think: 'a' + b'b' is different from: b'Content-Length: %d' % 42 The former we want to prevent, but I see nothing wrong with the latter. So, I'm -1 on the PEP. It doesn't address the cases laid out in issue 3892. See for example http://bugs.python.org/issue3982#msg180432 . Eric. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On Fri, 10 Jan 2014 20:53:09 -0500 Eric V. Smith e...@trueblade.com wrote: So, I'm -1 on the PEP. It doesn't address the cases laid out in issue 3892. See for example http://bugs.python.org/issue3982#msg180432 . Then we might as well not do anything, since any attempt to advance things is met by stubborn opposition in the name of not far enough. (I don't care much personally, I think the issue is quite overblown anyway) Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On 01/10/2014 06:04 PM, Antoine Pitrou wrote: On Fri, 10 Jan 2014 20:53:09 -0500 Eric V. Smith e...@trueblade.com wrote: So, I'm -1 on the PEP. It doesn't address the cases laid out in issue 3892. See for example http://bugs.python.org/issue3982#msg180432 . Then we might as well not do anything, since any attempt to advance things is met by stubborn opposition in the name of not far enough. Heh, and here I thought it was stubborn opposition in the name of purity. ;) (I don't care much personally, I think the issue is quite overblown anyway) Is it safe to assume you don't use Python for the use-cases under discussion? Specifically, mixed ASCII, binary, and encoded-text byte streams? -- ~Ethan~ ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On Fri, 10 Jan 2014 18:28:41 -0800 Ethan Furman et...@stoneleaf.us wrote: Is it safe to assume you don't use Python for the use-cases under discussion? You know, I've done quite a bit of network programming. I've also done an experimental port of Twisted to Python 3. I know what a network protocol with ill-defined encodings looks like. Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
To avoid implicit conversion between str and bytes, I propose adding only limited %-format, not .format() or .format_map(). limited %-format means: %c accepts integer or bytes having one length. %r is not supported %s accepts only bytes. %a is only format accepts arbitrary object. And other formats is same to str. On Sat, Jan 11, 2014 at 8:24 AM, Antoine Pitrou solip...@pitrou.net wrote: On Fri, 10 Jan 2014 18:14:45 -0500 Eric V. Smith e...@trueblade.com wrote: Because embedding the ASCII equivalent of ints and floats in byte streams is a common operation? Again, if you're representing ASCII, you're representing text and should use a str object. Yes, but is there existing 2.x code that uses %s for int and float (perhaps unwittingly), and do we want to help that code out? Or do we want to make porters first change to using %d or %f instead of %s? I'm afraid you're misunderstanding me. The PEP doesn't allow for %d and %f on bytes objects. I think what you're getting at is that in addition to not calling __format__, we don't want to call __str__, either, for the same reason. Not only. We don't want to do anything that actually asks for a *textual* representation of something. %d and %f ask for a textual representation of a number, so they're right out. Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/songofacandy%40gmail.com -- INADA Naoki songofaca...@gmail.com ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On 01/10/2014 06:39 PM, Antoine Pitrou wrote: On Fri, 10 Jan 2014 18:28:41 -0800 Ethan Furman wrote: Is it safe to assume you don't use Python for the use-cases under discussion? You know, I've done quite a bit of network programming. No, I didn't, that's why I asked. I've also done an experimental port of Twisted to Python 3. I know what a network protocol with ill-defined encodings looks like. Can you give a code sample of what you think, for example, the PDF generation code should look like? (If you already have, I apologize -- I missed it in all the ruckus.) -- ~Ethan~ ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On 01/10/2014 06:39 PM, Antoine Pitrou wrote: I know what a network protocol with ill-defined encodings looks like. For the record, I've been (and I suspect Eric and some others have also been) talking about well-defined encodings. For the DBF files that I work with, there is binary, ASCII, and third that is specified in the file header. -- ~Ethan~ ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
To avoid implicit conversion between str and bytes, I propose adding only limited %-format, not .format() or .format_map(). limited %-format means: %c accepts integer or bytes having one length. %r is not supported %s accepts only bytes. %a is only format accepts arbitrary object. And other formats is same to str. On Sat, Jan 11, 2014 at 8:24 AM, Antoine Pitrou solip...@pitrou.net wrote: On Fri, 10 Jan 2014 18:14:45 -0500 Eric V. Smith e...@trueblade.com wrote: Because embedding the ASCII equivalent of ints and floats in byte streams is a common operation? Again, if you're representing ASCII, you're representing text and should use a str object. Yes, but is there existing 2.x code that uses %s for int and float (perhaps unwittingly), and do we want to help that code out? Or do we want to make porters first change to using %d or %f instead of %s? I'm afraid you're misunderstanding me. The PEP doesn't allow for %d and %f on bytes objects. I think what you're getting at is that in addition to not calling __format__, we don't want to call __str__, either, for the same reason. Not only. We don't want to do anything that actually asks for a *textual* representation of something. %d and %f ask for a textual representation of a number, so they're right out. Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/songofacandy%40gmail.com -- INADA Naoki songofaca...@gmail.com ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On 11Jan2014 00:43, Juraj Sukop juraj.su...@gmail.com wrote: On Fri, Jan 10, 2014 at 11:12 PM, Victor Stinner victor.stin...@gmail.comwrote: What not building 10 0 obj ... stream and endstream endobj in Unicode and then encode to ASCII? Example: data = b''.join(( (%d %d obj ... stream % (10, 0)).encode('ascii'), binary_image_data, (endstream endobj).encode('ascii'), )) The key is encode to ASCII which means that the result is bytes. Then, there is this 11 0 obj which should also be bytes. But it has no binary_image_data - only lots of numbers waiting to be somehow converted to bytes. I already mentioned the problems with .encode('ascii') but it does not stop here. Numbers may appear not only inside streams but almost anywhere: in the header there is PDF version, an image has to have width and height, at the end of PDF there is a structure containing offsets to all of the objects in file. Basically, to .encode('ascii') every possible number is not exactly simple or pretty. Hi Juraj, Might I suggest a helper function (outside the PEP scope) instead of arguing for support for %f et al? Thus: def bytify(things, encoding='ascii'): for thing: if isinstance(thing, bytes): yield thing else: yield str(thing).encode('ascii') Then one's embedding in PDF might become, more readably: data = b' '.join( bytify( [ 10, 0, obj, binary_image_data, ... ] ) ) Of course, bytify might be augmented with whatever encoding facilities might suit your needs. Cheers, -- Cameron Simpson c...@zip.com.au We tend to overestimate the short-term impact of technological change and underestimate its long-term impact. - Amara's Law ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On Fri, Jan 10, 2014 at 06:17:02PM +0100, Juraj Sukop wrote: As you may know, PDF operates over bytes and an integer or floating-point number is written down as-is, for example 100 or 1.23. I'm sorry, I don't understand what you mean here. I'm honestly not trying to be difficult, but you sound confident that you understand what you are doing, but your description doesn't make sense to me. To me, it looks like you are conflating bytes and ASCII characters, that is, assuming that characters are in some sense identical to their ASCII representation. Let me explain: The integer that in English is written as 100 is represented in memory as bytes 0x0064 (assuming a big-endian C short), so when you say an integer is written down AS-IS (emphasis added), to me that says that the PDF file includes the bytes 0x0064. But then you go on to write the three character string 100, which (assuming ASCII) is the bytes 0x313030. Going from the C short to the ASCII representation 0x313030 is nothing like inserting the int as-is. To put it another way, the Python 2 '%d' format code does not just copy bytes. I think that what you are trying to say is that a PDF file is a binary file which includes some ASCII-formatted text fields. So when writing an integer 100, rather than writing it as is which would be byte 0x64 (with however many leading null bytes needed for padding), it is converted to ASCII representation 0x313030 first, and that's what needs to be inserted. If you consider PDF as binary with occasional pieces of ASCII text, then working with bytes makes sense. But I wonder whether it might be better to consider PDF as mostly text with some binary bytes. Even though the bulk of the PDF will be binary, the interesting bits are text. E.g. your example: In the case of PDF, the embedding of an image into PDF looks like: 10 0 obj /Type /XObject /Width 100 /Height 100 /Alternates 15 0 R /Length 2167 stream ...binary image data... endstream endobj Even though the binary image data is probably much, much larger in length than the text shown above, it's (probably) trivial to deal with: convert your image data into bytes, decode those bytes into Latin-1, then concatenate the Latin-1 string into the text above. Latin-1 has the nice property that every byte decodes into the character with the same code point, and visa versa. So: for i in range(256): assert bytes([i]).decode('latin-1') == chr(i) assert chr(i).encode('latin-1') == bytes([i]) passes. It seems to me that your problem goes away if you use Unicode text with embedded binary data, rather than binary data with embedded ASCII text. Then when writing the file to disk, of course you encode it to Latin-1, either explicitly: pdf = ... # Unicode string containing the PDF contents with open(outfile.pdf, wb) as f: f.write(pdf.encode(latin-1) or implicitly: with open(outfile.pdf, w, encoding=latin-1) as f: f.write(pdf) There may be a few wrinkles I haven't thought of, I don't claim to be an expert on PDF. But I see no reason why PDF files ought to be an exception to the rule: * work internally with Unicode text; * convert to and from bytes only on input and output. Please also take note that in Python 3.3 and better, the internal representation of Unicode strings containing only code points up to 255 (i.e. pure ASCII or pure Latin-1) is very efficient, using only one byte per character. Another advantage is that using text rather than bytes means that your example: [...] dropping the bytes-formatting of numbers makes it more complicated than it was. I would appreciate any explanation on how: b'%.1f %.1f %.1f RG' % (r, g, b) becomes simply '%.1f %.1f %.1f RG' % (r, g, b) in Python 3. In Python 3.3 and above, it can be written as: u'%.1f %.1f %.1f RG' % (r, g, b) which conveniently is exactly the same syntax you would use in Python 2. That's *much* nicer than your suggestion: is more confusing than: b'%s %s %s RG' % tuple(map(lambda x: (u'%.1f' % x).encode('ascii'), (r, g, b))) -- Steven ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
Am 11.01.2014 03:04, schrieb Antoine Pitrou: On Fri, 10 Jan 2014 20:53:09 -0500 Eric V. Smith e...@trueblade.com wrote: So, I'm -1 on the PEP. It doesn't address the cases laid out in issue 3892. See for example http://bugs.python.org/issue3982#msg180432 . I agree. Then we might as well not do anything, since any attempt to advance things is met by stubborn opposition in the name of not far enough. (I don't care much personally, I think the issue is quite overblown anyway) So you wouldn't mind another overhaul of the PEP including a bit more functionality again? :) I really think that practicality beats purity here. (I'm not advocating free mixing bytes and str, mind you!) Georg ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On 9 Jan 2014 11:29, INADA Naoki songofaca...@gmail.com wrote: And I think everyone was well intentioned - and python3 covers most of the bases, but working with binary data is not only a wire-protocol programmer's problem. If you're working with binary data, use the binary API offered by bytes, bytearray and memoryview. Needing a library to wrap bytesthing.format('ascii', 'surrogateescape') or some such thing makes python3 less approachable for those who haven't learned that yet - which was almost all of us at some point when we started programming. Totally agree with you. If you're on a relatively modern OS, everything should be UTF-8 and you should be fine as a beginner. When you start encountered malformed data, Python 3 should throw an error, and provide an opportunity to learn more (by looking up the error message), where Python 2 would silently corrupt the data stream. Python 2 enshrined a data model eminently suitable for boundary code that dealt with ASCII compatible binary protocols (like web frameworks) as the default text model. Application code then needed to take special steps to get correct behaviour for the full Unicode range. In essence, the Python 2 text model is the POSIX text model with Unicode support bolted on to the side to make it at least *possible* to write correct application code. This is completely backwards. Web applications vastly outnumber web frameworks, and the same goes for every other domain: applications are vastly more common than the libraries and frameworks that handle data transformations at system boundaries on their behalf, so making the latter easier to write at the expense of the former is a deeply flawed design choice. So Python 3 reverses the situation: the core text model is now more appropriate for the central application code, *after* the boundary code has cleaned up the murky details of wire protocols and file formats. This is pretty easy to deal with for *new* Python 3 code, since you just write things to deal with either bytes or text as appropriate. However, there is some code written for Python 2 that relies more heavily on the ability to treat ascii compatible binary data as both binary data *and* as text. This is the use case that Python 3 treats as a more specialised use case (perhaps benefitting from a specialised third party type), whereas Python 2 supports it by default. This is also the use case that relied most heavily on implicit encoding and decoding, since that's the mechanism that allows the 8-bit and Unicode paths to share string literals. Cheers, Nick. -- INADA Naoki songofaca...@gmail.com ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On Thu, 09 Jan 2014 03:54:13 + MRAB pyt...@mrabarnett.plus.com wrote: I'm thinking that the i format could be used for signed integers and the u for unsigned integers. The width would be the number of bytes. You would also need to have a way of specifying the endianness. For example: b'{:2i}'.format(256) b'\x01\x00' b'{:2i}'.format(256) b'\x00\x01' The goal is not to add an alternative to the struct module. If you need binary packing/unpacking, just use struct. Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On Jan 08, 2014, at 01:51 PM, Stephen J. Turnbull wrote: Benjamin Peterson writes: I agree. This is a very important, much-requested feature for low-level networking code. I hear it's much-requested, but is there any description of typical use cases? The two unported libraries that are preventing me from switching Mailman 3 to Python 3 are restish and storm. For storm, there's a viable alternative in SQLAlchemy though I haven't looked at how difficult it will be to port the model layer (even though we once did use SA). restish is tougher. I've investigated flask, pecan, wsme, and a few others that already have Python 3 support and none of them provide an API that I consider as nice a fit as restish for our standalone WSGI-based REST admin server. That's not to denigrate those other projects, it's just that I think restish hit the sweet spot, and porting Mailman 3 to some other framework so far has proven unworkable (I've tried with each of them). restish is plumbing so I think it's a good test case for Nick's observations of a wire-protocol layer library, and it's obvious that it Just Works in Python 2 but does work at all in Python 3. There have been at least 3 attempts to port restish to Python 3 and all of them get stuck in various places where you actually *can't* decide whether some data structure should be a bytes or str. Make one choice and you get stuck over here, make the other chose and you get stuck over there. I've got two abandoned branches on github with (rather old) porting attempts, and I know other developers have some branches as well. Having given up on trying to switch to a different framework, I'm starting over again with restish (really, it's wonderful :). I plan on keeping more detailed notes this time specifically so that I can help contribute to this discussion. If anybody wants to pitch in, both for the specific purpose of porting the library, and for the more general insights it could provide for this thread, please get in touch. Cheers, -Barry signature.asc Description: PGP signature ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On 9 Jan 2014 06:43, Antoine Pitrou solip...@pitrou.net wrote: Hi, With Victor's consent, I overhauled PEP 460 and made the feature set more restricted and consistent with the bytes/str separation. +1 I was initially dubious about the idea, but the proposed semantics look good to me. We should probably include format_map for consistency with the str API. However, I also added bytearray into the mix, as bytearray objects should generally support the same operations as bytes (and they can be useful *especially* for network programming). So we'd define the *format* string as mutable to get a mutable result out of the formatting operations? This seems a little weird to me. It also seems weird for a format method on a mutable type to *not* perform in-place mutation. On the other hand, I don't see another obvious way to control the output type. Cheers, Nick. Regards Antoine. On Mon, 6 Jan 2014 14:24:50 +0100 Victor Stinner victor.stin...@gmail.com wrote: Hi, bytes % args and bytes.format(args) are requested by Mercurial and Twisted projects. The issue #3982 was stuck because nobody proposed a complete definition of the new features. Here is a try as a PEP. The PEP is a draft with open questions. First, I'm not sure that both bytes%args and bytes.format(args) are needed. The implementation of .format() is more complex, so why not only adding bytes%args? Then, the following points must be decided to define the complete list of supported features (formatters): ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On Fri, 10 Jan 2014 05:26:04 +1000 Nick Coghlan ncogh...@gmail.com wrote: We should probably include format_map for consistency with the str API. Yes, you're right. However, I also added bytearray into the mix, as bytearray objects should generally support the same operations as bytes (and they can be useful *especially* for network programming). So we'd define the *format* string as mutable to get a mutable result out of the formatting operations? This seems a little weird to me. It also seems weird for a format method on a mutable type to *not* perform in-place mutation. It's consistent with bytearray.join's behaviour: x = bytearray() x.join([babc]) bytearray(b'abc') x bytearray(b'') Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com