Re: Python 3.2 has some deadly infection
On 06/05/2014 05:02 PM, Steven D'Aprano wrote: [...] But Linux Unicode support is much better than Windows. Unicode support in Windows is crippled by continued reliance on legacy code pages, and by the assumption deep inside the Windows APIs that Unicode means 16 bit characters. See, for example, the amount of space spent on fixing Windows Unicode handling here: http://www.utf8everywhere.org/ While not disagreeing with the the general premise of that page, it has some problems that raise doubts in my mind about taking everything the author says at face value. For example Q: Why would the Asians give up on UTF-16 encoding, which saves them 50% the memory per character? [...] in fact UTF-8 is used just as often in those [Asian] countries. That is not my experience, at least for Japan. See my comments in https://mail.python.org/pipermail/python-ideas/2012-June/015429.html where I show that utf8 files are a tiny minority of the text files found by Google. He then gives a table with the size of utf8 and utf16 encoded contents (ie stripped of html stuff) of an unnamed Japanese wikipedia page to show that even without a lot of (html-mandated) ascii, the space savings are not very much compared to the theoretical 50% savings he stated: Dense text (Δ UTF-8) UTF-8 ... 222 KB (0%) UTF-16 ... 176 KB (−21%) Note that he calculates the space saving as (utf8-utf16)/utf8. Yet by that metric the theoretical saving is *NOT* 50%, it is 33%. For example 1000 Japanese characters will use 2000 bytes in utf16 and 3000 in utf8. I did the same test using http://ja.wikipedia.org/wiki/%E7%B9%94%E7%94%B0%E4%BF%A1%E9%95%B7 I stripped html tags, javascript and redundant ascii whitespace characters The stripped utf-8 file was 164946 bytes, the utf-16 encoded version of same was 117756. That gives (using the (utf8-utf16)/utf16 metric he used to claim 50% idealized savings) 40% which is quite a bit closer to the idealized 50% than his 21%. I would have more faith in his opinions about things I don't know about (such as unicode programming on Windows) if his other info were more trustworthy. IOW, just because it's on the internet doesn't mean it's true. -- https://mail.python.org/mailman/listinfo/python-list
Re: Python 3.2 has some deadly infection
Marko Rauhamaa ma...@pacujo.net writes: Steven D'Aprano steve+comp.lang.pyt...@pearwood.info: Nevertheless, there are important abstractions that are written on top of the bytes layer, and in the Unix and Linux world, the most important abstraction is *text*. In the Unix world, text formats and text processing is much more common in user-space apps than binary processing. That linux text is not the same thing as Python's text. Conceptually, Python text is a sequence of 32-bit integers. Linux text is a sequence of 8-bit integers. _Unicode string in Python is a sequence of Unicode codepoints_. It is correct that 32-bit integer is enough to represent any Unicode codepoint: \u...\U0010 It says *nothing* about how Unicode strings are represented *internally* in Python. It may vary from version to version, build options and even may depend on the content of a string at runtime. In the past, narrow builds might break the abstraction in some cases that is why Linux distributions used wide python builds. _Unicode codepoint is not a Python concept_. There is Unicode standard http://unicode.org Though intead of following the self-referential defenitions web, I find it easier to learn from examples such as http://codepoints.net/U+0041 (A) or http://codepoints.net/U+1F3A7 () _There is no such thing as 8-bit text_ http://www.joelonsoftware.com/articles/Unicode.html If you insert a space after each byte (8-bit) in the input text then you may get garbage i.e., you can't assume that a character is a byte: $ echo Hyvää yötä | perl -pe's/.\K/ /g' H y v a � � � � y � � t � � In general, you can't assume that a character is a Unicode codepoint: $ echo Hyvää yötä | perl -C -pe's/.\K/ /g' H y v a ̈ ä y ö t ä The eXtended grapheme clusters (user-perceived characters) may be useful in this case: $ echo Hyvää yötä | perl -C -pe's/\X\K/ /g' H y v ä ä y ö t ä \X pattern is supported by `regex` module in Python i.e., you can't even iterate over characters (as they are seen by a user) in Python using only stdlib. \w+ pattern is also broken for Unicode text http://bugs.python.org/issue1693050 (it is fixed in the `regex` module) i.e., you can't select a word in Unicode text using only stdlib. \X along is not enough in some cases e.g., “ch” may be considered a grapheme cluster in Slovak, for processes such as collation [1] (sorting order). `PyICU` module might be useful here. Knowing about Unicode normalization forms (NFC, NFKD, etc) http://unicode.org/reports/tr15/ Unicode text segmentation [1] and Unicode collation algorithm http://www.unicode.org/reports/tr10/ concepts is also useful; if you want to work with text. [1]: http://www.unicode.org/reports/tr29/ -- akira -- https://mail.python.org/mailman/listinfo/python-list
Re: Python 3.2 has some deadly infection
On 05/06/2014 18:16, Ian Kelly wrote: . How should e.g. bytes.upper() be implemented then? The correct behavior is entirely dependent on the encoding. Python 2 just assumes ASCII, which at best will correctly upper-case some subset of the string and leave the rest unchanged, and at worst could corrupt the string entirely. There are some things that were dropped that should not have been, but my impression is that those are being worked on, for example % formatting in PEP 461. bytes.upper should have done exactly what str.upper in python 2 did; that way we could have at least continued to do the wrong thing :) -- Robin Becker -- https://mail.python.org/mailman/listinfo/python-list
Re: Python 3.2 has some deadly infection
On Fri, 06 Jun 2014 02:21:54 +0300, Marko Rauhamaa wrote: Steven D'Aprano steve+comp.lang.pyt...@pearwood.info: In any case, I reject your premise. ALL data types are constructed on top of bytes, Only in a very dull sense. I agree with you that this is a very dull, unimportant sense. And I think it's dullness applies equally to the situation you somehow think is meaningfully exciting: Text is made of bytes! If you squint, you can see those bytes! Therefore text is not a first class data type!!! To which my answer is, yes text is made of bytes, yes, you can expose those bytes, and no your conclusion doesn't follow. and so long as you allow applications *any way* to coerce data types to different data types, you allow them to see inside the black box. I can't see the bytes inside Python objects, including strings, and that's how it is supposed to be. That's because Python the language doesn't allow you to coerce types to other types, except possibly through its interface to the underlying C implementation, ctypes. But Python allows you to write extensions in C, and that gives you the full power to take any data structure and turn it into any other data structure. Even bytes. Similarly, I can't (easily) see how files are laid out on hard disks. That's a true abstraction. Nothing in linux presents data, though, except through bytes. Incorrect. Linux presents data as text all the time. Look at the prompt: its treated as text, not numbers. You type commands using a text interface. The commands are made of words like ls, dd and ps, not numbers like 0x6C73, 0x6464 and 0x7073. Applications like grep are based on line- based files, and line is a text concept, not a byte concept. Consider: [steve@ando ~]$ echo -e '\x41\x42\x43' ABC The assumption of *text* is so strong in the echo application that by default you cannot enter numeric escapes at all. Without the -e switch, echo assumes that numeric escapes represent themselves as character literals: [steve@ando ~]$ echo '\x41\x42\x43' \x41\x42\x43 -- Steven D'Aprano http://import-that.dreamwidth.org/ -- https://mail.python.org/mailman/listinfo/python-list
Re: Python 3.2 has some deadly infection
Steven D'Aprano steve+comp.lang.pyt...@pearwood.info: Incorrect. Linux presents data as text all the time. Look at the prompt: its treated as text, not numbers. Of course there is a textual human interface. However, from the point of view of virtually every OS component, it's bytes. Consider: [steve@ando ~]$ echo -e '\x41\x42\x43' ABC echo doesn't know it's emitting text. It would be perfectly happy to emit binary gibberish. The output goes to the pty which doesn't care about the textual interpretation, either. Finally, the terminal (emulation program) translates the incoming bytes to textual glyphs to the best of its capabilities. Anyway, what interests me mostly is that I routinely build programs and systems that talk to each other over files, pipes, sockets and devices. I really need to micromanage that data. I'm fine with encoding text if that's the suitable interpretation. I just think Python is overreaching by making the text interpretation the default for the standard streams and files and guessing the correct encoding. Note that subprocess.Popen() wisely assumes binary pipes. Unfortunately the subprocess might be a python program that opens the standard streams in the text mode... Marko -- https://mail.python.org/mailman/listinfo/python-list
Re: Python 3.2 has some deadly infection
On 06/05/2014 11:30 AM, Marko Rauhamaa wrote: How text is represented is very different from whether text is a fundamental data type. A fundamental text file is such that ordinary operating system facilities can't see inside the black box (that is, they are *not* encoded as far as the applications go). Of course they are. It may be an ASCII-encoding of some flavor or other, or something really (to me) strange -- but an encoding is most assuredly in affect. ASCII is *not* the state of this string has no encoding -- that would be Unicode; a Unicode string, as a data type, has no encoding. To transport it, store it, etc., it must (usually?) be encoded into something -- utf-8, ASCII, turkish, or whatever subset is agreed upon and will hopefully contain all the Unicode characters needed for the string to be properly represented. The realization that ASCII was, in fact, an encoding was a big paradigm shift for me, but a necessary one. -- ~Ethan~ -- https://mail.python.org/mailman/listinfo/python-list
Re: Python 3.2 has some deadly infection
Ethan Furman et...@stoneleaf.us: On 06/05/2014 11:30 AM, Marko Rauhamaa wrote: A fundamental text file is such that ordinary operating system facilities can't see inside the black box (that is, they are *not* encoded as far as the applications go). Of course they are. How would you know? It may be an ASCII-encoding of some flavor or other, or something really (to me) strange -- but an encoding is most assuredly in affect. Outside metaphysics, that statement is only meaningful if you have access to the encoding. ASCII is *not* the state of this string has no encoding -- that would be Unicode; a Unicode string, as a data type, has no encoding. Huh? Marko -- https://mail.python.org/mailman/listinfo/python-list
Re: Python 3.2 has some deadly infection
On 06/05/2014 09:32 AM, Steven D'Aprano wrote: But whatever the situation, and despite our differences of opinion about Unicode, THANK YOU for having updated ReportLabs to 3.3. +1000 -- ~Ethan~ -- https://mail.python.org/mailman/listinfo/python-list
Re: Python 3.2 has some deadly infection
On 06/06/2014 08:10 AM, Marko Rauhamaa wrote: Ethan Furman et...@stoneleaf.us: ASCII is *not* the state of this string has no encoding -- that would be Unicode; a Unicode string, as a data type, has no encoding. Huh? It's this very fact that trips of JMF in his rants about FSR. Thank you to Ethan for putting it so succinctly. What part of his statement are you saying Huh? about? -- https://mail.python.org/mailman/listinfo/python-list
Re: Python 3.2 has some deadly infection
On Fri, Jun 6, 2014 at 11:24 PM, Ethan Furman et...@stoneleaf.us wrote: On 06/05/2014 11:30 AM, Marko Rauhamaa wrote: How text is represented is very different from whether text is a fundamental data type. A fundamental text file is such that ordinary operating system facilities can't see inside the black box (that is, they are *not* encoded as far as the applications go). Of course they are. It may be an ASCII-encoding of some flavor or other, or something really (to me) strange -- but an encoding is most assuredly in affect. Allow me to explain what I think Marko's getting at here. In most file systems, a file exists on the disk as a set of sectors of data, plus some metadata including the file's actual size. When you ask the OS to read you that file, it goes to the disk, reads those sectors, truncates the data to the real size, and gives you those bytes. It's possible to mount a file as a directory, in which case the physical representation is very different, but the file still appears the same. In that case, the OS goes reading some part of the file, maybe decompresses it, and gives it to you. Same difference. These files still contain bytes. A fundamental text file would be one where, instead of reading and writing bytes, you read and write Unicode text. Since the hard disk still works with sectors and bytes, it'll still be stored as such, but that's an implementation detail; and you could format your disk UTF-8 or UTF-16 or FSR or anything you like, and the only difference you'd see is performance. This could certainly be done, in theory. I don't know how well it'd fit with any of the popular OSes of today, but it could be done. And these files would not have an encoding; their on-platter representations would, but that's purely implementation - the text that you wrote out and the text that you read in are the same text, and there's been no encoding visible. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Python 3.2 has some deadly infection
Michael Torrie torr...@gmail.com: On 06/06/2014 08:10 AM, Marko Rauhamaa wrote: Ethan Furman et...@stoneleaf.us: ASCII is *not* the state of this string has no encoding -- that would be Unicode; a Unicode string, as a data type, has no encoding. Huh? [...] What part of his statement are you saying Huh? about? Unicode, like ASCII, is a code. Representing text in unicode is encoding. Marko -- https://mail.python.org/mailman/listinfo/python-list
Re: Python 3.2 has some deadly infection
On Sat, Jun 7, 2014 at 1:32 AM, Marko Rauhamaa ma...@pacujo.net wrote: Michael Torrie torr...@gmail.com: On 06/06/2014 08:10 AM, Marko Rauhamaa wrote: Ethan Furman et...@stoneleaf.us: ASCII is *not* the state of this string has no encoding -- that would be Unicode; a Unicode string, as a data type, has no encoding. Huh? [...] What part of his statement are you saying Huh? about? Unicode, like ASCII, is a code. Representing text in unicode is encoding. Yes and no. ASCII means two things: Firstly, it's a mapping from the letter A to the number 65, from the exclamation mark to 33, from the backslash to 92, and so on. And secondly, it's an encoding of those numbers into the lowest seven bits of a byte, with the high byte left clear. Between those two, you get a means of representing the letter 'A' as the byte 0x41, and one of them is an encoding. Unicode, on the other hand, is only the first part. It maps all the same characters to the same numbers that ASCII does, and then adds a few more... a few followed by a few, followed by... okay, quite a lot more. Unicode specifies that the character OK HAND SIGN, which looks like if you have the right font, is number 1F44C in hex (128076 decimal). This is the Universal Character Set or UCS. ASCII could specify a single encoding, because that encoding makes sense for nearly all purposes. (There are times when you transmit ASCII text and use the high bit to mean something else, like parity or this is the end of a word or something, but even then, you follow the same convention of packing a number into the low seven bits of a byte.) Unicode can't, because there are many different pros and cons to the different encodings, and so we have UCS Transformation Formats like UTF-8 and UTF-32. Each one is an encoding that maps a codepoint to a sequence of bytes. You can't represent text in Unicode in a computer. Somewhere along the way, you have to figure out how to store those codepoints as bytes, or something more concrete (you could, for instance, use a Python list of Python integers; I can't say that it would be in any way more efficient than alternatives, but it would be plausible); and that's the encoding. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Python 3.2 has some deadly infection
On Fri, 06 Jun 2014 18:32:39 +0300, Marko Rauhamaa wrote: Michael Torrie torr...@gmail.com: On 06/06/2014 08:10 AM, Marko Rauhamaa wrote: Ethan Furman et...@stoneleaf.us: ASCII is *not* the state of this string has no encoding -- that would be Unicode; a Unicode string, as a data type, has no encoding. Huh? [...] What part of his statement are you saying Huh? about? Unicode, like ASCII, is a code. Representing text in unicode is encoding. A Unicode string as an abstract data type has no encoding. It is a Platonic ideal, a pure form like the real numbers. There are no bytes, no bits, just code points. That is what Ethan means. A Unicode string like this: s = uNOBODY expects the Spanish Inquisition! should not be thought of as a bunch of bytes in some encoding, but as an array of code points. Eventually the abstraction will leak, all abstractions do, but not for a very long time. -- Steven D'Aprano http://import-that.dreamwidth.org/ -- https://mail.python.org/mailman/listinfo/python-list
Re: Python 3.2 has some deadly infection
On Friday, June 6, 2014 9:27:51 PM UTC+5:30, Steven D'Aprano wrote: On Fri, 06 Jun 2014 18:32:39 +0300, Marko Rauhamaa wrote: Michael Torri: On 06/06/2014 08:10 AM, Marko Rauhamaa wrote: Ethan Furman : ASCII is *not* the state of this string has no encoding -- that would be Unicode; a Unicode string, as a data type, has no encoding. Huh? [...] What part of his statement are you saying Huh? about? Unicode, like ASCII, is a code. Representing text in unicode is encoding. A Unicode string as an abstract data type has no encoding. It is a Platonic ideal, a pure form like the real numbers. There are no bytes, no bits, just code points. That is what Ethan means. A Unicode string like this: s = uNOBODY expects the Spanish Inquisition! should not be thought of as a bunch of bytes in some encoding, but as an array of code points. Eventually the abstraction will leak, all abstractions do, but not for a very long time. Should not be thought of yes thats the Python3 world view Not even the Python2 world view And very far from the classic Unix world view. As Ned Batchelder says in Unipain: http://nedbatchelder.com/text/unipain.html : Programmers should use the 'unicode sandwich'to avoid 'unipain': Bytes on the outside, Unicode on the inside, encode/decode at the edges. The discussion here is precisely about these edges Combine that with Chris': Yes and no. ASCII means two things: Firstly, it's a mapping from the letter A to the number 65, from the exclamation mark to 33, from the backslash to 92, and so on. And secondly, it's an encoding of those numbers into the lowest seven bits of a byte, with the high byte left clear. Between those two, you get a means of representing the letter 'A' as the byte 0x41, and one of them is an encoding. and the situation appears quite the opposite of Ethan's description: In the 'old world' ASCII was both mapping and encoding and so there was never a justification to distinguish encoding from codepoint. It is unicode that demands these distinctions. If we could magically go to a world where the number of bits in a byte was 32 all this headache would go away. [Actually just 21 is enough!] -- https://mail.python.org/mailman/listinfo/python-list
Re: Python 3.2 has some deadly infection
On Sat, Jun 7, 2014 at 2:21 AM, Rustom Mody rustompm...@gmail.com wrote: Combine that with Chris': Yes and no. ASCII means two things: Firstly, it's a mapping from the letter A to the number 65, from the exclamation mark to 33, from the backslash to 92, and so on. And secondly, it's an encoding of those numbers into the lowest seven bits of a byte, with the high byte left clear. Between those two, you get a means of representing the letter 'A' as the byte 0x41, and one of them is an encoding. and the situation appears quite the opposite of Ethan's description: In the 'old world' ASCII was both mapping and encoding and so there was never a justification to distinguish encoding from codepoint. It is unicode that demands these distinctions. If we could magically go to a world where the number of bits in a byte was 32 all this headache would go away. [Actually just 21 is enough!] An ASCII mentality lets you be sloppy. That doesn't mean the distinction doesn't exist. When I first started programming in C, int was *always* 16 bits long and *always* little-endian (because I used only one compiler). I could pretend that those bits in memory actually were that integer, that there were no other ways that integer could be encoded. That doesn't mean that encodings weren't important. And as soon as I started working on a 32-bit OS/2 system, and my ints became bigger, I had to concern myself with that. Even more so when I got into networking, and byte order became important to me. And of course, these days I work with integers that are encoded in all sorts of different ways (a Python integer isn't just a puddle of bytes in memory), and I generally let someone else take care of the details, but the encodings are still there. ASCII was once your one companion, it was all that mattered. ASCII was once a friendly encoding, then your world was shattered. Wishing it were somehow here again, wishing it were somehow near... sometimes it seemed, if you just dreamed, somehow it would be here! Wishing you could use just bytes again, knowing that you never would... dreaming of it won't help you to do all that you dream you could! It's time to stop chasing the phantom and start living in the Raoul world... err, the real world. :) ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Python 3.2 has some deadly infection
Chris Angelico ros...@gmail.com: ASCII means two things: Firstly, it's a mapping from the letter A to the number 65, from the exclamation mark to 33, from the backslash to 92, and so on. And secondly, it's an encoding of those numbers into the lowest seven bits of a byte, with the high byte left clear. Between those two, you get a means of representing the letter 'A' as the byte 0x41, and one of them is an encoding. The American Standard Code for Information Interchange [...] is a character-encoding scheme [...] URL: http://en.wikipedia.org/wiki/ASCII Unicode, on the other hand, is only the first part. It maps all the same characters to the same numbers that ASCII does, and then adds a few more... a few followed by a few, followed by... okay, quite a lot more. Unicode specifies that the character OK HAND SIGN, which looks like if you have the right font, is number 1F44C in hex (128076 decimal). This is the Universal Character Set or UCS. Unicode is a computing industry standard for the consistent encoding, representation and handling of text [...] URL: http://en.wikipedia.org/wiki/Unicode Each standard assigns numbers to letters and other symbols. In a word, each is a code. That's what their names say, too. Marko -- https://mail.python.org/mailman/listinfo/python-list
Re: Python 3.2 has some deadly infection
On Friday, June 6, 2014 10:18:41 PM UTC+5:30, Chris Angelico wrote: On Sat, Jun 7, 2014 at 2:21 AM, Rustom Mody wrote: Combine that with Chris': Yes and no. ASCII means two things: Firstly, it's a mapping from the letter A to the number 65, from the exclamation mark to 33, from the backslash to 92, and so on. And secondly, it's an encoding of those numbers into the lowest seven bits of a byte, with the high byte left clear. Between those two, you get a means of representing the letter 'A' as the byte 0x41, and one of them is an encoding. and the situation appears quite the opposite of Ethan's description: In the 'old world' ASCII was both mapping and encoding and so there was never a justification to distinguish encoding from codepoint. It is unicode that demands these distinctions. If we could magically go to a world where the number of bits in a byte was 32 all this headache would go away. [Actually just 21 is enough!] An ASCII mentality lets you be sloppy. That doesn't mean the distinction doesn't exist. When I first started programming in C, int was *always* 16 bits long and *always* little-endian (because I used only one compiler). I could pretend that those bits in memory actually were that integer, that there were no other ways that integer could be encoded. That doesn't mean that encodings weren't important. And as soon as I started working on a 32-bit OS/2 system, and my ints became bigger, I had to concern myself with that. Even more so when I got into networking, and byte order became important to me. And of course, these days I work with integers that are encoded in all sorts of different ways (a Python integer isn't just a puddle of bytes in memory), and I generally let someone else take care of the details, but the encodings are still there. ASCII was once your one companion, it was all that mattered. ASCII was once a friendly encoding, then your world was shattered. Wishing it were somehow here again, wishing it were somehow near... sometimes it seemed, if you just dreamed, somehow it would be here! Wishing you could use just bytes again, knowing that you never would... dreaming of it won't help you to do all that you dream you could! It's time to stop chasing the phantom and start living in the Raoul world... err, the real world. :) I thought that If only bytes were 21+ bits wide would sound sufficiently nonsensical, that I did not need to explicitly qualify it as a utopian dream! -- https://mail.python.org/mailman/listinfo/python-list
Re: Python 3.2 has some deadly infection
Steven D'Aprano steve+comp.lang.pyt...@pearwood.info: On Fri, 06 Jun 2014 18:32:39 +0300, Marko Rauhamaa wrote: Unicode, like ASCII, is a code. Representing text in unicode is encoding. A Unicode string as an abstract data type has no encoding. Unicode itself is an encoding. See it in action here: 72 101 108 108 111 44 32 119 111 114 108 100 It is a Platonic ideal, a pure form like the real numbers. Far from it. It is a mapping from symbols to integers. The symbols are the Platonic ones. The Unicode/ASCII encoding above represents the same Platonic string as this ESCDIC one: 212 133 147 147 150 107 64 166 150 153 137 132 Unicode string like this: s = uNOBODY expects the Spanish Inquisition! should not be thought of as a bunch of bytes in some encoding, Encoding is not tied to bytes or even computers. People can speak in code, after all. Marko -- https://mail.python.org/mailman/listinfo/python-list
Re: Python 3.2 has some deadly infection
On Friday, June 6, 2014 10:32:47 PM UTC+5:30, Marko Rauhamaa wrote: Chris Angelico : ASCII means two things: Firstly, it's a mapping from the letter A to the number 65, from the exclamation mark to 33, from the backslash to 92, and so on. And secondly, it's an encoding of those numbers into the lowest seven bits of a byte, with the high byte left clear. Between those two, you get a means of representing the letter 'A' as the byte 0x41, and one of them is an encoding. The American Standard Code for Information Interchange [...] is a character-encoding scheme [...] URL: And a similar argument to this is seen on that page's talk page! http://en.wikipedia.org/wiki/Talk:ASCII#Character_set_vs._Character_encoding.3F -- https://mail.python.org/mailman/listinfo/python-list
Re: Python 3.2 has some deadly infection
On Sat, Jun 7, 2014 at 3:11 AM, Marko Rauhamaa ma...@pacujo.net wrote: Encoding is not tied to bytes or even computers. People can speak in code, after all. Obligatory: http://xkcd.com/257/ ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Python 3.2 has some deadly infection
Marko Rauhamaa ma...@pacujo.net: Far from it. It is a mapping from symbols to integers. The symbols are the Platonic ones. Well, of course, even the symbols are a code. Letters code sounds and digits code numbers. And the sounds and numbers code ideas. Now we are getting close to being truly Platonic. Marko -- https://mail.python.org/mailman/listinfo/python-list
Re: Python 3.2 has some deadly infection
On Sat, Jun 7, 2014 at 3:04 AM, Rustom Mody rustompm...@gmail.com wrote: ASCII was once your one companion, it was all that mattered. ASCII was once a friendly encoding, then your world was shattered. Wishing it were somehow here again, wishing it were somehow near... sometimes it seemed, if you just dreamed, somehow it would be here! Wishing you could use just bytes again, knowing that you never would... dreaming of it won't help you to do all that you dream you could! It's time to stop chasing the phantom and start living in the Raoul world... err, the real world. :) I thought that If only bytes were 21+ bits wide would sound sufficiently nonsensical, that I did not need to explicitly qualify it as a utopian dream! Humour never dies! ChrisA (In case it's not obvious, by the way, everything I said above is a reference to the Phantom of the Opera.) -- https://mail.python.org/mailman/listinfo/python-list
Re: Python 3.2 has some deadly infection
On Sat, Jun 7, 2014 at 3:13 AM, Rustom Mody rustompm...@gmail.com wrote: On Friday, June 6, 2014 10:32:47 PM UTC+5:30, Marko Rauhamaa wrote: Chris Angelico : ASCII means two things: Firstly, it's a mapping from the letter A to the number 65, from the exclamation mark to 33, from the backslash to 92, and so on. And secondly, it's an encoding of those numbers into the lowest seven bits of a byte, with the high byte left clear. Between those two, you get a means of representing the letter 'A' as the byte 0x41, and one of them is an encoding. The American Standard Code for Information Interchange [...] is a character-encoding scheme [...] URL: And a similar argument to this is seen on that page's talk page! http://en.wikipedia.org/wiki/Talk:ASCII#Character_set_vs._Character_encoding.3F Which proves that Wikipedia is exactly as reliable as a mailing list. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Python 3.2 has some deadly infection
On 6/6/14 1:11 PM, Marko Rauhamaa wrote: Steven D'Aprano steve+comp.lang.pyt...@pearwood.info: On Fri, 06 Jun 2014 18:32:39 +0300, Marko Rauhamaa wrote: Unicode, like ASCII, is a code. Representing text in unicode is encoding. A Unicode string as an abstract data type has no encoding. Unicode itself is an encoding. See it in action here: 72 101 108 108 111 44 32 119 111 114 108 100 It is a Platonic ideal, a pure form like the real numbers. Far from it. It is a mapping from symbols to integers. The symbols are the Platonic ones. The Unicode/ASCII encoding above represents the same Platonic string as this ESCDIC one: 212 133 147 147 150 107 64 166 150 153 137 132 Unicode string like this: s = uNOBODY expects the Spanish Inquisition! should not be thought of as a bunch of bytes in some encoding, Encoding is not tied to bytes or even computers. People can speak in code, after all. Marko, you are right about the broader English meaning of the word encoding. The original point here was that Unicode text provides no information about what sequence of bytes is at work. In the Unicode ecosystem, an encoding is a specification of how the text will be represented in a byte stream. Saying something is Unicode doesn't provide that information. You have to say, UTF8 or UTF16 or UCS2, etc, in order to know how bytes will be involved. When Ethan said, a Unicode string, as a data type, has no encoding, he meant (as he explained) that a Unicode string doesn't require or imply any particular mapping to bytes. I'm sure you understand this, I'm just trying to clarify the different meanings of the word encoding. Marko -- Ned Batchelder, http://nedbatchelder.com -- https://mail.python.org/mailman/listinfo/python-list
Re: Python 3.2 has some deadly infection
On Sat, 07 Jun 2014 01:50:50 +1000, Chris Angelico wrote: Yes and no. ASCII means two things: ASCII means: American Standard Code for Information Interchange aka ASA Standard X3.4-1963 into the lowest seven bits of a byte, with the high byte left clear. high BIT left clear. -- Denis McMahon, denismfmcma...@gmail.com -- https://mail.python.org/mailman/listinfo/python-list
Re: Python 3.2 has some deadly infection
On Sat, Jun 7, 2014 at 7:18 AM, Denis McMahon denismfmcma...@gmail.com wrote: into the lowest seven bits of a byte, with the high byte left clear. high BIT left clear. That thing. Unless you have bytes inside bytes (byteception?), you'll only have room for one high bit. Some day I'll get my brain and my fingers to agree on everything we do... but that day is not today. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Python 3.2 has some deadly infection
Gregory Ewing greg.ew...@canterbury.ac.nz: As a result, most unix programs, most of the time, deal with text on stdin and stdout. Well, ok. But even accepting that premise, that text might not be what Python3 considers text. For example, if your program reads in XML, JSON or Python, the parser object might prefer to take it in as bytes and not have it predecoded by sys.stdin. So, it makes sense for them to be text by default. I'm not sure. That could lead to nasty surprises. I've experienced analogous consternations when the sort utility hasn't worked identically for identical input: it is heavily influenced by the (spit, spit) locale. That's why 99.9% of your scripts should prefix sort and grep with LC_ALL=C -- even when the input really is UTF-8. Should I now take it further and prefix all Python programs with LC_ALL=C? Probably not, since UTF-8 might cause sys.stdin to barf. And wherever there's text, there needs to be an encoding. No problem there, only should sys.stdin and sys.stdout carry the decoding/encoding out or should it be left for the program. Marko -- https://mail.python.org/mailman/listinfo/python-list
Re: Python 3.2 has some deadly infection
On Thu, Jun 5, 2014 at 5:16 PM, Marko Rauhamaa ma...@pacujo.net wrote: No problem there, only should sys.stdin and sys.stdout carry the decoding/encoding out or should it be left for the program. The most normal thing to do with the standard streams is to have them produce text, and as much as possible, you shouldn't have to go to great lengths to make that work. If, in Python, I say print(Hello, world!), I expect that to produce a line of text on the screen, without my code having to encode that to bytes, figure out what sort of newline to add, etc, etc. Even if stdout isn't a tty, chances are you're still working with text. Only an extreme few Unix programs actually manipulate binary standard streams (some, like cat, will pipe binary through unchanged, but even cat assumes text for options like -n); those few should be the ones to have to worry about setting stdin and stdout to be binary. In the same way that we have double-quoted strings being Unicode strings, we should have print() and input() naturally just work with Unicode, which means they should negotiate encodings with the system without the programmer having to lift a finger. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Python 3.2 has some deadly infection
Chris Angelico ros...@gmail.com: If, in Python, I say print(Hello, world!), I expect that to produce a line of text on the screen, without my code having to encode that to bytes, figure out what sort of newline to add, etc, etc. That example in no way represents the typical Python program (if there is one). Only an extreme few Unix programs actually manipulate binary standard streams That's quite an assumption to make. we should have print() and input() naturally just work with Unicode No problem there. I couldn't imagine using either function for anything serious. Marko -- https://mail.python.org/mailman/listinfo/python-list
Re: Python 3.2 has some deadly infection
On Thu, 05 Jun 2014 14:01:50 +1200, Gregory Ewing wrote: Steven D'Aprano wrote: The whole concept of stdin and stdout is based on the idea of having a console to read from and write to. Not really; stdin and stdout are frequently connected to files, or pipes to other processes. The console, if it exists, just happens to be a convenient default value for them. Even on a system without a console, they're still a useful abstraction. If you had kept reading my post, including the bits you cut out *wink*, you'd see that I did raise that same point. Having stdin and stdout trivially generalises to the idea of replacing them with other files, or pipes. But the idea of having standard input and standard output in the first place comes about because they are useful for the console. I gave the example of Mac, which didn't have a command-line interface at all, hence no console, no stdin, no stdout. If a system had no command line interface (hence no consoles), why would you bother with a *standard* input file and output file that are never used? But we were talking about encodings, and whether stdin and stdout should be text or binary by default. Well, one of the design principles behind unix is to make use of plain text wherever possible. What's plain text? *half a wink* Its a serious question. Some people think that good ol' plain text is EBCDIC, like IBM intended. To them, the letter A is synonymous with the byte 0xC1, and there's no need for an encoding (or so they think) because A *is* 0xC1. Of course, people on ASCII systems know better: who needs encodings when it is a universal fact that A *is* 0x41? *wink* Not just for stuff meant to be seen on the screen, but for stuff kept in files as well. As a result, most unix programs, most of the time, deal with text on stdin and stdout. So, it makes sense for them to be text by default. And wherever there's text, there needs to be an encoding. This is true whether a console is involved or not. Agreed. -- Steven -- https://mail.python.org/mailman/listinfo/python-list
Re: Python 3.2 has some deadly infection
On Thu, Jun 5, 2014 at 6:05 PM, Marko Rauhamaa ma...@pacujo.net wrote: Chris Angelico ros...@gmail.com: If, in Python, I say print(Hello, world!), I expect that to produce a line of text on the screen, without my code having to encode that to bytes, figure out what sort of newline to add, etc, etc. That example in no way represents the typical Python program (if there is one). It's simpler than most, but use of print() is certainly quite common. A naive search of .py files in my /usr came up with five thousand instances of ' print(', and given that that search won't necessarily find a Python 2 print statement (and I'm on Debian Wheezy, so Py2 is the system Python), I think that's a fairly respectable figure. Only an extreme few Unix programs actually manipulate binary standard streams That's quite an assumption to make. Okay. Start listing some. You have (de)compression programs like gzip, which primarily work with files but can work with standard streams; some image or movie manipulation programs (eg avconv) can also read from stdin, although again, it's far more common to use files; cat will happily transmit binary untouched, but all its options (at least the ones I can see in my 'man cat') are for working with text. What else do you have? Let's see... grep, sort, less/more, sed, awk, these are all text manipulation programs. All your give me info about the system programs (ls, mount, pwd, hostname, date...) print text to stdout. Some also read from stdin, like md5sum and related. Piles and piles of programs that work with text. A small handful that work with binary, and most of them are more commonly used directly with files, not with pipes. The most common case is that it all be text. we should have print() and input() naturally just work with Unicode No problem there. I couldn't imagine using either function for anything serious. I don't know about those exact functions, but I do know that there are plenty of Python programs that use the console (take hg as one fairly hefty example). Maybe input() isn't all that heavily used, but certainly print() is a fine function. I can not only imagine using them seriously, I *have used* them, and their equivalents in other languages, seriously. If the standard streams are so crucial, why are their most obvious interfaces insignificant to you? ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Python 3.2 has some deadly infection
Steven D'Aprano st...@pearwood.info: But the idea of having standard input and standard output in the first place comes about because they are useful for the console. I doubt that. Classic programs take input and produce output. Standard input and output are the default input and output. The textbook Pascal programs started: program myprogram(input, output); If a system had no command line interface (hence no consoles), why would you bother with a *standard* input file and output file that are never used? Because programs are supposed to do useful work. They consume input and produce output. That concept is older than computers themselves and is used to define things like computation, algorithm, halting etc. On Thu, 05 Jun 2014 14:01:50 +1200, Gregory Ewing wrote: But we were talking about encodings, and whether stdin and stdout should be text or binary by default. Well, one of the design principles behind unix is to make use of plain text wherever possible. No, one of the design principles behind unix is that all data is bytes: memory, files, devices, sockets, pathnames. Yes, the ASCII-is-good-for-everybody assumption has been there since the beginning, but Python will not be able to hide the fact that there is no text data (in the Python sense). There are only bytes. UTF-8 beautifully gives text a second-class citizenship in unix/linux. It will never be granted first-class citizenship, though. As a result, most unix programs, most of the time, deal with text on stdin and stdout. So, it makes sense for them to be text by default. And wherever there's text, there needs to be an encoding. This is true whether a console is involved or not. Agreed. Disagreed strongly. tcpdump -s 0 -w - error.pcap tar zxf - python.tar.gz sha1sum smile.jpg base64 -d a.dat a.exe wget ftp://micorsops.com/something.avi -O - | mplayer -cache 8192 - Unfortunately, the text/binary dichotomy breaks a beautiful principle in Python as well. In numerous contexts, any file-like object will be valid. Now there is no file-like object. Instead, you have text-file-like objects and binary-file-like objects, which require special attention since some operate on strings while others operate on bytes. Marko -- https://mail.python.org/mailman/listinfo/python-list
Re: Python 3.2 has some deadly infection
Chris Angelico ros...@gmail.com: If the standard streams are so crucial, why are their most obvious interfaces insignificant to you? I want the standard streams to consume and produce bytes. I do a lot of system programming and connect processes to each other with socketpairs, pipes and the like. I have dealt with plugin APIs that communicate over stdin and stdout. Python is clearly on a crusade to make *text* a first class system entity. I don't believe that is possible (without casualties) in the linux world. Python text should only exist inside string objects. Marko -- https://mail.python.org/mailman/listinfo/python-list
Re: Python 3.2 has some deadly infection
On Thursday, June 5, 2014 3:11:34 PM UTC+5:30, Marko Rauhamaa wrote: Steven D'Aprano wrote: But the idea of having standard input and standard output in the first place comes about because they are useful for the console. I doubt that. Classic programs take input and produce output. Standard input and output are the default input and output. The textbook Pascal programs started: program myprogram(input, output); If a system had no command line interface (hence no consoles), why would you bother with a *standard* input file and output file that are never used? Because programs are supposed to do useful work. They consume input and produce output. That concept is older than computers themselves and is used to define things like computation, algorithm, halting etc. On Thu, 05 Jun 2014 14:01:50 +1200, Gregory Ewing wrote: But we were talking about encodings, and whether stdin and stdout should be text or binary by default. Well, one of the design principles behind unix is to make use of plain text wherever possible. No, one of the design principles behind unix is that all data is bytes: memory, files, devices, sockets, pathnames. Yes, the ASCII-is-good-for-everybody assumption has been there since the beginning, but Python will not be able to hide the fact that there is no text data (in the Python sense). There are only bytes. UTF-8 beautifully gives text a second-class citizenship in unix/linux. It will never be granted first-class citizenship, though. As a result, most unix programs, most of the time, deal with text on stdin and stdout. So, it makes sense for them to be text by default. And wherever there's text, there needs to be an encoding. This is true whether a console is involved or not. Agreed. Disagreed strongly. tcpdump -s 0 -w - error.pcap tar zxf - python.tar.gz sha1sum smile.jpg base64 -d a.dat a.exe wget ftp://micorsops.com/something.avi -O - | mplayer -cache 8192 - Unfortunately, the text/binary dichotomy breaks a beautiful principle in Python as well. In numerous contexts, any file-like object will be valid. Now there is no file-like object. Instead, you have text-file-like objects and binary-file-like objects, which require special attention since some operate on strings while others operate on bytes. Pascal is for building pyramids — imposing, breathtaking, static structures built by armies pushing heavy blocks into place. — Alan Perlis Lisp is like a ball of mud. Add more and it's still a ball of mud — it still looks like Lisp. — Guy Steele There are two fundamental outlooks in computer science — structuring and universality. And they pull in opposite directions. Universality happens when a data-structure can hold everything — a universal data structure. Some of the most significant advances in CS come from a universalist vision: - von Neumann machine storing data+code in memory - Turing-tape able to store arbitrary turing machines (∴ universal TM) - Lisp program ≡ Lisp data - Stream of byte can handle/represent everything in Unix — memory, files, devices, sockets, pathnames. However after the allurement of universality is over, the realization dawns that we have a mess — Lisp is a 'mud-ball'. At which point people start needing to make distinctions — code and data, different data-structures, type-systems etc. IOW imposing structure on the mud-ball. Taking a broad view, while structuring trades the power for order, it is universality that adds significant power. Python is not as universal as Lisp — it has no homoiconicity. But it is close enough in that any variable/data-structure can contain any value. What Marko is saying is that by imposing the structuring of unicode on the outside (Unix) world of text=byte, significant power is lost. This is also Armin's crib. How significant that loss is, is yet to be seen… -- https://mail.python.org/mailman/listinfo/python-list
Re: Python 3.2 has some deadly infection
Rustom Mody rustompm...@gmail.com: What Marko is saying is that by imposing the structuring of unicode on the outside (Unix) world of text=byte, significant power is lost. Mostly I'm saying Python3 will not be able to hide the fact that linux data consists of bytes. It shouldn't even try. The linux OS outside the Python process talks bytes, not strings. A different OS might have different assumptions. Marko -- https://mail.python.org/mailman/listinfo/python-list
Re: Python 3.2 has some deadly infection
On Thu, 05 Jun 2014 17:45:34 +0300, Marko Rauhamaa wrote: Rustom Mody rustompm...@gmail.com: What Marko is saying is that by imposing the structuring of unicode on the outside (Unix) world of text=byte, significant power is lost. Mostly I'm saying Python3 will not be able to hide the fact that linux data consists of bytes. It shouldn't even try. The linux OS outside the Python process talks bytes, not strings. Data on pretty much *all* computers consists of bytes, regardless of the language or operating system. There may be a few esoteric or ancient machines from the Dark Ages that aren't based on bytes, and even fewer that aren't based on bits (ancient Soviet era mainframes, if any of them still survive), but they aren't important. Someday esoteric non-byte machines, perhaps quantum computers, or machines based on DNA, or nano- sized analog computers made of carbon atoms, say, will be important, but this is not that day. For now, bytes rule *everywhere*. Nevertheless, there are important abstractions that are written on top of the bytes layer, and in the Unix and Linux world, the most important abstraction is *text*. In the Unix world, text formats and text processing is much more common in user-space apps than binary processing. Perhaps the definitive explanation and celebration of the Unix way is Eric Raymond's The Art Of Unix Programming: http://www.catb.org/esr/writings/taoup/html/ch05s01.html -- Steven D'Aprano http://import-that.dreamwidth.org/ -- https://mail.python.org/mailman/listinfo/python-list
Re: Python 3.2 has some deadly infection
On 05/06/2014 15:45, Marko Rauhamaa wrote: Rustom Mody rustompm...@gmail.com: What Marko is saying is that by imposing the structuring of unicode on the outside (Unix) world of text=byte, significant power is lost. Mostly I'm saying Python3 will not be able to hide the fact that linux data consists of bytes. It shouldn't even try. The linux OS outside the Python process talks bytes, not strings. A different OS might have different assumptions. Marko I think I'm in the unix camp as well. I just think that an extra assumption on input output isn't always helpful. In python 3 byte strings are second class which I think is wrong; apparently pressure from influential users is pushing to make byte strings more first class which is a good thing. -- Robin Becker -- https://mail.python.org/mailman/listinfo/python-list
Re: Python 3.2 has some deadly infection
On Fri, Jun 6, 2014 at 1:37 AM, Robin Becker ro...@reportlab.com wrote: I think I'm in the unix camp as well. I just think that an extra assumption on input output isn't always helpful. In python 3 byte strings are second class which I think is wrong; apparently pressure from influential users is pushing to make byte strings more first class which is a good thing. I wouldn't say they're second-class; it's more that the bytes type was considered to be more like a list of ints than like a Unicode string, and now that there are a few years' worth of real-world usage information to learn from, it's known that some more string-like operations will be extremely helpful. So now they're being added, which I agree is a good thing. Whether ba[0] should be b'a' or ord(b'a') is another sticking point. The Py2 str does the first, the Py3 bytes does the second. That one's a bit hard to change, but what I'm not sure of is how significant this is to new-build Py3 code. Obviously it's a barrier to porting, but is it important on its own? However, that's still not really byte strings are second class. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Python 3.2 has some deadly infection
On Fri, Jun 6, 2014 at 1:33 AM, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote: In the Unix world, text formats and text processing is much more common in user-space apps than binary processing. Perhaps the definitive explanation and celebration of the Unix way is Eric Raymond's The Art Of Unix Programming: http://www.catb.org/esr/writings/taoup/html/ch05s01.html Specifically, this from the opening paragraph: Text streams are a valuable universal format because they're easy for human beings to read, write, and edit without specialized tools. These formats are (or can be designed to be) transparent. He goes on to talk about network protocols, one of the best examples of this. I've idly speculated at times about the possibility of rewriting the Magic: The Gathering Online back-end with a view to making it easier to work with. Among other changes, I'd be wanting to make the client-server communication be plain text (an SMTP-style of protocol), with an external layer of encryption (TLS). This would mean that: 1) Internal testing can be done without TLS, making the communication absolutely transparent, easy to debug, easy to watch, everything. Adding TLS later would have zero impact on the critical code internally - it's just a layer around the outside. 2) Upgrades to crypto can simply follow industry best-practice. (Reminder, to anyone who might have been mad enough to consider this: DO NOT roll your own crypto! Ever! Even if you use a good library for the heavy lifting!) 3) A debug log of what the client has sent and received could be included, even in production, at very low cost. You don't need to decode packets and pretty-print them - you just take the lines of text, maybe adorn or color them according to which were sent/received, and dump them into a display box or log file somewhere. 4) The server is forced to acknowledge that the client might not be the one it expected. Not only do you get better security that way, but you could also call this a feature. 5) Therefore, you can debug the system with a simple TELNET or MUD client (okay, most MUD clients don't do SSL, but you can use openssl s_client). As someone who's debugged myriad issues using his trusty MUD client, I consider this to be a *huge* advantage. All it takes is a few simple rules, like: All communication is text, encoded down the wire as UTF-8, and consists of lines (terminated by U+000A) which consist of a word, a U+0020 space, and then parameters to the command. There, that's a rigorous definition that covers everything you'll need of it; compare with what Flash uses, by default: https://en.wikipedia.org/wiki/Action_Message_Format Sure, it might be slightly more compact going down the wire; but what do you really gain? Text wins. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Python 3.2 has some deadly infection
On 05/06/2014 16:50, Chris Angelico wrote: .. I wouldn't say they're second-class; it's more that the bytes type was considered to be more like a list of ints than like a Unicode string, and now that there are a few years' worth of real-world usage information to learn from, it's known that some more string-like operations will be extremely helpful. So now they're being added, which I agree is a good thing. in python 2 str and unicode were much more comparable. On balance I think just reversing them ie str -- bytes and unicode -- str was probably the right thing to do if the default conversions had been turned off. However making bytes a crippled thing was wrong. Whether ba[0] should be b'a' or ord(b'a') is another sticking point. The Py2 str does the first, the Py3 bytes does the second. That one's a bit hard to change, but what I'm not sure of is how significant this is to new-build Py3 code. Obviously it's a barrier to porting, but is it important on its own? However, that's still not really byte strings are second class. .. I dislike the current model, but that's because I had a lot of stuff to convert and probably made a bunch of blunders. The reportlab code is now a mess of hacks to keep it alive for 2.7 =3.3; I'm probably never going to be convinced that uncode types are good. Bytes are the underlying concept and should have remained so for simplicity's sake. -- Robin Becker -- https://mail.python.org/mailman/listinfo/python-list
Re: Python 3.2 has some deadly infection
On Thu, 05 Jun 2014 16:37:23 +0100, Robin Becker wrote: In python 3 byte strings are second class which I think is wrong It certainly is wrong. bytes are just as much a first-class built-in type as list, int, float, bool, set, tuple and str. There may be missing functionality (relatively easy to add new functionality), and even poor design choices (like the foolish decision to have bytes display as if they were ASCII-ish strings, a silly mistake that simply reinforces the myth that bytes and ASCII are synonymous). Python 3.4 and 3.5 are in the process of rectifying as many of these mistakes as possible, e.g. adding back % formatting. But a few mistakes in the design of bytes' API no more makes it second-class than the lack of dict.contains_value() method makes dict second-class. By all means ask for better bytes functionality. But don't libel Python by pretending that bytes is anything less than one of the most important and fundamental types in the language. bytes are so important that there are TWO implementations for them, a mutable and immutable version (bytearray and bytes), while text strings only have an immutable version. -- Steven D'Aprano http://import-that.dreamwidth.org/ -- https://mail.python.org/mailman/listinfo/python-list
Re: Python 3.2 has some deadly infection
On Thu, 05 Jun 2014 17:17:05 +0100, Robin Becker wrote: Bytes are the underlying concept and should have remained so for simplicity's sake. Bytes are the underlying concept for classes too. Do you think that an opaque unstructured blob of bytes is simpler to use than a class? How would an unstructured blob of bytes be simpler to use than an array of multi-byte characters? Earlier: I dislike the current model, but that's because I had a lot of stuff to convert and probably made a bunch of blunders. The reportlab code is now a mess of hacks to keep it alive for 2.7 =3.3; Although I've been critical of many of your statements, I am sympathetic to your pain. There's no doubt that that the transition from the old, broken system of bytes masquerading as text can be hard, especially to those who never quite get past the misleading and false paradigm that bytes are ASCII. It may have been that there were better ways to have updated to 3.3; perhaps you were merely unfortunate to have updated too early, and had you waited to 3.4 or 3.5 things would have been better. I don't know. But whatever the situation, and despite our differences of opinion about Unicode, THANK YOU for having updated ReportLabs to 3.3. -- Steven D'Aprano http://import-that.dreamwidth.org/ -- https://mail.python.org/mailman/listinfo/python-list
Re: Python 3.2 has some deadly infection
Steven D'Aprano steve+comp.lang.pyt...@pearwood.info: Nevertheless, there are important abstractions that are written on top of the bytes layer, and in the Unix and Linux world, the most important abstraction is *text*. In the Unix world, text formats and text processing is much more common in user-space apps than binary processing. That linux text is not the same thing as Python's text. Conceptually, Python text is a sequence of 32-bit integers. Linux text is a sequence of 8-bit integers. It is great that lots of computer-to-computer formats are encoded in ASCII (~ UTF-8). However, nowhere in linux is there a real abstraction layer that processes Python-esque text. Case in point: $ env | grep UTF LANG=en_US.UTF-8 $ od -c Hyvää yötä # Good night in Finnish 000 H y v 303 244 303 244 y 303 266 t 303 244 \n 017 The od utility is asked to display its input as characters. The locale info gives a hint that all text data is in UTF-8. Yet what comes out is bytes. How about: $ wc -c Hyvää yötä 15 $ tr 'ä' 'a' Hyvää yötä Hyv ya�taa Grep is smarter: $ grep v...y Hyvää yötä Hyvää yötä which is why you should always prefix grep with LC_ALL=C in your scripts (makes it far faster, too). Marko -- https://mail.python.org/mailman/listinfo/python-list
Re: Python 3.2 has some deadly infection
On Thursday, June 5, 2014 9:42:28 PM UTC+5:30, Chris Angelico wrote: On Fri, Jun 6, 2014 at 1:33 AM, Steven D'Aprano wrote: In the Unix world, text formats and text processing is much more common in user-space apps than binary processing. Perhaps the definitive explanation and celebration of the Unix way is Eric Raymond's The Art Of Unix Programming: http://www.catb.org/esr/writings/taoup/html/ch05s01.html Specifically, this from the opening paragraph: Text streams are a valuable universal format because they're easy for human beings to read, write, and edit without specialized tools. These formats are (or can be designed to be) transparent. A fact that stops being true when you tie up text with encodings. For two reasons: 1. The function/pair encode/decode mapping between byte-string and text cannot be a bijection because the byte-string set is larger than the text set. This is the error that Armin was hit by 2. Since there is not one but a zillion encodings possible we are not talking of one (possibly universal) data structure but a zillion ones: Text streams are a universal format - which encoding-ed form of text?? -- https://mail.python.org/mailman/listinfo/python-list
Re: Python 3.2 has some deadly infection
On Fri, Jun 6, 2014 at 2:17 AM, Robin Becker ro...@reportlab.com wrote: in python 2 str and unicode were much more comparable. On balance I think just reversing them ie str -- bytes and unicode -- str was probably the right thing to do if the default conversions had been turned off. However making bytes a crippled thing was wrong. It's easy to build up functionality after the event. Maybe reportlab will have lots of hacks to support both 2.7 and 3.3, but in a few years you'll be able to say supports 2.7 and 3.5 and take advantage of percent formatting and whatever else is added. But this is just the way that languages develop; you use them, you find what isn't easy, and you fix it. The nature of stability is that it takes time before you can depend on freshly-written functionality (contrast the extreme instability of running the version from source control - stuff might be fixed at any time, but you have to do all the work yourself to make sure your dependencies line up), but over time, you can depend on improvements making their way out there. Can you point to specific areas in which the bytes type is crippled? Comparing either to the Py2 str or the Py3 str, or to anything else? The Python core devs are listening, as evidenced by PEP 461. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Python 3.2 has some deadly infection
On Thu, Jun 5, 2014 at 10:17 AM, Robin Becker ro...@reportlab.com wrote: in python 2 str and unicode were much more comparable. On balance I think just reversing them ie str -- bytes and unicode -- str was probably the right thing to do if the default conversions had been turned off. However making bytes a crippled thing was wrong. How should e.g. bytes.upper() be implemented then? The correct behavior is entirely dependent on the encoding. Python 2 just assumes ASCII, which at best will correctly upper-case some subset of the string and leave the rest unchanged, and at worst could corrupt the string entirely. There are some things that were dropped that should not have been, but my impression is that those are being worked on, for example % formatting in PEP 461. -- https://mail.python.org/mailman/listinfo/python-list
Re: Python 3.2 has some deadly infection
On Fri, Jun 6, 2014 at 2:52 AM, Marko Rauhamaa ma...@pacujo.net wrote: That linux text is not the same thing as Python's text. Conceptually, Python text is a sequence of 32-bit integers. Linux text is a sequence of 8-bit integers. Point of terminology: Linux is the kernel, everything you say below here is talking about particular programs. From what I understand, bash (just another Unix program) treats strings as sequences of codepoints, just as Python does; though its string manipulation is not nearly as rich as Python's, so it's harder to prove. Python is itself a Unix program, so you can do the exact same proofs and demonstrate that Linux is clearly Unicode-aware. It's not Linux you're testing. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Python 3.2 has some deadly infection
On Fri, Jun 6, 2014 at 2:54 AM, Rustom Mody rustompm...@gmail.com wrote: On Thursday, June 5, 2014 9:42:28 PM UTC+5:30, Chris Angelico wrote: On Fri, Jun 6, 2014 at 1:33 AM, Steven D'Aprano wrote: In the Unix world, text formats and text processing is much more common in user-space apps than binary processing. Perhaps the definitive explanation and celebration of the Unix way is Eric Raymond's The Art Of Unix Programming: http://www.catb.org/esr/writings/taoup/html/ch05s01.html Specifically, this from the opening paragraph: Text streams are a valuable universal format because they're easy for human beings to read, write, and edit without specialized tools. These formats are (or can be designed to be) transparent. A fact that stops being true when you tie up text with encodings. For two reasons: 1. The function/pair encode/decode mapping between byte-string and text cannot be a bijection because the byte-string set is larger than the text set. This is the error that Armin was hit by 2. Since there is not one but a zillion encodings possible we are not talking of one (possibly universal) data structure but a zillion ones: Text streams are a universal format - which encoding-ed form of text?? As soon as you store or transmit ANY form of information, you need to worry about encodings. Ever heard of this thing called network byte order? It's part of taming the wilds of integer encodings. The theory is that the LC environment variables will carry all that crucial out-of-band information about encodings, and while the practice isn't perfect, it does still mean that there is such a thing as a text stream. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Python 3.2 has some deadly infection
On 6/5/2014 10:45 AM, Marko Rauhamaa wrote: Mostly I'm saying Python3 will not be able to hide the fact that linux data consists of bytes. It shouldn't even try. The linux OS outside the Python process talks bytes, not strings. A text file is a binary file wrapped with a codex to translate to and from a universal text format on input and output. Much of the time, the wrapping is a great user convenience. Since the wrapping is optional, nothing is forced or really hidden. A different OS might have different assumptions. Different OSes *do* have different assumptions. Both MacOSX and current Windows use (UCS-2 or) UTF-16 for text. It seems that unicode strings are better than ascii+??? strings as a universal basis for OS interfacing. For Windows, at least, the interface is much improved in Python 3. I understand that some, but not all, Latin alphabet *nix programmers wish that Python 3 continued to be strongly in their favor. But they are a small minority of the world's programmers, and Python 3 is aimed at everyone on all systems. -- Terry Jan Reedy -- https://mail.python.org/mailman/listinfo/python-list
Re: Python 3.2 has some deadly infection
Terry Reedy tjre...@udel.edu: Different OSes *do* have different assumptions. Both MacOSX and current Windows use (UCS-2 or) UTF-16 for text. Linux can use anything for text; UTF-8 has become a de-facto standard. How text is represented is very different from whether text is a fundamental data type. A fundamental text file is such that ordinary operating system facilities can't see inside the black box (that is, they are *not* encoded as far as the applications go). I have no idea how opaque text files are in Windows or OS-X. For Windows, at least, the interface is much improved in Python 3. Yes, I get the feeling that Python is reaching out to Windows and OS-X and trying to make linux look like them. I understand that some, but not all, Latin alphabet *nix programmers wish that Python 3 continued to be strongly in their favor. But they are a small minority of the world's programmers, and Python 3 is aimed at everyone on all systems. Python allows linux programmers to write native linux programs. Maybe it allows Windows programmers to write native Windows programs. I certainly hope so. I don't want to have to write Windows programs that kinda run on linux. Java suffers from that: no import os in Java. Marko -- https://mail.python.org/mailman/listinfo/python-list
Re: Python 3.2 has some deadly infection
On 6/5/2014 5:53 AM, Marko Rauhamaa wrote: Chris Angelico ros...@gmail.com: If the standard streams are so crucial, why are their most obvious interfaces insignificant to you? I want the standard streams to consume and produce bytes. Easy. Read the manual entry for stdxxx. To write or read binary data from/to the standard streams, use the underlying binary buffer object. For example, to write bytes to stdout, use sys.stdout.buffer.write(b'abc') To make it easy, use bound methods. myfilter.p -- import sys sysin = sys.stdin.buffer.read sysout = sys.stdout.buffer.write syserr = sys.stderr.buffer.write filter code with calls to sysin, sysout, syserr. --- The same trick of defining bound methods to save both writing and execution time is also useful for text filters when you use sys.stdin.read, etc, more than once in the text. When you try this, please report the result, either way. I do a lot of system programming and connect processes to each other with socketpairs, pipes and the like. I have dealt with plugin APIs that communicate over stdin and stdout. Now you know how to do so on Python 3. Python is clearly on a crusade to make *text* a first class system entity. I don't believe that is possible (without casualties) in the linux world. Python text should only exist inside string objects. You are clearly on a crusade to push a falsehood. Why? On Windows and, I believe, Mac, utf-16 encoded text (C widechar type) *is* a 'first class system entity. The problem Python has with *nix is getting text bytes from the system in an unknown or worse, wrongly-claimed encoding. The Python developers do their best to cope with the differences and peculiarities of the systems it runs on. -- Terry Jan Reedy -- https://mail.python.org/mailman/listinfo/python-list
Re: Python 3.2 has some deadly infection
Terry Reedy tjre...@udel.edu: On 6/5/2014 5:53 AM, Marko Rauhamaa wrote: Chris Angelico ros...@gmail.com: If the standard streams are so crucial, why are their most obvious interfaces insignificant to you? I want the standard streams to consume and produce bytes. Easy. Read the manual entry for stdxxx. To write or read binary data from/to the standard streams, use the underlying binary buffer object. For example, to write bytes to stdout, use sys.stdout.buffer.write(b'abc') This note from the manual is a bit vague: Note that the streams can be replaced with objects (like io.StringIO) that do not support the buffer attribute or the detach() method Can be replaced by who? By the Python developers? By me? By random library calls? Does it mean the buffer and detach are not guaranteed to stay with the API? Marko -- https://mail.python.org/mailman/listinfo/python-list
Re: Python 3.2 has some deadly infection
On 6/5/2014 4:21 PM, Marko Rauhamaa wrote: Terry Reedy tjre...@udel.edu: On 6/5/2014 5:53 AM, Marko Rauhamaa wrote: Chris Angelico ros...@gmail.com: If the standard streams are so crucial, why are their most obvious interfaces insignificant to you? I want the standard streams to consume and produce bytes. Easy. Read the manual entry for stdxxx. To write or read binary data from/to the standard streams, use the underlying binary buffer object. For example, to write bytes to stdout, use sys.stdout.buffer.write(b'abc') This note from the manual is a bit vague: Note that the streams can be replaced with objects (like io.StringIO) that do not support the buffer attribute or the detach() method Can be replaced by who? By the Python developers? By me? By random library calls? Fair question. The Python developers will not fiddle with stdxxx for 3rd party code on 3rd party systems. We do sometimes *temporarily replace the streams with StringIO, either directly or via test.support when testing Python itself or stdlib modules. That is done in Lib/test, and except for testing StringIO, it is only done as a convenience, not a necessity. To test a binary stream filter, you would have to do something else, like read from and write to actual files on disk. Otherwise, you seem unlikely to sabotage yourself, even accidentally. Random non-stdlib library calls could sabotage you. However, in my opinion, an imported 3rd party module should never modify std streams, with one exception. The exception would be a module whose entire purpose was to put the streams in a known state, as documented, and only if intentionally asked to. Having said that, bound methods created (first) should work regardless of any subsequent manipulation of sys. Here is an experiment, run from an Idle editor. import sys sysout = sys.stdout.write sys.stdout = None sysout('works anyway\n') works anyway (Of course, subsequent attempts to continue interactively fail. But that is not your use case.) -- Terry Jan Reedy -- https://mail.python.org/mailman/listinfo/python-list
Re: Python 3.2 has some deadly infection
On Thursday, June 5, 2014 10:58:43 PM UTC+5:30, Chris Angelico wrote: On Fri, Jun 6, 2014 at 2:52 AM, Marko Rauhamaa wrote: That linux text is not the same thing as Python's text. Conceptually, Python text is a sequence of 32-bit integers. Linux text is a sequence of 8-bit integers. Point of terminology: Linux is the kernel, everything you say below here is talking about particular programs. If it helps try the following substitution: s/Linux/Pretty much all the distros that use Linux for their OS kernel/ BTW the only (other) guy I know who insistently makes that distinction is Richard Stallman. Are you an emacs user by any chance wink? From what I understand, bash (just another Unix program) treats strings as sequences of codepoints, just as Python does; though its string manipulation is not nearly as rich as Python's, so it's harder to prove. Python is itself a Unix program, so you can do the exact same proofs and demonstrate that Linux is clearly Unicode-aware. It's not Linux you're testing. In these 'other programs' is it permissible to include the kernel itself? And then ask how Linux (in your and Stallman's sense) differs from Windows in how the filesystem handles things like filenames? -- https://mail.python.org/mailman/listinfo/python-list
Re: Python 3.2 has some deadly infection
On Fri, Jun 6, 2014 at 8:35 AM, Rustom Mody rustompm...@gmail.com wrote: On Thursday, June 5, 2014 10:58:43 PM UTC+5:30, Chris Angelico wrote: On Fri, Jun 6, 2014 at 2:52 AM, Marko Rauhamaa wrote: That linux text is not the same thing as Python's text. Conceptually, Python text is a sequence of 32-bit integers. Linux text is a sequence of 8-bit integers. Point of terminology: Linux is the kernel, everything you say below here is talking about particular programs. If it helps try the following substitution: s/Linux/Pretty much all the distros that use Linux for their OS kernel/ You could look at the Debian Project, which is a full environment with everything you're talking about. And everything you say would be equally true of Debian Linux and Debian kfreebsd. :) BTW the only (other) guy I know who insistently makes that distinction is Richard Stallman. Are you an emacs user by any chance wink? Nope! Just a terminology nerd. :) From what I understand, bash (just another Unix program) treats strings as sequences of codepoints, just as Python does; though its string manipulation is not nearly as rich as Python's, so it's harder to prove. Python is itself a Unix program, so you can do the exact same proofs and demonstrate that Linux is clearly Unicode-aware. It's not Linux you're testing. In these 'other programs' is it permissible to include the kernel itself? And then ask how Linux (in your and Stallman's sense) differs from Windows in how the filesystem handles things like filenames? What are you testing of the kernel? Most of the kernel doesn't actually work with text at all - it works with integers, buffers of memory (which could be seen as streams of bytes, but might be almost anything), process tables, open file handles... but not usually text. To you, EAGAIN might be a bit of text, but to the Linux kernel, it's an integer (11 decimal, if I recall correctly). Is that some fancy new form of encoding? :) ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Python 3.2 has some deadly infection
On Thu, 05 Jun 2014 21:30:11 +0300, Marko Rauhamaa wrote: Terry Reedy tjre...@udel.edu: Different OSes *do* have different assumptions. Both MacOSX and current Windows use (UCS-2 or) UTF-16 for text. Linux can use anything for text; UTF-8 has become a de-facto standard. How text is represented is very different from whether text is a fundamental data type. A fundamental text file is such that ordinary operating system facilities can't see inside the black box (that is, they are *not* encoded as far as the applications go). Wait, are they black-boxes to the *operating system* or to *applications*? They aren't the same thing. In any case, I reject your premise. ALL data types are constructed on top of bytes, and so long as you allow applications *any way* to coerce data types to different data types, you allow them to see inside the black box. I can extract the four bytes from a C long integer, but that doesn't mean that C longs aren't fundamental data types in Unix/Linux. I have no idea how opaque text files are in Windows or OS-X. Exactly as opaque as they are in Unix, which is to say not at all. Just open the file in binary mode, and voilà you see the underlying bytes. All you're doing is pointing out that, in modern electronic computers, the fundamental data structure which underlies all others (the indivisible protons and neutrons, so to speak, only there are 256 of them rather than 2) is the byte. We know this, and don't dispute it. (Like protons and neutrons, we can see inside bytes to the quark-like bits that make up bytes. Like quarks, bits do not exist in isolation, but only inside bytes.) For Windows, at least, the interface is much improved in Python 3. Yes, I get the feeling that Python is reaching out to Windows and OS-X and trying to make linux look like them. Unicode support in OS-X is (I have been assured) is very good, probably better than Linux. Apple has very high standards when it comes to their apps, and provides rich Unicode-aware APIs. But Linux Unicode support is much better than Windows. Unicode support in Windows is crippled by continued reliance on legacy code pages, and by the assumption deep inside the Windows APIs that Unicode means 16 bit characters. See, for example, the amount of space spent on fixing Windows Unicode handling here: http://www.utf8everywhere.org/ -- Steven D'Aprano http://import-that.dreamwidth.org/ -- https://mail.python.org/mailman/listinfo/python-list
Re: Python 3.2 has some deadly infection
On Thu, 05 Jun 2014 23:21:35 +0300, Marko Rauhamaa wrote: Terry Reedy tjre...@udel.edu: On 6/5/2014 5:53 AM, Marko Rauhamaa wrote: Chris Angelico ros...@gmail.com: If the standard streams are so crucial, why are their most obvious interfaces insignificant to you? I want the standard streams to consume and produce bytes. Easy. Read the manual entry for stdxxx. To write or read binary data from/to the standard streams, use the underlying binary buffer object. For example, to write bytes to stdout, use sys.stdout.buffer.write(b'abc') This note from the manual is a bit vague: Note that the streams can be replaced with objects (like io.StringIO) that do not support the buffer attribute or the detach() method Can be replaced by who? By the Python developers? By me? By random library calls? By you. sys.stdout and friends are writable. Any code you call may have replaced them with another file-like object, and you should honour that. The API could have/should have been a little more friendly, but it's conceptually simple: * Does sys.stdout have a buffer attribute? Then write raw bytes to the buffer. * If not, then write raw bytes to sys.stdout. * If either fails, then somebody has replaced stdout with something weird, and they deserve whatever horrible fate their damn fool move causes. It's not your responsibility to try to keep your application running under bizarre circumstances. -- Steven D'Aprano http://import-that.dreamwidth.org/ -- https://mail.python.org/mailman/listinfo/python-list
Re: Python 3.2 has some deadly infection
Steven D'Aprano steve+comp.lang.pyt...@pearwood.info: In any case, I reject your premise. ALL data types are constructed on top of bytes, Only in a very dull sense. and so long as you allow applications *any way* to coerce data types to different data types, you allow them to see inside the black box. I can't see the bytes inside Python objects, including strings, and that's how it is supposed to be. Similarly, I can't (easily) see how files are laid out on hard disks. That's a true abstraction. Nothing in linux presents data, though, except through bytes. Marko -- https://mail.python.org/mailman/listinfo/python-list
Re: Python 3.2 has some deadly infection
Steven D'Aprano steve+comp.lang.pyt...@pearwood.info: Can be replaced by who? By the Python developers? By me? By random library calls? By you. sys.stdout and friends are writable. Any code you call may have replaced them with another file-like object, and you should honour that. I can of course overwrite even sys and os and open and all. That hardly merits mentioning in the API documentation. What I'm afraid of is that the Python developers are reserving the right to remove the buffer and detach attributes from the standard streams in a future version. That would be terrible. If it means some other module is allowed to commandeer the standard streams, that would be bad as well. Worst of all, I don't know why the caveat had to be there. Or is it maybe because some python command line options could cause buffer and detach not to be there? That would explain the caveat, but still would be kinda sucky. Marko -- https://mail.python.org/mailman/listinfo/python-list
Re: Python 3.2 has some deadly infection
On Fri, Jun 6, 2014 at 9:30 AM, Marko Rauhamaa ma...@pacujo.net wrote: Steven D'Aprano steve+comp.lang.pyt...@pearwood.info: Can be replaced by who? By the Python developers? By me? By random library calls? By you. sys.stdout and friends are writable. Any code you call may have replaced them with another file-like object, and you should honour that. I can of course overwrite even sys and os and open and all. That hardly merits mentioning in the API documentation. What I'm afraid of is that the Python developers are reserving the right to remove the buffer and detach attributes from the standard streams in a future version. That would be terrible. If it means some other module is allowed to commandeer the standard streams, that would be bad as well. Worst of all, I don't know why the caveat had to be there. Or is it maybe because some python command line options could cause buffer and detach not to be there? That would explain the caveat, but still would be kinda sucky. It's more that replacng sys.std* is considered reasonably normal (unlike, say, replacing sys.float_info, which would be a weird thing to do); and you could replace them with something that doesn't have those attributes. If you're running a top-level script and you never import anything that changes the streams, you should be able to depend on those always being there. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Python 3.2 has some deadly infection
On 6/5/2014 7:30 PM, Marko Rauhamaa wrote: Steven D'Aprano steve+comp.lang.pyt...@pearwood.info: Can be replaced by who? By the Python developers? By me? By random library calls? By you. sys.stdout and friends are writable. Any code you call may have replaced them with another file-like object, and you should honour that. I can of course overwrite even sys and os and open and all. That hardly merits mentioning in the API documentation. What I'm afraid of is that the Python developers are reserving the right to remove the buffer and detach attributes from the standard streams in a future version. No, not at all. That would be terrible. Agreed. If it means some other module is allowed to commandeer the standard streams, that would be bad as well. I think that, for the most part, library modules should either open a file given a filename from outside or read from and write to open files handed to them from outside, but not hard-code the std streams. The module doc should say if the file (name or object) must be text or in particular binary. The warning is also a hint as to how to solve a problem, such as testing a binary filter. Assume the module reads from and writes to .buffer and has a main function. One approach, untested: import sys, io, unittest from mod import main class Binstd: def __init(self): self.buffer = io.BytesIO sys.stdin = Binstd() sys.stdout = Binstd() sys.stdin.buffer.write('test data') sys.stdin.buffer.seek(0) main() out = sys.stdout.buffer.getvalue() # test that out is as expected for the input # seek to 0 and truncate for more tests Worst of all, I don't know why the caveat had to be there. Because the streams can be replaced for a variety of good reasons, as above. Or is it maybe because some python command line options could cause buffer and detach not to be there? That would explain the caveat, but still would be kinda sucky. The doc set documents the Python command line options, as well any that are CPython specific. It is possible that some implementation could add one to open stdxyz in binary mode. CPython does not really need that. -- Terry Jan Reedy -- https://mail.python.org/mailman/listinfo/python-list
Re: Python 3.2 has some deadly infection
On Friday, June 6, 2014 4:22:22 AM UTC+5:30, Chris Angelico wrote: On Fri, Jun 6, 2014 at 8:35 AM, Rustom Mody wrote: And then ask how Linux (in your and Stallman's sense) differs from Windows in how the filesystem handles things like filenames? What are you testing of the kernel? Most of the kernel doesn't actually work with text at all - it works with integers, buffers of memory (which could be seen as streams of bytes, but might be almost anything), process tables, open file handles... but not usually text. To you, EAGAIN might be a bit of text, but to the Linux kernel, it's an integer (11 decimal, if I recall correctly). Is that some fancy new form of encoding? :) | Thanks to the properties of UTF-8 encoding, the Linux kernel, the | innermost and lowest-level part of the operating system, can | handle Unicode filenames without even having the user tell it | that UTF-8 is to be used. All character strings, including | filenames, are treated by the kernel in such a way that THEY | APPEAR TO IT ONLY AS STRINGS OF BYTES. Thus, it doesn't care and | does not need to know whether a pair of consecutive bytes should | logically be treated as two characters or a single one. The only | risk of the kernel being fooled would be, for example, for a | filename to contain a multibyte Unicode character encoded in such | a way that one of the bytes used to represent it was a slash or | some other character that has a special meaning in file | names. Fortunately, as we noted, UTF-8 never uses ASCII | characters for encoding multibyte characters, so neither the | slash nor any other special character can appear as part of one | and therefore there is no risk associated with using Unicode in | filenames. | | Filesystems found on Microsoft Windows machines (NTFS and FAT) | are different in that THEY STORE FILENAMES ON DISK IN SOME | PARTICULAR ENCODING. The kernel must translate this encoding to | the system encoding, which will be UTF-8 in our case. | | If you have Windows partitions on your system, you will have to | take care that they are mounted with correct options. For FAT and | ISO9660 (used by CD-ROMs) partitions, option utf8 makes the | system translate the filesystem's character encoding to | UTF-8. For NTFS, nls=utf8 is the recommended option (utf8 should | also work). [Emphases mine] From: http://michal.kosmulski.org/computing/articles/linux-unicode.html -- https://mail.python.org/mailman/listinfo/python-list
Re: Python 3.2 has some deadly infection
On Fri, Jun 6, 2014 at 1:11 PM, Rustom Mody rustompm...@gmail.com wrote: All character strings, including | filenames, are treated by the kernel in such a way that THEY | APPEAR TO IT ONLY AS STRINGS OF BYTES. Yep, the real issue here is file systems, not the kernel. But yes, this is one of the very few places where the kernel deals with a string - and because of the hairiness of having to handle myriad file systems in a single path (imagine multiple levels of remote mounts - I've had a case where I mount via sshfs a tree that includes a Samba mount point, and you can go a lot deeper than that), the only thing it can do is pass the bytes on unchanged. Which means, in reality, the kernel doesn't actually do *anything* with the string, it just passes it right along to the file system. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Python 3.2 has some deadly infection
On Friday, June 6, 2014 8:50:57 AM UTC+5:30, Chris Angelico wrote: kernel doesn't actually do *anything* with the string, it just passes it right along to the file system. Which is what Marko (and others like Armin) are asking of python (treated as a processing 'kernel'): I know what I am doing with my bytes -- please channel/funnel them around as requested without being unnecessarily and unrequestedly 'intelligent' -- https://mail.python.org/mailman/listinfo/python-list
Re: Python 3.2 has some deadly infection
On 06/05/2014 04:30 PM, Marko Rauhamaa wrote: What I'm afraid of is that the Python developers are reserving the right to remove the buffer and detach attributes from the standard streams in a future version. Being afraid is silly. If you have a question, ask it. -- ~Ethan~ -- https://mail.python.org/mailman/listinfo/python-list
Re: Python 3.2 has some deadly infection
On Tue, 03 Jun 2014 15:18:19 +0100, Robin Becker wrote: The problem is that causal readers like Robin sometimes jump from 'In Python 3, it can be hard to do something one really ought not to do' to 'Binary I/O is hard in Python 3' -- which is is not. I'm fairly causal and I did understand that the rant was a bit over the top for fairly practical reasons I have always regarded the std streams as allowing binary data and always objected to having to open files in python with a 't' or 'b' mode to cope with line ending issues. Isn't it a bit old fashioned to think everything is connected to a console? The whole concept of stdin and stdout is based on the idea of having a console to read from and write to. Otherwise, what would be the point? Classic Mac (pre OS X) had no command line interface nothing, and nothing even remotely like stdin and stdout. But once you have a console, stdin, stdout, and stderr become useful. And once you have them, then you can extend the concept using redirection and pipes. But fundamentally, stdin and stdout are about consoles. I think the idea that we only give meaning to binary data using encodings is a bit limiting. A zip or gif file has structure, but I don't think it's reasonable to regard such a file as having an encoding in the python unicode sense. In the Unicode sense? Of course not, that would be silly. The concept of encodings is bigger than just text, and in that sense zip compression is an encoding which encodes non-random data into a different format which generally takes up less space. -- Steven D'Aprano http://import-that.dreamwidth.org/ -- https://mail.python.org/mailman/listinfo/python-list
Re: Python 3.2 has some deadly infection
Steven D'Aprano wrote: The whole concept of stdin and stdout is based on the idea of having a console to read from and write to. Not really; stdin and stdout are frequently connected to files, or pipes to other processes. The console, if it exists, just happens to be a convenient default value for them. Even on a system without a console, they're still a useful abstraction. But we were talking about encodings, and whether stdin and stdout should be text or binary by default. Well, one of the design principles behind unix is to make use of plain text wherever possible. Not just for stuff meant to be seen on the screen, but for stuff kept in files as well. As a result, most unix programs, most of the time, deal with text on stdin and stdout. So, it makes sense for them to be text by default. And wherever there's text, there needs to be an encoding. This is true whether a console is involved or not. -- Greg -- https://mail.python.org/mailman/listinfo/python-list
Re: Python 3.2 has some deadly infection
Steven D'Aprano steve+comp.lang.pyt...@pearwood.info writes: On Tue, 03 Jun 2014 15:18:19 +0100, Robin Becker wrote: Isn't it a bit old fashioned to think everything is connected to a console? The whole concept of stdin and stdout is based on the idea of having a console to read from and write to. Otherwise, what would be the point? Classic Mac (pre OS X) had no command line interface nothing, and nothing even remotely like stdin and stdout. But once you have a console, stdin, stdout, and stderr become useful. And once you have them, then you can extend the concept using redirection and pipes. But fundamentally, stdin and stdout are about consoles. We can consider pipes abstraction to be fundumental. Decades of usage prove a pipeline of processes usefulness e.g., tr -cs A-Za-z '\n' | tr A-Z a-z | sort | uniq -c | sort -rn | sed ${1}q See http://www.leancrew.com/all-this/2011/12/more-shell-less-egg/ Whether or not a pipe is connected to a tty is a small detail. stdin/stdout is about pipes, not consoles. -- akira -- https://mail.python.org/mailman/listinfo/python-list
Re: Python 3.2 has some deadly infection
On 6/3/2014 1:16 AM, Gregory Ewing wrote: Terry Reedy wrote: The issue Armin ran into is this. He write a library module that makes sure the streams are binary. Seems to me he made a mistake right there. A library should *not* be making global changes like that. It can obtain binary streams from stdin and stdout for its own use, but it shouldn't stuff them back into sys.stdin and sys.stdout. If he had trouble because another library did that, then that library is broken, not Python. I agree. The example in Armin's blog rant was an application, an empty unix filter (ie, simplified cat clone). For that example the complex code he posted to show how awful Python 3 is is unneeded. When I asked what he did not directly use the fix in the doc, without the scaffolding, he switching to the 'library' module explanation. The problem is that causal readers like Robin sometimes jump from 'In Python 3, it can be hard to do something one really ought not to do' to 'Binary I/O is hard in Python 3' -- which is is not. -- Terry Jan Reedy -- https://mail.python.org/mailman/listinfo/python-list
Re: Python 3.2 has some deadly infection
The problem is that causal readers like Robin sometimes jump from 'In Python 3, it can be hard to do something one really ought not to do' to 'Binary I/O is hard in Python 3' -- which is is not. I'm fairly causal and I did understand that the rant was a bit over the top for fairly practical reasons I have always regarded the std streams as allowing binary data and always objected to having to open files in python with a 't' or 'b' mode to cope with line ending issues. Isn't it a bit old fashioned to think everything is connected to a console? I think the idea that we only give meaning to binary data using encodings is a bit limiting. A zip or gif file has structure, but I don't think it's reasonable to regard such a file as having an encoding in the python unicode sense. -- Robin Becker -- https://mail.python.org/mailman/listinfo/python-list
Re: Python 3.2 has some deadly infection
On Wed, Jun 4, 2014 at 12:18 AM, Robin Becker ro...@reportlab.com wrote: I think the idea that we only give meaning to binary data using encodings is a bit limiting. A zip or gif file has structure, but I don't think it's reasonable to regard such a file as having an encoding in the python unicode sense. Of course it doesn't. Those are binary files. Ultimately, every file is binary; but since the vast majority of them actually contain text, in one of a handful of common encodings, it's nice to have an easy way to open a text file. You could argue that rb should be the default, rather than rt, but that's a relatively minor point. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Python 3.2 has some deadly infection
On Mon, 02 Jun 2014 12:10:48 +0100, Robin Becker wrote: there seems to be an implicit assumption in python land that encoded strings are the norm. On virtually every computer I encounter that assumption is wrong. The vast majority of bytes in most computers is not something that can be easily printed out for humans to read. I suppose some clever pythonista can figure out an encoding to read my .o / .so etc files, but they are practically meaningless to a unicode program today. Same goes for most image formats and media files. Browsers routinely encounter mis/un-encoded pages. If you include image, video and sound files, you are probably correct that most content of files is binary. Outside of those three kinds of files, I would expect that *by far* the single largest kind of file is text. Some text is wrapped in a binary layer, e.g. .doc, .odt, etc. but an awful lot of it is good old human readable text, including web pages (html) and XML. Every programming language I know of defaults to opening files in text mode rather than binary mode. There may be exceptions, but reading and writing text is ubiquitous while writing .o and .so files is not. In python I would have preferred for bytes to remain the default io mechanism, at least that would allow me to decide if I need any decoding. That implies that you're opening files in binary mode by default. It also implies that even something as trivial as writing the string Hello World to a file (stdout is a file) is impossible until you've learned about encodings and know which encoding you need. I really don't think that's a good plan, for any language, but especially a language like Python which is intended for beginners as well as experts. The Python 2 approach, where stdout in binary but tries really hard to pretend to be a superset of ASCII, is simply broken. It works well for trivial examples, while breaking in surprising and hard-to-diagnose ways in others. It violates the Zen, errors should not be ignored unless explicitly silenced, instead silently failing and giving moji-bake: [steve@ando ~]$ python2.7 -c import sys; sys.stdout.write(u'ñβж\n') ñβж Changing to print doesn't help: [steve@ando ~]$ python2.7 -c print u'ñβж' ñβж Python 3 works correctly, whether you use print or sys.stdout: [steve@ando ~]$ python3.3 -c import sys; sys.stdout.write(u'ñβж\n') ñβж (although I haven't tested it on Windows). -- Steven D'Aprano http://import-that.dreamwidth.org/ -- https://mail.python.org/mailman/listinfo/python-list
Re: Python 3.2 has some deadly infection
On Wed, Jun 4, 2014 at 2:34 AM, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote: Outside of those three kinds of files, I would expect that *by far* the single largest kind of file is text. Some text is wrapped in a binary layer, e.g. .doc, .odt, etc. but an awful lot of it is good old human readable text, including web pages (html) and XML. In terms of file I/O in Python, text wrapped in a binary layer has to be treated as binary, not text. There's no difference between a JPEG file that has some textual EXIF information and an ODT file that's a whole lot of zipped up text; both of them have to be read as binary, then unpacked according to the container's specs, and then the text portion decoded according to an encoding like UTF-8. But you're quite right that a large proportion of files out there really are text. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Python 3.2 has some deadly infection
On 6/3/2014 10:18 AM, Robin Becker wrote: I think the idea that we only give meaning to binary data using encodings is a bit limiting. On the contrary, it is liberating. The fact that bits have no meaning other than 'a choice between two alterntives' means 1. any binary choice - 0/1, -/+, false/true, no/yes, closed/open, male/female, sad/happy, evil/good, low/high, and so on ad infinitum, can be encoded into a bit. Since any such pair could have been reversed, the mapping between bit states and the pair is arbitrary, and constitutes an encoding. 2. any discret or digitized information that constitutes a choice between multiple alternative can be encoded into a sequence of bits. This crucial discovery is the basis of Shannon's 1947 paper and of the information age that started about then. A zip or gif file has structure, but I don't think it's reasonable to to regard such a file as having an encoding in the python unicode sense. I an not quite sure what you are denying. Color encodings are encodings as much as character encodings, even if they encode different information. Both encode sensory experience and conceptual correlates into a sequences of bits usually organized for convenience into a sequence of bytes or other chunks. There is another similarity. Text files often have at least two levels of encoding. First is the character encoding; that is all unicode handles. Then there is the text structure encoding, which is sometimes called the 'file format'. Most text files are at least structured into 'lines'. For this, they use encoded line endings, and there have been multiple choices for this and at least 2 still in common use (which is a nuisance). Similarly, a pixel (bitmap!) image file must encode the color of each pixel and a higher-level structuring of pixels into a a 2D array of rows of lines. Just as with text, there have been and still are multiple encoding at both levels. Also, similarly, the receiver of an image must know what encoding the sender used. Vector graphics is a different way of encoding certain types of images, and again there are multiple ways to encode the information into bits. The encoding hassle here is similar to that for text. One of the frustrations of tk is that it natively uses just one old dialect of postscript (.ps) to output screen images. One has to find and install an extension to a modern Scaled Vector Graphics (.svg) encoding. Because Python is programed with lines of text, it must come with minimal text decoding. If Python were programmed with drawings, it would come with one or more drawing decoders and a drawing equivalent of a lexer. It might even have special 'rd' (read drawing) mode for open. -- Terry Jan Reedy -- https://mail.python.org/mailman/listinfo/python-list
Re: Python 3.2 has some deadly infection
Tim Delaney timothy.c.delaney at gmail.com writes: I also should have been more clear that *in the particular situation I was talking about* iso-latin-1 as default would be the right thing to do, not in the general case. Quite often we won't know the correct encoding until we've executed a command via ssh - iso-latin-1 will allow us to extract the info we need (which will generally be 7-bit ASCII) without the possibility of an invalid encoding. Sure we may get mojibake, but that's better than the alternative when we don't yet know the correct encoding. Latin-1 is one of those legacy encodings which needs to die, not to be entrenched as the default. My terminal uses UTF-8 by default (as itshould), and if I use the terminal to input δжç, Python ought to seewhat I input, not Latin-1 moji-bake. For some purposes, there needs to be a way to treat an arbitrary stream of bytes as an arbitrary stream of 8-bit characters. iso-latin-1 is a convenient way to do that. For that purpose, Python3 has the bytes() type. Read the data as is, then decode it to a string once you figured out its encoding. Wolfgang -- https://mail.python.org/mailman/listinfo/python-list
Re: Python 3.2 has some deadly infection
On 2 June 2014 17:45, Wolfgang Maier wolfgang.ma...@biologie.uni-freiburg.de wrote: Tim Delaney timothy.c.delaney at gmail.com writes: For some purposes, there needs to be a way to treat an arbitrary stream of bytes as an arbitrary stream of 8-bit characters. iso-latin-1 is a convenient way to do that. For that purpose, Python3 has the bytes() type. Read the data as is, then decode it to a string once you figured out its encoding. I know that, you know that. Convincing other people of that is the difficulty. I probably should have mentioned it, but in my case it's not even Python (Java). It's exactly the same principal - an assumption was made that has become entrenched due to the fear of breakage. If they'd been forced to think about encodings up-front, it shouldn't have been an issue, which was the point I was trying to make. In Java, it's much worse. At least with Python you can perform string-like operations on bytes. In Java you have to convert it to characters before you can really do anything with it, so people just use the default encoding all the time - especially if they want the convenience of line-by-line reading using BufferedReader ... Tim Delaney -- https://mail.python.org/mailman/listinfo/python-list
Re: Python 3.2 has some deadly infection
On Mon, Jun 2, 2014 at 7:02 PM, Tim Delaney timothy.c.dela...@gmail.com wrote: In Java, it's much worse. At least with Python you can perform string-like operations on bytes. In Java you have to convert it to characters before you can really do anything with it, so people just use the default encoding all the time - especially if they want the convenience of line-by-line reading using BufferedReader ... What exactly is line-by-line reading with bytes? As I understand it, lines are defined by characters. If you mean reading a stream of bytes and dividing it on 0x0A, then surely you can do that, but that assumes an ASCII-compatible encoding. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Python 3.2 has some deadly infection
I probably should have mentioned it, but in my case it's not even Python (Java). It's exactly the same principal - an assumption was made that has become entrenched due to the fear of breakage. If they'd been forced to think about encodings up-front, it shouldn't have been an issue, which was the point I was trying to make. there seems to be an implicit assumption in python land that encoded strings are the norm. On virtually every computer I encounter that assumption is wrong. The vast majority of bytes in most computers is not something that can be easily printed out for humans to read. I suppose some clever pythonista can figure out an encoding to read my .o / .so etc files, but they are practically meaningless to a unicode program today. Same goes for most image formats and media files. Browsers routinely encounter mis/un-encoded pages. In Java, it's much worse. At least with Python you can perform string-like operations on bytes. In Java you have to convert it to characters before you can really do anything with it, so people just use the default encoding all the time - especially if they want the convenience of line-by-line reading using BufferedReader ... .. In python I would have preferred for bytes to remain the default io mechanism, at least that would allow me to decide if I need any decoding. As the cat example http://lucumr.pocoo.org/2014/5/12/everything-about-unicode/ showed these extra assumptions are sometimes really in the way. -- Robin Becker -- https://mail.python.org/mailman/listinfo/python-list
Re: Python 3.2 has some deadly infection
On 6/2/2014 7:10 AM, Robin Becker wrote: there seems to be an implicit assumption in python land that encoded strings are the norm. I don't know why you say that. To have a stream of bytes interpreted as characters, open in text mode and give the encoding. Otherwise, open in binary mode and apply whatever encoding you want. Image programs like Pil or Pillow assume that bytes have image encodings. Same idea. On virtually every computer I encounter that assumption is wrong. Except for the std streams (see below), it is also not part of Python. I will just point out that bytes are given meaning by encoding meaning into them. Unicode attempts to reduce the hundreds of text encodings to just a few, and mostly to just one for external storage and transmission. In python I would have preferred for bytes to remain the default io Do you really think that defaulting the open mode to 'rb' rather than 'rt' would be a better choice for newbies? mechanism, at least that would allow me to decide if I need any decoding. Assuming that 'rb' is actually needed more than 'rt' for you in particular, is it really such a burden to give a mode more often than not? As the cat example http://lucumr.pocoo.org/2014/5/12/everything-about-unicode/ showed these extra assumptions are sometimes really in the way. This example is *only* about the *pre-opened* stdxyz streams. Python uses these to read characters from the keyboard and print characters to the screen in input, print, and the interactive interpreter. So they are open in text mode (which wraps binary read and write). The developers, knowing that people can and do write batch mode programs that avoid input and print, gave a documented way to convert the streams back to binary. (See the sys doc.) The issue Armin ran into is this. He write a library module that makes sure the streams are binary. Someone else does the same. A program imports both modules, in either order. The conversion method referenced above raises an exception if one attempt to convert an already converted stream. Much of the extra code Armin published detects whether the steam is already binary or needs conversion. The obvious solution is to enhance the conversion method so that one may say 'convert is needed, otherwise just pass'. -- Terry Jan Reedy -- https://mail.python.org/mailman/listinfo/python-list
Re: Python 3.2 has some deadly infection
Terry Reedy wrote: The issue Armin ran into is this. He write a library module that makes sure the streams are binary. Seems to me he made a mistake right there. A library should *not* be making global changes like that. It can obtain binary streams from stdin and stdout for its own use, but it shouldn't stuff them back into sys.stdin and sys.stdout. If he had trouble because another library did that, then that library is broken, not Python. -- Greg -- https://mail.python.org/mailman/listinfo/python-list
Re: Python 3.2 has some deadly infection
On 1 June 2014 12:26, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote: with cross-platform behavior preferred over system-dependent one -- It's not clear how cross-platform behaviour has anything to do with the Internet age. Python has preferred cross-platform behaviour forever, except for those features and modules which are explicitly intended to be interfaces to system-dependent features. (E.g. a lot of functions in the os module are thin wrappers around OS features. Hence the name of the module.) There is the behaviour of defaulting input and output to the system encoding. I personally think we would all be better off if Python (and Java, and many other languages) defaulted to UTF-8. This hopefully would eventually have the effect of producers changing to output UTF-8 by default, and consumers learning to manually specify an encoding when it's not UTF-8 (due to invalid codepoints). I'm currently working on a product that interacts with lots of other products. These other products can be using any encoding - but most of the functions that interact with I/O assume the system default encoding of the machine that is collecting the data. The product has been in production for nearly a decade, so there's a lot of pushback against changes deep in the code for fear that it will break working systems. The fact that they are working largely by accident appears to escape them ... FWIW, changing to use iso-latin-1 by default would be the most sensible option (effectively treating everything as bytes), with the option for another encoding if/when more information is known (e.g. there's often a call to return the encoding, and the output of that call is guaranteed to be ASCII). Tim Delaney -- https://mail.python.org/mailman/listinfo/python-list
Re: Python 3.2 has some deadly infection
On Mon, 02 Jun 2014 08:54:33 +1000, Tim Delaney wrote: On 1 June 2014 12:26, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote: with cross-platform behavior preferred over system-dependent one -- It's not clear how cross-platform behaviour has anything to do with the Internet age. Python has preferred cross-platform behaviour forever, except for those features and modules which are explicitly intended to be interfaces to system-dependent features. (E.g. a lot of functions in the os module are thin wrappers around OS features. Hence the name of the module.) There is the behaviour of defaulting input and output to the system encoding. That's a tricky one, but I think on balance that is a case where defaulting to the system encoding is the right thing to do. Input and out occurs on the local system you are running on, which by definition isn't cross-platform. (Non-local I/O is possible, but requires work -- it doesn't just happen.) I personally think we would all be better off if Python (and Java, and many other languages) defaulted to UTF-8. This hopefully would eventually have the effect of producers changing to output UTF-8 by default, and consumers learning to manually specify an encoding when it's not UTF-8 (due to invalid codepoints). UTF-8 everywhere should be our ultimate aim. Then we can forget about legacy encodings except when digging out ancient documents from archived floppy disks :-) I'm currently working on a product that interacts with lots of other products. These other products can be using any encoding - but most of the functions that interact with I/O assume the system default encoding of the machine that is collecting the data. The product has been in production for nearly a decade, so there's a lot of pushback against changes deep in the code for fear that it will break working systems. The fact that they are working largely by accident appears to escape them ... FWIW, changing to use iso-latin-1 by default would be the most sensible option (effectively treating everything as bytes), with the option for another encoding if/when more information is known (e.g. there's often a call to return the encoding, and the output of that call is guaranteed to be ASCII). Python 2 does what you suggest, and it is *broken*. Python 2.7 creates moji-bake, while Python 3 gets it right: [steve@ando ~]$ python2.7 -c print u'δжç' δжç [steve@ando ~]$ python3.3 -c print(u'δжç') δжç Latin-1 is one of those legacy encodings which needs to die, not to be entrenched as the default. My terminal uses UTF-8 by default (as it should), and if I use the terminal to input δжç, Python ought to see what I input, not Latin-1 moji-bake. If I were to use Windows with a legacy code page, then I couldn't even enter δжç on the command line since none of the legacy encodings support that set of characters at the same time. I don't know exactly what I would get if I tried (say, by copying and pasting text from a Unicode-aware application), but I'd see that it was weird *in the shell* before it even reaches Python. On the other hand, if I were to input something supported by the legacy encoding, let's say I entered αβγ while using ISO-8859-7 (Greek), then Python ought to see αβγ and not moji-bake: py b = αβγ.encode('iso-8859-7') # what the shell generates py b.decode('latin-1') # what Python interprets those bytes as 'áâã' Defaulting to the system encoding means that Python input and output just works, to the degree that input and output on your system just works. If your system is crippled by the use of a legacy encoding, then Python will at least be *no worse* than your system. -- Steven D'Aprano http://import-that.dreamwidth.org/ -- https://mail.python.org/mailman/listinfo/python-list
Re: Python 3.2 has some deadly infection
On 2 June 2014 11:14, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote: On Mon, 02 Jun 2014 08:54:33 +1000, Tim Delaney wrote: I'm currently working on a product that interacts with lots of other products. These other products can be using any encoding - but most of the functions that interact with I/O assume the system default encoding of the machine that is collecting the data. The product has been in production for nearly a decade, so there's a lot of pushback against changes deep in the code for fear that it will break working systems. The fact that they are working largely by accident appears to escape them ... FWIW, changing to use iso-latin-1 by default would be the most sensible option (effectively treating everything as bytes), with the option for another encoding if/when more information is known (e.g. there's often a call to return the encoding, and the output of that call is guaranteed to be ASCII). Python 2 does what you suggest, and it is *broken*. Python 2.7 creates moji-bake, while Python 3 gets it right: The purpose of my example was to show a case where no thought was put into encodings - the assumption was that the system encoding and the remote system encoding would be the same. This is most definitely not the case a lot of the time. I also should have been more clear that *in the particular situation I was talking about* iso-latin-1 as default would be the right thing to do, not in the general case. Quite often we won't know the correct encoding until we've executed a command via ssh - iso-latin-1 will allow us to extract the info we need (which will generally be 7-bit ASCII) without the possibility of an invalid encoding. Sure we may get mojibake, but that's better than the alternative when we don't yet know the correct encoding. Latin-1 is one of those legacy encodings which needs to die, not to be entrenched as the default. My terminal uses UTF-8 by default (as it should), and if I use the terminal to input δжç, Python ought to see what I input, not Latin-1 moji-bake. For some purposes, there needs to be a way to treat an arbitrary stream of bytes as an arbitrary stream of 8-bit characters. iso-latin-1 is a convenient way to do that. It's not the only way, but settling on it and being consistent is better than not having a way. Tim Delaney -- https://mail.python.org/mailman/listinfo/python-list
Re: Python 3.2 has some deadly infection
On Monday, June 2, 2014 7:53:05 AM UTC+5:30, Tim Delaney wrote: On 2 June 2014 11:14, Steven D'Aprano steve+comp@pearwood.info wrote: Latin-1 is one of those legacy encodings which needs to die, not to be entrenched as the default. My terminal uses UTF-8 by default (as it should), and if I use the terminal to input δжç, Python ought to see what I input, not Latin-1 moji-bake. For some purposes, there needs to be a way to treat an arbitrary stream of bytes as an arbitrary stream of 8-bit characters. iso-latin-1 is a convenient way to do that. It's not the only way, but settling on it and being consistent is better than not having a way. Here is a quote from the oracle docs: http://docs.oracle.com/cd/E23824_01/html/E26033/glmbx.html#glmar | The C locale, also known as the POSIX locale, is the POSIX system | default locale for all POSIX-compliant systems. In more layman language | ASCII also known as the 'Unix locale' is the default for all *nix | compliant systems which is a key aspect of what Ive called 'The UNIX Assumption' : http://blog.languager.org/2014/04/unicode-and-unix-assumption.html -- https://mail.python.org/mailman/listinfo/python-list
Re: Python 3.2 has some deadly infection
Mark Lawrence breamore...@yahoo.co.uk: Some interesting comments here http://techtonik.rainforce.org/2014/05/python-32-has-some-deadly-infection.html so I'm simply asking for other opinions. I read the article, but unfortunately I failed to see interesting comments or opinions. There was some graphic, but it didn't say anything to me, and the article didn't really seem to be making any argument apart from the disappointed tone. Marko -- https://mail.python.org/mailman/listinfo/python-list
Re: Python 3.2 has some deadly infection
On Sat, 31 May 2014 17:10:20 +0100, Mark Lawrence wrote: Some interesting comments here http://techtonik.rainforce.org/2014/05/python-32-has-some-deadly- infection.html so I'm simply asking for other opinions. Oh, Anatoly Techtonik. He's quite notorious on python-dev for wanting to impose his wild and sometimes wacky processes on the entire community. Specific examples aren't coming to mind, and I'm too lazy to search the archives, so I'll just make one up to give you an idea of the flavour of his requests: Twitter is the only way that developers can effectively communicate. We must shut down all the mailing lists and the bug tracker and move all communication immediately to Twitter. And by we I mean you. [Not an actual quote.] I've come to the conclusion that he occasionally has a point to his posts, but only at random by virtue of the scatter-gun technique. He's obviously widely read, but not deeply, and so he fires off a lot of ill- thought out but superficially attractive proposals. Just by chance a few of them end up being interesting, not *interesting enough* for somebody else to do the work. At this point the ideas languish, because he refuses to sign a contributor agreement so the Python core developers cannot accept anything from him. This blog post is a strong opinion about Python, but it isn't clear what that opinion *actually is*. His post is rambling and unfocused and incoherent (art is the future). He rails against having to write PEPs, and decries the lack of stats, summaries, analysis and comparison, utterly missing the point that the purpose of the PEP process is to provide those stats, summaries, analysis and comparison. Reading between the lines, I think what he means, deep down, is that *somebody else* ought to gather those stats and do the analysis to support his ideas, and not expect him to write the PEP. He makes at least one factually wrong claim: I thought that C/C++ must die, because really all major security problems are because of it. [actual quote] He's talking about buffer overflows. Buffer overflows have never been responsible for all major security problems. Even allowing for a little hyperbole, buffer overflows have not been responsible for the majority of major security problems for a very long time. It is not 1992 any more, and today the single largest source of security bugs are code injection attacks. In Python terms that mostly means failure to sanitize SQL queries and the use of eval() and exec() on untrusted data. http://cwe.mitre.org/top25/ Three of the top four software errors are forms of code injection: SQL injection, OS command injection, cross-site scripting. The classic C buffer overflow comes in at number 3, so it's not an inconsiderable cause of security vulnerabilities even today, but it is not even close to the only such cause. See also http://www.sans.org/top25-software-errors/ Back to the blog post... it's 2014, Python 3.3 and 3.4 have come out, why is he talking about 3.2? It's interesting that he starts off by stating his graph is meaningless: They don't measure anything - just show some lines that correlate to each other. then immediately tries to draw a conclusion from those lines: It looks like the peak of Python was on February 2011, and since then there was a significant drop. I've written about the difficulty of measuring language popularity in any meaningful way: http://import-that.dreamwidth.org/1388.html http://import-that.dreamwidth.org/2873.html Anatoly has picked the TIOBE Index, but I don't know that this is the best measure of language popularity. According to it, Python is more popular than Javascript. I love Python, but really, more popular than Javascript? That feels wrong to me. In any case, I think that a better explanation for the observed dip in Feb 2011 is not that Python 3.2 is infected (infected by what?) but *regression to the mean*. Regression to the mean is a statistical phenomenon which basically says that all else being equal, an extreme value is likely to be followed by a less extreme (closer to the average) value. Language popularity, as measured by TIOBE, is at least in part random. (Look at how wiggly the lines are. The wiggles represent random variation.) If by chance a language gets a spike in interest one month, it is less likely to Because TIOBE's results contain so much random noise, they really ought to smooth them out by averaging the scores over a three month window, and show trend lines. They don't, I believe, because random hiccoughs in the data provide interest: Last month, Java was overthrown from it's #1 ranking by C. This month it has fought its way back to #1 again! Tune in next month to see if C can repeat it's stunning victory!!! I think that long term trend lines would be much less exciting but much more informative. Eyeballing the graph, it seems to me that Java and C++ are
Re: Python 3.2 has some deadly infection
On Sun, Jun 1, 2014 at 12:26 PM, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote: TL;DR: Anatoly's blog post is long on disappointment and short on actual content. It feels to me that we could summarise his post as: I don't know what I want, I won't recognise it even if I saw it, but Python 3 isn't it. I blame others for not living up to my expectations for features I cannot describe and were never promised. I think that summary is accurate. When Mark posted this last night (okay, it was last night for me, probably not for most of you), I tried to read the post and figure out what he was actually saying... and failed. Gave up on it and moved on. Got better things to do with my life... like, I dunno, actually writing code, which seems to be something that people who whine in blog posts don't do. ChrisA -- https://mail.python.org/mailman/listinfo/python-list