Re: Python 3.2 has some deadly infection

2014-06-07 Thread rurpy
On 06/05/2014 05:02 PM, Steven D'Aprano wrote:
[...]
 But Linux Unicode support is much better than Windows. Unicode support in 
 Windows is crippled by continued reliance on legacy code pages, and by 
 the assumption deep inside the Windows APIs that Unicode means 16 bit 
 characters. See, for example, the amount of space spent on fixing 
 Windows Unicode handling here:
 
 http://www.utf8everywhere.org/

While not disagreeing with the the general premise of that page, it 
has some problems that raise doubts in my mind about taking everything 
the author says at face value.

For example

  Q: Why would the Asians give up on UTF-16 encoding, which saves 
  them 50% the memory per character?
  [...] in fact UTF-8 is used just as often in those [Asian] countries. 

That is not my experience, at least for Japan.  See my comments in 
  https://mail.python.org/pipermail/python-ideas/2012-June/015429.html
where I show that utf8 files are a tiny minority of the text files 
found by Google.

He then gives a table with the size of utf8 and utf16 encoded contents
(ie stripped of html stuff) of an unnamed Japanese wikipedia page to 
show that even without a lot of (html-mandated) ascii, the space savings 
are not very much compared to the theoretical 50% savings he stated:

   Dense text (Δ UTF-8)
   UTF-8   ... 222 KB (0%)
   UTF-16  ... 176 KB (−21%)

Note that he calculates the space saving as (utf8-utf16)/utf8.
Yet by that metric the theoretical saving is *NOT* 50%, it is 33%.
For example 1000 Japanese characters will use 2000 bytes in utf16
and 3000 in utf8.

I did the same test using
  http://ja.wikipedia.org/wiki/%E7%B9%94%E7%94%B0%E4%BF%A1%E9%95%B7
I stripped html tags, javascript and redundant ascii whitespace characters
The stripped utf-8 file was 164946 bytes, the utf-16 encoded version of
same was 117756.  That gives (using the (utf8-utf16)/utf16 metric he used 
to claim 50% idealized savings) 40% which is quite a bit closer to the 
idealized 50% than his 21%.

I would have more faith in his opinions about things I don't know
about (such as unicode programming on Windows) if his other info
were more trustworthy.  IOW, just because it's on the internet doesn't 
mean it's true.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python 3.2 has some deadly infection

2014-06-06 Thread Akira Li
Marko Rauhamaa ma...@pacujo.net writes:

 Steven D'Aprano steve+comp.lang.pyt...@pearwood.info:

 Nevertheless, there are important abstractions that are written on top
 of the bytes layer, and in the Unix and Linux world, the most
 important abstraction is *text*. In the Unix world, text formats and
 text processing is much more common in user-space apps than binary
 processing.

 That linux text is not the same thing as Python's text. Conceptually,
 Python text is a sequence of 32-bit integers. Linux text is a sequence
 of 8-bit integers.

_Unicode string in Python is a sequence of Unicode codepoints_. It is
correct that 32-bit integer is enough to represent any Unicode
codepoint: \u...\U0010 

It says *nothing* about how Unicode strings are represented
*internally* in Python. It may vary from version to version, build
options and even may depend on the content of a string at runtime.

In the past, narrow builds might break the abstraction in some cases
that is why Linux distributions used wide python builds.


_Unicode codepoint is  not a Python concept_. There is Unicode
standard http://unicode.org Though intead of following the
self-referential defenitions web, I find it easier to learn from
examples such as http://codepoints.net/U+0041 (A) or
http://codepoints.net/U+1F3A7 ()

_There is no such thing as 8-bit text_
http://www.joelonsoftware.com/articles/Unicode.html

If you insert a space after each byte (8-bit) in the input text then you
may get garbage i.e., you can't assume that a character is a byte:

  $ echo Hyvää yötä | perl -pe's/.\K/ /g'
  H y v a � � � �   y � � t � �

In general, you can't assume that a character is a Unicode codepoint:

  $ echo Hyvää yötä | perl -C -pe's/.\K/ /g'
  H y v a ̈ ä   y ö t ä

The eXtended grapheme clusters (user-perceived characters) may be useful
in this case:

  $ echo Hyvää yötä | perl -C -pe's/\X\K/ /g'
  H y v ä ä   y ö t ä

\X pattern is supported by `regex` module in Python i.e., you can't even
iterate over characters (as they are seen by a user) in Python using
only stdlib. \w+ pattern is also broken for Unicode text
http://bugs.python.org/issue1693050 (it is fixed in the `regex` module)
i.e., you can't select a word in Unicode text using only stdlib.

\X along is not enough in some cases e.g., “ch” may be considered a
grapheme cluster in Slovak, for processes such as collation [1]
(sorting order). `PyICU` module might be useful here.

Knowing about Unicode normalization forms (NFC, NFKD, etc)
http://unicode.org/reports/tr15/ Unicode
text segmentation [1] and Unicode collation algorithm
http://www.unicode.org/reports/tr10/ concepts is also 
useful; if you want to work with text. 

[1]: http://www.unicode.org/reports/tr29/


--
akira

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python 3.2 has some deadly infection

2014-06-06 Thread Robin Becker

On 05/06/2014 18:16, Ian Kelly wrote:
.


How should e.g. bytes.upper() be implemented then?  The correct
behavior is entirely dependent on the encoding.  Python 2 just assumes
ASCII, which at best will correctly upper-case some subset of the
string and leave the rest unchanged, and at worst could corrupt the
string entirely.  There are some things that were dropped that should
not have been, but my impression is that those are being worked on,
for example % formatting in PEP 461.

bytes.upper should have done exactly what str.upper in python 2 did; that way we 
could have at least continued to do the wrong thing :)

--
Robin Becker

--
https://mail.python.org/mailman/listinfo/python-list


Re: Python 3.2 has some deadly infection

2014-06-06 Thread Steven D'Aprano
On Fri, 06 Jun 2014 02:21:54 +0300, Marko Rauhamaa wrote:

 Steven D'Aprano steve+comp.lang.pyt...@pearwood.info:
 
 In any case, I reject your premise. ALL data types are constructed on
 top of bytes,
 
 Only in a very dull sense.

I agree with you that this is a very dull, unimportant sense. And I think 
it's dullness applies equally to the situation you somehow think is 
meaningfully exciting: Text is made of bytes! If you squint, you can see 
those bytes! Therefore text is not a first class data type!!!

To which my answer is, yes text is made of bytes, yes, you can expose 
those bytes, and no your conclusion doesn't follow.

 
 and so long as you allow applications *any way* to coerce data types to
 different data types, you allow them to see inside the black box.
 
 I can't see the bytes inside Python objects, including strings, and
 that's how it is supposed to be.

That's because Python the language doesn't allow you to coerce types to 
other types, except possibly through its interface to the underlying C 
implementation, ctypes. But Python allows you to write extensions in C, 
and that gives you the full power to take any data structure and turn it 
into any other data structure. Even bytes.


 Similarly, I can't (easily) see how files are laid out on hard disks.
 That's a true abstraction. Nothing in linux presents data, though,
 except through bytes.

Incorrect. Linux presents data as text all the time. Look at the prompt: 
its treated as text, not numbers. You type commands using a text 
interface. The commands are made of words like ls, dd and ps, not numbers 
like 0x6C73, 0x6464 and 0x7073. Applications like grep are based on line-
based files, and line is a text concept, not a byte concept.

Consider:

[steve@ando ~]$ echo -e '\x41\x42\x43'
ABC


The assumption of *text* is so strong in the echo application that by 
default you cannot enter numeric escapes at all. Without the -e switch, 
echo assumes that numeric escapes represent themselves as character 
literals:

[steve@ando ~]$ echo '\x41\x42\x43'
\x41\x42\x43



-- 
Steven D'Aprano
http://import-that.dreamwidth.org/
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python 3.2 has some deadly infection

2014-06-06 Thread Marko Rauhamaa
Steven D'Aprano steve+comp.lang.pyt...@pearwood.info:

 Incorrect. Linux presents data as text all the time. Look at the prompt: 
 its treated as text, not numbers.

Of course there is a textual human interface. However, from the point of
view of virtually every OS component, it's bytes.


 Consider:

 [steve@ando ~]$ echo -e '\x41\x42\x43'
 ABC

echo doesn't know it's emitting text. It would be perfectly happy to
emit binary gibberish. The output goes to the pty which doesn't care
about the textual interpretation, either. Finally, the terminal
(emulation program) translates the incoming bytes to textual glyphs to
the best of its capabilities.

Anyway, what interests me mostly is that I routinely build programs and
systems that talk to each other over files, pipes, sockets and devices.
I really need to micromanage that data. I'm fine with encoding text if
that's the suitable interpretation. I just think Python is overreaching
by making the text interpretation the default for the standard streams
and files and guessing the correct encoding.

Note that subprocess.Popen() wisely assumes binary pipes. Unfortunately
the subprocess might be a python program that opens the standard streams
in the text mode...


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python 3.2 has some deadly infection

2014-06-06 Thread Ethan Furman

On 06/05/2014 11:30 AM, Marko Rauhamaa wrote:


How text is represented is very different from whether text is a
fundamental data type. A fundamental text file is such that ordinary
operating system facilities can't see inside the black box (that is,
they are *not* encoded as far as the applications go).


Of course they are.  It may be an ASCII-encoding of some flavor or 
other, or something really (to me) strange -- but an encoding is most 
assuredly in affect.


ASCII is *not* the state of this string has no encoding -- that would 
be Unicode; a Unicode string, as a data type, has no encoding.  To 
transport it, store it, etc., it must (usually?) be encoded into 
something -- utf-8, ASCII, turkish, or whatever subset is agreed upon 
and will hopefully contain all the Unicode characters needed for the 
string to be properly represented.


The realization that ASCII was, in fact, an encoding was a big paradigm 
shift for me, but a necessary one.


--
~Ethan~

--
https://mail.python.org/mailman/listinfo/python-list


Re: Python 3.2 has some deadly infection

2014-06-06 Thread Marko Rauhamaa
Ethan Furman et...@stoneleaf.us:

 On 06/05/2014 11:30 AM, Marko Rauhamaa wrote:
 A fundamental text file is such that ordinary operating system
 facilities can't see inside the black box (that is, they are *not*
 encoded as far as the applications go).

 Of course they are.

How would you know?

 It may be an ASCII-encoding of some flavor or other, or something
 really (to me) strange -- but an encoding is most assuredly in affect.

Outside metaphysics, that statement is only meaningful if you have
access to the encoding.

 ASCII is *not* the state of this string has no encoding -- that
 would be Unicode; a Unicode string, as a data type, has no encoding.

Huh?


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python 3.2 has some deadly infection

2014-06-06 Thread Ethan Furman

On 06/05/2014 09:32 AM, Steven D'Aprano wrote:


But whatever the situation, and despite our differences of opinion about
Unicode, THANK YOU for having updated ReportLabs to 3.3.


+1000

--
~Ethan~

--
https://mail.python.org/mailman/listinfo/python-list


Re: Python 3.2 has some deadly infection

2014-06-06 Thread Michael Torrie
On 06/06/2014 08:10 AM, Marko Rauhamaa wrote:
 Ethan Furman et...@stoneleaf.us:
 ASCII is *not* the state of this string has no encoding -- that
 would be Unicode; a Unicode string, as a data type, has no encoding.
 
 Huh?

It's this very fact that trips of JMF in his rants about FSR.  Thank you
to Ethan for putting it so succinctly.

What part of his statement are you saying Huh? about?

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python 3.2 has some deadly infection

2014-06-06 Thread Chris Angelico
On Fri, Jun 6, 2014 at 11:24 PM, Ethan Furman et...@stoneleaf.us wrote:
 On 06/05/2014 11:30 AM, Marko Rauhamaa wrote:


 How text is represented is very different from whether text is a
 fundamental data type. A fundamental text file is such that ordinary
 operating system facilities can't see inside the black box (that is,
 they are *not* encoded as far as the applications go).

 Of course they are.  It may be an ASCII-encoding of some flavor or other, or
 something really (to me) strange -- but an encoding is most assuredly in
 affect.

Allow me to explain what I think Marko's getting at here.

In most file systems, a file exists on the disk as a set of sectors of
data, plus some metadata including the file's actual size. When you
ask the OS to read you that file, it goes to the disk, reads those
sectors, truncates the data to the real size, and gives you those
bytes.

It's possible to mount a file as a directory, in which case the
physical representation is very different, but the file still appears
the same. In that case, the OS goes reading some part of the file,
maybe decompresses it, and gives it to you. Same difference. These
files still contain bytes.

A fundamental text file would be one where, instead of reading and
writing bytes, you read and write Unicode text. Since the hard disk
still works with sectors and bytes, it'll still be stored as such, but
that's an implementation detail; and you could format your disk UTF-8
or UTF-16 or FSR or anything you like, and the only difference you'd
see is performance.

This could certainly be done, in theory. I don't know how well it'd
fit with any of the popular OSes of today, but it could be done. And
these files would not have an encoding; their on-platter
representations would, but that's purely implementation - the text
that you wrote out and the text that you read in are the same text,
and there's been no encoding visible.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python 3.2 has some deadly infection

2014-06-06 Thread Marko Rauhamaa
Michael Torrie torr...@gmail.com:

 On 06/06/2014 08:10 AM, Marko Rauhamaa wrote:
 Ethan Furman et...@stoneleaf.us:
 ASCII is *not* the state of this string has no encoding -- that
 would be Unicode; a Unicode string, as a data type, has no encoding.
 
 Huh?

 [...]

 What part of his statement are you saying Huh? about?

Unicode, like ASCII, is a code. Representing text in unicode is
encoding.


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python 3.2 has some deadly infection

2014-06-06 Thread Chris Angelico
On Sat, Jun 7, 2014 at 1:32 AM, Marko Rauhamaa ma...@pacujo.net wrote:
 Michael Torrie torr...@gmail.com:

 On 06/06/2014 08:10 AM, Marko Rauhamaa wrote:
 Ethan Furman et...@stoneleaf.us:
 ASCII is *not* the state of this string has no encoding -- that
 would be Unicode; a Unicode string, as a data type, has no encoding.

 Huh?

 [...]

 What part of his statement are you saying Huh? about?

 Unicode, like ASCII, is a code. Representing text in unicode is
 encoding.

Yes and no. ASCII means two things: Firstly, it's a mapping from the
letter A to the number 65, from the exclamation mark to 33, from the
backslash to 92, and so on. And secondly, it's an encoding of those
numbers into the lowest seven bits of a byte, with the high byte left
clear. Between those two, you get a means of representing the letter
'A' as the byte 0x41, and one of them is an encoding.

Unicode, on the other hand, is only the first part. It maps all the
same characters to the same numbers that ASCII does, and then adds a
few more... a few followed by a few, followed by... okay, quite a lot
more. Unicode specifies that the character OK HAND SIGN, which looks
like  if you have the right font, is number 1F44C in hex (128076
decimal). This is the Universal Character Set or UCS.

ASCII could specify a single encoding, because that encoding makes
sense for nearly all purposes. (There are times when you transmit
ASCII text and use the high bit to mean something else, like parity or
this is the end of a word or something, but even then, you follow
the same convention of packing a number into the low seven bits of a
byte.) Unicode can't, because there are many different pros and cons
to the different encodings, and so we have UCS Transformation Formats
like UTF-8 and UTF-32. Each one is an encoding that maps a codepoint
to a sequence of bytes.

You can't represent text in Unicode in a computer. Somewhere along
the way, you have to figure out how to store those codepoints as
bytes, or something more concrete (you could, for instance, use a
Python list of Python integers; I can't say that it would be in any
way more efficient than alternatives, but it would be plausible); and
that's the encoding.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python 3.2 has some deadly infection

2014-06-06 Thread Steven D'Aprano
On Fri, 06 Jun 2014 18:32:39 +0300, Marko Rauhamaa wrote:

 Michael Torrie torr...@gmail.com:
 
 On 06/06/2014 08:10 AM, Marko Rauhamaa wrote:
 Ethan Furman et...@stoneleaf.us:
 ASCII is *not* the state of this string has no encoding -- that
 would be Unicode; a Unicode string, as a data type, has no encoding.
 
 Huh?

 [...]

 What part of his statement are you saying Huh? about?
 
 Unicode, like ASCII, is a code. Representing text in unicode is
 encoding.

A Unicode string as an abstract data type has no encoding. It is a 
Platonic ideal, a pure form like the real numbers. There are no bytes, no 
bits, just code points. That is what Ethan means. A Unicode string like 
this:

s = uNOBODY expects the Spanish Inquisition!

should not be thought of as a bunch of bytes in some encoding, but as an 
array of code points. Eventually the abstraction will leak, all 
abstractions do, but not for a very long time.


-- 
Steven D'Aprano
http://import-that.dreamwidth.org/
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python 3.2 has some deadly infection

2014-06-06 Thread Rustom Mody
On Friday, June 6, 2014 9:27:51 PM UTC+5:30, Steven D'Aprano wrote:
 On Fri, 06 Jun 2014 18:32:39 +0300, Marko Rauhamaa wrote:

  Michael Torri:
  On 06/06/2014 08:10 AM, Marko Rauhamaa wrote:
  Ethan Furman :
  ASCII is *not* the state of this string has no encoding -- that
  would be Unicode; a Unicode string, as a data type, has no encoding.
  Huh?
  [...]
  What part of his statement are you saying Huh? about?
  Unicode, like ASCII, is a code. Representing text in unicode is
  encoding.

 A Unicode string as an abstract data type has no encoding. It is a 
 Platonic ideal, a pure form like the real numbers. There are no bytes, no 
 bits, just code points. That is what Ethan means. A Unicode string like 
 this:

 s = uNOBODY expects the Spanish Inquisition!

 should not be thought of as a bunch of bytes in some encoding, but as an 
 array of code points. Eventually the abstraction will leak, all 
 abstractions do, but not for a very long time.

Should not be thought of yes thats the Python3 world view
Not even the Python2 world view
And very far from the classic Unix world view.

As Ned Batchelder says in Unipain: http://nedbatchelder.com/text/unipain.html :
Programmers should use the 'unicode sandwich'to avoid 'unipain':

Bytes on the outside, Unicode on the inside, encode/decode at the edges.

The discussion here is precisely about these edges

Combine that with Chris':

 Yes and no. ASCII means two things: Firstly, it's a mapping from the
 letter A to the number 65, from the exclamation mark to 33, from the
 backslash to 92, and so on. And secondly, it's an encoding of those
 numbers into the lowest seven bits of a byte, with the high byte left
 clear. Between those two, you get a means of representing the letter
 'A' as the byte 0x41, and one of them is an encoding.

and the situation appears quite the opposite of Ethan's description:

In the 'old world' ASCII was both mapping and encoding and so there was 
never a justification to distinguish encoding from codepoint.

It is unicode that demands these distinctions.

If we could magically go to a world where the number of bits in a byte was 32
all this headache would go away. [Actually just 21 is enough!]
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python 3.2 has some deadly infection

2014-06-06 Thread Chris Angelico
On Sat, Jun 7, 2014 at 2:21 AM, Rustom Mody rustompm...@gmail.com wrote:
 Combine that with Chris':

 Yes and no. ASCII means two things: Firstly, it's a mapping from the
 letter A to the number 65, from the exclamation mark to 33, from the
 backslash to 92, and so on. And secondly, it's an encoding of those
 numbers into the lowest seven bits of a byte, with the high byte left
 clear. Between those two, you get a means of representing the letter
 'A' as the byte 0x41, and one of them is an encoding.

 and the situation appears quite the opposite of Ethan's description:

 In the 'old world' ASCII was both mapping and encoding and so there was
 never a justification to distinguish encoding from codepoint.

 It is unicode that demands these distinctions.

 If we could magically go to a world where the number of bits in a byte was 32
 all this headache would go away. [Actually just 21 is enough!]

An ASCII mentality lets you be sloppy. That doesn't mean the
distinction doesn't exist. When I first started programming in C, int
was *always* 16 bits long and *always* little-endian (because I used
only one compiler). I could pretend that those bits in memory actually
were that integer, that there were no other ways that integer could be
encoded. That doesn't mean that encodings weren't important. And as
soon as I started working on a 32-bit OS/2 system, and my ints became
bigger, I had to concern myself with that. Even more so when I got
into networking, and byte order became important to me. And of course,
these days I work with integers that are encoded in all sorts of
different ways (a Python integer isn't just a puddle of bytes in
memory), and I generally let someone else take care of the details,
but the encodings are still there.

ASCII was once your one companion, it was all that mattered. ASCII was
once a friendly encoding, then your world was shattered. Wishing it
were somehow here again, wishing it were somehow near... sometimes it
seemed, if you just dreamed, somehow it would be here! Wishing you
could use just bytes again, knowing that you never would... dreaming
of it won't help you to do all that you dream you could!

It's time to stop chasing the phantom and start living in the Raoul
world... err, the real world. :)

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python 3.2 has some deadly infection

2014-06-06 Thread Marko Rauhamaa
Chris Angelico ros...@gmail.com:

 ASCII means two things: Firstly, it's a mapping from the letter A to
 the number 65, from the exclamation mark to 33, from the backslash to
 92, and so on. And secondly, it's an encoding of those numbers into
 the lowest seven bits of a byte, with the high byte left clear.
 Between those two, you get a means of representing the letter 'A' as
 the byte 0x41, and one of them is an encoding.

   The American Standard Code for Information Interchange [...] is a
   character-encoding scheme [...] URL:
   http://en.wikipedia.org/wiki/ASCII

 Unicode, on the other hand, is only the first part. It maps all the
 same characters to the same numbers that ASCII does, and then adds a
 few more... a few followed by a few, followed by... okay, quite a lot
 more. Unicode specifies that the character OK HAND SIGN, which looks
 like  if you have the right font, is number 1F44C in hex (128076
 decimal). This is the Universal Character Set or UCS.

   Unicode is a computing industry standard for the consistent encoding,
   representation and handling of text [...] URL:
   http://en.wikipedia.org/wiki/Unicode

Each standard assigns numbers to letters and other symbols. In a word,
each is a code. That's what their names say, too.


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python 3.2 has some deadly infection

2014-06-06 Thread Rustom Mody
On Friday, June 6, 2014 10:18:41 PM UTC+5:30, Chris Angelico wrote:
 On Sat, Jun 7, 2014 at 2:21 AM, Rustom Mody  wrote:
  Combine that with Chris':
  Yes and no. ASCII means two things: Firstly, it's a mapping from the
  letter A to the number 65, from the exclamation mark to 33, from the
  backslash to 92, and so on. And secondly, it's an encoding of those
  numbers into the lowest seven bits of a byte, with the high byte left
  clear. Between those two, you get a means of representing the letter
  'A' as the byte 0x41, and one of them is an encoding.
  and the situation appears quite the opposite of Ethan's description:
  In the 'old world' ASCII was both mapping and encoding and so there was
  never a justification to distinguish encoding from codepoint.
  It is unicode that demands these distinctions.
  If we could magically go to a world where the number of bits in a byte was 
  32
  all this headache would go away. [Actually just 21 is enough!]

 An ASCII mentality lets you be sloppy. That doesn't mean the
 distinction doesn't exist. When I first started programming in C, int
 was *always* 16 bits long and *always* little-endian (because I used
 only one compiler). I could pretend that those bits in memory actually
 were that integer, that there were no other ways that integer could be
 encoded. That doesn't mean that encodings weren't important. And as
 soon as I started working on a 32-bit OS/2 system, and my ints became
 bigger, I had to concern myself with that. Even more so when I got
 into networking, and byte order became important to me. And of course,
 these days I work with integers that are encoded in all sorts of
 different ways (a Python integer isn't just a puddle of bytes in
 memory), and I generally let someone else take care of the details,
 but the encodings are still there.

 ASCII was once your one companion, it was all that mattered. ASCII was
 once a friendly encoding, then your world was shattered. Wishing it
 were somehow here again, wishing it were somehow near... sometimes it
 seemed, if you just dreamed, somehow it would be here! Wishing you
 could use just bytes again, knowing that you never would... dreaming
 of it won't help you to do all that you dream you could!

 It's time to stop chasing the phantom and start living in the Raoul
 world... err, the real world. :)

I thought that If only bytes were 21+ bits wide would sound sufficiently 
nonsensical, that I did not need to explicitly qualify it as a utopian dream!
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python 3.2 has some deadly infection

2014-06-06 Thread Marko Rauhamaa
Steven D'Aprano steve+comp.lang.pyt...@pearwood.info:

 On Fri, 06 Jun 2014 18:32:39 +0300, Marko Rauhamaa wrote:
 Unicode, like ASCII, is a code. Representing text in unicode is
 encoding.

 A Unicode string as an abstract data type has no encoding.

Unicode itself is an encoding. See it in action here:

72 101 108 108 111 44 32 119 111 114 108 100

 It is a Platonic ideal, a pure form like the real numbers.

Far from it. It is a mapping from symbols to integers. The symbols are
the Platonic ones.

The Unicode/ASCII encoding above represents the same Platonic string
as this ESCDIC one:

212 133 147 147 150 107 64 166 150 153 137 132

 Unicode string like this:

 s = uNOBODY expects the Spanish Inquisition!

 should not be thought of as a bunch of bytes in some encoding,

Encoding is not tied to bytes or even computers. People can speak in
code, after all.


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python 3.2 has some deadly infection

2014-06-06 Thread Rustom Mody
On Friday, June 6, 2014 10:32:47 PM UTC+5:30, Marko Rauhamaa wrote:
 Chris Angelico :

  ASCII means two things: Firstly, it's a mapping from the letter A to
  the number 65, from the exclamation mark to 33, from the backslash to
  92, and so on. And secondly, it's an encoding of those numbers into
  the lowest seven bits of a byte, with the high byte left clear.
  Between those two, you get a means of representing the letter 'A' as
  the byte 0x41, and one of them is an encoding.

The American Standard Code for Information Interchange [...] is a
character-encoding scheme [...] URL:

And a similar argument to this is seen on that page's talk page!
http://en.wikipedia.org/wiki/Talk:ASCII#Character_set_vs._Character_encoding.3F
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python 3.2 has some deadly infection

2014-06-06 Thread Chris Angelico
On Sat, Jun 7, 2014 at 3:11 AM, Marko Rauhamaa ma...@pacujo.net wrote:
 Encoding is not tied to bytes or even computers. People can speak in
 code, after all.

Obligatory: http://xkcd.com/257/

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python 3.2 has some deadly infection

2014-06-06 Thread Marko Rauhamaa
Marko Rauhamaa ma...@pacujo.net:

 Far from it. It is a mapping from symbols to integers. The symbols are
 the Platonic ones.

Well, of course, even the symbols are a code. Letters code sounds and
digits code numbers.

And the sounds and numbers code ideas. Now we are getting close to being
truly Platonic.


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python 3.2 has some deadly infection

2014-06-06 Thread Chris Angelico
On Sat, Jun 7, 2014 at 3:04 AM, Rustom Mody rustompm...@gmail.com wrote:
 ASCII was once your one companion, it was all that mattered. ASCII was
 once a friendly encoding, then your world was shattered. Wishing it
 were somehow here again, wishing it were somehow near... sometimes it
 seemed, if you just dreamed, somehow it would be here! Wishing you
 could use just bytes again, knowing that you never would... dreaming
 of it won't help you to do all that you dream you could!

 It's time to stop chasing the phantom and start living in the Raoul
 world... err, the real world. :)

 I thought that If only bytes were 21+ bits wide would sound sufficiently
 nonsensical, that I did not need to explicitly qualify it as a utopian dream!

Humour never dies!

ChrisA
(In case it's not obvious, by the way, everything I said above is a
reference to the Phantom of the Opera.)
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python 3.2 has some deadly infection

2014-06-06 Thread Chris Angelico
On Sat, Jun 7, 2014 at 3:13 AM, Rustom Mody rustompm...@gmail.com wrote:
 On Friday, June 6, 2014 10:32:47 PM UTC+5:30, Marko Rauhamaa wrote:
 Chris Angelico :

  ASCII means two things: Firstly, it's a mapping from the letter A to
  the number 65, from the exclamation mark to 33, from the backslash to
  92, and so on. And secondly, it's an encoding of those numbers into
  the lowest seven bits of a byte, with the high byte left clear.
  Between those two, you get a means of representing the letter 'A' as
  the byte 0x41, and one of them is an encoding.

The American Standard Code for Information Interchange [...] is a
character-encoding scheme [...] URL:

 And a similar argument to this is seen on that page's talk page!
 http://en.wikipedia.org/wiki/Talk:ASCII#Character_set_vs._Character_encoding.3F

Which proves that Wikipedia is exactly as reliable as a mailing list.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python 3.2 has some deadly infection

2014-06-06 Thread Ned Batchelder

On 6/6/14 1:11 PM, Marko Rauhamaa wrote:

Steven D'Aprano steve+comp.lang.pyt...@pearwood.info:


On Fri, 06 Jun 2014 18:32:39 +0300, Marko Rauhamaa wrote:

Unicode, like ASCII, is a code. Representing text in unicode is
encoding.


A Unicode string as an abstract data type has no encoding.


Unicode itself is an encoding. See it in action here:

 72 101 108 108 111 44 32 119 111 114 108 100


It is a Platonic ideal, a pure form like the real numbers.


Far from it. It is a mapping from symbols to integers. The symbols are
the Platonic ones.

The Unicode/ASCII encoding above represents the same Platonic string
as this ESCDIC one:

 212 133 147 147 150 107 64 166 150 153 137 132


Unicode string like this:

s = uNOBODY expects the Spanish Inquisition!

should not be thought of as a bunch of bytes in some encoding,


Encoding is not tied to bytes or even computers. People can speak in
code, after all.




Marko, you are right about the broader English meaning of the word 
encoding.  The original point here was that Unicode text provides no 
information about what sequence of bytes is at work.


In the Unicode ecosystem, an encoding is a specification of how the text 
will be represented in a byte stream.  Saying something is Unicode 
doesn't provide that information.  You have to say, UTF8 or UTF16 or 
UCS2, etc, in order to know how bytes will be involved.


When Ethan said, a Unicode string, as a data type, has no encoding, he 
meant (as he explained) that a Unicode string doesn't require or imply 
any particular mapping to bytes.


I'm sure you understand this, I'm just trying to clarify the different 
meanings of the word encoding.




Marko




--
Ned Batchelder, http://nedbatchelder.com

--
https://mail.python.org/mailman/listinfo/python-list


Re: Python 3.2 has some deadly infection

2014-06-06 Thread Denis McMahon
On Sat, 07 Jun 2014 01:50:50 +1000, Chris Angelico wrote:

 Yes and no. ASCII means two things:

ASCII means: American Standard Code for Information Interchange aka ASA 
Standard X3.4-1963

 into the lowest seven bits of a byte, with the high byte left clear.

high BIT left clear.

-- 
Denis McMahon, denismfmcma...@gmail.com
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python 3.2 has some deadly infection

2014-06-06 Thread Chris Angelico
On Sat, Jun 7, 2014 at 7:18 AM, Denis McMahon denismfmcma...@gmail.com wrote:
 into the lowest seven bits of a byte, with the high byte left clear.

 high BIT left clear.

That thing. Unless you have bytes inside bytes (byteception?), you'll
only have room for one high bit. Some day I'll get my brain and my
fingers to agree on everything we do... but that day is not today.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python 3.2 has some deadly infection

2014-06-05 Thread Marko Rauhamaa
Gregory Ewing greg.ew...@canterbury.ac.nz:

 As a result, most unix programs, most of the time, deal
 with text on stdin and stdout.

Well, ok. But even accepting that premise, that text might not be what
Python3 considers text.

For example, if your program reads in XML, JSON or Python, the parser
object might prefer to take it in as bytes and not have it predecoded by
sys.stdin.

 So, it makes sense for them to be text by default.

I'm not sure. That could lead to nasty surprises.

I've experienced analogous consternations when the sort utility hasn't
worked identically for identical input: it is heavily influenced by the
(spit, spit) locale. That's why 99.9% of your scripts should prefix
sort and grep with LC_ALL=C -- even when the input really is UTF-8.

Should I now take it further and prefix all Python programs with
LC_ALL=C? Probably not, since UTF-8 might cause sys.stdin to barf.

 And wherever there's text, there needs to be an encoding.

No problem there, only should sys.stdin and sys.stdout carry the
decoding/encoding out or should it be left for the program.


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python 3.2 has some deadly infection

2014-06-05 Thread Chris Angelico
On Thu, Jun 5, 2014 at 5:16 PM, Marko Rauhamaa ma...@pacujo.net wrote:
 No problem there, only should sys.stdin and sys.stdout carry the
 decoding/encoding out or should it be left for the program.

The most normal thing to do with the standard streams is to have them
produce text, and as much as possible, you shouldn't have to go to
great lengths to make that work. If, in Python, I say print(Hello,
world!), I expect that to produce a line of text on the screen,
without my code having to encode that to bytes, figure out what sort
of newline to add, etc, etc.

Even if stdout isn't a tty, chances are you're still working with
text. Only an extreme few Unix programs actually manipulate binary
standard streams (some, like cat, will pipe binary through unchanged,
but even cat assumes text for options like -n); those few should be
the ones to have to worry about setting stdin and stdout to be binary.
In the same way that we have double-quoted strings being Unicode
strings, we should have print() and input() naturally just work with
Unicode, which means they should negotiate encodings with the system
without the programmer having to lift a finger.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python 3.2 has some deadly infection

2014-06-05 Thread Marko Rauhamaa
Chris Angelico ros...@gmail.com:

 If, in Python, I say print(Hello, world!), I expect that to produce
 a line of text on the screen, without my code having to encode that to
 bytes, figure out what sort of newline to add, etc, etc.

That example in no way represents the typical Python program (if there
is one).

 Only an extreme few Unix programs actually manipulate binary standard
 streams

That's quite an assumption to make.

 we should have print() and input() naturally just work with Unicode

No problem there. I couldn't imagine using either function for anything
serious.


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python 3.2 has some deadly infection

2014-06-05 Thread Steven D'Aprano
On Thu, 05 Jun 2014 14:01:50 +1200, Gregory Ewing wrote:

 Steven D'Aprano wrote:
 The whole concept of stdin and stdout is based on the idea of having a
 console to read from and write to.
 
 Not really; stdin and stdout are frequently connected to files, or pipes
 to other processes. The console, if it exists, just happens to be a
 convenient default value for them. Even on a system without a console,
 they're still a useful abstraction.

If you had kept reading my post, including the bits you cut out *wink*, 
you'd see that I did raise that same point. Having stdin and stdout 
trivially generalises to the idea of replacing them with other files, or 
pipes. But the idea of having standard input and standard output in the 
first place comes about because they are useful for the console. I gave 
the example of Mac, which didn't have a command-line interface at all, 
hence no console, no stdin, no stdout.

If a system had no command line interface (hence no consoles), why would 
you bother with a *standard* input file and output file that are never 
used?


 But we were talking about encodings, and whether stdin and stdout should
 be text or binary by default. Well, one of the design principles behind
 unix is to make use of plain text wherever possible. 

What's plain text? *half a wink*

Its a serious question. Some people think that good ol' plain text is 
EBCDIC, like IBM intended. To them, the letter A is synonymous with the 
byte 0xC1, and there's no need for an encoding (or so they think) because 
A *is* 0xC1.

Of course, people on ASCII systems know better: who needs encodings when 
it is a universal fact that A *is* 0x41?

*wink*


 Not just for stuff
 meant to be seen on the screen, but for stuff kept in files as well.
 
 As a result, most unix programs, most of the time, deal with text on
 stdin and stdout. So, it makes sense for them to be text by default. And
 wherever there's text, there needs to be an encoding. This is true
 whether a console is involved or not.


Agreed.




-- 
Steven
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python 3.2 has some deadly infection

2014-06-05 Thread Chris Angelico
On Thu, Jun 5, 2014 at 6:05 PM, Marko Rauhamaa ma...@pacujo.net wrote:
 Chris Angelico ros...@gmail.com:

 If, in Python, I say print(Hello, world!), I expect that to produce
 a line of text on the screen, without my code having to encode that to
 bytes, figure out what sort of newline to add, etc, etc.

 That example in no way represents the typical Python program (if there
 is one).

It's simpler than most, but use of print() is certainly quite common.
A naive search of .py files in my /usr came up with five thousand
instances of ' print(', and given that that search won't necessarily
find a Python 2 print statement (and I'm on Debian Wheezy, so Py2 is
the system Python), I think that's a fairly respectable figure.

 Only an extreme few Unix programs actually manipulate binary standard
 streams

 That's quite an assumption to make.

Okay. Start listing some. You have (de)compression programs like gzip,
which primarily work with files but can work with standard streams;
some image or movie manipulation programs (eg avconv) can also read
from stdin, although again, it's far more common to use files; cat
will happily transmit binary untouched, but all its options (at least
the ones I can see in my 'man cat') are for working with text.

What else do you have? Let's see... grep, sort, less/more, sed, awk,
these are all text manipulation programs. All your give me info about
the system programs (ls, mount, pwd, hostname, date...) print
text to stdout. Some also read from stdin, like md5sum and related.

Piles and piles of programs that work with text. A small handful that
work with binary, and most of them are more commonly used directly
with files, not with pipes. The most common case is that it all be
text.

 we should have print() and input() naturally just work with Unicode

 No problem there. I couldn't imagine using either function for anything
 serious.

I don't know about those exact functions, but I do know that there are
plenty of Python programs that use the console (take hg as one fairly
hefty example). Maybe input() isn't all that heavily used, but
certainly print() is a fine function. I can not only imagine using
them seriously, I *have used* them, and their equivalents in other
languages, seriously.

If the standard streams are so crucial, why are their most obvious
interfaces insignificant to you?

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python 3.2 has some deadly infection

2014-06-05 Thread Marko Rauhamaa
Steven D'Aprano st...@pearwood.info:

 But the idea of having standard input and standard output in the first
 place comes about because they are useful for the console.

I doubt that. Classic programs take input and produce output. Standard
input and output are the default input and output. The textbook Pascal
programs started:

   program myprogram(input, output);

 If a system had no command line interface (hence no consoles), why
 would you bother with a *standard* input file and output file that are
 never used?

Because programs are supposed to do useful work. They consume input and
produce output. That concept is older than computers themselves and is
used to define things like computation, algorithm, halting etc.

 On Thu, 05 Jun 2014 14:01:50 +1200, Gregory Ewing wrote:
 But we were talking about encodings, and whether stdin and stdout
 should be text or binary by default. Well, one of the design
 principles behind unix is to make use of plain text wherever
 possible.

No, one of the design principles behind unix is that all data is bytes:
memory, files, devices, sockets, pathnames. Yes, the
ASCII-is-good-for-everybody assumption has been there since the
beginning, but Python will not be able to hide the fact that there is no
text data (in the Python sense). There are only bytes.

UTF-8 beautifully gives text a second-class citizenship in unix/linux.
It will never be granted first-class citizenship, though.

 As a result, most unix programs, most of the time, deal with text on
 stdin and stdout. So, it makes sense for them to be text by default.
 And wherever there's text, there needs to be an encoding. This is
 true whether a console is involved or not.

 Agreed.

Disagreed strongly.

   tcpdump -s 0 -w - error.pcap
   tar zxf - python.tar.gz
   sha1sum smile.jpg
   base64 -d a.dat a.exe
   wget ftp://micorsops.com/something.avi -O - | mplayer -cache 8192 -

Unfortunately, the text/binary dichotomy breaks a beautiful principle in
Python as well. In numerous contexts, any file-like object will be
valid. Now there is no file-like object. Instead, you have
text-file-like objects and binary-file-like objects, which require
special attention since some operate on strings while others operate on
bytes.


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python 3.2 has some deadly infection

2014-06-05 Thread Marko Rauhamaa
Chris Angelico ros...@gmail.com:

 If the standard streams are so crucial, why are their most obvious
 interfaces insignificant to you?

I want the standard streams to consume and produce bytes. I do a lot of
system programming and connect processes to each other with socketpairs,
pipes and the like. I have dealt with plugin APIs that communicate over
stdin and stdout.

Python is clearly on a crusade to make *text* a first class system
entity. I don't believe that is possible (without casualties) in the
linux world. Python text should only exist inside string objects.


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python 3.2 has some deadly infection

2014-06-05 Thread Rustom Mody
On Thursday, June 5, 2014 3:11:34 PM UTC+5:30, Marko Rauhamaa wrote:
 Steven D'Aprano wrote:

  But the idea of having standard input and standard output in the first
  place comes about because they are useful for the console.

 I doubt that. Classic programs take input and produce output. Standard
 input and output are the default input and output. The textbook Pascal
 programs started:

program myprogram(input, output);

  If a system had no command line interface (hence no consoles), why
  would you bother with a *standard* input file and output file that are
  never used?

 Because programs are supposed to do useful work. They consume input and
 produce output. That concept is older than computers themselves and is
 used to define things like computation, algorithm, halting etc.

  On Thu, 05 Jun 2014 14:01:50 +1200, Gregory Ewing wrote:
  But we were talking about encodings, and whether stdin and stdout
  should be text or binary by default. Well, one of the design
  principles behind unix is to make use of plain text wherever
  possible.

 No, one of the design principles behind unix is that all data is bytes:
 memory, files, devices, sockets, pathnames. Yes, the
 ASCII-is-good-for-everybody assumption has been there since the
 beginning, but Python will not be able to hide the fact that there is no
 text data (in the Python sense). There are only bytes.

 UTF-8 beautifully gives text a second-class citizenship in unix/linux.
 It will never be granted first-class citizenship, though.

  As a result, most unix programs, most of the time, deal with text on
  stdin and stdout. So, it makes sense for them to be text by default.
  And wherever there's text, there needs to be an encoding. This is
  true whether a console is involved or not.
  Agreed.

 Disagreed strongly.

tcpdump -s 0 -w - error.pcap
tar zxf - python.tar.gz
sha1sum smile.jpg
base64 -d a.dat a.exe
wget ftp://micorsops.com/something.avi -O - | mplayer -cache 8192 -

 Unfortunately, the text/binary dichotomy breaks a beautiful principle in
 Python as well. In numerous contexts, any file-like object will be
 valid. Now there is no file-like object. Instead, you have
 text-file-like objects and binary-file-like objects, which require
 special attention since some operate on strings while others operate on
 bytes.


Pascal is for building pyramids — imposing, breathtaking, static
structures built by armies pushing heavy blocks into place. — Alan Perlis

Lisp is like a ball of mud. Add more and it's still a ball of mud
— it still looks like Lisp. — Guy Steele

There are two fundamental outlooks in computer science —
structuring and universality. And they pull in opposite
directions.

Universality happens when a data-structure can hold everything —
a universal data structure.

Some of the most significant advances in CS come from a universalist vision:

- von Neumann machine storing data+code in memory
- Turing-tape able to store arbitrary turing machines (∴ universal TM)
- Lisp program ≡ Lisp data
- Stream of byte can handle/represent everything in Unix — memory, files,
  devices, sockets, pathnames.

However after the allurement of universality is over, the
realization dawns that we have a mess — Lisp is a 'mud-ball'. At
which point people start needing to make distinctions — code and
data, different data-structures, type-systems etc. IOW imposing
structure on the mud-ball.

Taking a broad view, while structuring trades the power for
order, it is universality that adds significant power.

Python is not as universal as Lisp — it has no homoiconicity.
But it is close enough in that any variable/data-structure can
contain any value.

What Marko  is saying is that by imposing the structuring of
unicode on the outside (Unix) world of text=byte, significant power is lost.

This is also Armin's crib.

How significant that loss is, is yet to be seen…

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python 3.2 has some deadly infection

2014-06-05 Thread Marko Rauhamaa
Rustom Mody rustompm...@gmail.com:

 What Marko is saying is that by imposing the structuring of unicode on
 the outside (Unix) world of text=byte, significant power is lost.

Mostly I'm saying Python3 will not be able to hide the fact that linux
data consists of bytes. It shouldn't even try. The linux OS outside the
Python process talks bytes, not strings.

A different OS might have different assumptions.


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python 3.2 has some deadly infection

2014-06-05 Thread Steven D'Aprano
On Thu, 05 Jun 2014 17:45:34 +0300, Marko Rauhamaa wrote:

 Rustom Mody rustompm...@gmail.com:
 
 What Marko is saying is that by imposing the structuring of unicode on
 the outside (Unix) world of text=byte, significant power is lost.
 
 Mostly I'm saying Python3 will not be able to hide the fact that linux
 data consists of bytes. It shouldn't even try. The linux OS outside the
 Python process talks bytes, not strings.

Data on pretty much *all* computers consists of bytes, regardless of the 
language or operating system. There may be a few esoteric or ancient 
machines from the Dark Ages that aren't based on bytes, and even fewer 
that aren't based on bits (ancient Soviet era mainframes, if any of them 
still survive), but they aren't important. Someday esoteric non-byte 
machines, perhaps quantum computers, or machines based on DNA, or nano-
sized analog computers made of carbon atoms, say, will be important, but 
this is not that day. For now, bytes rule *everywhere*.

Nevertheless, there are important abstractions that are written on top of 
the bytes layer, and in the Unix and Linux world, the most important 
abstraction is *text*. In the Unix world, text formats and text 
processing is much more common in user-space apps than binary processing. 
Perhaps the definitive explanation and celebration of the Unix way is 
Eric Raymond's The Art Of Unix Programming:

http://www.catb.org/esr/writings/taoup/html/ch05s01.html




-- 
Steven D'Aprano
http://import-that.dreamwidth.org/
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python 3.2 has some deadly infection

2014-06-05 Thread Robin Becker

On 05/06/2014 15:45, Marko Rauhamaa wrote:

Rustom Mody rustompm...@gmail.com:


What Marko is saying is that by imposing the structuring of unicode on
the outside (Unix) world of text=byte, significant power is lost.


Mostly I'm saying Python3 will not be able to hide the fact that linux
data consists of bytes. It shouldn't even try. The linux OS outside the
Python process talks bytes, not strings.

A different OS might have different assumptions.


Marko

I think I'm in the unix camp as well. I just think that an extra assumption on 
input output isn't always helpful. In python 3 byte strings are second class 
which I think is wrong; apparently pressure from influential users is pushing to 
make byte strings more first class which is a good thing.

--
Robin Becker

--
https://mail.python.org/mailman/listinfo/python-list


Re: Python 3.2 has some deadly infection

2014-06-05 Thread Chris Angelico
On Fri, Jun 6, 2014 at 1:37 AM, Robin Becker ro...@reportlab.com wrote:
 I think I'm in the unix camp as well. I just think that an extra assumption
 on input output isn't always helpful. In python 3 byte strings are second
 class which I think is wrong; apparently pressure from influential users is
 pushing to make byte strings more first class which is a good thing.

I wouldn't say they're second-class; it's more that the bytes type was
considered to be more like a list of ints than like a Unicode string,
and now that there are a few years' worth of real-world usage
information to learn from, it's known that some more string-like
operations will be extremely helpful. So now they're being added,
which I agree is a good thing.

Whether ba[0] should be b'a' or ord(b'a') is another sticking point.
The Py2 str does the first, the Py3 bytes does the second. That one's
a bit hard to change, but what I'm not sure of is how significant this
is to new-build Py3 code. Obviously it's a barrier to porting, but is
it important on its own? However, that's still not really byte
strings are second class.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python 3.2 has some deadly infection

2014-06-05 Thread Chris Angelico
On Fri, Jun 6, 2014 at 1:33 AM, Steven D'Aprano
steve+comp.lang.pyt...@pearwood.info wrote:
 In the Unix world, text formats and text
 processing is much more common in user-space apps than binary processing.
 Perhaps the definitive explanation and celebration of the Unix way is
 Eric Raymond's The Art Of Unix Programming:

 http://www.catb.org/esr/writings/taoup/html/ch05s01.html

Specifically, this from the opening paragraph:

Text streams are a valuable universal format because they're easy for
human beings to read, write, and edit without specialized tools. These
formats are (or can be designed to be) transparent.


He goes on to talk about network protocols, one of the best examples
of this. I've idly speculated at times about the possibility of
rewriting the Magic: The Gathering Online back-end with a view to
making it easier to work with. Among other changes, I'd be wanting to
make the client-server communication be plain text (an SMTP-style of
protocol), with an external layer of encryption (TLS). This would mean
that:

1) Internal testing can be done without TLS, making the communication
absolutely transparent, easy to debug, easy to watch, everything.
Adding TLS later would have zero impact on the critical code
internally - it's just a layer around the outside.
2) Upgrades to crypto can simply follow industry best-practice.
(Reminder, to anyone who might have been mad enough to consider this:
DO NOT roll your own crypto! Ever! Even if you use a good library for
the heavy lifting!)
3) A debug log of what the client has sent and received could be
included, even in production, at very low cost. You don't need to
decode packets and pretty-print them - you just take the lines of
text, maybe adorn or color them according to which were sent/received,
and dump them into a display box or log file somewhere.
4) The server is forced to acknowledge that the client might not be
the one it expected. Not only do you get better security that way, but
you could also call this a feature.
5) Therefore, you can debug the system with a simple TELNET or MUD
client (okay, most MUD clients don't do SSL, but you can use openssl
s_client). As someone who's debugged myriad issues using his trusty
MUD client, I consider this to be a *huge* advantage.

All it takes is a few simple rules, like: All communication is text,
encoded down the wire as UTF-8, and consists of lines (terminated by
U+000A) which consist of a word, a U+0020 space, and then parameters
to the command. There, that's a rigorous definition that covers
everything you'll need of it; compare with what Flash uses, by
default:

https://en.wikipedia.org/wiki/Action_Message_Format

Sure, it might be slightly more compact going down the wire; but what
do you really gain?

Text wins.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python 3.2 has some deadly infection

2014-06-05 Thread Robin Becker

On 05/06/2014 16:50, Chris Angelico wrote:
..


I wouldn't say they're second-class; it's more that the bytes type was
considered to be more like a list of ints than like a Unicode string,
and now that there are a few years' worth of real-world usage
information to learn from, it's known that some more string-like
operations will be extremely helpful. So now they're being added,
which I agree is a good thing.


in python 2 str and unicode were much more comparable. On balance I think just 
reversing them ie str -- bytes and unicode -- str was probably the right thing 
to do if the default conversions had been turned off. However making bytes a 
crippled thing was wrong.





Whether ba[0] should be b'a' or ord(b'a') is another sticking point.
The Py2 str does the first, the Py3 bytes does the second. That one's
a bit hard to change, but what I'm not sure of is how significant this
is to new-build Py3 code. Obviously it's a barrier to porting, but is
it important on its own? However, that's still not really byte
strings are second class.

..
I dislike the current model, but that's because I had a lot of stuff to convert 
and probably made a bunch of blunders. The reportlab code is now a mess of hacks 
to keep it alive for 2.7  =3.3; I'm probably never going to be convinced that 
uncode types are good. Bytes are the underlying concept and should have remained 
so for simplicity's sake.

--
Robin Becker

--
https://mail.python.org/mailman/listinfo/python-list


Re: Python 3.2 has some deadly infection

2014-06-05 Thread Steven D'Aprano
On Thu, 05 Jun 2014 16:37:23 +0100, Robin Becker wrote:

 In python 3 byte strings
 are second class which I think is wrong

It certainly is wrong. bytes are just as much a first-class built-in type 
as list, int, float, bool, set, tuple and str.

There may be missing functionality (relatively easy to add new 
functionality), and even poor design choices (like the foolish decision 
to have bytes display as if they were ASCII-ish strings, a silly mistake 
that simply reinforces the myth that bytes and ASCII are synonymous). 
Python 3.4 and 3.5 are in the process of rectifying as many of these 
mistakes as possible, e.g. adding back % formatting. But a few mistakes 
in the design of bytes' API no more makes it second-class than the lack 
of dict.contains_value() method makes dict second-class.

By all means ask for better bytes functionality. But don't libel Python 
by pretending that bytes is anything less than one of the most important 
and fundamental types in the language. bytes are so important that there 
are TWO implementations for them, a mutable and immutable version 
(bytearray and bytes), while text strings only have an immutable version.



-- 
Steven D'Aprano
http://import-that.dreamwidth.org/
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python 3.2 has some deadly infection

2014-06-05 Thread Steven D'Aprano
On Thu, 05 Jun 2014 17:17:05 +0100, Robin Becker wrote:

 Bytes are the underlying
 concept and should have remained so for simplicity's sake.

Bytes are the underlying concept for classes too. Do you think that an 
opaque unstructured blob of bytes is simpler to use than a class? How 
would an unstructured blob of bytes be simpler to use than an array of 
multi-byte characters?

Earlier:

 I dislike the current model, but that's because I had a lot of stuff to
 convert and probably made a bunch of blunders. The reportlab code is
 now a mess of hacks to keep it alive for 2.7  =3.3;

Although I've been critical of many of your statements, I am sympathetic 
to your pain. There's no doubt that that the transition from the old, 
broken system of bytes masquerading as text can be hard, especially to 
those who never quite get past the misleading and false paradigm that 
bytes are ASCII. It may have been that there were better ways to have 
updated to 3.3; perhaps you were merely unfortunate to have updated too 
early, and had you waited to 3.4 or 3.5 things would have been better. I 
don't know.

But whatever the situation, and despite our differences of opinion about 
Unicode, THANK YOU for having updated ReportLabs to 3.3.



-- 
Steven D'Aprano
http://import-that.dreamwidth.org/
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python 3.2 has some deadly infection

2014-06-05 Thread Marko Rauhamaa
Steven D'Aprano steve+comp.lang.pyt...@pearwood.info:

 Nevertheless, there are important abstractions that are written on top
 of the bytes layer, and in the Unix and Linux world, the most
 important abstraction is *text*. In the Unix world, text formats and
 text processing is much more common in user-space apps than binary
 processing.

That linux text is not the same thing as Python's text. Conceptually,
Python text is a sequence of 32-bit integers. Linux text is a sequence
of 8-bit integers.

It is great that lots of computer-to-computer formats are encoded in
ASCII (~ UTF-8). However, nowhere in linux is there a real abstraction
layer that processes Python-esque text.

Case in point:

   $ env | grep UTF
   LANG=en_US.UTF-8
   $ od -c Hyvää yötä # Good night in Finnish
   000   H   y   v 303 244 303 244   y 303 266   t 303 244  \n
   017

The od utility is asked to display its input as characters. The locale
info gives a hint that all text data is in UTF-8. Yet what comes out is
bytes.

How about:

   $ wc -c Hyvää yötä
   15
   $ tr 'ä' 'a' Hyvää yötä
   Hyv ya�taa

Grep is smarter:

   $ grep v...y Hyvää yötä
   Hyvää yötä

which is why you should always prefix grep with LC_ALL=C in your
scripts (makes it far faster, too).


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python 3.2 has some deadly infection

2014-06-05 Thread Rustom Mody
On Thursday, June 5, 2014 9:42:28 PM UTC+5:30, Chris Angelico wrote:
 On Fri, Jun 6, 2014 at 1:33 AM, Steven D'Aprano wrote:
  In the Unix world, text formats and text
  processing is much more common in user-space apps than binary processing.
  Perhaps the definitive explanation and celebration of the Unix way is
  Eric Raymond's The Art Of Unix Programming:
  http://www.catb.org/esr/writings/taoup/html/ch05s01.html

 Specifically, this from the opening paragraph:
 
 Text streams are a valuable universal format because they're easy for
 human beings to read, write, and edit without specialized tools. These
 formats are (or can be designed to be) transparent.
 

A fact that stops being true when you tie up text with encodings.
For two reasons:

1. The function/pair encode/decode mapping between byte-string and text 
   cannot be a bijection because the byte-string set is larger than the text
   set.  This is the error that Armin was hit by

2. Since there is not one but a zillion encodings possible we are not
   talking of one (possibly universal) data structure but a zillion
   ones: Text streams are a universal format - which encoding-ed
   form of text??
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python 3.2 has some deadly infection

2014-06-05 Thread Chris Angelico
On Fri, Jun 6, 2014 at 2:17 AM, Robin Becker ro...@reportlab.com wrote:
 in python 2 str and unicode were much more comparable. On balance I think
 just reversing them ie str -- bytes and unicode -- str was probably the
 right thing to do if the default conversions had been turned off. However
 making bytes a crippled thing was wrong.

It's easy to build up functionality after the event. Maybe reportlab
will have lots of hacks to support both 2.7 and 3.3, but in a few
years you'll be able to say supports 2.7 and 3.5 and take advantage
of percent formatting and whatever else is added. But this is just the
way that languages develop; you use them, you find what isn't easy,
and you fix it. The nature of stability is that it takes time before
you can depend on freshly-written functionality (contrast the extreme
instability of running the version from source control - stuff might
be fixed at any time, but you have to do all the work yourself to make
sure your dependencies line up), but over time, you can depend on
improvements making their way out there.

Can you point to specific areas in which the bytes type is crippled?
Comparing either to the Py2 str or the Py3 str, or to anything else?
The Python core devs are listening, as evidenced by PEP 461.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python 3.2 has some deadly infection

2014-06-05 Thread Ian Kelly
On Thu, Jun 5, 2014 at 10:17 AM, Robin Becker ro...@reportlab.com wrote:
 in python 2 str and unicode were much more comparable. On balance I think
 just reversing them ie str -- bytes and unicode -- str was probably the
 right thing to do if the default conversions had been turned off. However
 making bytes a crippled thing was wrong.

How should e.g. bytes.upper() be implemented then?  The correct
behavior is entirely dependent on the encoding.  Python 2 just assumes
ASCII, which at best will correctly upper-case some subset of the
string and leave the rest unchanged, and at worst could corrupt the
string entirely.  There are some things that were dropped that should
not have been, but my impression is that those are being worked on,
for example % formatting in PEP 461.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python 3.2 has some deadly infection

2014-06-05 Thread Chris Angelico
On Fri, Jun 6, 2014 at 2:52 AM, Marko Rauhamaa ma...@pacujo.net wrote:
 That linux text is not the same thing as Python's text. Conceptually,
 Python text is a sequence of 32-bit integers. Linux text is a sequence
 of 8-bit integers.

Point of terminology: Linux is the kernel, everything you say below
here is talking about particular programs. From what I understand,
bash (just another Unix program) treats strings as sequences of
codepoints, just as Python does; though its string manipulation is not
nearly as rich as Python's, so it's harder to prove. Python is itself
a Unix program, so you can do the exact same proofs and demonstrate
that Linux is clearly Unicode-aware. It's not Linux you're testing.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python 3.2 has some deadly infection

2014-06-05 Thread Chris Angelico
On Fri, Jun 6, 2014 at 2:54 AM, Rustom Mody rustompm...@gmail.com wrote:
 On Thursday, June 5, 2014 9:42:28 PM UTC+5:30, Chris Angelico wrote:
 On Fri, Jun 6, 2014 at 1:33 AM, Steven D'Aprano wrote:
  In the Unix world, text formats and text
  processing is much more common in user-space apps than binary processing.
  Perhaps the definitive explanation and celebration of the Unix way is
  Eric Raymond's The Art Of Unix Programming:
  http://www.catb.org/esr/writings/taoup/html/ch05s01.html

 Specifically, this from the opening paragraph:
 
 Text streams are a valuable universal format because they're easy for
 human beings to read, write, and edit without specialized tools. These
 formats are (or can be designed to be) transparent.
 

 A fact that stops being true when you tie up text with encodings.
 For two reasons:

 1. The function/pair encode/decode mapping between byte-string and text
cannot be a bijection because the byte-string set is larger than the text
set.  This is the error that Armin was hit by

 2. Since there is not one but a zillion encodings possible we are not
talking of one (possibly universal) data structure but a zillion
ones: Text streams are a universal format - which encoding-ed
form of text??

As soon as you store or transmit ANY form of information, you need to
worry about encodings. Ever heard of this thing called network byte
order? It's part of taming the wilds of integer encodings. The theory
is that the LC environment variables will carry all that crucial
out-of-band information about encodings, and while the practice isn't
perfect, it does still mean that there is such a thing as a text
stream.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python 3.2 has some deadly infection

2014-06-05 Thread Terry Reedy

On 6/5/2014 10:45 AM, Marko Rauhamaa wrote:


Mostly I'm saying Python3 will not be able to hide the fact that linux
data consists of bytes. It shouldn't even try. The linux OS outside the
Python process talks bytes, not strings.


A text file is a binary file wrapped with a codex to translate to and 
from a universal text format on input and output.  Much of the time, the 
wrapping is a great user convenience. Since the wrapping is optional, 
nothing is forced or really hidden.



A different OS might have different assumptions.


Different OSes *do* have different assumptions. Both MacOSX and current 
Windows use (UCS-2 or) UTF-16 for text. It seems that unicode strings 
are better than ascii+??? strings as a universal basis for OS 
interfacing.  For Windows, at least, the interface is much improved in 
Python 3.


I understand that some, but not all, Latin alphabet *nix programmers 
wish that Python 3 continued to be strongly in their favor. But they are 
a small minority of the world's programmers, and Python 3 is aimed at 
everyone on all systems.


--
Terry Jan Reedy


--
https://mail.python.org/mailman/listinfo/python-list


Re: Python 3.2 has some deadly infection

2014-06-05 Thread Marko Rauhamaa
Terry Reedy tjre...@udel.edu:

 Different OSes *do* have different assumptions. Both MacOSX and
 current Windows use (UCS-2 or) UTF-16 for text.

Linux can use anything for text; UTF-8 has become a de-facto standard.

How text is represented is very different from whether text is a
fundamental data type. A fundamental text file is such that ordinary
operating system facilities can't see inside the black box (that is,
they are *not* encoded as far as the applications go).

I have no idea how opaque text files are in Windows or OS-X.

 For Windows, at least, the interface is much improved in Python 3.

Yes, I get the feeling that Python is reaching out to Windows and OS-X
and trying to make linux look like them.

 I understand that some, but not all, Latin alphabet *nix programmers
 wish that Python 3 continued to be strongly in their favor. But they
 are a small minority of the world's programmers, and Python 3 is aimed
 at everyone on all systems.

Python allows linux programmers to write native linux programs. Maybe it
allows Windows programmers to write native Windows programs. I certainly
hope so.

I don't want to have to write Windows programs that kinda run on linux.
Java suffers from that: no import os in Java.


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python 3.2 has some deadly infection

2014-06-05 Thread Terry Reedy

On 6/5/2014 5:53 AM, Marko Rauhamaa wrote:

Chris Angelico ros...@gmail.com:


If the standard streams are so crucial, why are their most obvious
interfaces insignificant to you?


I want the standard streams to consume and produce bytes.


Easy. Read the manual entry for stdxxx. To write or read binary data 
from/to the standard streams, use the underlying binary buffer object. 
For example, to write bytes to stdout, use 
sys.stdout.buffer.write(b'abc') To make it easy, use bound methods.


myfilter.p
--
import sys
sysin = sys.stdin.buffer.read
sysout = sys.stdout.buffer.write
syserr = sys.stderr.buffer.write

filter code with calls to sysin, sysout, syserr.
---

The same trick of defining bound methods to save both writing and 
execution time is also useful for text filters when you use 
sys.stdin.read, etc, more than once in the text.


When you try this, please report the result, either way.

 I do a lot of  system programming and connect processes to each other
 with socketpairs, pipes and the like. I have dealt with plugin APIs
 that communicate over stdin and stdout.

Now you know how to do so on Python 3.


Python is clearly on a crusade to make *text* a first class system
entity. I don't believe that is possible (without casualties) in the
linux world. Python text should only exist inside string objects.


You are clearly on a crusade to push a falsehood. Why?

On Windows and, I believe, Mac, utf-16 encoded text (C widechar type) 
*is* a 'first class system entity. The problem Python has with *nix is 
getting text bytes from the system in an unknown or worse, 
wrongly-claimed encoding. The Python developers do their best to cope 
with the differences and peculiarities of the systems it runs on.


--
Terry Jan Reedy

--
https://mail.python.org/mailman/listinfo/python-list


Re: Python 3.2 has some deadly infection

2014-06-05 Thread Marko Rauhamaa
Terry Reedy tjre...@udel.edu:

 On 6/5/2014 5:53 AM, Marko Rauhamaa wrote:
 Chris Angelico ros...@gmail.com:

 If the standard streams are so crucial, why are their most obvious
 interfaces insignificant to you?

 I want the standard streams to consume and produce bytes.

 Easy. Read the manual entry for stdxxx. To write or read binary data
 from/to the standard streams, use the underlying binary buffer object.
 For example, to write bytes to stdout, use
 sys.stdout.buffer.write(b'abc')

This note from the manual is a bit vague:

   Note that the streams can be replaced with objects (like io.StringIO)
   that do not support the buffer attribute or the detach() method

Can be replaced by who? By the Python developers? By me? By random
library calls?

Does it mean the buffer and detach are not guaranteed to stay with the
API?


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python 3.2 has some deadly infection

2014-06-05 Thread Terry Reedy

On 6/5/2014 4:21 PM, Marko Rauhamaa wrote:

Terry Reedy tjre...@udel.edu:


On 6/5/2014 5:53 AM, Marko Rauhamaa wrote:

Chris Angelico ros...@gmail.com:


If the standard streams are so crucial, why are their most obvious
interfaces insignificant to you?


I want the standard streams to consume and produce bytes.


Easy. Read the manual entry for stdxxx. To write or read binary data
from/to the standard streams, use the underlying binary buffer object.
For example, to write bytes to stdout, use
sys.stdout.buffer.write(b'abc')


This note from the manual is a bit vague:

Note that the streams can be replaced with objects (like io.StringIO)
that do not support the buffer attribute or the detach() method

Can be replaced by who? By the Python developers? By me? By random
library calls?


Fair question. The Python developers will not fiddle with stdxxx for 3rd 
party code on 3rd party systems. We do sometimes *temporarily replace 
the streams with StringIO, either directly or via test.support when 
testing Python itself or stdlib modules. That is done in Lib/test, and 
except for testing StringIO, it is only done as a convenience, not a 
necessity.


To test a binary stream filter, you would have to do something else, 
like read from and write to actual files on disk. Otherwise, you seem 
unlikely to sabotage yourself, even accidentally.


Random non-stdlib library calls could sabotage you. However, in my 
opinion, an imported 3rd party module should never modify std streams, 
with one exception. The exception would be a module whose entire purpose 
was to put the streams in a known state, as documented, and only if 
intentionally asked to.


Having said that, bound methods created (first) should work regardless 
of any subsequent manipulation of sys. Here is an experiment, run from 
an Idle editor.


import sys
sysout = sys.stdout.write
sys.stdout = None
sysout('works anyway\n')

works anyway

(Of course, subsequent attempts to continue interactively fail. But that 
is not your use case.)


--
Terry Jan Reedy

--
https://mail.python.org/mailman/listinfo/python-list


Re: Python 3.2 has some deadly infection

2014-06-05 Thread Rustom Mody
On Thursday, June 5, 2014 10:58:43 PM UTC+5:30, Chris Angelico wrote:
 On Fri, Jun 6, 2014 at 2:52 AM, Marko Rauhamaa wrote:
  That linux text is not the same thing as Python's text. Conceptually,
  Python text is a sequence of 32-bit integers. Linux text is a sequence
  of 8-bit integers.

 Point of terminology: Linux is the kernel, everything you say below
 here is talking about particular programs.

If it helps try the following substitution:

s/Linux/Pretty much all the distros that use Linux for their OS kernel/

BTW the only (other) guy I know who insistently makes that distinction is
Richard Stallman.

Are you an emacs user by any chance wink?


 From what I understand,
 bash (just another Unix program) treats strings as sequences of
 codepoints, just as Python does; though its string manipulation is not
 nearly as rich as Python's, so it's harder to prove. Python is itself
 a Unix program, so you can do the exact same proofs and demonstrate
 that Linux is clearly Unicode-aware. It's not Linux you're testing.

In these 'other programs' is it permissible to include the kernel
itself?
And then ask how Linux (in your and Stallman's sense) differs from
Windows in how the filesystem handles things like filenames?
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python 3.2 has some deadly infection

2014-06-05 Thread Chris Angelico
On Fri, Jun 6, 2014 at 8:35 AM, Rustom Mody rustompm...@gmail.com wrote:
 On Thursday, June 5, 2014 10:58:43 PM UTC+5:30, Chris Angelico wrote:
 On Fri, Jun 6, 2014 at 2:52 AM, Marko Rauhamaa wrote:
  That linux text is not the same thing as Python's text. Conceptually,
  Python text is a sequence of 32-bit integers. Linux text is a sequence
  of 8-bit integers.

 Point of terminology: Linux is the kernel, everything you say below
 here is talking about particular programs.

 If it helps try the following substitution:

 s/Linux/Pretty much all the distros that use Linux for their OS kernel/

You could look at the Debian Project, which is a full environment with
everything you're talking about. And everything you say would be
equally true of Debian Linux and Debian kfreebsd. :)

 BTW the only (other) guy I know who insistently makes that distinction is
 Richard Stallman.

 Are you an emacs user by any chance wink?

Nope! Just a terminology nerd. :)

 From what I understand,
 bash (just another Unix program) treats strings as sequences of
 codepoints, just as Python does; though its string manipulation is not
 nearly as rich as Python's, so it's harder to prove. Python is itself
 a Unix program, so you can do the exact same proofs and demonstrate
 that Linux is clearly Unicode-aware. It's not Linux you're testing.

 In these 'other programs' is it permissible to include the kernel
 itself?
 And then ask how Linux (in your and Stallman's sense) differs from
 Windows in how the filesystem handles things like filenames?

What are you testing of the kernel? Most of the kernel doesn't
actually work with text at all - it works with integers, buffers of
memory (which could be seen as streams of bytes, but might be almost
anything), process tables, open file handles... but not usually text.
To you, EAGAIN might be a bit of text, but to the Linux kernel, it's
an integer (11 decimal, if I recall correctly). Is that some fancy new
form of encoding? :)

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python 3.2 has some deadly infection

2014-06-05 Thread Steven D'Aprano
On Thu, 05 Jun 2014 21:30:11 +0300, Marko Rauhamaa wrote:

 Terry Reedy tjre...@udel.edu:
 
 Different OSes *do* have different assumptions. Both MacOSX and current
 Windows use (UCS-2 or) UTF-16 for text.
 
 Linux can use anything for text; UTF-8 has become a de-facto standard.
 
 How text is represented is very different from whether text is a
 fundamental data type. A fundamental text file is such that ordinary
 operating system facilities can't see inside the black box (that is,
 they are *not* encoded as far as the applications go).

Wait, are they black-boxes to the *operating system* or to 
*applications*? They aren't the same thing.

In any case, I reject your premise. ALL data types are constructed on top 
of bytes, and so long as you allow applications *any way* to coerce data 
types to different data types, you allow them to see inside the black 
box. I can extract the four bytes from a C long integer, but that 
doesn't mean that C longs aren't fundamental data types in Unix/Linux.


 I have no idea how opaque text files are in Windows or OS-X.

Exactly as opaque as they are in Unix, which is to say not at all. Just 
open the file in binary mode, and voilà you see the underlying bytes.

All you're doing is pointing out that, in modern electronic computers, 
the fundamental data structure which underlies all others (the 
indivisible protons and neutrons, so to speak, only there are 256 of them 
rather than 2) is the byte. We know this, and don't dispute it.

(Like protons and neutrons, we can see inside bytes to the quark-like 
bits that make up bytes. Like quarks, bits do not exist in isolation, but 
only inside bytes.)



 For Windows, at least, the interface is much improved in Python 3.
 
 Yes, I get the feeling that Python is reaching out to Windows and OS-X
 and trying to make linux look like them.

Unicode support in OS-X is (I have been assured) is very good, probably 
better than Linux. Apple has very high standards when it comes to their 
apps, and provides rich Unicode-aware APIs.

But Linux Unicode support is much better than Windows. Unicode support in 
Windows is crippled by continued reliance on legacy code pages, and by 
the assumption deep inside the Windows APIs that Unicode means 16 bit 
characters. See, for example, the amount of space spent on fixing 
Windows Unicode handling here:

http://www.utf8everywhere.org/



-- 
Steven D'Aprano
http://import-that.dreamwidth.org/
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python 3.2 has some deadly infection

2014-06-05 Thread Steven D'Aprano
On Thu, 05 Jun 2014 23:21:35 +0300, Marko Rauhamaa wrote:

 Terry Reedy tjre...@udel.edu:
 
 On 6/5/2014 5:53 AM, Marko Rauhamaa wrote:
 Chris Angelico ros...@gmail.com:

 If the standard streams are so crucial, why are their most obvious
 interfaces insignificant to you?

 I want the standard streams to consume and produce bytes.

 Easy. Read the manual entry for stdxxx. To write or read binary data
 from/to the standard streams, use the underlying binary buffer object.
 For example, to write bytes to stdout, use
 sys.stdout.buffer.write(b'abc')
 
 This note from the manual is a bit vague:
 
Note that the streams can be replaced with objects (like io.StringIO)
that do not support the buffer attribute or the detach() method
 
 Can be replaced by who? By the Python developers? By me? By random
 library calls?

By you. sys.stdout and friends are writable. Any code you call may have 
replaced them with another file-like object, and you should honour that.

The API could have/should have been a little more friendly, but it's 
conceptually simple:

* Does sys.stdout have a buffer attribute? Then write raw bytes to
  the buffer.

* If not, then write raw bytes to sys.stdout.

* If either fails, then somebody has replaced stdout with something
  weird, and they deserve whatever horrible fate their damn fool
  move causes. It's not your responsibility to try to keep your
  application running under bizarre circumstances.



-- 
Steven D'Aprano
http://import-that.dreamwidth.org/
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python 3.2 has some deadly infection

2014-06-05 Thread Marko Rauhamaa
Steven D'Aprano steve+comp.lang.pyt...@pearwood.info:

 In any case, I reject your premise. ALL data types are constructed on
 top of bytes,

Only in a very dull sense.

 and so long as you allow applications *any way* to coerce data types
 to different data types, you allow them to see inside the black box.

I can't see the bytes inside Python objects, including strings, and
that's how it is supposed to be.

Similarly, I can't (easily) see how files are laid out on hard disks.
That's a true abstraction. Nothing in linux presents data, though,
except through bytes.


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python 3.2 has some deadly infection

2014-06-05 Thread Marko Rauhamaa
Steven D'Aprano steve+comp.lang.pyt...@pearwood.info:

 Can be replaced by who? By the Python developers? By me? By random
 library calls?

 By you. sys.stdout and friends are writable. Any code you call may
 have replaced them with another file-like object, and you should
 honour that.

I can of course overwrite even sys and os and open and all. That hardly
merits mentioning in the API documentation.

What I'm afraid of is that the Python developers are reserving the right
to remove the buffer and detach attributes from the standard streams in
a future version. That would be terrible.

If it means some other module is allowed to commandeer the standard
streams, that would be bad as well.

Worst of all, I don't know why the caveat had to be there.

Or is it maybe because some python command line options could cause
buffer and detach not to be there? That would explain the caveat, but
still would be kinda sucky.


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python 3.2 has some deadly infection

2014-06-05 Thread Chris Angelico
On Fri, Jun 6, 2014 at 9:30 AM, Marko Rauhamaa ma...@pacujo.net wrote:
 Steven D'Aprano steve+comp.lang.pyt...@pearwood.info:

 Can be replaced by who? By the Python developers? By me? By random
 library calls?

 By you. sys.stdout and friends are writable. Any code you call may
 have replaced them with another file-like object, and you should
 honour that.

 I can of course overwrite even sys and os and open and all. That hardly
 merits mentioning in the API documentation.

 What I'm afraid of is that the Python developers are reserving the right
 to remove the buffer and detach attributes from the standard streams in
 a future version. That would be terrible.

 If it means some other module is allowed to commandeer the standard
 streams, that would be bad as well.

 Worst of all, I don't know why the caveat had to be there.

 Or is it maybe because some python command line options could cause
 buffer and detach not to be there? That would explain the caveat, but
 still would be kinda sucky.

It's more that replacng sys.std* is considered reasonably normal
(unlike, say, replacing sys.float_info, which would be a weird thing
to do); and you could replace them with something that doesn't have
those attributes. If you're running a top-level script and you never
import anything that changes the streams, you should be able to depend
on those always being there.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python 3.2 has some deadly infection

2014-06-05 Thread Terry Reedy

On 6/5/2014 7:30 PM, Marko Rauhamaa wrote:

Steven D'Aprano steve+comp.lang.pyt...@pearwood.info:


Can be replaced by who? By the Python developers? By me? By random
library calls?


By you. sys.stdout and friends are writable. Any code you call may
have replaced them with another file-like object, and you should
honour that.


I can of course overwrite even sys and os and open and all. That hardly
merits mentioning in the API documentation.

What I'm afraid of is that the Python developers are reserving the right
to remove the buffer and detach attributes from the standard streams in
a future version.


No, not at all.


That would be terrible.


Agreed.


If it means some other module is allowed to commandeer the standard
streams, that would be bad as well.


I think that, for the most part, library modules should either open a 
file given a filename from outside or read from and write to open files 
handed to them from outside, but not hard-code the std streams. The 
module doc should say if the file (name or object) must be text or in 
particular binary.


The warning is also a hint as to how to solve a problem, such as testing 
a binary filter. Assume the module reads from and writes to .buffer and 
has a main function. One approach, untested:


import sys, io, unittest
from mod import main

class Binstd:
def __init(self):
self.buffer = io.BytesIO

sys.stdin = Binstd()
sys.stdout = Binstd()

sys.stdin.buffer.write('test data')
sys.stdin.buffer.seek(0)
main()
out = sys.stdout.buffer.getvalue()
# test that out is as expected for the input
# seek to 0 and truncate for more tests


Worst of all, I don't know why the caveat had to be there.


Because the streams can be replaced for a variety of good reasons, as above.


Or is it maybe because some python command line options could cause
buffer and detach not to be there? That would explain the caveat, but
still would be kinda sucky.


The doc set documents the Python command line options, as well any that
are CPython specific.  It is possible that some implementation could add
one to open stdxyz in binary mode. CPython does not really need that.


--
Terry Jan Reedy

--
https://mail.python.org/mailman/listinfo/python-list


Re: Python 3.2 has some deadly infection

2014-06-05 Thread Rustom Mody
On Friday, June 6, 2014 4:22:22 AM UTC+5:30, Chris Angelico wrote:
 On Fri, Jun 6, 2014 at 8:35 AM, Rustom Mody  wrote:
  And then ask how Linux (in your and Stallman's sense) differs from
  Windows in how the filesystem handles things like filenames?

 What are you testing of the kernel? Most of the kernel doesn't
 actually work with text at all - it works with integers, buffers of
 memory (which could be seen as streams of bytes, but might be almost
 anything), process tables, open file handles... but not usually text.
 To you, EAGAIN might be a bit of text, but to the Linux kernel, it's
 an integer (11 decimal, if I recall correctly). Is that some fancy new
 form of encoding? :)


| Thanks to the properties of UTF-8 encoding, the Linux kernel, the
| innermost and lowest-level part of the operating system, can
| handle Unicode filenames without even having the user tell it
| that UTF-8 is to be used. All character strings, including
| filenames, are treated by the kernel in such a way that THEY
| APPEAR TO IT ONLY AS STRINGS OF BYTES. Thus, it doesn't care and
| does not need to know whether a pair of consecutive bytes should
| logically be treated as two characters or a single one. The only
| risk of the kernel being fooled would be, for example, for a
| filename to contain a multibyte Unicode character encoded in such
| a way that one of the bytes used to represent it was a slash or
| some other character that has a special meaning in file
| names. Fortunately, as we noted, UTF-8 never uses ASCII
| characters for encoding multibyte characters, so neither the
| slash nor any other special character can appear as part of one
| and therefore there is no risk associated with using Unicode in
| filenames.
|  
| Filesystems found on Microsoft Windows machines (NTFS and FAT)
| are different in that THEY STORE FILENAMES ON DISK IN SOME
| PARTICULAR ENCODING. The kernel must translate this encoding to
| the system encoding, which will be UTF-8 in our case.
|  
| If you have Windows partitions on your system, you will have to
| take care that they are mounted with correct options. For FAT and
| ISO9660 (used by CD-ROMs) partitions, option utf8 makes the
| system translate the filesystem's character encoding to
| UTF-8. For NTFS, nls=utf8 is the recommended option (utf8 should
| also work).

[Emphases mine]

From: http://michal.kosmulski.org/computing/articles/linux-unicode.html
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python 3.2 has some deadly infection

2014-06-05 Thread Chris Angelico
On Fri, Jun 6, 2014 at 1:11 PM, Rustom Mody rustompm...@gmail.com wrote:
 All character strings, including
 | filenames, are treated by the kernel in such a way that THEY
 | APPEAR TO IT ONLY AS STRINGS OF BYTES.

Yep, the real issue here is file systems, not the kernel. But yes,
this is one of the very few places where the kernel deals with a
string - and because of the hairiness of having to handle myriad file
systems in a single path (imagine multiple levels of remote mounts -
I've had a case where I mount via sshfs a tree that includes a Samba
mount point, and you can go a lot deeper than that), the only thing it
can do is pass the bytes on unchanged. Which means, in reality, the
kernel doesn't actually do *anything* with the string, it just passes
it right along to the file system.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python 3.2 has some deadly infection

2014-06-05 Thread Rustom Mody
On Friday, June 6, 2014 8:50:57 AM UTC+5:30, Chris Angelico wrote:
 kernel doesn't actually do *anything* with the string, it just passes
 it right along to the file system.

Which is what Marko (and others like Armin) are asking of python
(treated as a processing 'kernel'):

I know what I am doing with my bytes -- please channel/funnel them
around as requested without being unnecessarily and unrequestedly
'intelligent'
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python 3.2 has some deadly infection

2014-06-05 Thread Ethan Furman

On 06/05/2014 04:30 PM, Marko Rauhamaa wrote:


What I'm afraid of is that the Python developers are reserving the right
to remove the buffer and detach attributes from the standard streams in
a future version.


Being afraid is silly.  If you have a question, ask it.

--
~Ethan~

--
https://mail.python.org/mailman/listinfo/python-list


Re: Python 3.2 has some deadly infection

2014-06-04 Thread Steven D'Aprano
On Tue, 03 Jun 2014 15:18:19 +0100, Robin Becker wrote:

 
 The problem is that causal readers like Robin sometimes jump from 'In
 Python 3, it can be hard to do something one really ought not to do' to
 'Binary I/O is hard in Python 3' -- which is is not.

 I'm fairly causal and I did understand that the rant was a bit over the
 top for fairly practical reasons I have always regarded the std streams
 as allowing binary data and always objected to having to open files in
 python with  a 't' or 'b' mode to cope with line ending issues.
 
 Isn't it a bit old fashioned to think everything is connected to a
 console?

The whole concept of stdin and stdout is based on the idea of having a 
console to read from and write to. Otherwise, what would be the point? 
Classic Mac (pre OS X) had no command line interface nothing, and nothing 
even remotely like stdin and stdout. But once you have a console, stdin, 
stdout, and stderr become useful. And once you have them, then you can 
extend the concept using redirection and pipes. But fundamentally, stdin 
and stdout are about consoles.


 I think the idea that we only give meaning to binary data using
 encodings is a bit limiting. A zip or gif file has structure, but I
 don't think it's reasonable to regard such a file as having an encoding
 in the python unicode sense.

In the Unicode sense? Of course not, that would be silly.

The concept of encodings is bigger than just text, and in that sense zip 
compression is an encoding which encodes non-random data into a different 
format which generally takes up less space.



-- 
Steven D'Aprano
http://import-that.dreamwidth.org/
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python 3.2 has some deadly infection

2014-06-04 Thread Gregory Ewing

Steven D'Aprano wrote:
The whole concept of stdin and stdout is based on the idea of having a 
console to read from and write to.


Not really; stdin and stdout are frequently connected to
files, or pipes to other processes. The console, if it
exists, just happens to be a convenient default value for
them. Even on a system without a console, they're still
a useful abstraction.

But we were talking about encodings, and whether stdin
and stdout should be text or binary by default. Well,
one of the design principles behind unix is to make use
of plain text wherever possible. Not just for stuff
meant to be seen on the screen, but for stuff kept in
files as well.

As a result, most unix programs, most of the time, deal
with text on stdin and stdout. So, it makes sense for
them to be text by default. And wherever there's text,
there needs to be an encoding. This is true whether
a console is involved or not.

--
Greg
--
https://mail.python.org/mailman/listinfo/python-list


Re: Python 3.2 has some deadly infection

2014-06-04 Thread Akira Li
Steven D'Aprano steve+comp.lang.pyt...@pearwood.info writes:

 On Tue, 03 Jun 2014 15:18:19 +0100, Robin Becker wrote:

 Isn't it a bit old fashioned to think everything is connected to a
 console?

 The whole concept of stdin and stdout is based on the idea of having a 
 console to read from and write to. Otherwise, what would be the point? 
 Classic Mac (pre OS X) had no command line interface nothing, and nothing 
 even remotely like stdin and stdout. But once you have a console, stdin, 
 stdout, and stderr become useful. And once you have them, then you can 
 extend the concept using redirection and pipes. But fundamentally, stdin 
 and stdout are about consoles.

We can consider pipes abstraction to be fundumental. Decades of usage
prove a pipeline of processes usefulness e.g.,

  tr -cs A-Za-z '\n' |
  tr A-Z a-z |
  sort |
  uniq -c |
  sort -rn |
  sed ${1}q

See http://www.leancrew.com/all-this/2011/12/more-shell-less-egg/

Whether or not a pipe is connected to a tty is a small
detail. stdin/stdout is about pipes, not consoles.


--
akira

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python 3.2 has some deadly infection

2014-06-03 Thread Terry Reedy

On 6/3/2014 1:16 AM, Gregory Ewing wrote:

Terry Reedy wrote:

The issue Armin ran into is this. He write a library module that makes
sure the streams are binary.


Seems to me he made a mistake right there. A library should
*not* be making global changes like that. It can obtain
binary streams from stdin and stdout for its own use, but
it shouldn't stuff them back into sys.stdin and sys.stdout.

If he had trouble because another library did that, then
that library is broken, not Python.


I agree. The example in Armin's blog rant was an application, an empty 
unix filter (ie, simplified cat clone). For that example the complex 
code he posted to show how awful Python 3 is is unneeded. When I asked 
what he did not directly use the fix in the doc, without the 
scaffolding, he switching to the 'library' module explanation.


The problem is that causal readers like Robin sometimes jump from 'In 
Python 3, it can be hard to do something one really ought not to do' to 
'Binary I/O is hard in Python 3' -- which is is not.


--
Terry Jan Reedy

--
https://mail.python.org/mailman/listinfo/python-list


Re: Python 3.2 has some deadly infection

2014-06-03 Thread Robin Becker



The problem is that causal readers like Robin sometimes jump from 'In Python 3,
it can be hard to do something one really ought not to do' to 'Binary I/O is
hard in Python 3' -- which is is not.

I'm fairly causal and I did understand that the rant was a bit over the top for 
fairly practical reasons I have always regarded the std streams as allowing 
binary data and always objected to having to open files in python with  a 't' or 
'b' mode to cope with line ending issues.


Isn't it a bit old fashioned to think everything is connected to a console?

I think the idea that we only give meaning to binary data using encodings is a 
bit limiting. A zip or gif file has structure, but I don't think it's reasonable 
to regard such a file as having an encoding in the python unicode sense.

--
Robin Becker

--
https://mail.python.org/mailman/listinfo/python-list


Re: Python 3.2 has some deadly infection

2014-06-03 Thread Chris Angelico
On Wed, Jun 4, 2014 at 12:18 AM, Robin Becker ro...@reportlab.com wrote:
 I think the idea that we only give meaning to binary data using encodings is
 a bit limiting. A zip or gif file has structure, but I don't think it's
 reasonable to regard such a file as having an encoding in the python unicode
 sense.

Of course it doesn't. Those are binary files. Ultimately, every file
is binary; but since the vast majority of them actually contain text,
in one of a handful of common encodings, it's nice to have an easy way
to open a text file. You could argue that rb should be the default,
rather than rt, but that's a relatively minor point.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python 3.2 has some deadly infection

2014-06-03 Thread Steven D'Aprano
On Mon, 02 Jun 2014 12:10:48 +0100, Robin Becker wrote:

 there seems to be an implicit assumption in python land that encoded
 strings are the norm. On virtually every computer I encounter that
 assumption is wrong. The vast majority of bytes in most computers is not
 something that can be easily printed out for humans to read. I suppose
 some clever pythonista can figure out an encoding to read my .o / .so
 etc  files, but they are practically meaningless to a unicode program
 today. Same goes for most image formats and media files. Browsers
 routinely encounter mis/un-encoded pages.

If you include image, video and sound files, you are probably correct 
that most content of files is binary.

Outside of those three kinds of files, I would expect that *by far* the 
single largest kind of file is text. Some text is wrapped in a binary 
layer, e.g. .doc, .odt, etc. but an awful lot of it is good old human 
readable text, including web pages (html) and XML.

Every programming language I know of defaults to opening files in text 
mode rather than binary mode. There may be exceptions, but reading and 
writing text is ubiquitous while writing .o and .so files is not.


 In python I would have preferred for bytes to remain the default io
 mechanism, at least that would allow me to decide if I need any
 decoding.

That implies that you're opening files in binary mode by default. It also 
implies that even something as trivial as writing the string Hello 
World to a file (stdout is a file) is impossible until you've learned 
about encodings and know which encoding you need. I really don't think 
that's a good plan, for any language, but especially a language like 
Python which is intended for beginners as well as experts.

The Python 2 approach, where stdout in binary but tries really hard to 
pretend to be a superset of ASCII, is simply broken. It works well for 
trivial examples, while breaking in surprising and hard-to-diagnose ways 
in others. It violates the Zen, errors should not be ignored unless 
explicitly silenced, instead silently failing and giving moji-bake:

[steve@ando ~]$ python2.7 -c import sys; sys.stdout.write(u'ñβж\n')
ñβж

Changing to print doesn't help:

[steve@ando ~]$ python2.7 -c print u'ñβж'
ñβж


Python 3 works correctly, whether you use print or sys.stdout:

[steve@ando ~]$ python3.3 -c import sys; sys.stdout.write(u'ñβж\n')
ñβж

(although I haven't tested it on Windows).





-- 
Steven D'Aprano
http://import-that.dreamwidth.org/
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python 3.2 has some deadly infection

2014-06-03 Thread Chris Angelico
On Wed, Jun 4, 2014 at 2:34 AM, Steven D'Aprano
steve+comp.lang.pyt...@pearwood.info wrote:
 Outside of those three kinds of files, I would expect that *by far* the
 single largest kind of file is text. Some text is wrapped in a binary
 layer, e.g. .doc, .odt, etc. but an awful lot of it is good old human
 readable text, including web pages (html) and XML.

In terms of file I/O in Python, text wrapped in a binary layer has to
be treated as binary, not text. There's no difference between a JPEG
file that has some textual EXIF information and an ODT file that's a
whole lot of zipped up text; both of them have to be read as binary,
then unpacked according to the container's specs, and then the text
portion decoded according to an encoding like UTF-8.

But you're quite right that a large proportion of files out there
really are text.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python 3.2 has some deadly infection

2014-06-03 Thread Terry Reedy

On 6/3/2014 10:18 AM, Robin Becker wrote:


I think the idea that we only give meaning to binary data using
encodings is a bit limiting.


On the contrary, it is liberating. The fact that bits have no meaning 
other than 'a choice between two alterntives' means
1. any binary choice - 0/1, -/+, false/true, no/yes, closed/open, 
male/female, sad/happy, evil/good, low/high, and so on ad infinitum, can 
be encoded into a bit. Since any such pair could have been reversed, the 
mapping between bit states and the pair is arbitrary, and constitutes an 
encoding.
2. any discret or digitized information that constitutes a choice 
between multiple alternative can be encoded into a sequence of bits.


This crucial discovery is the basis of Shannon's 1947 paper and of the 
information age that started about then.



A zip or gif file has structure, but I don't think it's reasonable  to

to regard such a file as having an encoding in the python unicode sense.

I an not quite sure what you are denying. Color encodings are encodings 
as much as character encodings, even if they encode different 
information. Both encode sensory experience and conceptual correlates 
into a sequences of bits usually organized for convenience into a 
sequence of bytes or other chunks.


There is another similarity. Text files often have at least two levels 
of encoding. First is the character encoding; that is all unicode 
handles. Then there is the text structure encoding, which is sometimes 
called the 'file format'. Most text files are at least structured into 
'lines'. For this, they use encoded line endings, and there have been 
multiple choices for this and at least 2 still in common use (which is a 
nuisance).


Similarly, a pixel (bitmap!) image file must encode the color of each 
pixel and a higher-level structuring of pixels into a a 2D array of rows 
of lines. Just as with text, there have been and still are multiple 
encoding at both levels. Also, similarly, the receiver of an image must 
know what encoding the sender used.


Vector graphics is a different way of encoding certain types of images, 
and again there are multiple ways to encode the information into bits. 
The encoding hassle here is similar to that for text. One of the 
frustrations of tk is that it natively uses just one old dialect of 
postscript (.ps) to output screen images. One has to find and install an 
extension to a modern Scaled Vector Graphics (.svg) encoding.


Because Python is programed with lines of text, it must come with 
minimal text decoding. If Python were programmed with drawings, it would 
come with one or more drawing decoders and a drawing equivalent of a 
lexer. It might even have special 'rd' (read drawing) mode for open.


--
Terry Jan Reedy


--
https://mail.python.org/mailman/listinfo/python-list


Re: Python 3.2 has some deadly infection

2014-06-02 Thread Wolfgang Maier
Tim Delaney timothy.c.delaney at gmail.com writes:

 
 I also should have been more clear that *in the particular situation I was
talking about* iso-latin-1 as default would be the right thing to do, not in
the general case. Quite often we won't know the correct encoding until we've
executed a command via ssh - iso-latin-1 will allow us to extract the info
we need (which will generally be 7-bit ASCII) without the possibility of an
invalid encoding. Sure we may get mojibake, but that's better than the
alternative when we don't yet know the correct encoding.
  
 Latin-1 is one of those legacy encodings which needs to die, not to be
 entrenched as the default. My terminal uses UTF-8 by default (as
itshould), and if I use the terminal to input δжç, Python ought to seewhat
I input, not Latin-1 moji-bake.
 
 
 For some purposes, there needs to be a way to treat an arbitrary stream of
bytes as an arbitrary stream of 8-bit characters. iso-latin-1 is a
convenient way to do that.
 

For that purpose, Python3 has the bytes() type. Read the data as is, then
decode it to a string once you figured out its encoding.

Wolfgang



-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python 3.2 has some deadly infection

2014-06-02 Thread Tim Delaney
On 2 June 2014 17:45, Wolfgang Maier 
wolfgang.ma...@biologie.uni-freiburg.de wrote:

 Tim Delaney timothy.c.delaney at gmail.com writes:

  For some purposes, there needs to be a way to treat an arbitrary stream
 of
 bytes as an arbitrary stream of 8-bit characters. iso-latin-1 is a
 convenient way to do that.
 

 For that purpose, Python3 has the bytes() type. Read the data as is, then
 decode it to a string once you figured out its encoding.


I know that, you know that. Convincing other people of that is the
difficulty.

I probably should have mentioned it, but in my case it's not even Python
(Java). It's exactly the same principal - an assumption was made that has
become entrenched due to the fear of breakage. If they'd been forced to
think about encodings up-front, it shouldn't have been an issue, which was
the point I was trying to make.

In Java, it's much worse. At least with Python you can perform string-like
operations on bytes. In Java you have to convert it to characters before
you can really do anything with it, so people just use the default encoding
all the time - especially if they want the convenience of line-by-line
reading using BufferedReader ...

Tim Delaney
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python 3.2 has some deadly infection

2014-06-02 Thread Chris Angelico
On Mon, Jun 2, 2014 at 7:02 PM, Tim Delaney timothy.c.dela...@gmail.com wrote:
 In Java, it's much worse. At least with Python you can perform string-like
 operations on bytes. In Java you have to convert it to characters before you
 can really do anything with it, so people just use the default encoding all
 the time - especially if they want the convenience of line-by-line reading
 using BufferedReader ...

What exactly is line-by-line reading with bytes? As I understand it,
lines are defined by characters. If you mean reading a stream of
bytes and dividing it on 0x0A, then surely you can do that, but that
assumes an ASCII-compatible encoding.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python 3.2 has some deadly infection

2014-06-02 Thread Robin Becker




I probably should have mentioned it, but in my case it's not even Python
(Java). It's exactly the same principal - an assumption was made that has
become entrenched due to the fear of breakage. If they'd been forced to
think about encodings up-front, it shouldn't have been an issue, which was
the point I was trying to make.

there seems to be an implicit assumption in python land that encoded strings are 
the norm. On virtually every computer I encounter that assumption is wrong. The 
vast majority of bytes in most computers is not something that can be easily 
printed out for humans to read. I suppose some clever pythonista can figure out 
an encoding to read my .o / .so etc  files, but they are practically meaningless 
to a unicode program today. Same goes for most image formats and media files. 
Browsers routinely encounter mis/un-encoded pages.



In Java, it's much worse. At least with Python you can perform string-like
operations on bytes. In Java you have to convert it to characters before
you can really do anything with it, so people just use the default encoding
all the time - especially if they want the convenience of line-by-line
reading using BufferedReader ...

..


In python I would have preferred for bytes to remain the default io mechanism, 
at least that would allow me to decide if I need any decoding.


As the cat example

http://lucumr.pocoo.org/2014/5/12/everything-about-unicode/

showed these extra assumptions are sometimes really in the way.
--
Robin Becker

--
https://mail.python.org/mailman/listinfo/python-list


Re: Python 3.2 has some deadly infection

2014-06-02 Thread Terry Reedy

On 6/2/2014 7:10 AM, Robin Becker wrote:


there seems to be an implicit assumption in python land that encoded
strings are the norm.


I don't know why you say that. To have a stream of bytes interpreted as 
characters, open in text mode and give the encoding. Otherwise, open in 
binary mode and apply whatever encoding you want. Image programs like 
Pil or Pillow assume that bytes have image encodings. Same idea.


 On virtually every computer I encounter that assumption is wrong.

Except for the std streams (see below), it is also not part of Python.

I will just point out that bytes are given meaning by encoding meaning 
into them. Unicode attempts to reduce the hundreds of text encodings to 
just a few, and mostly to just one for external storage and transmission.



In python I would have preferred for bytes to remain the default io


Do you really think that defaulting the open mode to 'rb' rather than 
'rt' would be a better choice for newbies?



mechanism, at least that would allow me to decide if I need any decoding.


Assuming that 'rb' is actually needed more than 'rt' for you in 
particular, is it really such a burden to give a mode more often than not?



As the cat example
http://lucumr.pocoo.org/2014/5/12/everything-about-unicode/
showed these extra assumptions are sometimes really in the way.


This example is *only* about the *pre-opened* stdxyz streams. Python 
uses these to read characters from the keyboard and print characters to 
the screen in input, print, and the interactive interpreter. So they are 
open in text mode (which wraps binary read and write). The developers, 
knowing that people can and do write batch mode programs that avoid 
input and print, gave a documented way to convert the streams back to 
binary. (See the sys doc.)


The issue Armin ran into is this. He write a library module that makes 
sure the streams are binary. Someone else does the same. A program 
imports both modules, in either order. The conversion method referenced 
above raises an exception if one attempt to convert an already converted 
stream. Much of the extra code Armin published detects whether the steam 
is already binary or needs conversion.


The obvious solution is to enhance the conversion method so that one may 
say 'convert is needed, otherwise just pass'.


--
Terry Jan Reedy

--
https://mail.python.org/mailman/listinfo/python-list


Re: Python 3.2 has some deadly infection

2014-06-02 Thread Gregory Ewing

Terry Reedy wrote:
The issue Armin ran into is this. He write a library module that makes 
sure the streams are binary.


Seems to me he made a mistake right there. A library should
*not* be making global changes like that. It can obtain
binary streams from stdin and stdout for its own use, but
it shouldn't stuff them back into sys.stdin and sys.stdout.

If he had trouble because another library did that, then
that library is broken, not Python.

--
Greg
--
https://mail.python.org/mailman/listinfo/python-list


Re: Python 3.2 has some deadly infection

2014-06-01 Thread Tim Delaney
On 1 June 2014 12:26, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info
wrote:


 with cross-platform behavior preferred over system-dependent one --
 It's not clear how cross-platform behaviour has anything to do with the
 Internet age. Python has preferred cross-platform behaviour forever,
 except for those features and modules which are explicitly intended to be
 interfaces to system-dependent features. (E.g. a lot of functions in the
 os module are thin wrappers around OS features. Hence the name of the
 module.)


There is the behaviour of defaulting input and output to the system
encoding. I personally think we would all be better off if Python (and
Java, and many other languages) defaulted to UTF-8. This hopefully would
eventually have the effect of producers changing to output UTF-8 by
default, and consumers learning to manually specify an encoding when it's
not UTF-8 (due to invalid codepoints).

I'm currently working on a product that interacts with lots of other
products. These other products can be using any encoding - but most of the
functions that interact with I/O assume the system default encoding of the
machine that is collecting the data. The product has been in production for
nearly a decade, so there's a lot of pushback against changes deep in the
code for fear that it will break working systems. The fact that they are
working largely by accident appears to escape them ...

FWIW, changing to use iso-latin-1 by default would be the most sensible
option (effectively treating everything as bytes), with the option for
another encoding if/when more information is known (e.g. there's often a
call to return the encoding, and the output of that call is guaranteed to
be ASCII).

Tim Delaney
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python 3.2 has some deadly infection

2014-06-01 Thread Steven D'Aprano
On Mon, 02 Jun 2014 08:54:33 +1000, Tim Delaney wrote:

 On 1 June 2014 12:26, Steven D'Aprano
 steve+comp.lang.pyt...@pearwood.info wrote:
 
 
 with cross-platform behavior preferred over system-dependent one --
 It's not clear how cross-platform behaviour has anything to do with the
 Internet age. Python has preferred cross-platform behaviour forever,
 except for those features and modules which are explicitly intended to
 be interfaces to system-dependent features. (E.g. a lot of functions in
 the os module are thin wrappers around OS features. Hence the name of
 the module.)


 There is the behaviour of defaulting input and output to the system
 encoding. 

That's a tricky one, but I think on balance that is a case where 
defaulting to the system encoding is the right thing to do. Input and out 
occurs on the local system you are running on, which by definition isn't 
cross-platform. (Non-local I/O is possible, but requires work -- it 
doesn't just happen.)


 I personally think we would all be better off if Python (and
 Java, and many other languages) defaulted to UTF-8. This hopefully would
 eventually have the effect of producers changing to output UTF-8 by
 default, and consumers learning to manually specify an encoding when
 it's not UTF-8 (due to invalid codepoints).

UTF-8 everywhere should be our ultimate aim. Then we can forget about 
legacy encodings except when digging out ancient documents from archived 
floppy disks :-)


 I'm currently working on a product that interacts with lots of other
 products. These other products can be using any encoding - but most of
 the functions that interact with I/O assume the system default encoding
 of the machine that is collecting the data. The product has been in
 production for nearly a decade, so there's a lot of pushback against
 changes deep in the code for fear that it will break working systems.
 The fact that they are working largely by accident appears to escape
 them ...
 
 FWIW, changing to use iso-latin-1 by default would be the most sensible
 option (effectively treating everything as bytes), with the option for
 another encoding if/when more information is known (e.g. there's often a
 call to return the encoding, and the output of that call is guaranteed
 to be ASCII).

Python 2 does what you suggest, and it is *broken*. Python 2.7 creates 
moji-bake, while Python 3 gets it right:


[steve@ando ~]$ python2.7 -c print u'δжç'
δжç
[steve@ando ~]$ python3.3 -c print(u'δжç')
δжç


Latin-1 is one of those legacy encodings which needs to die, not to be 
entrenched as the default. My terminal uses UTF-8 by default (as it 
should), and if I use the terminal to input δжç, Python ought to see 
what I input, not Latin-1 moji-bake.

If I were to use Windows with a legacy code page, then I couldn't even 
enter δжç on the command line since none of the legacy encodings 
support that set of characters at the same time. I don't know exactly 
what I would get if I tried (say, by copying and pasting text from a 
Unicode-aware application), but I'd see that it was weird *in the shell* 
before it even reaches Python.

On the other hand, if I were to input something supported by the legacy 
encoding, let's say I entered αβγ while using ISO-8859-7 (Greek), then 
Python ought to see αβγ and not moji-bake:

py b = αβγ.encode('iso-8859-7')  # what the shell generates
py b.decode('latin-1')  # what Python interprets those bytes as
'áâã'


Defaulting to the system encoding means that Python input and output just 
works, to the degree that input and output on your system just works. If 
your system is crippled by the use of a legacy encoding, then Python will 
at least be *no worse* than your system.



-- 
Steven D'Aprano
http://import-that.dreamwidth.org/
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python 3.2 has some deadly infection

2014-06-01 Thread Tim Delaney
On 2 June 2014 11:14, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info
wrote:

 On Mon, 02 Jun 2014 08:54:33 +1000, Tim Delaney wrote:
  I'm currently working on a product that interacts with lots of other
  products. These other products can be using any encoding - but most of
  the functions that interact with I/O assume the system default encoding
  of the machine that is collecting the data. The product has been in
  production for nearly a decade, so there's a lot of pushback against
  changes deep in the code for fear that it will break working systems.
  The fact that they are working largely by accident appears to escape
  them ...
 
  FWIW, changing to use iso-latin-1 by default would be the most sensible
  option (effectively treating everything as bytes), with the option for
  another encoding if/when more information is known (e.g. there's often a
  call to return the encoding, and the output of that call is guaranteed
  to be ASCII).

 Python 2 does what you suggest, and it is *broken*. Python 2.7 creates
 moji-bake, while Python 3 gets it right:


The purpose of my example was to show a case where no thought was put into
encodings - the assumption was that the system encoding and the remote
system encoding would be the same. This is most definitely not the case a
lot of the time.

I also should have been more clear that *in the particular situation I was
talking about* iso-latin-1 as default would be the right thing to do, not
in the general case. Quite often we won't know the correct encoding until
we've executed a command via ssh - iso-latin-1 will allow us to extract the
info we need (which will generally be 7-bit ASCII) without the possibility
of an invalid encoding. Sure we may get mojibake, but that's better than
the alternative when we don't yet know the correct encoding.


 Latin-1 is one of those legacy encodings which needs to die, not to be
 entrenched as the default. My terminal uses UTF-8 by default (as it
 should), and if I use the terminal to input δжç, Python ought to see
 what I input, not Latin-1 moji-bake.


For some purposes, there needs to be a way to treat an arbitrary stream of
bytes as an arbitrary stream of 8-bit characters. iso-latin-1 is a
convenient way to do that. It's not the only way, but settling on it and
being consistent is better than not having a way.

Tim Delaney
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python 3.2 has some deadly infection

2014-06-01 Thread Rustom Mody
On Monday, June 2, 2014 7:53:05 AM UTC+5:30, Tim Delaney wrote:
 On 2 June 2014 11:14, Steven D'Aprano steve+comp@pearwood.info wrote:
  Latin-1 is one of those legacy encodings which needs to die, not to be
 entrenched as the default. My terminal uses UTF-8 by default (as it
 should), and if I use the terminal to input δжç, Python ought to see
 what I input, not Latin-1 moji-bake.

 For some purposes, there needs to be a way to treat an arbitrary
 stream of bytes as an arbitrary stream of 8-bit
 characters. iso-latin-1 is a convenient way to do that. It's not the
 only way, but settling on it and being consistent is better than not
 having a way.

Here is a quote from the oracle docs:

http://docs.oracle.com/cd/E23824_01/html/E26033/glmbx.html#glmar

| The C locale, also known as the POSIX locale, is the POSIX system
| default locale for all POSIX-compliant systems.

In more layman language

| ASCII also known as the 'Unix locale' is the default for all *nix
| compliant systems

which is a key aspect of what Ive called 'The UNIX Assumption' :
http://blog.languager.org/2014/04/unicode-and-unix-assumption.html
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python 3.2 has some deadly infection

2014-05-31 Thread Marko Rauhamaa
Mark Lawrence breamore...@yahoo.co.uk:

 Some interesting comments here
 http://techtonik.rainforce.org/2014/05/python-32-has-some-deadly-infection.html
 so I'm simply asking for other opinions.

I read the article, but unfortunately I failed to see interesting
comments or opinions. There was some graphic, but it didn't say anything
to me, and the article didn't really seem to be making any argument
apart from the disappointed tone.


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python 3.2 has some deadly infection

2014-05-31 Thread Steven D'Aprano
On Sat, 31 May 2014 17:10:20 +0100, Mark Lawrence wrote:

 Some interesting comments here
 http://techtonik.rainforce.org/2014/05/python-32-has-some-deadly-
infection.html
 so I'm simply asking for other opinions.

Oh, Anatoly Techtonik. He's quite notorious on python-dev for wanting to 
impose his wild and sometimes wacky processes on the entire community. 
Specific examples aren't coming to mind, and I'm too lazy to search the 
archives, so I'll just make one up to give you an idea of the flavour of 
his requests:

Twitter is the only way that developers can effectively
communicate. We must shut down all the mailing lists and the
bug tracker and move all communication immediately to 
Twitter. And by we I mean you.
[Not an actual quote.]

I've come to the conclusion that he occasionally has a point to his 
posts, but only at random by virtue of the scatter-gun technique. He's 
obviously widely read, but not deeply, and so he fires off a lot of ill-
thought out but superficially attractive proposals. Just by chance a few 
of them end up being interesting, not *interesting enough* for somebody 
else to do the work. At this point the ideas languish, because he refuses 
to sign a contributor agreement so the Python core developers cannot 
accept anything from him.

This blog post is a strong opinion about Python, but it isn't clear what 
that opinion *actually is*. His post is rambling and unfocused and 
incoherent (art is the future). He rails against having to write PEPs, 
and decries the lack of stats, summaries, analysis and comparison, 
utterly missing the point that the purpose of the PEP process is to 
provide those stats, summaries, analysis and comparison. Reading between 
the lines, I think what he means, deep down, is that *somebody else* 
ought to gather those stats and do the analysis to support his ideas, and 
not expect him to write the PEP.

He makes at least one factually wrong claim:

I thought that C/C++ must die, because really all major 
security problems are because of it.
[actual quote]

He's talking about buffer overflows. Buffer overflows have never been 
responsible for all major security problems. Even allowing for a little 
hyperbole, buffer overflows have not been responsible for the majority of 
major security problems for a very long time. It is not 1992 any more, 
and today the single largest source of security bugs are code injection 
attacks. In Python terms that mostly means failure to sanitize SQL 
queries and the use of eval() and exec() on untrusted data.

http://cwe.mitre.org/top25/

Three of the top four software errors are forms of code injection: SQL 
injection, OS command injection, cross-site scripting. The classic C 
buffer overflow comes in at number 3, so it's not an inconsiderable cause 
of security vulnerabilities even today, but it is not even close to the 
only such cause.

See also http://www.sans.org/top25-software-errors/ 

Back to the blog post... it's 2014, Python 3.3 and 3.4 have come out, why 
is he talking about 3.2?

It's interesting that he starts off by stating his graph is meaningless:

They don't measure anything - just show some lines that 
correlate to each other.

then immediately tries to draw a conclusion from those lines:

It looks like the peak of Python was on February 2011, 
and since then there was a significant drop.

I've written about the difficulty of measuring language popularity in any 
meaningful way:

http://import-that.dreamwidth.org/1388.html
http://import-that.dreamwidth.org/2873.html

Anatoly has picked the TIOBE Index, but I don't know that this is the 
best measure of language popularity. According to it, Python is more 
popular than Javascript. I love Python, but really, more popular than 
Javascript? That feels wrong to me.

In any case, I think that a better explanation for the observed dip in 
Feb 2011 is not that Python 3.2 is infected (infected by what?) but 
*regression to the mean*. Regression to the mean is a statistical 
phenomenon which basically says that all else being equal, an extreme 
value is likely to be followed by a less extreme (closer to the average) 
value.

Language popularity, as measured by TIOBE, is at least in part random. 
(Look at how wiggly the lines are. The wiggles represent random 
variation.) If by chance a language gets a spike in interest one month, 
it is less likely to 

Because TIOBE's results contain so much random noise, they really ought 
to smooth them out by averaging the scores over a three month window, and 
show trend lines. They don't, I believe, because random hiccoughs in the 
data provide interest: Last month, Java was overthrown from it's #1 
ranking by C. This month it has fought its way back to #1 again! Tune in 
next month to see if C can repeat it's stunning victory!!!

I think that long term trend lines would be much less exciting but much 
more informative. Eyeballing the graph, it seems to me that Java and C++ 
are 

Re: Python 3.2 has some deadly infection

2014-05-31 Thread Chris Angelico
On Sun, Jun 1, 2014 at 12:26 PM, Steven D'Aprano
steve+comp.lang.pyt...@pearwood.info wrote:
 TL;DR: Anatoly's blog post is long on disappointment and short on actual
 content. It feels to me that we could summarise his post as:

 I don't know what I want, I won't recognise it even if I saw
 it, but Python 3 isn't it. I blame others for not living up
 to my expectations for features I cannot describe and were
 never promised.

I think that summary is accurate. When Mark posted this last night
(okay, it was last night for me, probably not for most of you), I
tried to read the post and figure out what he was actually saying...
and failed. Gave up on it and moved on. Got better things to do with
my life... like, I dunno, actually writing code, which seems to be
something that people who whine in blog posts don't do.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list