Re: why isn't Unicode the default encoding?

2006-03-21 Thread Jon Ribbens
In article [EMAIL PROTECTED], Martin v. Löwis wrote: In any case, it doesn't matter what encoding the document is in: read(2) always returns two bytes. It returns *up to* two bytes. Sorry to be picky but I think it's relevant to the topic because it illustrates how it's difficult to change the

why isn't Unicode the default encoding?

2006-03-20 Thread John Salerno
Forgive my newbieness, but I don't quite understand why Unicode is still something that needs special treatment in Python (and perhaps elsewhere). I'm reading Dive Into Python right now, and it constantly refers to a 'regular string' versus a 'Unicode string' and how you need to convert back

Re: why isn't Unicode the default encoding?

2006-03-20 Thread Robert Kern
John Salerno wrote: Forgive my newbieness, but I don't quite understand why Unicode is still something that needs special treatment in Python (and perhaps elsewhere). I'm reading Dive Into Python right now, and it constantly refers to a 'regular string' versus a 'Unicode string' and how you

Re: why isn't Unicode the default encoding?

2006-03-20 Thread John Salerno
Robert Kern wrote: Well, *I* use UTF-8, but that's neither here nor there. I see UTF-8 a lot, but this particular book also mentions that UTF-16 is the most common. Is that true? Why can't Unicode replace them so we no longer need the 'u' prefix or the encoding tricks? It would break a

Re: why isn't Unicode the default encoding?

2006-03-20 Thread Jan Niklas Fingerle
John Salerno [EMAIL PROTECTED] wrote: to convert back and forth. But why isn't Unicode considered a regular string by now? Is it for historical reasons that we still use ASCII and Latin-1? The point is, that, with a regular string, you don't know its encoding or whether it has an encoding

Re: why isn't Unicode the default encoding?

2006-03-20 Thread Robert Kern
John Salerno wrote: Robert Kern wrote: Well, *I* use UTF-8, but that's neither here nor there. I see UTF-8 a lot, but this particular book also mentions that UTF-16 is the most common. Is that true? I think it unlikely, but I have no numbers to give. And I'll bet that that book doesn't

Re: why isn't Unicode the default encoding?

2006-03-20 Thread Jan Niklas Fingerle
Robert Kern [EMAIL PROTECTED] wrote: I see UTF-8 a lot, but this particular book also mentions that UTF-16 is the most common. Is that true? I think it unlikely, but I have no numbers to give. And I'll bet that that book doesn't either. I haven't got any numbers, but my guess would be

Re: why isn't Unicode the default encoding?

2006-03-20 Thread John Salerno
Robert Kern wrote: I figured this might have something to do with it, but then again I thought that Unicode was created as a subset of ASCII and Latin-1 so that they would be compatible...but I guess it's never that easy. :) No, it isn't. You seem to be somewhat confused about Unicode. At

Re: why isn't Unicode the default encoding?

2006-03-20 Thread John Salerno
Robert Kern wrote: http://www.joelonsoftware.com/articles/Unicode.html That was fascinating. Thank you. So as it turns out, Unicode and UTF-8 are not the same thing? Am I right to say that UTF-8 stores the first 128 Unicode code points in a single byte, and then stores higher code points

Re: why isn't Unicode the default encoding?

2006-03-20 Thread Martin v. Löwis
John Salerno wrote: Robert Kern wrote: http://www.joelonsoftware.com/articles/Unicode.html That was fascinating. Thank you. So as it turns out, Unicode and UTF-8 are not the same thing? Am I right to say that UTF-8 stores the first 128 Unicode code points in a single byte, and then

Re: why isn't Unicode the default encoding?

2006-03-20 Thread Martin v. Löwis
I figured this might have something to do with it, but then again I thought that Unicode was created as a subset of ASCII and Latin-1 so that they would be compatible...but I guess it's never that easy. :) The real problem is that the Python string type is used to represent two very

Re: why isn't Unicode the default encoding?

2006-03-20 Thread John Salerno
Martin v. Löwis wrote: John Salerno wrote: Robert Kern wrote: http://www.joelonsoftware.com/articles/Unicode.html That was fascinating. Thank you. So as it turns out, Unicode and UTF-8 are not the same thing? Am I right to say that UTF-8 stores the first 128 Unicode code points in a

Re: why isn't Unicode the default encoding?

2006-03-20 Thread John Salerno
Martin v. Löwis wrote: The real problem is that the Python string type is used to represent two very different concepts: bytes, and characters. You can't just drop the current Python string type, and use the Unicode type instead - then you would have no good way to represent sequences of

Re: why isn't Unicode the default encoding?

2006-03-20 Thread and-google
John Salerno wrote: So as it turns out, Unicode and UTF-8 are not the same thing? Well yes. UTF-8 is one scheme in which the whole Unicode character repertoire can be represented as bytes. Confusion arises because Windows uses the name 'Unicode' in character encoding lists, to mean UTF-16_LE,

Re: why isn't Unicode the default encoding?

2006-03-20 Thread Matt Goodall
John Salerno wrote: Martin v. Löwis wrote: The real problem is that the Python string type is used to represent two very different concepts: bytes, and characters. You can't just drop the current Python string type, and use the Unicode type instead - then you would have no good way to

Re: why isn't Unicode the default encoding?

2006-03-20 Thread Martin v. Löwis
John Salerno wrote: Interesting. So then the read() method, if given a numeric argument for bytes to read, would act differently depending on if you were using Unicode or not? The read method currently returns a byte string, not a Unicode string. It's not clear to me how the numeric argument