In article [EMAIL PROTECTED], Martin v. Löwis wrote:
In any case, it doesn't matter what encoding the document is in:
read(2) always returns two bytes.
It returns *up to* two bytes. Sorry to be picky but I think it's
relevant to the topic because it illustrates how it's difficult
to change the
Forgive my newbieness, but I don't quite understand why Unicode is still
something that needs special treatment in Python (and perhaps
elsewhere). I'm reading Dive Into Python right now, and it constantly
refers to a 'regular string' versus a 'Unicode string' and how you need
to convert back
John Salerno wrote:
Forgive my newbieness, but I don't quite understand why Unicode is still
something that needs special treatment in Python (and perhaps
elsewhere). I'm reading Dive Into Python right now, and it constantly
refers to a 'regular string' versus a 'Unicode string' and how you
Robert Kern wrote:
Well, *I* use UTF-8, but that's neither here nor there.
I see UTF-8 a lot, but this particular book also mentions that UTF-16 is
the most common. Is that true?
Why can't Unicode replace them so we no longer need the 'u'
prefix or the encoding tricks?
It would break a
John Salerno [EMAIL PROTECTED] wrote:
to convert back and forth. But why isn't Unicode considered a regular
string by now? Is it for historical reasons that we still use ASCII and
Latin-1?
The point is, that, with a regular string, you don't know its encoding
or whether it has an encoding
John Salerno wrote:
Robert Kern wrote:
Well, *I* use UTF-8, but that's neither here nor there.
I see UTF-8 a lot, but this particular book also mentions that UTF-16 is
the most common. Is that true?
I think it unlikely, but I have no numbers to give. And I'll bet that that book
doesn't
Robert Kern [EMAIL PROTECTED] wrote:
I see UTF-8 a lot, but this particular book also mentions that UTF-16 is
the most common. Is that true?
I think it unlikely, but I have no numbers to give. And I'll bet that that
book
doesn't either.
I haven't got any numbers, but my guess would be
Robert Kern wrote:
I figured this might have something to do with it, but then again I
thought that Unicode was created as a subset of ASCII and Latin-1 so
that they would be compatible...but I guess it's never that easy. :)
No, it isn't. You seem to be somewhat confused about Unicode. At
Robert Kern wrote:
http://www.joelonsoftware.com/articles/Unicode.html
That was fascinating. Thank you. So as it turns out, Unicode and UTF-8
are not the same thing? Am I right to say that UTF-8 stores the first
128 Unicode code points in a single byte, and then stores higher code
points
John Salerno wrote:
Robert Kern wrote:
http://www.joelonsoftware.com/articles/Unicode.html
That was fascinating. Thank you. So as it turns out, Unicode and UTF-8
are not the same thing? Am I right to say that UTF-8 stores the first
128 Unicode code points in a single byte, and then
I figured this might have something to do with it, but then again I
thought that Unicode was created as a subset of ASCII and Latin-1 so
that they would be compatible...but I guess it's never that easy. :)
The real problem is that the Python string type is used to represent
two very
Martin v. Löwis wrote:
John Salerno wrote:
Robert Kern wrote:
http://www.joelonsoftware.com/articles/Unicode.html
That was fascinating. Thank you. So as it turns out, Unicode and UTF-8
are not the same thing? Am I right to say that UTF-8 stores the first
128 Unicode code points in a
Martin v. Löwis wrote:
The real problem is that the Python string type is used to represent
two very different concepts: bytes, and characters. You can't just drop
the current Python string type, and use the Unicode type instead - then
you would have no good way to represent sequences of
John Salerno wrote:
So as it turns out, Unicode and UTF-8 are not the same thing?
Well yes. UTF-8 is one scheme in which the whole Unicode character
repertoire can be represented as bytes.
Confusion arises because Windows uses the name 'Unicode' in character
encoding lists, to mean UTF-16_LE,
John Salerno wrote:
Martin v. Löwis wrote:
The real problem is that the Python string type is used to represent
two very different concepts: bytes, and characters. You can't just drop
the current Python string type, and use the Unicode type instead - then
you would have no good way to
John Salerno wrote:
Interesting. So then the read() method, if given a numeric argument for
bytes to read, would act differently depending on if you were using
Unicode or not?
The read method currently returns a byte string, not a Unicode string.
It's not clear to me how the numeric argument
16 matches
Mail list logo