Re: What encoding does u'...' syntax use?

2009-02-21 Thread Aahz
In article 499f397c.7030...@v.loewis.de, =?ISO-8859-15?Q?=22Martin_v=2E_L=F6wis=22?= mar...@v.loewis.de wrote: Yes, I know that. But every concrete representation of a unicode string has to have an encoding associated with it, including unicode strings produced by the Python parser when it

Re: What encoding does u'...' syntax use?

2009-02-21 Thread Thorsten Kampe
* Martin v. Löwis (Sat, 21 Feb 2009 00:15:08 +0100) Yes, I know that. But every concrete representation of a unicode string has to have an encoding associated with it, including unicode strings produced by the Python parser when it parses the ascii string u'\xb5' My question is: what

Re: What encoding does u'...' syntax use?

2009-02-21 Thread Denis Kasak
On Sat, Feb 21, 2009 at 7:24 PM, Thorsten Kampe thors...@thorstenkampe.de wrote: I'm pretty much sure it is UCS-2 or UCS-4. (Yes, I know there is only a slight difference to UTF-16/UTF-32). I wouldn't call the difference that slight, especially between UTF-16 and UCS-2, since the former can

Re: What encoding does u'...' syntax use?

2009-02-21 Thread Martin v. Löwis
My question is: what is that encoding? The internal representation is either UTF-16, or UTF-32; which one is a compile-time choice (i.e. when the Python interpreter is built). Wait, I thought it was UCS-2 or UCS-4? Or am I misremembering the countless threads about the distinction between

Re: What encoding does u'...' syntax use?

2009-02-21 Thread Martin v. Löwis
I'm pretty much sure it is UCS-2 or UCS-4. (Yes, I know there is only a slight difference to UTF-16/UTF-32). I wouldn't call the difference that slight, especially between UTF-16 and UCS-2, since the former can encode all Unicode code points, while the latter can only encode those in the

Re: What encoding does u'...' syntax use?

2009-02-21 Thread Denis Kasak
On Sat, Feb 21, 2009 at 9:10 PM, Martin v. Löwis mar...@v.loewis.de wrote: I'm pretty much sure it is UCS-2 or UCS-4. (Yes, I know there is only a slight difference to UTF-16/UTF-32). I wouldn't call the difference that slight, especially between UTF-16 and UCS-2, since the former can encode

Re: What encoding does u'...' syntax use?

2009-02-21 Thread Adam Olsen
On Feb 21, 10:48 am, a...@pythoncraft.com (Aahz) wrote: In article 499f397c.7030...@v.loewis.de, =?ISO-8859-15?Q?=22Martin_v=2E_L=F6wis=22?=  mar...@v.loewis.de wrote: Yes, I know that.  But every concrete representation of a unicode string has to have an encoding associated with it,

Re: What encoding does u'...' syntax use?

2009-02-21 Thread Martin v. Löwis
Indeed. As Python *can* encode all characters even in 2-byte mode (since PEP 261), it seems clear that Python's Unicode representation is *not* strictly UCS-2 anymore. Since we're already discussing this, I'm curious - why was UCS-2 chosen over plain UTF-16 or UTF-8 in the first place for

Re: What encoding does u'...' syntax use?

2009-02-21 Thread Denis Kasak
On Sat, Feb 21, 2009 at 9:45 PM, Martin v. Löwis mar...@v.loewis.de wrote: Indeed. As Python *can* encode all characters even in 2-byte mode (since PEP 261), it seems clear that Python's Unicode representation is *not* strictly UCS-2 anymore. Since we're already discussing this, I'm curious -

What encoding does u'...' syntax use?

2009-02-20 Thread Ron Garret
I would have thought that the answer would be: the default encoding (duh!) But empirically this appears not to be the case: unicode('\xb5') Traceback (most recent call last): File stdin, line 1, in module UnicodeDecodeError: 'ascii' codec can't decode byte 0xb5 in position 0: ordinal not in

Re: What encoding does u'...' syntax use?

2009-02-20 Thread Stefan Behnel
Ron Garret wrote: I would have thought that the answer would be: the default encoding (duh!) But empirically this appears not to be the case: unicode('\xb5') Traceback (most recent call last): File stdin, line 1, in module UnicodeDecodeError: 'ascii' codec can't decode byte 0xb5 in

Re: What encoding does u'...' syntax use?

2009-02-20 Thread Stefan Behnel
Stefan Behnel wrote: print u'\xb5' µ What you see in the last line is what the Python interpreter makes of your unicode string when passing it into stdout, which in your case seems to use a latin-1 encoding (check your environment settings for that). The seems to is misleading. The

Re: What encoding does u'...' syntax use?

2009-02-20 Thread Ron Garret
In article 499f18bd$0$31879$9b4e6...@newsspool3.arcor-online.net, Stefan Behnel stefan...@behnel.de wrote: Ron Garret wrote: I would have thought that the answer would be: the default encoding (duh!) But empirically this appears not to be the case: unicode('\xb5') Traceback (most

Re: What encoding does u'...' syntax use?

2009-02-20 Thread Terry Reedy
Ron Garret wrote: I would have thought that the answer would be: the default encoding (duh!) But empirically this appears not to be the case: unicode('\xb5') Traceback (most recent call last): File stdin, line 1, in module UnicodeDecodeError: 'ascii' codec can't decode byte 0xb5 in

Re: What encoding does u'...' syntax use?

2009-02-20 Thread Matthew Woodcraft
Ron Garret rnospa...@flownet.com writes: Put this another way: I would have thought that when the Python parser parses u'\xb5' it would produce the same result as calling unicode('\xb5'), but it doesn't. Instead it seems to produce the same result as calling unicode('\xb5', 'latin-1'). But my

Re: What encoding does u'...' syntax use?

2009-02-20 Thread Martin v. Löwis
Yes, I know that. But every concrete representation of a unicode string has to have an encoding associated with it, including unicode strings produced by the Python parser when it parses the ascii string u'\xb5' My question is: what is that encoding? The internal representation is either

Re: What encoding does u'...' syntax use?

2009-02-20 Thread Martin v. Löwis
u'\xb5' u'\xb5' print u'\xb5' � Unicode literals are *in the source file*, which can only have one encoding (for a given source file). (That last character shows up as a micron sign despite the fact that my default encoding is ascii, so it seems to me that that unicode string must

Re: What encoding does u'...' syntax use?

2009-02-20 Thread Ron Garret
In article 499f3a8f.9010...@v.loewis.de, Martin v. Löwis mar...@v.loewis.de wrote: u'\xb5' u'\xb5' print u'\xb5' ? Unicode literals are *in the source file*, which can only have one encoding (for a given source file). (That last character shows up as a micron sign despite the

Re: What encoding does u'...' syntax use?

2009-02-20 Thread Ron Garret
In article 499f397c.7030...@v.loewis.de, Martin v. Löwis mar...@v.loewis.de wrote: Yes, I know that. But every concrete representation of a unicode string has to have an encoding associated with it, including unicode strings produced by the Python parser when it parses the ascii string

Re: What encoding does u'...' syntax use?

2009-02-20 Thread Terry Reedy
Martin v. Löwis wrote: mehow have picked up a latin-1 encoding.) I think latin-1 was the default without a coding cookie line. (May be uft-8 in 3.0). It is, but that's irrelevant for the example. In the source u'\xb5' all characters are ASCII (i.e. all of letter u, single quote,