Re: Need debugging knowhow for my creeping Unicodephobia

2010-02-12 Thread John Nagle
kj wrote: Some people have mathphobia. I'm developing a wicked case of Unicodephobia. I have read a *ton* of stuff on Unicode. It doesn't even seem all that hard. Or so I think. Then I start writing code, and WHAM: UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordi

Re: Need debugging knowhow for my creeping Unicodephobia

2010-02-12 Thread John Nagle
kj wrote: =A0 x =3D '%s' % y =A0 x =3D '%s' % z =A0 print y =A0 print z =A0 print y, z Bear in mind that most Python implementations assume the "console" only handles ASCII. So "print" output is converted to ASCII, which can fail. (Actually, all modern Windows and Linux systems support Un

Re: Need debugging knowhow for my creeping Unicodephobia

2010-02-11 Thread Nobody
On Wed, 10 Feb 2010 12:17:51 -0800, Anthony Tolle wrote: > 4. Consider switching to Python 3.x, since there is only one string > type (unicode). However: one drawback of Python 3.x is that the repr() of a Unicode string is no longer restricted to ASCII. There is an ascii() function which behaves

Re: Need debugging knowhow for my creeping Unicodephobia

2010-02-11 Thread Terry Reedy
On 2/11/2010 4:43 PM, mk wrote: Neat, except that the process of porting most projects and external libraries to P3 seems to be, how should I put it, standing still? What is important are the libraries, so more new projects can start in 3.x. There is a slow trickly of 3.x support announcement

Re: Need debugging knowhow for my creeping Unicodephobia

2010-02-11 Thread Steve Holden
mk wrote: > MRAB wrote: > >> When working with Unicode in Python 2, you should use the 'unicode' type >> for text (Unicode strings) and limit the 'str' type to binary data >> (bytestrings, ie bytes) only. > > Well OK, always use u'something', that's simple -- but isn't str what I > get from files

Re: Need debugging knowhow for my creeping Unicodephobia

2010-02-11 Thread Robert Kern
On 2010-02-11 15:43 PM, mk wrote: MRAB wrote: Strictly speaking, only Unicode can be encoded. How so? Can't bytestrings containing characters of, say, koi8r encoding be encoded? I think he means that only unicode objects can be encoded using the .encode() method, as clarified by his next

Re: Need debugging knowhow for my creeping Unicodephobia

2010-02-11 Thread mk
MRAB wrote: When working with Unicode in Python 2, you should use the 'unicode' type for text (Unicode strings) and limit the 'str' type to binary data (bytestrings, ie bytes) only. Well OK, always use u'something', that's simple -- but isn't str what I get from files and sockets and the like

Re: Need debugging knowhow for my creeping Unicodephobia

2010-02-11 Thread MRAB
mk wrote: kj wrote: I have read a *ton* of stuff on Unicode. It doesn't even seem all that hard. Or so I think. Then I start writing code, and WHAM: UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128) (There, see? My Unicodephobia just went u

Re: Need debugging knowhow for my creeping Unicodephobia

2010-02-11 Thread kj
In mk writes: >To make matters more complicated, str.encode() internally DECODES from >string into unicode: > >>> nu >'\xc4\x84' > >>> > >>> type(nu) > > >>> nu.encode() >Traceback (most recent call last): > File "", line 1, in >UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in

Re: Need debugging knowhow for my creeping Unicodephobia

2010-02-11 Thread mk
kj wrote: I have read a *ton* of stuff on Unicode. It doesn't even seem all that hard. Or so I think. Then I start writing code, and WHAM: UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128) (There, see? My Unicodephobia just went up a notch.)

Re: Need debugging knowhow for my creeping Unicodephobia

2010-02-10 Thread kj
In Duncan Booth writes: >kj wrote: >> But to ground >> the problem a bit I'll say that the exception above happens during >> the execution of a statement of the form: >> >> x = '%s %s' % (y, z) >> >> Also, I found that, with the exact same values y and z as above, >> all of the following

Re: Need debugging knowhow for my creeping Unicodephobia

2010-02-10 Thread David Malcolm
On Wed, 2010-02-10 at 12:17 -0800, Anthony Tolle wrote: > On Feb 10, 2:09 pm, kj wrote: > > Some people have mathphobia. I'm developing a wicked case of > > Unicodephobia. > > [snip] > > Some general advice (Looks like I am reiterating what MRAB said -- I > type slower :): > > 1. If possible, u

Re: Need debugging knowhow for my creeping Unicodephobia

2010-02-10 Thread Chris Rebert
On Wed, Feb 10, 2010 at 1:03 PM, kj wrote: > In <402ac982-0750-4977-adb2-602b19149...@m24g2000prn.googlegroups.com> Jonathan Gardner writes: >>It sounds like someone, probably beautiful soup, is trying to turn >>your strings into unicode. A full stacktrace would be useful to see >>who did what w

Re: Need debugging knowhow for my creeping Unicodephobia

2010-02-10 Thread Stephen Hansen
On Wed, Feb 10, 2010 at 1:03 PM, kj wrote: > >What are y and z? > > x = "%s %s" % (table['id'], table.tr.renderContents()) > > where the variable table represents a BeautifulSoup.Tag instance. > > >Are they unicode or strings? > > The first item (table['id']) is unicode, and the second is str.

Re: Need debugging knowhow for my creeping Unicodephobia

2010-02-10 Thread kj
In <402ac982-0750-4977-adb2-602b19149...@m24g2000prn.googlegroups.com> Jonathan Gardner writes: >On Feb 10, 11:09=A0am, kj wrote: >> FWIW, I'm using Python 2.6. =A0The example above happens to come from >> a script that extracts data from HTML files, which are all in >> English, but they are a

Re: Need debugging knowhow for my creeping Unicodephobia

2010-02-10 Thread Anthony Tolle
On Feb 10, 2:09 pm, kj wrote: > Some people have mathphobia.  I'm developing a wicked case of > Unicodephobia. > [snip] Some general advice (Looks like I am reiterating what MRAB said -- I type slower :): 1. If possible, use unicode strings for everything. That is, don't use both str and unicod

Re: Need debugging knowhow for my creeping Unicodephobia

2010-02-10 Thread MRAB
kj wrote: Some people have mathphobia. I'm developing a wicked case of Unicodephobia. I have read a *ton* of stuff on Unicode. It doesn't even seem all that hard. Or so I think. Then I start writing code, and WHAM: UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ord

Re: Need debugging knowhow for my creeping Unicodephobia

2010-02-10 Thread Duncan Booth
kj wrote: > But to ground > the problem a bit I'll say that the exception above happens during > the execution of a statement of the form: > > x = '%s %s' % (y, z) > > Also, I found that, with the exact same values y and z as above, > all of the following statements work perfectly fine: > >

Re: Need debugging knowhow for my creeping Unicodephobia

2010-02-10 Thread Jonathan Gardner
On Feb 10, 11:09 am, kj wrote: > > UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: > ordinal not in range(128) > You'll have to understand some terminology first. "codec" is a description of how to encode and decode unicode data to a stream of bytes. "decode" means you

Need debugging knowhow for my creeping Unicodephobia

2010-02-10 Thread kj
Some people have mathphobia. I'm developing a wicked case of Unicodephobia. I have read a *ton* of stuff on Unicode. It doesn't even seem all that hard. Or so I think. Then I start writing code, and WHAM: UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not i