Thanks for the quick reply
s1=s1.decode('Latin-1')
does not help, is that what you had in mind?
Peter
On Feb 2, 10:38 pm, Jonathan Lundell <[email protected]> wrote:
> On Feb 2, 2012, at 2:24 PM, peter wrote:
>
>
>
>
>
>
>
>
>
> > I am reading some text from a web site, using f=urllib.urlopen(....),
> > and then s=f.read()
>
> > I then extract a bit of 's' as s1, s1 contains "Na Ponta Do Pé"
>
> > The é is encoded in a single byte as 0XE9.
>
> > If I do IS_SLUG.urlify(s1) it throws and error because 0XE9 is not a
> > valid character. I believe the encoding is ansii. I have tried all
> > manner of encoding and decoding but cannot get anything to work. If I
> > print s1 to the console or a file, then it works fine. But most python
> > character operations fail, presumably because they are expecting utf-8
> > which encodes é as two bytes.
>
> > If I do
> > s1="Na Ponta Do Pé"
> > IS_SLUG.urlify(s1)
>
> > There is no error.
>
> > Clearly I could check for 0XE9 and convert it uniquely, but I wonder
> > if anyone could suggest a conversion that would work for any ansii
> > character. I have googled and experimented a lot on this with no
> > success.
>
> The page you're reading is encoded as Latin-1. You need to decode it first.