Jon Crump wrote: > Dear All, > > I have some utf-8 unicode text with lines like this: > > ANVERS-LE-HOMONT, Maine. > ANGOULÊME, Angoumois. > ANDELY (le Petit), Normandie. > > which I'm using as-is in this line of code: > > place.append(line.strip()) > > What I would prefer would be something like this: > > place.append(line.title().strip()) > > which works for most lines, giving me, for example: > > Anvers-Le-Homont, Maine. > and > Andely (Le Petit), Normandie. > > but where there are diacritics involved, title() gives me: > > AngoulÊMe, Angoumois. > > Can anyone give the clueless a clue on how to manage such unicode > strings more effectively?
First, don't confuse unicode and utf-8. Second, convert the string to unicode and then title-case it, then convert back to utf-8 if you need to: In [3]: s='ANGOUL\303\212ME, Angoumois' In [5]: s Out[5]: 'ANGOUL\xc3\x8aME, Angoumois' In [4]: s.title() Out[4]: 'Angoul\xc3\x8aMe, Angoumois' In [10]: print s.title() AngoulÊMe, Angoumois In [6]: u=s.decode('utf-8') In [7]: u.title() Out[7]: u'Angoul\xeame, Angoumois' In [8]: print u.title() ------------------------------------------------------------ Traceback (most recent call last): File "<ipython console>", line 1, in <module> <type 'exceptions.UnicodeEncodeError'>: 'ascii' codec can't encode character u'\xea' in position 6: ordinal not in range(128) Oops, print is trying to convert to a byte string with the default encoding, have to give it some help... In [9]: print u.title().encode('utf-8') Angoulême, Angoumois Kent _______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor