bruce wrote: > Hi. > > Ive got a "basic" situation that should be simpl. So it must be a user > (me) issue! > > > I've got a page from a web fetch. I'm simply trying to go from utf-8 to > ascii. I'm not worried about any cruft that might get stripped out as the > data is generated from a us site. (It's a college/class dataset). > > I know this is a unicode issue. I know I need to have a much more > robust/ythnic/correct approach. I will later, but for now, just want to > resolve this issue, and get it off my plate so to speak. > > I've looked at stackoverflow, as well as numerous other sites, so I turn > to the group for a pointer or two... > > The unicode that I'm dealing with is 'u\2013' > > The basic things I've done up to now are: > > s=content > s=ascii_strip(s) > s=s.replace('\u2013', '-') > s=s.replace(u'\u2013', '-') > s=s.replace(u"\u2013", "-") > s=re.sub(u"\u2013", "-", s) > print repr(s) > > When I look at the input content, I have : > > u'English 120 Course Syllabus \u2013 Fall \u2013 2006' > > So, any pointers on replacing the \u2013 with a simple '-' (dash) (or I > could even handle just a ' ' (space)
I suppose you want to replace the DASH with HYPHEN-MINUS. For that both > s=s.replace(u'\u2013', '-') > s=s.replace(u"\u2013", "-") should work (the Python interpreter sees no difference between the two). Let's try: >>> s = u'English 120 Course Syllabus \u2013 Fall \u2013 2006' >>> t = s.replace(u"\u2013", "-") >>> s == t False >>> s u'English 120 Course Syllabus \u2013 Fall \u2013 2006' >>> t u'English 120 Course Syllabus - Fall - 2006' So it look like you did not actually try the code you posted. To remove all non-ascii codepoints you can use encode(): >>> s.encode("ascii", "ignore") 'English 120 Course Syllabus Fall 2006' (Note that the result is a byte string) _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor