andy wrote:
Hi people
I'm using beautiful soup to rip the uk headlines from the uk bbc page.
This works rather well but there is the problem of html entities which
appear in the xml feed.
Is there an elegant/simple way to convert them into the "standard"
output? By this I mean £ going to  ? or do i have to use regexp?
and where does unicode fit into all of this?
import re
# Fredrik Lundh, http://effbot.org/zone/re-sub.html
def unescape(text):
def fixup(m):
text = m.group(0)
if text[:2] == "&#":
# character reference
try:
if text[:3].lower() == "&#x":
return unichr(int(text[3:-1], 16))
else:
return unichr(int(text[2:-1]))
except ValueError:
pass
else:
# named entity
import htmlentitydefs
try:
text = unichr(htmlentitydefs.name2codepoint[text[1:-1]])
except KeyError:
pass
return text # leave as is
return re.sub("&#?\w+;", fixup, text)
print unescape('£')
£
~
_______________________________________________
Tutor maillist - Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor