andy wrote:
Hi people

I'm using beautiful soup to rip the uk headlines from the uk bbc page.
This works rather well but there is the problem of html entities which
appear in the xml feed.
Is there an elegant/simple way to convert them into the "standard"
output? By this I mean £ going to  ? or do i have to use regexp?
and where does unicode fit into all of this?


import re

# Fredrik Lundh, http://effbot.org/zone/re-sub.html
def unescape(text):
    def fixup(m):
        text = m.group(0)
        if text[:2] == "&#":
            # character reference
            try:
                if text[:3].lower() == "&#x":
                    return unichr(int(text[3:-1], 16))
                else:
                    return unichr(int(text[2:-1]))
            except ValueError:
                pass
        else:
            # named entity
            import htmlentitydefs
            try:
                text = unichr(htmlentitydefs.name2codepoint[text[1:-1]])
            except KeyError:
                pass
        return text # leave as is
    return re.sub("&#?\w+;", fixup, text)

print unescape('£')

£



~

_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Reply via email to