Re: [Tutor] Html entities, beautiful soup and unicode

spir Tue, 19 Jan 2010 03:01:31 -0800

On Tue, 19 Jan 2010 08:49:27 +0100
andy <chees...@titan.physx.u-szeged.hu> wrote:


> Hi people
> 
> I'm using beautiful soup to rip the uk headlines from the uk bbc page.
> This works rather well but there is the problem of html entities which
> appear in the xml feed.
> Is there an elegant/simple way to convert them into the "standard"
> output? By this I mean &#163; going to Â ? or do i have to use regexp?
> and where does unicode fit into all of this?

Ha, ha!
What do you mean exactly, convert them into the "standard" output? What form do 
you expect, and to do what?
Maybe your aim is to replace number-coded html entities in a python string by 
real characters in a given format, to be able to output them. Then one way may 
be to use a simple regex and replace with a custom function. Eg:

import re

def rep(result):
    string = result.group()                   # "&#xxx;"
    n = int(string[2:-1])
    uchar = unichr(n)                         # matching unicode char
    # for you dest format may be iso-8859-2 ?
    return unicode.encode(uchar, "utf-8")     # format-encoded char

source = "xxx&#161;xxx&#194;xxx&#255;xxx"
pat = re.compile("""&#\d+;""")
print pat.sub(rep, source)

Denis
________________________________

la vita e estrany

http://spir.wikidot.com/
_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] Html entities, beautiful soup and unicode

Reply via email to