[webpy] Problem scrapping spanish characters

Luis Gonzalez Sun, 18 Apr 2010 00:34:13 -0700

This problem has bitten me for a long time, and I still can't find the
solution...
I'm wrote a web scrapping script using Beautifulsoup for extracting
news links out of Yahoo news.
The problem is that spanish characters such as accents and special
letters don't show up properly.
I ignore if this is a problem of BeatifulSoup, webpy or GAE's
urlfetch. An Intensive search through all these mailing lists hasn't
help so far...


This is the script where as you can see, I made sure in several places
that utf-8 is handled properly (of course, I have no clue about
that...):

class news(app.page):
    def GET(self):
        web.header('Content-Type', 'text/html; charset=utf-8')
        from google.appengine.api import urlfetch
        from BeautifulSoup import BeautifulSoup
        from google.appengine.api import mail

        page = urlfetch.fetch("http://ar.news.yahoo.com";,
headers={'Content-Type': 'text/html; charset=utf-8'})
        soup = BeautifulSoup(page.content, fromEncoding="utf-8")
        s=['<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//
EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd";>',
        '<html xmlns="http://www.w3.org/1999/xhtml";>','<head>','<meta
http-equiv="Content-Language" content="es-ar" />',
        '<meta http-equiv="Content-Type" content="text/html;
charset=utf-8" />','</head>','</body>']

        block = soup('ul',{"class":"headlines"})
        for i in block:
            s.append(repr(i))
        s.extend(['</body>','</html>'])
        res = '\n'.join(s)
        res = res.replace('"/', '"http://ar.news.yahoo.com/')
        return web.utf8(res)

Any hint would be highly appreciated...
Luis

-- 
You received this message because you are subscribed to the Google Groups 
"web.py" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/webpy?hl=en.

[webpy] Problem scrapping spanish characters

Reply via email to