This problem has bitten me for a long time, and I still can't find the
solution...
I'm wrote a web scrapping script using Beautifulsoup for extracting
news links out of Yahoo news.
The problem is that spanish characters such as accents and special
letters don't show up properly.
I ignore if this is a problem of BeatifulSoup, webpy or GAE's
urlfetch. An Intensive search through all these mailing lists hasn't
help so far...
This is the script where as you can see, I made sure in several places
that utf-8 is handled properly (of course, I have no clue about
that...):
class news(app.page):
def GET(self):
web.header('Content-Type', 'text/html; charset=utf-8')
from google.appengine.api import urlfetch
from BeautifulSoup import BeautifulSoup
from google.appengine.api import mail
page = urlfetch.fetch("http://ar.news.yahoo.com",
headers={'Content-Type': 'text/html; charset=utf-8'})
soup = BeautifulSoup(page.content, fromEncoding="utf-8")
s=['<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//
EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">',
'<html xmlns="http://www.w3.org/1999/xhtml">','<head>','<meta
http-equiv="Content-Language" content="es-ar" />',
'<meta http-equiv="Content-Type" content="text/html;
charset=utf-8" />','</head>','</body>']
block = soup('ul',{"class":"headlines"})
for i in block:
s.append(repr(i))
s.extend(['</body>','</html>'])
res = '\n'.join(s)
res = res.replace('"/', '"http://ar.news.yahoo.com/')
return web.utf8(res)
Any hint would be highly appreciated...
Luis
--
You received this message because you are subscribed to the Google Groups
"web.py" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to
[email protected].
For more options, visit this group at
http://groups.google.com/group/webpy?hl=en.