Ryan Ginstrom wrote: > I am just learning python, or trying to, and am having trouble handling utf-8 > text. > > I want to take a utf-8 encoded web page, and feed it to Beautiful Soup > (http://crummy.com/software/BeautifulSoup/). > BeautifulSoup uses SGMLParser to parse text. > > But although I am able to read the utf-8 encoded Japanese text from the web > page and print it to a file without corruption, the text coming out of > Beautiful Soup is mangled. I imagine it's because the parser thinks I'm > giving it a string in the system encoding, which is sjis.
You're not the first person to have trouble with BS and non-ascii text, unfortunately. I wrote a program to test round-tripping data through BS. It turns out that BS is being 'helpful' and converting the chars in the range 0x80 to 0x9F to equivalent entity escapes. This might be useful if the source text is in cp1252 but it is disastrous to utf-8 as you have discovered. A solution is to turn off this fixup (and a few others) by passing avoidParserProblems=False to the BeautifulSoup constructor. Here is a short program that successfully round-trips a selection of utf-8 chars: from BeautifulSoup import BeautifulSoup # Test data includes all codepoints from 32-255 as utf-8 data = ''.join(chr(n) for n in range(32,256)) data = unicode(data, 'latin-1').encode('utf-8') html = '<body>' + data + '</body>' soup = BeautifulSoup(html, avoidParserProblems=False) newData = soup.body.string print repr(data) print print repr(newData) assert data == newData Kent _______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor