I am just learning python, or trying to, and am having trouble handling utf-8 text.
I want to take a utf-8 encoded web page, and feed it to Beautiful Soup (http://crummy.com/software/BeautifulSoup/). BeautifulSoup uses SGMLParser to parse text. But although I am able to read the utf-8 encoded Japanese text from the web page and print it to a file without corruption, the text coming out of Beautiful Soup is mangled. I imagine it's because the parser thinks I'm giving it a string in the system encoding, which is sjis. Here is the code I am using: # -*- coding: utf-8 -*- # ============================== # Test program to read in utf-8 encoded html page # ============================== import urllib2, pprint from BeautifulSoup import BeautifulSoup # utf-8 encoded content html = urllib2.urlopen( 'http://jat.org/jtt/index.html' ).read() # write the raw html to raw.txt # This comes out ok file1 = open("raw.txt", "w") print >> file1, html file1.close() # write the parsed html to parsed.txt # The Japanese text is garbled in this one file2 = open("parsed.txt", "w") soup = BeautifulSoup() soup.feed( html ) print >> file2, soup.html file2.close() # ============================== Any help much appreciated. Regards, Ryan --- Ryan Ginstrom [EMAIL PROTECTED] / [EMAIL PROTECTED] http://ginstrom.com _______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor