Marc Tompkins, 06.06.2012 03:10: > I'm trying to parse a webpage using lxml; every time I try, I'm > rewarded with "UnicodeDecodeError: 'ascii' codec can't decode byte > 0x?? in position?????: ordinal not in range(128)" (the byte value and > the position occasionally change; the error never does.) > > The page's encoding is UTF-8: > <meta http-equiv="content-type" content="text/html; charset=utf-8" /> > so I have tried: > - setting HTMLParser's encoding to 'utf-8'
That's the way to do it, although the parser should be able to figure it out by itself, given the above content type declaration. > Here's my current version, trying everything at once: > > from __future__ import print_function > import datetime > import urllib2 > from lxml import etree > url = > 'http://www.wpc-edi.com/reference/codelists/healthcare/claim-adjustment-reason-codes/' > page = urllib2.urlopen(url) > pagecontents = page.read() > pagecontents = pagecontents.decode('utf-8') > pagecontents = pagecontents.encode('ascii', 'ignore') > tree = etree.parse(pagecontents, > etree.HTMLParser(encoding='utf-8',recover=True)) parse() is meant to parse from files and file-like objects, so you are telling it to parse from the "file path" in pagecontents, which obviously does not exist. I admit that the error message is not helpful. You can do this: connection = urllib2.urlopen(url) tree = etree.parse(connection, my_html_parser) Alternatively, use fromstring() to parse from strings: page = urllib2.urlopen(url) pagecontents = page.read() html_root = etree.fromstring(pagecontents, my_html_parser) See the lxml tutorial. Also note that there's lxml.html, which provides an extended tool set for HTML processing. Stefan _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor