subject:"Using lxml to screen scrap a site, problem with charset"

Re: Using lxml to screen scrap a site, problem with charset

2009-02-04 Thread Stefan Behnel

Tim Arnold wrote: > "?? ???" wrote in message > news:ciqh56-ses@archaeopteryx.softver.org.mk... >> So, I'm using lxml to screen scrap a site that uses the cyrillic >> alphabet (windows-1251 encoding). The sites HTML doesn't have the > ..content-type.. charset=..> header, but does

Re: Using lxml to screen scrap a site, problem with charset

2009-02-02 Thread Tim Arnold

"?? ???" wrote in message news:ciqh56-ses@archaeopteryx.softver.org.mk... > So, I'm using lxml to screen scrap a site that uses the cyrillic > alphabet (windows-1251 encoding). The sites HTML doesn't have the ..content-type.. charset=..> header, but does have a HTTP header that >

Using lxml to screen scrap a site, problem with charset

2009-02-01 Thread Дамјан Георгиевски

So, I'm using lxml to screen scrap a site that uses the cyrillic alphabet (windows-1251 encoding). The sites HTML doesn't have the header, but does have a HTTP header that specifies the charset... so they are standards compliant enough. Now when I run this code: from lxml import html doc = htm