Re: python fast HTML data extraction library

2009-07-26 Thread Filip
On Jul 23, 3:53 am, Paul McGuire wrote: > # You should use raw string literals throughout, as in: > # blah_re = re.compile(r'sljdflsflds') > # (note the leading r before the string literal).  raw string > literals > # really help keep your re expressions clean, so that you don't ever > # have to d

Re: python fast HTML data extraction library

2009-07-26 Thread John Machin
On Jul 23, 11:53 am, Paul McGuire wrote: > On Jul 22, 5:43 pm, Filip wrote: > > # Needs re.IGNORECASE, and can have tag attributes, such as CLEAR="ALL"> > line_break_re = re.compile('', re.UNICODE) Just in case somebody actually uses valid XHTML :-) it might be a good idea to allow for > # w

Re: python fast HTML data extraction library

2009-07-25 Thread Aahz
In article <37da38d2-09a8-4fd2-94b4-5feae9675...@k1g2000yqf.googlegroups.com>, Filip wrote: > >I tried to fix that with BeautifulSoup + regexp filtering of some >particular cases I encountered. That was slow and after running my >data scraper for some time a lot of new problems (exceptions from >

Re: python fast HTML data extraction library

2009-07-22 Thread Paul McGuire
On Jul 22, 5:43 pm, Filip wrote: > > My library, rather than parsing the whole input into a tree, processes > it like a char stream with regular expressions. > Filip - In general, parsing HTML with re's is fraught with easily-overlooked deviations from the norm. But since you have stepped up to

python fast HTML data extraction library

2009-07-22 Thread Filip
Hello, Sometime ago I was searching for a library that would simplify mass data scraping/extraction from webpages. Python XPath implementation seemed like the way to go. The problem was that most of the HTML on the net doesn't conform to XML standards, even the XHTML (those advertised as valid XHT