Re: python fast HTML data extraction library

2009-07-26 Thread Filip
On Jul 23, 3:53 am, Paul McGuire wrote: > # You should use raw string literals throughout, as in: > # blah_re = re.compile(r'sljdflsflds') > # (note the leading r before the string literal).  raw string > literals > # really help keep your re expressions clean, so that you don't ever > # have to d

Re: python fast HTML data extraction library

2009-07-26 Thread John Machin
On Jul 23, 11:53 am, Paul McGuire wrote: > On Jul 22, 5:43 pm, Filip wrote: > > # Needs re.IGNORECASE, and can have tag attributes, such as CLEAR="ALL"> > line_break_re = re.compile('', re.UNICODE) Just in case somebody actually uses valid XHTML :-) it might be a good idea to allow for > # w

Re: python fast HTML data extraction library

2009-07-25 Thread Aahz
In article <37da38d2-09a8-4fd2-94b4-5feae9675...@k1g2000yqf.googlegroups.com>, Filip wrote: > >I tried to fix that with BeautifulSoup + regexp filtering of some >particular cases I encountered. That was slow and after running my >data scraper for some time a lot of new problems (exceptions from >

Re: python fast HTML data extraction library

2009-07-22 Thread Paul McGuire
On Jul 22, 5:43 pm, Filip wrote: > > My library, rather than parsing the whole input into a tree, processes > it like a char stream with regular expressions. > Filip - In general, parsing HTML with re's is fraught with easily-overlooked deviations from the norm. But since you have stepped up to