On Jul 23, 3:53 am, Paul McGuire wrote:
> # You should use raw string literals throughout, as in:
> # blah_re = re.compile(r'sljdflsflds')
> # (note the leading r before the string literal). raw string
> literals
> # really help keep your re expressions clean, so that you don't ever
> # have to d
On Jul 23, 11:53 am, Paul McGuire wrote:
> On Jul 22, 5:43 pm, Filip wrote:
>
> # Needs re.IGNORECASE, and can have tag attributes, such as CLEAR="ALL">
> line_break_re = re.compile('', re.UNICODE)
Just in case somebody actually uses valid XHTML :-) it might be a good
idea to allow for
> # w
In article <37da38d2-09a8-4fd2-94b4-5feae9675...@k1g2000yqf.googlegroups.com>,
Filip wrote:
>
>I tried to fix that with BeautifulSoup + regexp filtering of some
>particular cases I encountered. That was slow and after running my
>data scraper for some time a lot of new problems (exceptions from
>
On Jul 22, 5:43 pm, Filip wrote:
>
> My library, rather than parsing the whole input into a tree, processes
> it like a char stream with regular expressions.
>
Filip -
In general, parsing HTML with re's is fraught with easily-overlooked
deviations from the norm. But since you have stepped up to