On Jul 23, 3:53 am, Paul McGuire wrote:
> # You should use raw string literals throughout, as in:
> # blah_re = re.compile(r'sljdflsflds')
> # (note the leading r before the string literal). raw string
> literals
> # really help keep your re expressions clean, so that you don't ever
> # have to d
On Jul 23, 11:53 am, Paul McGuire wrote:
> On Jul 22, 5:43 pm, Filip wrote:
>
> # Needs re.IGNORECASE, and can have tag attributes, such as CLEAR="ALL">
> line_break_re = re.compile('', re.UNICODE)
Just in case somebody actually uses valid XHTML :-) it might be a good
idea to allow for
> # w
In article <37da38d2-09a8-4fd2-94b4-5feae9675...@k1g2000yqf.googlegroups.com>,
Filip wrote:
>
>I tried to fix that with BeautifulSoup + regexp filtering of some
>particular cases I encountered. That was slow and after running my
>data scraper for some time a lot of new problems (exceptions from
>
On Jul 22, 5:43 pm, Filip wrote:
>
> My library, rather than parsing the whole input into a tree, processes
> it like a char stream with regular expressions.
>
Filip -
In general, parsing HTML with re's is fraught with easily-overlooked
deviations from the norm. But since you have stepped up to
Hello,
Sometime ago I was searching for a library that would simplify mass
data scraping/extraction from webpages. Python XPath implementation
seemed like the way to go. The problem was that most of the HTML on
the net doesn't conform to XML standards, even the XHTML (those
advertised as valid XHT