Joe Farro wrote:
> The package implements a DSL that is intended to make web-scraping a bit
> more maintainable :)
>
> I generally find my scraping code ends up being rather chaotic with
> querying, regex manipulations, conditional processing, conversions, etc.,
> ending up being to close together and sometimes interwoven. It's
> stressful.
Everything is cleaner than a bunch of regular expressions. It's just that
sometimes they give results more quickly and as reliable as you can get
without adding a javascript engine to your script.
> The DSL attempts to mitigate this by doing only two things:
> finding stuff and saving it as a string. The post-processing is left to be
> done down the pipeline. It's almost just a configuration file.
>
> Here is an example that would get the text and URL for every link in a
> page:
>
> $ a
> save each: links
> | [href]
> save: url
> | text
> save: link_text
>
>
> The result would be something along these lines:
>
> {
> 'links': [
> {
> 'url': 'http://www.something.com/hm',
> 'link_text': 'The text in the link'
> },
> # etc... another dict for each <a> tag
> ]
> }
>
With beautiful soup you could write this
soup = bs4.BeautifulSoup(...)
links = [
{
"url": a["href"],
"link_text": a.text
}
for a in soup("a")
]
and for many applications you wouldn't even bother with the intermediate
data structure.
Can you give a real-world example where your DSL is significantly cleaner
than the corresponding code using bs4, or lxml.xpath, or lxml.objectify?
> The hope is that having all the selectors in one place will make them more
> manageable and possibly simplify the post-processing.
>
> This is my first go at something along these lines, so any feedback is
> welcomed.
Your code on github looks good to me (too few docstrings), but like Alan I'm
not prepared to read it completely. Do you have specific questions?
_______________________________________________
Tutor maillist - [email protected]
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor