Hi arvind,could u please share the crawler code,I also want to design a similar
crawler to design newspaper websites.
--
Datameet is a community of Data Science enthusiasts in India. Know more about
us by visiting http://datameet.org
---
You received this message because you are subscribed to
Hi Gora,
The problem with using scrapy (or just simply BeautifulSoup) is that some
newspapers generate content dynamically, using javascript. The possible
solutions we found was using phantomjs or Goose (a python library). If
Nutch can handle content generated through javascript (which it
Hi,
That problem is common to whatever crawler you use. Nutch will extract
links from JavaScript, but that's it. I would use Rhino, and a custom
HtmlParser plugin for Nutch. This is admittedly non-trivial, but I know of
no open source tool that already does this.
Regards,
Gora
Hi Gora,
The
I know of newsrack (in fact I created the NREGA topic on the site long long
ago) but what I am looking for is a crawler of past records which I can use
for my own research. Maybe the code behind newsrack can be reused to build
such a crawler -- but I didn't see it anywhere on the site.
Anyway,
Hi Debamitro,
Couple of months ago, me and few of my friends built a media monitoring
tool to track what traditional media was writing about Aam Aadmi Party. Our
work can be seen here - http://aap.mediatrack.in
As part of the process, we wrote a crawler that crawls Hindu, TOI, HT and
three other