Re: [datameet] how to crawl Indian newspaper sites

2016-10-28 Thread Amit Tiwari
Hi arvind,could u please share the crawler code,I also want to design a similar crawler to design newspaper websites. -- Datameet is a community of Data Science enthusiasts in India. Know more about us by visiting http://datameet.org --- You received this message because you are subscribed to

Re: [datameet] how to crawl Indian newspaper sites

2013-12-10 Thread Debamitro Chakraborti
Hi Gora, The problem with using scrapy (or just simply BeautifulSoup) is that some newspapers generate content dynamically, using javascript. The possible solutions we found was using phantomjs or Goose (a python library). If Nutch can handle content generated through javascript (which it

Re: [datameet] how to crawl Indian newspaper sites

2013-12-10 Thread Gora Mohanty
Hi, That problem is common to whatever crawler you use. Nutch will extract links from JavaScript, but that's it. I would use Rhino, and a custom HtmlParser plugin for Nutch. This is admittedly non-trivial, but I know of no open source tool that already does this. Regards, Gora Hi Gora, The

Re: [datameet] how to crawl Indian newspaper sites

2013-12-09 Thread Debamitro Chakraborti
I know of newsrack (in fact I created the NREGA topic on the site long long ago) but what I am looking for is a crawler of past records which I can use for my own research. Maybe the code behind newsrack can be reused to build such a crawler -- but I didn't see it anywhere on the site. Anyway,

Re: [datameet] how to crawl Indian newspaper sites

2013-12-09 Thread Arvind Batra
Hi Debamitro, Couple of months ago, me and few of my friends built a media monitoring tool to track what traditional media was writing about Aam Aadmi Party. Our work can be seen here - http://aap.mediatrack.in As part of the process, we wrote a crawler that crawls Hindu, TOI, HT and three other