Re: Injector works. But generator and fetcher don't work.

2014-06-07 Thread Manikandan Saravanan
Hey, I finally solved it! It was to do with my Cassandra cluster. My hadoop and cassandra clusters were in two different datacenters. This caused cassandra requests to timeout. And that meant the generate phase didn’t have any input! Works like a charm now :) Regards --  Manikandan Saravanan

Nutch use a Browser or phantomjs as fetcher

2014-06-07 Thread Patrick Kirsch
Hey list, I'm sure this issue was asked several times, but a quick look in the nutch user archive did not help, so: Has anyone documentation or tried to use a browser (like chromium) or phantomjs etc. for fetching web pages? Due to a heavily loaded javascript site, nutch needs to see the

Re: Nutch use a Browser or phantomjs as fetcher

2014-06-07 Thread remi tassing
I'm currently looking at those separately but an integrated option would be more efficient. Looking forward for any experience sharing On Sat, Jun 7, 2014 at 6:25 PM, Patrick Kirsch pkir...@zscho.de wrote: Hey list, I'm sure this issue was asked several times, but a quick look in the nutch

Re: Incremental crawling with nutch

2014-06-07 Thread Bayu Widyasanyata
Hi Ali, OK, I will share using my current script. I sometimes use -adddays parameter on nutch generate steps to force recrawling. Thanks. On Fri, Jun 6, 2014 at 11:02 PM, Ali Nazemian alinazem...@gmail.com wrote: Dear Bayu, Would you please also provide me what procedure you are going to

Re: Incremental crawling with nutch

2014-06-07 Thread Ali Nazemian
So you mean the only difference(beside some parameter that should be set in site-nutch.xml is using nutch generate -adddays instead of nutch generate? what about other parts?) Could you please provide step by step guide? Regards. On Sat, Jun 7, 2014 at 4:20 PM, Bayu Widyasanyata