Crawling focused only over seed file

Andrés Rincón Pacheco Sat, 14 Nov 2015 16:52:33 -0800

Hi,

I need execute nutch focus over seed file, no more urls added in every
cycle.


I am executing nutch with the following scenarios:

1. Invoking crawl script without updatedb job:  The time of execution for
every cycle is 15 minutes, but
in every cycle the urls processing are the same.  The total time for nutch
execution is around 16 hours.
Because the urls in every cycle are the same?

2. Crawling normal (using updateddb): if I am using updatedb job, how can
nutch make fetch only urls of seed file without add new urls to crawldb?

I am trying execute nutch using updatedb job with -noAdditions, so that it
serves this option?  I was reading the nutch wiki but is not clear the
performance of
-noAdditions option

Conditions for every case: the configuration used for proccessing is 360
urls in every cycle.  The seed file contains around 25000 urls.
(limit parameter in crawl bash script is 25000 and sizeFetchlist is 360).

Thanks,

Andres

Crawling focused only over seed file

Reply via email to