Nutch generating less URLs for fetcher to fetch (running in Hadoop mode)

Karanjeet Singh Wed, 13 Apr 2016 18:32:46 -0700

Hello,

I am trying to crawl a website using Nutch on Hadoop cluster. I have
modified the crawl script to restrict the sizeFetchList to 1000 (which is
the topN value for nutch generate command).


However, as I see, Nutch is only generating 62 URLs where the unfetched URL
count is 5,000 (approx). I am using the below command:

nutch generate -D mapreduce.job.reduces=16 -D mapreduce.job.maps=8 -D
mapred.child.java.opts=-Xmx8192m -D mapreduce.map.memory.mb=8192 -D
mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D
mapreduce.map.output.compress=true crawl/crawldb crawl/segments -topN 1000
-numFetchers 1 -noFilter

Can anyone please look into this and let me know if I am missing something.
Please find the crawl configuration here [0].

[0]: https://github.com/karanjeets/crawl-evaluation/tree/master/nutch/conf

Thanks & Regards,
Karanjeet Singh
USC
ᐧ

Nutch generating less URLs for fetcher to fetch (running in Hadoop mode)

Reply via email to