Hello, I am trying to crawl a website using Nutch on Hadoop cluster. I have modified the crawl script to restrict the sizeFetchList to 1000 (which is the topN value for nutch generate command).
However, as I see, Nutch is only generating 62 URLs where the unfetched URL count is 5,000 (approx). I am using the below command: nutch generate -D mapreduce.job.reduces=16 -D mapreduce.job.maps=8 -D mapred.child.java.opts=-Xmx8192m -D mapreduce.map.memory.mb=8192 -D mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D mapreduce.map.output.compress=true crawl/crawldb crawl/segments -topN 1000 -numFetchers 1 -noFilter Can anyone please look into this and let me know if I am missing something. Please find the crawl configuration here [0]. [0]: https://github.com/karanjeets/crawl-evaluation/tree/master/nutch/conf Thanks & Regards, Karanjeet Singh USC ᐧ

