Hi, I didn't anything wrong. Did you check whether CrawlDb entries are marked as "generated" by "_ngt_="? With generate.update.crawldb=true it may happen that after having run generate multiple times, only 62 unfetched and not-generated entries remain.
Sebastian On 04/14/2016 03:31 AM, Karanjeet Singh wrote: > Hello, > > I am trying to crawl a website using Nutch on Hadoop cluster. I have > modified the crawl script to restrict the sizeFetchList to 1000 (which is > the topN value for nutch generate command). > > However, as I see, Nutch is only generating 62 URLs where the unfetched URL > count is 5,000 (approx). I am using the below command: > > nutch generate -D mapreduce.job.reduces=16 -D mapreduce.job.maps=8 -D > mapred.child.java.opts=-Xmx8192m -D mapreduce.map.memory.mb=8192 -D > mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D > mapreduce.map.output.compress=true crawl/crawldb crawl/segments -topN 1000 > -numFetchers 1 -noFilter > > Can anyone please look into this and let me know if I am missing something. > Please find the crawl configuration here [0]. > > [0]: https://github.com/karanjeets/crawl-evaluation/tree/master/nutch/conf > > Thanks & Regards, > Karanjeet Singh > USC > ᐧ >

