Thanks, Sebastian. This is solved now. I looked through the code and found that Nutch has a limit placed on the count of host URLs which is defined by *topN / number of reducer tasks*. Please refer here [0].
So, I was running 16 reduce tasks with topN 1000 and hence 62 URLs (1000 / 16). I am interested to know the reason for this. Is it due to politeness? [0]: https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/crawl/Generator.java#L141 Regards, Karanjeet Singh USC On Thu, Apr 14, 2016 at 1:40 AM, Sebastian Nagel <[email protected] > wrote: > Hi, > > I didn't anything wrong. Did you check whether > CrawlDb entries are marked as "generated" > by "_ngt_="? With generate.update.crawldb=true > it may happen that after having run generate > multiple times, only 62 unfetched and not-generated > entries remain. > > Sebastian > > On 04/14/2016 03:31 AM, Karanjeet Singh wrote: > > Hello, > > > > I am trying to crawl a website using Nutch on Hadoop cluster. I have > > modified the crawl script to restrict the sizeFetchList to 1000 (which is > > the topN value for nutch generate command). > > > > However, as I see, Nutch is only generating 62 URLs where the unfetched > URL > > count is 5,000 (approx). I am using the below command: > > > > nutch generate -D mapreduce.job.reduces=16 -D mapreduce.job.maps=8 -D > > mapred.child.java.opts=-Xmx8192m -D mapreduce.map.memory.mb=8192 -D > > mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D > > mapreduce.map.output.compress=true crawl/crawldb crawl/segments -topN > 1000 > > -numFetchers 1 -noFilter > > > > Can anyone please look into this and let me know if I am missing > something. > > Please find the crawl configuration here [0]. > > > > [0]: > https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_karanjeets_crawl-2Devaluation_tree_master_nutch_conf&d=CwIDaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=u7neGGUaVmQKNSLUqJ9zpA&m=O7MP8WCf7SwgrHXMvaLfmySYST5zRY_AIRTn6cMKclA&s=QwXZBwYqg1DRis1p2p3iCS6zk4VIb-alEkjMhnzjpWg&e= > > > > Thanks & Regards, > > Karanjeet Singh > > USC > > ᐧ > > > > ᐧ

