Thanks, Sebastian.

This is solved now. I looked through the code and found that Nutch has a
limit placed on the count of host URLs which is defined by *topN / number
of reducer tasks*. Please refer here [0].

So, I was running 16 reduce tasks with topN 1000 and hence 62 URLs (1000 /
16).

I am interested to know the reason for this. Is it due to politeness?

[0]:
https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/crawl/Generator.java#L141

Regards,
Karanjeet Singh
USC

On Thu, Apr 14, 2016 at 1:40 AM, Sebastian Nagel <[email protected]
> wrote:

> Hi,
>
> I didn't anything wrong. Did you check whether
> CrawlDb entries are marked as "generated"
> by "_ngt_="?  With generate.update.crawldb=true
> it may happen that after having run generate
> multiple times, only 62 unfetched and not-generated
> entries remain.
>
> Sebastian
>
> On 04/14/2016 03:31 AM, Karanjeet Singh wrote:
> > Hello,
> >
> > I am trying to crawl a website using Nutch on Hadoop cluster. I have
> > modified the crawl script to restrict the sizeFetchList to 1000 (which is
> > the topN value for nutch generate command).
> >
> > However, as I see, Nutch is only generating 62 URLs where the unfetched
> URL
> > count is 5,000 (approx). I am using the below command:
> >
> > nutch generate -D mapreduce.job.reduces=16 -D mapreduce.job.maps=8 -D
> > mapred.child.java.opts=-Xmx8192m -D mapreduce.map.memory.mb=8192 -D
> > mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D
> > mapreduce.map.output.compress=true crawl/crawldb crawl/segments -topN
> 1000
> > -numFetchers 1 -noFilter
> >
> > Can anyone please look into this and let me know if I am missing
> something.
> > Please find the crawl configuration here [0].
> >
> > [0]:
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_karanjeets_crawl-2Devaluation_tree_master_nutch_conf&d=CwIDaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=u7neGGUaVmQKNSLUqJ9zpA&m=O7MP8WCf7SwgrHXMvaLfmySYST5zRY_AIRTn6cMKclA&s=QwXZBwYqg1DRis1p2p3iCS6zk4VIb-alEkjMhnzjpWg&e=
> >
> > Thanks & Regards,
> > Karanjeet Singh
> > USC
> > ᐧ
> >
>
>
ᐧ

Reply via email to