Re: Nutch generating less URLs for fetcher to fetch (running in Hadoop mode)

Sebastian Nagel Thu, 14 Apr 2016 01:40:38 -0700

Hi,

I didn't anything wrong. Did you check whether
CrawlDb entries are marked as "generated"
by "_ngt_="?  With generate.update.crawldb=true
it may happen that after having run generate
multiple times, only 62 unfetched and not-generated
entries remain.


Sebastian

On 04/14/2016 03:31 AM, Karanjeet Singh wrote:
> Hello,
> 
> I am trying to crawl a website using Nutch on Hadoop cluster. I have
> modified the crawl script to restrict the sizeFetchList to 1000 (which is
> the topN value for nutch generate command).
> 
> However, as I see, Nutch is only generating 62 URLs where the unfetched URL
> count is 5,000 (approx). I am using the below command:
> 
> nutch generate -D mapreduce.job.reduces=16 -D mapreduce.job.maps=8 -D
> mapred.child.java.opts=-Xmx8192m -D mapreduce.map.memory.mb=8192 -D
> mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D
> mapreduce.map.output.compress=true crawl/crawldb crawl/segments -topN 1000
> -numFetchers 1 -noFilter
> 
> Can anyone please look into this and let me know if I am missing something.
> Please find the crawl configuration here [0].
> 
> [0]: https://github.com/karanjeets/crawl-evaluation/tree/master/nutch/conf
> 
> Thanks & Regards,
> Karanjeet Singh
> USC
> ᐧ
>

Re: Nutch generating less URLs for fetcher to fetch (running in Hadoop mode)

Reply via email to