Re: Same pages crawled more than once and slow crawling

alxsss Fri, 19 Oct 2012 10:53:16 -0700

Hello,

I meant that it could be a gora-mysql problem. In order to test it, you can run 
nutch in local mode with Generator Debug enabled. Put this 
log4j.logger.org.apache.nutch.crawl.GeneratorJob=DEBUG,cmdstdout


in your conf/log4j.properties

and run the crawl cycle with updatedb. if gora-mysql works properly, then you 
must see in the output,

shouldFetch rejected  url , fetchTime  FetchTime  curTime  curTime

for those urls that were fetched in the previous cycle. If you do not see them, 
then it means gora-mysql has issues.

Good luck.
Alex.

 

 

 

-----Original Message-----
From: Luca Vasarelli <[email protected]>
To: user <[email protected]>
Sent: Fri, Oct 19, 2012 1:01 am
Subject: Re: Same pages crawled more than once and slow crawling


> Hi Luca,

Hi Sebastian, thanks for replying!

> But after the 5th cycle the crawler stopped?

Yes

> For Pierre this has worked...
> Any suggestions?

I can post info for each step, but please tell me which log is more 
important: Haadop log? MySQL table? If this last one, which fields?

Alex says it's a MySQL problem, how can I verify after the generate step 
if he is correct?

> Well, Nutch (resp. Hadoop) are designed to process much data. Job management 
has some overhead
> (and some artificial sleeps): 5 cycles * 4 jobs (generate/fetch/parse/update) 
= 20 jobs.
> 6s per job seems roughly ok, though it could be slightly faster.

Yes, this test is not well designed for Nutch, but I thought, as Stefan 
said, about a config or hardcoded delay somewhere in the nutch files I 
can try to reduce, since I will use on a single machine.

Luca

Re: Same pages crawled more than once and slow crawling

Reply via email to