Hello, I meant that it could be a gora-mysql problem. In order to test it, you can run nutch in local mode with Generator Debug enabled. Put this log4j.logger.org.apache.nutch.crawl.GeneratorJob=DEBUG,cmdstdout
in your conf/log4j.properties and run the crawl cycle with updatedb. if gora-mysql works properly, then you must see in the output, shouldFetch rejected url , fetchTime FetchTime curTime curTime for those urls that were fetched in the previous cycle. If you do not see them, then it means gora-mysql has issues. Good luck. Alex. -----Original Message----- From: Luca Vasarelli <[email protected]> To: user <[email protected]> Sent: Fri, Oct 19, 2012 1:01 am Subject: Re: Same pages crawled more than once and slow crawling > Hi Luca, Hi Sebastian, thanks for replying! > But after the 5th cycle the crawler stopped? Yes > For Pierre this has worked... > Any suggestions? I can post info for each step, but please tell me which log is more important: Haadop log? MySQL table? If this last one, which fields? Alex says it's a MySQL problem, how can I verify after the generate step if he is correct? > Well, Nutch (resp. Hadoop) are designed to process much data. Job management has some overhead > (and some artificial sleeps): 5 cycles * 4 jobs (generate/fetch/parse/update) = 20 jobs. > 6s per job seems roughly ok, though it could be slightly faster. Yes, this test is not well designed for Nutch, but I thought, as Stefan said, about a config or hardcoded delay somewhere in the nutch files I can try to reduce, since I will use on a single machine. Luca

