Hi Christian, I have been busy with my problems a couple of days now and noticed it was out of something minor. But what I have learned from that is, in conf/log4j.properties, I set
log4j.logger.org.apache.nutch.crawl.Crawl=DEBUG, cmdst log4j.logger.org.apache.nutch.crawl.Injector=DEBUG, cmdst log4j.logger.org.apache.nutch.crawl.Generator=DEBUG, cmdst log4j.logger.org.apache.nutch.crawl.Fetcher=DEBUG, cmdst Therefore, I can examine in logs/hadoop.log what is going on when I am not there. You may try this: you could try to inspect the traffic (headers sent&got) with an HTTP sniffer for those emerging distressed URLs. There, I caught mine. viele Grüße... 2011/8/3 Christian Weiske <[email protected]> > Hallo Dinçer, > > > > One more thing, will you share the stats as: > > > > *$ bin/nutch readdb crawl-dir/crawldb -stats* > > *$ bin/nutch readseg -list crawl-dir/segments/** > > > > When I got that error, the latter list shows that one (or more) > > segments is not finished well. But now you can see my segments seem > > ok. What about yours? > > I got the problem again, and I have a segment without data: > > $ bin/nutch readdb crawl/crawldb -stats > CrawlDb statistics start: crawl/crawldb > Statistics for CrawlDb: crawl/crawldb > TOTAL urls: 1915 > retry 0: 1911 > retry 2: 4 > min score: 0.0 > avg score: 0.0013519583 > max score: 1.056 > status 1 (db_unfetched): 4 > status 2 (db_fetched): 1909 > status 3 (db_gone): 1 > status 4 (db_redir_temp): 1 > CrawlDb statistics: done > > $ bin/nutch readseg -list crawl/segments/* > NAME GENERATED FETCHER START FETCHER END > FETCHED PARSED > 20110801134606 1 2011-08-01T13:46:11 2011-08-01T13:46:11 > 1 1 > 20110801134620 35 2011-08-01T13:46:23 2011-08-01T13:46:52 > 36 30 > 20110801134706 257 2011-08-01T13:47:08 2011-08-01T13:47:57 > 257 256 > 20110801134825 720 2011-08-01T13:48:28 2011-08-01T13:50:45 > 720 720 > 20110801135116 684 2011-08-01T13:51:18 2011-08-01T13:53:29 > 684 684 > 20110803090956 201 ? ? ? ? > 20110803091137 201 2011-08-03T09:11:41 2011-08-03T09:12:40 > 201 197 > 20110803091304 21 2011-08-03T09:13:07 2011-08-03T09:13:10 > 21 21 > > > > -- > Viele Grüße > Dipl.-Inf. Christian Weiske > > Senior Developer > Netresearch GmbH & Co. KG >

