Hi Christian,

I have been busy with my problems a couple of days now and noticed it was
out of something minor. But what I have learned from that is, in
conf/log4j.properties, I set

log4j.logger.org.apache.nutch.crawl.Crawl=DEBUG, cmdst
log4j.logger.org.apache.nutch.crawl.Injector=DEBUG, cmdst
log4j.logger.org.apache.nutch.crawl.Generator=DEBUG, cmdst
log4j.logger.org.apache.nutch.crawl.Fetcher=DEBUG, cmdst

Therefore, I can examine in logs/hadoop.log what is going on when I am not
there. You may try this: you could try to inspect the traffic (headers
sent&got) with an HTTP sniffer for those emerging distressed URLs. There, I
caught mine.

viele Grüße...


2011/8/3 Christian Weiske <[email protected]>

> Hallo Dinçer,
>
>
> > One more thing, will you share the stats as:
> >
> > *$ bin/nutch readdb crawl-dir/crawldb -stats*
> > *$ bin/nutch readseg -list crawl-dir/segments/**
> >
> > When I got that error, the latter list shows that one (or more)
> > segments is not finished well. But now you can see my segments seem
> > ok. What about yours?
>
> I got the problem again, and I have a segment without data:
>
> $ bin/nutch readdb crawl/crawldb -stats
> CrawlDb statistics start: crawl/crawldb
> Statistics for CrawlDb: crawl/crawldb
> TOTAL urls:     1915
> retry 0:        1911
> retry 2:        4
> min score:      0.0
> avg score:      0.0013519583
> max score:      1.056
> status 1 (db_unfetched):        4
> status 2 (db_fetched):  1909
> status 3 (db_gone):     1
> status 4 (db_redir_temp):       1
> CrawlDb statistics: done
>
> $ bin/nutch readseg -list crawl/segments/*
> NAME            GENERATED       FETCHER START           FETCHER END
>     FETCHED PARSED
> 20110801134606  1               2011-08-01T13:46:11     2011-08-01T13:46:11
>     1       1
> 20110801134620  35              2011-08-01T13:46:23     2011-08-01T13:46:52
>     36      30
> 20110801134706  257             2011-08-01T13:47:08     2011-08-01T13:47:57
>     257     256
> 20110801134825  720             2011-08-01T13:48:28     2011-08-01T13:50:45
>     720     720
> 20110801135116  684             2011-08-01T13:51:18     2011-08-01T13:53:29
>     684     684
> 20110803090956  201             ?               ?       ?       ?
> 20110803091137  201             2011-08-03T09:11:41     2011-08-03T09:12:40
>     201     197
> 20110803091304  21              2011-08-03T09:13:07     2011-08-03T09:13:10
>     21      21
>
>
>
> --
> Viele Grüße
> Dipl.-Inf. Christian Weiske
>
> Senior Developer
> Netresearch GmbH & Co. KG
>

Reply via email to