Hi: It is caused by different operating system clocks on these slaves computers. Never thought that this could affect the crawling process. This is enlightening -__-..
Appreciate it. Regards Andy On 11 June 2012 19:22, Markus Jelsma <markus.jel...@openindex.io> wrote: > Hi > > This CrawlDatum's FetchTime is tomorrow in EST > Fetch time: Tue Jun 12 02:59:27 EST 2012 > > > -----Original message----- > > From:Andy Xue <andyxuey...@gmail.com> > > Sent: Mon 11-Jun-2012 11:00 > > To: user@nutch.apache.org > > Subject: Generator: 0 records selected for fetching, exiting ... > > > > Hi all: > > > > This is regarding an error I encountered when doing a distributed crawl : > > "Generator: 0 records selected for fetching, exiting ..." > > I understand that this is a well answered issue which normally is caused > by > > either seed list or url filters. (or when topN is smaller than the number > > of reducers which is not in my case) > > However, mine is a little bit different. It runs well in local mode, but > > fails after the first level when running in distributed mode with several > > computers. > > > > After successfully crawled the first level (i.e., fetched all urls in the > > seed list), the generator of the second level fails with the following > log: > > ======================================================================== > > ---- generating ---- > > (started at Mon Jun 11 18:33:57 2012) > > 12/06/11 18:34:00 INFO crawl.Generator: Generator: starting at 2012-06-11 > > 18:34:00 > > 12/06/11 18:34:00 INFO crawl.Generator: Generator: Selecting best-scoring > > urls due for fetch. > > 12/06/11 18:34:00 INFO crawl.Generator: Generator: filtering: true > > 12/06/11 18:34:00 INFO crawl.Generator: Generator: normalizing: true > > 12/06/11 18:34:00 INFO crawl.Generator: Generator: topN: 5000000 > > 12/06/11 18:34:01 INFO mapred.FileInputFormat: Total input paths to > process > > : 6 > > 12/06/11 18:34:04 INFO mapred.JobClient: Running job: > job_201206111824_0007 > > 12/06/11 18:34:05 INFO mapred.JobClient: map 0% reduce 0% > > 12/06/11 18:34:20 INFO mapred.JobClient: map 18% reduce 0% > > 12/06/11 18:34:26 INFO mapred.JobClient: map 31% reduce 0% > > 12/06/11 18:34:29 INFO mapred.JobClient: map 37% reduce 2% > > 12/06/11 18:34:32 INFO mapred.JobClient: map 50% reduce 7% > > 12/06/11 18:34:35 INFO mapred.JobClient: map 56% reduce 9% > > 12/06/11 18:34:38 INFO mapred.JobClient: map 68% reduce 11% > > 12/06/11 18:34:41 INFO mapred.JobClient: map 75% reduce 14% > > 12/06/11 18:34:44 INFO mapred.JobClient: map 87% reduce 19% > > 12/06/11 18:34:47 INFO mapred.JobClient: map 93% reduce 23% > > 12/06/11 18:34:50 INFO mapred.JobClient: map 100% reduce 25% > > 12/06/11 18:34:53 INFO mapred.JobClient: map 100% reduce 27% > > 12/06/11 18:34:56 INFO mapred.JobClient: map 100% reduce 65% > > 12/06/11 18:34:59 INFO mapred.JobClient: map 100% reduce 100% > > 12/06/11 18:35:04 INFO mapred.JobClient: Job complete: > job_201206111824_0007 > > 12/06/11 18:35:04 INFO mapred.JobClient: Counters: 30 > > 12/06/11 18:35:04 INFO mapred.JobClient: Job Counters > > 12/06/11 18:35:04 INFO mapred.JobClient: Launched reduce tasks=6 > > 12/06/11 18:35:04 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=154922 > > 12/06/11 18:35:04 INFO mapred.JobClient: Total time spent by all > > reduces waiting after reserving slots (ms)=0 > > 12/06/11 18:35:04 INFO mapred.JobClient: Total time spent by all maps > > waiting after reserving slots (ms)=0 > > 12/06/11 18:35:04 INFO mapred.JobClient: Rack-local map tasks=2 > > 12/06/11 18:35:04 INFO mapred.JobClient: Launched map tasks=32 > > 12/06/11 18:35:04 INFO mapred.JobClient: Data-local map tasks=30 > > 12/06/11 18:35:04 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=204710 > > 12/06/11 18:35:04 INFO mapred.JobClient: File Input Format Counters > > 12/06/11 18:35:04 INFO mapred.JobClient: Bytes Read=643181 > > 12/06/11 18:35:04 INFO mapred.JobClient: File Output Format Counters > > 12/06/11 18:35:04 INFO mapred.JobClient: Bytes Written=0 > > 12/06/11 18:35:04 INFO mapred.JobClient: FileSystemCounters > > 12/06/11 18:35:04 INFO mapred.JobClient: FILE_BYTES_READ=36 > > 12/06/11 18:35:04 INFO mapred.JobClient: HDFS_BYTES_READ=651494 > > 12/06/11 18:35:04 INFO mapred.JobClient: FILE_BYTES_WRITTEN=1311646 > > 12/06/11 18:35:04 INFO mapred.JobClient: Map-Reduce Framework > > 12/06/11 18:35:04 INFO mapred.JobClient: Map output materialized > > bytes=1152 > > 12/06/11 18:35:04 INFO mapred.JobClient: Map input records=4259 > > 12/06/11 18:35:04 INFO mapred.JobClient: Reduce shuffle bytes=1128 > > 12/06/11 18:35:04 INFO mapred.JobClient: Spilled Records=0 > > 12/06/11 18:35:04 INFO mapred.JobClient: Map output bytes=0 > > 12/06/11 18:35:04 INFO mapred.JobClient: Total committed heap usage > > (bytes)=5583339520 > > 12/06/11 18:35:04 INFO mapred.JobClient: CPU time spent (ms)=55840 > > 12/06/11 18:35:04 INFO mapred.JobClient: Map input bytes=618447 > > 12/06/11 18:35:04 INFO mapred.JobClient: SPLIT_RAW_BYTES=3712 > > 12/06/11 18:35:04 INFO mapred.JobClient: Combine input records=0 > > 12/06/11 18:35:04 INFO mapred.JobClient: Reduce input records=0 > > 12/06/11 18:35:04 INFO mapred.JobClient: Reduce input groups=0 > > 12/06/11 18:35:04 INFO mapred.JobClient: Combine output records=0 > > 12/06/11 18:35:04 INFO mapred.JobClient: Physical memory (bytes) > > snapshot=7267901440 > > 12/06/11 18:35:04 INFO mapred.JobClient: Reduce output records=0 > > 12/06/11 18:35:04 INFO mapred.JobClient: Virtual memory (bytes) > > snapshot=109405061120 > > 12/06/11 18:35:04 INFO mapred.JobClient: Map output records=0 > > 12/06/11 18:35:04 WARN crawl.Generator: Generator: 0 records selected for > > fetching, exiting ... > > ======================================================================== > > > > Some diagnosis that I have done: > > * Did a local crawl and it successfully finished two levels; > > * Experimented distributed crawling with different (and different number > > of) urls in the seed list; > > * Used "nutch org.apache.nutch.net.URLFilterChecker -allCombined" to test > > some urls in the seed list and the un-fetched web pages in the crawldb (I > > did check the content of crawldb and it does contain plenty of positively > > scored "db_unfetched" web pages which are linked from the urls in the > seed > > list.) > > > > One entry in the crawldb looks like this. "generate.min.score" is set to > 0. > > ====================================== > > http://www.ci.redmond.wa.us/ Version: 7 > > Status: 1 (db_unfetched) > > Fetch time: Tue Jun 12 02:59:27 EST 2012 > > Modified time: Thu Jan 01 10:00:00 EST 1970 > > Retries since fetch: 0 > > Retry interval: 2592000 seconds (30 days) > > Score: 0.05 > > Signature: null > > Metadata: > > ====================================== > > > > > > These computers used in the distributed crawl were functioning fine > > previously when I finished plenty of crawls in the past month. Maybe it > is > > because I modified some of the code (my customised filters) and some > > properties in the configuration files. > > > > Any suggestions? It is driving me crazy now... > > Thanks for your help. Even a wild thought or guess would be helpful for > me > > to test out. > > > > Appreciate your time. > > > > Andy > > >