Hi:

It is caused by different operating system clocks on these slaves
computers. Never thought that this could affect the crawling process. This
is enlightening -__-..

Appreciate it.

Regards
Andy


On 11 June 2012 19:22, Markus Jelsma <markus.jel...@openindex.io> wrote:

> Hi
>
> This CrawlDatum's FetchTime is tomorrow in EST
> Fetch time: Tue Jun 12 02:59:27 EST 2012
>
>
> -----Original message-----
> > From:Andy Xue <andyxuey...@gmail.com>
> > Sent: Mon 11-Jun-2012 11:00
> > To: user@nutch.apache.org
> > Subject: Generator: 0 records selected for fetching, exiting ...
> >
> > Hi all:
> >
> > This is regarding an error I encountered when doing a distributed crawl :
> > "Generator: 0 records selected for fetching, exiting ..."
> > I understand that this is a well answered issue which normally is caused
> by
> > either seed list or url filters. (or when topN is smaller than the number
> > of reducers which is not in my case)
> > However, mine is a little bit different. It runs well in local mode, but
> > fails after the first level when running in distributed mode with several
> > computers.
> >
> > After successfully crawled the first level (i.e., fetched all urls in the
> > seed list), the generator of the second level fails with the following
> log:
> > ========================================================================
> > ---- generating ----
> > (started  at Mon Jun 11 18:33:57 2012)
> > 12/06/11 18:34:00 INFO crawl.Generator: Generator: starting at 2012-06-11
> > 18:34:00
> > 12/06/11 18:34:00 INFO crawl.Generator: Generator: Selecting best-scoring
> > urls due for fetch.
> > 12/06/11 18:34:00 INFO crawl.Generator: Generator: filtering: true
> > 12/06/11 18:34:00 INFO crawl.Generator: Generator: normalizing: true
> > 12/06/11 18:34:00 INFO crawl.Generator: Generator: topN: 5000000
> > 12/06/11 18:34:01 INFO mapred.FileInputFormat: Total input paths to
> process
> > : 6
> > 12/06/11 18:34:04 INFO mapred.JobClient: Running job:
> job_201206111824_0007
> > 12/06/11 18:34:05 INFO mapred.JobClient:  map 0% reduce 0%
> > 12/06/11 18:34:20 INFO mapred.JobClient:  map 18% reduce 0%
> > 12/06/11 18:34:26 INFO mapred.JobClient:  map 31% reduce 0%
> > 12/06/11 18:34:29 INFO mapred.JobClient:  map 37% reduce 2%
> > 12/06/11 18:34:32 INFO mapred.JobClient:  map 50% reduce 7%
> > 12/06/11 18:34:35 INFO mapred.JobClient:  map 56% reduce 9%
> > 12/06/11 18:34:38 INFO mapred.JobClient:  map 68% reduce 11%
> > 12/06/11 18:34:41 INFO mapred.JobClient:  map 75% reduce 14%
> > 12/06/11 18:34:44 INFO mapred.JobClient:  map 87% reduce 19%
> > 12/06/11 18:34:47 INFO mapred.JobClient:  map 93% reduce 23%
> > 12/06/11 18:34:50 INFO mapred.JobClient:  map 100% reduce 25%
> > 12/06/11 18:34:53 INFO mapred.JobClient:  map 100% reduce 27%
> > 12/06/11 18:34:56 INFO mapred.JobClient:  map 100% reduce 65%
> > 12/06/11 18:34:59 INFO mapred.JobClient:  map 100% reduce 100%
> > 12/06/11 18:35:04 INFO mapred.JobClient: Job complete:
> job_201206111824_0007
> > 12/06/11 18:35:04 INFO mapred.JobClient: Counters: 30
> > 12/06/11 18:35:04 INFO mapred.JobClient:   Job Counters
> > 12/06/11 18:35:04 INFO mapred.JobClient:     Launched reduce tasks=6
> > 12/06/11 18:35:04 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=154922
> > 12/06/11 18:35:04 INFO mapred.JobClient:     Total time spent by all
> > reduces waiting after reserving slots (ms)=0
> > 12/06/11 18:35:04 INFO mapred.JobClient:     Total time spent by all maps
> > waiting after reserving slots (ms)=0
> > 12/06/11 18:35:04 INFO mapred.JobClient:     Rack-local map tasks=2
> > 12/06/11 18:35:04 INFO mapred.JobClient:     Launched map tasks=32
> > 12/06/11 18:35:04 INFO mapred.JobClient:     Data-local map tasks=30
> > 12/06/11 18:35:04 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=204710
> > 12/06/11 18:35:04 INFO mapred.JobClient:   File Input Format Counters
> > 12/06/11 18:35:04 INFO mapred.JobClient:     Bytes Read=643181
> > 12/06/11 18:35:04 INFO mapred.JobClient:   File Output Format Counters
> > 12/06/11 18:35:04 INFO mapred.JobClient:     Bytes Written=0
> > 12/06/11 18:35:04 INFO mapred.JobClient:   FileSystemCounters
> > 12/06/11 18:35:04 INFO mapred.JobClient:     FILE_BYTES_READ=36
> > 12/06/11 18:35:04 INFO mapred.JobClient:     HDFS_BYTES_READ=651494
> > 12/06/11 18:35:04 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=1311646
> > 12/06/11 18:35:04 INFO mapred.JobClient:   Map-Reduce Framework
> > 12/06/11 18:35:04 INFO mapred.JobClient:     Map output materialized
> > bytes=1152
> > 12/06/11 18:35:04 INFO mapred.JobClient:     Map input records=4259
> > 12/06/11 18:35:04 INFO mapred.JobClient:     Reduce shuffle bytes=1128
> > 12/06/11 18:35:04 INFO mapred.JobClient:     Spilled Records=0
> > 12/06/11 18:35:04 INFO mapred.JobClient:     Map output bytes=0
> > 12/06/11 18:35:04 INFO mapred.JobClient:     Total committed heap usage
> > (bytes)=5583339520
> > 12/06/11 18:35:04 INFO mapred.JobClient:     CPU time spent (ms)=55840
> > 12/06/11 18:35:04 INFO mapred.JobClient:     Map input bytes=618447
> > 12/06/11 18:35:04 INFO mapred.JobClient:     SPLIT_RAW_BYTES=3712
> > 12/06/11 18:35:04 INFO mapred.JobClient:     Combine input records=0
> > 12/06/11 18:35:04 INFO mapred.JobClient:     Reduce input records=0
> > 12/06/11 18:35:04 INFO mapred.JobClient:     Reduce input groups=0
> > 12/06/11 18:35:04 INFO mapred.JobClient:     Combine output records=0
> > 12/06/11 18:35:04 INFO mapred.JobClient:     Physical memory (bytes)
> > snapshot=7267901440
> > 12/06/11 18:35:04 INFO mapred.JobClient:     Reduce output records=0
> > 12/06/11 18:35:04 INFO mapred.JobClient:     Virtual memory (bytes)
> > snapshot=109405061120
> > 12/06/11 18:35:04 INFO mapred.JobClient:     Map output records=0
> > 12/06/11 18:35:04 WARN crawl.Generator: Generator: 0 records selected for
> > fetching, exiting ...
> > ========================================================================
> >
> > Some diagnosis that I have done:
> > * Did a local crawl and it successfully finished two levels;
> > * Experimented distributed crawling with different (and different number
> > of) urls in the seed list;
> > * Used "nutch org.apache.nutch.net.URLFilterChecker -allCombined" to test
> > some urls in the seed list and the un-fetched web pages in the crawldb (I
> > did check the content of crawldb and it does contain plenty of positively
> > scored "db_unfetched" web pages which are linked from the urls in the
> seed
> > list.)
> >
> > One entry in the crawldb looks like this. "generate.min.score" is set to
> 0.
> > ======================================
> > http://www.ci.redmond.wa.us/    Version: 7
> > Status: 1 (db_unfetched)
> > Fetch time: Tue Jun 12 02:59:27 EST 2012
> > Modified time: Thu Jan 01 10:00:00 EST 1970
> > Retries since fetch: 0
> > Retry interval: 2592000 seconds (30 days)
> > Score: 0.05
> > Signature: null
> > Metadata:
> > ======================================
> >
> >
> > These computers used in the distributed crawl were functioning fine
> > previously when I finished plenty of crawls in the past month. Maybe it
> is
> > because I modified some of the code (my customised filters) and some
> > properties in the configuration files.
> >
> > Any suggestions? It is driving me crazy now...
> > Thanks for your help. Even a wild thought or guess would be helpful for
> me
> > to test out.
> >
> > Appreciate your time.
> >
> > Andy
> >
>

Reply via email to