Re: Generator problems in Nutch 1.1

reinhard schwab Wed, 30 Jun 2010 21:08:15 -0700

could you dump the entry of this url in crawl db with

bin/nutch readdb crawl/crawldb -url <url>


it also depends on your configuration.
what is the value for
db.fetch.interval.default and for
db.fetch.schedule.class?

a guess. i have not done crawling with nutch for a while, so im not sure.
i guess it has CrawlDatum.STATUS_FETCH_RETRY when crawling and it uses

 schedule.setPageRetrySchedule((Text)key, result, prevFetchTime,
          prevModifiedTime, fetch.getFetchTime());

in AbstractFetchSchedule you can read

 /**
   * This method adjusts the fetch schedule if fetching needs to be
   * re-tried due to transient errors. The default implementation
   * sets the next fetch time 1 day in the future and increases
   * the retry counter.
   * @param url URL of the page
   * @param datum page information
   * @param prevFetchTime previous fetch time
   * @param prevModifiedTime previous modified time
   * @param fetchTime current fetch time
   * @return adjusted page information, including all original information.
   * NOTE: this may be a different instance than {...@param datum}, but
   * implementations should make sure that it contains at least all
   * information from {...@param datum}.
   */
  public CrawlDatum setPageRetrySchedule(Text url, CrawlDatum datum,
          long prevFetchTime, long prevModifiedTime, long fetchTime) {
    datum.setFetchTime(fetchTime + (long)SECONDS_PER_DAY*1000);
    datum.setRetriesSinceFetch(datum.getRetriesSinceFetch() + 1);
    return datum;
  }

you have set adddays to 5 days.
in this case it will be refetched and refetched again.
try adddays with value 0.
or change the code in AbstractFetchSchedule

datum.setFetchTime(fetchTime + (long)SECONDS_PER_DAY*1000);


[email protected] schrieb:
> Hi,
>
> I am trying to use Nutch 1.1 to build a complete index of our corporate web 
> sites. I am using a script based on this one:
>
> http://wiki.apache.org/nutch/Crawl
>
> The text of my script is included below. I set the crawling depth to 100 to 
> make sure that everything is indexed, expecting that the process will stop 
> after about 20 iterations. The problem is that the Generator keeps 
> re-scheduling fetching of failed URLs. The process stopped because the max 
> number of iterations (the depth) was reached. After a few iterations, only 
> failing URLs were being repeatedly fetched. I checked the log for one of 
> them. It failed with code 403 and was re-scheduled for fetching 94 times. 
> This does not seem right.
>
> Another problem that I've noticed is that sometimes the Generators schedules 
> fetching of just one URL per iteration. This happened once and I did not try 
> to repeat this effect, but this does not seem right either.
>
> Here is my script text:
>
> -----------------------------------------------
> #!/bin/bash
>
> depth=100
> threads=10
> adddays=5
> topN=50000
>
> NUTCH_HOME=/data/HORUS_1/nutch1.1
> NUTCH_HEAPSIZE=8000
> nutch=$NUTCH_HOME
> JAVA_HOME=/usr/lib/jvm/java-6-sun
> export NUTCH_HOME
> export JAVA_HOME
> export NUTCH_HEAPSIZE
>
> steps=6
> echo "----- Inject (Step 1 of $steps) -----"
> $nutch/bin/nutch inject $nutch/crawl/crawldb $nutch/crawl/seed
>
> echo "----- Generate, Fetch, Parse, Update (Step 2 of $steps) -----"
> for((i=0; i < $depth; i++))
> do
>   echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---"
>   $nutch/bin/nutch generate $nutch/crawl/crawldb $nutch/crawl/segments $topN 
> -adddays $adddays
>   if [ $? -ne 0 ]
>   then
>     echo "Stopping at depth $depth. No more URLs to fetch."
>     break
>   fi
>   segment=`ls -d $nutch/crawl/segments/* | tail -1`
>
>   $nutch/bin/nutch fetch $segment -threads $threads
>   if [ $? -ne 0 ]
>   then
>     echo "fetch $segment at depth $depth failed. Deleting it."
>     rm -rf $segment
>     continue
>   fi
>
> #  echo "--- Parsing Segment $segment ---"
> #  $nutch/bin/nutch parse $segment
>
>   $nutch/bin/nutch updatedb $nutch/crawl/crawldb $segment
> done
>
> echo "----- Invert Links (Step 3 of $steps) -----"
> $nutch/bin/nutch invertlinks $nutch/crawl/linkdb $nutch/crawl/segments/*
>
> echo "----- Index (Step 4 of $steps) -----"
> $nutch/bin/nutch index $nutch/crawl/preIndex $nutch/crawl/crawldb 
> $nutch/crawl/linkdb $nutch/crawl/segments/*
>
> echo "----- Dedup (Step 5 of $steps) -----"
> $nutch/bin/nutch dedup $nutch/crawl/preIndex
>
> echo "----- Merge Indexes (Step 6 of $steps) -----"
> $nutch/bin/nutch merge $nutch/crawl/index $nutch/crawl/preIndex
>
> # in nutch-site, hadoop.tmp.dir points to crawl/tmp
> rm -rf $nutch/crawl/tmp/*
> -----------------------------------------------
>
> Is anyone experiencing same problems? Is there anything wrong in what I am 
> doing?
>
> Regards,
>
> Arkadi
>
>
>
>
>
>
>

Re: Generator problems in Nutch 1.1

Reply via email to