RE: Generator problems in Nutch 1.1

Arkadi.Kosmynin Wed, 30 Jun 2010 21:37:04 -0700

Hi Reinhard,

>-----Original Message-----
>From: reinhard schwab [mailto:[email protected]]
>Sent: Thursday, July 01, 2010 2:10 PM
>To: [email protected]
>Subject: Re: Generator problems in Nutch 1.1
>
>could you dump the entry of this url in crawl db with
>
>bin/nutch readdb crawl/crawldb -url <url>


Here it is:

URL: http://www.atnf.csiro.au/people/bkoribal/obs/C848.html
Version: 7
Status: 3 (db_gone)
Fetch time: Thu Jul 01 20:51:33 GMT+10:00 2010
Modified time: Thu Jan 01 10:00:00 GMT+10:00 1970
Retries since fetch: 94
Retry interval: 2592000 seconds (30 days)
Score: 1.0
Signature: null
Metadata: _pst_: exception(16), lastModified=0: Http code=403, 
url=http://www.atnf.csiro.au/people/bkoribal/obs/C848.html

Please note that this is not the only URL which was repeatedly re-fetched. 
There were a number of them, about 150. 

>it also depends on your configuration.
>what is the value for
>db.fetch.interval.default and for
>db.fetch.schedule.class?

I did not change these parameters in nutch-default.xml:

<property>
  <name>db.default.fetch.interval</name>
  <value>30</value>
  <description>(DEPRECATED) The default number of days between re-fetches of a 
page.
  </description>
</property>

<property>
  <name>db.fetch.interval.default</name>
  <value>2592000</value>
  <description>The default number of seconds between re-fetches of a page (30 
days).
  </description>
</property>

<property>
  <name>db.fetch.interval.max</name>
  <value>7776000</value>
  <description>The maximum number of seconds between re-fetches of a page
  (90 days). After this period every page in the db will be re-tried, no
  matter what is its status.
  </description>
</property>

<property>
  <name>db.fetch.schedule.class</name>
  <value>org.apache.nutch.crawl.DefaultFetchSchedule</value>
  <description>The implementation of fetch schedule. DefaultFetchSchedule simply
  adds the original fetchInterval to the last fetch time, regardless of
  page changes.</description>
</property>



>
>a guess. i have not done crawling with nutch for a while, so im not
>sure.
>i guess it has CrawlDatum.STATUS_FETCH_RETRY when crawling and it uses
>
> schedule.setPageRetrySchedule((Text)key, result, prevFetchTime,
>          prevModifiedTime, fetch.getFetchTime());
>
>in AbstractFetchSchedule you can read
>
> /**
>   * This method adjusts the fetch schedule if fetching needs to be
>   * re-tried due to transient errors. The default implementation
>   * sets the next fetch time 1 day in the future and increases
>   * the retry counter.
>   * @param url URL of the page
>   * @param datum page information
>   * @param prevFetchTime previous fetch time
>   * @param prevModifiedTime previous modified time
>   * @param fetchTime current fetch time
>   * @return adjusted page information, including all original
>information.
>   * NOTE: this may be a different instance than {...@param datum}, but
>   * implementations should make sure that it contains at least all
>   * information from {...@param datum}.
>   */
>  public CrawlDatum setPageRetrySchedule(Text url, CrawlDatum datum,
>          long prevFetchTime, long prevModifiedTime, long fetchTime) {
>    datum.setFetchTime(fetchTime + (long)SECONDS_PER_DAY*1000);
>    datum.setRetriesSinceFetch(datum.getRetriesSinceFetch() + 1);
>    return datum;
>  }
>
>you have set adddays to 5 days.
>in this case it will be refetched and refetched again.
>try adddays with value 0.
>or change the code in AbstractFetchSchedule
>
>datum.setFetchTime(fetchTime + (long)SECONDS_PER_DAY*1000);

I will try 0, thanks. The interesting thing is, I don't recall this happening 
with Nutch 1.0. I used it with the same script. The other problem that I 
mentioned, scheduling 1 new URL per iteration, is also new. I definitely did 
not have it with Nutch 1.0.

Regards,

Arkadi

>
>[email protected] schrieb:
>> Hi,
>>
>> I am trying to use Nutch 1.1 to build a complete index of our
>corporate web sites. I am using a script based on this one:
>>
>> http://wiki.apache.org/nutch/Crawl
>>
>> The text of my script is included below. I set the crawling depth to
>100 to make sure that everything is indexed, expecting that the process
>will stop after about 20 iterations. The problem is that the Generator
>keeps re-scheduling fetching of failed URLs. The process stopped because
>the max number of iterations (the depth) was reached. After a few
>iterations, only failing URLs were being repeatedly fetched. I checked
>the log for one of them. It failed with code 403 and was re-scheduled
>for fetching 94 times. This does not seem right.
>>
>> Another problem that I've noticed is that sometimes the Generators
>schedules fetching of just one URL per iteration. This happened once and
>I did not try to repeat this effect, but this does not seem right
>either.
>>
>> Here is my script text:
>>
>> -----------------------------------------------
>> #!/bin/bash
>>
>> depth=100
>> threads=10
>> adddays=5
>> topN=50000
>>
>> NUTCH_HOME=/data/HORUS_1/nutch1.1
>> NUTCH_HEAPSIZE=8000
>> nutch=$NUTCH_HOME
>> JAVA_HOME=/usr/lib/jvm/java-6-sun
>> export NUTCH_HOME
>> export JAVA_HOME
>> export NUTCH_HEAPSIZE
>>
>> steps=6
>> echo "----- Inject (Step 1 of $steps) -----"
>> $nutch/bin/nutch inject $nutch/crawl/crawldb $nutch/crawl/seed
>>
>> echo "----- Generate, Fetch, Parse, Update (Step 2 of $steps) -----"
>> for((i=0; i < $depth; i++))
>> do
>>   echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---"
>>   $nutch/bin/nutch generate $nutch/crawl/crawldb $nutch/crawl/segments
>$topN -adddays $adddays
>>   if [ $? -ne 0 ]
>>   then
>>     echo "Stopping at depth $depth. No more URLs to fetch."
>>     break
>>   fi
>>   segment=`ls -d $nutch/crawl/segments/* | tail -1`
>>
>>   $nutch/bin/nutch fetch $segment -threads $threads
>>   if [ $? -ne 0 ]
>>   then
>>     echo "fetch $segment at depth $depth failed. Deleting it."
>>     rm -rf $segment
>>     continue
>>   fi
>>
>> #  echo "--- Parsing Segment $segment ---"
>> #  $nutch/bin/nutch parse $segment
>>
>>   $nutch/bin/nutch updatedb $nutch/crawl/crawldb $segment
>> done
>>
>> echo "----- Invert Links (Step 3 of $steps) -----"
>> $nutch/bin/nutch invertlinks $nutch/crawl/linkdb
>$nutch/crawl/segments/*
>>
>> echo "----- Index (Step 4 of $steps) -----"
>> $nutch/bin/nutch index $nutch/crawl/preIndex $nutch/crawl/crawldb
>$nutch/crawl/linkdb $nutch/crawl/segments/*
>>
>> echo "----- Dedup (Step 5 of $steps) -----"
>> $nutch/bin/nutch dedup $nutch/crawl/preIndex
>>
>> echo "----- Merge Indexes (Step 6 of $steps) -----"
>> $nutch/bin/nutch merge $nutch/crawl/index $nutch/crawl/preIndex
>>
>> # in nutch-site, hadoop.tmp.dir points to crawl/tmp
>> rm -rf $nutch/crawl/tmp/*
>> -----------------------------------------------
>>
>> Is anyone experiencing same problems? Is there anything wrong in what
>I am doing?
>>
>> Regards,
>>
>> Arkadi
>>
>>
>>
>>
>>
>>
>>

RE: Generator problems in Nutch 1.1

Reply via email to