Hi Reinhard, >-----Original Message----- >From: reinhard schwab [mailto:[email protected]] >Sent: Thursday, July 01, 2010 2:10 PM >To: [email protected] >Subject: Re: Generator problems in Nutch 1.1 > >could you dump the entry of this url in crawl db with > >bin/nutch readdb crawl/crawldb -url <url>
Here it is: URL: http://www.atnf.csiro.au/people/bkoribal/obs/C848.html Version: 7 Status: 3 (db_gone) Fetch time: Thu Jul 01 20:51:33 GMT+10:00 2010 Modified time: Thu Jan 01 10:00:00 GMT+10:00 1970 Retries since fetch: 94 Retry interval: 2592000 seconds (30 days) Score: 1.0 Signature: null Metadata: _pst_: exception(16), lastModified=0: Http code=403, url=http://www.atnf.csiro.au/people/bkoribal/obs/C848.html Please note that this is not the only URL which was repeatedly re-fetched. There were a number of them, about 150. >it also depends on your configuration. >what is the value for >db.fetch.interval.default and for >db.fetch.schedule.class? I did not change these parameters in nutch-default.xml: <property> <name>db.default.fetch.interval</name> <value>30</value> <description>(DEPRECATED) The default number of days between re-fetches of a page. </description> </property> <property> <name>db.fetch.interval.default</name> <value>2592000</value> <description>The default number of seconds between re-fetches of a page (30 days). </description> </property> <property> <name>db.fetch.interval.max</name> <value>7776000</value> <description>The maximum number of seconds between re-fetches of a page (90 days). After this period every page in the db will be re-tried, no matter what is its status. </description> </property> <property> <name>db.fetch.schedule.class</name> <value>org.apache.nutch.crawl.DefaultFetchSchedule</value> <description>The implementation of fetch schedule. DefaultFetchSchedule simply adds the original fetchInterval to the last fetch time, regardless of page changes.</description> </property> > >a guess. i have not done crawling with nutch for a while, so im not >sure. >i guess it has CrawlDatum.STATUS_FETCH_RETRY when crawling and it uses > > schedule.setPageRetrySchedule((Text)key, result, prevFetchTime, > prevModifiedTime, fetch.getFetchTime()); > >in AbstractFetchSchedule you can read > > /** > * This method adjusts the fetch schedule if fetching needs to be > * re-tried due to transient errors. The default implementation > * sets the next fetch time 1 day in the future and increases > * the retry counter. > * @param url URL of the page > * @param datum page information > * @param prevFetchTime previous fetch time > * @param prevModifiedTime previous modified time > * @param fetchTime current fetch time > * @return adjusted page information, including all original >information. > * NOTE: this may be a different instance than {...@param datum}, but > * implementations should make sure that it contains at least all > * information from {...@param datum}. > */ > public CrawlDatum setPageRetrySchedule(Text url, CrawlDatum datum, > long prevFetchTime, long prevModifiedTime, long fetchTime) { > datum.setFetchTime(fetchTime + (long)SECONDS_PER_DAY*1000); > datum.setRetriesSinceFetch(datum.getRetriesSinceFetch() + 1); > return datum; > } > >you have set adddays to 5 days. >in this case it will be refetched and refetched again. >try adddays with value 0. >or change the code in AbstractFetchSchedule > >datum.setFetchTime(fetchTime + (long)SECONDS_PER_DAY*1000); I will try 0, thanks. The interesting thing is, I don't recall this happening with Nutch 1.0. I used it with the same script. The other problem that I mentioned, scheduling 1 new URL per iteration, is also new. I definitely did not have it with Nutch 1.0. Regards, Arkadi > >[email protected] schrieb: >> Hi, >> >> I am trying to use Nutch 1.1 to build a complete index of our >corporate web sites. I am using a script based on this one: >> >> http://wiki.apache.org/nutch/Crawl >> >> The text of my script is included below. I set the crawling depth to >100 to make sure that everything is indexed, expecting that the process >will stop after about 20 iterations. The problem is that the Generator >keeps re-scheduling fetching of failed URLs. The process stopped because >the max number of iterations (the depth) was reached. After a few >iterations, only failing URLs were being repeatedly fetched. I checked >the log for one of them. It failed with code 403 and was re-scheduled >for fetching 94 times. This does not seem right. >> >> Another problem that I've noticed is that sometimes the Generators >schedules fetching of just one URL per iteration. This happened once and >I did not try to repeat this effect, but this does not seem right >either. >> >> Here is my script text: >> >> ----------------------------------------------- >> #!/bin/bash >> >> depth=100 >> threads=10 >> adddays=5 >> topN=50000 >> >> NUTCH_HOME=/data/HORUS_1/nutch1.1 >> NUTCH_HEAPSIZE=8000 >> nutch=$NUTCH_HOME >> JAVA_HOME=/usr/lib/jvm/java-6-sun >> export NUTCH_HOME >> export JAVA_HOME >> export NUTCH_HEAPSIZE >> >> steps=6 >> echo "----- Inject (Step 1 of $steps) -----" >> $nutch/bin/nutch inject $nutch/crawl/crawldb $nutch/crawl/seed >> >> echo "----- Generate, Fetch, Parse, Update (Step 2 of $steps) -----" >> for((i=0; i < $depth; i++)) >> do >> echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---" >> $nutch/bin/nutch generate $nutch/crawl/crawldb $nutch/crawl/segments >$topN -adddays $adddays >> if [ $? -ne 0 ] >> then >> echo "Stopping at depth $depth. No more URLs to fetch." >> break >> fi >> segment=`ls -d $nutch/crawl/segments/* | tail -1` >> >> $nutch/bin/nutch fetch $segment -threads $threads >> if [ $? -ne 0 ] >> then >> echo "fetch $segment at depth $depth failed. Deleting it." >> rm -rf $segment >> continue >> fi >> >> # echo "--- Parsing Segment $segment ---" >> # $nutch/bin/nutch parse $segment >> >> $nutch/bin/nutch updatedb $nutch/crawl/crawldb $segment >> done >> >> echo "----- Invert Links (Step 3 of $steps) -----" >> $nutch/bin/nutch invertlinks $nutch/crawl/linkdb >$nutch/crawl/segments/* >> >> echo "----- Index (Step 4 of $steps) -----" >> $nutch/bin/nutch index $nutch/crawl/preIndex $nutch/crawl/crawldb >$nutch/crawl/linkdb $nutch/crawl/segments/* >> >> echo "----- Dedup (Step 5 of $steps) -----" >> $nutch/bin/nutch dedup $nutch/crawl/preIndex >> >> echo "----- Merge Indexes (Step 6 of $steps) -----" >> $nutch/bin/nutch merge $nutch/crawl/index $nutch/crawl/preIndex >> >> # in nutch-site, hadoop.tmp.dir points to crawl/tmp >> rm -rf $nutch/crawl/tmp/* >> ----------------------------------------------- >> >> Is anyone experiencing same problems? Is there anything wrong in what >I am doing? >> >> Regards, >> >> Arkadi >> >> >> >> >> >> >>

