hi i dumped the database, and this is what i found:
Status: 1 (db_unfetched) Fetch time: Thu Dec 03 15:53:24 EST 2009 Modified time: Wed Dec 31 19:00:00 EST 1969 Retries since fetch: 0 Retry interval: 18000 seconds (0 days) Score: 2.0549393 Signature: null Metadata: so if meeting this url several time in 2 hours, that means becoz of the 0 days it gonna be fetched several times ?? it will not look at the 18000 secondes ??? thx > Date: Thu, 3 Dec 2009 22:39:29 +0100 > From: reinhard.sch...@aon.at > To: nutch-user@lucene.apache.org > Subject: Re: db.fetch.interval.default > > hi, > > i have identified one source of such a problem and opened an issue at jira. > you can apply this patch and check whether is solves your problem. > > https://issues.apache.org/jira/browse/NUTCH-774 > > btw you can check also your crawldb for such items - the retry interval > is set to 0. > just dump the crawldb and search for it. > > regards > reinhard > > BELLINI ADAM schrieb: > > hi, > > > > i'm crawling my intranet , and i have setted the db.fetch.interval.default > > to be 5 hours, but it seens that it doesnt work correctly > > > > <property> > > > > <name>db.fetch.interval.default</name> > > > > <value>18000</value> > > > > <description>The number of seconds between re-fetches of a page (5 hours > > ). > > > > </description> > > > > </property> > > > > > > the first crawl when the crawl directory $crawl doesnt existe yet the crawl > > spend just 2 hours (with depth=10). > > > > > > but when performing the recrawl with the recrawl.sh script (with crawldb > > full), it takes like 2 hours for each depth !! > > > > and when i checked the log file i found that one URL is fetched like > > several times ! so did my 5 hours db.fetch.interval.default works correctly > > ? > > > > why it's refetching same URLs several time at each depth (depth =10), i > > understood that the timestamp of pages will not allow a refetch since the > > time (5 hours) is not spent yet ! > > > > plz can you just explain me how this db.fetch.interval.defaul works ? > > should i use only one depth since i have all the urls in the crawldb ? > > > > > > > > > > i'm using this recrawl script : > > > > depth=10 > > > > echo "----- Inject (Step 1 of $steps) -----" > > $NUTCH_HOME/bin/nutch inject $crawl/crawldb urls > > > > echo "----- Generate, Fetch, Parse, Update (Step 2 of $steps) -----" > > for((i=0; i < $depth; i++)) > > > > do > > > > echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---" > > > > $NUTCH_HOME/bin/nutch generate $crawl/crawldb $crawl/segments $topN > > > > if [ $? -ne 0 ] > > then > > echo "runbot: Stopping at depth $depth. No more URLs to fetch." > > break > > fi > > segment=`ls -d $crawl/segments/* | tail -1` > > > > $NUTCH_HOME/bin/nutch fetch $segment -threads $threads > > > > if [ $? -ne 0 ] > > then > > echo "runbot: fetch $segment at depth `expr $i + 1` failed." > > echo "runbot: Deleting segment $segment." > > rm $RMARGS $segment > > continue > > fi > > > > echo " ----- Updating Dadatabase ( $steps) -----" > > > > > > $NUTCH_HOME/bin/nutch updatedb $crawl/crawldb $segment > > > > done > > > > echo "----- Merge Segments (Step 3 of $steps) -----" > > $NUTCH_HOME/bin/nutch mergesegs $crawl/MERGEDsegments $crawl/segments/* > > > > > > rm $crawl/segments > > mv $crawl/MERGEDsegments $crawl/segments > > > > echo "----- Invert Links (Step 4 of $steps) -----" > > $NUTCH_HOME/bin/nutch invertlinks $crawl/linkdb $crawl/segments/* > > > > > > > > > > _________________________________________________________________ > > Eligible CDN College & University students can upgrade to Windows 7 before > > Jan 3 for only $39.99. Upgrade now! > > http://go.microsoft.com/?linkid=9691819 > > > _________________________________________________________________ Eligible CDN College & University students can upgrade to Windows 7 before Jan 3 for only $39.99. Upgrade now! http://go.microsoft.com/?linkid=9691819