hi
i dumped the database, and this is what i found:

Status: 1 (db_unfetched)
Fetch time: Thu Dec 03 15:53:24 EST 2009
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 18000 seconds (0 days)
Score: 2.0549393
Signature: null
Metadata:




so if meeting this url several time in 2 hours, that means becoz of the 0 days 
it gonna be fetched several times ??
it will not look at the 18000 secondes ???


thx




> Date: Thu, 3 Dec 2009 22:39:29 +0100
> From: reinhard.sch...@aon.at
> To: nutch-user@lucene.apache.org
> Subject: Re: db.fetch.interval.default
> 
> hi,
> 
> i have identified one source of such a problem and opened an issue at jira.
> you can apply this patch and check whether is solves your problem.
> 
> https://issues.apache.org/jira/browse/NUTCH-774
> 
> btw you can check also your crawldb for such items - the retry interval
> is set to 0.
> just dump the crawldb and search for it.
> 
> regards
> reinhard
> 
> BELLINI ADAM schrieb:
> > hi,
> >
> > i'm crawling my intranet , and i have setted the db.fetch.interval.default 
> > to be 5 hours, but it seens that it doesnt work correctly
> >
> > <property>
> >
> >   <name>db.fetch.interval.default</name>
> >
> >   <value>18000</value>
> >
> >   <description>The number of seconds between re-fetches of a page (5 hours 
> > ).
> >
> >   </description>
> >
> > </property>
> >
> >
> > the first crawl when the crawl directory $crawl doesnt existe yet the crawl 
> > spend just 2 hours (with depth=10).
> >
> >
> > but when performing the recrawl with the recrawl.sh script (with crawldb 
> > full), it takes like 2 hours for each depth !!
> >
> > and when i checked the log file i found that one URL is fetched like 
> > several times ! so did my 5 hours db.fetch.interval.default works correctly 
> > ?
> >
> > why it's refetching same URLs several time at each depth (depth =10), i 
> > understood that the timestamp of pages will not allow a refetch since the 
> > time (5 hours) is not spent yet !
> >
> > plz can you just explain me how this db.fetch.interval.defaul works ? 
> > should i use only one depth  since i have all the urls in the crawldb ?
> >
> >
> >
> >
> > i'm using this recrawl script :
> >
> > depth=10
> >
> > echo "----- Inject (Step 1 of $steps) -----"
> > $NUTCH_HOME/bin/nutch inject $crawl/crawldb urls
> >
> > echo "----- Generate, Fetch, Parse, Update (Step 2 of $steps) -----"
> > for((i=0; i < $depth; i++))
> >
> > do
> >
> >   echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---"
> >
> > $NUTCH_HOME/bin/nutch generate $crawl/crawldb $crawl/segments $topN
> >
> >   if [ $? -ne 0 ]
> >   then
> >     echo "runbot: Stopping at depth $depth. No more URLs to fetch."
> >     break
> >   fi
> >   segment=`ls -d $crawl/segments/* | tail -1`
> >
> >   $NUTCH_HOME/bin/nutch fetch $segment -threads $threads
> >
> >   if [ $? -ne 0 ]
> >   then
> >     echo "runbot: fetch $segment at depth `expr $i + 1` failed."
> >     echo "runbot: Deleting segment $segment."
> >     rm $RMARGS $segment
> >     continue
> >   fi
> >
> > echo " ----- Updating Dadatabase ( $steps) -----"
> >
> >
> >   $NUTCH_HOME/bin/nutch updatedb $crawl/crawldb $segment
> >
> > done
> >
> > echo "----- Merge Segments (Step 3 of $steps) -----"
> > $NUTCH_HOME/bin/nutch mergesegs $crawl/MERGEDsegments $crawl/segments/*
> >
> >
> > rm   $crawl/segments
> > mv  $crawl/MERGEDsegments $crawl/segments
> >
> > echo "----- Invert Links (Step 4 of $steps) -----"
> > $NUTCH_HOME/bin/nutch invertlinks $crawl/linkdb $crawl/segments/*
> >
> >
> >
> >                                       
> > _________________________________________________________________
> > Eligible CDN College & University students can upgrade to Windows 7 before 
> > Jan 3 for only $39.99. Upgrade now!
> > http://go.microsoft.com/?linkid=9691819
> >   
> 
                                          
_________________________________________________________________
Eligible CDN College & University students can upgrade to Windows 7 before Jan 
3 for only $39.99. Upgrade now!
http://go.microsoft.com/?linkid=9691819

Reply via email to