Re: AbstractFetchSchedule

2009-11-22 Thread Andrzej Bialecki

reinhard schwab wrote:

there is some piece of code i dont understand

  public boolean shouldFetch(Text url, CrawlDatum datum, long curTime) {
// pages are never truly GONE - we have to check them from time to time.
// pages with too long fetchInterval are adjusted so that they fit
within
// maximum fetchInterval (segment retention period).
if (datum.getFetchTime() - curTime  (long) maxInterval * 1000) {
  datum.setFetchInterval(maxInterval * 0.9f);
  datum.setFetchTime(curTime);
}
if (datum.getFetchTime()  curTime) {
  return false;   // not time yet
}
return true;
  }


First, concerning the segment retention - we want to enforce that pages 
that were not refreshed longer than maxInterval should be retried, no 
matter what is their status - because we want to obtain a copy of the 
page in a newer segment in order to be able to delete the old segment.




why is the fetch time set here to curTime?


Because we want to fetch it now - see the next line where this condition 
is checked.



and why is the fetch interval set to maxInterval * 0.9f whithout
checking the current value of fetchInterval?


Hm, indeed this looks like a bug - we should instead do like this:

if (datum.getFetchInterval()  maxInterval) {
  datum.setFetchInterval(maxInterval * 0.9);
}



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: AbstractFetchSchedule

2009-11-22 Thread reinhard schwab
Andrzej Bialecki schrieb:
 reinhard schwab wrote:
 there is some piece of code i dont understand

   public boolean shouldFetch(Text url, CrawlDatum datum, long curTime) {
 // pages are never truly GONE - we have to check them from time
 to time.
 // pages with too long fetchInterval are adjusted so that they fit
 within
 // maximum fetchInterval (segment retention period).
 if (datum.getFetchTime() - curTime  (long) maxInterval * 1000) {
   datum.setFetchInterval(maxInterval * 0.9f);
   datum.setFetchTime(curTime);
 }
 if (datum.getFetchTime()  curTime) {
   return false;   // not time yet
 }
 return true;
   }

 First, concerning the segment retention - we want to enforce that
 pages that were not refreshed longer than maxInterval should be
 retried, no matter what is their status - because we want to obtain a
 copy of the page in a newer segment in order to be able to delete the
 old segment.
thanks for the explanation.
but i still dont understand.
my assumption is, that the crawl dates contained in the oldest segments
have a fetch time nearer
to current time compared with those in the most recent segments.
what is wrong with my assumption?

when is this use case happening?
if someone changes the configuration, maxInterval(
db.fetch.interval.max ) to a lower value?
then those crawl dates belonging to the oldest segments should be
refetched first?
usually these are not those with fetchTime-curTime  maxInterval*1000.



 why is the fetch time set here to curTime?

 Because we want to fetch it now - see the next line where this
 condition is checked.

 and why is the fetch interval set to maxInterval * 0.9f whithout
 checking the current value of fetchInterval?

 Hm, indeed this looks like a bug - we should instead do like this:

 if (datum.getFetchInterval()  maxInterval) {
   datum.setFetchInterval(maxInterval * 0.9);
 }

i will open an issue for this bug together with the one smaller bug i
have discovered some time ago
and will attach a patch.

regards
reinhard


AbstractFetchSchedule

2009-11-21 Thread reinhard schwab
there is some piece of code i dont understand

  public boolean shouldFetch(Text url, CrawlDatum datum, long curTime) {
// pages are never truly GONE - we have to check them from time to time.
// pages with too long fetchInterval are adjusted so that they fit
within
// maximum fetchInterval (segment retention period).
if (datum.getFetchTime() - curTime  (long) maxInterval * 1000) {
  datum.setFetchInterval(maxInterval * 0.9f);
  datum.setFetchTime(curTime);
}
if (datum.getFetchTime()  curTime) {
  return false;   // not time yet
}
return true;
  }

why is the fetch time set here to curTime?
and why is the fetch interval set to maxInterval * 0.9f whithout
checking the current value of fetchInterval?

regards
reinhard