Hi,

> I am expecting this method to return true when maxInterval elapses for a
> page, so that it could be included in the generate list.

can you give an example where the fetchInterval gets larger than
maxInterval? Or a (next) fetch time more than maxInterval in the
future? If this happens that's a bug.

Indeed, 2.x code compared to 1.x seems wrong (at least different):

  public boolean shouldFetch(Text url, CrawlDatum datum, long curTime) {
    ...
    if (datum.getFetchTime() > curTime) {
      return false; // not time yet
    }
    return true;
  }

Sebastian

On 07/21/2013 10:16 AM, vivekvl wrote:
> Hi Lewis,
> I am expecting this method to return true when maxInterval elapses for a
> page, so that it could be included in the generate list.
> 
> @Override
> public boolean shouldFetch(String url, WebPage page, long curTime) {
>   // pages are never truly GONE - we have to check them from time to time.
>   // pages with too long fetchInterval are adjusted so that they fit within
>   // maximum fetchInterval (segment retention period).
>   long fetchTime = page.getFetchTime();
>   if (fetchTime - curTime > maxInterval * 1000L) {
>     if (page.getFetchInterval() > maxInterval) {
>       page.setFetchInterval(Math.round(maxInterval * 0.9f));
>     }
>     page.setFetchTime(curTime);
>   }
>   return fetchTime <= curTime;
> }
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Issue-in-generating-URLs-for-re-fetching-once-db-fetch-interval-max-elapses-tp4079039p4079343.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
> 

Reply via email to