Hi,
> I am expecting this method to return true when maxInterval elapses for a
> page, so that it could be included in the generate list.
can you give an example where the fetchInterval gets larger than
maxInterval? Or a (next) fetch time more than maxInterval in the
future? If this happens that's a bug.
Indeed, 2.x code compared to 1.x seems wrong (at least different):
public boolean shouldFetch(Text url, CrawlDatum datum, long curTime) {
...
if (datum.getFetchTime() > curTime) {
return false; // not time yet
}
return true;
}
Sebastian
On 07/21/2013 10:16 AM, vivekvl wrote:
> Hi Lewis,
> I am expecting this method to return true when maxInterval elapses for a
> page, so that it could be included in the generate list.
>
> @Override
> public boolean shouldFetch(String url, WebPage page, long curTime) {
> // pages are never truly GONE - we have to check them from time to time.
> // pages with too long fetchInterval are adjusted so that they fit within
> // maximum fetchInterval (segment retention period).
> long fetchTime = page.getFetchTime();
> if (fetchTime - curTime > maxInterval * 1000L) {
> if (page.getFetchInterval() > maxInterval) {
> page.setFetchInterval(Math.round(maxInterval * 0.9f));
> }
> page.setFetchTime(curTime);
> }
> return fetchTime <= curTime;
> }
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Issue-in-generating-URLs-for-re-fetching-once-db-fetch-interval-max-elapses-tp4079039p4079343.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>