Re: AbstractFetchSchedule
reinhard schwab wrote: there is some piece of code i dont understand public boolean shouldFetch(Text url, CrawlDatum datum, long curTime) { // pages are never truly GONE - we have to check them from time to time. // pages with too long fetchInterval are adjusted so that they fit within // maximum fetchInterval (segment retention period). if (datum.getFetchTime() - curTime (long) maxInterval * 1000) { datum.setFetchInterval(maxInterval * 0.9f); datum.setFetchTime(curTime); } if (datum.getFetchTime() curTime) { return false; // not time yet } return true; } First, concerning the segment retention - we want to enforce that pages that were not refreshed longer than maxInterval should be retried, no matter what is their status - because we want to obtain a copy of the page in a newer segment in order to be able to delete the old segment. why is the fetch time set here to curTime? Because we want to fetch it now - see the next line where this condition is checked. and why is the fetch interval set to maxInterval * 0.9f whithout checking the current value of fetchInterval? Hm, indeed this looks like a bug - we should instead do like this: if (datum.getFetchInterval() maxInterval) { datum.setFetchInterval(maxInterval * 0.9); } -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: AbstractFetchSchedule
Andrzej Bialecki schrieb: reinhard schwab wrote: there is some piece of code i dont understand public boolean shouldFetch(Text url, CrawlDatum datum, long curTime) { // pages are never truly GONE - we have to check them from time to time. // pages with too long fetchInterval are adjusted so that they fit within // maximum fetchInterval (segment retention period). if (datum.getFetchTime() - curTime (long) maxInterval * 1000) { datum.setFetchInterval(maxInterval * 0.9f); datum.setFetchTime(curTime); } if (datum.getFetchTime() curTime) { return false; // not time yet } return true; } First, concerning the segment retention - we want to enforce that pages that were not refreshed longer than maxInterval should be retried, no matter what is their status - because we want to obtain a copy of the page in a newer segment in order to be able to delete the old segment. thanks for the explanation. but i still dont understand. my assumption is, that the crawl dates contained in the oldest segments have a fetch time nearer to current time compared with those in the most recent segments. what is wrong with my assumption? when is this use case happening? if someone changes the configuration, maxInterval( db.fetch.interval.max ) to a lower value? then those crawl dates belonging to the oldest segments should be refetched first? usually these are not those with fetchTime-curTime maxInterval*1000. why is the fetch time set here to curTime? Because we want to fetch it now - see the next line where this condition is checked. and why is the fetch interval set to maxInterval * 0.9f whithout checking the current value of fetchInterval? Hm, indeed this looks like a bug - we should instead do like this: if (datum.getFetchInterval() maxInterval) { datum.setFetchInterval(maxInterval * 0.9); } i will open an issue for this bug together with the one smaller bug i have discovered some time ago and will attach a patch. regards reinhard
AbstractFetchSchedule
there is some piece of code i dont understand public boolean shouldFetch(Text url, CrawlDatum datum, long curTime) { // pages are never truly GONE - we have to check them from time to time. // pages with too long fetchInterval are adjusted so that they fit within // maximum fetchInterval (segment retention period). if (datum.getFetchTime() - curTime (long) maxInterval * 1000) { datum.setFetchInterval(maxInterval * 0.9f); datum.setFetchTime(curTime); } if (datum.getFetchTime() curTime) { return false; // not time yet } return true; } why is the fetch time set here to curTime? and why is the fetch interval set to maxInterval * 0.9f whithout checking the current value of fetchInterval? regards reinhard