Doug Cutting wrote:
Andrzej Bialecki wrote:

So, I would propose a deadline of Aug 8 for the last commits, and then perhaps Aug 15 for the release?


Sounds good to me.  Thanks for helping with this!

Unfortunately, the patches related to detecting the unmodified content will have to wait until after the release.

Here's the problem: It's quite easy to add this checking and recording capability to all fetcher plugins, fetchlist generation and db update tools, and I've done this in my local patches. However, after a while I discovered a serious problem in the way Nutch currently manages "phasing out" of old segment data. If we assume that we always refresh after some fixed interval (30 days, or whatever), then we can safely delete segments older than 30 days. If the interval varies, then potentially we could be stuck with some segments with very old (but still valid) data. This is very inefficient, because in a single given segment there might be only a couple of such pages left after a while, and the rest of them would have to be removed again and again by deduplication because newer pages would exist in newer segments.

Moreover (and this is the worst problem) if such segments are lost, the information in webdb must be updated in a way to force refetching, even though the "If-Modified-Since" or the MD5 points out that the page is still unchanged since the last fetching. Currently the only way to do this is to "add days" - but if we use a variable refetch interval then it doesn't make much sense. I think we need to track in a better way which pages are "missing" from the segments, and have to be re-fetched, or to have a better DB update mechanism if we lose some segments.

Perhaps we should extend the Page to record which segment holds the latest version of the page? But segments don't have unique ID's now (a directory name is too fragile and too easily changed) ...

Related question: in the FetchListEntry we have a "fetch" flag. I think that after minor modifications of the FetchListTool (to generate only entries, which we are supposed to fetch) we could get rid of this flag, or change its semantics to mean "unconditionally fetch, even if unmodified".

Any comments?

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to