Andrzej Bialecki wrote:
I'm planning to work on adding support in 0.8 for interleaved fetch cycles.

Great!

Then, when running an updatedb, the issue of scores and metadata comes into question. We can imagine now that there were some other updatedb-s run in the meantime, not necessarily with earlier fetchlists - so the score and metadata info could be actually newer in the latest CrawlDB than what we have inside the current segment. In such case, we will get the following in CrawlDbReducer:

* "old" value from CrawlDb (which could be actually newer!). Even if it's old, its fetchTime could be in the future due to the trick described above. We could also get null here, if we just discovered a new page.

* "original" value from CrawlDb, which was recorded in fetchlist. This, for once, has a true fetch time, and its metadata and score are snapshots of that information at the time of "generate".

* "new" value from Fetcher, with new score / metadata information. We will also get "new" values from redirects, which might not match any of the above values (i.e. they could use unique urls).

* "linked" values from parsers, with score / metadata contributions.

Now, the question is how to update the score, metadata, fetchTime and fetchInterval information. We need a way to determine if the "new" value we have is in fact newer or older than the "old" value - I'm not sure how to do this, fetchTime and fetchInterval could have been modified so they are not reliable... Perhaps we should add a "generation ID" to CrawlDatum?

Would it work to, when generating, set the fetch time for generated items to the current time? That way, the "new" value will always be a bit after the "old" time. In 0.7 we stored not the fetched-time but the time-to-next-fetch, so we had to set it into the future. But if we instead just mark it as fetched now, so that it won't be re-generated until its fetch interval has expired, that would resolve this, no?

 Anyway, assuming we have a way to know this:

* if "new" is newer than "old", then we take all metadata from "old", overwrite all info with the values from "new", and we keep "new".

* if "new" is older than "old", then we overwrite its metadata with all values from "old". We do the same with fetchTime and fetchInterval.

That sounds right to me.  When is "original" used, if at all?

What about the score? I think that for new score calculations we should take the latest available score info from the "old" value.

That also sounds right. The crawl db should own the scores. Scores should not be updated by the fetcher, but only by crawldb updates.

Updatedb would also have to lock CrawlDB so that no other updatedb or generate could run while we modify it.

Yes, that sounds right too.

Thanks for working on this!

Doug


-------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to