Re: Adaptive fetch
Hi, Is the patch for Adaptive Refetch has been released? Considering intranet and using nutch for indexing large static HTML pages, i hope this feature plays a crucial role. Please update me on this. Thanks, D.Saravanaraj On 3/31/06, Andrzej Bialecki [EMAIL PROTECTED] wrote: Raghavendra Prabhu wrote: I believe we had a recent mail with problem of redirection also (with this patch applied..) And as you said more people testing the patch would be better. Considering that this has the highest votes for add-on features, it is a critical one i guess. Ok, I'll bring this patch up to date over the weekend. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Adaptive fetch
Raghavendra Prabhu wrote: Hi Andrzej Can you put in the latest version of the diff for the adaptive fetch? Because we seem to have problem patching agains the latest release. This should help us test it. The patch is probably out of sync, there have been many (trivial) changes in the meantime. The best option would be to commit this functionality, if enough people consider it of a sufficiently good quality. What prevents me from doing this is that I don't use this version on a regular basis - the original version is good enough for my use, even though not ideal. And I have a feeling that not too many people really reviewed this patch. So, IMHO these patches need more testing, because the potential for disruption is rather large. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Adaptive fetch
I believe we had a recent mail with problem of redirection also (with this patch applied..) And as you said more people testing the patch would be better. Considering that this has the highest votes for add-on features, it is a critical one i guess. Rgds Prabhu On 3/31/06, Andrzej Bialecki [EMAIL PROTECTED] wrote: Raghavendra Prabhu wrote: Hi Andrzej Can you put in the latest version of the diff for the adaptive fetch? Because we seem to have problem patching agains the latest release. This should help us test it. The patch is probably out of sync, there have been many (trivial) changes in the meantime. The best option would be to commit this functionality, if enough people consider it of a sufficiently good quality. What prevents me from doing this is that I don't use this version on a regular basis - the original version is good enough for my use, even though not ideal. And I have a feeling that not too many people really reviewed this patch. So, IMHO these patches need more testing, because the potential for disruption is rather large. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Adaptive fetch
Raghavendra Prabhu wrote: I believe we had a recent mail with problem of redirection also (with this patch applied..) And as you said more people testing the patch would be better. Considering that this has the highest votes for add-on features, it is a critical one i guess. Ok, I'll bring this patch up to date over the weekend. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: adaptive fetch
Raghavendra Prabhu wrote: Hi Andrzej After applying the patch, i seemed to find some strange behaviour The fetch list for each URL was getting created inspite of the fact that db.default.fetch.interval had not been reached You probably forgot to change the interval from days to seconds. It's now expressed in seconds. This defines the maximum allowed interval, and any pages with interval higher than that will be refetched anyway - so if it's 30 (seconds :) ) then there is a high probability that you reach this limit before each cycle completes... I thought this was supposed to be in this order 1)For the particular url/file get db fetch interval (which changes) 2) if current date exceeds db fetch interval, generate fetch list for the particular file url 3) fetch list checks for file modified date and then decides to fetch the latest contents file/URL It is supposed to function in the above manner right. Did i miss out anything??? Yes, this is how it's supposed to work. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Adaptive fetch schedule
(Moved to the proper list) Raghavendra Prabhu wrote: Hi Does the inlink value problem solve the OPIC problem which was there. That is on a recrawl, the page would have a higher score. Does this fix that problem? No, it doesn't. But it prevents your linkDB from growing indefinitely, which is also good. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Adaptive fetch
Point noted Andrzej We will experiment with the schedules and let you know how it worked out.Itis flexible right now i guess. Thanks Rgds Prabhu On 2/28/06, Andrzej Bialecki [EMAIL PROTECTED] wrote: Raghavendra Prabhu wrote: Maybe we can add a function which will do this so that people using crawl can make use of this function.(a new function with a minor modification in update database which so that it will replace the db.defautl.fetch.intervalin the webdb to zero) Ah, well... there will always be this or that that you can add, the question is whether you should? Somehow I don't see that it would be needed to put this functionality in the FetchSchedule interface... that was the whole point of this patch, so that you can experiment and implement various fetch schedules as you wish. In this case I recommend that you do just that ;-) -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com