subject:"Nutch \- Focused crawling"

Re: Nutch - Focused crawling

2009-11-23 Thread Julien Nioche

Hi guys, I've separated both functionalities into separate patches on JIRA (NUTCH-769 / NUTCH-770). Julien -- DigitalPebble Ltd http://www.digitalpebble.com 2009/11/21 Julien Nioche lists.digitalpeb...@gmail.com Hi Eran, There is currently no time limit implemented in the Fetcher. We

Re: Nutch - Focused crawling

2009-11-23 Thread Eran Zinman

Thanks Julien, I can confirm this patch works perfectly and does a good job of keeping a good crawl rate. We have doubled the rate of information retrieval by using a time limit on the fetch queue. Thanks, Eran On Mon, Nov 23, 2009 at 1:28 PM, Julien Nioche lists.digitalpeb...@gmail.com

Nutch - Focused crawling

2009-11-21 Thread Eran Zinman

Hi, We've been using Nutch for focused crawling (right now we are crawling about 50 domains). We've encountered the long-tail problem - We've set TopN to 100,000 and generate.max.per.host to about 1500. 90% of all domains finish fetching after 30min, and the other 10% takes an additional 2.5

Re: Nutch - Focused crawling

2009-11-21 Thread Julien Nioche

Hi Eran, There is currently no time limit implemented in the Fetcher. We implemented one which worked quite well in combination with another mechanism which clears the URLs from a pool if more than x successive exceptions have been encountered. This limits cases where a site or domain is not