Hi guys,
I've separated both functionalities into separate patches on JIRA (NUTCH-769
/ NUTCH-770).
Julien
--
DigitalPebble Ltd
http://www.digitalpebble.com
2009/11/21 Julien Nioche lists.digitalpeb...@gmail.com
Hi Eran,
There is currently no time limit implemented in the Fetcher. We
Thanks Julien,
I can confirm this patch works perfectly and does a good job of keeping a
good crawl rate.
We have doubled the rate of information retrieval by using a time limit on
the fetch queue.
Thanks,
Eran
On Mon, Nov 23, 2009 at 1:28 PM, Julien Nioche
lists.digitalpeb...@gmail.com
Hi,
We've been using Nutch for focused crawling (right now we are crawling about
50 domains).
We've encountered the long-tail problem - We've set TopN to 100,000 and
generate.max.per.host to about 1500.
90% of all domains finish fetching after 30min, and the other 10% takes an
additional 2.5
Hi Eran,
There is currently no time limit implemented in the Fetcher. We implemented
one which worked quite well in combination with another mechanism which
clears the URLs from a pool if more than x successive exceptions have been
encountered. This limits cases where a site or domain is not