Thanks Julien,
I can confirm this patch works perfectly and does a good job of keeping a
good crawl rate.
We have doubled the rate of information retrieval by using a time limit on
the fetch queue.
Thanks,
Eran
On Mon, Nov 23, 2009 at 1:28 PM, Julien Nioche <
lists.digitalpeb...@gmail.com> wrot
Hi guys,
I've separated both functionalities into separate patches on JIRA (NUTCH-769
/ NUTCH-770).
Julien
--
DigitalPebble Ltd
http://www.digitalpebble.com
2009/11/21 Julien Nioche
> Hi Eran,
>
> There is currently no time limit implemented in the Fetcher. We implemented
> one which worked q
Hi Eran,
There is currently no time limit implemented in the Fetcher. We implemented
one which worked quite well in combination with another mechanism which
clears the URLs from a pool if more than x successive exceptions have been
encountered. This limits cases where a site or domain is not respo
Hi,
We've been using Nutch for focused crawling (right now we are crawling about
50 domains).
We've encountered the long-tail problem - We've set TopN to 100,000 and
generate.max.per.host to about 1500.
90% of all domains finish fetching after 30min, and the other 10% takes an
additional 2.5 hou