Hi,
We've been using Nutch for focused crawling (right now we are crawling about
50 domains).
We've encountered the long-tail problem - We've set TopN to 100,000 and
generate.max.per.host to about 1500.
90% of all domains finish fetching after 30min, and the other 10% takes an
additional 2.5
Hi Eran,
There is currently no time limit implemented in the Fetcher. We implemented
one which worked quite well in combination with another mechanism which
clears the URLs from a pool if more than x successive exceptions have been
encountered. This limits cases where a site or domain is not
I have created NUTCH-768. I am in the middle of testing a few thousand
page crawl for the most recent released version of Hadoop 0.20.1.
Everything passes unit tests fine and there are no interface breaks.
Looks like it will be an easy upgrade so far :)
Dennis
Andrzej Bialecki wrote:
Dennis Kubes wrote:
I have created NUTCH-768. I am in the middle of testing a few thousand
page crawl for the most recent released version of Hadoop 0.20.1.
Everything passes unit tests fine and there are no interface breaks.
Looks like it will be an easy upgrade so far :)
Great, thanks!
Hadoop seems to have a few more configuration files now. There are some
questions about which ones to move over. I think it also might be time
to upgrade xerces from 2.6 to the current 2.9.1. I am testing with that
currently. I know with the current version there are some (ignorable
but
there is some piece of code i dont understand
public boolean shouldFetch(Text url, CrawlDatum datum, long curTime) {
// pages are never truly GONE - we have to check them from time to time.
// pages with too long fetchInterval are adjusted so that they fit
within
// maximum