Nutch - Focused crawling

2009-11-21 Thread Eran Zinman
Hi, We've been using Nutch for focused crawling (right now we are crawling about 50 domains). We've encountered the long-tail problem - We've set TopN to 100,000 and generate.max.per.host to about 1500. 90% of all domains finish fetching after 30min, and the other 10% takes an additional 2.5

Re: Nutch - Focused crawling

2009-11-21 Thread Julien Nioche
Hi Eran, There is currently no time limit implemented in the Fetcher. We implemented one which worked quite well in combination with another mechanism which clears the URLs from a pool if more than x successive exceptions have been encountered. This limits cases where a site or domain is not

Re: Nutch upgrade to Hadoop

2009-11-21 Thread Dennis Kubes
I have created NUTCH-768. I am in the middle of testing a few thousand page crawl for the most recent released version of Hadoop 0.20.1. Everything passes unit tests fine and there are no interface breaks. Looks like it will be an easy upgrade so far :) Dennis Andrzej Bialecki wrote:

Re: Nutch upgrade to Hadoop

2009-11-21 Thread Andrzej Bialecki
Dennis Kubes wrote: I have created NUTCH-768. I am in the middle of testing a few thousand page crawl for the most recent released version of Hadoop 0.20.1. Everything passes unit tests fine and there are no interface breaks. Looks like it will be an easy upgrade so far :) Great, thanks!

Re: Nutch upgrade to Hadoop

2009-11-21 Thread Dennis Kubes
Hadoop seems to have a few more configuration files now. There are some questions about which ones to move over. I think it also might be time to upgrade xerces from 2.6 to the current 2.9.1. I am testing with that currently. I know with the current version there are some (ignorable but

AbstractFetchSchedule

2009-11-21 Thread reinhard schwab
there is some piece of code i dont understand public boolean shouldFetch(Text url, CrawlDatum datum, long curTime) { // pages are never truly GONE - we have to check them from time to time. // pages with too long fetchInterval are adjusted so that they fit within // maximum