Nutch whole web crawl in EC2 hangs and fetches few URLs

2009-11-22 Thread VidyaMN
Hi, I want to use Nutch in EC2 to crawl around 100 million URLs, extracting only questions and answers from http://answers.yahoo.com. I'm a Nutch newbie so apologies for any basic queries, I've the following questions: 1. I chose to use the individual fetch, generate, updatedb etc. CLI over the

Yahoo Answers subdirectory exclusion filter

2009-11-22 Thread VidyaMN
Hi, I want to exclude some of Yahoo Answers URLs from crawling. Few examples are as follows: 1. http://answers.yahoo.com/question/?link=answerqid=20091122033318AA3huLM 2. http://answers.yahoo.com/question/index?link=answerqid=20091122033342AAOM4wP 3.

Re: AbstractFetchSchedule

2009-11-22 Thread Andrzej Bialecki
reinhard schwab wrote: there is some piece of code i dont understand public boolean shouldFetch(Text url, CrawlDatum datum, long curTime) { // pages are never truly GONE - we have to check them from time to time. // pages with too long fetchInterval are adjusted so that they fit

Re: AbstractFetchSchedule

2009-11-22 Thread reinhard schwab
Andrzej Bialecki schrieb: reinhard schwab wrote: there is some piece of code i dont understand public boolean shouldFetch(Text url, CrawlDatum datum, long curTime) { // pages are never truly GONE - we have to check them from time to time. // pages with too long fetchInterval are