Hi,
I want to use Nutch in EC2 to crawl around 100 million URLs, extracting only
questions and answers from http://answers.yahoo.com. I'm a Nutch newbie so
apologies for any basic queries, I've the following questions:
1. I chose to use the individual fetch, generate, updatedb etc. CLI over the
Hi,
I want to exclude some of Yahoo Answers URLs from crawling.
Few examples are as follows:
1. http://answers.yahoo.com/question/?link=answerqid=20091122033318AA3huLM
2.
http://answers.yahoo.com/question/index?link=answerqid=20091122033342AAOM4wP
3.
reinhard schwab wrote:
there is some piece of code i dont understand
public boolean shouldFetch(Text url, CrawlDatum datum, long curTime) {
// pages are never truly GONE - we have to check them from time to time.
// pages with too long fetchInterval are adjusted so that they fit
Andrzej Bialecki schrieb:
reinhard schwab wrote:
there is some piece of code i dont understand
public boolean shouldFetch(Text url, CrawlDatum datum, long curTime) {
// pages are never truly GONE - we have to check them from time
to time.
// pages with too long fetchInterval are