Make Nutch to crawl internal urls only

James Ford Wed, 09 May 2012 08:09:36 -0700

Hello,

I am wondering how to only crawl the domains of a injected seed without
adding external URLs to the database?


Lets say I have 5k urls in my seed, and I want nutch to crawl everything(Or
some million urls) for each domain in the fastest way possible.

What settings should I use?

I will have topN at about 20k, and I want the db_unfetched to be around 20k
for each iteration?

What should I set "db.max.outlinks.per.page" to? I was wondering about
setting it to 4, to get 4*5k=20k for the first iteration?

Can anyone help me? 

Thanks,
James Ford

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Make-Nutch-to-crawl-internal-urls-only-tp3974397.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Make Nutch to crawl internal urls only

Reply via email to