Hi

On Wed, 9 May 2012 08:09:09 -0700 (PDT), James Ford <[email protected]> wrote:
Hello,

I am wondering how to only crawl the domains of a injected seed without
adding external URLs to the database?

Check db.ignore.external.links.


Lets say I have 5k urls in my seed, and I want nutch to crawl everything(Or
some million urls) for each domain in the fastest way possible.

What settings should I use?

Well, the fastest is of course no delay and with maximum number of threads but that's usually not a good idea. You will overload your connection or the servers.


I will have topN at about 20k, and I want the db_unfetched to be around 20k
for each iteration?

There is no guarantee of db_unfetched unless each page has exactly the same number of outlinks. If your crawl is limited to a few domains then just crawl until there's nothing left to crawl.


What should I set "db.max.outlinks.per.page" to? I was wondering about
setting it to 4, to get 4*5k=20k for the first iteration?

It's set to 100 by default. There's no reason to change it unless some pages have more than 100 and the target pages have no other inlinks.


Can anyone help me?

Thanks,
James Ford

--
View this message in context:

http://lucene.472066.n3.nabble.com/Make-Nutch-to-crawl-internal-urls-only-tp3974397.html
Sent from the Nutch - User mailing list archive at Nabble.com.

--
Markus Jelsma - CTO - Openindex

Reply via email to