Re: Make Nutch to crawl internal urls only

Markus Jelsma Wed, 09 May 2012 12:07:57 -0700

Hi

On Wed, 9 May 2012 08:09:09 -0700 (PDT), James Ford<[email protected]> wrote:

Hello,
I am wondering how to only crawl the domains of a injected seedwithout
adding external URLs to the database?


Check db.ignore.external.links.

Lets say I have 5k urls in my seed, and I want nutch to crawleverything(Or
some million urls) for each domain in the fastest way possible.

What settings should I use?

Well, the fastest is of course no delay and with maximum number ofthreads but that's usually not a good idea. You will overload yourconnection or the servers.

I will have topN at about 20k, and I want the db_unfetched to bearound 20k
for each iteration?

There is no guarantee of db_unfetched unless each page has exactly thesame number of outlinks. If your crawl is limited to a few domains thenjust crawl until there's nothing left to crawl.

What should I set "db.max.outlinks.per.page" to? I was wonderingabout
setting it to 4, to get 4*5k=20k for the first iteration?

It's set to 100 by default. There's no reason to change it unless somepages have more than 100 and the target pages have no other inlinks.


Can anyone help me?

Thanks,
James Ford

--
View this message in context:

http://lucene.472066.n3.nabble.com/Make-Nutch-to-crawl-internal-urls-only-tp3974397.html
Sent from the Nutch - User mailing list archive at Nabble.com.


--
Markus Jelsma - CTO - Openindex

Re: Make Nutch to crawl internal urls only

Reply via email to