Re: Make Nutch to crawl internal urls only

Markus Jelsma Thu, 10 May 2012 01:54:28 -0700

Hi

On Thu, 10 May 2012 01:47:34 -0700 (PDT), James Ford<[email protected]> wrote:

Thanks for your reply.
The problem I have with using the suggested settings you havedescribedabove is that the Generator step of normalizing is taking too longaftersome iterations(Thats why I want the crawldb to be at a reasonablelevel).

Then disable normalizing and filtering in that step. There's usually nogood reason to do it unless you have some very specific set-up andexotic requirements.

It seems that I can crawl and index about one million URLs in a 24hperiodfrom the first init. But this number is decreasing with a largeamount if Icontinue to crawl. This is due to the fact that the normalize stepcan takeup to one hour after some iterations, when the crawldb is gettingbigger.


You run Nutch local?

I don't see why the generator step is taking so long? It can't takethat
much time selecting X urls from a database of about 10 million URLs?

Certainly! The GeneratorMapper is quite CPU intensive, it's calculate alot of things for most records and then the reducer limits records byhost or domain taking a lot of additional CPU time and RAM.

You must disable filtering and normalizing but this will only help fora short while. If the CrawlDB grows again you must you a cluster to dothe work.


Thanks,
James Ford

--
View this message in context:

http://lucene.472066.n3.nabble.com/Make-Nutch-to-crawl-internal-urls-only-tp3974397p3976511.html
Sent from the Nutch - User mailing list archive at Nabble.com.


--
Markus Jelsma - CTO - Openindex

Re: Make Nutch to crawl internal urls only

Reply via email to