Hi

On Thu, 10 May 2012 01:47:34 -0700 (PDT), James Ford <[email protected]> wrote:
Thanks for your reply.

The problem I have with using the suggested settings you have described above is that the Generator step of normalizing is taking too long after some iterations(Thats why I want the crawldb to be at a reasonable level).

Then disable normalizing and filtering in that step. There's usually no good reason to do it unless you have some very specific set-up and exotic requirements.


It seems that I can crawl and index about one million URLs in a 24h period from the first init. But this number is decreasing with a large amount if I continue to crawl. This is due to the fact that the normalize step can take up to one hour after some iterations, when the crawldb is getting bigger.

You run Nutch local?


I don't see why the generator step is taking so long? It can't take that
much time selecting X urls from a database of about 10 million URLs?

Certainly! The GeneratorMapper is quite CPU intensive, it's calculate a lot of things for most records and then the reducer limits records by host or domain taking a lot of additional CPU time and RAM.

You must disable filtering and normalizing but this will only help for a short while. If the CrawlDB grows again you must you a cluster to do the work.


Thanks,
James Ford

--
View this message in context:

http://lucene.472066.n3.nabble.com/Make-Nutch-to-crawl-internal-urls-only-tp3974397p3976511.html
Sent from the Nutch - User mailing list archive at Nabble.com.

--
Markus Jelsma - CTO - Openindex

Reply via email to