Hi
On Thu, 10 May 2012 01:47:34 -0700 (PDT), James Ford
<[email protected]> wrote:
Thanks for your reply.
The problem I have with using the suggested settings you have
described
above is that the Generator step of normalizing is taking too long
after
some iterations(Thats why I want the crawldb to be at a reasonable
level).
Then disable normalizing and filtering in that step. There's usually no
good reason to do it unless you have some very specific set-up and
exotic requirements.
It seems that I can crawl and index about one million URLs in a 24h
period
from the first init. But this number is decreasing with a large
amount if I
continue to crawl. This is due to the fact that the normalize step
can take
up to one hour after some iterations, when the crawldb is getting
bigger.
You run Nutch local?
I don't see why the generator step is taking so long? It can't take
that
much time selecting X urls from a database of about 10 million URLs?
Certainly! The GeneratorMapper is quite CPU intensive, it's calculate a
lot of things for most records and then the reducer limits records by
host or domain taking a lot of additional CPU time and RAM.
You must disable filtering and normalizing but this will only help for
a short while. If the CrawlDB grows again you must you a cluster to do
the work.
Thanks,
James Ford
--
View this message in context:
http://lucene.472066.n3.nabble.com/Make-Nutch-to-crawl-internal-urls-only-tp3974397p3976511.html
Sent from the Nutch - User mailing list archive at Nabble.com.
--
Markus Jelsma - CTO - Openindex