Re: Make Nutch to crawl internal urls only

James Ford Thu, 10 May 2012 01:48:02 -0700

Thanks for your reply.

The problem I have with using the suggested settings you have described
above is that the Generator step of normalizing is taking too long after
some iterations(Thats why I want the crawldb to be at a reasonable level).


It seems that I can crawl and index about one million URLs in a 24h period
from the first init. But this number is decreasing with a large amount if I
continue to crawl. This is due to the fact that the normalize step can take
up to one hour after some iterations, when the crawldb is getting bigger.

I don't see why the generator step is taking so long? It can't take that
much time selecting X urls from a database of about 10 million URLs?

Thanks,
James Ford

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Make-Nutch-to-crawl-internal-urls-only-tp3974397p3976511.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Make Nutch to crawl internal urls only

Reply via email to