Re: Make Nutch to crawl internal urls only

Julien Nioche Thu, 10 May 2012 03:00:57 -0700

Just adding to what Markus said, even in distributed mode the generation
and update steps will take more and more time as your crawldb gets bigger.
There are quite a few things you can do to alleviate that e.g. set a
minimal score for the generation, generate multiple segments in one go then
fetch them one by one and update them all at the same time. having said
that if your crawldb contains only 10M urls deactivating the normalisation
as Markus said will be the best thing to do in the short term. Last
comment, even if you are on a single machine you should run Nutch in pseudo
distributed mode and not in local mode. This way you'll be able to monitor
your crawl using the hadoop web interfaces + have more than one mapper and
reducer


HTH

Julien

On 10 May 2012 09:56, Markus Jelsma <[email protected]> wrote:

> Hi
>
>
> On Thu, 10 May 2012 01:47:34 -0700 (PDT), James Ford <
> [email protected]> wrote:
>
>> Thanks for your reply.
>>
>> The problem I have with using the suggested settings you have described
>> above is that the Generator step of normalizing is taking too long after
>> some iterations(Thats why I want the crawldb to be at a reasonable level).
>>
>
> Then disable normalizing and filtering in that step. There's usually no
> good reason to do it unless you have some very specific set-up and exotic
> requirements.
>
>
>
>> It seems that I can crawl and index about one million URLs in a 24h period
>> from the first init. But this number is decreasing with a large amount if
>> I
>> continue to crawl. This is due to the fact that the normalize step can
>> take
>> up to one hour after some iterations, when the crawldb is getting bigger.
>>
>
> You run Nutch local?
>
>
>
>> I don't see why the generator step is taking so long? It can't take that
>> much time selecting X urls from a database of about 10 million URLs?
>>
>
> Certainly! The GeneratorMapper is quite CPU intensive, it's calculate a
> lot of things for most records and then the reducer limits records by host
> or domain taking a lot of additional CPU time and RAM.
>
> You must disable filtering and normalizing but this will only help for a
> short while. If the CrawlDB grows again you must you a cluster to do the
> work.
>
>
>
>> Thanks,
>> James Ford
>>
>> --
>> View this message in context:
>>
>> http://lucene.472066.n3.**nabble.com/Make-Nutch-to-**
>> crawl-internal-urls-only-**tp3974397p3976511.html<http://lucene.472066.n3.nabble.com/Make-Nutch-to-crawl-internal-urls-only-tp3974397p3976511.html>
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
>
> --
> Markus Jelsma - CTO - Openindex
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: Make Nutch to crawl internal urls only

Reply via email to