Re: Nutch 1.2 performance and memory issues

Julien Nioche Wed, 02 Feb 2011 12:51:04 -0800

Hi


I'm testing nutch 1-2 in pseudo distributed and local mode. I have a
> database with around 126M urls. They are all injected and prepared to
> fetch.
> When generating segments, there is always first a phase of low and stable
> memory, and near the end of the operation, memory grows up.
>

the generation consists of 2 separate jobs (selection and partition). do you
know which one is causing the issue? is it during the map or the reduce
stage?

the only thing I can think of is the map holding the count of urls per host.
Do you limit the number of URLS per host?



> I have doubts of what is normal here, ¿how much memory requieres segment
> generation of 126M urls? I have seen 7Gb of memory filled, and then jvm
> crash with gc overhead limit, and other errors.
> When I do topN 10000000 it works well but the memory comsuption is very
> high
> too.


> I don't know if this is normal or not, I've been reading nutch-844, and
> other memory problems, but I don't know if they are applicable on segment
> generation. Maybe is a problem of using in pseudo distribution mode or in
> local mode, or maybe is a memory leak, or maybe is normal.
>

it is worth investigating. Could you call jstack on the process when things
are starting to take a bit of memory? This could give us an indication


>
> By the way, ¿How do you guys scale the generation of segments, database
> updates etc?
> Using crawl.database.update and generating small segments?
>

the trouble with generating small segments is that when the crawldb gets
large, you spend most of the time generating / updating and proportionally
little time on fetching and parsing. It is more efficient to generate
multiple segments (using -maxNumSegments) once, fetch and parse each segment
then update the whole lot against the crawldb.

the obvious way of scaling being of course to use more than one machine in
your cluster

Julien

-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Re: Nutch 1.2 performance and memory issues

Reply via email to