Hi
I'm testing nutch 1-2 in pseudo distributed and local mode. I have a > database with around 126M urls. They are all injected and prepared to > fetch. > When generating segments, there is always first a phase of low and stable > memory, and near the end of the operation, memory grows up. > the generation consists of 2 separate jobs (selection and partition). do you know which one is causing the issue? is it during the map or the reduce stage? the only thing I can think of is the map holding the count of urls per host. Do you limit the number of URLS per host? > I have doubts of what is normal here, ¿how much memory requieres segment > generation of 126M urls? I have seen 7Gb of memory filled, and then jvm > crash with gc overhead limit, and other errors. > When I do topN 10000000 it works well but the memory comsuption is very > high > too. > I don't know if this is normal or not, I've been reading nutch-844, and > other memory problems, but I don't know if they are applicable on segment > generation. Maybe is a problem of using in pseudo distribution mode or in > local mode, or maybe is a memory leak, or maybe is normal. > it is worth investigating. Could you call jstack on the process when things are starting to take a bit of memory? This could give us an indication > > By the way, ¿How do you guys scale the generation of segments, database > updates etc? > Using crawl.database.update and generating small segments? > the trouble with generating small segments is that when the crawldb gets large, you spend most of the time generating / updating and proportionally little time on fetching and parsing. It is more efficient to generate multiple segments (using -maxNumSegments) once, fetch and parse each segment then update the whole lot against the crawldb. the obvious way of scaling being of course to use more than one machine in your cluster Julien -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com

