On Thursday 22 December 2011 19:36:29 Bai Shen wrote:
> How does the whole multiple segments work?

Use the generator to create multipel segments in one go.
> 
> And the only stack trace I get is the OOM exception.  I haven't found
> anything else indicating what's using up all of the memory.

the log should provide more.

> 
> If I use a shell script to execute the nutch commands instead of a java
> program I don't get the OOM exception.  

Ah, there may be a big leak in that shellscript.

> And they're both just infinite
> loops that call the various nutch parts in order.
> 
> On Mon, Dec 19, 2011 at 10:08 AM, Markus Jelsma
> 
> <markus.jel...@openindex.io>wrote:
> > On Monday 19 December 2011 15:57:02 Bai Shen wrote:
> > > AFAIK, mapred.map.child.java.opts is not set, but I'll double check.
> > > 
> > > When you say threads, you're referring to fetcher threads, correct? 
> > > I'm using the default ten threads.  And the JVM reuse is set to -1, so
> > > it shouldn't be reusing them.
> > 
> > That sounds fine.
> > 
> > > The problem only occurs after several hours of
> > > crawling.
> > 
> > ah, you might want to debug all your hadoop options now. It may fail
> > during processing of your mapper output. This is very tedious to debug
> > but you must
> > follow the stack trace when it happens again. Most likely just a hadoop
> > issue.
> > 
> > Also, try to fetch less urls but more segments.
> > 
> > > On Fri, Dec 16, 2011 at 3:13 PM, Markus Jelsma
> > > 
> > > <markus.jel...@openindex.io>wrote:
> > > > Are you running with too many threads perhaps? It takes up additional
> > > > RAM. Also, you must really verify that that is the actual heap space
> > > > that is allocated. We usually use mapred.map.child.java.opts to set
> > 
> > heap
> > 
> > > > space for mappers specifically. The child opts is, in our case, used
> > > > by the datanodes and jobtrackers.
> > > > 
> > > > > The fetcher reduce jobs are what failed.  Two completed, but the
> > 
> > third
> > 
> > > > > died.  It tried to run on all three data nodes with the same
> > > > > results.
> > > > > 
> > > > > MapReduce Child Java Maximum Heap Size is set to 1073741824
> > > > > 
> > > > > The description states the following.
> > > > > 
> > > > > The maximum heap size, in bytes, of the Java child process. This
> > 
> > number
> > 
> > > > > will be formatted and concatenated with the 'base' setting for
> > > > > 'mapred_child_java_opts' to pass to Hadoop. Can be made final (see
> > > > > below) to prevent clients from overriding it. Will be part of
> > > > > generated client configuration.
> > > > > 
> > > > > I thought there was another heap setting, but I'm not sure where to
> > > > > find
> > > > 
> > > > it
> > > > 
> > > > > in Cloudera.
> > > > > 
> > > > > On Fri, Dec 16, 2011 at 11:38 AM, Markus Jelsma
> > > > > 
> > > > > <markus.jel...@openindex.io>wrote:
> > > > > > What jobs exit with OOM? What is your heap size for the mapper
> > > > > > and reducer?
> > > > > > 
> > > > > > On Friday 16 December 2011 17:13:45 Bai Shen wrote:
> > > > > > > I've tried running Nutch in local, psuedo, and full distributed
> > > > > > > mode,
> > > > > > 
> > > > > > and I
> > > > > > 
> > > > > > > keep getting OutOfMemoryErrors.  I'm running Nutch using a
> > 
> > slightly
> > 
> > > > > > > modified version of the Crawler code that's included. 
> > > > > > > Basically,
> > > > 
> > > > I've
> > > > 
> > > > > > > modified it to continously crawl instead of stopping after a
> > > > > > > set
> > > > 
> > > > number
> > > > 
> > > > > > of
> > > > > > 
> > > > > > > cycles.
> > > > > > > 
> > > > > > > I have hadoop set not to reuse JVMs, so I'm not sure what the
> > 
> > leak
> > 
> > > > is.
> > > > 
> > > > > >  Any
> > > > > >  
> > > > > > > suggestions on what to look at?
> > > > > > > 
> > > > > > > Thanks.
> > > > > > 
> > > > > > --
> > > > > > Markus Jelsma - CTO - Openindex
> > 
> > --
> > Markus Jelsma - CTO - Openindex

-- 
Markus Jelsma - CTO - Openindex

Reply via email to