I know how to make the generator create multiple segments, but I wasn't
sure how to have nutch deal with them after that.  What is the benefit of
multiple smaller segments vs one large one?

If I get a chance, I'll try it again and take a look.  But right now I've
just been running the shell script instead since that doesn't have any of
the problems.

On Fri, Dec 23, 2011 at 5:08 AM, Markus Jelsma
<markus.jel...@openindex.io>wrote:

>
>
> On Thursday 22 December 2011 19:36:29 Bai Shen wrote:
> > How does the whole multiple segments work?
>
> Use the generator to create multipel segments in one go.
> >
> > And the only stack trace I get is the OOM exception.  I haven't found
> > anything else indicating what's using up all of the memory.
>
> the log should provide more.
>
> >
> > If I use a shell script to execute the nutch commands instead of a java
> > program I don't get the OOM exception.
>
> Ah, there may be a big leak in that shellscript.
>
> > And they're both just infinite
> > loops that call the various nutch parts in order.
> >
> > On Mon, Dec 19, 2011 at 10:08 AM, Markus Jelsma
> >
> > <markus.jel...@openindex.io>wrote:
> > > On Monday 19 December 2011 15:57:02 Bai Shen wrote:
> > > > AFAIK, mapred.map.child.java.opts is not set, but I'll double check.
> > > >
> > > > When you say threads, you're referring to fetcher threads, correct?
> > > > I'm using the default ten threads.  And the JVM reuse is set to -1,
> so
> > > > it shouldn't be reusing them.
> > >
> > > That sounds fine.
> > >
> > > > The problem only occurs after several hours of
> > > > crawling.
> > >
> > > ah, you might want to debug all your hadoop options now. It may fail
> > > during processing of your mapper output. This is very tedious to debug
> > > but you must
> > > follow the stack trace when it happens again. Most likely just a hadoop
> > > issue.
> > >
> > > Also, try to fetch less urls but more segments.
> > >
> > > > On Fri, Dec 16, 2011 at 3:13 PM, Markus Jelsma
> > > >
> > > > <markus.jel...@openindex.io>wrote:
> > > > > Are you running with too many threads perhaps? It takes up
> additional
> > > > > RAM. Also, you must really verify that that is the actual heap
> space
> > > > > that is allocated. We usually use mapred.map.child.java.opts to set
> > >
> > > heap
> > >
> > > > > space for mappers specifically. The child opts is, in our case,
> used
> > > > > by the datanodes and jobtrackers.
> > > > >
> > > > > > The fetcher reduce jobs are what failed.  Two completed, but the
> > >
> > > third
> > >
> > > > > > died.  It tried to run on all three data nodes with the same
> > > > > > results.
> > > > > >
> > > > > > MapReduce Child Java Maximum Heap Size is set to 1073741824
> > > > > >
> > > > > > The description states the following.
> > > > > >
> > > > > > The maximum heap size, in bytes, of the Java child process. This
> > >
> > > number
> > >
> > > > > > will be formatted and concatenated with the 'base' setting for
> > > > > > 'mapred_child_java_opts' to pass to Hadoop. Can be made final
> (see
> > > > > > below) to prevent clients from overriding it. Will be part of
> > > > > > generated client configuration.
> > > > > >
> > > > > > I thought there was another heap setting, but I'm not sure where
> to
> > > > > > find
> > > > >
> > > > > it
> > > > >
> > > > > > in Cloudera.
> > > > > >
> > > > > > On Fri, Dec 16, 2011 at 11:38 AM, Markus Jelsma
> > > > > >
> > > > > > <markus.jel...@openindex.io>wrote:
> > > > > > > What jobs exit with OOM? What is your heap size for the mapper
> > > > > > > and reducer?
> > > > > > >
> > > > > > > On Friday 16 December 2011 17:13:45 Bai Shen wrote:
> > > > > > > > I've tried running Nutch in local, psuedo, and full
> distributed
> > > > > > > > mode,
> > > > > > >
> > > > > > > and I
> > > > > > >
> > > > > > > > keep getting OutOfMemoryErrors.  I'm running Nutch using a
> > >
> > > slightly
> > >
> > > > > > > > modified version of the Crawler code that's included.
> > > > > > > > Basically,
> > > > >
> > > > > I've
> > > > >
> > > > > > > > modified it to continously crawl instead of stopping after a
> > > > > > > > set
> > > > >
> > > > > number
> > > > >
> > > > > > > of
> > > > > > >
> > > > > > > > cycles.
> > > > > > > >
> > > > > > > > I have hadoop set not to reuse JVMs, so I'm not sure what the
> > >
> > > leak
> > >
> > > > > is.
> > > > >
> > > > > > >  Any
> > > > > > >
> > > > > > > > suggestions on what to look at?
> > > > > > > >
> > > > > > > > Thanks.
> > > > > > >
> > > > > > > --
> > > > > > > Markus Jelsma - CTO - Openindex
> > >
> > > --
> > > Markus Jelsma - CTO - Openindex
>
> --
> Markus Jelsma - CTO - Openindex
>

Reply via email to