Hi,

Of course setting a bigger heap sure helps, but most of the time only
temporary. Can you see in the logs what type of documents are parsed?

In case of html documents crawled on the wild web, a single document can
cause the heap to explode. By default the cyberneko parser (in HtmlParser)
is used for html documents. I hacked this library so that there are limits
in the number of elements that are loaded during a parse. (I'm still trying
to find a way to contribute this back into the codebase).

Ferdy.

On Wed, Aug 8, 2012 at 10:03 PM, Niccolò Becchi <[email protected]>wrote:

> If you are using Nutch in an hadoop cluster and you have enough memory try
> with this parameters:
>
> <property>
>     <name>mapred.child.java.opts</name>
>     <value>-Xmx1600m -XX:-UseGCOverheadLimit
> -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/tmp</value>
> </property>
>
> On Wed, Aug 8, 2012 at 9:32 PM, Bai Shen <[email protected]> wrote:
>
> > Is this something other people are seeing?  I was parsing 10k urls when I
> > got this exception.  I'm running Nutch 2 head as of Aug 6 with the
> default
> > memory settings(1 GB).
> >
> > Just wondering if anybody else has experienced this on Nutch 2.
> >
> > Thanks.
> >
>

Reply via email to