It was crawling HTML files when it started throwing the exception.
Unfortunately, I didn't keep copies of the files or urls.

On Thu, Aug 9, 2012 at 3:07 AM, Ferdy Galema <[email protected]>wrote:

> Hi,
>
> Of course setting a bigger heap sure helps, but most of the time only
> temporary. Can you see in the logs what type of documents are parsed?
>
> In case of html documents crawled on the wild web, a single document can
> cause the heap to explode. By default the cyberneko parser (in HtmlParser)
> is used for html documents. I hacked this library so that there are limits
> in the number of elements that are loaded during a parse. (I'm still trying
> to find a way to contribute this back into the codebase).
>
> Ferdy.
>
> On Wed, Aug 8, 2012 at 10:03 PM, Niccolò Becchi <[email protected]
> >wrote:
>
> > If you are using Nutch in an hadoop cluster and you have enough memory
> try
> > with this parameters:
> >
> > <property>
> >     <name>mapred.child.java.opts</name>
> >     <value>-Xmx1600m -XX:-UseGCOverheadLimit
> > -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/tmp</value>
> > </property>
> >
> > On Wed, Aug 8, 2012 at 9:32 PM, Bai Shen <[email protected]>
> wrote:
> >
> > > Is this something other people are seeing?  I was parsing 10k urls
> when I
> > > got this exception.  I'm running Nutch 2 head as of Aug 6 with the
> > default
> > > memory settings(1 GB).
> > >
> > > Just wondering if anybody else has experienced this on Nutch 2.
> > >
> > > Thanks.
> > >
> >
>

Reply via email to