It was crawling HTML files when it started throwing the exception. Unfortunately, I didn't keep copies of the files or urls.
On Thu, Aug 9, 2012 at 3:07 AM, Ferdy Galema <[email protected]>wrote: > Hi, > > Of course setting a bigger heap sure helps, but most of the time only > temporary. Can you see in the logs what type of documents are parsed? > > In case of html documents crawled on the wild web, a single document can > cause the heap to explode. By default the cyberneko parser (in HtmlParser) > is used for html documents. I hacked this library so that there are limits > in the number of elements that are loaded during a parse. (I'm still trying > to find a way to contribute this back into the codebase). > > Ferdy. > > On Wed, Aug 8, 2012 at 10:03 PM, Niccolò Becchi <[email protected] > >wrote: > > > If you are using Nutch in an hadoop cluster and you have enough memory > try > > with this parameters: > > > > <property> > > <name>mapred.child.java.opts</name> > > <value>-Xmx1600m -XX:-UseGCOverheadLimit > > -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/tmp</value> > > </property> > > > > On Wed, Aug 8, 2012 at 9:32 PM, Bai Shen <[email protected]> > wrote: > > > > > Is this something other people are seeing? I was parsing 10k urls > when I > > > got this exception. I'm running Nutch 2 head as of Aug 6 with the > > default > > > memory settings(1 GB). > > > > > > Just wondering if anybody else has experienced this on Nutch 2. > > > > > > Thanks. > > > > > >

