How does the whole multiple segments work? And the only stack trace I get is the OOM exception. I haven't found anything else indicating what's using up all of the memory.
If I use a shell script to execute the nutch commands instead of a java program I don't get the OOM exception. And they're both just infinite loops that call the various nutch parts in order. On Mon, Dec 19, 2011 at 10:08 AM, Markus Jelsma <markus.jel...@openindex.io>wrote: > > > On Monday 19 December 2011 15:57:02 Bai Shen wrote: > > AFAIK, mapred.map.child.java.opts is not set, but I'll double check. > > > > When you say threads, you're referring to fetcher threads, correct? I'm > > using the default ten threads. And the JVM reuse is set to -1, so it > > shouldn't be reusing them. > > That sounds fine. > > > The problem only occurs after several hours of > > crawling. > > ah, you might want to debug all your hadoop options now. It may fail during > processing of your mapper output. This is very tedious to debug but you > must > follow the stack trace when it happens again. Most likely just a hadoop > issue. > > Also, try to fetch less urls but more segments. > > > > > On Fri, Dec 16, 2011 at 3:13 PM, Markus Jelsma > > > > <markus.jel...@openindex.io>wrote: > > > Are you running with too many threads perhaps? It takes up additional > > > RAM. Also, you must really verify that that is the actual heap space > > > that is allocated. We usually use mapred.map.child.java.opts to set > heap > > > space for mappers specifically. The child opts is, in our case, used by > > > the datanodes and jobtrackers. > > > > > > > The fetcher reduce jobs are what failed. Two completed, but the > third > > > > died. It tried to run on all three data nodes with the same results. > > > > > > > > MapReduce Child Java Maximum Heap Size is set to 1073741824 > > > > > > > > The description states the following. > > > > > > > > The maximum heap size, in bytes, of the Java child process. This > number > > > > will be formatted and concatenated with the 'base' setting for > > > > 'mapred_child_java_opts' to pass to Hadoop. Can be made final (see > > > > below) to prevent clients from overriding it. Will be part of > > > > generated client configuration. > > > > > > > > I thought there was another heap setting, but I'm not sure where to > > > > find > > > > > > it > > > > > > > in Cloudera. > > > > > > > > On Fri, Dec 16, 2011 at 11:38 AM, Markus Jelsma > > > > > > > > <markus.jel...@openindex.io>wrote: > > > > > What jobs exit with OOM? What is your heap size for the mapper and > > > > > reducer? > > > > > > > > > > On Friday 16 December 2011 17:13:45 Bai Shen wrote: > > > > > > I've tried running Nutch in local, psuedo, and full distributed > > > > > > mode, > > > > > > > > > > and I > > > > > > > > > > > keep getting OutOfMemoryErrors. I'm running Nutch using a > slightly > > > > > > modified version of the Crawler code that's included. Basically, > > > > > > I've > > > > > > > > > modified it to continously crawl instead of stopping after a set > > > > > > number > > > > > > > > of > > > > > > > > > > > cycles. > > > > > > > > > > > > I have hadoop set not to reuse JVMs, so I'm not sure what the > leak > > > > > > is. > > > > > > > > Any > > > > > > > > > > > suggestions on what to look at? > > > > > > > > > > > > Thanks. > > > > > > > > > > -- > > > > > Markus Jelsma - CTO - Openindex > > -- > Markus Jelsma - CTO - Openindex >