On Thursday 22 December 2011 19:36:29 Bai Shen wrote: > How does the whole multiple segments work?
Use the generator to create multipel segments in one go. > > And the only stack trace I get is the OOM exception. I haven't found > anything else indicating what's using up all of the memory. the log should provide more. > > If I use a shell script to execute the nutch commands instead of a java > program I don't get the OOM exception. Ah, there may be a big leak in that shellscript. > And they're both just infinite > loops that call the various nutch parts in order. > > On Mon, Dec 19, 2011 at 10:08 AM, Markus Jelsma > > <markus.jel...@openindex.io>wrote: > > On Monday 19 December 2011 15:57:02 Bai Shen wrote: > > > AFAIK, mapred.map.child.java.opts is not set, but I'll double check. > > > > > > When you say threads, you're referring to fetcher threads, correct? > > > I'm using the default ten threads. And the JVM reuse is set to -1, so > > > it shouldn't be reusing them. > > > > That sounds fine. > > > > > The problem only occurs after several hours of > > > crawling. > > > > ah, you might want to debug all your hadoop options now. It may fail > > during processing of your mapper output. This is very tedious to debug > > but you must > > follow the stack trace when it happens again. Most likely just a hadoop > > issue. > > > > Also, try to fetch less urls but more segments. > > > > > On Fri, Dec 16, 2011 at 3:13 PM, Markus Jelsma > > > > > > <markus.jel...@openindex.io>wrote: > > > > Are you running with too many threads perhaps? It takes up additional > > > > RAM. Also, you must really verify that that is the actual heap space > > > > that is allocated. We usually use mapred.map.child.java.opts to set > > > > heap > > > > > > space for mappers specifically. The child opts is, in our case, used > > > > by the datanodes and jobtrackers. > > > > > > > > > The fetcher reduce jobs are what failed. Two completed, but the > > > > third > > > > > > > died. It tried to run on all three data nodes with the same > > > > > results. > > > > > > > > > > MapReduce Child Java Maximum Heap Size is set to 1073741824 > > > > > > > > > > The description states the following. > > > > > > > > > > The maximum heap size, in bytes, of the Java child process. This > > > > number > > > > > > > will be formatted and concatenated with the 'base' setting for > > > > > 'mapred_child_java_opts' to pass to Hadoop. Can be made final (see > > > > > below) to prevent clients from overriding it. Will be part of > > > > > generated client configuration. > > > > > > > > > > I thought there was another heap setting, but I'm not sure where to > > > > > find > > > > > > > > it > > > > > > > > > in Cloudera. > > > > > > > > > > On Fri, Dec 16, 2011 at 11:38 AM, Markus Jelsma > > > > > > > > > > <markus.jel...@openindex.io>wrote: > > > > > > What jobs exit with OOM? What is your heap size for the mapper > > > > > > and reducer? > > > > > > > > > > > > On Friday 16 December 2011 17:13:45 Bai Shen wrote: > > > > > > > I've tried running Nutch in local, psuedo, and full distributed > > > > > > > mode, > > > > > > > > > > > > and I > > > > > > > > > > > > > keep getting OutOfMemoryErrors. I'm running Nutch using a > > > > slightly > > > > > > > > > modified version of the Crawler code that's included. > > > > > > > Basically, > > > > > > > > I've > > > > > > > > > > > modified it to continously crawl instead of stopping after a > > > > > > > set > > > > > > > > number > > > > > > > > > > of > > > > > > > > > > > > > cycles. > > > > > > > > > > > > > > I have hadoop set not to reuse JVMs, so I'm not sure what the > > > > leak > > > > > > is. > > > > > > > > > > Any > > > > > > > > > > > > > suggestions on what to look at? > > > > > > > > > > > > > > Thanks. > > > > > > > > > > > > -- > > > > > > Markus Jelsma - CTO - Openindex > > > > -- > > Markus Jelsma - CTO - Openindex -- Markus Jelsma - CTO - Openindex