What would I do with the heap dump though?  Run one of those java heap
analyzers looking for memory leaks or something?  I have no experience
with thoseI saw there was a bug fix in solr 1.4.1 for a 100 byte memory
leak occurring on each commit, but it would take thousands of commits to
make that add up to anything right?

-----Original Message-----
From: Ken Krugler [mailto:kkrugler_li...@transpac.com] 
Sent: Tuesday, November 30, 2010 3:12 PM
To: solr-user@lucene.apache.org
Subject: Re: entire farm fails at the same time with OOM issues

Hi Robert,

I'd recommend launching Tomcat with -XX:+HeapDumpOnOutOfMemoryError  
and -XX:HeapDumpPath=<path to where you want the file to go>, so then  
you have something to look at versus a Gedankenexperiment :)

-- Ken

On Nov 30, 2010, at 3:04pm, Robert Petersen wrote:

> Greetings, we are running one master and four slaves of our multicore
> solr setup.  We just served searches for our catalog of 8 million
> products with this farm during black Friday and cyber Monday, our
> busiest days of the year, and the servers did not break a sweat!   
> Index
> size is about 28GB.
> However, twice now recently during a time of low load we have had a  
> fire
> drill where I have seen tomcat/solr fail and become unresponsive after
> some OOM heap errors.  Solr wouldn't even serve up its admin pages.
> I've had to go in and manually knock tomcat out of memory and then
> restart it.  These solr slaves are load balanced and the load  
> balancers
> always probe the solr slaves so if they stop serving up searches they
> are automatically removed from the load balancer.  When all four  
> fail at
> the same time we have an issue!
> My question is this.  Why in the world would all of my slaves, after
> running fine for some days, suddenly all at the exact same minute
> experience OOM heap errors and go dead?  The load balancer kicks them
> all out at the same time each time.  Each slave only talks to the  
> master
> and not to each other, but the master show no errors in the logs at  
> all.
> Something must be triggering this though.  The only other odd thing I
> saw in the logs was after the first OOM errors were recorded, the  
> slaves
> started occasionally not being able to get to the master.
> This behavior makes me a little nervous...    =:-o  eek!
> Environment:  Lucid Imagination distro of Solr 1.4 on Tomcat
> Platform: RHEL with Sun JRE 1.6.0_18 on dual quad xeon machines with
> 64GB memory etc etc

+1 530-265-2225

Ken Krugler
+1 530-210-6378
e l a s t i c   w e b   m i n i n g

Reply via email to