entire farm fails at the same time with OOM issues

Robert Petersen Tue, 30 Nov 2010 15:04:42 -0800

Greetings, we are running one master and four slaves of our multicore
solr setup.  We just served searches for our catalog of 8 million
products with this farm during black Friday and cyber Monday, our
busiest days of the year, and the servers did not break a sweat!  Index
size is about 28GB.


 

However, twice now recently during a time of low load we have had a fire
drill where I have seen tomcat/solr fail and become unresponsive after
some OOM heap errors.  Solr wouldn't even serve up its admin pages.
I've had to go in and manually knock tomcat out of memory and then
restart it.  These solr slaves are load balanced and the load balancers
always probe the solr slaves so if they stop serving up searches they
are automatically removed from the load balancer.  When all four fail at
the same time we have an issue!

 

My question is this.  Why in the world would all of my slaves, after
running fine for some days, suddenly all at the exact same minute
experience OOM heap errors and go dead?  The load balancer kicks them
all out at the same time each time.  Each slave only talks to the master
and not to each other, but the master show no errors in the logs at all.
Something must be triggering this though.  The only other odd thing I
saw in the logs was after the first OOM errors were recorded, the slaves
started occasionally not being able to get to the master.

 

This behavior makes me a little nervous...    =:-o  eek!

 

 

Environment:  Lucid Imagination distro of Solr 1.4 on Tomcat  

 

Platform: RHEL with Sun JRE 1.6.0_18 on dual quad xeon machines with
64GB memory etc etc

entire farm fails at the same time with OOM issues

Reply via email to