Greetings, we are running one master and four slaves of our multicore solr setup. We just served searches for our catalog of 8 million products with this farm during black Friday and cyber Monday, our busiest days of the year, and the servers did not break a sweat! Index size is about 28GB.
However, twice now recently during a time of low load we have had a fire drill where I have seen tomcat/solr fail and become unresponsive after some OOM heap errors. Solr wouldn't even serve up its admin pages. I've had to go in and manually knock tomcat out of memory and then restart it. These solr slaves are load balanced and the load balancers always probe the solr slaves so if they stop serving up searches they are automatically removed from the load balancer. When all four fail at the same time we have an issue! My question is this. Why in the world would all of my slaves, after running fine for some days, suddenly all at the exact same minute experience OOM heap errors and go dead? The load balancer kicks them all out at the same time each time. Each slave only talks to the master and not to each other, but the master show no errors in the logs at all. Something must be triggering this though. The only other odd thing I saw in the logs was after the first OOM errors were recorded, the slaves started occasionally not being able to get to the master. This behavior makes me a little nervous... =:-o eek! Environment: Lucid Imagination distro of Solr 1.4 on Tomcat Platform: RHEL with Sun JRE 1.6.0_18 on dual quad xeon machines with 64GB memory etc etc