What would I do with the heap dump though? Run one of those java heap analyzers looking for memory leaks or something? I have no experience with thoseI saw there was a bug fix in solr 1.4.1 for a 100 byte memory leak occurring on each commit, but it would take thousands of commits to make that add up to anything right?
-----Original Message----- From: Ken Krugler [mailto:kkrugler_li...@transpac.com] Sent: Tuesday, November 30, 2010 3:12 PM To: solr-user@lucene.apache.org Subject: Re: entire farm fails at the same time with OOM issues Hi Robert, I'd recommend launching Tomcat with -XX:+HeapDumpOnOutOfMemoryError and -XX:HeapDumpPath=<path to where you want the file to go>, so then you have something to look at versus a Gedankenexperiment :) -- Ken On Nov 30, 2010, at 3:04pm, Robert Petersen wrote: > Greetings, we are running one master and four slaves of our multicore > solr setup. We just served searches for our catalog of 8 million > products with this farm during black Friday and cyber Monday, our > busiest days of the year, and the servers did not break a sweat! > Index > size is about 28GB. > > However, twice now recently during a time of low load we have had a > fire > drill where I have seen tomcat/solr fail and become unresponsive after > some OOM heap errors. Solr wouldn't even serve up its admin pages. > I've had to go in and manually knock tomcat out of memory and then > restart it. These solr slaves are load balanced and the load > balancers > always probe the solr slaves so if they stop serving up searches they > are automatically removed from the load balancer. When all four > fail at > the same time we have an issue! > > My question is this. Why in the world would all of my slaves, after > running fine for some days, suddenly all at the exact same minute > experience OOM heap errors and go dead? The load balancer kicks them > all out at the same time each time. Each slave only talks to the > master > and not to each other, but the master show no errors in the logs at > all. > Something must be triggering this though. The only other odd thing I > saw in the logs was after the first OOM errors were recorded, the > slaves > started occasionally not being able to get to the master. > > This behavior makes me a little nervous... =:-o eek! > > > > > > Environment: Lucid Imagination distro of Solr 1.4 on Tomcat > > > > Platform: RHEL with Sun JRE 1.6.0_18 on dual quad xeon machines with > 64GB memory etc etc > > > > > > > -------------------------------------------- <http://ken-blog.krugler.org> +1 530-265-2225 -------------------------- Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g