Shawn, Thanks for pointing me in the right direction. After consulting the above document I *think* that the problem may be too large of a heap and which may be affecting GC collection and hence causing ZK timeouts.
We have around 20G of memory on these machines with a min/max of heap at 6, 8 respectively (-Xms6G -Xmx10G). The rest was allocated for aside for disk cache. Why did we choose 6-10? No other reason than we wanted to allot enough for disk cache and then everything else was thrown and Solr. Does this sound about right? I took some screenshots for VisualVM and our NewRelic reporting as well as some relevant portions of our SolrConfig.xml. Any thoughts/comments would be greatly appreciated. http://postimg.org/gallery/4t73sdks/1fc10f9c/ Thanks On Sat, Mar 22, 2014 at 2:26 PM, Shawn Heisey <s...@elyograg.org> wrote: > On 3/22/2014 1:23 PM, Software Dev wrote: >> We have 2 collections with 1 shard each replicated over 5 servers in the >> cluster. We see a lot of flapping (down or recovering) on one of the >> collections. When this happens the other collection hosted on the same >> machine is still marked as active. When this happens it takes a fairly long >> time (~30 minutes) for the collection to come back online, if at all. I >> find that its usually more reliable to completely shutdown solr on the >> affected machine and bring it back up with its core disabled. We then >> re-enable the core when its marked as active. >> >> A few questions: >> >> 1) What is the healthcheck in Solr-Cloud? Put another way, what is failing >> that marks one collection as down but the other on the same machine as up? >> >> 2) Why does recovery take forever when a node goes down.. even if its only >> down for 30 seconds. Our index is only 7-8G and we are running on SSD's. >> >> 3) What can be done to diagnose and fix this problem? > > Unless you are actually using the ping request handler, the healthcheck > config will not matter. Or were you referring to something else? > > Referencing the logs you included in your reply: The EofException > errors happen because your client code times out and disconnects before > the request it made has completed. That is most likely just a symptom > that has nothing at all to do with the problem. > > Read the following wiki page. What I'm going to say below will > reference information you can find there: > > http://wiki.apache.org/solr/SolrPerformanceProblems > > Relevant side note: The default zookeeper client timeout is 15 seconds. > A typical zookeeper config defines tickTime as 2 seconds, and the > timeout cannot be configured to be more than 20 times the tickTime, > which means it cannot go beyond 40 seconds. The default timeout value > 15 seconds is usually more than enough, unless you are having > performance problems. > > If you are not actually taking Solr instances down, then the fact that > you are seeing the log replay messages indicates to me that something is > taking so much time that the connection to Zookeeper times out. When it > finally responds, it will attempt to recover the index, which means > first it will replay the transaction log and then it might replicate the > index from the shard leader. > > Replaying the transaction log is likely the reason it takes so long to > recover. The wiki page I linked above has a "slow startup" section that > explains how to fix this. > > There is some kind of underlying problem that is causing the zookeeper > connection to timeout. It is most likely garbage collection pauses or > insufficient RAM to cache the index, possibly both. > > You did not indicate how much total RAM you have or how big your Java > heap is. As the wiki page mentions in the SSD section, SSD is not a > substitute for having enough RAM to cache at significant percentage of > your index. > > Thanks, > Shawn >