Hi Al, I don't see attachments to this message. It might be easier to open a jira and upload the logs there.
-Flavio On Sep 7, 2012, at 10:28 PM, Alan Horn wrote: > > Hi, > > We have a five node cluster, recently upgraded from 3.3.5 to 3.4.3. Was > running fine for a few weeks after the upgrade, then the following sequence > of events occurred : > > 1. All servers stopped responding to 'ruok' at the same time > 2. Our local supervisor process restarted all of them at the same time > (yes, this is bad, we didn't expect it to fail this way :) > 3. The cluster would not serve requests after this. Appeared to be unable to > complete an election. > > We tried various things at this point, none of which worked : > > * Moved around the restart order of the nodes (e.g. 4 thru 0, instead of 0 > thru 4) > * Reduced number of running nodes from 5 -> 3 to simplify the quorum, by only > starting up 0, 1 & 2, in one test, and 0, 2 & 4 in the other > * Removed the *Epoch files from version-2/ snapshot directory > * Put the same version2/snapshot.xxxxx file on each server in the cluster > * Added the (same on all nodes) last txlog onto each cluster > * Kept only the last snapshot plus txlog unique on each server > * Moved leaderServes=no to leaderServes=yes > * Removed all files and started up with empty data as a control. This worked, > but of course isn't terribly useful :) > > Finally, I brought the data up on a single node running in standalone and > this worked (yay!) So at this point we brought the single node back into > service and have kept the other four available to debug why the election is > failing. > > We downgraded the four nodes to 3.3.5, and then they completed the election > and started serving as expected. > We did a rolling upgrade to 3.4.3, and everything was fine until we restarted > the leader, whereupon we encountered the same re-election loop as before. > > We're a bit out of ideas at this point, so I was hoping someone from this > list might have some useful input. > > Output from two followers and a leader during this condition are attached. > > Cheers, > > Al > > >
