Don't suppose you could share some snippets of the log messages you're seeing a lot?
On Wed, Apr 29, 2015 at 1:29 PM, Karol Dudzinski <[email protected]> wrote: > Did a bit more investigation today though I didn't try the prod snapshot > in a test cluster. I did however find a couple of other strange things > which I'd initially missed. > > When the leader of this particular cluster goes for a reboot, I see many > notification printouts from FastLeaderElection. On some occasions I've > seen nearly 50. In other clusters I usually see 5 or thereabouts. The > number of these notifications seems to correlate with how long election > took. What could cause it to require exchanging so many messages? > > This cluster also seems to often timeout while syncing after selecting the > leader. On the new leader I see the expected number of "Follower sid: ..." > messages but it doesn't always get to the "Synchronizing with Follower" > part. On the leader I then get a "Unexpected exception causing shutdown > while sock still open" message followed by a GOODBYE for one of the > followers. At the same time on that follower I see a "Exception when > following the leader" caused by a read timeout in > Learner.registerWithLeader. Does anything happen prior to syncing that > could take a long time? > > Thanks, > Karol > > > On 28 Apr 2015, at 20:22, Karol Dudzinski <[email protected]> > wrote: > > > > Well these are prod clusters so my ability to experiment is rather > limited. I can take a copy of the snapshot and try both 3 node and 5 in a > test cluster. > > > > One thing I forgot to mention is that in most clusters the number of > election notification log lines I see is typically, give or take, the same > as the number of participants. In this cluster however, it's typically 2 > or 3 times as many notifications as the number of participants. > > > > My gut feeling is it's more likely to be due to load as the 5 node > cluster is much busier and the election time has been increasing over time > (as has load). I have no idea exactly what load though, whether it's > number of clients, frequency of transactions, total data size, etc. I > don't understand why though but that may just be my limited knowledge of > the election protocol. > > > > Karol > > > >> On 28 Apr 2015, at 19:54, Camille Fournier <[email protected]> wrote: > >> > >> Just out of curiosity, if you start the 5 node cluster up with only 3 of > >> the nodes to begin with (like, config 5, but only bring up 3 processes), > >> does it speed up the leader election or is it still slow? > >> > >> C > >> > >> On Tue, Apr 28, 2015 at 1:41 PM, Karol Dudzinski < > [email protected]> > >> wrote: > >> > >>> Hi, > >>> > >>> We're seeing some rather strange leader election in one of our > clusters. > >>> The duration reported by the "FOLLOWING - LEADER ELECTION TOOK" log > line > >>> (and equivalent for the leader) seems to vary hugely. During one > rolling > >>> reboot, I saw the number reported as small as 39ms and as large as 57 > >>> seconds (difference in units is not a typo). The average is just > about 10 > >>> seconds and std dev also about 10 seconds. So the time taken is not > only > >>> quite large, it's also very variable. > >>> > >>> We have other clusters but the average election time in those is in the > >>> hundreds of millis with std dev in a similar ballpark. I guess one > >>> difference is the "slow" cluster is 5 participants while the others > are 3, > >>> which may be a factor but I wouldn't expect it to make two orders of > >>> magnitude difference! > >>> > >>> So my question is, what factors contribute to the election time > reported > >>> by these log lines? And what can we do to speed this up? > >>> > >>> As far as I understand from logs and a quick browse through the code > that > >>> time is the time to select a leader. Syncing up to the leader happens > >>> after that. The syncing part I can understand will vary depending on > load > >>> but I don't see why selecting the leader would. > >>> > >>> Thanks, > >>> Karol >
