Thanks for this - yes, I realised that was a different method but wondered if that was another way of achieving the same thing.
Having pondered it, I nearly ripped everything out to replace with JRaft on Sunday but then figured it *must *be something I was doing - there was no way this wouldn't work. I then took my simple test state machine where I first tried out Ratis and implemented these in there yesterday afternoon and found it worked fine - with it working, I could then compare and contrast settings and config and see what was happening... I found it was indeed a stupid mistake I'd made. One one host the peer naming was number-based and it turns out one server was n0, n1 and n2 and the others were n1, n2 and n3. Apologies... but for others in the same situation, this is worth checking! Gordon On Tue, Jun 15, 2021 at 10:25 AM Tsz Wo Sze <[email protected]> wrote: > Hi Gordon, > > Yes, there is a TestRaftServerNoLeaderTimeout for > testing notifyExtendedNoLeader(..). > > The bold line is for *notifyLeaderChanged(..)*, which is a different > method. The notifyExtendedNoLeader(..) method should work (i.e. it will be > called after NO_LEADER_TIMEOUT_KEY). Have you overridden it in your > StateMachine? Otherwise, it is a no-op. > > Hope it helps. > Tsz-Wo > > > > On Fri, Jun 11, 2021 at 11:39 PM Gordon Jahn <[email protected]> wrote: > >> Hi folks, >> >> Further to this, I'm wondering if I never see the notifyExtendedNoLeader >> called as I see these exceptions: >> >> [grpc-default-executor-5] WARN >> org.apache.ratis.grpc.server.GrpcServerProtocolService - n0: Failed >> requestVote n2->n1#0: >> org.apache.ratis.protocol.exceptions.ServerNotReadyException: >> n0@group-726F75706964 is not in [RUNNING]: current state is STARTING >> [grpc-default-executor-5] INFO >> org.apache.ratis.server.RaftServer$Division - n0@group-726F75706964: >> receive requestVote(PRE_VOTE, n2, group-726F75706964, 23, (t:23, i:566)) >> [grpc-default-executor-5] WARN >> org.apache.ratis.grpc.server.GrpcServerProtocolService - n0: Failed >> requestVote n2->n1#0: >> org.apache.ratis.protocol.exceptions.ServerNotReadyException: >> n0@group-726F75706964 is not in [RUNNING]: current state is STARTING >> [grpc-default-executor-5] INFO >> org.apache.ratis.server.RaftServer$Division - n0@group-726F75706964: >> receive requestVote(PRE_VOTE, n2, group-726F75706964, 23, (t:23, i:566)) >> [grpc-default-executor-5] WARN >> org.apache.ratis.grpc.server.GrpcServerProtocolService - n0: Failed >> requestVote n2->n1#0: >> org.apache.ratis.protocol.exceptions.ServerNotReadyException: >> n0@group-726F75706964 is not in [RUNNING]: current state is STARTING >> [grpc-default-executor-5] INFO >> org.apache.ratis.server.RaftServer$Division - n0@group-726F75706964: >> receive requestVote(PRE_VOTE, n2, group-726F75706964, 23, (t:23, i:566)) >> [grpc-default-executor-5] WARN >> org.apache.ratis.grpc.server.GrpcServerProtocolService - n0: Failed >> requestVote n2->n1#0: >> org.apache.ratis.protocol.exceptions.ServerNotReadyException: >> n0@group-726F75706964 is not in [RUNNING]: current state is STARTING >> [grpc-default-executor-5] INFO >> org.apache.ratis.server.RaftServer$Division - n0@group-726F75706964: >> receive requestVote(PRE_VOTE, n2, group-726F75706964, 23, (t:23, i:566)) >> >> Is this a known issue? Is there a basic test case somewhere showing >> notifyExtendedNoLeader working with gRPC and what needs to be configured >> (is it more than >> properties.setTimeDuration(RaftServerConfigKeys.Notification.NO_LEADER_TIMEOUT_KEY, >> TimeDuration.valueOf(5000, TimeUnit.MILLISECONDS)); ?). Equally, would >> a different transport work out of the box? >> >> Alternatively, is there a reason that ServerState.java looks like this: >> >> void setLeader(RaftPeerId newLeaderId, Object op) { >> if (!Objects.equals(leaderId, newLeaderId)) { >> String suffix; >> if (newLeaderId == null) { >> // reset the time stamp when a null leader is assigned >> lastNoLeaderTime = Timestamp.currentTime(); >> suffix = ""; >> } else { >> Timestamp previous = lastNoLeaderTime; >> lastNoLeaderTime = null; >> suffix = ", leader elected after " + previous.elapsedTimeMs() + >> "ms"; >> * >> server.getStateMachine().event().notifyLeaderChanged(getMemberId(), >> newLeaderId);* >> } >> LOG.info("{}: change Leader from {} to {} at term {} for {}{}", >> getMemberId(), leaderId, newLeaderId, getCurrentTerm(), op, >> suffix); >> leaderId = newLeaderId; >> if (leaderId != null) { >> server.finishTransferLeadership(); >> } >> } >> } >> >> It seems like moving bold line down, outside the else block would mean >> the state machine is notified when the leader moves to *null* which would >> give the notification that no leader was present and that data in the store >> should not be trusted (in this case, my application would resign primary >> status). >> >> Any input on this approach would be useful... it's painful to be so >> close to a working solution but just unable to actually be told the server >> state has changed (or even access it as all access to the server state >> seems to sit behind non-public methods). >> >> Regards, >> Gordon >> >> On Fri, Jun 11, 2021 at 1:00 PM Gordon Jahn <[email protected]> wrote: >> >>> Hi folks, >>> >>> I'm not sure if I'm missing something incredibly simple or just going >>> about this the wrong way... >>> >>> I've implemented a simple Ratis State Machine (extending >>> BaseStateMachine as per the examples) in order to write a primary-server >>> selection. >>> >>> My design is: >>> >>> * Use Ratis as a high availability data store; I don't care which node >>> is actually the leader of the Ratis group - they can fight it out, and all >>> my state machines get messages - that's great >>> * For my application servers, maybe only 2 of 3 can ever be the primary >>> server and they share the availability and priority via Ratis messages >>> * The Ratis state machine picks algorithmically the machine to be my >>> application primary (based on the enabled, priority and ID fields shared) >>> * The state machine notifies the rest of my application when it is / is >>> not the primary server >>> * The machine should not be able to be primary server if it's not >>> connected to the Ratis group >>> >>> I have most of this working, but cannot figure out where to get a >>> notification / event from Ratis that the peer is not part of the majority >>> group. >>> >>> Implementing notifyFollowerSlowness(RoleInfoProto roleInfoProto) and >>> notifyExtendedNoLeader(RoleInfoProto roleInfoProto), and setting timeouts >>> in the config, doesn't seem to result in these being called if I start 3 >>> Ratis nodes then shut 2 down. The state machine's reset method also isn't >>> called (I thought it might if it was no longer part of a quorum, and then >>> reinitialise might be called when it rejoined). >>> >>> Should I be able to see the disconnection of my state machine somewhere >>> so I can trigger an event or is this just the wrong approach to take? Is >>> there a leader election example anywhere to demonstrate this? >>> >>> Thanks in advance, >>> Gordon >>> >>
