Hi folks,

Further to this, I'm wondering if I never see the notifyExtendedNoLeader
called as I see these exceptions:

[grpc-default-executor-5] WARN
org.apache.ratis.grpc.server.GrpcServerProtocolService - n0: Failed
requestVote n2->n1#0:
org.apache.ratis.protocol.exceptions.ServerNotReadyException:
n0@group-726F75706964 is not in [RUNNING]: current state is STARTING
[grpc-default-executor-5] INFO org.apache.ratis.server.RaftServer$Division
- n0@group-726F75706964: receive requestVote(PRE_VOTE, n2,
group-726F75706964, 23, (t:23, i:566))
[grpc-default-executor-5] WARN
org.apache.ratis.grpc.server.GrpcServerProtocolService - n0: Failed
requestVote n2->n1#0:
org.apache.ratis.protocol.exceptions.ServerNotReadyException:
n0@group-726F75706964 is not in [RUNNING]: current state is STARTING
[grpc-default-executor-5] INFO org.apache.ratis.server.RaftServer$Division
- n0@group-726F75706964: receive requestVote(PRE_VOTE, n2,
group-726F75706964, 23, (t:23, i:566))
[grpc-default-executor-5] WARN
org.apache.ratis.grpc.server.GrpcServerProtocolService - n0: Failed
requestVote n2->n1#0:
org.apache.ratis.protocol.exceptions.ServerNotReadyException:
n0@group-726F75706964 is not in [RUNNING]: current state is STARTING
[grpc-default-executor-5] INFO org.apache.ratis.server.RaftServer$Division
- n0@group-726F75706964: receive requestVote(PRE_VOTE, n2,
group-726F75706964, 23, (t:23, i:566))
[grpc-default-executor-5] WARN
org.apache.ratis.grpc.server.GrpcServerProtocolService - n0: Failed
requestVote n2->n1#0:
org.apache.ratis.protocol.exceptions.ServerNotReadyException:
n0@group-726F75706964 is not in [RUNNING]: current state is STARTING
[grpc-default-executor-5] INFO org.apache.ratis.server.RaftServer$Division
- n0@group-726F75706964: receive requestVote(PRE_VOTE, n2,
group-726F75706964, 23, (t:23, i:566))

Is this a known issue?  Is there a basic test case somewhere showing
notifyExtendedNoLeader working with gRPC and what needs to be configured
(is it more than
properties.setTimeDuration(RaftServerConfigKeys.Notification.NO_LEADER_TIMEOUT_KEY,
TimeDuration.valueOf(5000, TimeUnit.MILLISECONDS)); ?).  Equally, would a
different transport work out of the box?

Alternatively, is there a reason that ServerState.java looks like this:

  void setLeader(RaftPeerId newLeaderId, Object op) {
    if (!Objects.equals(leaderId, newLeaderId)) {
      String suffix;
      if (newLeaderId == null) {
        // reset the time stamp when a null leader is assigned
        lastNoLeaderTime = Timestamp.currentTime();
        suffix = "";
      } else {
        Timestamp previous = lastNoLeaderTime;
        lastNoLeaderTime = null;
        suffix = ", leader elected after " + previous.elapsedTimeMs() +
"ms";
*
server.getStateMachine().event().notifyLeaderChanged(getMemberId(),
newLeaderId);*
      }
      LOG.info("{}: change Leader from {} to {} at term {} for {}{}",
          getMemberId(), leaderId, newLeaderId, getCurrentTerm(), op,
suffix);
      leaderId = newLeaderId;
      if (leaderId != null) {
        server.finishTransferLeadership();
      }
    }
  }

It seems like moving bold line down, outside the else block would mean the
state machine is notified when the leader moves to *null* which would give
the notification that no leader was present and that data in the store
should not be trusted (in this case, my application would resign primary
status).

Any input on this approach would be useful...  it's painful to be so close
to a working solution but just unable to actually be told the server state
has changed (or even access it as all access to the server state seems to
sit behind non-public methods).

Regards,
Gordon

On Fri, Jun 11, 2021 at 1:00 PM Gordon Jahn <[email protected]> wrote:

> Hi folks,
>
> I'm not sure if I'm missing something incredibly simple or just going
> about this the wrong way...
>
> I've implemented a simple Ratis State Machine (extending BaseStateMachine
> as per the examples) in order to write a primary-server selection.
>
> My design is:
>
> * Use Ratis as a high availability data store; I don't care which node is
> actually the leader of the Ratis group - they can fight it out, and all my
> state machines get messages - that's great
> * For my application servers, maybe only 2 of 3 can ever be the primary
> server and they share the availability and priority via Ratis messages
> * The Ratis state machine picks algorithmically the machine to be my
> application primary (based on the enabled, priority and ID fields shared)
> * The state machine notifies the rest of my application when it is / is
> not the primary server
> * The machine should not be able to be primary server if it's not
> connected to the Ratis group
>
> I have most of this working, but cannot figure out where to get a
> notification / event from Ratis that the peer is not part of the majority
> group.
>
> Implementing notifyFollowerSlowness(RoleInfoProto roleInfoProto) and
> notifyExtendedNoLeader(RoleInfoProto roleInfoProto), and setting timeouts
> in the config, doesn't seem to result in these being called if I start 3
> Ratis nodes then shut 2 down. The state machine's reset method also isn't
> called (I thought it might if it was no longer part of a quorum, and then
> reinitialise might be called when it rejoined).
>
> Should I be able to see the disconnection of my state machine somewhere so
> I can trigger an event or is this just the wrong approach to take?  Is
> there a leader election example anywhere to demonstrate this?
>
> Thanks in advance,
> Gordon
>

Reply via email to