Primary / Secondary raft — Ratis implementation

Adam Zionts Thu, 27 Feb 2025 16:02:13 -0800

Hi there,
We're looking for some guidance on patterns and implementation for a server
implementation that involves two raft clusters — a primary & a secondary.


The primary cluster is responsible for front-line communications and
managing database state. The secondary cluster is responsible for off-line
processing; anything that's CPU or I/O intensive. The primary logs user
requests and delegates work to the secondary, who reports back with an
update along the processing chain & waits for the primary to update state
and trigger the next step.

This has caused us to run into a few problems in our implementation:
- The client is instantiated as part of the state machine, since messages
are sent between primary <--> secondary throughout the process. Since the
client is created before the leader is established, it seems to have
trouble communicating to the leader. We've created a wrapper around the
client that refreshes itself when the leader changes for this purpose
- Only the leader in both the primary & the secondary triggers a call-back
to the other, since otherwise there is an exponential explosion.
- For some reason, any rebuild causes the server to hang indefinitely
without any logging to indicate a crash (e.g. no StateMachineUpdater
catching exceptions)

Right now we're still in relatively early stages of our implementation,
though we have a working product that uses Ratis as our distributed
consensus model for the back-end. Before we get too deep into the
implementation, though, we'd love some guidance (or warnings!) on how to
manage a primary / secondary cluster and ensure health. Has anyone built
something like that before?

All the best,
Adam Zionts

Primary / Secondary raft — Ratis implementation

Reply via email to