Hi Adam,

> ... the client is created before the leader is established, it seems to
have trouble communicating to the leader. ...

Provided that the groups are configured correctly, clients can
automatically find the leader (it will fail with NotLeaderException and
then retry).  Would it be the case that the group might have misconfigured?

>  For some reason, any rebuild causes the server to hang indefinitely ...

You may dump the stack trace to see what the threads are waiting for.

> ... some guidance (or warnings!) on how to manage a primary / secondary
cluster and ensure health. ...

In Apache Ozone, we have a similar situation -- Ozone Manager and Storage
Container Manager.  Both are using Ratis for HA.  In order to avoid
distributed deadlock, the communication is always initiated from OM to
SCM.  You may consider doing it in a similar way.

If you have particular questions on Ratis, please feel free to let us know.

Hope it helps.
Tsz-Wo


On Thu, Feb 27, 2025 at 3:36 PM Adam Zionts <[email protected]> wrote:

> Hi there,
> We're looking for some guidance on patterns and implementation for a
> server implementation that involves two raft clusters — a primary & a
> secondary.
>
> The primary cluster is responsible for front-line communications and
> managing database state. The secondary cluster is responsible for off-line
> processing; anything that's CPU or I/O intensive. The primary logs user
> requests and delegates work to the secondary, who reports back with an
> update along the processing chain & waits for the primary to update state
> and trigger the next step.
>
> This has caused us to run into a few problems in our implementation:
> - The client is instantiated as part of the state machine, since messages
> are sent between primary <--> secondary throughout the process. Since the
> client is created before the leader is established, it seems to have
> trouble communicating to the leader. We've created a wrapper around the
> client that refreshes itself when the leader changes for this purpose
> - Only the leader in both the primary & the secondary triggers a call-back
> to the other, since otherwise there is an exponential explosion.
> - For some reason, any rebuild causes the server to hang indefinitely
> without any logging to indicate a crash (e.g. no StateMachineUpdater
> catching exceptions)
>
> Right now we're still in relatively early stages of our implementation,
> though we have a working product that uses Ratis as our distributed
> consensus model for the back-end. Before we get too deep into the
> implementation, though, we'd love some guidance (or warnings!) on how to
> manage a primary / secondary cluster and ensure health. Has anyone built
> something like that before?
>
> All the best,
> Adam Zionts
>

Reply via email to