Hi Brent,

AFAIK, there is no way to tell the controller to suspend a certain state
transition. Even if you reject the transition (although rejection is not
officially supported either), the controller will probably retry in the
next rebalance pipeline repeatedly.

Alternatively, from your description, I think "Slave" means 2 states in
your system. 1. new Slave that is out of sync. 2. sync-ed Slave. It is
possible you define a customized mode that differentiates these 2 states?
Offline -> Syncing -> Slave, etc.
Even simpler, is it OK to restrict the definition of Slave to the 2nd case?
Meaning before a partition syncs with the Master, it shall not mark itself
as the Slave. It implies offline -> Slave transition would take a longer
time, but once it is done, the Slave partition would be fully ready.

BTW, we encourage the users to use inclusive language. Maybe you can
consider changing to use the LeaderStandby SMD? We might deprecate
MasterSlave SMD in the near future.

Best Regards,
Jiajun


On Wed, Aug 4, 2021 at 3:41 PM Brent <[email protected]> wrote:

> I had asked a question a while back about how to deal with a failed state
> transition (
> http://mail-archives.apache.org/mod_mbox/helix-user/202009.mbox/%[email protected]%3E)
> and the correct answer there was to throw an exception to cause an ERROR
> state in the state machine.
>
> I have a slightly different but related question now.  I'm using
> the org.apache.helix.model.MasterSlaveSMD.  In our system, for a Slave
> partition to become fully in-sync with a Master partition can take a long
> time (maybe 30 minutes).  Under normal circumstances, until a Slave has
> finished syncing data from a Master, it should not be eligible for
> promotion to Master.
>
> So let's say a node (maybe newly added to the cluster) is the Slave for
> partition 22 and has been online for 10 minutes (not long enough to have
> sync-ed everything from the existing partition 22 Master) and receives a
> state transition from Helix saying it should go from Slave->Master.  Is it
> possible to temporarily reject that transition without going into ERROR
> state for that partition?  ERROR state seems like slightly the wrong thing
> because while it's not a valid transition right now, it will be a valid
> transition 20 minutes from now when the initial sync completes.
>
> Is there a way to get this functionality to "fail" a transition, but not
> fully go into ERROR state?  Or is there a different way I should be
> thinking about solving this problem?  I was thinking this could potentially
> be a frequent occurrence when new nodes are added to the cluster.
>
> Thank you for your time and help as always!
>
> ~Brent
>

Reply via email to