Ah! OK, that's interesting. It seems like I *was* thinking about it the
wrong way. Rather than try to stop the transition from STANDBY -> LEADER,
I need to make sure I don't let a node become STANDBY until it is ready to
be promoted to LEADER if necessary.
You said it does not hurt to have some long-running transition tasks (e.g.
OFFLINE -> STANDBY in my case). So when Helix sends the transition message
to move from OFFLINE -> STANDBY for a given partition, how do I signal to
Helix that I'm still working? Is it literally just that I don't return
from the agent Java callback until the transition is complete? e.g.:
@Transition(to = "STANDBY", from = "OFFLINE")
public void onBecomeStandbyFromOffline(Message message, NotificationContext
context) {
// this node has been asked to become standby for this partition
DoLongRunningOperationToBootstrapNode(message); // this could take 30
minutes
// returning now means the node has successfully transitioned to STANDBY
}
Are there any default state transition timeouts or any other properties I
need to worry about updating?
To your other point, I think parallel bootstraps may be OK. I was hoping a
StateTransitionThrottleConfig with ThrottleScope.INSTANCE could limit the
number of those, but that seems like it applies to ALL transitions on that
node, not just a particular type like OFFLINE->STANDBY. I suspect I'd have
to use that carefully to make sure I'm never blocking a STANDBY -> LEADER
transition while waiting on OFFLINE -> STANDBY transitions.
Thanks for changing my perspective on how I was seeing this problem
Jiajun. Very helpful!
~Brent
On Thu, Aug 5, 2021 at 10:31 AM Wang Jiajun <[email protected]> wrote:
> In short, the key to this solution is to prevent STANDBY -> LEADER message
> before partitions are truly ready. We do not restrict SYNCING -> STANDBY
> messages at all. So the controller will send S -> S message and wait until
> the transition (bootstrap) is done. After that, it can go ahead to bring up
> the LEADER. It does not hurt to have some long-run transition tasks (as
> long as they are not STANDBY -> LEADER, because LEADER is serving the
> traffic, we don't want to have a big gap with no LEADER) in the system,
> because it is what happens there. But this also means in
> parallel bootstraps. I'm not sure if this fits your need.
>
> Best Regards,
> Jiajun
>
>
> On Thu, Aug 5, 2021 at 9:42 AM Brent <[email protected]> wrote:
>
>> Thank you for the response Jiajun!
>>
>> On the inclusivity thing, I'm glad to hear we're moving to different
>> terminology. Our code actually wraps the MS state machine and renames the
>> terminology to "Leader" and "Follower" everywhere visible to our users and
>> operators for similar reasons. :-) I thought the Leader/Standby SMD was
>> a bit different which was why I wasn't using it, but looking at the
>> definition, I guess the only difference is it doesn't seem to define an
>> ERROR state like the MS SMD does. So for the rest of this thread, let's
>> use the LEADER/STANDBY terminology instead.
>>
>> For context, I have 1000-2000 shards of a database where each shard can
>> be 100GB+ in size so bootstrapping nodes is expensive. Your logic on
>> splitting up the STANDBY state into two states like SYNCING and STANDBY
>> makes sense (OFFLINE -> SYNCING -> STANDBY -> LEADER), though I'm still not
>> sure how I can prevent the state from transitioning from SYNCING to STANDBY
>> until the node is ready (i.e. has an up-to-date copy of the leader's
>> data). Based on what you were saying, is it possible to have the Helix
>> controller tell a node it's in SYNCING state, but then have the node decide
>> when it's safe to transition itself to STANDBY? Or can state transition
>> cancellation be used if the node isn't ready? Or can I just let the
>> transition timeout if the node isn't ready?
>>
>> This seems like it would be a pretty common problem with large,
>> expensive-to-move data (e.g. a shard of a large database), especially when
>> adding a new node to an existing system and needing to bootstrap it from
>> nothing. I suspect people do this and I'm just thinking about it the wrong
>> way or there's a Helix strategy that I'm just not grasping correctly.
>>
>> For the LinkedIn folks on the list, what does Espresso do for
>> bootstrapping new nodes and avoiding this problem of them getting promoted
>> to LEADER before they're ready? It seems like a similar problem to mine
>> (stateful node with large data that needs a leader/standby setup).
>>
>> Thanks again!
>>
>> ~Brent
>>
>> On Wed, Aug 4, 2021 at 6:32 PM Wang Jiajun <[email protected]>
>> wrote:
>>
>>> Hi Brent,
>>>
>>> AFAIK, there is no way to tell the controller to suspend a certain state
>>> transition. Even if you reject the transition (although rejection is not
>>> officially supported either), the controller will probably retry in the
>>> next rebalance pipeline repeatedly.
>>>
>>> Alternatively, from your description, I think "Slave" means 2 states in
>>> your system. 1. new Slave that is out of sync. 2. sync-ed Slave. It is
>>> possible you define a customized mode that differentiates these 2 states?
>>> Offline -> Syncing -> Slave, etc.
>>> Even simpler, is it OK to restrict the definition of Slave to the 2nd
>>> case? Meaning before a partition syncs with the Master, it shall not mark
>>> itself as the Slave. It implies offline -> Slave transition would take a
>>> longer time, but once it is done, the Slave partition would be fully ready.
>>>
>>> BTW, we encourage the users to use inclusive language. Maybe you can
>>> consider changing to use the LeaderStandby SMD? We might deprecate
>>> MasterSlave SMD in the near future.
>>>
>>> Best Regards,
>>> Jiajun
>>>
>>>
>>> On Wed, Aug 4, 2021 at 3:41 PM Brent <[email protected]> wrote:
>>>
>>>> I had asked a question a while back about how to deal with a failed
>>>> state transition (
>>>> http://mail-archives.apache.org/mod_mbox/helix-user/202009.mbox/%[email protected]%3E)
>>>> and the correct answer there was to throw an exception to cause an ERROR
>>>> state in the state machine.
>>>>
>>>> I have a slightly different but related question now. I'm using
>>>> the org.apache.helix.model.MasterSlaveSMD. In our system, for a Slave
>>>> partition to become fully in-sync with a Master partition can take a long
>>>> time (maybe 30 minutes). Under normal circumstances, until a Slave has
>>>> finished syncing data from a Master, it should not be eligible for
>>>> promotion to Master.
>>>>
>>>> So let's say a node (maybe newly added to the cluster) is the Slave for
>>>> partition 22 and has been online for 10 minutes (not long enough to have
>>>> sync-ed everything from the existing partition 22 Master) and receives a
>>>> state transition from Helix saying it should go from Slave->Master. Is it
>>>> possible to temporarily reject that transition without going into ERROR
>>>> state for that partition? ERROR state seems like slightly the wrong thing
>>>> because while it's not a valid transition right now, it will be a valid
>>>> transition 20 minutes from now when the initial sync completes.
>>>>
>>>> Is there a way to get this functionality to "fail" a transition, but
>>>> not fully go into ERROR state? Or is there a different way I should be
>>>> thinking about solving this problem? I was thinking this could potentially
>>>> be a frequent occurrence when new nodes are added to the cluster.
>>>>
>>>> Thank you for your time and help as always!
>>>>
>>>> ~Brent
>>>>
>>>