Re: Temporarily preventing a state transition

Wang Jiajun Thu, 05 Aug 2021 21:45:54 -0700

In short, the controller thinks a state transition is still ongoing if, 1.
the message exists and status = READ, 2. the instance is alive (with
liveInstance znode). The participant does not need to do any additional
signal until it updates the current states, meaning the transition is
either done or ERROR. The controller will read current states to get the
information. And yes, the call should not return.


State transition timeout is configurable. By default, regular partition
state transition does not have a timeout. If anything happens in the
execution, Helix logic on the participant will catch and throw Exception
then the partition will end up with an ERROR state.

Regarding throttling,  StateTransitionThrottleConfig would be the
recommended way. If it does not fit your needs, then an alternative way is
to use another theadpool with a limited thread size to execute the
bootstrap task. onBecomeStandbyFromOffline on the other hand, should submit
the task to this threadpool and wait for the result. This is another way to
control critical system resource usage. It is more complicated obviously.
So, please evaluate this method carefully : )

Best Regards,
Jiajun


On Thu, Aug 5, 2021 at 2:03 PM Brent <[email protected]> wrote:

> Ah!  OK, that's interesting. It seems like I *was* thinking about it the
> wrong way.  Rather than try to stop the transition from STANDBY -> LEADER,
> I need to make sure I don't let a node become STANDBY until it is ready to
> be promoted to LEADER if necessary.
>
> You said it does not hurt to have some long-running transition tasks (e.g.
> OFFLINE -> STANDBY in my case).  So when Helix sends the transition message
> to move from OFFLINE -> STANDBY for a given partition, how do I signal to
> Helix that I'm still working?  Is it literally just that I don't return
> from the agent Java callback until the transition is complete?  e.g.:
>
> @Transition(to = "STANDBY", from = "OFFLINE")
> public void onBecomeStandbyFromOffline(Message message,
> NotificationContext context) {
>     // this node has been asked to become standby for this partition
>
>     DoLongRunningOperationToBootstrapNode(message);  // this could take 30
> minutes
>
>     // returning now means the node has successfully transitioned to
> STANDBY
> }
>
> Are there any default state transition timeouts or any other properties I
> need to worry about updating?
>
> To your other point, I think parallel bootstraps may be OK.  I was hoping
> a StateTransitionThrottleConfig with ThrottleScope.INSTANCE could limit the
> number of those, but that seems like it applies to ALL transitions on that
> node, not just a particular type like OFFLINE->STANDBY.  I suspect I'd have
> to use that carefully to make sure I'm never blocking a STANDBY -> LEADER
> transition while waiting on OFFLINE -> STANDBY transitions.
>
> Thanks for changing my perspective on how I was seeing this problem
> Jiajun.  Very helpful!
>
> ~Brent
>
> On Thu, Aug 5, 2021 at 10:31 AM Wang Jiajun <[email protected]>
> wrote:
>
>> In short, the key to this solution is to prevent STANDBY -> LEADER
>> message before partitions are truly ready. We do not restrict SYNCING ->
>> STANDBY messages at all. So the controller will send S -> S message and
>> wait until the transition (bootstrap) is done. After that, it can go ahead
>> to bring up the LEADER. It does not hurt to have some long-run transition
>> tasks (as long as they are not STANDBY -> LEADER, because LEADER is serving
>> the traffic, we don't want to have a big gap with no LEADER) in the system,
>> because it is what happens there. But this also means in
>> parallel bootstraps. I'm not sure if this fits your need.
>>
>> Best Regards,
>> Jiajun
>>
>>
>> On Thu, Aug 5, 2021 at 9:42 AM Brent <[email protected]> wrote:
>>
>>> Thank you for the response Jiajun!
>>>
>>> On the inclusivity thing, I'm glad to hear we're moving to different
>>> terminology.  Our code actually wraps the MS state machine and renames the
>>> terminology to "Leader" and "Follower" everywhere visible to our users and
>>> operators for similar reasons.  :-)   I thought the Leader/Standby SMD was
>>> a bit different which was why I wasn't using it, but looking at the
>>> definition, I guess the only difference is it doesn't seem to define an
>>> ERROR state like the MS SMD does.  So for the rest of this thread, let's
>>> use the LEADER/STANDBY terminology instead.
>>>
>>> For context, I have 1000-2000 shards of a database where each shard can
>>> be 100GB+ in size so bootstrapping nodes is expensive.  Your logic on
>>> splitting up the STANDBY state into two states like SYNCING and STANDBY
>>> makes sense (OFFLINE -> SYNCING -> STANDBY -> LEADER), though I'm still not
>>> sure how I can prevent the state from transitioning from SYNCING to STANDBY
>>> until the node is ready (i.e. has an up-to-date copy of the leader's
>>> data).  Based on what you were saying, is it possible to have the Helix
>>> controller tell a node it's in SYNCING state, but then have the node decide
>>> when it's safe to transition itself to STANDBY?  Or can state transition
>>> cancellation be used if the node isn't ready?  Or can I just let the
>>> transition timeout if the node isn't ready?
>>>
>>> This seems like it would be a pretty common problem with large,
>>> expensive-to-move data (e.g. a shard of a large database), especially when
>>> adding a new node to an existing system and needing to bootstrap it from
>>> nothing.  I suspect people do this and I'm just thinking about it the wrong
>>> way or there's a Helix strategy that I'm just not grasping correctly.
>>>
>>> For the LinkedIn folks on the list, what does Espresso do for
>>> bootstrapping new nodes and avoiding this problem of them getting promoted
>>> to LEADER before they're ready?  It seems like a similar problem to mine
>>> (stateful node with large data that needs a leader/standby setup).
>>>
>>> Thanks again!
>>>
>>> ~Brent
>>>
>>> On Wed, Aug 4, 2021 at 6:32 PM Wang Jiajun <[email protected]>
>>> wrote:
>>>
>>>> Hi Brent,
>>>>
>>>> AFAIK, there is no way to tell the controller to suspend a certain
>>>> state transition. Even if you reject the transition (although rejection is
>>>> not officially supported either), the controller will probably retry in the
>>>> next rebalance pipeline repeatedly.
>>>>
>>>> Alternatively, from your description, I think "Slave" means 2 states in
>>>> your system. 1. new Slave that is out of sync. 2. sync-ed Slave. It is
>>>> possible you define a customized mode that differentiates these 2 states?
>>>> Offline -> Syncing -> Slave, etc.
>>>> Even simpler, is it OK to restrict the definition of Slave to the 2nd
>>>> case? Meaning before a partition syncs with the Master, it shall not mark
>>>> itself as the Slave. It implies offline -> Slave transition would take a
>>>> longer time, but once it is done, the Slave partition would be fully ready.
>>>>
>>>> BTW, we encourage the users to use inclusive language. Maybe you can
>>>> consider changing to use the LeaderStandby SMD? We might deprecate
>>>> MasterSlave SMD in the near future.
>>>>
>>>> Best Regards,
>>>> Jiajun
>>>>
>>>>
>>>> On Wed, Aug 4, 2021 at 3:41 PM Brent <[email protected]> wrote:
>>>>
>>>>> I had asked a question a while back about how to deal with a failed
>>>>> state transition (
>>>>> http://mail-archives.apache.org/mod_mbox/helix-user/202009.mbox/%[email protected]%3E)
>>>>> and the correct answer there was to throw an exception to cause an ERROR
>>>>> state in the state machine.
>>>>>
>>>>> I have a slightly different but related question now.  I'm using
>>>>> the org.apache.helix.model.MasterSlaveSMD.  In our system, for a Slave
>>>>> partition to become fully in-sync with a Master partition can take a long
>>>>> time (maybe 30 minutes).  Under normal circumstances, until a Slave has
>>>>> finished syncing data from a Master, it should not be eligible for
>>>>> promotion to Master.
>>>>>
>>>>> So let's say a node (maybe newly added to the cluster) is the Slave
>>>>> for partition 22 and has been online for 10 minutes (not long enough to
>>>>> have sync-ed everything from the existing partition 22 Master) and 
>>>>> receives
>>>>> a state transition from Helix saying it should go from Slave->Master.  Is
>>>>> it possible to temporarily reject that transition without going into ERROR
>>>>> state for that partition?  ERROR state seems like slightly the wrong thing
>>>>> because while it's not a valid transition right now, it will be a valid
>>>>> transition 20 minutes from now when the initial sync completes.
>>>>>
>>>>> Is there a way to get this functionality to "fail" a transition, but
>>>>> not fully go into ERROR state?  Or is there a different way I should be
>>>>> thinking about solving this problem?  I was thinking this could 
>>>>> potentially
>>>>> be a frequent occurrence when new nodes are added to the cluster.
>>>>>
>>>>> Thank you for your time and help as always!
>>>>>
>>>>> ~Brent
>>>>>
>>>>

Re: Temporarily preventing a state transition

Reply via email to