Re: Temporarily preventing a state transition

Wang Jiajun Fri, 06 Aug 2021 08:56:57 -0700

You're welcome : )

On Fri, Aug 6, 2021, 8:20 AM Brent <[email protected]> wrote:


> Great information.  Thank you again, I really appreciate all the
> thoughtful & detailed responses.
>
> On Thu, Aug 5, 2021 at 9:45 PM Wang Jiajun <[email protected]> wrote:
>
>> In short, the controller thinks a state transition is still ongoing if,
>> 1. the message exists and status = READ, 2. the instance is alive (with
>> liveInstance znode). The participant does not need to do any additional
>> signal until it updates the current states, meaning the transition is
>> either done or ERROR. The controller will read current states to get the
>> information. And yes, the call should not return.
>>
>> State transition timeout is configurable. By default, regular partition
>> state transition does not have a timeout. If anything happens in the
>> execution, Helix logic on the participant will catch and throw Exception
>> then the partition will end up with an ERROR state.
>>
>> Regarding throttling,  StateTransitionThrottleConfig would be the
>> recommended way. If it does not fit your needs, then an alternative way is
>> to use another theadpool with a limited thread size to execute the
>> bootstrap task. onBecomeStandbyFromOffline on the other hand, should submit
>> the task to this threadpool and wait for the result. This is another way to
>> control critical system resource usage. It is more complicated obviously.
>> So, please evaluate this method carefully : )
>>
>> Best Regards,
>> Jiajun
>>
>>
>> On Thu, Aug 5, 2021 at 2:03 PM Brent <[email protected]> wrote:
>>
>>> Ah!  OK, that's interesting. It seems like I *was* thinking about it the
>>> wrong way.  Rather than try to stop the transition from STANDBY -> LEADER,
>>> I need to make sure I don't let a node become STANDBY until it is ready to
>>> be promoted to LEADER if necessary.
>>>
>>> You said it does not hurt to have some long-running transition tasks
>>> (e.g. OFFLINE -> STANDBY in my case).  So when Helix sends the transition
>>> message to move from OFFLINE -> STANDBY for a given partition, how do I
>>> signal to Helix that I'm still working?  Is it literally just that I don't
>>> return from the agent Java callback until the transition is complete?  e.g.:
>>>
>>> @Transition(to = "STANDBY", from = "OFFLINE")
>>> public void onBecomeStandbyFromOffline(Message message,
>>> NotificationContext context) {
>>>     // this node has been asked to become standby for this partition
>>>
>>>     DoLongRunningOperationToBootstrapNode(message);  // this could take
>>> 30 minutes
>>>
>>>     // returning now means the node has successfully transitioned to
>>> STANDBY
>>> }
>>>
>>> Are there any default state transition timeouts or any other properties
>>> I need to worry about updating?
>>>
>>> To your other point, I think parallel bootstraps may be OK.  I was
>>> hoping a StateTransitionThrottleConfig with ThrottleScope.INSTANCE could
>>> limit the number of those, but that seems like it applies to ALL
>>> transitions on that node, not just a particular type like
>>> OFFLINE->STANDBY.  I suspect I'd have to use that carefully to make sure
>>> I'm never blocking a STANDBY -> LEADER transition while waiting on OFFLINE
>>> -> STANDBY transitions.
>>>
>>> Thanks for changing my perspective on how I was seeing this problem
>>> Jiajun.  Very helpful!
>>>
>>> ~Brent
>>>
>>> On Thu, Aug 5, 2021 at 10:31 AM Wang Jiajun <[email protected]>
>>> wrote:
>>>
>>>> In short, the key to this solution is to prevent STANDBY -> LEADER
>>>> message before partitions are truly ready. We do not restrict SYNCING ->
>>>> STANDBY messages at all. So the controller will send S -> S message and
>>>> wait until the transition (bootstrap) is done. After that, it can go ahead
>>>> to bring up the LEADER. It does not hurt to have some long-run transition
>>>> tasks (as long as they are not STANDBY -> LEADER, because LEADER is serving
>>>> the traffic, we don't want to have a big gap with no LEADER) in the system,
>>>> because it is what happens there. But this also means in
>>>> parallel bootstraps. I'm not sure if this fits your need.
>>>>
>>>> Best Regards,
>>>> Jiajun
>>>>
>>>>
>>>> On Thu, Aug 5, 2021 at 9:42 AM Brent <[email protected]> wrote:
>>>>
>>>>> Thank you for the response Jiajun!
>>>>>
>>>>> On the inclusivity thing, I'm glad to hear we're moving to different
>>>>> terminology.  Our code actually wraps the MS state machine and renames the
>>>>> terminology to "Leader" and "Follower" everywhere visible to our users and
>>>>> operators for similar reasons.  :-)   I thought the Leader/Standby SMD was
>>>>> a bit different which was why I wasn't using it, but looking at the
>>>>> definition, I guess the only difference is it doesn't seem to define an
>>>>> ERROR state like the MS SMD does.  So for the rest of this thread, let's
>>>>> use the LEADER/STANDBY terminology instead.
>>>>>
>>>>> For context, I have 1000-2000 shards of a database where each shard
>>>>> can be 100GB+ in size so bootstrapping nodes is expensive.  Your logic on
>>>>> splitting up the STANDBY state into two states like SYNCING and STANDBY
>>>>> makes sense (OFFLINE -> SYNCING -> STANDBY -> LEADER), though I'm still 
>>>>> not
>>>>> sure how I can prevent the state from transitioning from SYNCING to 
>>>>> STANDBY
>>>>> until the node is ready (i.e. has an up-to-date copy of the leader's
>>>>> data).  Based on what you were saying, is it possible to have the Helix
>>>>> controller tell a node it's in SYNCING state, but then have the node 
>>>>> decide
>>>>> when it's safe to transition itself to STANDBY?  Or can state transition
>>>>> cancellation be used if the node isn't ready?  Or can I just let the
>>>>> transition timeout if the node isn't ready?
>>>>>
>>>>> This seems like it would be a pretty common problem with large,
>>>>> expensive-to-move data (e.g. a shard of a large database), especially when
>>>>> adding a new node to an existing system and needing to bootstrap it from
>>>>> nothing.  I suspect people do this and I'm just thinking about it the 
>>>>> wrong
>>>>> way or there's a Helix strategy that I'm just not grasping correctly.
>>>>>
>>>>> For the LinkedIn folks on the list, what does Espresso do for
>>>>> bootstrapping new nodes and avoiding this problem of them getting promoted
>>>>> to LEADER before they're ready?  It seems like a similar problem to mine
>>>>> (stateful node with large data that needs a leader/standby setup).
>>>>>
>>>>> Thanks again!
>>>>>
>>>>> ~Brent
>>>>>
>>>>> On Wed, Aug 4, 2021 at 6:32 PM Wang Jiajun <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hi Brent,
>>>>>>
>>>>>> AFAIK, there is no way to tell the controller to suspend a certain
>>>>>> state transition. Even if you reject the transition (although rejection 
>>>>>> is
>>>>>> not officially supported either), the controller will probably retry in 
>>>>>> the
>>>>>> next rebalance pipeline repeatedly.
>>>>>>
>>>>>> Alternatively, from your description, I think "Slave" means 2 states
>>>>>> in your system. 1. new Slave that is out of sync. 2. sync-ed Slave. It is
>>>>>> possible you define a customized mode that differentiates these 2 states?
>>>>>> Offline -> Syncing -> Slave, etc.
>>>>>> Even simpler, is it OK to restrict the definition of Slave to the 2nd
>>>>>> case? Meaning before a partition syncs with the Master, it shall not mark
>>>>>> itself as the Slave. It implies offline -> Slave transition would take a
>>>>>> longer time, but once it is done, the Slave partition would be fully 
>>>>>> ready.
>>>>>>
>>>>>> BTW, we encourage the users to use inclusive language. Maybe you can
>>>>>> consider changing to use the LeaderStandby SMD? We might deprecate
>>>>>> MasterSlave SMD in the near future.
>>>>>>
>>>>>> Best Regards,
>>>>>> Jiajun
>>>>>>
>>>>>>
>>>>>> On Wed, Aug 4, 2021 at 3:41 PM Brent <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> I had asked a question a while back about how to deal with a failed
>>>>>>> state transition (
>>>>>>> http://mail-archives.apache.org/mod_mbox/helix-user/202009.mbox/%[email protected]%3E)
>>>>>>> and the correct answer there was to throw an exception to cause an ERROR
>>>>>>> state in the state machine.
>>>>>>>
>>>>>>> I have a slightly different but related question now.  I'm using
>>>>>>> the org.apache.helix.model.MasterSlaveSMD.  In our system, for a Slave
>>>>>>> partition to become fully in-sync with a Master partition can take a 
>>>>>>> long
>>>>>>> time (maybe 30 minutes).  Under normal circumstances, until a Slave has
>>>>>>> finished syncing data from a Master, it should not be eligible for
>>>>>>> promotion to Master.
>>>>>>>
>>>>>>> So let's say a node (maybe newly added to the cluster) is the Slave
>>>>>>> for partition 22 and has been online for 10 minutes (not long enough to
>>>>>>> have sync-ed everything from the existing partition 22 Master) and 
>>>>>>> receives
>>>>>>> a state transition from Helix saying it should go from Slave->Master.  
>>>>>>> Is
>>>>>>> it possible to temporarily reject that transition without going into 
>>>>>>> ERROR
>>>>>>> state for that partition?  ERROR state seems like slightly the wrong 
>>>>>>> thing
>>>>>>> because while it's not a valid transition right now, it will be a valid
>>>>>>> transition 20 minutes from now when the initial sync completes.
>>>>>>>
>>>>>>> Is there a way to get this functionality to "fail" a transition, but
>>>>>>> not fully go into ERROR state?  Or is there a different way I should be
>>>>>>> thinking about solving this problem?  I was thinking this could 
>>>>>>> potentially
>>>>>>> be a frequent occurrence when new nodes are added to the cluster.
>>>>>>>
>>>>>>> Thank you for your time and help as always!
>>>>>>>
>>>>>>> ~Brent
>>>>>>>
>>>>>>

Re: Temporarily preventing a state transition

Reply via email to