Great information.  Thank you again, I really appreciate all the thoughtful
& detailed responses.

On Thu, Aug 5, 2021 at 9:45 PM Wang Jiajun <[email protected]> wrote:

> In short, the controller thinks a state transition is still ongoing if, 1.
> the message exists and status = READ, 2. the instance is alive (with
> liveInstance znode). The participant does not need to do any additional
> signal until it updates the current states, meaning the transition is
> either done or ERROR. The controller will read current states to get the
> information. And yes, the call should not return.
>
> State transition timeout is configurable. By default, regular partition
> state transition does not have a timeout. If anything happens in the
> execution, Helix logic on the participant will catch and throw Exception
> then the partition will end up with an ERROR state.
>
> Regarding throttling,  StateTransitionThrottleConfig would be the
> recommended way. If it does not fit your needs, then an alternative way is
> to use another theadpool with a limited thread size to execute the
> bootstrap task. onBecomeStandbyFromOffline on the other hand, should submit
> the task to this threadpool and wait for the result. This is another way to
> control critical system resource usage. It is more complicated obviously.
> So, please evaluate this method carefully : )
>
> Best Regards,
> Jiajun
>
>
> On Thu, Aug 5, 2021 at 2:03 PM Brent <[email protected]> wrote:
>
>> Ah!  OK, that's interesting. It seems like I *was* thinking about it the
>> wrong way.  Rather than try to stop the transition from STANDBY -> LEADER,
>> I need to make sure I don't let a node become STANDBY until it is ready to
>> be promoted to LEADER if necessary.
>>
>> You said it does not hurt to have some long-running transition tasks
>> (e.g. OFFLINE -> STANDBY in my case).  So when Helix sends the transition
>> message to move from OFFLINE -> STANDBY for a given partition, how do I
>> signal to Helix that I'm still working?  Is it literally just that I don't
>> return from the agent Java callback until the transition is complete?  e.g.:
>>
>> @Transition(to = "STANDBY", from = "OFFLINE")
>> public void onBecomeStandbyFromOffline(Message message,
>> NotificationContext context) {
>>     // this node has been asked to become standby for this partition
>>
>>     DoLongRunningOperationToBootstrapNode(message);  // this could take
>> 30 minutes
>>
>>     // returning now means the node has successfully transitioned to
>> STANDBY
>> }
>>
>> Are there any default state transition timeouts or any other properties I
>> need to worry about updating?
>>
>> To your other point, I think parallel bootstraps may be OK.  I was hoping
>> a StateTransitionThrottleConfig with ThrottleScope.INSTANCE could limit the
>> number of those, but that seems like it applies to ALL transitions on that
>> node, not just a particular type like OFFLINE->STANDBY.  I suspect I'd have
>> to use that carefully to make sure I'm never blocking a STANDBY -> LEADER
>> transition while waiting on OFFLINE -> STANDBY transitions.
>>
>> Thanks for changing my perspective on how I was seeing this problem
>> Jiajun.  Very helpful!
>>
>> ~Brent
>>
>> On Thu, Aug 5, 2021 at 10:31 AM Wang Jiajun <[email protected]>
>> wrote:
>>
>>> In short, the key to this solution is to prevent STANDBY -> LEADER
>>> message before partitions are truly ready. We do not restrict SYNCING ->
>>> STANDBY messages at all. So the controller will send S -> S message and
>>> wait until the transition (bootstrap) is done. After that, it can go ahead
>>> to bring up the LEADER. It does not hurt to have some long-run transition
>>> tasks (as long as they are not STANDBY -> LEADER, because LEADER is serving
>>> the traffic, we don't want to have a big gap with no LEADER) in the system,
>>> because it is what happens there. But this also means in
>>> parallel bootstraps. I'm not sure if this fits your need.
>>>
>>> Best Regards,
>>> Jiajun
>>>
>>>
>>> On Thu, Aug 5, 2021 at 9:42 AM Brent <[email protected]> wrote:
>>>
>>>> Thank you for the response Jiajun!
>>>>
>>>> On the inclusivity thing, I'm glad to hear we're moving to different
>>>> terminology.  Our code actually wraps the MS state machine and renames the
>>>> terminology to "Leader" and "Follower" everywhere visible to our users and
>>>> operators for similar reasons.  :-)   I thought the Leader/Standby SMD was
>>>> a bit different which was why I wasn't using it, but looking at the
>>>> definition, I guess the only difference is it doesn't seem to define an
>>>> ERROR state like the MS SMD does.  So for the rest of this thread, let's
>>>> use the LEADER/STANDBY terminology instead.
>>>>
>>>> For context, I have 1000-2000 shards of a database where each shard can
>>>> be 100GB+ in size so bootstrapping nodes is expensive.  Your logic on
>>>> splitting up the STANDBY state into two states like SYNCING and STANDBY
>>>> makes sense (OFFLINE -> SYNCING -> STANDBY -> LEADER), though I'm still not
>>>> sure how I can prevent the state from transitioning from SYNCING to STANDBY
>>>> until the node is ready (i.e. has an up-to-date copy of the leader's
>>>> data).  Based on what you were saying, is it possible to have the Helix
>>>> controller tell a node it's in SYNCING state, but then have the node decide
>>>> when it's safe to transition itself to STANDBY?  Or can state transition
>>>> cancellation be used if the node isn't ready?  Or can I just let the
>>>> transition timeout if the node isn't ready?
>>>>
>>>> This seems like it would be a pretty common problem with large,
>>>> expensive-to-move data (e.g. a shard of a large database), especially when
>>>> adding a new node to an existing system and needing to bootstrap it from
>>>> nothing.  I suspect people do this and I'm just thinking about it the wrong
>>>> way or there's a Helix strategy that I'm just not grasping correctly.
>>>>
>>>> For the LinkedIn folks on the list, what does Espresso do for
>>>> bootstrapping new nodes and avoiding this problem of them getting promoted
>>>> to LEADER before they're ready?  It seems like a similar problem to mine
>>>> (stateful node with large data that needs a leader/standby setup).
>>>>
>>>> Thanks again!
>>>>
>>>> ~Brent
>>>>
>>>> On Wed, Aug 4, 2021 at 6:32 PM Wang Jiajun <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi Brent,
>>>>>
>>>>> AFAIK, there is no way to tell the controller to suspend a certain
>>>>> state transition. Even if you reject the transition (although rejection is
>>>>> not officially supported either), the controller will probably retry in 
>>>>> the
>>>>> next rebalance pipeline repeatedly.
>>>>>
>>>>> Alternatively, from your description, I think "Slave" means 2 states
>>>>> in your system. 1. new Slave that is out of sync. 2. sync-ed Slave. It is
>>>>> possible you define a customized mode that differentiates these 2 states?
>>>>> Offline -> Syncing -> Slave, etc.
>>>>> Even simpler, is it OK to restrict the definition of Slave to the 2nd
>>>>> case? Meaning before a partition syncs with the Master, it shall not mark
>>>>> itself as the Slave. It implies offline -> Slave transition would take a
>>>>> longer time, but once it is done, the Slave partition would be fully 
>>>>> ready.
>>>>>
>>>>> BTW, we encourage the users to use inclusive language. Maybe you can
>>>>> consider changing to use the LeaderStandby SMD? We might deprecate
>>>>> MasterSlave SMD in the near future.
>>>>>
>>>>> Best Regards,
>>>>> Jiajun
>>>>>
>>>>>
>>>>> On Wed, Aug 4, 2021 at 3:41 PM Brent <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> I had asked a question a while back about how to deal with a failed
>>>>>> state transition (
>>>>>> http://mail-archives.apache.org/mod_mbox/helix-user/202009.mbox/%[email protected]%3E)
>>>>>> and the correct answer there was to throw an exception to cause an ERROR
>>>>>> state in the state machine.
>>>>>>
>>>>>> I have a slightly different but related question now.  I'm using
>>>>>> the org.apache.helix.model.MasterSlaveSMD.  In our system, for a Slave
>>>>>> partition to become fully in-sync with a Master partition can take a long
>>>>>> time (maybe 30 minutes).  Under normal circumstances, until a Slave has
>>>>>> finished syncing data from a Master, it should not be eligible for
>>>>>> promotion to Master.
>>>>>>
>>>>>> So let's say a node (maybe newly added to the cluster) is the Slave
>>>>>> for partition 22 and has been online for 10 minutes (not long enough to
>>>>>> have sync-ed everything from the existing partition 22 Master) and 
>>>>>> receives
>>>>>> a state transition from Helix saying it should go from Slave->Master.  Is
>>>>>> it possible to temporarily reject that transition without going into 
>>>>>> ERROR
>>>>>> state for that partition?  ERROR state seems like slightly the wrong 
>>>>>> thing
>>>>>> because while it's not a valid transition right now, it will be a valid
>>>>>> transition 20 minutes from now when the initial sync completes.
>>>>>>
>>>>>> Is there a way to get this functionality to "fail" a transition, but
>>>>>> not fully go into ERROR state?  Or is there a different way I should be
>>>>>> thinking about solving this problem?  I was thinking this could 
>>>>>> potentially
>>>>>> be a frequent occurrence when new nodes are added to the cluster.
>>>>>>
>>>>>> Thank you for your time and help as always!
>>>>>>
>>>>>> ~Brent
>>>>>>
>>>>>

Reply via email to