Great information. Thank you again, I really appreciate all the thoughtful & detailed responses.
On Thu, Aug 5, 2021 at 9:45 PM Wang Jiajun <[email protected]> wrote: > In short, the controller thinks a state transition is still ongoing if, 1. > the message exists and status = READ, 2. the instance is alive (with > liveInstance znode). The participant does not need to do any additional > signal until it updates the current states, meaning the transition is > either done or ERROR. The controller will read current states to get the > information. And yes, the call should not return. > > State transition timeout is configurable. By default, regular partition > state transition does not have a timeout. If anything happens in the > execution, Helix logic on the participant will catch and throw Exception > then the partition will end up with an ERROR state. > > Regarding throttling, StateTransitionThrottleConfig would be the > recommended way. If it does not fit your needs, then an alternative way is > to use another theadpool with a limited thread size to execute the > bootstrap task. onBecomeStandbyFromOffline on the other hand, should submit > the task to this threadpool and wait for the result. This is another way to > control critical system resource usage. It is more complicated obviously. > So, please evaluate this method carefully : ) > > Best Regards, > Jiajun > > > On Thu, Aug 5, 2021 at 2:03 PM Brent <[email protected]> wrote: > >> Ah! OK, that's interesting. It seems like I *was* thinking about it the >> wrong way. Rather than try to stop the transition from STANDBY -> LEADER, >> I need to make sure I don't let a node become STANDBY until it is ready to >> be promoted to LEADER if necessary. >> >> You said it does not hurt to have some long-running transition tasks >> (e.g. OFFLINE -> STANDBY in my case). So when Helix sends the transition >> message to move from OFFLINE -> STANDBY for a given partition, how do I >> signal to Helix that I'm still working? Is it literally just that I don't >> return from the agent Java callback until the transition is complete? e.g.: >> >> @Transition(to = "STANDBY", from = "OFFLINE") >> public void onBecomeStandbyFromOffline(Message message, >> NotificationContext context) { >> // this node has been asked to become standby for this partition >> >> DoLongRunningOperationToBootstrapNode(message); // this could take >> 30 minutes >> >> // returning now means the node has successfully transitioned to >> STANDBY >> } >> >> Are there any default state transition timeouts or any other properties I >> need to worry about updating? >> >> To your other point, I think parallel bootstraps may be OK. I was hoping >> a StateTransitionThrottleConfig with ThrottleScope.INSTANCE could limit the >> number of those, but that seems like it applies to ALL transitions on that >> node, not just a particular type like OFFLINE->STANDBY. I suspect I'd have >> to use that carefully to make sure I'm never blocking a STANDBY -> LEADER >> transition while waiting on OFFLINE -> STANDBY transitions. >> >> Thanks for changing my perspective on how I was seeing this problem >> Jiajun. Very helpful! >> >> ~Brent >> >> On Thu, Aug 5, 2021 at 10:31 AM Wang Jiajun <[email protected]> >> wrote: >> >>> In short, the key to this solution is to prevent STANDBY -> LEADER >>> message before partitions are truly ready. We do not restrict SYNCING -> >>> STANDBY messages at all. So the controller will send S -> S message and >>> wait until the transition (bootstrap) is done. After that, it can go ahead >>> to bring up the LEADER. It does not hurt to have some long-run transition >>> tasks (as long as they are not STANDBY -> LEADER, because LEADER is serving >>> the traffic, we don't want to have a big gap with no LEADER) in the system, >>> because it is what happens there. But this also means in >>> parallel bootstraps. I'm not sure if this fits your need. >>> >>> Best Regards, >>> Jiajun >>> >>> >>> On Thu, Aug 5, 2021 at 9:42 AM Brent <[email protected]> wrote: >>> >>>> Thank you for the response Jiajun! >>>> >>>> On the inclusivity thing, I'm glad to hear we're moving to different >>>> terminology. Our code actually wraps the MS state machine and renames the >>>> terminology to "Leader" and "Follower" everywhere visible to our users and >>>> operators for similar reasons. :-) I thought the Leader/Standby SMD was >>>> a bit different which was why I wasn't using it, but looking at the >>>> definition, I guess the only difference is it doesn't seem to define an >>>> ERROR state like the MS SMD does. So for the rest of this thread, let's >>>> use the LEADER/STANDBY terminology instead. >>>> >>>> For context, I have 1000-2000 shards of a database where each shard can >>>> be 100GB+ in size so bootstrapping nodes is expensive. Your logic on >>>> splitting up the STANDBY state into two states like SYNCING and STANDBY >>>> makes sense (OFFLINE -> SYNCING -> STANDBY -> LEADER), though I'm still not >>>> sure how I can prevent the state from transitioning from SYNCING to STANDBY >>>> until the node is ready (i.e. has an up-to-date copy of the leader's >>>> data). Based on what you were saying, is it possible to have the Helix >>>> controller tell a node it's in SYNCING state, but then have the node decide >>>> when it's safe to transition itself to STANDBY? Or can state transition >>>> cancellation be used if the node isn't ready? Or can I just let the >>>> transition timeout if the node isn't ready? >>>> >>>> This seems like it would be a pretty common problem with large, >>>> expensive-to-move data (e.g. a shard of a large database), especially when >>>> adding a new node to an existing system and needing to bootstrap it from >>>> nothing. I suspect people do this and I'm just thinking about it the wrong >>>> way or there's a Helix strategy that I'm just not grasping correctly. >>>> >>>> For the LinkedIn folks on the list, what does Espresso do for >>>> bootstrapping new nodes and avoiding this problem of them getting promoted >>>> to LEADER before they're ready? It seems like a similar problem to mine >>>> (stateful node with large data that needs a leader/standby setup). >>>> >>>> Thanks again! >>>> >>>> ~Brent >>>> >>>> On Wed, Aug 4, 2021 at 6:32 PM Wang Jiajun <[email protected]> >>>> wrote: >>>> >>>>> Hi Brent, >>>>> >>>>> AFAIK, there is no way to tell the controller to suspend a certain >>>>> state transition. Even if you reject the transition (although rejection is >>>>> not officially supported either), the controller will probably retry in >>>>> the >>>>> next rebalance pipeline repeatedly. >>>>> >>>>> Alternatively, from your description, I think "Slave" means 2 states >>>>> in your system. 1. new Slave that is out of sync. 2. sync-ed Slave. It is >>>>> possible you define a customized mode that differentiates these 2 states? >>>>> Offline -> Syncing -> Slave, etc. >>>>> Even simpler, is it OK to restrict the definition of Slave to the 2nd >>>>> case? Meaning before a partition syncs with the Master, it shall not mark >>>>> itself as the Slave. It implies offline -> Slave transition would take a >>>>> longer time, but once it is done, the Slave partition would be fully >>>>> ready. >>>>> >>>>> BTW, we encourage the users to use inclusive language. Maybe you can >>>>> consider changing to use the LeaderStandby SMD? We might deprecate >>>>> MasterSlave SMD in the near future. >>>>> >>>>> Best Regards, >>>>> Jiajun >>>>> >>>>> >>>>> On Wed, Aug 4, 2021 at 3:41 PM Brent <[email protected]> >>>>> wrote: >>>>> >>>>>> I had asked a question a while back about how to deal with a failed >>>>>> state transition ( >>>>>> http://mail-archives.apache.org/mod_mbox/helix-user/202009.mbox/%[email protected]%3E) >>>>>> and the correct answer there was to throw an exception to cause an ERROR >>>>>> state in the state machine. >>>>>> >>>>>> I have a slightly different but related question now. I'm using >>>>>> the org.apache.helix.model.MasterSlaveSMD. In our system, for a Slave >>>>>> partition to become fully in-sync with a Master partition can take a long >>>>>> time (maybe 30 minutes). Under normal circumstances, until a Slave has >>>>>> finished syncing data from a Master, it should not be eligible for >>>>>> promotion to Master. >>>>>> >>>>>> So let's say a node (maybe newly added to the cluster) is the Slave >>>>>> for partition 22 and has been online for 10 minutes (not long enough to >>>>>> have sync-ed everything from the existing partition 22 Master) and >>>>>> receives >>>>>> a state transition from Helix saying it should go from Slave->Master. Is >>>>>> it possible to temporarily reject that transition without going into >>>>>> ERROR >>>>>> state for that partition? ERROR state seems like slightly the wrong >>>>>> thing >>>>>> because while it's not a valid transition right now, it will be a valid >>>>>> transition 20 minutes from now when the initial sync completes. >>>>>> >>>>>> Is there a way to get this functionality to "fail" a transition, but >>>>>> not fully go into ERROR state? Or is there a different way I should be >>>>>> thinking about solving this problem? I was thinking this could >>>>>> potentially >>>>>> be a frequent occurrence when new nodes are added to the cluster. >>>>>> >>>>>> Thank you for your time and help as always! >>>>>> >>>>>> ~Brent >>>>>> >>>>>
