You're welcome : ) On Fri, Aug 6, 2021, 8:20 AM Brent <[email protected]> wrote:
> Great information. Thank you again, I really appreciate all the > thoughtful & detailed responses. > > On Thu, Aug 5, 2021 at 9:45 PM Wang Jiajun <[email protected]> wrote: > >> In short, the controller thinks a state transition is still ongoing if, >> 1. the message exists and status = READ, 2. the instance is alive (with >> liveInstance znode). The participant does not need to do any additional >> signal until it updates the current states, meaning the transition is >> either done or ERROR. The controller will read current states to get the >> information. And yes, the call should not return. >> >> State transition timeout is configurable. By default, regular partition >> state transition does not have a timeout. If anything happens in the >> execution, Helix logic on the participant will catch and throw Exception >> then the partition will end up with an ERROR state. >> >> Regarding throttling, StateTransitionThrottleConfig would be the >> recommended way. If it does not fit your needs, then an alternative way is >> to use another theadpool with a limited thread size to execute the >> bootstrap task. onBecomeStandbyFromOffline on the other hand, should submit >> the task to this threadpool and wait for the result. This is another way to >> control critical system resource usage. It is more complicated obviously. >> So, please evaluate this method carefully : ) >> >> Best Regards, >> Jiajun >> >> >> On Thu, Aug 5, 2021 at 2:03 PM Brent <[email protected]> wrote: >> >>> Ah! OK, that's interesting. It seems like I *was* thinking about it the >>> wrong way. Rather than try to stop the transition from STANDBY -> LEADER, >>> I need to make sure I don't let a node become STANDBY until it is ready to >>> be promoted to LEADER if necessary. >>> >>> You said it does not hurt to have some long-running transition tasks >>> (e.g. OFFLINE -> STANDBY in my case). So when Helix sends the transition >>> message to move from OFFLINE -> STANDBY for a given partition, how do I >>> signal to Helix that I'm still working? Is it literally just that I don't >>> return from the agent Java callback until the transition is complete? e.g.: >>> >>> @Transition(to = "STANDBY", from = "OFFLINE") >>> public void onBecomeStandbyFromOffline(Message message, >>> NotificationContext context) { >>> // this node has been asked to become standby for this partition >>> >>> DoLongRunningOperationToBootstrapNode(message); // this could take >>> 30 minutes >>> >>> // returning now means the node has successfully transitioned to >>> STANDBY >>> } >>> >>> Are there any default state transition timeouts or any other properties >>> I need to worry about updating? >>> >>> To your other point, I think parallel bootstraps may be OK. I was >>> hoping a StateTransitionThrottleConfig with ThrottleScope.INSTANCE could >>> limit the number of those, but that seems like it applies to ALL >>> transitions on that node, not just a particular type like >>> OFFLINE->STANDBY. I suspect I'd have to use that carefully to make sure >>> I'm never blocking a STANDBY -> LEADER transition while waiting on OFFLINE >>> -> STANDBY transitions. >>> >>> Thanks for changing my perspective on how I was seeing this problem >>> Jiajun. Very helpful! >>> >>> ~Brent >>> >>> On Thu, Aug 5, 2021 at 10:31 AM Wang Jiajun <[email protected]> >>> wrote: >>> >>>> In short, the key to this solution is to prevent STANDBY -> LEADER >>>> message before partitions are truly ready. We do not restrict SYNCING -> >>>> STANDBY messages at all. So the controller will send S -> S message and >>>> wait until the transition (bootstrap) is done. After that, it can go ahead >>>> to bring up the LEADER. It does not hurt to have some long-run transition >>>> tasks (as long as they are not STANDBY -> LEADER, because LEADER is serving >>>> the traffic, we don't want to have a big gap with no LEADER) in the system, >>>> because it is what happens there. But this also means in >>>> parallel bootstraps. I'm not sure if this fits your need. >>>> >>>> Best Regards, >>>> Jiajun >>>> >>>> >>>> On Thu, Aug 5, 2021 at 9:42 AM Brent <[email protected]> wrote: >>>> >>>>> Thank you for the response Jiajun! >>>>> >>>>> On the inclusivity thing, I'm glad to hear we're moving to different >>>>> terminology. Our code actually wraps the MS state machine and renames the >>>>> terminology to "Leader" and "Follower" everywhere visible to our users and >>>>> operators for similar reasons. :-) I thought the Leader/Standby SMD was >>>>> a bit different which was why I wasn't using it, but looking at the >>>>> definition, I guess the only difference is it doesn't seem to define an >>>>> ERROR state like the MS SMD does. So for the rest of this thread, let's >>>>> use the LEADER/STANDBY terminology instead. >>>>> >>>>> For context, I have 1000-2000 shards of a database where each shard >>>>> can be 100GB+ in size so bootstrapping nodes is expensive. Your logic on >>>>> splitting up the STANDBY state into two states like SYNCING and STANDBY >>>>> makes sense (OFFLINE -> SYNCING -> STANDBY -> LEADER), though I'm still >>>>> not >>>>> sure how I can prevent the state from transitioning from SYNCING to >>>>> STANDBY >>>>> until the node is ready (i.e. has an up-to-date copy of the leader's >>>>> data). Based on what you were saying, is it possible to have the Helix >>>>> controller tell a node it's in SYNCING state, but then have the node >>>>> decide >>>>> when it's safe to transition itself to STANDBY? Or can state transition >>>>> cancellation be used if the node isn't ready? Or can I just let the >>>>> transition timeout if the node isn't ready? >>>>> >>>>> This seems like it would be a pretty common problem with large, >>>>> expensive-to-move data (e.g. a shard of a large database), especially when >>>>> adding a new node to an existing system and needing to bootstrap it from >>>>> nothing. I suspect people do this and I'm just thinking about it the >>>>> wrong >>>>> way or there's a Helix strategy that I'm just not grasping correctly. >>>>> >>>>> For the LinkedIn folks on the list, what does Espresso do for >>>>> bootstrapping new nodes and avoiding this problem of them getting promoted >>>>> to LEADER before they're ready? It seems like a similar problem to mine >>>>> (stateful node with large data that needs a leader/standby setup). >>>>> >>>>> Thanks again! >>>>> >>>>> ~Brent >>>>> >>>>> On Wed, Aug 4, 2021 at 6:32 PM Wang Jiajun <[email protected]> >>>>> wrote: >>>>> >>>>>> Hi Brent, >>>>>> >>>>>> AFAIK, there is no way to tell the controller to suspend a certain >>>>>> state transition. Even if you reject the transition (although rejection >>>>>> is >>>>>> not officially supported either), the controller will probably retry in >>>>>> the >>>>>> next rebalance pipeline repeatedly. >>>>>> >>>>>> Alternatively, from your description, I think "Slave" means 2 states >>>>>> in your system. 1. new Slave that is out of sync. 2. sync-ed Slave. It is >>>>>> possible you define a customized mode that differentiates these 2 states? >>>>>> Offline -> Syncing -> Slave, etc. >>>>>> Even simpler, is it OK to restrict the definition of Slave to the 2nd >>>>>> case? Meaning before a partition syncs with the Master, it shall not mark >>>>>> itself as the Slave. It implies offline -> Slave transition would take a >>>>>> longer time, but once it is done, the Slave partition would be fully >>>>>> ready. >>>>>> >>>>>> BTW, we encourage the users to use inclusive language. Maybe you can >>>>>> consider changing to use the LeaderStandby SMD? We might deprecate >>>>>> MasterSlave SMD in the near future. >>>>>> >>>>>> Best Regards, >>>>>> Jiajun >>>>>> >>>>>> >>>>>> On Wed, Aug 4, 2021 at 3:41 PM Brent <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> I had asked a question a while back about how to deal with a failed >>>>>>> state transition ( >>>>>>> http://mail-archives.apache.org/mod_mbox/helix-user/202009.mbox/%[email protected]%3E) >>>>>>> and the correct answer there was to throw an exception to cause an ERROR >>>>>>> state in the state machine. >>>>>>> >>>>>>> I have a slightly different but related question now. I'm using >>>>>>> the org.apache.helix.model.MasterSlaveSMD. In our system, for a Slave >>>>>>> partition to become fully in-sync with a Master partition can take a >>>>>>> long >>>>>>> time (maybe 30 minutes). Under normal circumstances, until a Slave has >>>>>>> finished syncing data from a Master, it should not be eligible for >>>>>>> promotion to Master. >>>>>>> >>>>>>> So let's say a node (maybe newly added to the cluster) is the Slave >>>>>>> for partition 22 and has been online for 10 minutes (not long enough to >>>>>>> have sync-ed everything from the existing partition 22 Master) and >>>>>>> receives >>>>>>> a state transition from Helix saying it should go from Slave->Master. >>>>>>> Is >>>>>>> it possible to temporarily reject that transition without going into >>>>>>> ERROR >>>>>>> state for that partition? ERROR state seems like slightly the wrong >>>>>>> thing >>>>>>> because while it's not a valid transition right now, it will be a valid >>>>>>> transition 20 minutes from now when the initial sync completes. >>>>>>> >>>>>>> Is there a way to get this functionality to "fail" a transition, but >>>>>>> not fully go into ERROR state? Or is there a different way I should be >>>>>>> thinking about solving this problem? I was thinking this could >>>>>>> potentially >>>>>>> be a frequent occurrence when new nodes are added to the cluster. >>>>>>> >>>>>>> Thank you for your time and help as always! >>>>>>> >>>>>>> ~Brent >>>>>>> >>>>>>
