Hi, Bo

  As Kishore commented, your offline->slave state transition callback needs 
some logic to determine whether a bootstrap or catch up is needed to transit a 
replica to slave.  A common way is to persist the data version of a local 
partition somewhere,  and during offline->slave, comparing local version (if 
there is) with current Master's version to determine if bootstrap (if version 
is null or too old) or catch-up is needed.


  There is one more difference in how Helix handles participant restarting vs 
ZK session. When a participant starts (or restarts), it creates a new 
StateModel (by calling CreateStateModel() in your StateModelFactory) for each 
partition.  However, if a participant loses ZK session and comes back (with new 
session), it will reuse the StateModel for partitions that were there before 
instead of creating a new one.  You may leverage this to tell whether a 
participant has been restarted or just re-established the ZK connection.


  In addition, the Delayed feature in DelayedAutoRebalancer is a little 
different then what you may understand.  When you lose a participant (e.g, 
crashed, in maintenance),  you lose one replica for some partitions.  In this 
situation, Helix will usually bring up a new replica in some other live node 
immediately to maintain the required replica count.  However, this may bring 
performance impact since bringing a new replica can require data bootstrap in 
new node.  If you expect the original participant will be back online soon and 
also you can tolerate losing one or more replica in short-term, then you can 
set a delay time here. In which Helix will not bring a new replica before this 
time.  Hope that makes it more clear.




Thanks

Lei




Lei Xia


Data Infra/Helix

[email protected]<mailto:[email protected]>
www.linkedin.com/in/lxia1<http://www.linkedin.com/in/lxia1>

________________________________
From: Bo Liu <[email protected]>
Sent: Monday, January 22, 2018 11:12:48 PM
To: [email protected]
Subject: differentiate between bootstrap and a soft failure

Hi There,

I am using FULL_AUTO with MasterSlave and DelayedAutoRebalancer. How can a 
participant differentiate between these two cases:

1) when a participant first joins a cluster, it will be requested to transit 
from OFFLINE to SLAVE. Since the participant doesn't have any data for this 
partition, it needs to bootstrap and download data from another participant or 
a data source.
2) when a participant loses its ZK session, the controller will automatically 
change the participant to be OFFLINE in ZK. If the participant manages to 
establish a new session to ZK before the delayed time threshold, the controller 
will send a request to it to switch from OFFLINE to SLAVE. In this case, the 
participant already has the data for the partition, so it doesn't need to 
bootstrap from other data sources.

--
Best regards,
Bo

Reply via email to