Re: Failure detection time

kishore g Sun, 03 Mar 2013 20:59:44 -0800

Thanks Ming, good catch. Do you mind submitting a patch and adding a test
case ?


https://issues.apache.org/jira/browse/HELIX-55

Thanks,
Kishore G





On Sun, Mar 3, 2013 at 10:34 AM, Ming Fang <[email protected]> wrote:

> I've tried setting zk.session.timeout property from my participants but I
> don't think it's working.
> Looking at org.apache.helix.manager.zk.ZKHelixManager line 155, it seems
> the session timeout is made same value as helixmanager.flappingTimeWindow.
> That looks like a bug since these two values are for different purposes.
>
> As a temporary workaround, this is a hack that works
>
>             manager = HelixManagerFactory.getZKHelixManager(CLUSTER_NAME,
> instanceName, InstanceType.PARTICIPANT, ZK_ADDRESS);
>             {
>                 //hack to set sessionTimeout
>                 Field sessionTimeout =
> ZKHelixManager.class.getDeclaredField("_sessionTimeout");
>                 sessionTimeout.setAccessible(true);
>                 sessionTimeout.setInt(manager, 1000);
>             }
>
> Also on the Zookeeper side I made the tickTime =500 and minSessionTimeout
> = 1000.
>
> On Mar 3, 2013, at 1:53 AM, kishore g <[email protected]> wrote:
>
> There are two kinds of fail over planned( during software upgrade)
> unplanned( node crash etc).
>
> For planned, you should add a jvm shutdownhook from which will you invoke
> helixmanager.disconnect() and then invoke kill <pid>. This will allow Helix
> to detect the failure immediately like 5-15 milli seconds.
>
> For unplanned, it is determined by zookeeper session timeout, this is by
> default set to 30 seconds. You can change this to be more aggressive like
> 5,10 or 15 seconds. Recommended value 15 seconds. You can change this by
> setting system property "zk.session.timeout"= 15*1000.
>
> helixmanager.flappingTimeWindow and helixmanager.maxDisconnectThreshold
> can be tuned in case you have bad network situations and excessive GC's.
> You probably dont need to tune this, but let me know if you need additional
> info on this.
>
> Fail over depends on number of partitions, nodes, resources etc in the
> system.  For a 1000 partition system with 10 nodes, failover time for one
> node might be 200-300 milliseconds.
>
> Jason has done lot of performance improvements on another branch that
> might improve this time further.
>
> thanks,
> Kishore G
>
>
>
>
>
>
>
>
> On Sat, Mar 2, 2013 at 9:53 PM, Ming Fang <[email protected]> wrote:
>
>> How can I tune the amount of time it takes for detecting a failed node,
>> e.g. kill -9?
>> Is it by setting "helixmanager.flappingTimeWindow"?
>>
>> What is the fastest possible time for a failover?
>
>
>
>

Re: Failure detection time

Reply via email to