Thanks Ming, good catch. Do you mind submitting a patch and adding a test case ?
https://issues.apache.org/jira/browse/HELIX-55 Thanks, Kishore G On Sun, Mar 3, 2013 at 10:34 AM, Ming Fang <[email protected]> wrote: > I've tried setting zk.session.timeout property from my participants but I > don't think it's working. > Looking at org.apache.helix.manager.zk.ZKHelixManager line 155, it seems > the session timeout is made same value as helixmanager.flappingTimeWindow. > That looks like a bug since these two values are for different purposes. > > As a temporary workaround, this is a hack that works > > manager = HelixManagerFactory.getZKHelixManager(CLUSTER_NAME, > instanceName, InstanceType.PARTICIPANT, ZK_ADDRESS); > { > //hack to set sessionTimeout > Field sessionTimeout = > ZKHelixManager.class.getDeclaredField("_sessionTimeout"); > sessionTimeout.setAccessible(true); > sessionTimeout.setInt(manager, 1000); > } > > Also on the Zookeeper side I made the tickTime =500 and minSessionTimeout > = 1000. > > On Mar 3, 2013, at 1:53 AM, kishore g <[email protected]> wrote: > > There are two kinds of fail over planned( during software upgrade) > unplanned( node crash etc). > > For planned, you should add a jvm shutdownhook from which will you invoke > helixmanager.disconnect() and then invoke kill <pid>. This will allow Helix > to detect the failure immediately like 5-15 milli seconds. > > For unplanned, it is determined by zookeeper session timeout, this is by > default set to 30 seconds. You can change this to be more aggressive like > 5,10 or 15 seconds. Recommended value 15 seconds. You can change this by > setting system property "zk.session.timeout"= 15*1000. > > helixmanager.flappingTimeWindow and helixmanager.maxDisconnectThreshold > can be tuned in case you have bad network situations and excessive GC's. > You probably dont need to tune this, but let me know if you need additional > info on this. > > Fail over depends on number of partitions, nodes, resources etc in the > system. For a 1000 partition system with 10 nodes, failover time for one > node might be 200-300 milliseconds. > > Jason has done lot of performance improvements on another branch that > might improve this time further. > > thanks, > Kishore G > > > > > > > > > On Sat, Mar 2, 2013 at 9:53 PM, Ming Fang <[email protected]> wrote: > >> How can I tune the amount of time it takes for detecting a failed node, >> e.g. kill -9? >> Is it by setting "helixmanager.flappingTimeWindow"? >> >> What is the fastest possible time for a failover? > > > >
