On Fri, Mar 7, 2014 at 12:15 PM, Josh Elser <[email protected]> wrote:
> On 3/7/14, 12:01 PM, Terry P. wrote: > >> Greetings folks, >> It seems network woes will never go away for this Accumulo 1.4.2 project >> :-( >> >> They rebooted one of the two "redundant switches" last night, but of >> course zero redundancy actually took place and the Master lost his >> zookeeper lock as did one of the Datanodes after 60 seconds and shut >> itself down. >> > > By datanode you mean tserver? Hadoop datanodes don't communicate with > ZooKeeper. > > > The 60 second period is odd, because I see that >> instance.zookeeper.timeout is actually set to 30s, but I do recall that >> often by default zookeeper clients retry 2 times before bailing so maybe >> that's why. >> > > It won't always be 30s before it's seen; I've seen it much quicker too. > I'm not sure about the retries off the top of my head. Most likely you were seeing the effects of ACCUMULO-1572 in which a ZooKeeper disconnect causes Accumulo failure before the expiration of the session. Fixed in 1.5.1 and to-be-released 1.4.5. If you think you're seeing something else it would be good to hear about it.
