On 3/7/14, 12:01 PM, Terry P. wrote:
Greetings folks,
It seems network woes will never go away for this Accumulo 1.4.2 project :-(
They rebooted one of the two "redundant switches" last night, but of
course zero redundancy actually took place and the Master lost his
zookeeper lock as did one of the Datanodes after 60 seconds and shut
itself down.
By datanode you mean tserver? Hadoop datanodes don't communicate with
ZooKeeper.
The 60 second period is odd, because I see that
instance.zookeeper.timeout is actually set to 30s, but I do recall that
often by default zookeeper clients retry 2 times before bailing so maybe
that's why.
It won't always be 30s before it's seen; I've seen it much quicker too.
I'm not sure about the retries off the top of my head.
My question: is it safe / advisable to increase the zookeeper timeout
to, say, 60 seconds? Where can I set that in a file to ensure the
change is durable?
Yes, as long as you're cognizant of the fact that it will take longer to
notice an actual failure. If a tserver dies/hangs, you could now
potentially take twice as long to realize this which would cause latency
in your application.
You should set that property in accumulo-site.xml. Make sure to place it
on all nodes in the cluster. I believe you will also have to restart
Accumulo for it to take effect.
Thanks in advance,
Tery