Hello, I have 8 node cluster, under heavy load a tserver goes down, we have systemd unit file to auto restart, but that causes unassigned tablet for an hour.
In the log of restarted tserver i see WARN: Saw (possibly) transient exception communicating with zookeeper and then error KeeperErrorCode = ConnectionLoss for /accumulo/<instance >/xxx KeeperErrroCode = ConnectionLoss at KeeperExcetion.create(KeeperException.java:102) at KeeperExcetion.create(KeeperException.java:54) at org.apache.zookeeper.Zookeeper.getChildren(zookeeper.java:2736) at org.apache.zookeeper.Zookeeper.getChildren(zookeper.java:2762) at org.apache.accumulo.fate.zookeeper.ZooReader.getChildren(ZooReader.java:159) xxxxx Any suggestions? -S