That log message is basically just reporting that the connection to ZK
failed. It's not very helpful in determining what led to that. You'll
probably have to gather additional evidence to track down the problem.
Check the master and tserver logs prior to the crash, as well as the
ZooKeeper logs. If you can detect the manager or a tserver in a bad
state, try to capture a jstack of its process ID. Also check for
system log messages, such as the oom-killer running and killing your
processes.

On Mon, Nov 22, 2021 at 12:04 PM Ligade, Shailesh [USA]
<ligade_shail...@bah.com> wrote:
>
> Hello,
>
> I have 8 node cluster, under heavy load a tserver goes down, we have systemd 
> unit file to auto restart, but that causes unassigned tablet for an hour.
>
> In the log of restarted tserver i see
> WARN: Saw (possibly) transient exception communicating with zookeeper
> and then error
> KeeperErrorCode = ConnectionLoss for /accumulo/<instance >/xxx
> KeeperErrroCode = ConnectionLoss
>     at KeeperExcetion.create(KeeperException.java:102)
>     at KeeperExcetion.create(KeeperException.java:54)
>     at org.apache.zookeeper.Zookeeper.getChildren(zookeeper.java:2736)
>     at org.apache.zookeeper.Zookeeper.getChildren(zookeper.java:2762)
>     at 
> org.apache.accumulo.fate.zookeeper.ZooReader.getChildren(ZooReader.java:159)
> xxxxx
>
> Any suggestions?
>
> -S

Reply via email to