Thanks Christopher,

It is actually oom killer. So how can I prevent it? I mean I have Xmx/s set to 
80G on a 128G machine. So some process is hogging the memory. On normal usage I 
don't see the issue but under bulk ingest I see the issue. 
I am going to try to reduce heap and test, but I really don't want to starve 
tserver either. I added mode tservers, hoping that reducing number of tablets 
per tserver might help, but it didn't.
Do you recommend to set oom_score_adj say -100? 

Appreciate your help

-S

-----Original Message-----
From: Christopher <ctubb...@apache.org> 
Sent: Monday, November 22, 2021 12:23 PM
To: accumulo-user <user@accumulo.apache.org>
Subject: [External] Re: acumulo 1.10.0 tserver goes down under heavy ingest

That log message is basically just reporting that the connection to ZK failed. 
It's not very helpful in determining what led to that. You'll probably have to 
gather additional evidence to track down the problem.
Check the master and tserver logs prior to the crash, as well as the ZooKeeper 
logs. If you can detect the manager or a tserver in a bad state, try to capture 
a jstack of its process ID. Also check for system log messages, such as the 
oom-killer running and killing your processes.

On Mon, Nov 22, 2021 at 12:04 PM Ligade, Shailesh [USA] 
<ligade_shail...@bah.com> wrote:
>
> Hello,
>
> I have 8 node cluster, under heavy load a tserver goes down, we have systemd 
> unit file to auto restart, but that causes unassigned tablet for an hour.
>
> In the log of restarted tserver i see
> WARN: Saw (possibly) transient exception communicating with zookeeper 
> and then error KeeperErrorCode = ConnectionLoss for 
> /accumulo/<instance >/xxx KeeperErrroCode = ConnectionLoss
>     at KeeperExcetion.create(KeeperException.java:102)
>     at KeeperExcetion.create(KeeperException.java:54)
>     at org.apache.zookeeper.Zookeeper.getChildren(zookeeper.java:2736)
>     at org.apache.zookeeper.Zookeeper.getChildren(zookeper.java:2762)
>     at 
> org.apache.accumulo.fate.zookeeper.ZooReader.getChildren(ZooReader.jav
> a:159)
> xxxxx
>
> Any suggestions?
>
> -S

Reply via email to