Yes we are using native library...i was thinking to reduce the heap to 65G....
-S -----Original Message----- From: Christopher <ctubb...@apache.org> Sent: Monday, November 22, 2021 7:20 PM To: accumulo-user <user@accumulo.apache.org> Subject: Re: [External] Re: acumulo 1.10.0 tserver goes down under heavy ingest I don't know how to tune the oom killer, but I do wonder why you would need an 80G Java heap. That seems excessive to me. Are you using the native map library? On Mon, Nov 22, 2021 at 7:06 PM Ligade, Shailesh [USA] <ligade_shail...@bah.com> wrote: > > Thanks Christopher, > > It is actually oom killer. So how can I prevent it? I mean I have Xmx/s set > to 80G on a 128G machine. So some process is hogging the memory. On normal > usage I don't see the issue but under bulk ingest I see the issue. > I am going to try to reduce heap and test, but I really don't want to starve > tserver either. I added mode tservers, hoping that reducing number of tablets > per tserver might help, but it didn't. > Do you recommend to set oom_score_adj say -100? > > Appreciate your help > > -S > > -----Original Message----- > From: Christopher <ctubb...@apache.org> > Sent: Monday, November 22, 2021 12:23 PM > To: accumulo-user <user@accumulo.apache.org> > Subject: [External] Re: acumulo 1.10.0 tserver goes down under heavy > ingest > > That log message is basically just reporting that the connection to ZK > failed. It's not very helpful in determining what led to that. You'll > probably have to gather additional evidence to track down the problem. > Check the master and tserver logs prior to the crash, as well as the > ZooKeeper logs. If you can detect the manager or a tserver in a bad state, > try to capture a jstack of its process ID. Also check for system log > messages, such as the oom-killer running and killing your processes. > > On Mon, Nov 22, 2021 at 12:04 PM Ligade, Shailesh [USA] > <ligade_shail...@bah.com> wrote: > > > > Hello, > > > > I have 8 node cluster, under heavy load a tserver goes down, we have > > systemd unit file to auto restart, but that causes unassigned tablet for an > > hour. > > > > In the log of restarted tserver i see > > WARN: Saw (possibly) transient exception communicating with > > zookeeper and then error KeeperErrorCode = ConnectionLoss for > > /accumulo/<instance >/xxx KeeperErrroCode = ConnectionLoss > > at KeeperExcetion.create(KeeperException.java:102) > > at KeeperExcetion.create(KeeperException.java:54) > > at org.apache.zookeeper.Zookeeper.getChildren(zookeeper.java:2736) > > at org.apache.zookeeper.Zookeeper.getChildren(zookeper.java:2762) > > at > > org.apache.accumulo.fate.zookeeper.ZooReader.getChildren(ZooReader.j > > av > > a:159) > > xxxxx > > > > Any suggestions? > > > > -S