Yes we are using native library...i was thinking to reduce the heap to 65G....

-S

-----Original Message-----
From: Christopher <ctubb...@apache.org> 
Sent: Monday, November 22, 2021 7:20 PM
To: accumulo-user <user@accumulo.apache.org>
Subject: Re: [External] Re: acumulo 1.10.0 tserver goes down under heavy ingest

I don't know how to tune the oom killer, but I do wonder why you would need an 
80G Java heap. That seems excessive to me. Are you using the native map library?

On Mon, Nov 22, 2021 at 7:06 PM Ligade, Shailesh [USA] 
<ligade_shail...@bah.com> wrote:
>
> Thanks Christopher,
>
> It is actually oom killer. So how can I prevent it? I mean I have Xmx/s set 
> to 80G on a 128G machine. So some process is hogging the memory. On normal 
> usage I don't see the issue but under bulk ingest I see the issue.
> I am going to try to reduce heap and test, but I really don't want to starve 
> tserver either. I added mode tservers, hoping that reducing number of tablets 
> per tserver might help, but it didn't.
> Do you recommend to set oom_score_adj say -100?
>
> Appreciate your help
>
> -S
>
> -----Original Message-----
> From: Christopher <ctubb...@apache.org>
> Sent: Monday, November 22, 2021 12:23 PM
> To: accumulo-user <user@accumulo.apache.org>
> Subject: [External] Re: acumulo 1.10.0 tserver goes down under heavy 
> ingest
>
> That log message is basically just reporting that the connection to ZK 
> failed. It's not very helpful in determining what led to that. You'll 
> probably have to gather additional evidence to track down the problem.
> Check the master and tserver logs prior to the crash, as well as the 
> ZooKeeper logs. If you can detect the manager or a tserver in a bad state, 
> try to capture a jstack of its process ID. Also check for system log 
> messages, such as the oom-killer running and killing your processes.
>
> On Mon, Nov 22, 2021 at 12:04 PM Ligade, Shailesh [USA] 
> <ligade_shail...@bah.com> wrote:
> >
> > Hello,
> >
> > I have 8 node cluster, under heavy load a tserver goes down, we have 
> > systemd unit file to auto restart, but that causes unassigned tablet for an 
> > hour.
> >
> > In the log of restarted tserver i see
> > WARN: Saw (possibly) transient exception communicating with 
> > zookeeper and then error KeeperErrorCode = ConnectionLoss for 
> > /accumulo/<instance >/xxx KeeperErrroCode = ConnectionLoss
> >     at KeeperExcetion.create(KeeperException.java:102)
> >     at KeeperExcetion.create(KeeperException.java:54)
> >     at org.apache.zookeeper.Zookeeper.getChildren(zookeeper.java:2736)
> >     at org.apache.zookeeper.Zookeeper.getChildren(zookeper.java:2762)
> >     at
> > org.apache.accumulo.fate.zookeeper.ZooReader.getChildren(ZooReader.j
> > av
> > a:159)
> > xxxxx
> >
> > Any suggestions?
> >
> > -S

Reply via email to