Re: Issue with second/third GC processes in a cluster error spam/OoM

Craig Portoghese Tue, 27 Aug 2024 05:49:14 -0700

Thanks Dave! Are there any mitigations I can employ to work around this
until 2.1.4 is released? I suppose on the standby servers I can schedule a
cronjob to restart the GC process every few hours. I'm not familiar with
how long Accumulo can operate without a GC in general, so maybe that's
something I should test for my particular database size/use.


On Mon, Aug 26, 2024 at 1:39 PM Dave Marion <dlmar...@apache.org> wrote:

> Thanks for reporting this. Based on the information you provided I was
> able to create https://github.com/apache/accumulo/pull/4838. It appears
> that the Manager, Monitor, and SimpleGarbageCollector are creating multiple
> instances of ServiceLock when in a loop waiting to acquire the lock (when
> they are the standby node). The ServiceLock constructor creates a Watcher
> in the ZooKeeper client, which is likely causing the problem you are
> having. The Manager and Monitor operate a little differently and thus do
> not exhibit the same OOME problem.
>
> On 2024/08/26 12:13:50 Craig Portoghese wrote:
> > Wasn't sure if this was bug territory or an issue with cluster
> > configuration.
> >
> > In my dev environment, I have a 5-server AWS EMR cluster using Accumulo
> > 2.1.2, Hadoop 3.3.6, and Zookeeper 3.5.10.The cluster is in high
> > availability mode so there are 3 primary nodes with Zookeeper running. On
> > the primary nodes I run the manager, monitor, and gc processes. On the 2
> > core nodes (with DataNode on them) I run just tablet servers.
> >
> > The manager and monitor processes on the 2nd and 3rd servers are fine, no
> > problems about not being the leader for their process. However, the 2nd
> and
> > 3rd GC processes will repeatedly complain in a DEBUG "Failed to acquire
> > lock". It will complain that there is already a gc lock, and then create
> an
> > ephemeral node #0000000001, then #0000000002, etc. After about 8 hours of
> > this complaint loop, it will turn into an error "Called
> > determineLockOwnership() when ephemeralNodeName == null", which it spams
> > forever, filling up the server and eventually killing the server.
> >
> > This has happened in multiple environments. Is it an issue with GC's
> > ability to hold elections? Should I be putting the standby GC processes
> on
> > a different node than the one running one of the zookeepers? Below are
> > samples of the two log types:
> >
> > 2024-08-24T15:28:03,292 [gc.SimpleGarbageCollector] INFO : Trying to
> > acquire ZooKeeper lock for garbage collector
> > 2024-08-24T15:28:03,330 [metrics.MetricsUtil] INFO : Metric producer
> > ThriftMetrics initialize
> > 2024-08-24T15:28:03,335 [rpc.TServerUtils] DEBUG: Instantiating unsecure
> > custom half-async Thrift server
> > 2024-08-24T15:28:03,348 [gc.SimpleGarbageCollector] DEBUG: Starting
> garbage
> > collector listening on coreNode1.example.domain:9998
> > 2024-08-24T15:28:59,694 [zookeeper.ServiceLock] DEBUG:
> > [zlock#a1a993fd-9086-4473-9545-113a865ca539#] Ephemeral node
> >
> /accumulo/f73e29fc-62b3-44bf-8d6d-694a6a262a98/gc/lock/zlock#a1a993fd-9086-4473-9545-113a865ca539#0000000057
> > created
> > 2024-08-24T15:28:59,694 [zookeeper.ServiceLock] DEBUG:
> > [zlock#a1a993fd-9086-4473-9545-113a865ca539#] Setting watcher on
> >
> /accumulo/f73e29fc-62b3-44bf-8d6d-694a6a262a98/gc/lock/zlock#a1a993fd-9086-4473-9545-113a865ca539#0000000057
> > 2024-08-24T15:28:59,695 [zookeeper.ServiceLock] DEBUG:
> > [zlock#a1a993fd-9086-4473-9545-113a865ca539#] Lock held by another
> process
> > with ephemeral node:
> zlock#85f3ef81-b877-4321-a416-a24faec6f032#0000000000
> > 2024-08-24T15:28:59,695 [zookeeper.ServiceLock] DEBUG:
> > [zlock#a1a993fd-9086-4473-9545-113a865ca539#] Establishing watch on prior
> > node
> >
> /accumulo/f73e29fc-62b3-44bf-8d6d-694a6a262a98/gc/lock/zlock#85f3ef81-b877-4321-a416-a24faec6f032#0000000000
> > 2024-08-24T15:28:59,695 [zookeeper.ServiceLock] DEBUG:
> > [zlock#a1a993fd-9086-4473-9545-113a865ca539#] Failed to acquire lock in
> > tryLock(), deleting all at path:
> >
> /accumulo/f73e29fc-62b3-44bf-8d6d-694a6a262a98/gc/lock/zlock#a1a993fd-9086-4473-9545-113a865ca539#0000000057
> > 2024-08-24T15:28:59,697 [gc.SimpleGarbageCollector] DEBUG: Failed to get
> GC
> > ZooKeeper lock, will retry
> >
> > 2024-08-25T21:48:31,418 [zookeeper.ClientCnxn] ERROR: Error while calling
> > watcher
> > java.lang.IllegalStateException: Called determineLockOwnership() when
> > ephemeralNodeName == null
> >         at
> >
> org.apache.accumulo.core.fate.zookeeper.ServiceLock.determineLockOwnership(ServiceLock.java:274)
> > ~[accumulo-core-2.1.2.jar:2.1.2]
> >         at
> >
> org.apache.accumulo.core.fate.zookeeper.ServiceLock$1.process(ServiceLock.java:354)
> > ~[accumulo-core-2.1.2.jar:2.1.2]
> >         at
> >
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:532)
> > ~[zookeeper-3.5.10.jar:3.5.10--1]
> >         at
> > org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:507)
> > ~[zookeeper-3.5.10.jar:3.5.10--1]
> >
>

Re: Issue with second/third GC processes in a cluster error spam/OoM

Reply via email to