Thanks Dave! Are there any mitigations I can employ to work around this until 2.1.4 is released? I suppose on the standby servers I can schedule a cronjob to restart the GC process every few hours. I'm not familiar with how long Accumulo can operate without a GC in general, so maybe that's something I should test for my particular database size/use.
On Mon, Aug 26, 2024 at 1:39 PM Dave Marion <dlmar...@apache.org> wrote: > Thanks for reporting this. Based on the information you provided I was > able to create https://github.com/apache/accumulo/pull/4838. It appears > that the Manager, Monitor, and SimpleGarbageCollector are creating multiple > instances of ServiceLock when in a loop waiting to acquire the lock (when > they are the standby node). The ServiceLock constructor creates a Watcher > in the ZooKeeper client, which is likely causing the problem you are > having. The Manager and Monitor operate a little differently and thus do > not exhibit the same OOME problem. > > On 2024/08/26 12:13:50 Craig Portoghese wrote: > > Wasn't sure if this was bug territory or an issue with cluster > > configuration. > > > > In my dev environment, I have a 5-server AWS EMR cluster using Accumulo > > 2.1.2, Hadoop 3.3.6, and Zookeeper 3.5.10.The cluster is in high > > availability mode so there are 3 primary nodes with Zookeeper running. On > > the primary nodes I run the manager, monitor, and gc processes. On the 2 > > core nodes (with DataNode on them) I run just tablet servers. > > > > The manager and monitor processes on the 2nd and 3rd servers are fine, no > > problems about not being the leader for their process. However, the 2nd > and > > 3rd GC processes will repeatedly complain in a DEBUG "Failed to acquire > > lock". It will complain that there is already a gc lock, and then create > an > > ephemeral node #0000000001, then #0000000002, etc. After about 8 hours of > > this complaint loop, it will turn into an error "Called > > determineLockOwnership() when ephemeralNodeName == null", which it spams > > forever, filling up the server and eventually killing the server. > > > > This has happened in multiple environments. Is it an issue with GC's > > ability to hold elections? Should I be putting the standby GC processes > on > > a different node than the one running one of the zookeepers? Below are > > samples of the two log types: > > > > 2024-08-24T15:28:03,292 [gc.SimpleGarbageCollector] INFO : Trying to > > acquire ZooKeeper lock for garbage collector > > 2024-08-24T15:28:03,330 [metrics.MetricsUtil] INFO : Metric producer > > ThriftMetrics initialize > > 2024-08-24T15:28:03,335 [rpc.TServerUtils] DEBUG: Instantiating unsecure > > custom half-async Thrift server > > 2024-08-24T15:28:03,348 [gc.SimpleGarbageCollector] DEBUG: Starting > garbage > > collector listening on coreNode1.example.domain:9998 > > 2024-08-24T15:28:59,694 [zookeeper.ServiceLock] DEBUG: > > [zlock#a1a993fd-9086-4473-9545-113a865ca539#] Ephemeral node > > > /accumulo/f73e29fc-62b3-44bf-8d6d-694a6a262a98/gc/lock/zlock#a1a993fd-9086-4473-9545-113a865ca539#0000000057 > > created > > 2024-08-24T15:28:59,694 [zookeeper.ServiceLock] DEBUG: > > [zlock#a1a993fd-9086-4473-9545-113a865ca539#] Setting watcher on > > > /accumulo/f73e29fc-62b3-44bf-8d6d-694a6a262a98/gc/lock/zlock#a1a993fd-9086-4473-9545-113a865ca539#0000000057 > > 2024-08-24T15:28:59,695 [zookeeper.ServiceLock] DEBUG: > > [zlock#a1a993fd-9086-4473-9545-113a865ca539#] Lock held by another > process > > with ephemeral node: > zlock#85f3ef81-b877-4321-a416-a24faec6f032#0000000000 > > 2024-08-24T15:28:59,695 [zookeeper.ServiceLock] DEBUG: > > [zlock#a1a993fd-9086-4473-9545-113a865ca539#] Establishing watch on prior > > node > > > /accumulo/f73e29fc-62b3-44bf-8d6d-694a6a262a98/gc/lock/zlock#85f3ef81-b877-4321-a416-a24faec6f032#0000000000 > > 2024-08-24T15:28:59,695 [zookeeper.ServiceLock] DEBUG: > > [zlock#a1a993fd-9086-4473-9545-113a865ca539#] Failed to acquire lock in > > tryLock(), deleting all at path: > > > /accumulo/f73e29fc-62b3-44bf-8d6d-694a6a262a98/gc/lock/zlock#a1a993fd-9086-4473-9545-113a865ca539#0000000057 > > 2024-08-24T15:28:59,697 [gc.SimpleGarbageCollector] DEBUG: Failed to get > GC > > ZooKeeper lock, will retry > > > > 2024-08-25T21:48:31,418 [zookeeper.ClientCnxn] ERROR: Error while calling > > watcher > > java.lang.IllegalStateException: Called determineLockOwnership() when > > ephemeralNodeName == null > > at > > > org.apache.accumulo.core.fate.zookeeper.ServiceLock.determineLockOwnership(ServiceLock.java:274) > > ~[accumulo-core-2.1.2.jar:2.1.2] > > at > > > org.apache.accumulo.core.fate.zookeeper.ServiceLock$1.process(ServiceLock.java:354) > > ~[accumulo-core-2.1.2.jar:2.1.2] > > at > > > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:532) > > ~[zookeeper-3.5.10.jar:3.5.10--1] > > at > > org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:507) > > ~[zookeeper-3.5.10.jar:3.5.10--1] > > >