Issue with second/third GC processes in a cluster error spam/OoM

Craig Portoghese Mon, 26 Aug 2024 05:14:06 -0700

Wasn't sure if this was bug territory or an issue with cluster
configuration.


In my dev environment, I have a 5-server AWS EMR cluster using Accumulo
2.1.2, Hadoop 3.3.6, and Zookeeper 3.5.10.The cluster is in high
availability mode so there are 3 primary nodes with Zookeeper running. On
the primary nodes I run the manager, monitor, and gc processes. On the 2
core nodes (with DataNode on them) I run just tablet servers.

The manager and monitor processes on the 2nd and 3rd servers are fine, no
problems about not being the leader for their process. However, the 2nd and
3rd GC processes will repeatedly complain in a DEBUG "Failed to acquire
lock". It will complain that there is already a gc lock, and then create an
ephemeral node #0000000001, then #0000000002, etc. After about 8 hours of
this complaint loop, it will turn into an error "Called
determineLockOwnership() when ephemeralNodeName == null", which it spams
forever, filling up the server and eventually killing the server.

This has happened in multiple environments. Is it an issue with GC's
ability to hold elections? Should I be putting the standby GC processes on
a different node than the one running one of the zookeepers? Below are
samples of the two log types:

2024-08-24T15:28:03,292 [gc.SimpleGarbageCollector] INFO : Trying to
acquire ZooKeeper lock for garbage collector
2024-08-24T15:28:03,330 [metrics.MetricsUtil] INFO : Metric producer
ThriftMetrics initialize
2024-08-24T15:28:03,335 [rpc.TServerUtils] DEBUG: Instantiating unsecure
custom half-async Thrift server
2024-08-24T15:28:03,348 [gc.SimpleGarbageCollector] DEBUG: Starting garbage
collector listening on coreNode1.example.domain:9998
2024-08-24T15:28:59,694 [zookeeper.ServiceLock] DEBUG:
[zlock#a1a993fd-9086-4473-9545-113a865ca539#] Ephemeral node
/accumulo/f73e29fc-62b3-44bf-8d6d-694a6a262a98/gc/lock/zlock#a1a993fd-9086-4473-9545-113a865ca539#0000000057
created
2024-08-24T15:28:59,694 [zookeeper.ServiceLock] DEBUG:
[zlock#a1a993fd-9086-4473-9545-113a865ca539#] Setting watcher on
/accumulo/f73e29fc-62b3-44bf-8d6d-694a6a262a98/gc/lock/zlock#a1a993fd-9086-4473-9545-113a865ca539#0000000057
2024-08-24T15:28:59,695 [zookeeper.ServiceLock] DEBUG:
[zlock#a1a993fd-9086-4473-9545-113a865ca539#] Lock held by another process
with ephemeral node: zlock#85f3ef81-b877-4321-a416-a24faec6f032#0000000000
2024-08-24T15:28:59,695 [zookeeper.ServiceLock] DEBUG:
[zlock#a1a993fd-9086-4473-9545-113a865ca539#] Establishing watch on prior
node
/accumulo/f73e29fc-62b3-44bf-8d6d-694a6a262a98/gc/lock/zlock#85f3ef81-b877-4321-a416-a24faec6f032#0000000000
2024-08-24T15:28:59,695 [zookeeper.ServiceLock] DEBUG:
[zlock#a1a993fd-9086-4473-9545-113a865ca539#] Failed to acquire lock in
tryLock(), deleting all at path:
/accumulo/f73e29fc-62b3-44bf-8d6d-694a6a262a98/gc/lock/zlock#a1a993fd-9086-4473-9545-113a865ca539#0000000057
2024-08-24T15:28:59,697 [gc.SimpleGarbageCollector] DEBUG: Failed to get GC
ZooKeeper lock, will retry

2024-08-25T21:48:31,418 [zookeeper.ClientCnxn] ERROR: Error while calling
watcher
java.lang.IllegalStateException: Called determineLockOwnership() when
ephemeralNodeName == null
        at
org.apache.accumulo.core.fate.zookeeper.ServiceLock.determineLockOwnership(ServiceLock.java:274)
~[accumulo-core-2.1.2.jar:2.1.2]
        at
org.apache.accumulo.core.fate.zookeeper.ServiceLock$1.process(ServiceLock.java:354)
~[accumulo-core-2.1.2.jar:2.1.2]
        at
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:532)
~[zookeeper-3.5.10.jar:3.5.10--1]
        at
org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:507)
~[zookeeper-3.5.10.jar:3.5.10--1]

Issue with second/third GC processes in a cluster error spam/OoM

Reply via email to