Wasn't sure if this was bug territory or an issue with cluster configuration.
In my dev environment, I have a 5-server AWS EMR cluster using Accumulo 2.1.2, Hadoop 3.3.6, and Zookeeper 3.5.10.The cluster is in high availability mode so there are 3 primary nodes with Zookeeper running. On the primary nodes I run the manager, monitor, and gc processes. On the 2 core nodes (with DataNode on them) I run just tablet servers. The manager and monitor processes on the 2nd and 3rd servers are fine, no problems about not being the leader for their process. However, the 2nd and 3rd GC processes will repeatedly complain in a DEBUG "Failed to acquire lock". It will complain that there is already a gc lock, and then create an ephemeral node #0000000001, then #0000000002, etc. After about 8 hours of this complaint loop, it will turn into an error "Called determineLockOwnership() when ephemeralNodeName == null", which it spams forever, filling up the server and eventually killing the server. This has happened in multiple environments. Is it an issue with GC's ability to hold elections? Should I be putting the standby GC processes on a different node than the one running one of the zookeepers? Below are samples of the two log types: 2024-08-24T15:28:03,292 [gc.SimpleGarbageCollector] INFO : Trying to acquire ZooKeeper lock for garbage collector 2024-08-24T15:28:03,330 [metrics.MetricsUtil] INFO : Metric producer ThriftMetrics initialize 2024-08-24T15:28:03,335 [rpc.TServerUtils] DEBUG: Instantiating unsecure custom half-async Thrift server 2024-08-24T15:28:03,348 [gc.SimpleGarbageCollector] DEBUG: Starting garbage collector listening on coreNode1.example.domain:9998 2024-08-24T15:28:59,694 [zookeeper.ServiceLock] DEBUG: [zlock#a1a993fd-9086-4473-9545-113a865ca539#] Ephemeral node /accumulo/f73e29fc-62b3-44bf-8d6d-694a6a262a98/gc/lock/zlock#a1a993fd-9086-4473-9545-113a865ca539#0000000057 created 2024-08-24T15:28:59,694 [zookeeper.ServiceLock] DEBUG: [zlock#a1a993fd-9086-4473-9545-113a865ca539#] Setting watcher on /accumulo/f73e29fc-62b3-44bf-8d6d-694a6a262a98/gc/lock/zlock#a1a993fd-9086-4473-9545-113a865ca539#0000000057 2024-08-24T15:28:59,695 [zookeeper.ServiceLock] DEBUG: [zlock#a1a993fd-9086-4473-9545-113a865ca539#] Lock held by another process with ephemeral node: zlock#85f3ef81-b877-4321-a416-a24faec6f032#0000000000 2024-08-24T15:28:59,695 [zookeeper.ServiceLock] DEBUG: [zlock#a1a993fd-9086-4473-9545-113a865ca539#] Establishing watch on prior node /accumulo/f73e29fc-62b3-44bf-8d6d-694a6a262a98/gc/lock/zlock#85f3ef81-b877-4321-a416-a24faec6f032#0000000000 2024-08-24T15:28:59,695 [zookeeper.ServiceLock] DEBUG: [zlock#a1a993fd-9086-4473-9545-113a865ca539#] Failed to acquire lock in tryLock(), deleting all at path: /accumulo/f73e29fc-62b3-44bf-8d6d-694a6a262a98/gc/lock/zlock#a1a993fd-9086-4473-9545-113a865ca539#0000000057 2024-08-24T15:28:59,697 [gc.SimpleGarbageCollector] DEBUG: Failed to get GC ZooKeeper lock, will retry 2024-08-25T21:48:31,418 [zookeeper.ClientCnxn] ERROR: Error while calling watcher java.lang.IllegalStateException: Called determineLockOwnership() when ephemeralNodeName == null at org.apache.accumulo.core.fate.zookeeper.ServiceLock.determineLockOwnership(ServiceLock.java:274) ~[accumulo-core-2.1.2.jar:2.1.2] at org.apache.accumulo.core.fate.zookeeper.ServiceLock$1.process(ServiceLock.java:354) ~[accumulo-core-2.1.2.jar:2.1.2] at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:532) ~[zookeeper-3.5.10.jar:3.5.10--1] at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:507) ~[zookeeper-3.5.10.jar:3.5.10--1]