I ran into what looks like a deadlock in blockUntilConnected and wanted to
give a high-level description in case someone can help me debug the issue.
I can try to make a reproducible example, but for reasons that will be
apparent, that's not straightforward.

I am using Curator within a custom Kafka Connect source. As a result, I
have a process per node on 11 nodes, and up to 12 tasks (threads) per node,
each with its own Curator client. Every node is also running zookeeper, so
I initialize the Curator clients by pointing to localhost:2181. On 9 nodes,
everything works perfectly, but on the other 2, all tasks seem to hang at
blockUntilConnected (specifically here:
https://github.com/apache/curator/blob/ae309a29643afc6df511d1d9a162526ce474598b/curator-framework/src/main/java/org/apache/curator/framework/state/ConnectionStateManager.java#L224).
I found this by observing no activity in my Kafka Connect logs and grabbing
a stacktrace via jstack on the offending nodes.

I also made a small test program that just initializes a client and runs
blockUntilConnected (nothing else) and ran it at the same time, and it also
hangs there forever. Meanwhile, I can use zookeeper-shell on localhost just
fine, and if I initialize a Curator client pointing to one of the other
nodes (not localhost) the Curator client initializes fine.

Is this a possible deadlock from initializing Curator clients across
multiple threads concurrently?

Reply via email to