Hey Helix folks,

We ran into a fun issue recently.  Between the time that Apache Helix
v1.0.3 was released on April 14 and v1.0.4 was recently on June 9, it looks
like a backward-incompatible change may have been introduced on June 3rd
that makes Helix v1.0.4 not work correctly on Zookeeper 3.4.x clusters.

I do acknowledge that Zookeeper 3.4.x was end-of-lifed on June 1st 2020 (
https://lists.apache.org/thread/xckr6nnsg9rxchkbvltkvt7hr2d0mhbo), so
obviously that certainly factors in, but it's what our organizational team
is supporting.  So unfortunately we're stuck between a rock and a hard
place at the moment:
- We can't go back to v1.0.2 because it lacks the Log4j fixes
- We can't use v1.0.3 due to the corruption issue
- We can't move ahead to v1.0.4 due to the compatibility issue with
Zookeeper
I have a fork we were previously using (
https://github.com/brentwritescode/helix/releases/tag/1.0.2-with-log4j-2.17.1),
but that's not a long-term solution either.

The issue is a bit subtle.  From v1.0.2 to v1.0.3, the org.apache.zookeeper
version requirement in the helix/zookeeper-api was bumped from 3.14.13 to
3.5.9:
- v1.0.2:
https://github.com/apache/helix/blob/c219050f8dc02c25451493f96575b56fabbf2c1e/zookeeper-api/pom.xml#L58
- v1.0.3:
https://github.com/apache/helix/blob/46b705f7d47990fa7bf1feeb6c64457e3d80af22/zookeeper-api/pom.xml#L54
So that, in and of itself, was not breaking.

And then from v1.0.3 to v1.0.4, some code changes were introduced in this
PR (https://github.com/apache/helix/pull/2138/files) that relied
specifically on that 3.5.x Zookeeper version.  For example, the "import
org.apache.zookeeper.AsyncCallback.Create2Callback" that was added to
"helix/zookeeper-api/src/main/java/org/apache/helix/zookeeper/zkclient/callback/ZkAsyncCallbacks.java"
in that PR introduces a backward incompatible change.

So the net result is that, unfortunately, there has been a drift over the
past two versions (from v1.0.2 to v1.0.4) that has rendered Zookeeper 3.4.x
clusters incompatible with Apache Helix.

I wanted to post this here:

1.  To see if you were all aware of it (since it may hit other customers as
well and we were a bit blind-sided by it)
2.  To see if you had any ideas on how to work with/around this

Our long-term plan will obviously be to get on newer Zookeeper clusters as
we can, but that's likely not going to be a quick turn-around for us.  In
the short-term we'll need to revert back to our v1.0.2 fork.

Does the team happen to have any other comments or suggestions on dealing
with this issue?  Is this correctable at the project level (I suspect that
will be tough)?

Thanks much!

~Brent

Reply via email to