Hi Helix community Nice to e-meet you guys. I am pretty new to this project and it is my first time writing to this mailing list - I apologize in advance for any mistakes.
I am trying to implement a system's state model requirement here but am not able to achieve it. Hoping anyone here could point me in the right direction. GOAL My system is a typical multi-node + multi-resource system with the following properties: 1. Any partition should have one & only one *online* partition at any given point of time. 2. The ONLINE -> OFFLINE transition is not instantaneous (typically takes minutes). 3. Offline partitions have no special role - they can be dropped as soon as they become offline. If it helps in understanding better, my application is a tool which copies data from Kafka to Hadoop. And having two ONLINE partitions at the same time means I am duplicating this data in Hadoop. WHAT I HAVE TRIED I was able to successfully modify the Quickstart <https://github.com/apache/helix/blob/master/helix-core/src/main/java/org/apache/helix/examples/Quickstart.java> script to imitate my use-case so I believe Helix can handle this scenario. But when I do it in my application I see that Helix fires the ONLINE -> OFFLINE & OFFLINE -> ONLINE transitions (to the corresponding 2 nodes) almost simultaneously. I want Helix to signal "ONLINE -> OFFLINE", then wait until the partition goes offline and only then fire the "OFFLINE -> ONLINE" transition to the new upcoming node. I have implemented my *@Transition(from = "ONLINE", to = "OFFLINE")* function in such a way that it waits for the partition to go offline (using *latch.await()* <https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/CountDownLatch.html#await-->) and only then returns (I have confirmed this from application logs). My application is different from my Quickstart app in the following ways (or at least, these are the ones known to me, I am building upon someone else's project so there might be code that I am not aware of): 1. The rebalancing algo is *not* AUTO - I am using my own custom logic to distribute partitions among nodes 2. I have enabled nodes to auto-join i.e. *props.put(ZKHelixManager.ALLOW_PARTICIPANT_AUTO_JOIN, String.valueOf(true));* Is it possible for me to achieve this system with these settings enabled? DEBUG LOGS / CODE If it helps, this is what I see in Zookeeper after adding a 2nd node to my cluster which had 1 node with 1 resource with 6 partitions - https://gist.github.com/akki/1d80c97463198275b3abe39350688bda#file-zookeeper-output-txt As you can see <https://gist.github.com/akki/1d80c97463198275b3abe39350688bda#file-zookeeper-output-txt-L15>, there are a few partitions which have 2 ONLINE replicas at the same time (after a while the draining replica goes away but in that duration, my data gets duplicated, which is the problem I want to overcome). I cannot understand how this is possible when I have set up these bounds <https://gist.github.com/akki/1d80c97463198275b3abe39350688bda#file-onlineofflinestatemodel-java-L36> in my model definition <https://gist.github.com/akki/1d80c97463198275b3abe39350688bda#file-onlineofflinestatemodel-java> . I would really appreciate if anyone here could give me any clues that what I might be doing wrong (or what I am trying to achieve is even possible or not with Helix). Thank you so much for building such a wonderful tool and having this mailing list to help us out. Regards Akshesh Doshi
