Hi Akshesh, How did you set up your resource? I notice it is in the CUSTOMIZED mode. If you refer to this page <https://helix.apache.org/0.9.7-docs/tutorial_rebalance.html>, both replica location and the state will be defined by the application instead of Helix. I think you should use FULL_AUTO or at least SEMI_AUTO and then try again.
Best Regards, Jiajun On Mon, Jun 1, 2020 at 10:34 PM Akshesh Doshi <[email protected]> wrote: > Hi Helix community > > Nice to e-meet you guys. I am pretty new to this project and it is my > first time writing to this mailing list - I apologize in advance for any > mistakes. > > I am trying to implement a system's state model requirement here but am > not able to achieve it. Hoping anyone here could point me in the right > direction. > > > GOAL > My system is a typical multi-node + multi-resource system with the > following properties: > 1. Any partition should have one & only one *online* partition at any > given point of time. > 2. The ONLINE -> OFFLINE transition is not instantaneous (typically takes > minutes). > 3. Offline partitions have no special role - they can be dropped as soon > as they become offline. > > If it helps in understanding better, my application is a tool which copies > data from Kafka to Hadoop. > And having two ONLINE partitions at the same time means I am duplicating > this data in Hadoop. > > > WHAT I HAVE TRIED > I was able to successfully modify the Quickstart > <https://github.com/apache/helix/blob/master/helix-core/src/main/java/org/apache/helix/examples/Quickstart.java> > script > to imitate my use-case so I believe Helix can handle this scenario. > But when I do it in my application I see that Helix fires the ONLINE -> > OFFLINE & OFFLINE -> ONLINE transitions (to the corresponding 2 nodes) > almost simultaneously. I want Helix to signal "ONLINE -> OFFLINE", then > wait until the partition goes offline and only then fire the "OFFLINE -> > ONLINE" transition to the new upcoming node. > I have implemented my *@Transition(from = "ONLINE", to = "OFFLINE")* function > in such a way that it waits for the partition to go offline (using > *latch.await()* > <https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/CountDownLatch.html#await-->) > and only then returns (I have confirmed this from application logs). > > My application is different from my Quickstart app in the following ways > (or at least, these are the ones known to me, I am building upon someone > else's project so there might be code that I am not aware of): > 1. The rebalancing algo is *not* AUTO - I am using my own custom logic to > distribute partitions among nodes > 2. I have enabled nodes to auto-join i.e. > *props.put(ZKHelixManager.ALLOW_PARTICIPANT_AUTO_JOIN, > String.valueOf(true));* > Is it possible for me to achieve this system with these settings enabled? > > > DEBUG LOGS / CODE > If it helps, this is what I see in Zookeeper after adding a 2nd node to my > cluster which had 1 node with 1 resource with 6 partitions - > https://gist.github.com/akki/1d80c97463198275b3abe39350688bda#file-zookeeper-output-txt > As you can see > <https://gist.github.com/akki/1d80c97463198275b3abe39350688bda#file-zookeeper-output-txt-L15>, > there are a few partitions which have 2 ONLINE replicas at the same time > (after a while the draining replica goes away but in that duration, my data > gets duplicated, which is the problem I want to overcome). I cannot > understand how this is possible when I have set up these bounds > <https://gist.github.com/akki/1d80c97463198275b3abe39350688bda#file-onlineofflinestatemodel-java-L36> > in my model definition > <https://gist.github.com/akki/1d80c97463198275b3abe39350688bda#file-onlineofflinestatemodel-java> > . > > > > I would really appreciate if anyone here could give me any clues that what > I might be doing wrong (or what I am trying to achieve is even possible or > not with Helix). > > Thank you so much for building such a wonderful tool and having this > mailing list to help us out. > > > Regards > Akshesh Doshi >
