Hi Akshesh,

How did you set up your resource? I notice it is in the CUSTOMIZED mode. If
you refer to this page
<https://helix.apache.org/0.9.7-docs/tutorial_rebalance.html>, both replica
location and the state will be defined by the application instead of Helix.
I think you should use FULL_AUTO or at least SEMI_AUTO and then try again.

Best Regards,
Jiajun


On Mon, Jun 1, 2020 at 10:34 PM Akshesh Doshi <[email protected]>
wrote:

> Hi Helix community
>
> Nice to e-meet you guys. I am pretty new to this project and it is my
> first time writing to this mailing list - I apologize in advance for any
> mistakes.
>
> I am trying to implement a system's state model requirement here but am
> not able to achieve it. Hoping anyone here could point me in the right
> direction.
>
>
> GOAL
> My system is a typical multi-node + multi-resource system with the
> following properties:
> 1. Any partition should have one & only one *online* partition at any
> given point of time.
> 2. The ONLINE -> OFFLINE transition is not instantaneous (typically takes
> minutes).
> 3. Offline partitions have no special role - they can be dropped as soon
> as they become offline.
>
> If it helps in understanding better, my application is a tool which copies
> data from Kafka to Hadoop.
> And having two ONLINE partitions at the same time means I am duplicating
> this data in Hadoop.
>
>
> WHAT I HAVE TRIED
> I was able to successfully modify the Quickstart
> <https://github.com/apache/helix/blob/master/helix-core/src/main/java/org/apache/helix/examples/Quickstart.java>
>  script
> to imitate my use-case so I believe Helix can handle this scenario.
> But when I do it in my application I see that Helix fires the ONLINE ->
> OFFLINE & OFFLINE -> ONLINE transitions (to the corresponding 2 nodes)
> almost simultaneously. I want Helix to signal "ONLINE -> OFFLINE", then
> wait until the partition goes offline and only then fire the "OFFLINE ->
> ONLINE" transition to the new upcoming node.
> I have implemented my *@Transition(from = "ONLINE", to = "OFFLINE")* function
> in such a way that it waits for the partition to go offline (using
> *latch.await()*
> <https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/CountDownLatch.html#await-->)
> and only then returns (I have confirmed this from application logs).
>
> My application is different from my Quickstart app in the following ways
> (or at least, these are the ones known to me, I am building upon someone
> else's project so there might be code that I am not aware of):
> 1. The rebalancing algo is *not* AUTO - I am using my own custom logic to
> distribute partitions among nodes
> 2. I have enabled nodes to auto-join i.e. 
> *props.put(ZKHelixManager.ALLOW_PARTICIPANT_AUTO_JOIN,
> String.valueOf(true));*
> Is it possible for me to achieve this system with these settings enabled?
>
>
> DEBUG LOGS / CODE
> If it helps, this is what I see in Zookeeper after adding a 2nd node to my
> cluster which had 1 node with 1 resource with 6 partitions -
> https://gist.github.com/akki/1d80c97463198275b3abe39350688bda#file-zookeeper-output-txt
> As you can see
> <https://gist.github.com/akki/1d80c97463198275b3abe39350688bda#file-zookeeper-output-txt-L15>,
> there are a few partitions which have 2 ONLINE replicas at the same time
> (after a while the draining replica goes away but in that duration, my data
> gets duplicated, which is the problem I want to overcome). I cannot
> understand how this is possible when I have set up these bounds
> <https://gist.github.com/akki/1d80c97463198275b3abe39350688bda#file-onlineofflinestatemodel-java-L36>
>  in my model definition
> <https://gist.github.com/akki/1d80c97463198275b3abe39350688bda#file-onlineofflinestatemodel-java>
> .
>
>
>
> I would really appreciate if anyone here could give me any clues that what
> I might be doing wrong (or what I am trying to achieve is even possible or
> not with Helix).
>
> Thank you so much for building such a wonderful tool and having this
> mailing list to help us out.
>
>
> Regards
> Akshesh Doshi
>

Reply via email to