Hi Hunter, For various limitations and constraints at this moment, I cannot go down the path of Task Framework.
Thanks, Santosh On Tue, May 12, 2020 at 7:23 PM Hunter Lee <[email protected]> wrote: > Alternative idea: > > Have you considered using Task Framework's targeted jobs for this use > case? You could make the jobs long-running, and this way, you save yourself > the trouble of having to implement the routing layer (simply specifying > which partition to target in your JobConfig would do it). > > Task Framework doesn't actively terminate running threads on the worker > (Participant) nodes, so you could achieve the effect of "draining" the node > by letting previously assigned tasks to finish by not actively canceling > them in your cancel() logic. > > Hunter > > On Tue, May 12, 2020 at 1:02 AM santosh gujar <[email protected]> > wrote: > >> Hi Lei, >> >> Thanks a lot for your time and response. >> >> Some more context about helix partition that i mentioned in my email >> earlier. >> My thinking is to my map multiple long jobs to a helix partition by >> running some hash function (simplest is taking a mod of an job) >> >> " what exactly you need to do to bring a job from OFFLINE to STARTUP?" >> I added STARTUP to distinguish the track the fact that a partition could >> be hosted on two nodes simultaneously, I doubt offline->UP->OFFLINE model >> can give me such information. >> >> " Once the job (partition) on node-1 goes OFFLINE, Helix will bring up >> the job in node-2 (OFFLINE->UP)" >> I think it may not work in my case. Here is what I see the implications. >> 1. While node1 is in drain, old jobs continue to run, but i want new jobs >> (for same partition) to be hosted by partition. Think of it as a partition >> moves from one node to other but over a long time (hours) as determined by >> when all existing jobs running on node1 finish. >> 2. As per your suggestion, node-2 serves the partition only when node-1 >> is offline. But it cannot satisfy 1 above. >> One workaround I can have is to handle up->offline transition event in >> the application and save the information about the node1 somewhere, then >> use this information later to distinguish old jobs and new jobs. But this >> information is stored outside helix and i wanted to avoid it. What >> attracted me towards helix is it's auto re-balancing capability and it's a >> central strorage for state of cluster which I can use for my routing logic. >> 3. A job could be running for hours and thus drain can happen for a long >> time. >> >> >> " How long you would expect OFFLINE->UP take here, if it is fast, the >> switch should be fast. " >> OFFLINE->UP is fast, As I describe above, it's the drain on earlier >> running node which is slow, the existing jobs cannot be pre-empted to move >> to new node. >> >> Regards, >> Santosh >> >> On Tue, May 12, 2020 at 10:40 AM Lei Xia <[email protected]> wrote: >> >>> Hi, Santosh >>> >>> One question, what exactly you need to do to bring a job from OFFLINE >>> to STARTUP? Can we simply use OFFLINE->UP->OFFINE model. From OFFLINE->UP >>> you will get the job started and ready to serve request. From UP->OFFLINE >>> you will block there until job get drained. >>> >>> With this state model, you can start to drain a node by disabling it. >>> Once a node is disabled, Helix will send UP->OFFLINE transition to all >>> partitions on that node, in your implementation of UP->OFFLINE transition, >>> you block there until the job completes. Once the job (partition) on node-1 >>> goes OFFLINE, Helix will bring up the job in node-2 (OFFLINE->UP). Does >>> this work for you? How long you would expect OFFLINE->UP take here, if it >>> is fast, the switch should be fast. >>> >>> >>> Lei >>> >>> >>> >>> On Mon, May 11, 2020 at 9:02 PM santosh gujar <[email protected]> >>> wrote: >>> >>>> Yes, there would be a database. >>>> So far i have following state model for partition. >>>> OFFLINE->STARTUP->UP->DRAIN->OFFLINE. But don't have / now to express >>>> following >>>> 1. How to Trigger Drain (This is for example we decide to get node out >>>> for maintenance) >>>> 2. Once a drain has started, I expect helix rebalancer to kick in and >>>> move the partition simultaneously on another node in start_up mode. >>>> 3. Once All jobs on node1 are done, need a manual way to trigger it to >>>> offline and move the other partition to UP state. >>>> >>>> It might be possible that my thinking is entirely wrong and how to fit >>>> it in helix model, but essentially above is the sequence of i want >>>> achieve. Any pointers will be of great help. The constraint is that it's a >>>> long running jobs that cannot be moved immediately to other node. >>>> >>>> Regards, >>>> Santosh >>>> >>>> On Tue, May 12, 2020 at 1:25 AM kishore g <[email protected]> wrote: >>>> >>>>> I was thinking exactly in that direction - having two states is the >>>>> right thing to do. Before we get there, one more question - >>>>> >>>>> - when you get a request for a job, how do you know if that job is old >>>>> or new? Is there a database that provides the mapping between job and node >>>>> >>>>> On Mon, May 11, 2020 at 12:44 PM santosh gujar < >>>>> [email protected]> wrote: >>>>> >>>>>> Thank You Kishore, >>>>>> >>>>>> During drain process N2 will start new jobs, the requests related to >>>>>> old jobs need to go to N1 and requests for new jobs need to go to N2. >>>>>> Thus >>>>>> during drain on N1, the partition could be present on both nodes. >>>>>> >>>>>> My current thinking is that in helix somehow i need to model is >>>>>> as Partition P with two different states on these two nodes. . e.g. N1 >>>>>> could have partition P in Drain State and N2 can have partition P in >>>>>> START_UP state. >>>>>> I don't know if my thinking about states is correct, but looking for >>>>>> any pointers. >>>>>> >>>>>> Regards >>>>>> Santosh >>>>>> >>>>>> On Tue, May 12, 2020 at 1:01 AM kishore g <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> what happens to request during the drain process i.e when you put >>>>>>> N1 out of service and while N2 is waiting for N1 to finish the jobs, >>>>>>> where >>>>>>> will the requests for P go to - N1 or N2 >>>>>>> >>>>>>> On Mon, May 11, 2020 at 12:19 PM santosh gujar < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> Hello, >>>>>>>> >>>>>>>> I am looking for some clues or inputs on how to achieve following >>>>>>>> >>>>>>>> I am working on a service that involves running a statetful long >>>>>>>> running jobs on a node. These long running jobs cannot be preempted and >>>>>>>> continue on other nodes. >>>>>>>> >>>>>>>> Problem Requirements : >>>>>>>> 1. In helix nomenclature, I let's say an helix partition P that >>>>>>>> involves J number of such jobs running on a node. (N1) >>>>>>>> 2. When I put the node in a drain, I want helix to assign a new >>>>>>>> node to this partition (P) is also started on the new node (N2). >>>>>>>> >>>>>>>> 3. N1 can be put out of service only when all running jobs (J) on >>>>>>>> it are over, at this point only N2 will serve P request. >>>>>>>> >>>>>>>> Questions : >>>>>>>> 1. Can drain process be modeled using helix? >>>>>>>> 2. If yes, Is there any recipe / pointers for a helix state model? >>>>>>>> 3. Is there any custom way to trigger state transitions? From >>>>>>>> documentation, I gather that Helix controller in full auto mode, >>>>>>>> triggers >>>>>>>> state transitions only when number of partitions change or cluster >>>>>>>> changes >>>>>>>> (node addition or deletion) >>>>>>>> 3.I guess spectator will be needed, to custom routing logic in >>>>>>>> such cases, any pointers for the the same? >>>>>>>> >>>>>>>> Thank You >>>>>>>> Santosh >>>>>>>> >>>>>>> >>> >>> -- >>> Lei Xia >>> >>
