Re: Long running jobs and node drain

santosh gujar Mon, 11 May 2020 21:03:00 -0700

Yes, there would be a database.
So far i have following state model for partition.
OFFLINE->STARTUP->UP->DRAIN->OFFLINE. But don't have / now to express
following
1. How to Trigger Drain (This is for example we decide to get node out for
maintenance)
2. Once a drain has started, I expect helix rebalancer to kick in and move
the partition simultaneously on another node in start_up mode.
3. Once All jobs  on node1 are done, need a manual way to trigger it to
offline and move the other partition to UP state.


It might be possible that my thinking is entirely wrong and how to fit it
in helix model,  but essentially above is the sequence of i want achieve.
Any pointers will be of great help. The constraint is that it's a long
running jobs that cannot be moved immediately to other node.

Regards,
Santosh

On Tue, May 12, 2020 at 1:25 AM kishore g <[email protected]> wrote:

> I was thinking exactly in that direction - having two states is the right
> thing to do. Before we get there, one more question -
>
> - when you get a request for a job, how do you know if that job is old or
> new? Is there a database that provides the mapping between job and node
>
> On Mon, May 11, 2020 at 12:44 PM santosh gujar <[email protected]>
> wrote:
>
>> Thank You Kishore,
>>
>> During drain process N2 will start new jobs, the requests related to old
>> jobs need to go to N1 and requests for new jobs need to go to N2. Thus
>> during drain on N1, the partition could be present on both nodes.
>>
>> My current thinking is that in helix somehow i need to model is
>> as Partition P with two different states on these two nodes. . e.g. N1
>> could have partition P in Drain State and N2 can have partition P in
>> START_UP state.
>> I don't know if my thinking about states is correct, but looking for any
>> pointers.
>>
>> Regards
>> Santosh
>>
>> On Tue, May 12, 2020 at 1:01 AM kishore g <[email protected]> wrote:
>>
>>> what  happens to request during the drain process i.e when you put N1
>>> out of service and while N2 is waiting for N1 to finish the jobs, where
>>> will the requests for P go to - N1 or N2
>>>
>>> On Mon, May 11, 2020 at 12:19 PM santosh gujar <[email protected]>
>>> wrote:
>>>
>>>> Hello,
>>>>
>>>> I am looking for some clues or inputs on how to achieve following
>>>>
>>>> I am working on a service that involves running a statetful long
>>>> running jobs on a node. These long running jobs cannot be preempted and
>>>> continue on other nodes.
>>>>
>>>> Problem Requirements :
>>>> 1. In helix nomenclature, I let's say an helix partition P that
>>>> involves J number of such jobs running on a node. (N1)
>>>> 2. When I put the node in a drain, I want helix to assign a new node to
>>>> this partition (P) is also started on the new node (N2).
>>>>
>>>> 3. N1 can be put out of service only when all running jobs (J) on it
>>>> are over, at this point only N2 will serve P request.
>>>>
>>>> Questions :
>>>> 1. Can drain process be modeled using helix?
>>>> 2. If yes, Is there any recipe / pointers for a helix state model?
>>>> 3. Is there any custom way to trigger state transitions? From
>>>> documentation, I gather that Helix controller in full auto mode, triggers
>>>> state transitions only when number of partitions change or cluster changes
>>>> (node addition or deletion)
>>>> 3.I guess  spectator will be needed, to custom routing logic in such
>>>> cases, any pointers for the the same?
>>>>
>>>> Thank You
>>>> Santosh
>>>>
>>>

Re: Long running jobs and node drain

Reply via email to