Re: Long running jobs and node drain

Lei Xia Mon, 11 May 2020 22:11:00 -0700

Hi, Santosh

  One question, what exactly you need to do to bring a job from OFFLINE to
STARTUP? Can we simply use OFFLINE->UP->OFFINE model. From OFFLINE->UP you
will get the job started and ready to serve request.  From UP->OFFLINE you
will block there until job get drained.


 With this state model, you can start to drain a node by disabling it. Once
a node is disabled, Helix will send UP->OFFLINE transition to all
partitions on that node, in your implementation of UP->OFFLINE transition,
you block there until the job completes. Once the job (partition) on node-1
goes OFFLINE, Helix will bring up the job in node-2 (OFFLINE->UP).  Does
this work for you?  How long you would expect OFFLINE->UP take here, if it
is fast, the switch should be fast.


Lei



On Mon, May 11, 2020 at 9:02 PM santosh gujar <[email protected]>
wrote:

> Yes, there would be a database.
> So far i have following state model for partition.
> OFFLINE->STARTUP->UP->DRAIN->OFFLINE. But don't have / now to express
> following
> 1. How to Trigger Drain (This is for example we decide to get node out for
> maintenance)
> 2. Once a drain has started, I expect helix rebalancer to kick in and move
> the partition simultaneously on another node in start_up mode.
> 3. Once All jobs  on node1 are done, need a manual way to trigger it to
> offline and move the other partition to UP state.
>
> It might be possible that my thinking is entirely wrong and how to fit it
> in helix model,  but essentially above is the sequence of i want achieve.
> Any pointers will be of great help. The constraint is that it's a long
> running jobs that cannot be moved immediately to other node.
>
> Regards,
> Santosh
>
> On Tue, May 12, 2020 at 1:25 AM kishore g <[email protected]> wrote:
>
>> I was thinking exactly in that direction - having two states is the right
>> thing to do. Before we get there, one more question -
>>
>> - when you get a request for a job, how do you know if that job is old or
>> new? Is there a database that provides the mapping between job and node
>>
>> On Mon, May 11, 2020 at 12:44 PM santosh gujar <[email protected]>
>> wrote:
>>
>>> Thank You Kishore,
>>>
>>> During drain process N2 will start new jobs, the requests related to old
>>> jobs need to go to N1 and requests for new jobs need to go to N2. Thus
>>> during drain on N1, the partition could be present on both nodes.
>>>
>>> My current thinking is that in helix somehow i need to model is
>>> as Partition P with two different states on these two nodes. . e.g. N1
>>> could have partition P in Drain State and N2 can have partition P in
>>> START_UP state.
>>> I don't know if my thinking about states is correct, but looking for any
>>> pointers.
>>>
>>> Regards
>>> Santosh
>>>
>>> On Tue, May 12, 2020 at 1:01 AM kishore g <[email protected]> wrote:
>>>
>>>> what  happens to request during the drain process i.e when you put N1
>>>> out of service and while N2 is waiting for N1 to finish the jobs, where
>>>> will the requests for P go to - N1 or N2
>>>>
>>>> On Mon, May 11, 2020 at 12:19 PM santosh gujar <
>>>> [email protected]> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> I am looking for some clues or inputs on how to achieve following
>>>>>
>>>>> I am working on a service that involves running a statetful long
>>>>> running jobs on a node. These long running jobs cannot be preempted and
>>>>> continue on other nodes.
>>>>>
>>>>> Problem Requirements :
>>>>> 1. In helix nomenclature, I let's say an helix partition P that
>>>>> involves J number of such jobs running on a node. (N1)
>>>>> 2. When I put the node in a drain, I want helix to assign a new node
>>>>> to this partition (P) is also started on the new node (N2).
>>>>>
>>>>> 3. N1 can be put out of service only when all running jobs (J) on it
>>>>> are over, at this point only N2 will serve P request.
>>>>>
>>>>> Questions :
>>>>> 1. Can drain process be modeled using helix?
>>>>> 2. If yes, Is there any recipe / pointers for a helix state model?
>>>>> 3. Is there any custom way to trigger state transitions? From
>>>>> documentation, I gather that Helix controller in full auto mode, triggers
>>>>> state transitions only when number of partitions change or cluster changes
>>>>> (node addition or deletion)
>>>>> 3.I guess  spectator will be needed, to custom routing logic in such
>>>>> cases, any pointers for the the same?
>>>>>
>>>>> Thank You
>>>>> Santosh
>>>>>
>>>>

-- 
Lei Xia

Re: Long running jobs and node drain

Reply via email to