One commonly used mode of operation at Livermore for long running
jobs to request more resources than needed and use SLURM's --no-kill
option. Then when a node failure is detected by SLURM or some other
system monitoring tool. The node gets set DOWN in SLURM and a
reason field is also set, which could provide a clear reason for failure
or something as vague as "Not responding". SLURM then will kill the
job step (typically an MPI job) and more job steps can be started on 
the job's resources that are still available. The mechanism is pretty 
widely used here and seems to work well. 

The node's "reason" field can be queried by the user, but I doubt 
that is commonly done.

In the long term, SLURM may support hot-spare nodes so individual 
jobs don't need to allocate extra resources, but that is not likely to 
happen for a year or more.
________________________________________
From: [email protected] [[email protected]] On Behalf 
Of Kenneth Yoshimoto [[email protected]]
Sent: Friday, March 04, 2011 11:27 AM
To: [email protected]
Subject: Re: [slurm-dev] SLURM and fault tolerance/resiliency

  I'm not a SLURM developer (rather I develop Catalina,
a Job Scheduler).  One scenario that we have considered
is failure of MPI_Init() due to a subset of bad nodes.
If this happens in a batch job, there's usually a failure,
requeue and queue wait.  It would be nice if the Job Scheduler
could subscribe to such events, then possibly allocate replacement
nodes to allow the job to run.  Going the other way, if the
Job Scheduler published a list of free nodes and associated
time windows, a modified MPI_Init() could detect failure and
request a reallocation with less wasted queue time.
  I think there was one SC paper on an approach like this,
but I don't have that reference handy.

Kenneth

On Fri, 4 Mar 2011, Paul H. Hargrove wrote:

> Date: Fri, 04 Mar 2011 11:15:42 -0800
> From: Paul H. Hargrove <[email protected]>
> Reply-To: [email protected]
> To: [email protected]
> Subject: [slurm-dev] SLURM and fault tolerance/resiliency
>
> SLURM developers,
>
> Some of you may know me as the lead developer of BLCR (Berkeley Lab
> Checkpoint Restart).  However, that project is part of a larger DoE-funded
> effort known as CIFTS (Coordinated Infrastructure for Fault Tolerant
> Systems).  I am writing this post as a representative of that effort.
>
> One of the main "products" of CIFTS is known as "FTB" (Fault Tolerance
> Backplane) which is a publish-subscribe infrastructure for system components
> to share "events" related to faults and error conditions. This information
> can then be used within the system to operate "better" in the presence of
> faults.
>
> Components in the FTB are not limited to a pre-defined set, but our
> expectations include at least the Job Scheduler (JS), Resource Manager (RM),
> MPI implementation, Global File System implementation, Numerical and I/O
> libraries linked into an application, and potentially even the application
> itself.  The users and administrators of the system are potential
> "components" as well, via monitoring scripts they might write.  This post is
> an effort to tap into your group's knowledge of Job Schedulers and Resource
> Managers.
>
> Initially I am interested in knowing what JS and RM components could IN
> THEORY do if/when connected to the FTB.  While it would be great if this
> could evolve into collaborations to add to SLURM, but I am not looking for
> any development commitment, just ideas.
>
> So, there are two questions I am seeking responses to:
>
> 1) What "events" could the JS and/or RM publish to help other components?
> The example that came first to my mind was generating an event if file
> systems holding logs or spooling files are full.
>
> 2) What "events" could the JS and/or RM subscribe to in order to "behave
> better" in the presence of faults?
> The example that came first to my mind was information about anything that
> was "down" that might be expressed as a job requirement - such as failed
> license servers, full global filesystem(s), downed nodes, etc.. The response
> would be to not start any job that required the failed component(s).
>
> I would appreciate any thoughts/feedback/questions you may have based on the
> 2 high-level questions above.
>
> Thanks,
> -Paul
>
>

Reply via email to