[slurm-dev] SLURM and fault tolerance/resiliency

Paul H. Hargrove Fri, 04 Mar 2011 11:16:29 -0800

SLURM developers,

Some of you may know me as the lead developer of BLCR (Berkeley LabCheckpoint Restart). However, that project is part of a largerDoE-funded effort known as CIFTS (Coordinated Infrastructure for FaultTolerant Systems). I am writing this post as a representative of thateffort.

One of the main "products" of CIFTS is known as "FTB" (Fault ToleranceBackplane) which is a publish-subscribe infrastructure for systemcomponents to share "events" related to faults and error conditions.This information can then be used within the system to operate "better"in the presence of faults.

Components in the FTB are not limited to a pre-defined set, but ourexpectations include at least the Job Scheduler (JS), Resource Manager(RM), MPI implementation, Global File System implementation, Numericaland I/O libraries linked into an application, and potentially even theapplication itself. The users and administrators of the system arepotential "components" as well, via monitoring scripts they might write.This post is an effort to tap into your group's knowledge of JobSchedulers and Resource Managers.

Initially I am interested in knowing what JS and RM components could INTHEORY do if/when connected to the FTB. While it would be great if thiscould evolve into collaborations to add to SLURM, but I am not lookingfor any development commitment, just ideas.


So, there are two questions I am seeking responses to:

1) What "events" could the JS and/or RM publish to help other components?

The example that came first to my mind was generating an event if filesystems holding logs or spooling files are full.

2) What "events" could the JS and/or RM subscribe to in order to "behavebetter" in the presence of faults?The example that came first to my mind was information about anythingthat was "down" that might be expressed as a job requirement - such asfailed license servers, full global filesystem(s), downed nodes, etc..The response would be to not start any job that required the failedcomponent(s).

I would appreciate any thoughts/feedback/questions you may have based onthe 2 high-level questions above.


Thanks,
-Paul

--
Paul H. Hargrove                          [email protected]
Future Technologies Group
HPC Research Department                   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900

[slurm-dev] SLURM and fault tolerance/resiliency

Reply via email to