[slurm-dev] contrib: arrayrun

Bjørn-Helge Mevik Tue, 28 Jun 2011 04:45:25 -0700

At our cluster, we've developed a command 'arrayrun' to simulate array
jobs as found in SGE or PBS.  It works with slurm 2.0, 2.1, 2.2 and 2.3.
The last year or so, more and more people have asked for a copy of it,
so I thought it might be a good idea to submit it for inclusion in the
slurm contribs directory, if you want it.

-*- text -*- $Id: README.arrayrun,v 1.2 2011/06/28 11:21:27 bhm Exp $

Overview
========

Arrayrun is an attempt to simulate arrayjobs as found in SGE and PBS.  It
works very similarly to mpirun:

   arrayrun [-r] taskids [sbatch arguments] YourCommand [arguments]

In principle, arrayrun does 

   TASK_ID=id sbatch [sbatch arguments] YourCommand [arguments]

for each id in the 'taskids' specification.  'taskids' is a comma separated
list of integers, ranges of integers (first-last) or ranges with step size
(first-last:step).  If -r is specified, arrayrun will restart a job that has
failed.  To avoid endless loops, a job is only restarted once, and a maximum
of 10 (configurable) jobs will be restarted.

The idea is to submit a master job that calls arrayrun to start the jobs,
for instance

   $ cat workerScript
   #!/bin/sh
   #SBATCH --account=YourProject
   #SBATCH --time=1:0:0
   #SBATCH --mem-per-cpu=1G

   DATASET=dataset.$TASK_ID
   OUTFILE=result.$TASK_ID
   cd $SCRATCH
   YourProgram $DATASET > $OUTFILE
   # end of workerScript

   $ cat submitScript
   #!/bin/sh
   #SBATCH --account=YourProject
   #SBATCH --time=50:0:0
   #SBATCH --mem-per-cpu=100M

   arrayrun 1-200 workerScript
   # end of submitScript

   $ sbatch submitScript

The --time specification in the master script must be long enough for all
jobs to finish.

Alternatively, arrayrun can be run on the command line of a login or master
node.

If the master job is cancelled, or the arrayrun process is killed, it tries
to scancel all running or pending jobs before it exits.

Arrayrun tries not to flood the queue with jobs.  It works by submitting a
limited number of jobs, sleeping a while, checking the status of its jobs,
and iterating, until all jobs have finished.  All limits and times are
configurable (see below).  It also tries to handle all errors in a graceful
manner.


Installation and configuration
==============================

There are two files, arrayrun (to be called by users) and arrayrun_worker
(exec'ed or srun'ed by arrayrun, to make scancel work).

arrayrun should be placed somewhere on the $PATH.  arrayrun_worker can be
place anywhere.  Both files should be accessible from all nodes.

There are quite a few configuration variables, so arrayrun can be tuned to
work under different policies and work loads.

Configuration variables in arrayrun:

- WORKER: the location of arrayrun_worker

Configuration variables in arrayrun_worker:

- $maxJobs:          The maximal number of jobs arrayrun will allow in the
                     queue at any time
- $maxIdleJobs:      The maximal number of _pending_ jobs arrayrun will allow
                     in the queue at any time
- $maxBurst:         The maximal number of jobs submitted at a time
- $pollSeconds:      How many seconds to sleep between each iteration
- $maxFails:         The maximal number of errors to accept when submitting a
                     job
- $retrySleep:       The number of seconds to sleep between each retry when
                     submitting a job
- $doubleCheckSleep: The number of seconds to sleep after a failed sbatch
                     before runnung squeue to double check whether the job
                     was submitted or not.
- $maxRestarts:      The maximal number of restarts all in all
- $sbatch:           The full path of the sbatch command to use


Notes and caveats
=================

Arrayrun is an attempt to simulate array jobs.  As such, it is not
perfect or foolproof.  Here are a couple of caveats.

- Sometimes, arrayrun fails to scancel all jobs when it is itself cancelled

- When arrayrun is run as a master job, it consumes one CPU for the whole
  duration of the job.  Also, the --time limit must be long enough.  This can
  be avoided by running arrayrun interactively on a master/login node (in
  which case running it under screen is probably a good idea).

- Arrayrun does (currently) not checkpoint, so if an arrayrun is restarted,
  it starts from scratch with the first taskid.

We welcome any suggestions for improvements or additional functionality!


Copyright
=========

Copyright 2009,2010,2011 Bjørn-Helge Mevik <[email protected]>

This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License version 2 as
published by the Free Software Foundation.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License version 2 for more details.

A copy of the GPL v. 2 text is available here:
http://www.gnu.org/licenses/old-licenses/gpl-2.0.txt

#!/bin/bash
### Simulate an array job
### $Id: arrayrun,v 1.6 2011/02/10 11:57:53 root Exp $

### Copyright 2009,2010 Bjørn-Helge Mevik <[email protected]>
###
### This program is free software; you can redistribute it and/or modify
### it under the terms of the GNU General Public License version 2 as
### published by the Free Software Foundation.
###
### This program is distributed in the hope that it will be useful,
### but WITHOUT ANY WARRANTY; without even the implied warranty of
### MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
### GNU General Public License version 2 for more details.
###
### A copy of the GPL v. 2 text is available here:
### http://www.gnu.org/licenses/old-licenses/gpl-2.0.txt


## Debugging
#set -x

### Configuration:
## The work horse:
WORKER=/site/lib/arrayrun_worker

## Documentation:
function usage () {
    echo "Run many instances of the same job or command in the queue system.
The instances are submitted via sbatch, and each get their own value
of the environment variable TASK_ID.  This can be used to select which
intput or output file to use, etc.

Usage:
 arrayrun [-r] taskids [sbatch arguments] command [arguments]
 arrayrun [-h | --help]

Arguments:
 '-r':         Restart a job if it fails.  For security reasons, each job is
               restarted only once, and no more than 5 jobs will be restarted.
 'taskids':    Run 'command' with TASK_ID set to the values specified in
               'taskids'.  'taskids' is a comma separated list of integers,
               ranges of integers (first-last) or ranges with step size
               (first-last:step).  For instance
                 1-5 means 1, 2, 3, 4, 5
                 1,4,6 means 1, 4, 6
                 10-20:5 means 10, 15, 20
                 1-5,15,100-150:25 means 1, 2, 3, 4, 5, 15, 100, 125, 150
               Note: spaces, negative number or decimal numbers are not allowed.
 'sbatch arguments': Any command line arguments for the implied sbatch.  This
               is most useful when 'command' is not a job script.
 'command':    The command or job script to run.  If it is a job script, it can
               contain #SBATCH lines in addition to or instead of the 'sbatch
               arguments'.
 'arguments':  Any arguments for 'command'.
 '-h', '--help' (or no arguments): Display this help."
}

if [ $# == 0 -o "$1" == '--help' -o "$1" == '-h' ]; then
    usage
    exit 0
fi

if [ -n "$SLURM_JOB_ID" ]; then
    ## Started in a job script.  Run with srun to make "scancel" work
    exec srun --ntasks=1 $WORKER "$@"
else
    exec $WORKER "$@"
fi

#!/usr/bin/perl
### Simulate an array job -- work horse script
### $Id: arrayrun_worker,v 1.30 2011/04/27 08:58:25 root Exp $

### Copyright 2009,2010,2011 Bjørn-Helge Mevik <[email protected]>
###
### This program is free software; you can redistribute it and/or modify
### it under the terms of the GNU General Public License version 2 as
### published by the Free Software Foundation.
###
### This program is distributed in the hope that it will be useful,
### but WITHOUT ANY WARRANTY; without even the implied warranty of
### MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
### GNU General Public License version 2 for more details.
###
### A copy of the GPL v. 2 text is available here:
### http://www.gnu.org/licenses/old-licenses/gpl-2.0.txt


### Note: This script is meant to be run by 'arrayrun'; do not
### run this script directly.

use strict;
use List::Util qw/min/;
use Time::HiRes qw/sleep/;

## Debug:
use warnings;
use constant DEBUG => 1;
$| = 1 if DEBUG;

## Configuration:
my $maxJobs = 100;              # Max total number of jobs in queue
my $maxIdleJobs = 10;           # Max number of pending jobs in queue
my $maxBurst = 10;              # Max number of jobs to submit at a time
my $pollSeconds = 180;          # How many seconds to sleep between each poll
my $maxFails = 300;             # Max errors to accept when submitting a job
my $retrySleep = 300;           # Seconds to sleep between each retry
my $doubleCheckSleep = 30;      # Seconds to sleep before double checking
my $maxRestarts = 10;           # Max number of restarts all in all
my $sbatch = "/site/bin/sbatch";# Which sbatch command to use

## Parse command line
my $restart = 0;
if (@ARGV && $ARGV[0] eq '-r') {
    $restart = 1;
    shift @ARGV;
}
my $jobSpec = shift @ARGV or die "Too few arguments\n";
my @commandLine = @ARGV or die "Too few arguments\n";
my @jobArray;
foreach (split /,/, $jobSpec) {
    if (/^(\d+)$/) {
        push @jobArray, $1;
    } elsif (/^(\d+)[-:](\d+)$/) {
        push @jobArray, $1 .. $2;
    } elsif (/^(\d+)[-:](\d+):(\d+)$/) {
        for (my $i = $1; $i <= $2; $i += $3) {
            push @jobArray, $i;
        }
    } else {
        die "Unknown TASK_ID specification: '$_'\n";
    }
}
die "No TASK_IDs specified\n" unless (@jobArray);

print "TASK_IDs to submit: ", join(",", @jobArray), "
Command line: @commandLine\n" if DEBUG;
print "Will restart failed jobs\n" if DEBUG && $restart;

## Setup
my $mainid = $ENV{'SLURM_JOB_ID'} || $ENV{'SLURM_JOBID'} || 'null';
my $runids = [];                # List of IDs of running jobs
my $pendids = [];               # List of IDs of pending jobs
my $testids = [];               # List of IDs to test
my %taskid;                     # TASK_ID for all submitted jobs
my @restartedTasks;             # TASK_ID of all restarted jobs
my @tmp = (localtime())[5,4,3];
my $starttime = sprintf "%d-%02d-%02d", $tmp[0] + 1900, $tmp[1] + 1, $tmp[2];

print "Main job id: $mainid\nStart time: $starttime\n" if DEBUG;

## Trap signals such that any running sub jobs are cancelled if the
## main job is cancelled or times out.
sub clean_up {
    print "Caught signal.  Cleaning up...\n" if DEBUG;
    ## Cancel any subjobs:
    if (@{$runids} || @{$pendids} || @{$testids}) {
        print "Cancelling @{$runids} @{$pendids} @{$testids}\n" if DEBUG;
        system("echo scancel @{$runids} @{$pendids} @{$testids}");
        system("scancel @{$runids} @{$pendids} @{$testids}");
        print "Cancelled @{$runids} @{$pendids} @{$testids}\n" if DEBUG;
    }
    exit 0;
}
$SIG{'TERM'} = 'clean_up';      # scancel/timeout
$SIG{'INT'} = 'clean_up';       # ^C in interactive use


## Submit a job with fail resilience:
sub submit_job {
    my $jobName = shift;
    (my $commandLine = shift) || die "Job script not specified\n";
    my $id;
    my $nFails = 0;
    my $success = 0;
    until ($success) {
        my $fail = 0;
        $id = `$sbatch --job-name=$jobName $commandLine 2>&1`;
        if ($? == 0) {
            chomp($id);
            print "  Result from submit: $id" if DEBUG;
            if ($id =~ s/.*Submitted batch job //) {
                $success = 1;
            }
        } else {
            warn "  sbatch failed with error code '$?' (output: '",
                $id || '', "'): $!\n";
            $nFails++;
        }
        until ($success || $fail || $nFails > $maxFails) {
            ## Double check that the job did not start
            warn "  Problem with submitting/checking job.  Checking with squeue 
in a while.\n";
            sleep $doubleCheckSleep - 5 + int(rand(11));
            $id = `squeue -h -o '%i %j' -u $ENV{USER}`;
            if ($? == 0) {
                chomp($id);
                print "  Result from squeue: $id" if DEBUG;
                if ($id =~ s/ $jobName//) {
                    warn "Job '$jobName' seems to have been started as jobid 
'$id'.  Using that id.\n";
                    $success = 1;
                } else {
                    warn "Job '$jobName' did not start.\n";
                    $fail = 1;
                }
            } else {
                $nFails++;
            }
        }
        unless ($success) {
            if ($nFails <= $maxFails) {
                warn "  Could not submit job.  Trying again in a while.\n";
                sleep $retrySleep - 5 + int(rand(11));
            } else {
                die "  Cannot submit job.  Giving up after $nFails errors.\n";
            }
        }
    }
    print " => job ID $id\n" if DEBUG;
    $id;
}


## Check the given jobs, and return lists of the ones still running/waiting:
sub check_queue {
    print scalar localtime, ": Checking queue...\n" if DEBUG;
    my $queueids = `squeue -h -o '%i %t' 2>&1`;
    if ($? != 0) {
        print "squeue failed with error code '$?',\nmessage: $queueids\nI will 
assume all jobs are still running/waiting\n";
        return;
    }
    my $testids = [ @{$runids}, @{$pendids} ];
    print "Number of jobs to check: ", scalar @{$testids}, "\n" if DEBUG;
    sleep 10 + rand;            # Sleep to allow requeued jobs to get back
                                # in queue.
    $runids = [];
    $pendids = [];
    foreach my $id (@{$testids}) {
        if ($queueids =~ /$id (\w+)/) {
            if ($1 eq "PD") {
                print " Job $id is still waiting\n" if DEBUG;
                push @{$pendids}, $id;
            } else {
                print " Job $id is still running\n" if DEBUG;
                push @{$runids}, $id;
            }
        } else {
            print " Job $id has finished:\n" if DEBUG;
            my @sacctres = `sacct -o 
jobid,start,end,maxvmsize,maxrss,state,exitcode -S $starttime -j $id 2>&1`;
            if ($? != 0) {
                print "  sacct failed with error code '$?',\n  message: ",
                  @sacctres, "  I will assume job $id finished successfully\n";
            } else {
                print join("  ", @sacctres);
                if (grep /^[ ]*$id[ ]+.*RUNNING/, @sacctres) {
                    print "  Job seems to be still running, after all.\n" if 
DEBUG;
                    push @{$runids}, $id;
                } elsif ($restart && !grep /^[ ]*$id[ ]+.*COMPLETED[ ]+0:0/, 
@sacctres) {
                    print "  Job failed. ";
                    if (@restartedTasks >= $maxRestarts) {
                        print "Too many jobs have been restarted.  Will not 
restart TASK_ID $taskid{$id}\n";
                    } elsif (grep /^$taskid{$id}$/, @restartedTasks) {
                        print "TASK_ID $taskid{$id} has already been restarted 
once.  Will not restart it again\n";
                    } else {
                        print "Restarting TASK_ID $taskid{$id}\n";
                        $ENV{'TASK_ID'} = $taskid{$id};
                        my $newid = submit_job "$mainid.$taskid{$id}", 
"@commandLine";
                        push @{$runids}, $newid;
                        $taskid{$newid} = $taskid{$id};
                        push @restartedTasks, $taskid{$newid};
                        sleep 1.5 + rand;       # Sleep between 1.5 and 2.5 secs
                    }
                }
            }
        }
    }
}


## Make sure sub jobs do not inherit the main job TMPDIR or jobname:
delete $ENV{'TMPDIR'};
delete $ENV{'SLURM_JOB_NAME'};

while (@jobArray) {
    ## There is more to submit
    print scalar localtime, ": Submitting jobs...\n" if DEBUG;
    print scalar @jobArray, " more job(s) to submit\n" if DEBUG;
    ## Submit as many as possible:
    my $nToSubmit = min(scalar @jobArray,
                        $maxJobs - @{$runids} - @{$pendids},
                        $maxIdleJobs - @{$pendids},
                        $maxBurst);
    print scalar(@{$runids}), " job(s) are running, and ",
        scalar(@{$pendids}), " are waiting\n" if DEBUG;
    print "Submitting $nToSubmit job(s):\n" if DEBUG;
    for (my $i = 1; $i <= $nToSubmit; $i++) {
        my $currJob = shift @jobArray;
        print " TASK_ID $currJob:\n" if DEBUG;
        ## Set $TASK_ID for the job:
        $ENV{'TASK_ID'} = $currJob;
        my $id = submit_job "$mainid.$currJob", "@commandLine";
        push @{$pendids}, $id;
        $taskid{$id} = $currJob;
        sleep 1.5 + rand;       # Sleep between 1.5 and 2.5 secs
    }
    ## Wait a while:
    print "Sleeping...\n" if DEBUG;
    sleep $pollSeconds - 5 + int(rand(11));
    ## Find which are still running or waiting:
    check_queue();
}
print "All jobs have been submitted\n" if DEBUG;

while (@{$runids} || @{$pendids}) {
    ## Some jobs are still running or pending
    print scalar(@{$runids}), " job(s) are still running, and ",
        scalar(@{$pendids}), " are waiting\n" if DEBUG;
    ## Wait a while
    print "Sleeping...\n" if DEBUG;
    sleep $pollSeconds - 5 + int(rand(11));
    ## Find which are still running or waiting:
    check_queue();
}

print "Done.\n" if DEBUG;

-- 
Regards,
Bjørn-Helge Mevik, dr. scient,
Research Computing Services, University of Oslo

[slurm-dev] contrib: arrayrun

Reply via email to