Hello,

i can shrink a running job by running

$ scontrol update JobId=1234 NumNodes=4

If a job had 8 nodes allocated it is correctly shrinked to 4 nodes now,
however the job steps on the the 4 nodes that are removed from the job are 
immediately killed.

Is there a way to leave the job steps running to finish but not schedule more 
steps to respect the new number of nodes?
I tried to run the job steps with no kill option "-k" but that did not change 
the behaviour.
Alternatively is it possible to automatically reschedule the killed job steps ?



This is a sample batch file that is submitted with sbatch:

------------------------------------------------

#!/bin/bash
#SBATCH --output=/NAS/renderfarm/jobs/job.%J.out
#SBATCH --output=/NAS/renderfarm/jobs/job.%J.out
#SBATCH -p render
#SBATCH --nodes=1-8
#SBATCH --job-name="Some Job"
#SBATCH --mem=12000
#SBATCH --time=01:00:00
SRUNOPTS="--chdir=/tmp -l  -k -c1 -n1 -N1  --checkpoint-dir=/NAS/checkpoints"


srun_cr  $SRUNOPTS --job-name="Step 1" longlastingjob &
srun_cr  $SRUNOPTS --job-name="Step 2" longlastingjob &
srun_cr  $SRUNOPTS --job-name="Step 3" longlastingjob &
srun_cr  $SRUNOPTS --job-name="Step 4" longlastingjob &
srun_cr  $SRUNOPTS --job-name="Step 5" longlastingjob &
srun_cr  $SRUNOPTS --job-name="Step 6" longlastingjob &
srun_cr  $SRUNOPTS --job-name="Step 7" longlastingjob &
srun_cr  $SRUNOPTS --job-name="Step 8" longlastingjob &
wait

------------------------------------------------

Thanks,

Lutz

Reply via email to