[slurm-dev] Dependencies: aftercorr for partially completed arrays

Sam Hawarden Thu, 03 Nov 2016 15:14:03 -0700

Hi there,


My current work flow makes use of sequential arrays with each array task 
dependent only on the previous array's matching element. A[n] -> B[n] -> C[n]


The new aftercorr:job_id dependency class seems perfect for this with one 
possible edge case: Partial completion.


There are numerous steps and occasionally things go wrong. Because of this I've 
added recovery solutions so if things do go pair shaped, it'll complete as much 
as it can then it will re-start itself as a whole and resume from the last set 
of completed jobs on the matching data set.


Unfortunately this can result in a sequence of two arrays where one is only 
partially completed but aftercorr implies that if a preceding element doesn't 
exist, say A[4] -> B[4] -> C[4] where A[4] was completed but B[4] died, then 
it'll fail.


Starting B with something like: -a $incompleteTasksB $([ "$depA" != "" ] && 
printf "-d aftercorr:A")


If there are no incomplete A elements then it's ok since A wont run at all and 
you can just remove the -d argument completely as above.


If parts of A failed as well, you're stuck with A existing but A[4] does not.


Currently I create B as dependent on A as a whole, if it exists. Then I use a 
script to scan through A and B's list of incomplete jobs and if there's a 
matching element, set B[n] to be dependent on A[n], otherwise remove B[n]'s 
dependency entirely, allowing it to run freely.


So it's kind of like a -d AfterCorrOrNotExist:A operation.


Does aftercorr function like this or when we eventually update to a version 
that supports aftercorr, will I need to retain my messy scontrol spamming 
tieArrayDeps function?


Thanks,

  Sam

________________________________
Sam Hawarden
Assistant Research Fellow
Pathology Department
Dunedin School of Medicine
sam.hawarden(at)otago.ac.nz

[slurm-dev] Dependencies: aftercorr for partially completed arrays

Reply via email to