Hi everyone. I don't see much about how individual jobs in an array are tracked after completion. It also looks like individual job indexes are not stored in 'sacct'/accounting.
I kind of expected that the exit code of the main job in an array could be a logical operation of all the individual indexes in the array. That is, if any of the job returned non-zero, the main exit code would also be non-zero. But after some random testing (slurm 14.03.4), it looks like the main exit status is just the exit status of the first index. As such, using any dependency based on afterok/afternotok is kind of pointless. And since there's no accounting for each index, I'm at loss here. Any comment? And since we're discussing this, it would also make sense to have a policy for job array failures. A failure for a single index could: - flag job as FAILED, but still continue executing the remaining indexes - flag job as COMPLETED as long as at least one index was ok - flag job as FAILED, but cancel the job as well For an array, I would guess the last mode makes more sense for a default.
