We have a user who wants to run multiple instances of a single process
job across a cluster, using a loop like
for N in $nodelist; do
srun -w $N program &
This works up to a thousand nodes or so (jobs are allocated by node
here), but as the number of jobs submitted increases, we periodically
see a variety of different error messages, such as
* srun: error: Ignoring job_complete for job 100035 because our job ID
* srun: error: io_init_msg_read too small
* srun: error: task 0 launch failed: Unspecified error
* srun: error: Unable to allocate resources: Job/step already
completing or completed
* srun: error: Unable to allocate resources: No error
* srun: error: unpack error in io_init_msg_unpack
* srun: Job step 211042.0 aborted before step completely launched.
We have tried setting
ulimit -n 500000
ulimit -u 64000
but that wasn't sufficient.
* CentOS 7.3 (x86_64)
* Slurm 17.11.0
Does this ring any bells? Any thoughts about how we should proceed?
High Performance Computing Software Engineering
+1 404 648 9024
My opinions are not necessarily those of HPE
May the source be with you!