So we have had some issues configuring SLURM to cope with 25000 job steps. Therefore we have configured the SLURM control daemon to accept only 600 job steps per job. So we split the job steps over multiple jobs. I have verified that we only have max 600 job steps per job, but still SLURM outputs:
srun: error: Unable to create job step: Step limit reached for this job srun: error: Unable to create job step: Step limit reached for this job srun: error: Unable to create job step: Step limit reached for this job How could this be happening?
