Hey Guys, I've seen a few references to the slurmctld as a multithreaded process but it doesn't seem that way.
We had a user submit 18000 jobs to our cluster (512 slots) and it shows 512 fully loaded, shows those jobs running, shows about 9800 currently pending, but upon her submission threw errors around 16500. Submitted batch job 16589 Submitted batch job 16590 Submitted batch job 16591 sbatch: error: Slurm temporarily unable to accept job, sleeping and retrying. sbatch: error: Batch job submission failed: Resource temporarily unavailable. The thing we noticed at this time on our master host is that slurmctld was pegging at 100% on one cpu quite regularly and paged 16GB of virtual memory, while all other cpu's were completely idle. We wondered if the pegging out of the control daemon is what led to the submission failure, as we haven't found any limits set anywhere to any specific job or user, and wondered if perhaps we missed a configure option for this when we did our original install. Any thoughts or ideas? We're running Slurm 2.5.4 on RHEL6. AC
