Hey Guys,

I've seen a few references to the slurmctld as a multithreaded process 
but it doesn't seem that way.

We had a user submit 18000 jobs to our cluster (512 slots) and it shows 
512 fully loaded, shows those jobs running, shows about 9800 currently 
pending, but upon her submission threw errors around 16500.

Submitted batch job 16589
Submitted batch job 16590
Submitted batch job 16591
sbatch: error: Slurm temporarily unable to accept job, sleeping and 
retrying.
sbatch: error: Batch job submission failed: Resource temporarily 
unavailable.

The thing we noticed at this time on our master host is that slurmctld 
was pegging at 100% on one cpu quite regularly and paged 16GB of virtual 
memory, while all other cpu's were completely idle.

We wondered if the pegging out of the control daemon is what led to the 
submission failure, as we haven't found any limits set anywhere to any 
specific job or user, and wondered if perhaps we missed a configure 
option for this when we did our original install.

Any thoughts or ideas? We're running Slurm 2.5.4 on RHEL6.

AC

Reply via email to