I'm also interested in this as I've only ever seen one slurmctld and 
only at 100%.  It would be good if making slurm multithreaded was on the 
path for the future.  I know we will have 100,000's of jobs in flight 
for our config so it would be good to have something that can take that 
load.

-Paul Edmon-

On 06/12/2013 12:30 PM, Alan V. Cowles wrote:
> Hey Guys,
>
> I've seen a few references to the slurmctld as a multithreaded process
> but it doesn't seem that way.
>
> We had a user submit 18000 jobs to our cluster (512 slots) and it shows
> 512 fully loaded, shows those jobs running, shows about 9800 currently
> pending, but upon her submission threw errors around 16500.
>
> Submitted batch job 16589
> Submitted batch job 16590
> Submitted batch job 16591
> sbatch: error: Slurm temporarily unable to accept job, sleeping and
> retrying.
> sbatch: error: Batch job submission failed: Resource temporarily
> unavailable.
>
> The thing we noticed at this time on our master host is that slurmctld
> was pegging at 100% on one cpu quite regularly and paged 16GB of virtual
> memory, while all other cpu's were completely idle.
>
> We wondered if the pegging out of the control daemon is what led to the
> submission failure, as we haven't found any limits set anywhere to any
> specific job or user, and wondered if perhaps we missed a configure
> option for this when we did our original install.
>
> Any thoughts or ideas? We're running Slurm 2.5.4 on RHEL6.
>
> AC

Reply via email to