[slurm-dev] Re: Slurmctld multithreaded?

Alejandro Lucero Palau Thu, 13 Jun 2013 07:47:59 -0700

Yes, it seems the problem was the max jobs.

However, if you are going to "suffer" this kind of workload, sdiag can
give you some ideas about what's going on.

We have installed 2.5.6. Danny Auble fixed a problem in 2.5.4 avoiding
really long schedule cycles. Another solved issue related to how job
priority is managed when a job is submitted to more than one partition
is not in 2.5.4

For systems with more than one partition and tenths of thousands of
jobs, the scheduling could take too much time just because it needs to
check all the partitions. In your case, just with two partitions, if
there is not any job from one of them, it will force the scheduler to go
through all the job queue. It means calling the function for getting the
highest priority job as much as number of queued jobs minus 1. Let's say
you have 100000 jobs waiting and let's assume the worst case with the
queue ordered by priority in reverse order (just the opposite to what
scheduler needs). It means doing comparations as much as  5000050000 
(n*(n+1)/2). Each comparation takes two memory accesses so let's say
each comparation takes 10ns (probably more). So you got close to a
minute for the scheduling cycle... It is unlikely to have such a worst
case but it is not unlikely to have more than 100000 waiting jobs in a
HTC system.

The problem is having just one queue and with the scheduler unaware of
jobs by partition. Even if we have a counter of waiting jobs per
partition, it can not avoid the problem since the job selection depends
on the jobs priority.

Assuming this scenario I've been working on the scheduler being more
efficient and using queues by partition. The idea is to have a
configurable number of jobs to be scheduled by cycle and by partition.
When scheduler starts it goes through the full job queue and creates
these queues by partition. The scheduler then just needs to take the
highest priority job from those queues. I have tested it and it works
fine for our workload but I'm waiting for testing deeply using more real
cases.

I know this solution could not be the best one for everyone so I guess
this could be a configuration option in the future.

Other option is to maintain an ordered job queue using more complex
structures but it would mean main code changes (Slurm 3.0 ?)

Best regards

On 06/13/2013 01:40 PM, Alan Cowles wrote:
> Alejandro,
>
> Thanks for that description.
>
> Right now we have 2 partitions but the user only submitted to one 
> (lowmem @ 32 nodes 128GB/ram, highmem @ 8 nodes 256GB/ram). She 
> submitted 18500 jobs, and it got around 16000 before it started throwing 
> the error. We found an entry in slurm.conf that sets max jobs = 5000, 
> but it was commented out and we believed this was "unlimited" however we 
> now believe this value commented out means, system default of 10,000.
>
> Which version of slurm are you running that better handles mass jobs? 
> We're really anxious to see what 2.6 provides onces it's out of RC so we 
> can arrange array jobs and such.
>
> AC
>
> On 06/13/2013 04:07 AM, Alejandro Lucero Palau wrote:
>   
>> Slurmctld is multithreading but it does not mean any load will be supported.
>>
>> Each time someone connects to slurmctld (sbatch, srun, squeue, sinfo,
>> ...) a new thread is created. There are the main threads as well, which
>> will live through the whole slurmctld execution. And there are the
>> agents threads which slurmtlcd uses to communicate with nodes.
>>
>> Under a heavy load you can see how many threads are active with sdiag:
>>
>> Server thread count: 14
>> Agent queue size:    10
>>
>> As you can see, information about agents is also given. Sdiag can tell
>> you if the scheduling is being a problem by itself.
>>
>> Main schedule statistics (microseconds):
>>      Last cycle:   78973
>>      Max cycle:    1801057
>>      Total cycles: 1526
>>
>> If you got a Max cycle larger than a couple of seconds, you have a
>> problem under a heavy load as main scheduler runs locking the queue.
>>
>> The main reason for the message
>>
>> "Slurm temporarily unable to accept job, sleeping and retrying."
>>
>>
>> is (probably)  a high number of jobs being submitted or a high number
>> completing. Any event like job submission or job completion creates a
>> new thread. By default, those new threads call the scheduler. Even with
>> the scheduler taking just a couple of seconds, hundreds or thousands of
>> threads can "lock" the system. You can avoid this behaviour deferring
>> calls to scheduler (the call is done but just trying to schedule the
>> first one).
>>
>> Also, depending on your design in terms of partitions and queues, the
>> slurm version you are using (2.5.4) could take too much time under some
>> circumstances.
>>
>> We have been tuning a HTC cluster with Slurm and it now supports heavy
>> loads as the one you describes. I have some tweaking (hardcoded) for
>> improving scheduling when several partitions are actively used and when
>> jobs can be submitted to more than one partition.
>>
>> On 06/12/2013 06:55 PM, Alan V. Cowles wrote:
>>     
>>> Under the Data Objects section on the following page
>>> http://slurm.schedmd.com/selectplugins.html we find the statement:
>>>
>>> "Slurmctld is a multi-threaded program with independent read and write
>>> locks on each data structure type."
>>>
>>> Which is what lead me to believe it's there, that we perhaps missed a
>>> configuration option.
>>>
>>> AC
>>>
>>>
>>>
>>> On 06/12/2013 12:43 PM, Paul Edmon wrote:
>>>    
>>>       
>>>> I'm also interested in this as I've only ever seen one slurmctld and
>>>> only at 100%.  It would be good if making slurm multithreaded was on the
>>>> path for the future.  I know we will have 100,000's of jobs in flight
>>>> for our config so it would be good to have something that can take that
>>>> load.
>>>>
>>>> -Paul Edmon-
>>>>
>>>> On 06/12/2013 12:30 PM, Alan V. Cowles wrote:
>>>>      
>>>>         
>>>>> Hey Guys,
>>>>>
>>>>> I've seen a few references to the slurmctld as a multithreaded process
>>>>> but it doesn't seem that way.
>>>>>
>>>>> We had a user submit 18000 jobs to our cluster (512 slots) and it shows
>>>>> 512 fully loaded, shows those jobs running, shows about 9800 currently
>>>>> pending, but upon her submission threw errors around 16500.
>>>>>
>>>>> Submitted batch job 16589
>>>>> Submitted batch job 16590
>>>>> Submitted batch job 16591
>>>>> sbatch: error: Slurm temporarily unable to accept job, sleeping and
>>>>> retrying.
>>>>> sbatch: error: Batch job submission failed: Resource temporarily
>>>>> unavailable.
>>>>>
>>>>> The thing we noticed at this time on our master host is that slurmctld
>>>>> was pegging at 100% on one cpu quite regularly and paged 16GB of virtual
>>>>> memory, while all other cpu's were completely idle.
>>>>>
>>>>> We wondered if the pegging out of the control daemon is what led to the
>>>>> submission failure, as we haven't found any limits set anywhere to any
>>>>> specific job or user, and wondered if perhaps we missed a configure
>>>>> option for this when we did our original install.
>>>>>
>>>>> Any thoughts or ideas? We're running Slurm 2.5.4 on RHEL6.
>>>>>
>>>>> AC
>>>>>        
>>>>>           
>> WARNING / LEGAL TEXT: This message is intended only for the use of the
>> individual or entity to which it is addressed and may contain
>> information which is privileged, confidential, proprietary, or exempt
>> from disclosure under applicable law. If you are not the intended
>> recipient or the person responsible for delivering the message to the
>> intended recipient, you are strictly prohibited from disclosing,
>> distributing, copying, or in any way using this message. If you have
>> received this communication in error, please notify the sender and
>> destroy and delete any copies you may have received.
>>
>> http://www.bsc.es/disclaimer
>>     

WARNING / LEGAL TEXT: This message is intended only for the use of the
individual or entity to which it is addressed and may contain
information which is privileged, confidential, proprietary, or exempt
from disclosure under applicable law. If you are not the intended
recipient or the person responsible for delivering the message to the
intended recipient, you are strictly prohibited from disclosing,
distributing, copying, or in any way using this message. If you have
received this communication in error, please notify the sender and
destroy and delete any copies you may have received.

http://www.bsc.es/disclaimer

[slurm-dev] Re: Slurmctld multithreaded?

Reply via email to