Re: [slurm-dev] Squeue taking 2+ minutes with 30K jobs

Clay Teeter Fri, 04 Nov 2011 12:54:31 -0700

Ok, here is some general profiling info:

the slurmctl process doesn't seem too terribly out of whack.  There are
bursts of 100% cpu usage, but they lasted no longer then a couple seconds.


I ran gstack on the slurmctl process while squeue was running.  Again, I
didn't see anything that look strange:

Thread 5 (Thread 0x41fc6940 (LWP 21939)):
#0  0x00002aaaaacb9df8 in select_p_job_test () from
/usr/lib64/slurm/select_cons_res.so
#1  0x00002aaaabad8cf3 in backfill_agent () from
/usr/lib64/slurm/sched_backfill.so
#2  0x0000003cc8e0673d in start_thread () from /lib64/libpthread.so.0
#3  0x0000003cc86d44bd in clone () from /lib64/libc.so.6
Thread 4 (Thread 0x420c7940 (LWP 21941)):
#0  0x0000003cc86cd722 in select () from /lib64/libc.so.6
#1  0x000000000042daa2 in _slurmctld_rpc_mgr ()
#2  0x0000003cc8e0673d in start_thread () from /lib64/libpthread.so.0
#3  0x0000003cc86d44bd in clone () from /lib64/libc.so.6
Thread 3 (Thread 0x421c8940 (LWP 21942)):
#0  0x0000003cc8e0e898 in do_sigwait () from /lib64/libpthread.so.0
#1  0x0000003cc8e0e93d in sigwait () from /lib64/libpthread.so.0
#2  0x000000000042b838 in _slurmctld_signal_hand ()
#3  0x0000003cc8e0673d in start_thread () from /lib64/libpthread.so.0
#4  0x0000003cc86d44bd in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x422c9940 (LWP 21943)):
#0  0x0000003cc8e0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from
/lib64/libpthread.so.0
#1  0x000000000046a9c1 in slurmctld_state_save ()
#2  0x0000003cc8e0673d in start_thread () from /lib64/libpthread.so.0
#3  0x0000003cc86d44bd in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x2aae5df000d0 (LWP 21933)):
#0  0x0000003cc8e0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from
/lib64/libpthread.so.0
#1  0x0000000000449cbf in _wr_wrlock ()
#2  0x000000000044a1d5 in lock_slurmctld ()
#3  0x000000000042af5e in _slurmctld_background ()
#4  0x000000000042cb8c in main ()

However, i did a gstack and strace on squeue and things did look a
bit weird.
strace:
time(NULL)                              = 1320435240
time(NULL)                              = 1320435240
time(NULL)                              = 1320435240
time(NULL)                              = 1320435240
time(NULL)                              = 1320435240
time(NULL)                              = 1320435240
time(NULL)                              = 1320435240
time(NULL)                              = 1320435240
.....
keeps on going, this might be the sorting algo?


gstack:
#0  0xffffffffff600410 in ?? ()
#1  0x0000003cc868c40d in time () from /lib64/libc.so.6
#2  0x000000000042608b in _get_start_time ()
#3  0x00000000004260e1 in _sort_job_by_time_start ()
#4  0x0000000000437958 in slurm_list_sort ()
#5  0x00000000004232af in print_jobs_array ()#6  0x0000000000420234 in
_get_info ()
#7  0x00000000004205a1 in main ()

squeue is pegged at 100% the entire time.  I noticed, that if i pass in the
param -S u to squeue the function _sort_job_by_time_start is still
executed.

Some interetesed slurmctl logs enteries:
[2011-11-04T19:50:37] Warning: Note very large processing time from
_slurm_rpc_dump_jobs: usec=6155478
[2011-11-04T19:50:37] Warning: Note very large processing time from
schedule: usec=6158333
[2011-11-04T19:50:37] Warning: Note very large processing time from
_slurmctld_background: usec=15995841


[2011-11-04T19:46:01] debug:  backfill: loop taking too long, yielding locks
[2011-11-04T19:46:01] debug2: Testing job time limits and checkpoints
[2011-11-04T19:46:01] debug2: Performing purge of old job records
[2011-11-04T19:46:01] Warning: Note very large processing time from
_slurmctld_background: usec=9779576

Thanks for all your help,
Clay

On Fri, Nov 4, 2011 at 2:16 AM, HAUTREUX Matthieu
<[email protected]>wrote:

> In my understanding of the internals, squeue does not query each node but
> only the controller. That is interesting to know that the performance issue
> seems to be due to the EC2 partition but it requires further information to
> know what is the reason of such a bad performance.
>
> The best way to find the problem is to observe the internal states of the
> controller while the "squeue -p aws" is running :
> - You should first look at the consumption of the slurmctld process to see
> if it is due to a too high complexity of processing in an internal function
> of slurm.
>  The easiest way to have that information is by using top and look at the
> consumption of the slurmctld process.
> - After that, in case you see no anormal consumption, you should try to
> capture the state of the slurmctld process during the hang. You can do that
> just by using the gstack command to capture the stacks of slurmctld threads
> and see where they are waiting. If you can send the result of "gstack
> $(pgrep slurmctld)" it could bring interesting information.
> - In case you see an anormal consumption, the gstack can help a little
> bit, but a more evolved tool, a profiler like for example oprofile, will be
> necessary to see where most of the time is spent in the controller
>
> Do not hesitate to send the result of the 2 first tests (top, gstack), it
> could help to find the bottleneck.
>
> Regards,
> Matthieu
>
> Clay Teeter a écrit :
>
>> I did notice something.
>>
>> Issuing squeue against the local partition is quite speedy, ~2s for 10K
>> job ids.  However, if i "squeue -p aws" the ec2(aws) partition i see the
>> 2m+ delay.  Does squeue query each node in the partition?
>>
>> Thanks,
>> Clay
>>
>> On Thu, Nov 3, 2011 at 2:42 PM, Clay Teeter <[email protected] <mailto:
>> [email protected]>> wrote:
>>
>>    First, thanks for your response!
>>
>>    Yeah, I figured that slurm shouldn't have a problem with the job
>>    count.
>>    First some background.   Our network is split into two parts; A
>>    local partition and a ec2 partition.  And, for each task two jobs
>>    (one is dependent on the other) are submitted.  The frist part
>>    computes a large datafile and submits it to s3, a second job pulls
>>    the files from s3 and copies it to our san. In total a large job
>>    requires about 30K jobs to be submitted.
>>
>>    Im not sure when squeue started slowing dow.  Currently, via sattr
>>    (about 2s query), there are a little more than 17000 jobs running.
>>
>>    Im worried that because half the cluster is in a remote datacenter
>>    latency is the problem.... which, unfortunately, would be a show
>>    stopper.
>>
>>    Version: slurm 2.3.0-2
>>    ControlMachine=master
>>    AuthType=auth/munge
>>    CacheGroups=0
>>    CryptoType=crypto/munge
>>    MaxJobCount=30000     MpiDefault=none
>>    MpiParams=ports=12000-12999
>>    ProctrackType=proctrack/pgid
>>    ReturnToService=1
>>    SlurmctldPidFile=/var/run/**slurm/slurmctld.pid
>>    SlurmctldPort=6817
>>    SlurmdPidFile=/var/run/slurm/**slurmd.pid
>>    SlurmdPort=6818
>>    SlurmdSpoolDir=/var/slurm/**spool
>>    SlurmUser=slurm
>>    StateSaveLocation=/var/slurm/**state
>>    SwitchType=switch/none
>>    TaskPlugin=task/none
>>    TrackWCKey=yes
>>    InactiveLimit=0
>>    KillWait=30
>>    MinJobAge=300
>>    SlurmctldTimeout=120
>>    SlurmdTimeout=300
>>    Waittime=0
>>    FastSchedule=1
>>    SchedulerType=sched/backfill
>>    SchedulerPort=7321
>>    SelectType=select/cons_res
>>    SelectTypeParameters=CR_Core
>>    AccountingStorageHost=master
>>    AccountingStorageLoc=slurm_**acct_db
>>    AccountingStoragePass=
>>    AccountingStorageType=**accounting_storage/mysql
>>    AccountingStorageUser=
>>    AccountingStoreJobComment=YES
>>    ClusterName=hydra
>>    JobCompHost=master
>>    JobCompLoc=slurm_acct_db
>>    JobCompPass=cluster_service
>>    JobCompType=jobcomp/mysql
>>    JobCompUser=cluster_service
>>    JobAcctGatherFrequency=30
>>    JobAcctGatherType=jobacct_**gather/linux
>>    SlurmctldDebug=3
>>    SlurmctldLogFile=/var/log/**slurm/slurmctld.log
>>    SlurmdDebug=3
>>    SlurmdLogFile=/var/log/slurm/**slurmd.log
>>    NodeName=aws[1-12] RealMemory=22528 Sockets=2 CoresPerSocket=4
>>    ThreadsPerCore=2 State=UNKNOWN     NodeName=centos1 RealMemory=3000
>> Sockets=2 CoresPerSocket=2
>>    ThreadsPerCore=1 State=UNKNOWN
>>
>>    PartitionName=local Nodes=centos1 Default=YES MaxTime=INFINITE
>>    State=UP Shared=NO
>>    PartitionName=all Nodes=aws[3-12],centos1 Default=NO
>>    MaxTime=INFINITE State=UP Shared=NO
>>    PartitionName=aws Nodes=aws[2-12] Default=NO MaxTime=INFINITE
>>    State=UP Shared=NO
>>
>>
>>
>>    On Thu, Nov 3, 2011 at 2:03 PM, Matthieu Hautreux
>>    <[email protected] 
>> <mailto:matthieu.hautreux@cea.**fr<[email protected]>>>
>> wrote:
>>
>>        Hi,
>>
>>        what is the version of SLURM that you are using ? Can you send
>>        your slurm.conf for details of configuration ?
>>
>>        With 2.3.1 you should be able to manage at least 10k jobs. We
>>        made tests with 2.2.7 and backported patches of 2.3.x that
>>        shows good results with 10K jobs.
>>        What is the number of jobs for which the system starts to hang ?
>>
>>        Matthieu
>>
>>
>>        On 11/03/2011 07:03 PM, Clay Teeter wrote:
>>
>>>        Hi group!
>>>
>>>        Im seeing that with about 30K (or so) jobs in the queue,
>>>        squeue hangs
>>>        for a around 2 minutes before returning values.  Does anyone
>>>        know a way
>>>        around this?  I expect that this isn't standard behavior.
>>>
>>>        Cheers,
>>>        Clay
>>>
>>
>>
>>
>>
>

Re: [slurm-dev] Squeue taking 2+ minutes with 30K jobs

Reply via email to