Re: [slurm-dev] Squeue taking 2+ minutes with 30K jobs

Mark A. Grondona Fri, 04 Nov 2011 15:41:49 -0700

On Fri, 4 Nov 2011 15:34:18 -0700, Matthieu Hautreux <[email protected]> 
wrote:
Non-text part: multipart/alternative
> Well, the gstack of the slurmctld shows that the rpc of the squeue is already 
> finished. As the log said that it tooks usec=6155478, that seems to be a 
> reasonable amount of time to process such a large amount of jobs. As you 
> noticed the problem seems to be in the squeue internals, that seems to take 
> too much time to sort the results.
> 
> In _print_jobs_array function of src/squeue/print.c, it seems that
> sort_jobs_by_start_time (l) is always called regardless of.the
> requested format of sort parameters. It may be interesting to see how
> this part of the code could be enhanced to optimize the sort of a
> large set of jobs (perhaps avoiding to call time() at each comparison
> could be a good start). Moe, Danny, what do you think about that
> problem ?


I bet the sort would go *much* faster if the _get_start_time()
were pre-computed before the sort. That function is being called
for each sort comparison, and since the sort in O(N^2), well you
get the idea -- that function is probably being called
one million times more than it needs to be.

mark

> 
> Regards,
> Matthieu
> 
> 
> On 11/04/2011 08:53 PM, Clay Teeter wrote:
> Ok, here is some general profiling info:
> 
> the slurmctl process doesn't seem too terribly out of whack.  There are 
> bursts of 100% cpu usage, but they lasted no longer then a couple seconds.
> 
> I ran gstack on the slurmctl process while squeue was running.  Again, I 
> didn't see anything that look strange:
> 
> Thread 5 (Thread 0x41fc6940 (LWP 21939)):
> #0  0x00002aaaaacb9df8 in select_p_job_test () from 
> /usr/lib64/slurm/select_cons_res.so
> #1  0x00002aaaabad8cf3 in backfill_agent () from 
> /usr/lib64/slurm/sched_backfill.so
> #2  0x0000003cc8e0673d in start_thread () from /lib64/libpthread.so.0
> #3  0x0000003cc86d44bd in clone () from /lib64/libc.so.6
> Thread 4 (Thread 0x420c7940 (LWP 21941)):
> #0  0x0000003cc86cd722 in select () from /lib64/libc.so.6
> #1  0x000000000042daa2 in _slurmctld_rpc_mgr ()
> #2  0x0000003cc8e0673d in start_thread () from /lib64/libpthread.so.0
> #3  0x0000003cc86d44bd in clone () from /lib64/libc.so.6
> Thread 3 (Thread 0x421c8940 (LWP 21942)):
> #0  0x0000003cc8e0e898 in do_sigwait () from /lib64/libpthread.so.0
> #1  0x0000003cc8e0e93d in sigwait () from /lib64/libpthread.so.0
> #2  0x000000000042b838 in _slurmctld_signal_hand ()
> #3  0x0000003cc8e0673d in start_thread () from /lib64/libpthread.so.0
> #4  0x0000003cc86d44bd in clone () from /lib64/libc.so.6
> Thread 2 (Thread 0x422c9940 (LWP 21943)):
> #0  0x0000003cc8e0aee9 in 
> pthread_cond_wait@@GLIBC_2.3.2<mailto:pthread_cond_wait@@GLIBC_2.3.2> () from 
> /lib64/libpthread.so.0
> #1  0x000000000046a9c1 in slurmctld_state_save ()
> #2  0x0000003cc8e0673d in start_thread () from /lib64/libpthread.so.0
> #3  0x0000003cc86d44bd in clone () from /lib64/libc.so.6
> Thread 1 (Thread 0x2aae5df000d0 (LWP 21933)):
> #0  0x0000003cc8e0aee9 in 
> pthread_cond_wait@@GLIBC_2.3.2<mailto:pthread_cond_wait@@GLIBC_2.3.2> () from 
> /lib64/libpthread.so.0
> #1  0x0000000000449cbf in _wr_wrlock ()
> #2  0x000000000044a1d5 in lock_slurmctld ()
> #3  0x000000000042af5e in _slurmctld_background ()
> #4  0x000000000042cb8c in main ()
> 
> However, i did a gstack and strace on squeue and things did look a bit weird.
> strace:
> time(NULL)                              = 1320435240
> time(NULL)                              = 1320435240
> time(NULL)                              = 1320435240
> time(NULL)                              = 1320435240
> time(NULL)                              = 1320435240
> time(NULL)                              = 1320435240
> time(NULL)                              = 1320435240
> time(NULL)                              = 1320435240
> .....
> keeps on going, this might be the sorting algo?
> 
> 
> gstack:
> #0  0xffffffffff600410 in ?? ()
> #1  0x0000003cc868c40d in time () from /lib64/libc.so.6
> #2  0x000000000042608b in _get_start_time ()
> #3  0x00000000004260e1 in _sort_job_by_time_start ()
> #4  0x0000000000437958 in slurm_list_sort ()
> #5  0x00000000004232af in print_jobs_array ()#6  0x0000000000420234 in 
> _get_info ()
> #7  0x00000000004205a1 in main ()
> 
> squeue is pegged at 100% the entire time.  I noticed, that if i pass in the 
> param -S u to squeue the function _sort_job_by_time_start is still executed.
> 
> Some interetesed slurmctl logs enteries:
> [2011-11-04T19:50:37] Warning: Note very large processing time from 
> _slurm_rpc_dump_jobs: usec=6155478
> [2011-11-04T19:50:37] Warning: Note very large processing time from schedule: 
> usec=6158333
> [2011-11-04T19:50:37] Warning: Note very large processing time from 
> _slurmctld_background: usec=15995841
> 
> 
> [2011-11-04T19:46:01] debug:  backfill: loop taking too long, yielding locks
> [2011-11-04T19:46:01] debug2: Testing job time limits and checkpoints
> [2011-11-04T19:46:01] debug2: Performing purge of old job records
> [2011-11-04T19:46:01] Warning: Note very large processing time from 
> _slurmctld_background: usec=9779576
> 
> Thanks for all your help,
> Clay
> 
> On Fri, Nov 4, 2011 at 2:16 AM, HAUTREUX Matthieu 
> <[email protected]<mailto:[email protected]>> wrote:
> In my understanding of the internals, squeue does not query each node but 
> only the controller. That is interesting to know that the performance issue 
> seems to be due to the EC2 partition but it requires further information to 
> know what is the reason of such a bad performance.
> 
> The best way to find the problem is to observe the internal states of the 
> controller while the "squeue -p aws" is running :
> - You should first look at the consumption of the slurmctld process to see if 
> it is due to a too high complexity of processing in an internal function of 
> slurm.
>  The easiest way to have that information is by using top and look at the 
> consumption of the slurmctld process.
> - After that, in case you see no anormal consumption, you should try to 
> capture the state of the slurmctld process during the hang. You can do that 
> just by using the gstack command to capture the stacks of slurmctld threads 
> and see where they are waiting. If you can send the result of "gstack $(pgrep 
> slurmctld)" it could bring interesting information.
> - In case you see an anormal consumption, the gstack can help a little bit, 
> but a more evolved tool, a profiler like for example oprofile, will be 
> necessary to see where most of the time is spent in the controller
> 
> Do not hesitate to send the result of the 2 first tests (top, gstack), it 
> could help to find the bottleneck.
> 
> Regards,
> Matthieu
> 
> Clay Teeter a écrit :
> I did notice something.
> 
> Issuing squeue against the local partition is quite speedy, ~2s for 10K job 
> ids.  However, if i "squeue -p aws" the ec2(aws) partition i see the 2m+ 
> delay.  Does squeue query each node in the partition?
> 
> Thanks,
> Clay
> 
> On Thu, Nov 3, 2011 at 2:42 PM, Clay Teeter 
> <[email protected]<mailto:[email protected]> 
> <mailto:[email protected]<mailto:[email protected]>>> wrote:
> 
>    First, thanks for your response!
> 
>    Yeah, I figured that slurm shouldn't have a problem with the job
>    count.
>    First some background.   Our network is split into two parts; A
>    local partition and a ec2 partition.  And, for each task two jobs
>    (one is dependent on the other) are submitted.  The frist part
>    computes a large datafile and submits it to s3, a second job pulls
>    the files from s3 and copies it to our san. In total a large job
>    requires about 30K jobs to be submitted.
> 
>    Im not sure when squeue started slowing dow.  Currently, via sattr
>    (about 2s query), there are a little more than 17000 jobs running.
> 
>    Im worried that because half the cluster is in a remote datacenter
>    latency is the problem.... which, unfortunately, would be a show
>    stopper.
> 
>    Version: slurm 2.3.0-2
>    ControlMachine=master
>    AuthType=auth/munge
>    CacheGroups=0
>    CryptoType=crypto/munge
>    MaxJobCount=30000     MpiDefault=none
>    MpiParams=ports=12000-12999
>    ProctrackType=proctrack/pgid
>    ReturnToService=1
>    SlurmctldPidFile=/var/run/slurm/slurmctld.pid
>    SlurmctldPort=6817
>    SlurmdPidFile=/var/run/slurm/slurmd.pid
>    SlurmdPort=6818
>    SlurmdSpoolDir=/var/slurm/spool
>    SlurmUser=slurm
>    StateSaveLocation=/var/slurm/state
>    SwitchType=switch/none
>    TaskPlugin=task/none
>    TrackWCKey=yes
>    InactiveLimit=0
>    KillWait=30
>    MinJobAge=300
>    SlurmctldTimeout=120
>    SlurmdTimeout=300
>    Waittime=0
>    FastSchedule=1
>    SchedulerType=sched/backfill
>    SchedulerPort=7321
>    SelectType=select/cons_res
>    SelectTypeParameters=CR_Core
>    AccountingStorageHost=master
>    AccountingStorageLoc=slurm_acct_db
>    AccountingStoragePass=
>    AccountingStorageType=accounting_storage/mysql
>    AccountingStorageUser=
>    AccountingStoreJobComment=YES
>    ClusterName=hydra
>    JobCompHost=master
>    JobCompLoc=slurm_acct_db
>    JobCompPass=cluster_service
>    JobCompType=jobcomp/mysql
>    JobCompUser=cluster_service
>    JobAcctGatherFrequency=30
>    JobAcctGatherType=jobacct_gather/linux
>    SlurmctldDebug=3
>    SlurmctldLogFile=/var/log/slurm/slurmctld.log
>    SlurmdDebug=3
>    SlurmdLogFile=/var/log/slurm/slurmd.log
>    NodeName=aws[1-12] RealMemory=22528 Sockets=2 CoresPerSocket=4
>    ThreadsPerCore=2 State=UNKNOWN     NodeName=centos1 RealMemory=3000 
> Sockets=2 CoresPerSocket=2
>    ThreadsPerCore=1 State=UNKNOWN
> 
>    PartitionName=local Nodes=centos1 Default=YES MaxTime=INFINITE
>    State=UP Shared=NO
>    PartitionName=all Nodes=aws[3-12],centos1 Default=NO
>    MaxTime=INFINITE State=UP Shared=NO
>    PartitionName=aws Nodes=aws[2-12] Default=NO MaxTime=INFINITE
>    State=UP Shared=NO
> 
> 
> 
>    On Thu, Nov 3, 2011 at 2:03 PM, Matthieu Hautreux
>    <[email protected]<mailto:[email protected]> 
> <mailto:[email protected]<mailto:[email protected]>>> wrote:
> 
>        Hi,
> 
>        what is the version of SLURM that you are using ? Can you send
>        your slurm.conf for details of configuration ?
> 
>        With 2.3.1 you should be able to manage at least 10k jobs. We
>        made tests with 2.2.7 and backported patches of 2.3.x that
>        shows good results with 10K jobs.
>        What is the number of jobs for which the system starts to hang ?
> 
>        Matthieu
> 
> 
>        On 11/03/2011 07:03 PM, Clay Teeter wrote:
>        Hi group!
> 
>        Im seeing that with about 30K (or so) jobs in the queue,
>        squeue hangs
>        for a around 2 minutes before returning values.  Does anyone
>        know a way
>        around this?  I expect that this isn't standard behavior.
> 
>        Cheers,
>        Clay
> 
> 
> 
> 
> 
> 
Non-text part: text/html

Re: [slurm-dev] Squeue taking 2+ minutes with 30K jobs

Reply via email to