Re: [slurm-dev] Squeue taking 2+ minutes with 30K jobs

Clay Teeter Fri, 04 Nov 2011 17:18:44 -0700

Ill see if I can recompile the sort out of squeue.  While there would be a
speed up, i can't image that the entire 2m is spent sorting.  Especially
because 17K jobs returned from a local (not ec2) partition takes a couple
of seconds.


In any case, ill get back to the list with my findings in a couple of days.

Have a great weekend!
Thanks,
Clay

On Fri, Nov 4, 2011 at 3:50 PM, Rayson Ho <[email protected]> wrote:

> On Fri, Nov 4, 2011 at 6:34 PM, Matthieu Hautreux
> <[email protected]> wrote:
> > In _print_jobs_array function of src/squeue/print.c, it seems that
> > sort_jobs_by_start_time (l) is always called regardless of.the requested
> > format of sort parameters. It may be interesting to see how this part of
> the
> > code could be enhanced to optimize the sort of a large set of jobs
> (perhaps
> > avoiding to call time() at each comparison could be a good start).
>
> Just a quick note - the overhead of time(2) is pretty low on modern
> Linux and many other modern operating systems. On modern processors
> executing 30K time(2) calls would take less than 0.5 sec, as it is not
> a full system call anymore (Google: VDSO+gettimeofday).
>
> Rayson
>
>
>
> > Moe, Danny, what do you think about that problem ?
> >
> > Regards,
> > Matthieu
> >
> >
> > On 11/04/2011 08:53 PM, Clay Teeter wrote:
> >
> > Ok, here is some general profiling info:
> > the slurmctl process doesn't seem too terribly out of whack.  There are
> > bursts of 100% cpu usage, but they lasted no longer then a couple
> seconds.
> > I ran gstack on the slurmctl process while squeue was running.  Again, I
> > didn't see anything that look strange:
> > Thread 5 (Thread 0x41fc6940 (LWP 21939)):
> > #0  0x00002aaaaacb9df8 in select_p_job_test () from
> > /usr/lib64/slurm/select_cons_res.so
> > #1  0x00002aaaabad8cf3 in backfill_agent () from
> > /usr/lib64/slurm/sched_backfill.so
> > #2  0x0000003cc8e0673d in start_thread () from /lib64/libpthread.so.0
> > #3  0x0000003cc86d44bd in clone () from /lib64/libc.so.6
> > Thread 4 (Thread 0x420c7940 (LWP 21941)):
> > #0  0x0000003cc86cd722 in select () from /lib64/libc.so.6
> > #1  0x000000000042daa2 in _slurmctld_rpc_mgr ()
> > #2  0x0000003cc8e0673d in start_thread () from /lib64/libpthread.so.0
> > #3  0x0000003cc86d44bd in clone () from /lib64/libc.so.6
> > Thread 3 (Thread 0x421c8940 (LWP 21942)):
> > #0  0x0000003cc8e0e898 in do_sigwait () from /lib64/libpthread.so.0
> > #1  0x0000003cc8e0e93d in sigwait () from /lib64/libpthread.so.0
> > #2  0x000000000042b838 in _slurmctld_signal_hand ()
> > #3  0x0000003cc8e0673d in start_thread () from /lib64/libpthread.so.0
> > #4  0x0000003cc86d44bd in clone () from /lib64/libc.so.6
> > Thread 2 (Thread 0x422c9940 (LWP 21943)):
> > #0  0x0000003cc8e0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from
> > /lib64/libpthread.so.0
> > #1  0x000000000046a9c1 in slurmctld_state_save ()
> > #2  0x0000003cc8e0673d in start_thread () from /lib64/libpthread.so.0
> > #3  0x0000003cc86d44bd in clone () from /lib64/libc.so.6
> > Thread 1 (Thread 0x2aae5df000d0 (LWP 21933)):
> > #0  0x0000003cc8e0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from
> > /lib64/libpthread.so.0
> > #1  0x0000000000449cbf in _wr_wrlock ()
> > #2  0x000000000044a1d5 in lock_slurmctld ()
> > #3  0x000000000042af5e in _slurmctld_background ()
> > #4  0x000000000042cb8c in main ()
> > However, i did a gstack and strace on squeue and things did look a
> > bit weird.
> > strace:
> > time(NULL)                              = 1320435240
> > time(NULL)                              = 1320435240
> > time(NULL)                              = 1320435240
> > time(NULL)                              = 1320435240
> > time(NULL)                              = 1320435240
> > time(NULL)                              = 1320435240
> > time(NULL)                              = 1320435240
> > time(NULL)                              = 1320435240
> > .....
> > keeps on going, this might be the sorting algo?
> >
> > gstack:
> > #0  0xffffffffff600410 in ?? ()
> > #1  0x0000003cc868c40d in time () from /lib64/libc.so.6
> > #2  0x000000000042608b in _get_start_time ()
> > #3  0x00000000004260e1 in _sort_job_by_time_start ()
> > #4  0x0000000000437958 in slurm_list_sort ()
> > #5  0x00000000004232af in print_jobs_array ()#6  0x0000000000420234 in
> > _get_info ()
> > #7  0x00000000004205a1 in main ()
> > squeue is pegged at 100% the entire time.  I noticed, that if i pass in
> the
> > param -S u to squeue the function _sort_job_by_time_start is still
> > executed.
> > Some interetesed slurmctl logs enteries:
> > [2011-11-04T19:50:37] Warning: Note very large processing time from
> > _slurm_rpc_dump_jobs: usec=6155478
> > [2011-11-04T19:50:37] Warning: Note very large processing time from
> > schedule: usec=6158333
> > [2011-11-04T19:50:37] Warning: Note very large processing time from
> > _slurmctld_background: usec=15995841
> >
> > [2011-11-04T19:46:01] debug:  backfill: loop taking too long, yielding
> locks
> > [2011-11-04T19:46:01] debug2: Testing job time limits and checkpoints
> > [2011-11-04T19:46:01] debug2: Performing purge of old job records
> > [2011-11-04T19:46:01] Warning: Note very large processing time from
> > _slurmctld_background: usec=9779576
> > Thanks for all your help,
> > Clay
> > On Fri, Nov 4, 2011 at 2:16 AM, HAUTREUX Matthieu <
> [email protected]>
> > wrote:
> >>
> >> In my understanding of the internals, squeue does not query each node
> but
> >> only the controller. That is interesting to know that the performance
> issue
> >> seems to be due to the EC2 partition but it requires further
> information to
> >> know what is the reason of such a bad performance.
> >>
> >> The best way to find the problem is to observe the internal states of
> the
> >> controller while the "squeue -p aws" is running :
> >> - You should first look at the consumption of the slurmctld process to
> see
> >> if it is due to a too high complexity of processing in an internal
> function
> >> of slurm.
> >>  The easiest way to have that information is by using top and look at
> the
> >> consumption of the slurmctld process.
> >> - After that, in case you see no anormal consumption, you should try to
> >> capture the state of the slurmctld process during the hang. You can do
> that
> >> just by using the gstack command to capture the stacks of slurmctld
> threads
> >> and see where they are waiting. If you can send the result of "gstack
> >> $(pgrep slurmctld)" it could bring interesting information.
> >> - In case you see an anormal consumption, the gstack can help a little
> >> bit, but a more evolved tool, a profiler like for example oprofile,
> will be
> >> necessary to see where most of the time is spent in the controller
> >>
> >> Do not hesitate to send the result of the 2 first tests (top, gstack),
> it
> >> could help to find the bottleneck.
> >>
> >> Regards,
> >> Matthieu
> >>
> >> Clay Teeter a écrit :
> >>>
> >>> I did notice something.
> >>>
> >>> Issuing squeue against the local partition is quite speedy, ~2s for 10K
> >>> job ids.  However, if i "squeue -p aws" the ec2(aws) partition i see
> the 2m+
> >>> delay.  Does squeue query each node in the partition?
> >>>
> >>> Thanks,
> >>> Clay
> >>>
> >>> On Thu, Nov 3, 2011 at 2:42 PM, Clay Teeter <[email protected]
> >>> <mailto:[email protected]>> wrote:
> >>>
> >>>    First, thanks for your response!
> >>>
> >>>    Yeah, I figured that slurm shouldn't have a problem with the job
> >>>    count.
> >>>    First some background.   Our network is split into two parts; A
> >>>    local partition and a ec2 partition.  And, for each task two jobs
> >>>    (one is dependent on the other) are submitted.  The frist part
> >>>    computes a large datafile and submits it to s3, a second job pulls
> >>>    the files from s3 and copies it to our san. In total a large job
> >>>    requires about 30K jobs to be submitted.
> >>>
> >>>    Im not sure when squeue started slowing dow.  Currently, via sattr
> >>>    (about 2s query), there are a little more than 17000 jobs running.
> >>>
> >>>    Im worried that because half the cluster is in a remote datacenter
> >>>    latency is the problem.... which, unfortunately, would be a show
> >>>    stopper.
> >>>
> >>>    Version: slurm 2.3.0-2
> >>>    ControlMachine=master
> >>>    AuthType=auth/munge
> >>>    CacheGroups=0
> >>>    CryptoType=crypto/munge
> >>>    MaxJobCount=30000     MpiDefault=none
> >>>    MpiParams=ports=12000-12999
> >>>    ProctrackType=proctrack/pgid
> >>>    ReturnToService=1
> >>>    SlurmctldPidFile=/var/run/slurm/slurmctld.pid
> >>>    SlurmctldPort=6817
> >>>    SlurmdPidFile=/var/run/slurm/slurmd.pid
> >>>    SlurmdPort=6818
> >>>    SlurmdSpoolDir=/var/slurm/spool
> >>>    SlurmUser=slurm
> >>>    StateSaveLocation=/var/slurm/state
> >>>    SwitchType=switch/none
> >>>    TaskPlugin=task/none
> >>>    TrackWCKey=yes
> >>>    InactiveLimit=0
> >>>    KillWait=30
> >>>    MinJobAge=300
> >>>    SlurmctldTimeout=120
> >>>    SlurmdTimeout=300
> >>>    Waittime=0
> >>>    FastSchedule=1
> >>>    SchedulerType=sched/backfill
> >>>    SchedulerPort=7321
> >>>    SelectType=select/cons_res
> >>>    SelectTypeParameters=CR_Core
> >>>    AccountingStorageHost=master
> >>>    AccountingStorageLoc=slurm_acct_db
> >>>    AccountingStoragePass=
> >>>    AccountingStorageType=accounting_storage/mysql
> >>>    AccountingStorageUser=
> >>>    AccountingStoreJobComment=YES
> >>>    ClusterName=hydra
> >>>    JobCompHost=master
> >>>    JobCompLoc=slurm_acct_db
> >>>    JobCompPass=cluster_service
> >>>    JobCompType=jobcomp/mysql
> >>>    JobCompUser=cluster_service
> >>>    JobAcctGatherFrequency=30
> >>>    JobAcctGatherType=jobacct_gather/linux
> >>>    SlurmctldDebug=3
> >>>    SlurmctldLogFile=/var/log/slurm/slurmctld.log
> >>>    SlurmdDebug=3
> >>>    SlurmdLogFile=/var/log/slurm/slurmd.log
> >>>    NodeName=aws[1-12] RealMemory=22528 Sockets=2 CoresPerSocket=4
> >>>    ThreadsPerCore=2 State=UNKNOWN     NodeName=centos1 RealMemory=3000
> >>> Sockets=2 CoresPerSocket=2
> >>>    ThreadsPerCore=1 State=UNKNOWN
> >>>
> >>>    PartitionName=local Nodes=centos1 Default=YES MaxTime=INFINITE
> >>>    State=UP Shared=NO
> >>>    PartitionName=all Nodes=aws[3-12],centos1 Default=NO
> >>>    MaxTime=INFINITE State=UP Shared=NO
> >>>    PartitionName=aws Nodes=aws[2-12] Default=NO MaxTime=INFINITE
> >>>    State=UP Shared=NO
> >>>
> >>>
> >>>
> >>>    On Thu, Nov 3, 2011 at 2:03 PM, Matthieu Hautreux
> >>>    <[email protected] <mailto:[email protected]>> wrote:
> >>>
> >>>        Hi,
> >>>
> >>>        what is the version of SLURM that you are using ? Can you send
> >>>        your slurm.conf for details of configuration ?
> >>>
> >>>        With 2.3.1 you should be able to manage at least 10k jobs. We
> >>>        made tests with 2.2.7 and backported patches of 2.3.x that
> >>>        shows good results with 10K jobs.
> >>>        What is the number of jobs for which the system starts to hang ?
> >>>
> >>>        Matthieu
> >>>
> >>>
> >>>        On 11/03/2011 07:03 PM, Clay Teeter wrote:
> >>>>
> >>>>        Hi group!
> >>>>
> >>>>        Im seeing that with about 30K (or so) jobs in the queue,
> >>>>        squeue hangs
> >>>>        for a around 2 minutes before returning values.  Does anyone
> >>>>        know a way
> >>>>        around this?  I expect that this isn't standard behavior.
> >>>>
> >>>>        Cheers,
> >>>>        Clay
> >>>
> >>>
> >>>
> >>
> >
> >
> >
>
>
>
> --
>
> ==================================================
> Open Grid Scheduler - The Official Open Source Grid Engine
> http://gridscheduler.sourceforge.net/
>

Re: [slurm-dev] Squeue taking 2+ minutes with 30K jobs

Reply via email to