Re: [slurm-dev] Squeue taking 2+ minutes with 30K jobs

Clay Teeter Mon, 07 Nov 2011 12:34:08 -0800

It looks like the lag is definitely hardware/squeue related.  I've
recompile squeue on a modern os and things became snappy (1-3s for the
results vs 2m on the cluster node)


A couple of things to mention about the cluster (slow) node.

cat /proc/sys/kernel/vsyscall64 = 0
the timing loop posted by Mark took > 1m

Ill implement the suggestions by Mark and Rayson and see what happens!

Thanks,
Clay

On Sat, Nov 5, 2011 at 8:08 AM, Mark A. Grondona <[email protected]> wrote:

> On Fri, 4 Nov 2011 20:33:31 -0700, Rayson Ho <[email protected]>
> wrote:
> > On Fri, Nov 4, 2011 at 9:58 PM, Mark A. Grondona <[email protected]>
> wrote:
> > > This is  definitely true, but the fact that Clay's debugger caught
> > > squeue in the time(2) system call means something strange might be
> > > going on with that call.
> >
> > That's possible, if it is running on an old Linux kernel, or if
> > /proc/sys/kernel/vsyscall64 is 0.
>
> Also depends on the timer used by gettimeofday.
>
> > (And BTW, that's 100M calls)
>
> Oh, yeah, I increased it to get >1s runtime on my system. sorry!
>
> > > Clay, you might try running the following
> > > benchmark on your system just to see how long 10M time(2) calls take:
> >
> > According to the comments in src/common/list.c, list_sort() has a
> > complexity of O(n^2). So 30K jobs with complexity of n^2, it's like
> > 900M calls to time(2)!!
>
> Yeah, many of the sort compare functions in squeue have this problem.
> Lots of code needlessly recomputes the same values over and over again
> (n^2 - n times, right?)
>
> > If that really is the case, then it would be slow to execute almost 1
> > billion time(2) calls even if the overhead is really low for each
> > call. (In the end, a VDSO call is not free)
>
> FWIW, I tried to reproduce Clay's problem on a test system
> (though I only got up to 20K jobs), and was not able to.
>
>
> > I am not sure how job->start_time gets assigned in the code (the
> > header says "time_t start_time;      /* time execution begins, actual
> > or expected */", so I would expect it to be unset for pending jobs
> > without expected start time. I don't have a SLURM cluster that I can
> > fool around with and to attach a debugger, yet there are too many
> > start_time variables in the code that is blocking me from digging
> > deeper!!).
> >
> > - If job->start_time really represents the expected start time for
> > every job, then I guess calling time(2) once and passing the value to
> > the comparsion function would make sense.
> >
> > - But if the common case start_time has a value of 0, then the
> > following patch should help (I've checked with gcc 4.4.5, even at -O3,
> > there's not sinking of time(NULL) below the first if statement, so we
> > need to do it ourselves):
>
> Your patch is a good start I would think. However, the start_time
> is still needlessly recomputed for every pending job (I _think_
> pending jobs start_time is set to submit time, until they actually
> start). So if you have 30K pending jobs you'd be in the same boat.
>
> Also, think about what happens when the sort takes >1s. The start
> time of these pending jobs starts to change! That seems wrong.
>
> The best fix for this one problem (if it is the problem) would be
> to set a global "now" that doesn't change, so the time(2) call
> only needs to be made once, and every pending job gets the same
> value for start_time.
>
> In general, the sorting and other code in squeue could be greatly
> improved by precomputing results or otherwise memoizing results
> (i.e. store results in a hash that could be checked before
> recomputing the same result you had before). A good example
> is _sort_job_by_node_list(). That function does 'a lot' of work,
> and like Rayson said, it could be called up to 900M times
> needlessly for 30K jobs!
>
> BTW, in my testing, profiling showed that the slowest part of
> squeue was the uid_to_string() call, which is called at least
> once for every job (in my case, every job had the same uid,
> so that call should have only been made once). Removing that call
> more than halved the runtime of squeue.
>
> mark
>
> >
> > [rayson@computer slurm-2.3.0-2]$ diff -U4 sort.c.original
> src/squeue/sort.c
> > --- sort.c.original     2011-11-04 22:47:02.001064100 -0400
> > +++ src/squeue/sort.c   2011-11-04 22:47:55.187063769 -0400
> > @@ -582,13 +582,13 @@
> >  }
> >
> >  static uint32_t _get_start_time(job_info_t *job)
> >  {
> > -       time_t now = time(NULL);
> > +       time_t now;
> >
> >         if (job->start_time == (time_t) 0)
> >                 return 0xffffffff;
> > -       if ((job->job_state == JOB_PENDING) && (job->start_time < now))
> > +       if ((job->job_state == JOB_PENDING) && (job->start_time < (now =
> time(NULL))))
> >                 return (uint32_t) now;
> >         return (uint32_t) job->start_time;
> >  }
> >
> > Rayson
> >
> >
> > >
> > >  #include <time.h>
> > >  int main (int ac, char **av)
> > >  {
> > >        unsigned long i;
> > >        time_t t;
> > >        for (i = 0; i < 100000000UL; i++)
> > >            t = time(NULL);
> > >            return 0;
> > >  }
> > >
> > > On my system I get:
> > >
> > >  $ time ./timebench
> > >
> > >  real    0m3.794s
> > >  user    0m3.791s
> > >
> > >
> > > mark
> > >
> > >
> > >
> > >> Rayson
> > >>
> > >>
> > >>
> > >> > Moe, Danny, what do you think about that problem ?
> > >> >
> > >> > Regards,
> > >> > Matthieu
> > >> >
> > >> >
> > >> > On 11/04/2011 08:53 PM, Clay Teeter wrote:
> > >> >
> > >> > Ok, here is some general profiling info:
> > >> > the slurmctl process doesn't seem too terribly out of whack.  There
> are
> > >> > bursts of 100% cpu usage, but they lasted no longer then a couple
> seconds.
> > >> > I ran gstack on the slurmctl process while squeue was running.
>  Again, I
> > >> > didn't see anything that look strange:
> > >> > Thread 5 (Thread 0x41fc6940 (LWP 21939)):
> > >> > #0  0x00002aaaaacb9df8 in select_p_job_test () from
> > >> > /usr/lib64/slurm/select_cons_res.so
> > >> > #1  0x00002aaaabad8cf3 in backfill_agent () from
> > >> > /usr/lib64/slurm/sched_backfill.so
> > >> > #2  0x0000003cc8e0673d in start_thread () from
> /lib64/libpthread.so.0
> > >> > #3  0x0000003cc86d44bd in clone () from /lib64/libc.so.6
> > >> > Thread 4 (Thread 0x420c7940 (LWP 21941)):
> > >> > #0  0x0000003cc86cd722 in select () from /lib64/libc.so.6
> > >> > #1  0x000000000042daa2 in _slurmctld_rpc_mgr ()
> > >> > #2  0x0000003cc8e0673d in start_thread () from
> /lib64/libpthread.so.0
> > >> > #3  0x0000003cc86d44bd in clone () from /lib64/libc.so.6
> > >> > Thread 3 (Thread 0x421c8940 (LWP 21942)):
> > >> > #0  0x0000003cc8e0e898 in do_sigwait () from /lib64/libpthread.so.0
> > >> > #1  0x0000003cc8e0e93d in sigwait () from /lib64/libpthread.so.0
> > >> > #2  0x000000000042b838 in _slurmctld_signal_hand ()
> > >> > #3  0x0000003cc8e0673d in start_thread () from
> /lib64/libpthread.so.0
> > >> > #4  0x0000003cc86d44bd in clone () from /lib64/libc.so.6
> > >> > Thread 2 (Thread 0x422c9940 (LWP 21943)):
> > >> > #0  0x0000003cc8e0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from
> > >> > /lib64/libpthread.so.0
> > >> > #1  0x000000000046a9c1 in slurmctld_state_save ()
> > >> > #2  0x0000003cc8e0673d in start_thread () from
> /lib64/libpthread.so.0
> > >> > #3  0x0000003cc86d44bd in clone () from /lib64/libc.so.6
> > >> > Thread 1 (Thread 0x2aae5df000d0 (LWP 21933)):
> > >> > #0  0x0000003cc8e0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from
> > >> > /lib64/libpthread.so.0
> > >> > #1  0x0000000000449cbf in _wr_wrlock ()
> > >> > #2  0x000000000044a1d5 in lock_slurmctld ()
> > >> > #3  0x000000000042af5e in _slurmctld_background ()
> > >> > #4  0x000000000042cb8c in main ()
> > >> > However, i did a gstack and strace on squeue and things did look a
> > >> > bit weird.
> > >> > strace:
> > >> > time(NULL)                              = 1320435240
> > >> > time(NULL)                              = 1320435240
> > >> > time(NULL)                              = 1320435240
> > >> > time(NULL)                              = 1320435240
> > >> > time(NULL)                              = 1320435240
> > >> > time(NULL)                              = 1320435240
> > >> > time(NULL)                              = 1320435240
> > >> > time(NULL)                              = 1320435240
> > >> > .....
> > >> > keeps on going, this might be the sorting algo?
> > >> >
> > >> > gstack:
> > >> > #0  0xffffffffff600410 in ?? ()
> > >> > #1  0x0000003cc868c40d in time () from /lib64/libc.so.6
> > >> > #2  0x000000000042608b in _get_start_time ()
> > >> > #3  0x00000000004260e1 in _sort_job_by_time_start ()
> > >> > #4  0x0000000000437958 in slurm_list_sort ()
> > >> > #5  0x00000000004232af in print_jobs_array ()#6  0x0000000000420234
> in
> > >> > _get_info ()
> > >> > #7  0x00000000004205a1 in main ()
> > >> > squeue is pegged at 100% the entire time.  I noticed, that if i
> pass in the
> > >> > param -S u to squeue the function _sort_job_by_time_start is still
> > >> > executed.
> > >> > Some interetesed slurmctl logs enteries:
> > >> > [2011-11-04T19:50:37] Warning: Note very large processing time from
> > >> > _slurm_rpc_dump_jobs: usec=6155478
> > >> > [2011-11-04T19:50:37] Warning: Note very large processing time from
> > >> > schedule: usec=6158333
> > >> > [2011-11-04T19:50:37] Warning: Note very large processing time from
> > >> > _slurmctld_background: usec=15995841
> > >> >
> > >> > [2011-11-04T19:46:01] debug:  backfill: loop taking too long,
> yielding locks
> > >> > [2011-11-04T19:46:01] debug2: Testing job time limits and
> checkpoints
> > >> > [2011-11-04T19:46:01] debug2: Performing purge of old job records
> > >> > [2011-11-04T19:46:01] Warning: Note very large processing time from
> > >> > _slurmctld_background: usec=9779576
> > >> > Thanks for all your help,
> > >> > Clay
> > >> > On Fri, Nov 4, 2011 at 2:16 AM, HAUTREUX Matthieu <
> [email protected]>
> > >> > wrote:
> > >> >>
> > >> >> In my understanding of the internals, squeue does not query each
> node but
> > >> >> only the controller. That is interesting to know that the
> performance issue
> > >> >> seems to be due to the EC2 partition but it requires further
> information to
> > >> >> know what is the reason of such a bad performance.
> > >> >>
> > >> >> The best way to find the problem is to observe the internal states
> of the
> > >> >> controller while the "squeue -p aws" is running :
> > >> >> - You should first look at the consumption of the slurmctld
> process to see
> > >> >> if it is due to a too high complexity of processing in an internal
> function
> > >> >> of slurm.
> > >> >>  The easiest way to have that information is by using top and look
> at the
> > >> >> consumption of the slurmctld process.
> > >> >> - After that, in case you see no anormal consumption, you should
> try to
> > >> >> capture the state of the slurmctld process during the hang. You
> can do that
> > >> >> just by using the gstack command to capture the stacks of
> slurmctld threads
> > >> >> and see where they are waiting. If you can send the result of
> "gstack
> > >> >> $(pgrep slurmctld)" it could bring interesting information.
> > >> >> - In case you see an anormal consumption, the gstack can help a
> little
> > >> >> bit, but a more evolved tool, a profiler like for example
> oprofile, will be
> > >> >> necessary to see where most of the time is spent in the controller
> > >> >>
> > >> >> Do not hesitate to send the result of the 2 first tests (top,
> gstack), it
> > >> >> could help to find the bottleneck.
> > >> >>
> > >> >> Regards,
> > >> >> Matthieu
> > >> >>
> > >> >> Clay Teeter a écrit :
> > >> >>>
> > >> >>> I did notice something.
> > >> >>>
> > >> >>> Issuing squeue against the local partition is quite speedy, ~2s
> for 10K
> > >> >>> job ids.  However, if i "squeue -p aws" the ec2(aws) partition i
> see the 2m+
> > >> >>> delay.  Does squeue query each node in the partition?
> > >> >>>
> > >> >>> Thanks,
> > >> >>> Clay
> > >> >>>
> > >> >>> On Thu, Nov 3, 2011 at 2:42 PM, Clay Teeter <[email protected]
> > >> >>> <mailto:[email protected]>> wrote:
> > >> >>>
> > >> >>>    First, thanks for your response!
> > >> >>>
> > >> >>>    Yeah, I figured that slurm shouldn't have a problem with the
> job
> > >> >>>    count.
> > >> >>>    First some background.   Our network is split into two parts; A
> > >> >>>    local partition and a ec2 partition.  And, for each task two
> jobs
> > >> >>>    (one is dependent on the other) are submitted.  The frist part
> > >> >>>    computes a large datafile and submits it to s3, a second job
> pulls
> > >> >>>    the files from s3 and copies it to our san. In total a large
> job
> > >> >>>    requires about 30K jobs to be submitted.
> > >> >>>
> > >> >>>    Im not sure when squeue started slowing dow.  Currently, via
> sattr
> > >> >>>    (about 2s query), there are a little more than 17000 jobs
> running.
> > >> >>>
> > >> >>>    Im worried that because half the cluster is in a remote
> datacenter
> > >> >>>    latency is the problem.... which, unfortunately, would be a
> show
> > >> >>>    stopper.
> > >> >>>
> > >> >>>    Version: slurm 2.3.0-2
> > >> >>>    ControlMachine=master
> > >> >>>    AuthType=auth/munge
> > >> >>>    CacheGroups=0
> > >> >>>    CryptoType=crypto/munge
> > >> >>>    MaxJobCount=30000     MpiDefault=none
> > >> >>>    MpiParams=ports=12000-12999
> > >> >>>    ProctrackType=proctrack/pgid
> > >> >>>    ReturnToService=1
> > >> >>>    SlurmctldPidFile=/var/run/slurm/slurmctld.pid
> > >> >>>    SlurmctldPort=6817
> > >> >>>    SlurmdPidFile=/var/run/slurm/slurmd.pid
> > >> >>>    SlurmdPort=6818
> > >> >>>    SlurmdSpoolDir=/var/slurm/spool
> > >> >>>    SlurmUser=slurm
> > >> >>>    StateSaveLocation=/var/slurm/state
> > >> >>>    SwitchType=switch/none
> > >> >>>    TaskPlugin=task/none
> > >> >>>    TrackWCKey=yes
> > >> >>>    InactiveLimit=0
> > >> >>>    KillWait=30
> > >> >>>    MinJobAge=300
> > >> >>>    SlurmctldTimeout=120
> > >> >>>    SlurmdTimeout=300
> > >> >>>    Waittime=0
> > >> >>>    FastSchedule=1
> > >> >>>    SchedulerType=sched/backfill
> > >> >>>    SchedulerPort=7321
> > >> >>>    SelectType=select/cons_res
> > >> >>>    SelectTypeParameters=CR_Core
> > >> >>>    AccountingStorageHost=master
> > >> >>>    AccountingStorageLoc=slurm_acct_db
> > >> >>>    AccountingStoragePass=
> > >> >>>    AccountingStorageType=accounting_storage/mysql
> > >> >>>    AccountingStorageUser=
> > >> >>>    AccountingStoreJobComment=YES
> > >> >>>    ClusterName=hydra
> > >> >>>    JobCompHost=master
> > >> >>>    JobCompLoc=slurm_acct_db
> > >> >>>    JobCompPass=cluster_service
> > >> >>>    JobCompType=jobcomp/mysql
> > >> >>>    JobCompUser=cluster_service
> > >> >>>    JobAcctGatherFrequency=30
> > >> >>>    JobAcctGatherType=jobacct_gather/linux
> > >> >>>    SlurmctldDebug=3
> > >> >>>    SlurmctldLogFile=/var/log/slurm/slurmctld.log
> > >> >>>    SlurmdDebug=3
> > >> >>>    SlurmdLogFile=/var/log/slurm/slurmd.log
> > >> >>>    NodeName=aws[1-12] RealMemory=22528 Sockets=2 CoresPerSocket=4
> > >> >>>    ThreadsPerCore=2 State=UNKNOWN     NodeName=centos1
> RealMemory=3000
> > >> >>> Sockets=2 CoresPerSocket=2
> > >> >>>    ThreadsPerCore=1 State=UNKNOWN
> > >> >>>
> > >> >>>    PartitionName=local Nodes=centos1 Default=YES MaxTime=INFINITE
> > >> >>>    State=UP Shared=NO
> > >> >>>    PartitionName=all Nodes=aws[3-12],centos1 Default=NO
> > >> >>>    MaxTime=INFINITE State=UP Shared=NO
> > >> >>>    PartitionName=aws Nodes=aws[2-12] Default=NO MaxTime=INFINITE
> > >> >>>    State=UP Shared=NO
> > >> >>>
> > >> >>>
> > >> >>>
> > >> >>>    On Thu, Nov 3, 2011 at 2:03 PM, Matthieu Hautreux
> > >> >>>    <[email protected] <mailto:[email protected]>>
> wrote:
> > >> >>>
> > >> >>>        Hi,
> > >> >>>
> > >> >>>        what is the version of SLURM that you are using ? Can you
> send
> > >> >>>        your slurm.conf for details of configuration ?
> > >> >>>
> > >> >>>        With 2.3.1 you should be able to manage at least 10k jobs.
> We
> > >> >>>        made tests with 2.2.7 and backported patches of 2.3.x that
> > >> >>>        shows good results with 10K jobs.
> > >> >>>        What is the number of jobs for which the system starts to
> hang ?
> > >> >>>
> > >> >>>        Matthieu
> > >> >>>
> > >> >>>
> > >> >>>        On 11/03/2011 07:03 PM, Clay Teeter wrote:
> > >> >>>>
> > >> >>>>        Hi group!
> > >> >>>>
> > >> >>>>        Im seeing that with about 30K (or so) jobs in the queue,
> > >> >>>>        squeue hangs
> > >> >>>>        for a around 2 minutes before returning values.  Does
> anyone
> > >> >>>>        know a way
> > >> >>>>        around this?  I expect that this isn't standard behavior.
> > >> >>>>
> > >> >>>>        Cheers,
> > >> >>>>        Clay
> > >> >>>
> > >> >>>
> > >> >>>
> > >> >>
> > >> >
> > >> >
> > >> >
> > >>
> > >>
> > >>
> > >> --
> > >>
> > >> ==================================================
> > >> Open Grid Scheduler - The Official Open Source Grid Engine
> > >> http://gridscheduler.sourceforge.net/
> > >>
> > >
> >
> >
> >
> > --
> >
> > ==================================================
> > Open Grid Scheduler - The Official Open Source Grid Engine
> > http://gridscheduler.sourceforge.net/
> >
>
>

Re: [slurm-dev] Squeue taking 2+ minutes with 30K jobs

Reply via email to