Re: [slurm-dev] Squeue taking 2+ minutes with 30K jobs

Mark A. Grondona Sat, 05 Nov 2011 08:09:11 -0700

On Fri, 4 Nov 2011 20:33:31 -0700, Rayson Ho <[email protected]> wrote:
> On Fri, Nov 4, 2011 at 9:58 PM, Mark A. Grondona <[email protected]> wrote:
> > This is  definitely true, but the fact that Clay's debugger caught
> > squeue in the time(2) system call means something strange might be
> > going on with that call.
> 
> That's possible, if it is running on an old Linux kernel, or if
> /proc/sys/kernel/vsyscall64 is 0.


Also depends on the timer used by gettimeofday.

> (And BTW, that's 100M calls)

Oh, yeah, I increased it to get >1s runtime on my system. sorry!
 
> > Clay, you might try running the following
> > benchmark on your system just to see how long 10M time(2) calls take:
> 
> According to the comments in src/common/list.c, list_sort() has a
> complexity of O(n^2). So 30K jobs with complexity of n^2, it's like
> 900M calls to time(2)!!

Yeah, many of the sort compare functions in squeue have this problem.
Lots of code needlessly recomputes the same values over and over again
(n^2 - n times, right?)

> If that really is the case, then it would be slow to execute almost 1
> billion time(2) calls even if the overhead is really low for each
> call. (In the end, a VDSO call is not free)

FWIW, I tried to reproduce Clay's problem on a test system
(though I only got up to 20K jobs), and was not able to.


> I am not sure how job->start_time gets assigned in the code (the
> header says "time_t start_time;      /* time execution begins, actual
> or expected */", so I would expect it to be unset for pending jobs
> without expected start time. I don't have a SLURM cluster that I can
> fool around with and to attach a debugger, yet there are too many
> start_time variables in the code that is blocking me from digging
> deeper!!).
> 
> - If job->start_time really represents the expected start time for
> every job, then I guess calling time(2) once and passing the value to
> the comparsion function would make sense.
> 
> - But if the common case start_time has a value of 0, then the
> following patch should help (I've checked with gcc 4.4.5, even at -O3,
> there's not sinking of time(NULL) below the first if statement, so we
> need to do it ourselves):

Your patch is a good start I would think. However, the start_time
is still needlessly recomputed for every pending job (I _think_
pending jobs start_time is set to submit time, until they actually
start). So if you have 30K pending jobs you'd be in the same boat.

Also, think about what happens when the sort takes >1s. The start
time of these pending jobs starts to change! That seems wrong.

The best fix for this one problem (if it is the problem) would be
to set a global "now" that doesn't change, so the time(2) call
only needs to be made once, and every pending job gets the same
value for start_time.

In general, the sorting and other code in squeue could be greatly
improved by precomputing results or otherwise memoizing results
(i.e. store results in a hash that could be checked before
recomputing the same result you had before). A good example
is _sort_job_by_node_list(). That function does 'a lot' of work,
and like Rayson said, it could be called up to 900M times
needlessly for 30K jobs!

BTW, in my testing, profiling showed that the slowest part of
squeue was the uid_to_string() call, which is called at least
once for every job (in my case, every job had the same uid,
so that call should have only been made once). Removing that call
more than halved the runtime of squeue.

mark
 
> 
> [rayson@computer slurm-2.3.0-2]$ diff -U4 sort.c.original src/squeue/sort.c
> --- sort.c.original     2011-11-04 22:47:02.001064100 -0400
> +++ src/squeue/sort.c   2011-11-04 22:47:55.187063769 -0400
> @@ -582,13 +582,13 @@
>  }
> 
>  static uint32_t _get_start_time(job_info_t *job)
>  {
> -       time_t now = time(NULL);
> +       time_t now;
> 
>         if (job->start_time == (time_t) 0)
>                 return 0xffffffff;
> -       if ((job->job_state == JOB_PENDING) && (job->start_time < now))
> +       if ((job->job_state == JOB_PENDING) && (job->start_time < (now = 
> time(NULL))))
>                 return (uint32_t) now;
>         return (uint32_t) job->start_time;
>  }
> 
> Rayson
> 
> 
> >
> >  #include <time.h>
> >  int main (int ac, char **av)
> >  {
> >        unsigned long i;
> >        time_t t;
> >        for (i = 0; i < 100000000UL; i++)
> >            t = time(NULL);
> >            return 0;
> >  }
> >
> > On my system I get:
> >
> >  $ time ./timebench
> >
> >  real    0m3.794s
> >  user    0m3.791s
> >
> >
> > mark
> >
> >
> >
> >> Rayson
> >>
> >>
> >>
> >> > Moe, Danny, what do you think about that problem ?
> >> >
> >> > Regards,
> >> > Matthieu
> >> >
> >> >
> >> > On 11/04/2011 08:53 PM, Clay Teeter wrote:
> >> >
> >> > Ok, here is some general profiling info:
> >> > the slurmctl process doesn't seem too terribly out of whack.  There are
> >> > bursts of 100% cpu usage, but they lasted no longer then a couple 
> >> > seconds.
> >> > I ran gstack on the slurmctl process while squeue was running.  Again, I
> >> > didn't see anything that look strange:
> >> > Thread 5 (Thread 0x41fc6940 (LWP 21939)):
> >> > #0  0x00002aaaaacb9df8 in select_p_job_test () from
> >> > /usr/lib64/slurm/select_cons_res.so
> >> > #1  0x00002aaaabad8cf3 in backfill_agent () from
> >> > /usr/lib64/slurm/sched_backfill.so
> >> > #2  0x0000003cc8e0673d in start_thread () from /lib64/libpthread.so.0
> >> > #3  0x0000003cc86d44bd in clone () from /lib64/libc.so.6
> >> > Thread 4 (Thread 0x420c7940 (LWP 21941)):
> >> > #0  0x0000003cc86cd722 in select () from /lib64/libc.so.6
> >> > #1  0x000000000042daa2 in _slurmctld_rpc_mgr ()
> >> > #2  0x0000003cc8e0673d in start_thread () from /lib64/libpthread.so.0
> >> > #3  0x0000003cc86d44bd in clone () from /lib64/libc.so.6
> >> > Thread 3 (Thread 0x421c8940 (LWP 21942)):
> >> > #0  0x0000003cc8e0e898 in do_sigwait () from /lib64/libpthread.so.0
> >> > #1  0x0000003cc8e0e93d in sigwait () from /lib64/libpthread.so.0
> >> > #2  0x000000000042b838 in _slurmctld_signal_hand ()
> >> > #3  0x0000003cc8e0673d in start_thread () from /lib64/libpthread.so.0
> >> > #4  0x0000003cc86d44bd in clone () from /lib64/libc.so.6
> >> > Thread 2 (Thread 0x422c9940 (LWP 21943)):
> >> > #0  0x0000003cc8e0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from
> >> > /lib64/libpthread.so.0
> >> > #1  0x000000000046a9c1 in slurmctld_state_save ()
> >> > #2  0x0000003cc8e0673d in start_thread () from /lib64/libpthread.so.0
> >> > #3  0x0000003cc86d44bd in clone () from /lib64/libc.so.6
> >> > Thread 1 (Thread 0x2aae5df000d0 (LWP 21933)):
> >> > #0  0x0000003cc8e0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from
> >> > /lib64/libpthread.so.0
> >> > #1  0x0000000000449cbf in _wr_wrlock ()
> >> > #2  0x000000000044a1d5 in lock_slurmctld ()
> >> > #3  0x000000000042af5e in _slurmctld_background ()
> >> > #4  0x000000000042cb8c in main ()
> >> > However, i did a gstack and strace on squeue and things did look a
> >> > bit weird.
> >> > strace:
> >> > time(NULL)                              = 1320435240
> >> > time(NULL)                              = 1320435240
> >> > time(NULL)                              = 1320435240
> >> > time(NULL)                              = 1320435240
> >> > time(NULL)                              = 1320435240
> >> > time(NULL)                              = 1320435240
> >> > time(NULL)                              = 1320435240
> >> > time(NULL)                              = 1320435240
> >> > .....
> >> > keeps on going, this might be the sorting algo?
> >> >
> >> > gstack:
> >> > #0  0xffffffffff600410 in ?? ()
> >> > #1  0x0000003cc868c40d in time () from /lib64/libc.so.6
> >> > #2  0x000000000042608b in _get_start_time ()
> >> > #3  0x00000000004260e1 in _sort_job_by_time_start ()
> >> > #4  0x0000000000437958 in slurm_list_sort ()
> >> > #5  0x00000000004232af in print_jobs_array ()#6  0x0000000000420234 in
> >> > _get_info ()
> >> > #7  0x00000000004205a1 in main ()
> >> > squeue is pegged at 100% the entire time.  I noticed, that if i pass in 
> >> > the
> >> > param -S u to squeue the function _sort_job_by_time_start is still
> >> > executed.
> >> > Some interetesed slurmctl logs enteries:
> >> > [2011-11-04T19:50:37] Warning: Note very large processing time from
> >> > _slurm_rpc_dump_jobs: usec=6155478
> >> > [2011-11-04T19:50:37] Warning: Note very large processing time from
> >> > schedule: usec=6158333
> >> > [2011-11-04T19:50:37] Warning: Note very large processing time from
> >> > _slurmctld_background: usec=15995841
> >> >
> >> > [2011-11-04T19:46:01] debug:  backfill: loop taking too long, yielding 
> >> > locks
> >> > [2011-11-04T19:46:01] debug2: Testing job time limits and checkpoints
> >> > [2011-11-04T19:46:01] debug2: Performing purge of old job records
> >> > [2011-11-04T19:46:01] Warning: Note very large processing time from
> >> > _slurmctld_background: usec=9779576
> >> > Thanks for all your help,
> >> > Clay
> >> > On Fri, Nov 4, 2011 at 2:16 AM, HAUTREUX Matthieu 
> >> > <[email protected]>
> >> > wrote:
> >> >>
> >> >> In my understanding of the internals, squeue does not query each node 
> >> >> but
> >> >> only the controller. That is interesting to know that the performance 
> >> >> issue
> >> >> seems to be due to the EC2 partition but it requires further 
> >> >> information to
> >> >> know what is the reason of such a bad performance.
> >> >>
> >> >> The best way to find the problem is to observe the internal states of 
> >> >> the
> >> >> controller while the "squeue -p aws" is running :
> >> >> - You should first look at the consumption of the slurmctld process to 
> >> >> see
> >> >> if it is due to a too high complexity of processing in an internal 
> >> >> function
> >> >> of slurm.
> >> >>  The easiest way to have that information is by using top and look at 
> >> >> the
> >> >> consumption of the slurmctld process.
> >> >> - After that, in case you see no anormal consumption, you should try to
> >> >> capture the state of the slurmctld process during the hang. You can do 
> >> >> that
> >> >> just by using the gstack command to capture the stacks of slurmctld 
> >> >> threads
> >> >> and see where they are waiting. If you can send the result of "gstack
> >> >> $(pgrep slurmctld)" it could bring interesting information.
> >> >> - In case you see an anormal consumption, the gstack can help a little
> >> >> bit, but a more evolved tool, a profiler like for example oprofile, 
> >> >> will be
> >> >> necessary to see where most of the time is spent in the controller
> >> >>
> >> >> Do not hesitate to send the result of the 2 first tests (top, gstack), 
> >> >> it
> >> >> could help to find the bottleneck.
> >> >>
> >> >> Regards,
> >> >> Matthieu
> >> >>
> >> >> Clay Teeter a écrit :
> >> >>>
> >> >>> I did notice something.
> >> >>>
> >> >>> Issuing squeue against the local partition is quite speedy, ~2s for 10K
> >> >>> job ids.  However, if i "squeue -p aws" the ec2(aws) partition i see 
> >> >>> the 2m+
> >> >>> delay.  Does squeue query each node in the partition?
> >> >>>
> >> >>> Thanks,
> >> >>> Clay
> >> >>>
> >> >>> On Thu, Nov 3, 2011 at 2:42 PM, Clay Teeter <[email protected]
> >> >>> <mailto:[email protected]>> wrote:
> >> >>>
> >> >>>    First, thanks for your response!
> >> >>>
> >> >>>    Yeah, I figured that slurm shouldn't have a problem with the job
> >> >>>    count.
> >> >>>    First some background.   Our network is split into two parts; A
> >> >>>    local partition and a ec2 partition.  And, for each task two jobs
> >> >>>    (one is dependent on the other) are submitted.  The frist part
> >> >>>    computes a large datafile and submits it to s3, a second job pulls
> >> >>>    the files from s3 and copies it to our san. In total a large job
> >> >>>    requires about 30K jobs to be submitted.
> >> >>>
> >> >>>    Im not sure when squeue started slowing dow.  Currently, via sattr
> >> >>>    (about 2s query), there are a little more than 17000 jobs running.
> >> >>>
> >> >>>    Im worried that because half the cluster is in a remote datacenter
> >> >>>    latency is the problem.... which, unfortunately, would be a show
> >> >>>    stopper.
> >> >>>
> >> >>>    Version: slurm 2.3.0-2
> >> >>>    ControlMachine=master
> >> >>>    AuthType=auth/munge
> >> >>>    CacheGroups=0
> >> >>>    CryptoType=crypto/munge
> >> >>>    MaxJobCount=30000     MpiDefault=none
> >> >>>    MpiParams=ports=12000-12999
> >> >>>    ProctrackType=proctrack/pgid
> >> >>>    ReturnToService=1
> >> >>>    SlurmctldPidFile=/var/run/slurm/slurmctld.pid
> >> >>>    SlurmctldPort=6817
> >> >>>    SlurmdPidFile=/var/run/slurm/slurmd.pid
> >> >>>    SlurmdPort=6818
> >> >>>    SlurmdSpoolDir=/var/slurm/spool
> >> >>>    SlurmUser=slurm
> >> >>>    StateSaveLocation=/var/slurm/state
> >> >>>    SwitchType=switch/none
> >> >>>    TaskPlugin=task/none
> >> >>>    TrackWCKey=yes
> >> >>>    InactiveLimit=0
> >> >>>    KillWait=30
> >> >>>    MinJobAge=300
> >> >>>    SlurmctldTimeout=120
> >> >>>    SlurmdTimeout=300
> >> >>>    Waittime=0
> >> >>>    FastSchedule=1
> >> >>>    SchedulerType=sched/backfill
> >> >>>    SchedulerPort=7321
> >> >>>    SelectType=select/cons_res
> >> >>>    SelectTypeParameters=CR_Core
> >> >>>    AccountingStorageHost=master
> >> >>>    AccountingStorageLoc=slurm_acct_db
> >> >>>    AccountingStoragePass=
> >> >>>    AccountingStorageType=accounting_storage/mysql
> >> >>>    AccountingStorageUser=
> >> >>>    AccountingStoreJobComment=YES
> >> >>>    ClusterName=hydra
> >> >>>    JobCompHost=master
> >> >>>    JobCompLoc=slurm_acct_db
> >> >>>    JobCompPass=cluster_service
> >> >>>    JobCompType=jobcomp/mysql
> >> >>>    JobCompUser=cluster_service
> >> >>>    JobAcctGatherFrequency=30
> >> >>>    JobAcctGatherType=jobacct_gather/linux
> >> >>>    SlurmctldDebug=3
> >> >>>    SlurmctldLogFile=/var/log/slurm/slurmctld.log
> >> >>>    SlurmdDebug=3
> >> >>>    SlurmdLogFile=/var/log/slurm/slurmd.log
> >> >>>    NodeName=aws[1-12] RealMemory=22528 Sockets=2 CoresPerSocket=4
> >> >>>    ThreadsPerCore=2 State=UNKNOWN     NodeName=centos1 RealMemory=3000
> >> >>> Sockets=2 CoresPerSocket=2
> >> >>>    ThreadsPerCore=1 State=UNKNOWN
> >> >>>
> >> >>>    PartitionName=local Nodes=centos1 Default=YES MaxTime=INFINITE
> >> >>>    State=UP Shared=NO
> >> >>>    PartitionName=all Nodes=aws[3-12],centos1 Default=NO
> >> >>>    MaxTime=INFINITE State=UP Shared=NO
> >> >>>    PartitionName=aws Nodes=aws[2-12] Default=NO MaxTime=INFINITE
> >> >>>    State=UP Shared=NO
> >> >>>
> >> >>>
> >> >>>
> >> >>>    On Thu, Nov 3, 2011 at 2:03 PM, Matthieu Hautreux
> >> >>>    <[email protected] <mailto:[email protected]>> wrote:
> >> >>>
> >> >>>        Hi,
> >> >>>
> >> >>>        what is the version of SLURM that you are using ? Can you send
> >> >>>        your slurm.conf for details of configuration ?
> >> >>>
> >> >>>        With 2.3.1 you should be able to manage at least 10k jobs. We
> >> >>>        made tests with 2.2.7 and backported patches of 2.3.x that
> >> >>>        shows good results with 10K jobs.
> >> >>>        What is the number of jobs for which the system starts to hang ?
> >> >>>
> >> >>>        Matthieu
> >> >>>
> >> >>>
> >> >>>        On 11/03/2011 07:03 PM, Clay Teeter wrote:
> >> >>>>
> >> >>>>        Hi group!
> >> >>>>
> >> >>>>        Im seeing that with about 30K (or so) jobs in the queue,
> >> >>>>        squeue hangs
> >> >>>>        for a around 2 minutes before returning values.  Does anyone
> >> >>>>        know a way
> >> >>>>        around this?  I expect that this isn't standard behavior.
> >> >>>>
> >> >>>>        Cheers,
> >> >>>>        Clay
> >> >>>
> >> >>>
> >> >>>
> >> >>
> >> >
> >> >
> >> >
> >>
> >>
> >>
> >> --
> >>
> >> ==================================================
> >> Open Grid Scheduler - The Official Open Source Grid Engine
> >> http://gridscheduler.sourceforge.net/
> >>
> >
> 
> 
> 
> --
> 
> ==================================================
> Open Grid Scheduler - The Official Open Source Grid Engine
> http://gridscheduler.sourceforge.net/
>

Re: [slurm-dev] Squeue taking 2+ minutes with 30K jobs

Reply via email to