Re: [slurm-dev] Squeue taking 2+ minutes with 30K jobs

Rayson Ho Fri, 04 Nov 2011 20:34:21 -0700

On Fri, Nov 4, 2011 at 9:58 PM, Mark A. Grondona <[email protected]> wrote:
> This is  definitely true, but the fact that Clay's debugger caught
> squeue in the time(2) system call means something strange might be
> going on with that call.


That's possible, if it is running on an old Linux kernel, or if
/proc/sys/kernel/vsyscall64 is 0.

(And BTW, that's 100M calls)

> Clay, you might try running the following
> benchmark on your system just to see how long 10M time(2) calls take:

According to the comments in src/common/list.c, list_sort() has a
complexity of O(n^2). So 30K jobs with complexity of n^2, it's like
900M calls to time(2)!!

If that really is the case, then it would be slow to execute almost 1
billion time(2) calls even if the overhead is really low for each
call. (In the end, a VDSO call is not free)

I am not sure how job->start_time gets assigned in the code (the
header says "time_t start_time;      /* time execution begins, actual
or expected */", so I would expect it to be unset for pending jobs
without expected start time. I don't have a SLURM cluster that I can
fool around with and to attach a debugger, yet there are too many
start_time variables in the code that is blocking me from digging
deeper!!).

- If job->start_time really represents the expected start time for
every job, then I guess calling time(2) once and passing the value to
the comparsion function would make sense.

- But if the common case start_time has a value of 0, then the
following patch should help (I've checked with gcc 4.4.5, even at -O3,
there's not sinking of time(NULL) below the first if statement, so we
need to do it ourselves):


[rayson@computer slurm-2.3.0-2]$ diff -U4 sort.c.original src/squeue/sort.c
--- sort.c.original     2011-11-04 22:47:02.001064100 -0400
+++ src/squeue/sort.c   2011-11-04 22:47:55.187063769 -0400
@@ -582,13 +582,13 @@
 }

 static uint32_t _get_start_time(job_info_t *job)
 {
-       time_t now = time(NULL);
+       time_t now;

        if (job->start_time == (time_t) 0)
                return 0xffffffff;
-       if ((job->job_state == JOB_PENDING) && (job->start_time < now))
+       if ((job->job_state == JOB_PENDING) && (job->start_time < (now = 
time(NULL))))
                return (uint32_t) now;
        return (uint32_t) job->start_time;
 }

Rayson


>
>  #include <time.h>
>  int main (int ac, char **av)
>  {
>        unsigned long i;
>        time_t t;
>        for (i = 0; i < 100000000UL; i++)
>            t = time(NULL);
>            return 0;
>  }
>
> On my system I get:
>
>  $ time ./timebench
>
>  real    0m3.794s
>  user    0m3.791s
>
>
> mark
>
>
>
>> Rayson
>>
>>
>>
>> > Moe, Danny, what do you think about that problem ?
>> >
>> > Regards,
>> > Matthieu
>> >
>> >
>> > On 11/04/2011 08:53 PM, Clay Teeter wrote:
>> >
>> > Ok, here is some general profiling info:
>> > the slurmctl process doesn't seem too terribly out of whack.  There are
>> > bursts of 100% cpu usage, but they lasted no longer then a couple seconds.
>> > I ran gstack on the slurmctl process while squeue was running.  Again, I
>> > didn't see anything that look strange:
>> > Thread 5 (Thread 0x41fc6940 (LWP 21939)):
>> > #0  0x00002aaaaacb9df8 in select_p_job_test () from
>> > /usr/lib64/slurm/select_cons_res.so
>> > #1  0x00002aaaabad8cf3 in backfill_agent () from
>> > /usr/lib64/slurm/sched_backfill.so
>> > #2  0x0000003cc8e0673d in start_thread () from /lib64/libpthread.so.0
>> > #3  0x0000003cc86d44bd in clone () from /lib64/libc.so.6
>> > Thread 4 (Thread 0x420c7940 (LWP 21941)):
>> > #0  0x0000003cc86cd722 in select () from /lib64/libc.so.6
>> > #1  0x000000000042daa2 in _slurmctld_rpc_mgr ()
>> > #2  0x0000003cc8e0673d in start_thread () from /lib64/libpthread.so.0
>> > #3  0x0000003cc86d44bd in clone () from /lib64/libc.so.6
>> > Thread 3 (Thread 0x421c8940 (LWP 21942)):
>> > #0  0x0000003cc8e0e898 in do_sigwait () from /lib64/libpthread.so.0
>> > #1  0x0000003cc8e0e93d in sigwait () from /lib64/libpthread.so.0
>> > #2  0x000000000042b838 in _slurmctld_signal_hand ()
>> > #3  0x0000003cc8e0673d in start_thread () from /lib64/libpthread.so.0
>> > #4  0x0000003cc86d44bd in clone () from /lib64/libc.so.6
>> > Thread 2 (Thread 0x422c9940 (LWP 21943)):
>> > #0  0x0000003cc8e0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from
>> > /lib64/libpthread.so.0
>> > #1  0x000000000046a9c1 in slurmctld_state_save ()
>> > #2  0x0000003cc8e0673d in start_thread () from /lib64/libpthread.so.0
>> > #3  0x0000003cc86d44bd in clone () from /lib64/libc.so.6
>> > Thread 1 (Thread 0x2aae5df000d0 (LWP 21933)):
>> > #0  0x0000003cc8e0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from
>> > /lib64/libpthread.so.0
>> > #1  0x0000000000449cbf in _wr_wrlock ()
>> > #2  0x000000000044a1d5 in lock_slurmctld ()
>> > #3  0x000000000042af5e in _slurmctld_background ()
>> > #4  0x000000000042cb8c in main ()
>> > However, i did a gstack and strace on squeue and things did look a
>> > bit weird.
>> > strace:
>> > time(NULL)                              = 1320435240
>> > time(NULL)                              = 1320435240
>> > time(NULL)                              = 1320435240
>> > time(NULL)                              = 1320435240
>> > time(NULL)                              = 1320435240
>> > time(NULL)                              = 1320435240
>> > time(NULL)                              = 1320435240
>> > time(NULL)                              = 1320435240
>> > .....
>> > keeps on going, this might be the sorting algo?
>> >
>> > gstack:
>> > #0  0xffffffffff600410 in ?? ()
>> > #1  0x0000003cc868c40d in time () from /lib64/libc.so.6
>> > #2  0x000000000042608b in _get_start_time ()
>> > #3  0x00000000004260e1 in _sort_job_by_time_start ()
>> > #4  0x0000000000437958 in slurm_list_sort ()
>> > #5  0x00000000004232af in print_jobs_array ()#6  0x0000000000420234 in
>> > _get_info ()
>> > #7  0x00000000004205a1 in main ()
>> > squeue is pegged at 100% the entire time.  I noticed, that if i pass in the
>> > param -S u to squeue the function _sort_job_by_time_start is still
>> > executed.
>> > Some interetesed slurmctl logs enteries:
>> > [2011-11-04T19:50:37] Warning: Note very large processing time from
>> > _slurm_rpc_dump_jobs: usec=6155478
>> > [2011-11-04T19:50:37] Warning: Note very large processing time from
>> > schedule: usec=6158333
>> > [2011-11-04T19:50:37] Warning: Note very large processing time from
>> > _slurmctld_background: usec=15995841
>> >
>> > [2011-11-04T19:46:01] debug:  backfill: loop taking too long, yielding 
>> > locks
>> > [2011-11-04T19:46:01] debug2: Testing job time limits and checkpoints
>> > [2011-11-04T19:46:01] debug2: Performing purge of old job records
>> > [2011-11-04T19:46:01] Warning: Note very large processing time from
>> > _slurmctld_background: usec=9779576
>> > Thanks for all your help,
>> > Clay
>> > On Fri, Nov 4, 2011 at 2:16 AM, HAUTREUX Matthieu 
>> > <[email protected]>
>> > wrote:
>> >>
>> >> In my understanding of the internals, squeue does not query each node but
>> >> only the controller. That is interesting to know that the performance 
>> >> issue
>> >> seems to be due to the EC2 partition but it requires further information 
>> >> to
>> >> know what is the reason of such a bad performance.
>> >>
>> >> The best way to find the problem is to observe the internal states of the
>> >> controller while the "squeue -p aws" is running :
>> >> - You should first look at the consumption of the slurmctld process to see
>> >> if it is due to a too high complexity of processing in an internal 
>> >> function
>> >> of slurm.
>> >>  The easiest way to have that information is by using top and look at the
>> >> consumption of the slurmctld process.
>> >> - After that, in case you see no anormal consumption, you should try to
>> >> capture the state of the slurmctld process during the hang. You can do 
>> >> that
>> >> just by using the gstack command to capture the stacks of slurmctld 
>> >> threads
>> >> and see where they are waiting. If you can send the result of "gstack
>> >> $(pgrep slurmctld)" it could bring interesting information.
>> >> - In case you see an anormal consumption, the gstack can help a little
>> >> bit, but a more evolved tool, a profiler like for example oprofile, will 
>> >> be
>> >> necessary to see where most of the time is spent in the controller
>> >>
>> >> Do not hesitate to send the result of the 2 first tests (top, gstack), it
>> >> could help to find the bottleneck.
>> >>
>> >> Regards,
>> >> Matthieu
>> >>
>> >> Clay Teeter a écrit :
>> >>>
>> >>> I did notice something.
>> >>>
>> >>> Issuing squeue against the local partition is quite speedy, ~2s for 10K
>> >>> job ids.  However, if i "squeue -p aws" the ec2(aws) partition i see the 
>> >>> 2m+
>> >>> delay.  Does squeue query each node in the partition?
>> >>>
>> >>> Thanks,
>> >>> Clay
>> >>>
>> >>> On Thu, Nov 3, 2011 at 2:42 PM, Clay Teeter <[email protected]
>> >>> <mailto:[email protected]>> wrote:
>> >>>
>> >>>    First, thanks for your response!
>> >>>
>> >>>    Yeah, I figured that slurm shouldn't have a problem with the job
>> >>>    count.
>> >>>    First some background.   Our network is split into two parts; A
>> >>>    local partition and a ec2 partition.  And, for each task two jobs
>> >>>    (one is dependent on the other) are submitted.  The frist part
>> >>>    computes a large datafile and submits it to s3, a second job pulls
>> >>>    the files from s3 and copies it to our san. In total a large job
>> >>>    requires about 30K jobs to be submitted.
>> >>>
>> >>>    Im not sure when squeue started slowing dow.  Currently, via sattr
>> >>>    (about 2s query), there are a little more than 17000 jobs running.
>> >>>
>> >>>    Im worried that because half the cluster is in a remote datacenter
>> >>>    latency is the problem.... which, unfortunately, would be a show
>> >>>    stopper.
>> >>>
>> >>>    Version: slurm 2.3.0-2
>> >>>    ControlMachine=master
>> >>>    AuthType=auth/munge
>> >>>    CacheGroups=0
>> >>>    CryptoType=crypto/munge
>> >>>    MaxJobCount=30000     MpiDefault=none
>> >>>    MpiParams=ports=12000-12999
>> >>>    ProctrackType=proctrack/pgid
>> >>>    ReturnToService=1
>> >>>    SlurmctldPidFile=/var/run/slurm/slurmctld.pid
>> >>>    SlurmctldPort=6817
>> >>>    SlurmdPidFile=/var/run/slurm/slurmd.pid
>> >>>    SlurmdPort=6818
>> >>>    SlurmdSpoolDir=/var/slurm/spool
>> >>>    SlurmUser=slurm
>> >>>    StateSaveLocation=/var/slurm/state
>> >>>    SwitchType=switch/none
>> >>>    TaskPlugin=task/none
>> >>>    TrackWCKey=yes
>> >>>    InactiveLimit=0
>> >>>    KillWait=30
>> >>>    MinJobAge=300
>> >>>    SlurmctldTimeout=120
>> >>>    SlurmdTimeout=300
>> >>>    Waittime=0
>> >>>    FastSchedule=1
>> >>>    SchedulerType=sched/backfill
>> >>>    SchedulerPort=7321
>> >>>    SelectType=select/cons_res
>> >>>    SelectTypeParameters=CR_Core
>> >>>    AccountingStorageHost=master
>> >>>    AccountingStorageLoc=slurm_acct_db
>> >>>    AccountingStoragePass=
>> >>>    AccountingStorageType=accounting_storage/mysql
>> >>>    AccountingStorageUser=
>> >>>    AccountingStoreJobComment=YES
>> >>>    ClusterName=hydra
>> >>>    JobCompHost=master
>> >>>    JobCompLoc=slurm_acct_db
>> >>>    JobCompPass=cluster_service
>> >>>    JobCompType=jobcomp/mysql
>> >>>    JobCompUser=cluster_service
>> >>>    JobAcctGatherFrequency=30
>> >>>    JobAcctGatherType=jobacct_gather/linux
>> >>>    SlurmctldDebug=3
>> >>>    SlurmctldLogFile=/var/log/slurm/slurmctld.log
>> >>>    SlurmdDebug=3
>> >>>    SlurmdLogFile=/var/log/slurm/slurmd.log
>> >>>    NodeName=aws[1-12] RealMemory=22528 Sockets=2 CoresPerSocket=4
>> >>>    ThreadsPerCore=2 State=UNKNOWN     NodeName=centos1 RealMemory=3000
>> >>> Sockets=2 CoresPerSocket=2
>> >>>    ThreadsPerCore=1 State=UNKNOWN
>> >>>
>> >>>    PartitionName=local Nodes=centos1 Default=YES MaxTime=INFINITE
>> >>>    State=UP Shared=NO
>> >>>    PartitionName=all Nodes=aws[3-12],centos1 Default=NO
>> >>>    MaxTime=INFINITE State=UP Shared=NO
>> >>>    PartitionName=aws Nodes=aws[2-12] Default=NO MaxTime=INFINITE
>> >>>    State=UP Shared=NO
>> >>>
>> >>>
>> >>>
>> >>>    On Thu, Nov 3, 2011 at 2:03 PM, Matthieu Hautreux
>> >>>    <[email protected] <mailto:[email protected]>> wrote:
>> >>>
>> >>>        Hi,
>> >>>
>> >>>        what is the version of SLURM that you are using ? Can you send
>> >>>        your slurm.conf for details of configuration ?
>> >>>
>> >>>        With 2.3.1 you should be able to manage at least 10k jobs. We
>> >>>        made tests with 2.2.7 and backported patches of 2.3.x that
>> >>>        shows good results with 10K jobs.
>> >>>        What is the number of jobs for which the system starts to hang ?
>> >>>
>> >>>        Matthieu
>> >>>
>> >>>
>> >>>        On 11/03/2011 07:03 PM, Clay Teeter wrote:
>> >>>>
>> >>>>        Hi group!
>> >>>>
>> >>>>        Im seeing that with about 30K (or so) jobs in the queue,
>> >>>>        squeue hangs
>> >>>>        for a around 2 minutes before returning values.  Does anyone
>> >>>>        know a way
>> >>>>        around this?  I expect that this isn't standard behavior.
>> >>>>
>> >>>>        Cheers,
>> >>>>        Clay
>> >>>
>> >>>
>> >>>
>> >>
>> >
>> >
>> >
>>
>>
>>
>> --
>>
>> ==================================================
>> Open Grid Scheduler - The Official Open Source Grid Engine
>> http://gridscheduler.sourceforge.net/
>>
>



--

==================================================
Open Grid Scheduler - The Official Open Source Grid Engine
http://gridscheduler.sourceforge.net/

Re: [slurm-dev] Squeue taking 2+ minutes with 30K jobs

Reply via email to