On Mon, 7 Nov 2011 13:00:00 -0800, Rayson Ho <[email protected]> wrote:
> On Mon, Nov 7, 2011 at 3:33 PM, Clay Teeter <[email protected]> wrote:
> > A couple of things to mention about the cluster (slow) node.
> > cat /proc/sys/kernel/vsyscall64 = 0
> 
> Yup, if /proc/sys/kernel/vsyscall64 is 0, then gettimeofday() and
> friends are real system calls...
> 
> So doing ~1B real system call to get the same time(2) data is really
> bad for performance!
> 
> 
> > the timing loop posted by Mark took > 1m
> > Ill implement the suggestions by Mark and Rayson and see what happens!
> 
> And replying to Mark's previous email: I agree with Mark that we need
> to call time(2) once and save it somewhere, but it is a bigger change
> and I am not sure if we will need a newer SLURM version to ship the
> change with. The patch I sent is a second choice, my first suggestion
> in the email was to call time(2) once and pass the time data into the
> comparsion function (but still would require some modifications in
> various places).

The following patch is a stopgap that should only initialize 
"now" once, preventing unecessary calls to time(2), and ensuring
the value for "now" doesn't change during a sort:

It isn't ideal since there will be an extraneous comparison
for each call to get_start_time, but it should reduce the calls
to time(2) to 1.


>From 318bc74372b7a73b0936e47a7441448b4a693ed7 Mon Sep 17 00:00:00 2001
From: Mark A. Grondona <[email protected]>
Date: Mon, 7 Nov 2011 13:19:16 -0800
Subject: [PATCH] squeue: Only call time(2) once when sorting by job start time

Since the list_sort algorithm is O(n^2), we want to avoid any
unecessary work in sort compare functions, especially work
that is duplicated. In the specific case of _get_start_time(),
we also want to avoid returning a different start time for
PENDING jobs during a single sort (which may happen if the
sort takes >1s).

So, as a stopgap measure, define "now" as static in _get_start_time
so that we only initialize it once.
---
 src/squeue/sort.c |   10 +++++++---
 1 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/src/squeue/sort.c b/src/squeue/sort.c
index af194ef..b75dc99 100644
--- a/src/squeue/sort.c
+++ b/src/squeue/sort.c
@@ -583,12 +583,16 @@ static int _sort_job_by_time_limit(void *void1, void 
*void2)
 
 static uint32_t _get_start_time(job_info_t *job)
 {
-       time_t now = time(NULL);
+       static time_t now = (time_t) 0;
 
        if (job->start_time == (time_t) 0)
                return 0xffffffff;
-       if ((job->job_state == JOB_PENDING) && (job->start_time < now))
-               return (uint32_t) now;
+       if (job->job_state == JOB_PENDING) {
+               if (now == (time_t) 0)
+                       now = time(NULL);
+               if (job->start_time < now)
+                       return (uint32_t) now;
+       }
        return (uint32_t) job->start_time;
 }
 
-- 
1.7.1





 
> Rayson
> 
> 
> 
> > Thanks,
> > Clay
> > On Sat, Nov 5, 2011 at 8:08 AM, Mark A. Grondona <[email protected]> wrote:
> >>
> >> On Fri, 4 Nov 2011 20:33:31 -0700, Rayson Ho <[email protected]>
> >> wrote:
> >> > On Fri, Nov 4, 2011 at 9:58 PM, Mark A. Grondona <[email protected]>
> >> > wrote:
> >> > > This is  definitely true, but the fact that Clay's debugger caught
> >> > > squeue in the time(2) system call means something strange might be
> >> > > going on with that call.
> >> >
> >> > That's possible, if it is running on an old Linux kernel, or if
> >> > /proc/sys/kernel/vsyscall64 is 0.
> >>
> >> Also depends on the timer used by gettimeofday.
> >>
> >> > (And BTW, that's 100M calls)
> >>
> >> Oh, yeah, I increased it to get >1s runtime on my system. sorry!
> >>
> >> > > Clay, you might try running the following
> >> > > benchmark on your system just to see how long 10M time(2) calls take:
> >> >
> >> > According to the comments in src/common/list.c, list_sort() has a
> >> > complexity of O(n^2). So 30K jobs with complexity of n^2, it's like
> >> > 900M calls to time(2)!!
> >>
> >> Yeah, many of the sort compare functions in squeue have this problem.
> >> Lots of code needlessly recomputes the same values over and over again
> >> (n^2 - n times, right?)
> >>
> >> > If that really is the case, then it would be slow to execute almost 1
> >> > billion time(2) calls even if the overhead is really low for each
> >> > call. (In the end, a VDSO call is not free)
> >>
> >> FWIW, I tried to reproduce Clay's problem on a test system
> >> (though I only got up to 20K jobs), and was not able to.
> >>
> >>
> >> > I am not sure how job->start_time gets assigned in the code (the
> >> > header says "time_t start_time;      /* time execution begins, actual
> >> > or expected */", so I would expect it to be unset for pending jobs
> >> > without expected start time. I don't have a SLURM cluster that I can
> >> > fool around with and to attach a debugger, yet there are too many
> >> > start_time variables in the code that is blocking me from digging
> >> > deeper!!).
> >> >
> >> > - If job->start_time really represents the expected start time for
> >> > every job, then I guess calling time(2) once and passing the value to
> >> > the comparsion function would make sense.
> >> >
> >> > - But if the common case start_time has a value of 0, then the
> >> > following patch should help (I've checked with gcc 4.4.5, even at -O3,
> >> > there's not sinking of time(NULL) below the first if statement, so we
> >> > need to do it ourselves):
> >>
> >> Your patch is a good start I would think. However, the start_time
> >> is still needlessly recomputed for every pending job (I _think_
> >> pending jobs start_time is set to submit time, until they actually
> >> start). So if you have 30K pending jobs you'd be in the same boat.
> >>
> >> Also, think about what happens when the sort takes >1s. The start
> >> time of these pending jobs starts to change! That seems wrong.
> >>
> >> The best fix for this one problem (if it is the problem) would be
> >> to set a global "now" that doesn't change, so the time(2) call
> >> only needs to be made once, and every pending job gets the same
> >> value for start_time.
> >>
> >> In general, the sorting and other code in squeue could be greatly
> >> improved by precomputing results or otherwise memoizing results
> >> (i.e. store results in a hash that could be checked before
> >> recomputing the same result you had before). A good example
> >> is _sort_job_by_node_list(). That function does 'a lot' of work,
> >> and like Rayson said, it could be called up to 900M times
> >> needlessly for 30K jobs!
> >>
> >> BTW, in my testing, profiling showed that the slowest part of
> >> squeue was the uid_to_string() call, which is called at least
> >> once for every job (in my case, every job had the same uid,
> >> so that call should have only been made once). Removing that call
> >> more than halved the runtime of squeue.
> >>
> >> mark
> >>
> >> >
> >> > [rayson@computer slurm-2.3.0-2]$ diff -U4 sort.c.original
> >> > src/squeue/sort.c
> >> > --- sort.c.original     2011-11-04 22:47:02.001064100 -0400
> >> > +++ src/squeue/sort.c   2011-11-04 22:47:55.187063769 -0400
> >> > @@ -582,13 +582,13 @@
> >> >  }
> >> >
> >> >  static uint32_t _get_start_time(job_info_t *job)
> >> >  {
> >> > -       time_t now = time(NULL);
> >> > +       time_t now;
> >> >
> >> >         if (job->start_time == (time_t) 0)
> >> >                 return 0xffffffff;
> >> > -       if ((job->job_state == JOB_PENDING) && (job->start_time < now))
> >> > +       if ((job->job_state == JOB_PENDING) && (job->start_time < (now =
> >> > time(NULL))))
> >> >                 return (uint32_t) now;
> >> >         return (uint32_t) job->start_time;
> >> >  }
> >> >
> >> > Rayson
> >> >
> >> >
> >> > >
> >> > >  #include <time.h>
> >> > >  int main (int ac, char **av)
> >> > >  {
> >> > >        unsigned long i;
> >> > >        time_t t;
> >> > >        for (i = 0; i < 100000000UL; i++)
> >> > >            t = time(NULL);
> >> > >            return 0;
> >> > >  }
> >> > >
> >> > > On my system I get:
> >> > >
> >> > >  $ time ./timebench
> >> > >
> >> > >  real    0m3.794s
> >> > >  user    0m3.791s
> >> > >
> >> > >
> >> > > mark
> >> > >
> >> > >
> >> > >
> >> > >> Rayson
> >> > >>
> >> > >>
> >> > >>
> >> > >> > Moe, Danny, what do you think about that problem ?
> >> > >> >
> >> > >> > Regards,
> >> > >> > Matthieu
> >> > >> >
> >> > >> >
> >> > >> > On 11/04/2011 08:53 PM, Clay Teeter wrote:
> >> > >> >
> >> > >> > Ok, here is some general profiling info:
> >> > >> > the slurmctl process doesn't seem too terribly out of whack.  There
> >> > >> > are
> >> > >> > bursts of 100% cpu usage, but they lasted no longer then a couple
> >> > >> > seconds.
> >> > >> > I ran gstack on the slurmctl process while squeue was running.
> >> > >> >  Again, I
> >> > >> > didn't see anything that look strange:
> >> > >> > Thread 5 (Thread 0x41fc6940 (LWP 21939)):
> >> > >> > #0  0x00002aaaaacb9df8 in select_p_job_test () from
> >> > >> > /usr/lib64/slurm/select_cons_res.so
> >> > >> > #1  0x00002aaaabad8cf3 in backfill_agent () from
> >> > >> > /usr/lib64/slurm/sched_backfill.so
> >> > >> > #2  0x0000003cc8e0673d in start_thread () from
> >> > >> > /lib64/libpthread.so.0
> >> > >> > #3  0x0000003cc86d44bd in clone () from /lib64/libc.so.6
> >> > >> > Thread 4 (Thread 0x420c7940 (LWP 21941)):
> >> > >> > #0  0x0000003cc86cd722 in select () from /lib64/libc.so.6
> >> > >> > #1  0x000000000042daa2 in _slurmctld_rpc_mgr ()
> >> > >> > #2  0x0000003cc8e0673d in start_thread () from
> >> > >> > /lib64/libpthread.so.0
> >> > >> > #3  0x0000003cc86d44bd in clone () from /lib64/libc.so.6
> >> > >> > Thread 3 (Thread 0x421c8940 (LWP 21942)):
> >> > >> > #0  0x0000003cc8e0e898 in do_sigwait () from /lib64/libpthread.so.0
> >> > >> > #1  0x0000003cc8e0e93d in sigwait () from /lib64/libpthread.so.0
> >> > >> > #2  0x000000000042b838 in _slurmctld_signal_hand ()
> >> > >> > #3  0x0000003cc8e0673d in start_thread () from
> >> > >> > /lib64/libpthread.so.0
> >> > >> > #4  0x0000003cc86d44bd in clone () from /lib64/libc.so.6
> >> > >> > Thread 2 (Thread 0x422c9940 (LWP 21943)):
> >> > >> > #0  0x0000003cc8e0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from
> >> > >> > /lib64/libpthread.so.0
> >> > >> > #1  0x000000000046a9c1 in slurmctld_state_save ()
> >> > >> > #2  0x0000003cc8e0673d in start_thread () from
> >> > >> > /lib64/libpthread.so.0
> >> > >> > #3  0x0000003cc86d44bd in clone () from /lib64/libc.so.6
> >> > >> > Thread 1 (Thread 0x2aae5df000d0 (LWP 21933)):
> >> > >> > #0  0x0000003cc8e0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from
> >> > >> > /lib64/libpthread.so.0
> >> > >> > #1  0x0000000000449cbf in _wr_wrlock ()
> >> > >> > #2  0x000000000044a1d5 in lock_slurmctld ()
> >> > >> > #3  0x000000000042af5e in _slurmctld_background ()
> >> > >> > #4  0x000000000042cb8c in main ()
> >> > >> > However, i did a gstack and strace on squeue and things did look a
> >> > >> > bit weird.
> >> > >> > strace:
> >> > >> > time(NULL)                              = 1320435240
> >> > >> > time(NULL)                              = 1320435240
> >> > >> > time(NULL)                              = 1320435240
> >> > >> > time(NULL)                              = 1320435240
> >> > >> > time(NULL)                              = 1320435240
> >> > >> > time(NULL)                              = 1320435240
> >> > >> > time(NULL)                              = 1320435240
> >> > >> > time(NULL)                              = 1320435240
> >> > >> > .....
> >> > >> > keeps on going, this might be the sorting algo?
> >> > >> >
> >> > >> > gstack:
> >> > >> > #0  0xffffffffff600410 in ?? ()
> >> > >> > #1  0x0000003cc868c40d in time () from /lib64/libc.so.6
> >> > >> > #2  0x000000000042608b in _get_start_time ()
> >> > >> > #3  0x00000000004260e1 in _sort_job_by_time_start ()
> >> > >> > #4  0x0000000000437958 in slurm_list_sort ()
> >> > >> > #5  0x00000000004232af in print_jobs_array ()#6  0x0000000000420234
> >> > >> > in
> >> > >> > _get_info ()
> >> > >> > #7  0x00000000004205a1 in main ()
> >> > >> > squeue is pegged at 100% the entire time.  I noticed, that if i
> >> > >> > pass in the
> >> > >> > param -S u to squeue the function _sort_job_by_time_start is still
> >> > >> > executed.
> >> > >> > Some interetesed slurmctl logs enteries:
> >> > >> > [2011-11-04T19:50:37] Warning: Note very large processing time from
> >> > >> > _slurm_rpc_dump_jobs: usec=6155478
> >> > >> > [2011-11-04T19:50:37] Warning: Note very large processing time from
> >> > >> > schedule: usec=6158333
> >> > >> > [2011-11-04T19:50:37] Warning: Note very large processing time from
> >> > >> > _slurmctld_background: usec=15995841
> >> > >> >
> >> > >> > [2011-11-04T19:46:01] debug:  backfill: loop taking too long,
> >> > >> > yielding locks
> >> > >> > [2011-11-04T19:46:01] debug2: Testing job time limits and
> >> > >> > checkpoints
> >> > >> > [2011-11-04T19:46:01] debug2: Performing purge of old job records
> >> > >> > [2011-11-04T19:46:01] Warning: Note very large processing time from
> >> > >> > _slurmctld_background: usec=9779576
> >> > >> > Thanks for all your help,
> >> > >> > Clay
> >> > >> > On Fri, Nov 4, 2011 at 2:16 AM, HAUTREUX Matthieu
> >> > >> > <[email protected]>
> >> > >> > wrote:
> >> > >> >>
> >> > >> >> In my understanding of the internals, squeue does not query each
> >> > >> >> node but
> >> > >> >> only the controller. That is interesting to know that the
> >> > >> >> performance issue
> >> > >> >> seems to be due to the EC2 partition but it requires further
> >> > >> >> information to
> >> > >> >> know what is the reason of such a bad performance.
> >> > >> >>
> >> > >> >> The best way to find the problem is to observe the internal states
> >> > >> >> of the
> >> > >> >> controller while the "squeue -p aws" is running :
> >> > >> >> - You should first look at the consumption of the slurmctld
> >> > >> >> process to see
> >> > >> >> if it is due to a too high complexity of processing in an internal
> >> > >> >> function
> >> > >> >> of slurm.
> >> > >> >>  The easiest way to have that information is by using top and look
> >> > >> >> at the
> >> > >> >> consumption of the slurmctld process.
> >> > >> >> - After that, in case you see no anormal consumption, you should
> >> > >> >> try to
> >> > >> >> capture the state of the slurmctld process during the hang. You
> >> > >> >> can do that
> >> > >> >> just by using the gstack command to capture the stacks of
> >> > >> >> slurmctld threads
> >> > >> >> and see where they are waiting. If you can send the result of
> >> > >> >> "gstack
> >> > >> >> $(pgrep slurmctld)" it could bring interesting information.
> >> > >> >> - In case you see an anormal consumption, the gstack can help a
> >> > >> >> little
> >> > >> >> bit, but a more evolved tool, a profiler like for example
> >> > >> >> oprofile, will be
> >> > >> >> necessary to see where most of the time is spent in the controller
> >> > >> >>
> >> > >> >> Do not hesitate to send the result of the 2 first tests (top,
> >> > >> >> gstack), it
> >> > >> >> could help to find the bottleneck.
> >> > >> >>
> >> > >> >> Regards,
> >> > >> >> Matthieu
> >> > >> >>
> >> > >> >> Clay Teeter a écrit :
> >> > >> >>>
> >> > >> >>> I did notice something.
> >> > >> >>>
> >> > >> >>> Issuing squeue against the local partition is quite speedy, ~2s
> >> > >> >>> for 10K
> >> > >> >>> job ids.  However, if i "squeue -p aws" the ec2(aws) partition i
> >> > >> >>> see the 2m+
> >> > >> >>> delay.  Does squeue query each node in the partition?
> >> > >> >>>
> >> > >> >>> Thanks,
> >> > >> >>> Clay
> >> > >> >>>
> >> > >> >>> On Thu, Nov 3, 2011 at 2:42 PM, Clay Teeter <[email protected]
> >> > >> >>> <mailto:[email protected]>> wrote:
> >> > >> >>>
> >> > >> >>>    First, thanks for your response!
> >> > >> >>>
> >> > >> >>>    Yeah, I figured that slurm shouldn't have a problem with the
> >> > >> >>> job
> >> > >> >>>    count.
> >> > >> >>>    First some background.   Our network is split into two parts;
> >> > >> >>> A
> >> > >> >>>    local partition and a ec2 partition.  And, for each task two
> >> > >> >>> jobs
> >> > >> >>>    (one is dependent on the other) are submitted.  The frist part
> >> > >> >>>    computes a large datafile and submits it to s3, a second job
> >> > >> >>> pulls
> >> > >> >>>    the files from s3 and copies it to our san. In total a large
> >> > >> >>> job
> >> > >> >>>    requires about 30K jobs to be submitted.
> >> > >> >>>
> >> > >> >>>    Im not sure when squeue started slowing dow.  Currently, via
> >> > >> >>> sattr
> >> > >> >>>    (about 2s query), there are a little more than 17000 jobs
> >> > >> >>> running.
> >> > >> >>>
> >> > >> >>>    Im worried that because half the cluster is in a remote
> >> > >> >>> datacenter
> >> > >> >>>    latency is the problem.... which, unfortunately, would be a
> >> > >> >>> show
> >> > >> >>>    stopper.
> >> > >> >>>
> >> > >> >>>    Version: slurm 2.3.0-2
> >> > >> >>>    ControlMachine=master
> >> > >> >>>    AuthType=auth/munge
> >> > >> >>>    CacheGroups=0
> >> > >> >>>    CryptoType=crypto/munge
> >> > >> >>>    MaxJobCount=30000     MpiDefault=none
> >> > >> >>>    MpiParams=ports=12000-12999
> >> > >> >>>    ProctrackType=proctrack/pgid
> >> > >> >>>    ReturnToService=1
> >> > >> >>>    SlurmctldPidFile=/var/run/slurm/slurmctld.pid
> >> > >> >>>    SlurmctldPort=6817
> >> > >> >>>    SlurmdPidFile=/var/run/slurm/slurmd.pid
> >> > >> >>>    SlurmdPort=6818
> >> > >> >>>    SlurmdSpoolDir=/var/slurm/spool
> >> > >> >>>    SlurmUser=slurm
> >> > >> >>>    StateSaveLocation=/var/slurm/state
> >> > >> >>>    SwitchType=switch/none
> >> > >> >>>    TaskPlugin=task/none
> >> > >> >>>    TrackWCKey=yes
> >> > >> >>>    InactiveLimit=0
> >> > >> >>>    KillWait=30
> >> > >> >>>    MinJobAge=300
> >> > >> >>>    SlurmctldTimeout=120
> >> > >> >>>    SlurmdTimeout=300
> >> > >> >>>    Waittime=0
> >> > >> >>>    FastSchedule=1
> >> > >> >>>    SchedulerType=sched/backfill
> >> > >> >>>    SchedulerPort=7321
> >> > >> >>>    SelectType=select/cons_res
> >> > >> >>>    SelectTypeParameters=CR_Core
> >> > >> >>>    AccountingStorageHost=master
> >> > >> >>>    AccountingStorageLoc=slurm_acct_db
> >> > >> >>>    AccountingStoragePass=
> >> > >> >>>    AccountingStorageType=accounting_storage/mysql
> >> > >> >>>    AccountingStorageUser=
> >> > >> >>>    AccountingStoreJobComment=YES
> >> > >> >>>    ClusterName=hydra
> >> > >> >>>    JobCompHost=master
> >> > >> >>>    JobCompLoc=slurm_acct_db
> >> > >> >>>    JobCompPass=cluster_service
> >> > >> >>>    JobCompType=jobcomp/mysql
> >> > >> >>>    JobCompUser=cluster_service
> >> > >> >>>    JobAcctGatherFrequency=30
> >> > >> >>>    JobAcctGatherType=jobacct_gather/linux
> >> > >> >>>    SlurmctldDebug=3
> >> > >> >>>    SlurmctldLogFile=/var/log/slurm/slurmctld.log
> >> > >> >>>    SlurmdDebug=3
> >> > >> >>>    SlurmdLogFile=/var/log/slurm/slurmd.log
> >> > >> >>>    NodeName=aws[1-12] RealMemory=22528 Sockets=2 CoresPerSocket=4
> >> > >> >>>    ThreadsPerCore=2 State=UNKNOWN     NodeName=centos1
> >> > >> >>> RealMemory=3000
> >> > >> >>> Sockets=2 CoresPerSocket=2
> >> > >> >>>    ThreadsPerCore=1 State=UNKNOWN
> >> > >> >>>
> >> > >> >>>    PartitionName=local Nodes=centos1 Default=YES MaxTime=INFINITE
> >> > >> >>>    State=UP Shared=NO
> >> > >> >>>    PartitionName=all Nodes=aws[3-12],centos1 Default=NO
> >> > >> >>>    MaxTime=INFINITE State=UP Shared=NO
> >> > >> >>>    PartitionName=aws Nodes=aws[2-12] Default=NO MaxTime=INFINITE
> >> > >> >>>    State=UP Shared=NO
> >> > >> >>>
> >> > >> >>>
> >> > >> >>>
> >> > >> >>>    On Thu, Nov 3, 2011 at 2:03 PM, Matthieu Hautreux
> >> > >> >>>    <[email protected] <mailto:[email protected]>>
> >> > >> >>> wrote:
> >> > >> >>>
> >> > >> >>>        Hi,
> >> > >> >>>
> >> > >> >>>        what is the version of SLURM that you are using ? Can you
> >> > >> >>> send
> >> > >> >>>        your slurm.conf for details of configuration ?
> >> > >> >>>
> >> > >> >>>        With 2.3.1 you should be able to manage at least 10k jobs.
> >> > >> >>> We
> >> > >> >>>        made tests with 2.2.7 and backported patches of 2.3.x that
> >> > >> >>>        shows good results with 10K jobs.
> >> > >> >>>        What is the number of jobs for which the system starts to
> >> > >> >>> hang ?
> >> > >> >>>
> >> > >> >>>        Matthieu
> >> > >> >>>
> >> > >> >>>
> >> > >> >>>        On 11/03/2011 07:03 PM, Clay Teeter wrote:
> >> > >> >>>>
> >> > >> >>>>        Hi group!
> >> > >> >>>>
> >> > >> >>>>        Im seeing that with about 30K (or so) jobs in the queue,
> >> > >> >>>>        squeue hangs
> >> > >> >>>>        for a around 2 minutes before returning values.  Does
> >> > >> >>>> anyone
> >> > >> >>>>        know a way
> >> > >> >>>>        around this?  I expect that this isn't standard behavior.
> >> > >> >>>>
> >> > >> >>>>        Cheers,
> >> > >> >>>>        Clay
> >> > >> >>>
> >> > >> >>>
> >> > >> >>>
> >> > >> >>
> >> > >> >
> >> > >> >
> >> > >> >
> >> > >>
> >> > >>
> >> > >>
> >> > >> --
> >> > >>
> >> > >> ==================================================
> >> > >> Open Grid Scheduler - The Official Open Source Grid Engine
> >> > >> http://gridscheduler.sourceforge.net/
> >> > >>
> >> > >
> >> >
> >> >
> >> >
> >> > --
> >> >
> >> > ==================================================
> >> > Open Grid Scheduler - The Official Open Source Grid Engine
> >> > http://gridscheduler.sourceforge.net/
> >> >
> >>
> >
> >
> 
> 
> 
> --
> 
> ==================================================
> Open Grid Scheduler - The Official Open Source Grid Engine
> http://gridscheduler.sourceforge.net/
> 

Reply via email to