> Thanks,
> Clay
> On Sat, Nov 5, 2011 at 8:08 AM, Mark A. Grondona <[email protected]>
wrote:
>>
>> On Fri, 4 Nov 2011 20:33:31 -0700, Rayson Ho <[email protected]>
>> wrote:
>> > On Fri, Nov 4, 2011 at 9:58 PM, Mark A. Grondona <
[email protected]>
>> > wrote:
>> > > This is definitely true, but the fact that Clay's debugger caught
>> > > squeue in the time(2) system call means something strange might be
>> > > going on with that call.
>> >
>> > That's possible, if it is running on an old Linux kernel, or if
>> > /proc/sys/kernel/vsyscall64 is 0.
>>
>> Also depends on the timer used by gettimeofday.
>>
>> > (And BTW, that's 100M calls)
>>
>> Oh, yeah, I increased it to get >1s runtime on my system. sorry!
>>
>> > > Clay, you might try running the following
>> > > benchmark on your system just to see how long 10M time(2) calls
take:
>> >
>> > According to the comments in src/common/list.c, list_sort() has a
>> > complexity of O(n^2). So 30K jobs with complexity of n^2, it's like
>> > 900M calls to time(2)!!
>>
>> Yeah, many of the sort compare functions in squeue have this problem.
>> Lots of code needlessly recomputes the same values over and over again
>> (n^2 - n times, right?)
>>
>> > If that really is the case, then it would be slow to execute almost
1
>> > billion time(2) calls even if the overhead is really low for each
>> > call. (In the end, a VDSO call is not free)
>>
>> FWIW, I tried to reproduce Clay's problem on a test system
>> (though I only got up to 20K jobs), and was not able to.
>>
>>
>> > I am not sure how job->start_time gets assigned in the code (the
>> > header says "time_t start_time; /* time execution begins,
actual
>> > or expected */", so I would expect it to be unset for pending jobs
>> > without expected start time. I don't have a SLURM cluster that I can
>> > fool around with and to attach a debugger, yet there are too many
>> > start_time variables in the code that is blocking me from digging
>> > deeper!!).
>> >
>> > - If job->start_time really represents the expected start time for
>> > every job, then I guess calling time(2) once and passing the value
to
>> > the comparsion function would make sense.
>> >
>> > - But if the common case start_time has a value of 0, then the
>> > following patch should help (I've checked with gcc 4.4.5, even at
-O3,
>> > there's not sinking of time(NULL) below the first if statement, so
we
>> > need to do it ourselves):
>>
>> Your patch is a good start I would think. However, the start_time
>> is still needlessly recomputed for every pending job (I _think_
>> pending jobs start_time is set to submit time, until they actually
>> start). So if you have 30K pending jobs you'd be in the same boat.
>>
>> Also, think about what happens when the sort takes >1s. The start
>> time of these pending jobs starts to change! That seems wrong.
>>
>> The best fix for this one problem (if it is the problem) would be
>> to set a global "now" that doesn't change, so the time(2) call
>> only needs to be made once, and every pending job gets the same
>> value for start_time.
>>
>> In general, the sorting and other code in squeue could be greatly
>> improved by precomputing results or otherwise memoizing results
>> (i.e. store results in a hash that could be checked before
>> recomputing the same result you had before). A good example
>> is _sort_job_by_node_list(). That function does 'a lot' of work,
>> and like Rayson said, it could be called up to 900M times
>> needlessly for 30K jobs!
>>
>> BTW, in my testing, profiling showed that the slowest part of
>> squeue was the uid_to_string() call, which is called at least
>> once for every job (in my case, every job had the same uid,
>> so that call should have only been made once). Removing that call
>> more than halved the runtime of squeue.
>>
>> mark
>>
>> >
>> > [rayson@computer slurm-2.3.0-2]$ diff -U4 sort.c.original
>> > src/squeue/sort.c
>> > --- sort.c.original 2011-11-04 22:47:02.001064100 -0400
>> > +++ src/squeue/sort.c 2011-11-04 22:47:55.187063769 -0400
>> > @@ -582,13 +582,13 @@
>> > }
>> >
>> > static uint32_t _get_start_time(job_info_t *job)
>> > {
>> > - time_t now = time(NULL);
>> > + time_t now;
>> >
>> > if (job->start_time == (time_t) 0)
>> > return 0xffffffff;
>> > - if ((job->job_state == JOB_PENDING) && (job->start_time <
now))
>> > + if ((job->job_state == JOB_PENDING) && (job->start_time <
(now =
>> > time(NULL))))
>> > return (uint32_t) now;
>> > return (uint32_t) job->start_time;
>> > }
>> >
>> > Rayson
>> >
>> >
>> > >
>> > > #include <time.h>
>> > > int main (int ac, char **av)
>> > > {
>> > > unsigned long i;
>> > > time_t t;
>> > > for (i = 0; i < 100000000UL; i++)
>> > > t = time(NULL);
>> > > return 0;
>> > > }
>> > >
>> > > On my system I get:
>> > >
>> > > $ time ./timebench
>> > >
>> > > real 0m3.794s
>> > > user 0m3.791s
>> > >
>> > >
>> > > mark
>> > >
>> > >
>> > >
>> > >> Rayson
>> > >>
>> > >>
>> > >>
>> > >> > Moe, Danny, what do you think about that problem ?
>> > >> >
>> > >> > Regards,
>> > >> > Matthieu
>> > >> >
>> > >> >
>> > >> > On 11/04/2011 08:53 PM, Clay Teeter wrote:
>> > >> >
>> > >> > Ok, here is some general profiling info:
>> > >> > the slurmctl process doesn't seem too terribly out of whack.
There
>> > >> > are
>> > >> > bursts of 100% cpu usage, but they lasted no longer then a
couple
>> > >> > seconds.
>> > >> > I ran gstack on the slurmctl process while squeue was running.
>> > >> > Again, I
>> > >> > didn't see anything that look strange:
>> > >> > Thread 5 (Thread 0x41fc6940 (LWP 21939)):
>> > >> > #0 0x00002aaaaacb9df8 in select_p_job_test () from
>> > >> > /usr/lib64/slurm/select_cons_**res.so
>> > >> > #1 0x00002aaaabad8cf3 in backfill_agent () from
>> > >> > /usr/lib64/slurm/sched_**backfill.so
>> > >> > #2 0x0000003cc8e0673d in start_thread () from
>> > >> > /lib64/libpthread.so.0
>> > >> > #3 0x0000003cc86d44bd in clone () from /lib64/libc.so.6
>> > >> > Thread 4 (Thread 0x420c7940 (LWP 21941)):
>> > >> > #0 0x0000003cc86cd722 in select () from /lib64/libc.so.6
>> > >> > #1 0x000000000042daa2 in _slurmctld_rpc_mgr ()
>> > >> > #2 0x0000003cc8e0673d in start_thread () from
>> > >> > /lib64/libpthread.so.0
>> > >> > #3 0x0000003cc86d44bd in clone () from /lib64/libc.so.6
>> > >> > Thread 3 (Thread 0x421c8940 (LWP 21942)):
>> > >> > #0 0x0000003cc8e0e898 in do_sigwait () from
/lib64/libpthread.so.0
>> > >> > #1 0x0000003cc8e0e93d in sigwait () from
/lib64/libpthread.so.0
>> > >> > #2 0x000000000042b838 in _slurmctld_signal_hand ()
>> > >> > #3 0x0000003cc8e0673d in start_thread () from
>> > >> > /lib64/libpthread.so.0
>> > >> > #4 0x0000003cc86d44bd in clone () from /lib64/libc.so.6
>> > >> > Thread 2 (Thread 0x422c9940 (LWP 21943)):
>> > >> > #0 0x0000003cc8e0aee9 in pthread_cond_wait@@GLIBC_2.3.2 ()
from
>> > >> > /lib64/libpthread.so.0
>> > >> > #1 0x000000000046a9c1 in slurmctld_state_save ()
>> > >> > #2 0x0000003cc8e0673d in start_thread () from
>> > >> > /lib64/libpthread.so.0
>> > >> > #3 0x0000003cc86d44bd in clone () from /lib64/libc.so.6
>> > >> > Thread 1 (Thread 0x2aae5df000d0 (LWP 21933)):
>> > >> > #0 0x0000003cc8e0aee9 in pthread_cond_wait@@GLIBC_2.3.2 ()
from
>> > >> > /lib64/libpthread.so.0
>> > >> > #1 0x0000000000449cbf in _wr_wrlock ()
>> > >> > #2 0x000000000044a1d5 in lock_slurmctld ()
>> > >> > #3 0x000000000042af5e in _slurmctld_background ()
>> > >> > #4 0x000000000042cb8c in main ()
>> > >> > However, i did a gstack and strace on squeue and things did
look a
>> > >> > bit weird.
>> > >> > strace:
>> > >> > time(NULL) = 1320435240
>> > >> > time(NULL) = 1320435240
>> > >> > time(NULL) = 1320435240
>> > >> > time(NULL) = 1320435240
>> > >> > time(NULL) = 1320435240
>> > >> > time(NULL) = 1320435240
>> > >> > time(NULL) = 1320435240
>> > >> > time(NULL) = 1320435240
>> > >> > .....
>> > >> > keeps on going, this might be the sorting algo?
>> > >> >
>> > >> > gstack:
>> > >> > #0 0xffffffffff600410 in ?? ()
>> > >> > #1 0x0000003cc868c40d in time () from /lib64/libc.so.6
>> > >> > #2 0x000000000042608b in _get_start_time ()
>> > >> > #3 0x00000000004260e1 in _sort_job_by_time_start ()
>> > >> > #4 0x0000000000437958 in slurm_list_sort ()
>> > >> > #5 0x00000000004232af in print_jobs_array ()#6
0x0000000000420234
>> > >> > in
>> > >> > _get_info ()
>> > >> > #7 0x00000000004205a1 in main ()
>> > >> > squeue is pegged at 100% the entire time. I noticed, that if i
>> > >> > pass in the
>> > >> > param -S u to squeue the function _sort_job_by_time_start is
still
>> > >> > executed.
>> > >> > Some interetesed slurmctl logs enteries:
>> > >> > [2011-11-04T19:50:37] Warning: Note very large processing time
from
>> > >> > _slurm_rpc_dump_jobs: usec=6155478
>> > >> > [2011-11-04T19:50:37] Warning: Note very large processing time
from
>> > >> > schedule: usec=6158333
>> > >> > [2011-11-04T19:50:37] Warning: Note very large processing time
from
>> > >> > _slurmctld_background: usec=15995841
>> > >> >
>> > >> > [2011-11-04T19:46:01] debug: backfill: loop taking too long,
>> > >> > yielding locks
>> > >> > [2011-11-04T19:46:01] debug2: Testing job time limits and
>> > >> > checkpoints
>> > >> > [2011-11-04T19:46:01] debug2: Performing purge of old job
records
>> > >> > [2011-11-04T19:46:01] Warning: Note very large processing time
from
>> > >> > _slurmctld_background: usec=9779576
>> > >> > Thanks for all your help,
>> > >> > Clay
>> > >> > On Fri, Nov 4, 2011 at 2:16 AM, HAUTREUX Matthieu
>> > >> > <[email protected]>
>> > >> > wrote:
>> > >> >>
>> > >> >> In my understanding of the internals, squeue does not query
each
>> > >> >> node but
>> > >> >> only the controller. That is interesting to know that the
>> > >> >> performance issue
>> > >> >> seems to be due to the EC2 partition but it requires further
>> > >> >> information to
>> > >> >> know what is the reason of such a bad performance.
>> > >> >>
>> > >> >> The best way to find the problem is to observe the internal
states
>> > >> >> of the
>> > >> >> controller while the "squeue -p aws" is running :
>> > >> >> - You should first look at the consumption of the slurmctld
>> > >> >> process to see
>> > >> >> if it is due to a too high complexity of processing in an
internal
>> > >> >> function
>> > >> >> of slurm.
>> > >> >> The easiest way to have that information is by using top and
look
>> > >> >> at the
>> > >> >> consumption of the slurmctld process.
>> > >> >> - After that, in case you see no anormal consumption, you
should
>> > >> >> try to
>> > >> >> capture the state of the slurmctld process during the hang.
You
>> > >> >> can do that
>> > >> >> just by using the gstack command to capture the stacks of
>> > >> >> slurmctld threads
>> > >> >> and see where they are waiting. If you can send the result of
>> > >> >> "gstack
>> > >> >> $(pgrep slurmctld)" it could bring interesting information.
>> > >> >> - In case you see an anormal consumption, the gstack can help
a
>> > >> >> little
>> > >> >> bit, but a more evolved tool, a profiler like for example
>> > >> >> oprofile, will be
>> > >> >> necessary to see where most of the time is spent in the
controller
>> > >> >>
>> > >> >> Do not hesitate to send the result of the 2 first tests (top,
>> > >> >> gstack), it
>> > >> >> could help to find the bottleneck.
>> > >> >>
>> > >> >> Regards,
>> > >> >> Matthieu
>> > >> >>
>> > >> >> Clay Teeter a écrit :
>> > >> >>>
>> > >> >>> I did notice something.
>> > >> >>>
>> > >> >>> Issuing squeue against the local partition is quite speedy,
~2s
>> > >> >>> for 10K
>> > >> >>> job ids. However, if i "squeue -p aws" the ec2(aws)
partition i
>> > >> >>> see the 2m+
>> > >> >>> delay. Does squeue query each node in the partition?
>> > >> >>>
>> > >> >>> Thanks,
>> > >> >>> Clay
>> > >> >>>
>> > >> >>> On Thu, Nov 3, 2011 at 2:42 PM, Clay Teeter <
[email protected]
>> > >> >>> <mailto:[email protected]>> wrote:
>> > >> >>>
>> > >> >>> First, thanks for your response!
>> > >> >>>
>> > >> >>> Yeah, I figured that slurm shouldn't have a problem with
the
>> > >> >>> job
>> > >> >>> count.
>> > >> >>> First some background. Our network is split into two
parts;
>> > >> >>> A
>> > >> >>> local partition and a ec2 partition. And, for each task
two
>> > >> >>> jobs
>> > >> >>> (one is dependent on the other) are submitted. The frist
part
>> > >> >>> computes a large datafile and submits it to s3, a second
job
>> > >> >>> pulls
>> > >> >>> the files from s3 and copies it to our san. In total a
large
>> > >> >>> job
>> > >> >>> requires about 30K jobs to be submitted.
>> > >> >>>
>> > >> >>> Im not sure when squeue started slowing dow. Currently,
via
>> > >> >>> sattr
>> > >> >>> (about 2s query), there are a little more than 17000 jobs
>> > >> >>> running.
>> > >> >>>
>> > >> >>> Im worried that because half the cluster is in a remote
>> > >> >>> datacenter
>> > >> >>> latency is the problem.... which, unfortunately, would be
a
>> > >> >>> show
>> > >> >>> stopper.
>> > >> >>>
>> > >> >>> Version: slurm 2.3.0-2
>> > >> >>> ControlMachine=master
>> > >> >>> AuthType=auth/munge
>> > >> >>> CacheGroups=0
>> > >> >>> CryptoType=crypto/munge
>> > >> >>> MaxJobCount=30000 MpiDefault=none
>> > >> >>> MpiParams=ports=12000-12999
>> > >> >>> ProctrackType=proctrack/pgid
>> > >> >>> ReturnToService=1
>> > >> >>> SlurmctldPidFile=/var/run/**slurm/slurmctld.pid
>> > >> >>> SlurmctldPort=6817
>> > >> >>> SlurmdPidFile=/var/run/slurm/**slurmd.pid
>> > >> >>> SlurmdPort=6818
>> > >> >>> SlurmdSpoolDir=/var/slurm/**spool
>> > >> >>> SlurmUser=slurm
>> > >> >>> StateSaveLocation=/var/slurm/**state
>> > >> >>> SwitchType=switch/none
>> > >> >>> TaskPlugin=task/none
>> > >> >>> TrackWCKey=yes
>> > >> >>> InactiveLimit=0
>> > >> >>> KillWait=30
>> > >> >>> MinJobAge=300
>> > >> >>> SlurmctldTimeout=120
>> > >> >>> SlurmdTimeout=300
>> > >> >>> Waittime=0
>> > >> >>> FastSchedule=1
>> > >> >>> SchedulerType=sched/backfill
>> > >> >>> SchedulerPort=7321
>> > >> >>> SelectType=select/cons_res
>> > >> >>> SelectTypeParameters=CR_Core
>> > >> >>> AccountingStorageHost=master
>> > >> >>> AccountingStorageLoc=slurm_**acct_db
>> > >> >>> AccountingStoragePass=
>> > >> >>> AccountingStorageType=**accounting_storage/mysql
>> > >> >>> AccountingStorageUser=
>> > >> >>> AccountingStoreJobComment=YES
>> > >> >>> ClusterName=hydra
>> > >> >>> JobCompHost=master
>> > >> >>> JobCompLoc=slurm_acct_db
>> > >> >>> JobCompPass=cluster_service
>> > >> >>> JobCompType=jobcomp/mysql
>> > >> >>> JobCompUser=cluster_service
>> > >> >>> JobAcctGatherFrequency=30
>> > >> >>> JobAcctGatherType=jobacct_**gather/linux
>> > >> >>> SlurmctldDebug=3
>> > >> >>> SlurmctldLogFile=/var/log/**slurm/slurmctld.log
>> > >> >>> SlurmdDebug=3
>> > >> >>> SlurmdLogFile=/var/log/slurm/**slurmd.log
>> > >> >>> NodeName=aws[1-12] RealMemory=22528 Sockets=2
CoresPerSocket=4
>> > >> >>> ThreadsPerCore=2 State=UNKNOWN NodeName=centos1
>> > >> >>> RealMemory=3000
>> > >> >>> Sockets=2 CoresPerSocket=2
>> > >> >>> ThreadsPerCore=1 State=UNKNOWN
>> > >> >>>
>> > >> >>> PartitionName=local Nodes=centos1 Default=YES
MaxTime=INFINITE
>> > >> >>> State=UP Shared=NO
>> > >> >>> PartitionName=all Nodes=aws[3-12],centos1 Default=NO
>> > >> >>> MaxTime=INFINITE State=UP Shared=NO
>> > >> >>> PartitionName=aws Nodes=aws[2-12] Default=NO
MaxTime=INFINITE
>> > >> >>> State=UP Shared=NO
>> > >> >>>
>> > >> >>>
>> > >> >>>
>> > >> >>> On Thu, Nov 3, 2011 at 2:03 PM, Matthieu Hautreux
>> > >> >>> <[email protected] <mailto:matthieu.hautreux@cea.*
*fr <[email protected]>>>
>> > >> >>> wrote:
>> > >> >>>
>> > >> >>> Hi,
>> > >> >>>
>> > >> >>> what is the version of SLURM that you are using ? Can
you
>> > >> >>> send
>> > >> >>> your slurm.conf for details of configuration ?
>> > >> >>>
>> > >> >>> With 2.3.1 you should be able to manage at least 10k
jobs.
>> > >> >>> We
>> > >> >>> made tests with 2.2.7 and backported patches of 2.3.x
that
>> > >> >>> shows good results with 10K jobs.
>> > >> >>> What is the number of jobs for which the system
starts to
>> > >> >>> hang ?
>> > >> >>>
>> > >> >>> Matthieu
>> > >> >>>
>> > >> >>>
>> > >> >>> On 11/03/2011 07:03 PM, Clay Teeter wrote:
>> > >> >>>>
>> > >> >>>> Hi group!
>> > >> >>>>
>> > >> >>>> Im seeing that with about 30K (or so) jobs in the
queue,
>> > >> >>>> squeue hangs
>> > >> >>>> for a around 2 minutes before returning values. Does
>> > >> >>>> anyone
>> > >> >>>> know a way
>> > >> >>>> around this? I expect that this isn't standard
behavior.
>> > >> >>>>
>> > >> >>>> Cheers,
>> > >> >>>> Clay
>> > >> >>>
>> > >> >>>
>> > >> >>>
>> > >> >>
>> > >> >
>> > >> >
>> > >> >
>> > >>
>> > >>
>> > >>
>> > >> --
>> > >>
>> > >> ==============================**====================
>> > >> Open Grid Scheduler - The Official Open Source Grid Engine
>> > >>
http://gridscheduler.**sourceforge.net/<http://gridscheduler.sourceforge.net/>
>> > >>
>> > >
>> >
>> >
>> >
>> > --
>> >
>> > ==============================**====================
>> > Open Grid Scheduler - The Official Open Source Grid Engine
>> >
http://gridscheduler.**sourceforge.net/<http://gridscheduler.sourceforge.net/>
>> >
>>
>
>
--
==============================**====================
Open Grid Scheduler - The Official Open Source Grid Engine
http://gridscheduler.**sourceforge.net/<http://gridscheduler.sourceforge.net/>