It looks like the lag is definitely hardware/squeue related. I've recompile squeue on a modern os and things became snappy (1-3s for the results vs 2m on the cluster node)
A couple of things to mention about the cluster (slow) node. cat /proc/sys/kernel/vsyscall64 = 0 the timing loop posted by Mark took > 1m Ill implement the suggestions by Mark and Rayson and see what happens! Thanks, Clay On Sat, Nov 5, 2011 at 8:08 AM, Mark A. Grondona <[email protected]> wrote: > On Fri, 4 Nov 2011 20:33:31 -0700, Rayson Ho <[email protected]> > wrote: > > On Fri, Nov 4, 2011 at 9:58 PM, Mark A. Grondona <[email protected]> > wrote: > > > This is definitely true, but the fact that Clay's debugger caught > > > squeue in the time(2) system call means something strange might be > > > going on with that call. > > > > That's possible, if it is running on an old Linux kernel, or if > > /proc/sys/kernel/vsyscall64 is 0. > > Also depends on the timer used by gettimeofday. > > > (And BTW, that's 100M calls) > > Oh, yeah, I increased it to get >1s runtime on my system. sorry! > > > > Clay, you might try running the following > > > benchmark on your system just to see how long 10M time(2) calls take: > > > > According to the comments in src/common/list.c, list_sort() has a > > complexity of O(n^2). So 30K jobs with complexity of n^2, it's like > > 900M calls to time(2)!! > > Yeah, many of the sort compare functions in squeue have this problem. > Lots of code needlessly recomputes the same values over and over again > (n^2 - n times, right?) > > > If that really is the case, then it would be slow to execute almost 1 > > billion time(2) calls even if the overhead is really low for each > > call. (In the end, a VDSO call is not free) > > FWIW, I tried to reproduce Clay's problem on a test system > (though I only got up to 20K jobs), and was not able to. > > > > I am not sure how job->start_time gets assigned in the code (the > > header says "time_t start_time; /* time execution begins, actual > > or expected */", so I would expect it to be unset for pending jobs > > without expected start time. I don't have a SLURM cluster that I can > > fool around with and to attach a debugger, yet there are too many > > start_time variables in the code that is blocking me from digging > > deeper!!). > > > > - If job->start_time really represents the expected start time for > > every job, then I guess calling time(2) once and passing the value to > > the comparsion function would make sense. > > > > - But if the common case start_time has a value of 0, then the > > following patch should help (I've checked with gcc 4.4.5, even at -O3, > > there's not sinking of time(NULL) below the first if statement, so we > > need to do it ourselves): > > Your patch is a good start I would think. However, the start_time > is still needlessly recomputed for every pending job (I _think_ > pending jobs start_time is set to submit time, until they actually > start). So if you have 30K pending jobs you'd be in the same boat. > > Also, think about what happens when the sort takes >1s. The start > time of these pending jobs starts to change! That seems wrong. > > The best fix for this one problem (if it is the problem) would be > to set a global "now" that doesn't change, so the time(2) call > only needs to be made once, and every pending job gets the same > value for start_time. > > In general, the sorting and other code in squeue could be greatly > improved by precomputing results or otherwise memoizing results > (i.e. store results in a hash that could be checked before > recomputing the same result you had before). A good example > is _sort_job_by_node_list(). That function does 'a lot' of work, > and like Rayson said, it could be called up to 900M times > needlessly for 30K jobs! > > BTW, in my testing, profiling showed that the slowest part of > squeue was the uid_to_string() call, which is called at least > once for every job (in my case, every job had the same uid, > so that call should have only been made once). Removing that call > more than halved the runtime of squeue. > > mark > > > > > [rayson@computer slurm-2.3.0-2]$ diff -U4 sort.c.original > src/squeue/sort.c > > --- sort.c.original 2011-11-04 22:47:02.001064100 -0400 > > +++ src/squeue/sort.c 2011-11-04 22:47:55.187063769 -0400 > > @@ -582,13 +582,13 @@ > > } > > > > static uint32_t _get_start_time(job_info_t *job) > > { > > - time_t now = time(NULL); > > + time_t now; > > > > if (job->start_time == (time_t) 0) > > return 0xffffffff; > > - if ((job->job_state == JOB_PENDING) && (job->start_time < now)) > > + if ((job->job_state == JOB_PENDING) && (job->start_time < (now = > time(NULL)))) > > return (uint32_t) now; > > return (uint32_t) job->start_time; > > } > > > > Rayson > > > > > > > > > > #include <time.h> > > > int main (int ac, char **av) > > > { > > > unsigned long i; > > > time_t t; > > > for (i = 0; i < 100000000UL; i++) > > > t = time(NULL); > > > return 0; > > > } > > > > > > On my system I get: > > > > > > $ time ./timebench > > > > > > real 0m3.794s > > > user 0m3.791s > > > > > > > > > mark > > > > > > > > > > > >> Rayson > > >> > > >> > > >> > > >> > Moe, Danny, what do you think about that problem ? > > >> > > > >> > Regards, > > >> > Matthieu > > >> > > > >> > > > >> > On 11/04/2011 08:53 PM, Clay Teeter wrote: > > >> > > > >> > Ok, here is some general profiling info: > > >> > the slurmctl process doesn't seem too terribly out of whack. There > are > > >> > bursts of 100% cpu usage, but they lasted no longer then a couple > seconds. > > >> > I ran gstack on the slurmctl process while squeue was running. > Again, I > > >> > didn't see anything that look strange: > > >> > Thread 5 (Thread 0x41fc6940 (LWP 21939)): > > >> > #0 0x00002aaaaacb9df8 in select_p_job_test () from > > >> > /usr/lib64/slurm/select_cons_res.so > > >> > #1 0x00002aaaabad8cf3 in backfill_agent () from > > >> > /usr/lib64/slurm/sched_backfill.so > > >> > #2 0x0000003cc8e0673d in start_thread () from > /lib64/libpthread.so.0 > > >> > #3 0x0000003cc86d44bd in clone () from /lib64/libc.so.6 > > >> > Thread 4 (Thread 0x420c7940 (LWP 21941)): > > >> > #0 0x0000003cc86cd722 in select () from /lib64/libc.so.6 > > >> > #1 0x000000000042daa2 in _slurmctld_rpc_mgr () > > >> > #2 0x0000003cc8e0673d in start_thread () from > /lib64/libpthread.so.0 > > >> > #3 0x0000003cc86d44bd in clone () from /lib64/libc.so.6 > > >> > Thread 3 (Thread 0x421c8940 (LWP 21942)): > > >> > #0 0x0000003cc8e0e898 in do_sigwait () from /lib64/libpthread.so.0 > > >> > #1 0x0000003cc8e0e93d in sigwait () from /lib64/libpthread.so.0 > > >> > #2 0x000000000042b838 in _slurmctld_signal_hand () > > >> > #3 0x0000003cc8e0673d in start_thread () from > /lib64/libpthread.so.0 > > >> > #4 0x0000003cc86d44bd in clone () from /lib64/libc.so.6 > > >> > Thread 2 (Thread 0x422c9940 (LWP 21943)): > > >> > #0 0x0000003cc8e0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from > > >> > /lib64/libpthread.so.0 > > >> > #1 0x000000000046a9c1 in slurmctld_state_save () > > >> > #2 0x0000003cc8e0673d in start_thread () from > /lib64/libpthread.so.0 > > >> > #3 0x0000003cc86d44bd in clone () from /lib64/libc.so.6 > > >> > Thread 1 (Thread 0x2aae5df000d0 (LWP 21933)): > > >> > #0 0x0000003cc8e0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from > > >> > /lib64/libpthread.so.0 > > >> > #1 0x0000000000449cbf in _wr_wrlock () > > >> > #2 0x000000000044a1d5 in lock_slurmctld () > > >> > #3 0x000000000042af5e in _slurmctld_background () > > >> > #4 0x000000000042cb8c in main () > > >> > However, i did a gstack and strace on squeue and things did look a > > >> > bit weird. > > >> > strace: > > >> > time(NULL) = 1320435240 > > >> > time(NULL) = 1320435240 > > >> > time(NULL) = 1320435240 > > >> > time(NULL) = 1320435240 > > >> > time(NULL) = 1320435240 > > >> > time(NULL) = 1320435240 > > >> > time(NULL) = 1320435240 > > >> > time(NULL) = 1320435240 > > >> > ..... > > >> > keeps on going, this might be the sorting algo? > > >> > > > >> > gstack: > > >> > #0 0xffffffffff600410 in ?? () > > >> > #1 0x0000003cc868c40d in time () from /lib64/libc.so.6 > > >> > #2 0x000000000042608b in _get_start_time () > > >> > #3 0x00000000004260e1 in _sort_job_by_time_start () > > >> > #4 0x0000000000437958 in slurm_list_sort () > > >> > #5 0x00000000004232af in print_jobs_array ()#6 0x0000000000420234 > in > > >> > _get_info () > > >> > #7 0x00000000004205a1 in main () > > >> > squeue is pegged at 100% the entire time. I noticed, that if i > pass in the > > >> > param -S u to squeue the function _sort_job_by_time_start is still > > >> > executed. > > >> > Some interetesed slurmctl logs enteries: > > >> > [2011-11-04T19:50:37] Warning: Note very large processing time from > > >> > _slurm_rpc_dump_jobs: usec=6155478 > > >> > [2011-11-04T19:50:37] Warning: Note very large processing time from > > >> > schedule: usec=6158333 > > >> > [2011-11-04T19:50:37] Warning: Note very large processing time from > > >> > _slurmctld_background: usec=15995841 > > >> > > > >> > [2011-11-04T19:46:01] debug: backfill: loop taking too long, > yielding locks > > >> > [2011-11-04T19:46:01] debug2: Testing job time limits and > checkpoints > > >> > [2011-11-04T19:46:01] debug2: Performing purge of old job records > > >> > [2011-11-04T19:46:01] Warning: Note very large processing time from > > >> > _slurmctld_background: usec=9779576 > > >> > Thanks for all your help, > > >> > Clay > > >> > On Fri, Nov 4, 2011 at 2:16 AM, HAUTREUX Matthieu < > [email protected]> > > >> > wrote: > > >> >> > > >> >> In my understanding of the internals, squeue does not query each > node but > > >> >> only the controller. That is interesting to know that the > performance issue > > >> >> seems to be due to the EC2 partition but it requires further > information to > > >> >> know what is the reason of such a bad performance. > > >> >> > > >> >> The best way to find the problem is to observe the internal states > of the > > >> >> controller while the "squeue -p aws" is running : > > >> >> - You should first look at the consumption of the slurmctld > process to see > > >> >> if it is due to a too high complexity of processing in an internal > function > > >> >> of slurm. > > >> >> The easiest way to have that information is by using top and look > at the > > >> >> consumption of the slurmctld process. > > >> >> - After that, in case you see no anormal consumption, you should > try to > > >> >> capture the state of the slurmctld process during the hang. You > can do that > > >> >> just by using the gstack command to capture the stacks of > slurmctld threads > > >> >> and see where they are waiting. If you can send the result of > "gstack > > >> >> $(pgrep slurmctld)" it could bring interesting information. > > >> >> - In case you see an anormal consumption, the gstack can help a > little > > >> >> bit, but a more evolved tool, a profiler like for example > oprofile, will be > > >> >> necessary to see where most of the time is spent in the controller > > >> >> > > >> >> Do not hesitate to send the result of the 2 first tests (top, > gstack), it > > >> >> could help to find the bottleneck. > > >> >> > > >> >> Regards, > > >> >> Matthieu > > >> >> > > >> >> Clay Teeter a écrit : > > >> >>> > > >> >>> I did notice something. > > >> >>> > > >> >>> Issuing squeue against the local partition is quite speedy, ~2s > for 10K > > >> >>> job ids. However, if i "squeue -p aws" the ec2(aws) partition i > see the 2m+ > > >> >>> delay. Does squeue query each node in the partition? > > >> >>> > > >> >>> Thanks, > > >> >>> Clay > > >> >>> > > >> >>> On Thu, Nov 3, 2011 at 2:42 PM, Clay Teeter <[email protected] > > >> >>> <mailto:[email protected]>> wrote: > > >> >>> > > >> >>> First, thanks for your response! > > >> >>> > > >> >>> Yeah, I figured that slurm shouldn't have a problem with the > job > > >> >>> count. > > >> >>> First some background. Our network is split into two parts; A > > >> >>> local partition and a ec2 partition. And, for each task two > jobs > > >> >>> (one is dependent on the other) are submitted. The frist part > > >> >>> computes a large datafile and submits it to s3, a second job > pulls > > >> >>> the files from s3 and copies it to our san. In total a large > job > > >> >>> requires about 30K jobs to be submitted. > > >> >>> > > >> >>> Im not sure when squeue started slowing dow. Currently, via > sattr > > >> >>> (about 2s query), there are a little more than 17000 jobs > running. > > >> >>> > > >> >>> Im worried that because half the cluster is in a remote > datacenter > > >> >>> latency is the problem.... which, unfortunately, would be a > show > > >> >>> stopper. > > >> >>> > > >> >>> Version: slurm 2.3.0-2 > > >> >>> ControlMachine=master > > >> >>> AuthType=auth/munge > > >> >>> CacheGroups=0 > > >> >>> CryptoType=crypto/munge > > >> >>> MaxJobCount=30000 MpiDefault=none > > >> >>> MpiParams=ports=12000-12999 > > >> >>> ProctrackType=proctrack/pgid > > >> >>> ReturnToService=1 > > >> >>> SlurmctldPidFile=/var/run/slurm/slurmctld.pid > > >> >>> SlurmctldPort=6817 > > >> >>> SlurmdPidFile=/var/run/slurm/slurmd.pid > > >> >>> SlurmdPort=6818 > > >> >>> SlurmdSpoolDir=/var/slurm/spool > > >> >>> SlurmUser=slurm > > >> >>> StateSaveLocation=/var/slurm/state > > >> >>> SwitchType=switch/none > > >> >>> TaskPlugin=task/none > > >> >>> TrackWCKey=yes > > >> >>> InactiveLimit=0 > > >> >>> KillWait=30 > > >> >>> MinJobAge=300 > > >> >>> SlurmctldTimeout=120 > > >> >>> SlurmdTimeout=300 > > >> >>> Waittime=0 > > >> >>> FastSchedule=1 > > >> >>> SchedulerType=sched/backfill > > >> >>> SchedulerPort=7321 > > >> >>> SelectType=select/cons_res > > >> >>> SelectTypeParameters=CR_Core > > >> >>> AccountingStorageHost=master > > >> >>> AccountingStorageLoc=slurm_acct_db > > >> >>> AccountingStoragePass= > > >> >>> AccountingStorageType=accounting_storage/mysql > > >> >>> AccountingStorageUser= > > >> >>> AccountingStoreJobComment=YES > > >> >>> ClusterName=hydra > > >> >>> JobCompHost=master > > >> >>> JobCompLoc=slurm_acct_db > > >> >>> JobCompPass=cluster_service > > >> >>> JobCompType=jobcomp/mysql > > >> >>> JobCompUser=cluster_service > > >> >>> JobAcctGatherFrequency=30 > > >> >>> JobAcctGatherType=jobacct_gather/linux > > >> >>> SlurmctldDebug=3 > > >> >>> SlurmctldLogFile=/var/log/slurm/slurmctld.log > > >> >>> SlurmdDebug=3 > > >> >>> SlurmdLogFile=/var/log/slurm/slurmd.log > > >> >>> NodeName=aws[1-12] RealMemory=22528 Sockets=2 CoresPerSocket=4 > > >> >>> ThreadsPerCore=2 State=UNKNOWN NodeName=centos1 > RealMemory=3000 > > >> >>> Sockets=2 CoresPerSocket=2 > > >> >>> ThreadsPerCore=1 State=UNKNOWN > > >> >>> > > >> >>> PartitionName=local Nodes=centos1 Default=YES MaxTime=INFINITE > > >> >>> State=UP Shared=NO > > >> >>> PartitionName=all Nodes=aws[3-12],centos1 Default=NO > > >> >>> MaxTime=INFINITE State=UP Shared=NO > > >> >>> PartitionName=aws Nodes=aws[2-12] Default=NO MaxTime=INFINITE > > >> >>> State=UP Shared=NO > > >> >>> > > >> >>> > > >> >>> > > >> >>> On Thu, Nov 3, 2011 at 2:03 PM, Matthieu Hautreux > > >> >>> <[email protected] <mailto:[email protected]>> > wrote: > > >> >>> > > >> >>> Hi, > > >> >>> > > >> >>> what is the version of SLURM that you are using ? Can you > send > > >> >>> your slurm.conf for details of configuration ? > > >> >>> > > >> >>> With 2.3.1 you should be able to manage at least 10k jobs. > We > > >> >>> made tests with 2.2.7 and backported patches of 2.3.x that > > >> >>> shows good results with 10K jobs. > > >> >>> What is the number of jobs for which the system starts to > hang ? > > >> >>> > > >> >>> Matthieu > > >> >>> > > >> >>> > > >> >>> On 11/03/2011 07:03 PM, Clay Teeter wrote: > > >> >>>> > > >> >>>> Hi group! > > >> >>>> > > >> >>>> Im seeing that with about 30K (or so) jobs in the queue, > > >> >>>> squeue hangs > > >> >>>> for a around 2 minutes before returning values. Does > anyone > > >> >>>> know a way > > >> >>>> around this? I expect that this isn't standard behavior. > > >> >>>> > > >> >>>> Cheers, > > >> >>>> Clay > > >> >>> > > >> >>> > > >> >>> > > >> >> > > >> > > > >> > > > >> > > > >> > > >> > > >> > > >> -- > > >> > > >> ================================================== > > >> Open Grid Scheduler - The Official Open Source Grid Engine > > >> http://gridscheduler.sourceforge.net/ > > >> > > > > > > > > > > > -- > > > > ================================================== > > Open Grid Scheduler - The Official Open Source Grid Engine > > http://gridscheduler.sourceforge.net/ > > > >
