Ill see if I can recompile the sort out of squeue. While there would be a speed up, i can't image that the entire 2m is spent sorting. Especially because 17K jobs returned from a local (not ec2) partition takes a couple of seconds.
In any case, ill get back to the list with my findings in a couple of days. Have a great weekend! Thanks, Clay On Fri, Nov 4, 2011 at 3:50 PM, Rayson Ho <[email protected]> wrote: > On Fri, Nov 4, 2011 at 6:34 PM, Matthieu Hautreux > <[email protected]> wrote: > > In _print_jobs_array function of src/squeue/print.c, it seems that > > sort_jobs_by_start_time (l) is always called regardless of.the requested > > format of sort parameters. It may be interesting to see how this part of > the > > code could be enhanced to optimize the sort of a large set of jobs > (perhaps > > avoiding to call time() at each comparison could be a good start). > > Just a quick note - the overhead of time(2) is pretty low on modern > Linux and many other modern operating systems. On modern processors > executing 30K time(2) calls would take less than 0.5 sec, as it is not > a full system call anymore (Google: VDSO+gettimeofday). > > Rayson > > > > > Moe, Danny, what do you think about that problem ? > > > > Regards, > > Matthieu > > > > > > On 11/04/2011 08:53 PM, Clay Teeter wrote: > > > > Ok, here is some general profiling info: > > the slurmctl process doesn't seem too terribly out of whack. There are > > bursts of 100% cpu usage, but they lasted no longer then a couple > seconds. > > I ran gstack on the slurmctl process while squeue was running. Again, I > > didn't see anything that look strange: > > Thread 5 (Thread 0x41fc6940 (LWP 21939)): > > #0 0x00002aaaaacb9df8 in select_p_job_test () from > > /usr/lib64/slurm/select_cons_res.so > > #1 0x00002aaaabad8cf3 in backfill_agent () from > > /usr/lib64/slurm/sched_backfill.so > > #2 0x0000003cc8e0673d in start_thread () from /lib64/libpthread.so.0 > > #3 0x0000003cc86d44bd in clone () from /lib64/libc.so.6 > > Thread 4 (Thread 0x420c7940 (LWP 21941)): > > #0 0x0000003cc86cd722 in select () from /lib64/libc.so.6 > > #1 0x000000000042daa2 in _slurmctld_rpc_mgr () > > #2 0x0000003cc8e0673d in start_thread () from /lib64/libpthread.so.0 > > #3 0x0000003cc86d44bd in clone () from /lib64/libc.so.6 > > Thread 3 (Thread 0x421c8940 (LWP 21942)): > > #0 0x0000003cc8e0e898 in do_sigwait () from /lib64/libpthread.so.0 > > #1 0x0000003cc8e0e93d in sigwait () from /lib64/libpthread.so.0 > > #2 0x000000000042b838 in _slurmctld_signal_hand () > > #3 0x0000003cc8e0673d in start_thread () from /lib64/libpthread.so.0 > > #4 0x0000003cc86d44bd in clone () from /lib64/libc.so.6 > > Thread 2 (Thread 0x422c9940 (LWP 21943)): > > #0 0x0000003cc8e0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from > > /lib64/libpthread.so.0 > > #1 0x000000000046a9c1 in slurmctld_state_save () > > #2 0x0000003cc8e0673d in start_thread () from /lib64/libpthread.so.0 > > #3 0x0000003cc86d44bd in clone () from /lib64/libc.so.6 > > Thread 1 (Thread 0x2aae5df000d0 (LWP 21933)): > > #0 0x0000003cc8e0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from > > /lib64/libpthread.so.0 > > #1 0x0000000000449cbf in _wr_wrlock () > > #2 0x000000000044a1d5 in lock_slurmctld () > > #3 0x000000000042af5e in _slurmctld_background () > > #4 0x000000000042cb8c in main () > > However, i did a gstack and strace on squeue and things did look a > > bit weird. > > strace: > > time(NULL) = 1320435240 > > time(NULL) = 1320435240 > > time(NULL) = 1320435240 > > time(NULL) = 1320435240 > > time(NULL) = 1320435240 > > time(NULL) = 1320435240 > > time(NULL) = 1320435240 > > time(NULL) = 1320435240 > > ..... > > keeps on going, this might be the sorting algo? > > > > gstack: > > #0 0xffffffffff600410 in ?? () > > #1 0x0000003cc868c40d in time () from /lib64/libc.so.6 > > #2 0x000000000042608b in _get_start_time () > > #3 0x00000000004260e1 in _sort_job_by_time_start () > > #4 0x0000000000437958 in slurm_list_sort () > > #5 0x00000000004232af in print_jobs_array ()#6 0x0000000000420234 in > > _get_info () > > #7 0x00000000004205a1 in main () > > squeue is pegged at 100% the entire time. I noticed, that if i pass in > the > > param -S u to squeue the function _sort_job_by_time_start is still > > executed. > > Some interetesed slurmctl logs enteries: > > [2011-11-04T19:50:37] Warning: Note very large processing time from > > _slurm_rpc_dump_jobs: usec=6155478 > > [2011-11-04T19:50:37] Warning: Note very large processing time from > > schedule: usec=6158333 > > [2011-11-04T19:50:37] Warning: Note very large processing time from > > _slurmctld_background: usec=15995841 > > > > [2011-11-04T19:46:01] debug: backfill: loop taking too long, yielding > locks > > [2011-11-04T19:46:01] debug2: Testing job time limits and checkpoints > > [2011-11-04T19:46:01] debug2: Performing purge of old job records > > [2011-11-04T19:46:01] Warning: Note very large processing time from > > _slurmctld_background: usec=9779576 > > Thanks for all your help, > > Clay > > On Fri, Nov 4, 2011 at 2:16 AM, HAUTREUX Matthieu < > [email protected]> > > wrote: > >> > >> In my understanding of the internals, squeue does not query each node > but > >> only the controller. That is interesting to know that the performance > issue > >> seems to be due to the EC2 partition but it requires further > information to > >> know what is the reason of such a bad performance. > >> > >> The best way to find the problem is to observe the internal states of > the > >> controller while the "squeue -p aws" is running : > >> - You should first look at the consumption of the slurmctld process to > see > >> if it is due to a too high complexity of processing in an internal > function > >> of slurm. > >> The easiest way to have that information is by using top and look at > the > >> consumption of the slurmctld process. > >> - After that, in case you see no anormal consumption, you should try to > >> capture the state of the slurmctld process during the hang. You can do > that > >> just by using the gstack command to capture the stacks of slurmctld > threads > >> and see where they are waiting. If you can send the result of "gstack > >> $(pgrep slurmctld)" it could bring interesting information. > >> - In case you see an anormal consumption, the gstack can help a little > >> bit, but a more evolved tool, a profiler like for example oprofile, > will be > >> necessary to see where most of the time is spent in the controller > >> > >> Do not hesitate to send the result of the 2 first tests (top, gstack), > it > >> could help to find the bottleneck. > >> > >> Regards, > >> Matthieu > >> > >> Clay Teeter a écrit : > >>> > >>> I did notice something. > >>> > >>> Issuing squeue against the local partition is quite speedy, ~2s for 10K > >>> job ids. However, if i "squeue -p aws" the ec2(aws) partition i see > the 2m+ > >>> delay. Does squeue query each node in the partition? > >>> > >>> Thanks, > >>> Clay > >>> > >>> On Thu, Nov 3, 2011 at 2:42 PM, Clay Teeter <[email protected] > >>> <mailto:[email protected]>> wrote: > >>> > >>> First, thanks for your response! > >>> > >>> Yeah, I figured that slurm shouldn't have a problem with the job > >>> count. > >>> First some background. Our network is split into two parts; A > >>> local partition and a ec2 partition. And, for each task two jobs > >>> (one is dependent on the other) are submitted. The frist part > >>> computes a large datafile and submits it to s3, a second job pulls > >>> the files from s3 and copies it to our san. In total a large job > >>> requires about 30K jobs to be submitted. > >>> > >>> Im not sure when squeue started slowing dow. Currently, via sattr > >>> (about 2s query), there are a little more than 17000 jobs running. > >>> > >>> Im worried that because half the cluster is in a remote datacenter > >>> latency is the problem.... which, unfortunately, would be a show > >>> stopper. > >>> > >>> Version: slurm 2.3.0-2 > >>> ControlMachine=master > >>> AuthType=auth/munge > >>> CacheGroups=0 > >>> CryptoType=crypto/munge > >>> MaxJobCount=30000 MpiDefault=none > >>> MpiParams=ports=12000-12999 > >>> ProctrackType=proctrack/pgid > >>> ReturnToService=1 > >>> SlurmctldPidFile=/var/run/slurm/slurmctld.pid > >>> SlurmctldPort=6817 > >>> SlurmdPidFile=/var/run/slurm/slurmd.pid > >>> SlurmdPort=6818 > >>> SlurmdSpoolDir=/var/slurm/spool > >>> SlurmUser=slurm > >>> StateSaveLocation=/var/slurm/state > >>> SwitchType=switch/none > >>> TaskPlugin=task/none > >>> TrackWCKey=yes > >>> InactiveLimit=0 > >>> KillWait=30 > >>> MinJobAge=300 > >>> SlurmctldTimeout=120 > >>> SlurmdTimeout=300 > >>> Waittime=0 > >>> FastSchedule=1 > >>> SchedulerType=sched/backfill > >>> SchedulerPort=7321 > >>> SelectType=select/cons_res > >>> SelectTypeParameters=CR_Core > >>> AccountingStorageHost=master > >>> AccountingStorageLoc=slurm_acct_db > >>> AccountingStoragePass= > >>> AccountingStorageType=accounting_storage/mysql > >>> AccountingStorageUser= > >>> AccountingStoreJobComment=YES > >>> ClusterName=hydra > >>> JobCompHost=master > >>> JobCompLoc=slurm_acct_db > >>> JobCompPass=cluster_service > >>> JobCompType=jobcomp/mysql > >>> JobCompUser=cluster_service > >>> JobAcctGatherFrequency=30 > >>> JobAcctGatherType=jobacct_gather/linux > >>> SlurmctldDebug=3 > >>> SlurmctldLogFile=/var/log/slurm/slurmctld.log > >>> SlurmdDebug=3 > >>> SlurmdLogFile=/var/log/slurm/slurmd.log > >>> NodeName=aws[1-12] RealMemory=22528 Sockets=2 CoresPerSocket=4 > >>> ThreadsPerCore=2 State=UNKNOWN NodeName=centos1 RealMemory=3000 > >>> Sockets=2 CoresPerSocket=2 > >>> ThreadsPerCore=1 State=UNKNOWN > >>> > >>> PartitionName=local Nodes=centos1 Default=YES MaxTime=INFINITE > >>> State=UP Shared=NO > >>> PartitionName=all Nodes=aws[3-12],centos1 Default=NO > >>> MaxTime=INFINITE State=UP Shared=NO > >>> PartitionName=aws Nodes=aws[2-12] Default=NO MaxTime=INFINITE > >>> State=UP Shared=NO > >>> > >>> > >>> > >>> On Thu, Nov 3, 2011 at 2:03 PM, Matthieu Hautreux > >>> <[email protected] <mailto:[email protected]>> wrote: > >>> > >>> Hi, > >>> > >>> what is the version of SLURM that you are using ? Can you send > >>> your slurm.conf for details of configuration ? > >>> > >>> With 2.3.1 you should be able to manage at least 10k jobs. We > >>> made tests with 2.2.7 and backported patches of 2.3.x that > >>> shows good results with 10K jobs. > >>> What is the number of jobs for which the system starts to hang ? > >>> > >>> Matthieu > >>> > >>> > >>> On 11/03/2011 07:03 PM, Clay Teeter wrote: > >>>> > >>>> Hi group! > >>>> > >>>> Im seeing that with about 30K (or so) jobs in the queue, > >>>> squeue hangs > >>>> for a around 2 minutes before returning values. Does anyone > >>>> know a way > >>>> around this? I expect that this isn't standard behavior. > >>>> > >>>> Cheers, > >>>> Clay > >>> > >>> > >>> > >> > > > > > > > > > > -- > > ================================================== > Open Grid Scheduler - The Official Open Source Grid Engine > http://gridscheduler.sourceforge.net/ >
