On Fri, 4 Nov 2011 15:34:18 -0700, Matthieu Hautreux <[email protected]> wrote: Non-text part: multipart/alternative > Well, the gstack of the slurmctld shows that the rpc of the squeue is already > finished. As the log said that it tooks usec=6155478, that seems to be a > reasonable amount of time to process such a large amount of jobs. As you > noticed the problem seems to be in the squeue internals, that seems to take > too much time to sort the results. > > In _print_jobs_array function of src/squeue/print.c, it seems that > sort_jobs_by_start_time (l) is always called regardless of.the > requested format of sort parameters. It may be interesting to see how > this part of the code could be enhanced to optimize the sort of a > large set of jobs (perhaps avoiding to call time() at each comparison > could be a good start). Moe, Danny, what do you think about that > problem ?
I bet the sort would go *much* faster if the _get_start_time() were pre-computed before the sort. That function is being called for each sort comparison, and since the sort in O(N^2), well you get the idea -- that function is probably being called one million times more than it needs to be. mark > > Regards, > Matthieu > > > On 11/04/2011 08:53 PM, Clay Teeter wrote: > Ok, here is some general profiling info: > > the slurmctl process doesn't seem too terribly out of whack. There are > bursts of 100% cpu usage, but they lasted no longer then a couple seconds. > > I ran gstack on the slurmctl process while squeue was running. Again, I > didn't see anything that look strange: > > Thread 5 (Thread 0x41fc6940 (LWP 21939)): > #0 0x00002aaaaacb9df8 in select_p_job_test () from > /usr/lib64/slurm/select_cons_res.so > #1 0x00002aaaabad8cf3 in backfill_agent () from > /usr/lib64/slurm/sched_backfill.so > #2 0x0000003cc8e0673d in start_thread () from /lib64/libpthread.so.0 > #3 0x0000003cc86d44bd in clone () from /lib64/libc.so.6 > Thread 4 (Thread 0x420c7940 (LWP 21941)): > #0 0x0000003cc86cd722 in select () from /lib64/libc.so.6 > #1 0x000000000042daa2 in _slurmctld_rpc_mgr () > #2 0x0000003cc8e0673d in start_thread () from /lib64/libpthread.so.0 > #3 0x0000003cc86d44bd in clone () from /lib64/libc.so.6 > Thread 3 (Thread 0x421c8940 (LWP 21942)): > #0 0x0000003cc8e0e898 in do_sigwait () from /lib64/libpthread.so.0 > #1 0x0000003cc8e0e93d in sigwait () from /lib64/libpthread.so.0 > #2 0x000000000042b838 in _slurmctld_signal_hand () > #3 0x0000003cc8e0673d in start_thread () from /lib64/libpthread.so.0 > #4 0x0000003cc86d44bd in clone () from /lib64/libc.so.6 > Thread 2 (Thread 0x422c9940 (LWP 21943)): > #0 0x0000003cc8e0aee9 in > pthread_cond_wait@@GLIBC_2.3.2<mailto:pthread_cond_wait@@GLIBC_2.3.2> () from > /lib64/libpthread.so.0 > #1 0x000000000046a9c1 in slurmctld_state_save () > #2 0x0000003cc8e0673d in start_thread () from /lib64/libpthread.so.0 > #3 0x0000003cc86d44bd in clone () from /lib64/libc.so.6 > Thread 1 (Thread 0x2aae5df000d0 (LWP 21933)): > #0 0x0000003cc8e0aee9 in > pthread_cond_wait@@GLIBC_2.3.2<mailto:pthread_cond_wait@@GLIBC_2.3.2> () from > /lib64/libpthread.so.0 > #1 0x0000000000449cbf in _wr_wrlock () > #2 0x000000000044a1d5 in lock_slurmctld () > #3 0x000000000042af5e in _slurmctld_background () > #4 0x000000000042cb8c in main () > > However, i did a gstack and strace on squeue and things did look a bit weird. > strace: > time(NULL) = 1320435240 > time(NULL) = 1320435240 > time(NULL) = 1320435240 > time(NULL) = 1320435240 > time(NULL) = 1320435240 > time(NULL) = 1320435240 > time(NULL) = 1320435240 > time(NULL) = 1320435240 > ..... > keeps on going, this might be the sorting algo? > > > gstack: > #0 0xffffffffff600410 in ?? () > #1 0x0000003cc868c40d in time () from /lib64/libc.so.6 > #2 0x000000000042608b in _get_start_time () > #3 0x00000000004260e1 in _sort_job_by_time_start () > #4 0x0000000000437958 in slurm_list_sort () > #5 0x00000000004232af in print_jobs_array ()#6 0x0000000000420234 in > _get_info () > #7 0x00000000004205a1 in main () > > squeue is pegged at 100% the entire time. I noticed, that if i pass in the > param -S u to squeue the function _sort_job_by_time_start is still executed. > > Some interetesed slurmctl logs enteries: > [2011-11-04T19:50:37] Warning: Note very large processing time from > _slurm_rpc_dump_jobs: usec=6155478 > [2011-11-04T19:50:37] Warning: Note very large processing time from schedule: > usec=6158333 > [2011-11-04T19:50:37] Warning: Note very large processing time from > _slurmctld_background: usec=15995841 > > > [2011-11-04T19:46:01] debug: backfill: loop taking too long, yielding locks > [2011-11-04T19:46:01] debug2: Testing job time limits and checkpoints > [2011-11-04T19:46:01] debug2: Performing purge of old job records > [2011-11-04T19:46:01] Warning: Note very large processing time from > _slurmctld_background: usec=9779576 > > Thanks for all your help, > Clay > > On Fri, Nov 4, 2011 at 2:16 AM, HAUTREUX Matthieu > <[email protected]<mailto:[email protected]>> wrote: > In my understanding of the internals, squeue does not query each node but > only the controller. That is interesting to know that the performance issue > seems to be due to the EC2 partition but it requires further information to > know what is the reason of such a bad performance. > > The best way to find the problem is to observe the internal states of the > controller while the "squeue -p aws" is running : > - You should first look at the consumption of the slurmctld process to see if > it is due to a too high complexity of processing in an internal function of > slurm. > The easiest way to have that information is by using top and look at the > consumption of the slurmctld process. > - After that, in case you see no anormal consumption, you should try to > capture the state of the slurmctld process during the hang. You can do that > just by using the gstack command to capture the stacks of slurmctld threads > and see where they are waiting. If you can send the result of "gstack $(pgrep > slurmctld)" it could bring interesting information. > - In case you see an anormal consumption, the gstack can help a little bit, > but a more evolved tool, a profiler like for example oprofile, will be > necessary to see where most of the time is spent in the controller > > Do not hesitate to send the result of the 2 first tests (top, gstack), it > could help to find the bottleneck. > > Regards, > Matthieu > > Clay Teeter a écrit : > I did notice something. > > Issuing squeue against the local partition is quite speedy, ~2s for 10K job > ids. However, if i "squeue -p aws" the ec2(aws) partition i see the 2m+ > delay. Does squeue query each node in the partition? > > Thanks, > Clay > > On Thu, Nov 3, 2011 at 2:42 PM, Clay Teeter > <[email protected]<mailto:[email protected]> > <mailto:[email protected]<mailto:[email protected]>>> wrote: > > First, thanks for your response! > > Yeah, I figured that slurm shouldn't have a problem with the job > count. > First some background. Our network is split into two parts; A > local partition and a ec2 partition. And, for each task two jobs > (one is dependent on the other) are submitted. The frist part > computes a large datafile and submits it to s3, a second job pulls > the files from s3 and copies it to our san. In total a large job > requires about 30K jobs to be submitted. > > Im not sure when squeue started slowing dow. Currently, via sattr > (about 2s query), there are a little more than 17000 jobs running. > > Im worried that because half the cluster is in a remote datacenter > latency is the problem.... which, unfortunately, would be a show > stopper. > > Version: slurm 2.3.0-2 > ControlMachine=master > AuthType=auth/munge > CacheGroups=0 > CryptoType=crypto/munge > MaxJobCount=30000 MpiDefault=none > MpiParams=ports=12000-12999 > ProctrackType=proctrack/pgid > ReturnToService=1 > SlurmctldPidFile=/var/run/slurm/slurmctld.pid > SlurmctldPort=6817 > SlurmdPidFile=/var/run/slurm/slurmd.pid > SlurmdPort=6818 > SlurmdSpoolDir=/var/slurm/spool > SlurmUser=slurm > StateSaveLocation=/var/slurm/state > SwitchType=switch/none > TaskPlugin=task/none > TrackWCKey=yes > InactiveLimit=0 > KillWait=30 > MinJobAge=300 > SlurmctldTimeout=120 > SlurmdTimeout=300 > Waittime=0 > FastSchedule=1 > SchedulerType=sched/backfill > SchedulerPort=7321 > SelectType=select/cons_res > SelectTypeParameters=CR_Core > AccountingStorageHost=master > AccountingStorageLoc=slurm_acct_db > AccountingStoragePass= > AccountingStorageType=accounting_storage/mysql > AccountingStorageUser= > AccountingStoreJobComment=YES > ClusterName=hydra > JobCompHost=master > JobCompLoc=slurm_acct_db > JobCompPass=cluster_service > JobCompType=jobcomp/mysql > JobCompUser=cluster_service > JobAcctGatherFrequency=30 > JobAcctGatherType=jobacct_gather/linux > SlurmctldDebug=3 > SlurmctldLogFile=/var/log/slurm/slurmctld.log > SlurmdDebug=3 > SlurmdLogFile=/var/log/slurm/slurmd.log > NodeName=aws[1-12] RealMemory=22528 Sockets=2 CoresPerSocket=4 > ThreadsPerCore=2 State=UNKNOWN NodeName=centos1 RealMemory=3000 > Sockets=2 CoresPerSocket=2 > ThreadsPerCore=1 State=UNKNOWN > > PartitionName=local Nodes=centos1 Default=YES MaxTime=INFINITE > State=UP Shared=NO > PartitionName=all Nodes=aws[3-12],centos1 Default=NO > MaxTime=INFINITE State=UP Shared=NO > PartitionName=aws Nodes=aws[2-12] Default=NO MaxTime=INFINITE > State=UP Shared=NO > > > > On Thu, Nov 3, 2011 at 2:03 PM, Matthieu Hautreux > <[email protected]<mailto:[email protected]> > <mailto:[email protected]<mailto:[email protected]>>> wrote: > > Hi, > > what is the version of SLURM that you are using ? Can you send > your slurm.conf for details of configuration ? > > With 2.3.1 you should be able to manage at least 10k jobs. We > made tests with 2.2.7 and backported patches of 2.3.x that > shows good results with 10K jobs. > What is the number of jobs for which the system starts to hang ? > > Matthieu > > > On 11/03/2011 07:03 PM, Clay Teeter wrote: > Hi group! > > Im seeing that with about 30K (or so) jobs in the queue, > squeue hangs > for a around 2 minutes before returning values. Does anyone > know a way > around this? I expect that this isn't standard behavior. > > Cheers, > Clay > > > > > > Non-text part: text/html
