Ok, here is some general profiling info: the slurmctl process doesn't seem too terribly out of whack. There are bursts of 100% cpu usage, but they lasted no longer then a couple seconds.
I ran gstack on the slurmctl process while squeue was running. Again, I didn't see anything that look strange: Thread 5 (Thread 0x41fc6940 (LWP 21939)): #0 0x00002aaaaacb9df8 in select_p_job_test () from /usr/lib64/slurm/select_cons_res.so #1 0x00002aaaabad8cf3 in backfill_agent () from /usr/lib64/slurm/sched_backfill.so #2 0x0000003cc8e0673d in start_thread () from /lib64/libpthread.so.0 #3 0x0000003cc86d44bd in clone () from /lib64/libc.so.6 Thread 4 (Thread 0x420c7940 (LWP 21941)): #0 0x0000003cc86cd722 in select () from /lib64/libc.so.6 #1 0x000000000042daa2 in _slurmctld_rpc_mgr () #2 0x0000003cc8e0673d in start_thread () from /lib64/libpthread.so.0 #3 0x0000003cc86d44bd in clone () from /lib64/libc.so.6 Thread 3 (Thread 0x421c8940 (LWP 21942)): #0 0x0000003cc8e0e898 in do_sigwait () from /lib64/libpthread.so.0 #1 0x0000003cc8e0e93d in sigwait () from /lib64/libpthread.so.0 #2 0x000000000042b838 in _slurmctld_signal_hand () #3 0x0000003cc8e0673d in start_thread () from /lib64/libpthread.so.0 #4 0x0000003cc86d44bd in clone () from /lib64/libc.so.6 Thread 2 (Thread 0x422c9940 (LWP 21943)): #0 0x0000003cc8e0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x000000000046a9c1 in slurmctld_state_save () #2 0x0000003cc8e0673d in start_thread () from /lib64/libpthread.so.0 #3 0x0000003cc86d44bd in clone () from /lib64/libc.so.6 Thread 1 (Thread 0x2aae5df000d0 (LWP 21933)): #0 0x0000003cc8e0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x0000000000449cbf in _wr_wrlock () #2 0x000000000044a1d5 in lock_slurmctld () #3 0x000000000042af5e in _slurmctld_background () #4 0x000000000042cb8c in main () However, i did a gstack and strace on squeue and things did look a bit weird. strace: time(NULL) = 1320435240 time(NULL) = 1320435240 time(NULL) = 1320435240 time(NULL) = 1320435240 time(NULL) = 1320435240 time(NULL) = 1320435240 time(NULL) = 1320435240 time(NULL) = 1320435240 ..... keeps on going, this might be the sorting algo? gstack: #0 0xffffffffff600410 in ?? () #1 0x0000003cc868c40d in time () from /lib64/libc.so.6 #2 0x000000000042608b in _get_start_time () #3 0x00000000004260e1 in _sort_job_by_time_start () #4 0x0000000000437958 in slurm_list_sort () #5 0x00000000004232af in print_jobs_array ()#6 0x0000000000420234 in _get_info () #7 0x00000000004205a1 in main () squeue is pegged at 100% the entire time. I noticed, that if i pass in the param -S u to squeue the function _sort_job_by_time_start is still executed. Some interetesed slurmctl logs enteries: [2011-11-04T19:50:37] Warning: Note very large processing time from _slurm_rpc_dump_jobs: usec=6155478 [2011-11-04T19:50:37] Warning: Note very large processing time from schedule: usec=6158333 [2011-11-04T19:50:37] Warning: Note very large processing time from _slurmctld_background: usec=15995841 [2011-11-04T19:46:01] debug: backfill: loop taking too long, yielding locks [2011-11-04T19:46:01] debug2: Testing job time limits and checkpoints [2011-11-04T19:46:01] debug2: Performing purge of old job records [2011-11-04T19:46:01] Warning: Note very large processing time from _slurmctld_background: usec=9779576 Thanks for all your help, Clay On Fri, Nov 4, 2011 at 2:16 AM, HAUTREUX Matthieu <[email protected]>wrote: > In my understanding of the internals, squeue does not query each node but > only the controller. That is interesting to know that the performance issue > seems to be due to the EC2 partition but it requires further information to > know what is the reason of such a bad performance. > > The best way to find the problem is to observe the internal states of the > controller while the "squeue -p aws" is running : > - You should first look at the consumption of the slurmctld process to see > if it is due to a too high complexity of processing in an internal function > of slurm. > The easiest way to have that information is by using top and look at the > consumption of the slurmctld process. > - After that, in case you see no anormal consumption, you should try to > capture the state of the slurmctld process during the hang. You can do that > just by using the gstack command to capture the stacks of slurmctld threads > and see where they are waiting. If you can send the result of "gstack > $(pgrep slurmctld)" it could bring interesting information. > - In case you see an anormal consumption, the gstack can help a little > bit, but a more evolved tool, a profiler like for example oprofile, will be > necessary to see where most of the time is spent in the controller > > Do not hesitate to send the result of the 2 first tests (top, gstack), it > could help to find the bottleneck. > > Regards, > Matthieu > > Clay Teeter a écrit : > >> I did notice something. >> >> Issuing squeue against the local partition is quite speedy, ~2s for 10K >> job ids. However, if i "squeue -p aws" the ec2(aws) partition i see the >> 2m+ delay. Does squeue query each node in the partition? >> >> Thanks, >> Clay >> >> On Thu, Nov 3, 2011 at 2:42 PM, Clay Teeter <[email protected] <mailto: >> [email protected]>> wrote: >> >> First, thanks for your response! >> >> Yeah, I figured that slurm shouldn't have a problem with the job >> count. >> First some background. Our network is split into two parts; A >> local partition and a ec2 partition. And, for each task two jobs >> (one is dependent on the other) are submitted. The frist part >> computes a large datafile and submits it to s3, a second job pulls >> the files from s3 and copies it to our san. In total a large job >> requires about 30K jobs to be submitted. >> >> Im not sure when squeue started slowing dow. Currently, via sattr >> (about 2s query), there are a little more than 17000 jobs running. >> >> Im worried that because half the cluster is in a remote datacenter >> latency is the problem.... which, unfortunately, would be a show >> stopper. >> >> Version: slurm 2.3.0-2 >> ControlMachine=master >> AuthType=auth/munge >> CacheGroups=0 >> CryptoType=crypto/munge >> MaxJobCount=30000 MpiDefault=none >> MpiParams=ports=12000-12999 >> ProctrackType=proctrack/pgid >> ReturnToService=1 >> SlurmctldPidFile=/var/run/**slurm/slurmctld.pid >> SlurmctldPort=6817 >> SlurmdPidFile=/var/run/slurm/**slurmd.pid >> SlurmdPort=6818 >> SlurmdSpoolDir=/var/slurm/**spool >> SlurmUser=slurm >> StateSaveLocation=/var/slurm/**state >> SwitchType=switch/none >> TaskPlugin=task/none >> TrackWCKey=yes >> InactiveLimit=0 >> KillWait=30 >> MinJobAge=300 >> SlurmctldTimeout=120 >> SlurmdTimeout=300 >> Waittime=0 >> FastSchedule=1 >> SchedulerType=sched/backfill >> SchedulerPort=7321 >> SelectType=select/cons_res >> SelectTypeParameters=CR_Core >> AccountingStorageHost=master >> AccountingStorageLoc=slurm_**acct_db >> AccountingStoragePass= >> AccountingStorageType=**accounting_storage/mysql >> AccountingStorageUser= >> AccountingStoreJobComment=YES >> ClusterName=hydra >> JobCompHost=master >> JobCompLoc=slurm_acct_db >> JobCompPass=cluster_service >> JobCompType=jobcomp/mysql >> JobCompUser=cluster_service >> JobAcctGatherFrequency=30 >> JobAcctGatherType=jobacct_**gather/linux >> SlurmctldDebug=3 >> SlurmctldLogFile=/var/log/**slurm/slurmctld.log >> SlurmdDebug=3 >> SlurmdLogFile=/var/log/slurm/**slurmd.log >> NodeName=aws[1-12] RealMemory=22528 Sockets=2 CoresPerSocket=4 >> ThreadsPerCore=2 State=UNKNOWN NodeName=centos1 RealMemory=3000 >> Sockets=2 CoresPerSocket=2 >> ThreadsPerCore=1 State=UNKNOWN >> >> PartitionName=local Nodes=centos1 Default=YES MaxTime=INFINITE >> State=UP Shared=NO >> PartitionName=all Nodes=aws[3-12],centos1 Default=NO >> MaxTime=INFINITE State=UP Shared=NO >> PartitionName=aws Nodes=aws[2-12] Default=NO MaxTime=INFINITE >> State=UP Shared=NO >> >> >> >> On Thu, Nov 3, 2011 at 2:03 PM, Matthieu Hautreux >> <[email protected] >> <mailto:matthieu.hautreux@cea.**fr<[email protected]>>> >> wrote: >> >> Hi, >> >> what is the version of SLURM that you are using ? Can you send >> your slurm.conf for details of configuration ? >> >> With 2.3.1 you should be able to manage at least 10k jobs. We >> made tests with 2.2.7 and backported patches of 2.3.x that >> shows good results with 10K jobs. >> What is the number of jobs for which the system starts to hang ? >> >> Matthieu >> >> >> On 11/03/2011 07:03 PM, Clay Teeter wrote: >> >>> Hi group! >>> >>> Im seeing that with about 30K (or so) jobs in the queue, >>> squeue hangs >>> for a around 2 minutes before returning values. Does anyone >>> know a way >>> around this? I expect that this isn't standard behavior. >>> >>> Cheers, >>> Clay >>> >> >> >> >> >
