Ok, here is some general profiling info:
the slurmctl process doesn't seem too terribly out of whack. There
are bursts of 100% cpu usage, but they lasted no longer then a couple
seconds.
I ran gstack on the slurmctl process while squeue was running. Again,
I didn't see anything that look strange:
Thread 5 (Thread 0x41fc6940 (LWP 21939)):
#0 0x00002aaaaacb9df8 in select_p_job_test () from
/usr/lib64/slurm/select_cons_res.so
#1 0x00002aaaabad8cf3 in backfill_agent () from
/usr/lib64/slurm/sched_backfill.so
#2 0x0000003cc8e0673d in start_thread () from /lib64/libpthread.so.0
#3 0x0000003cc86d44bd in clone () from /lib64/libc.so.6
Thread 4 (Thread 0x420c7940 (LWP 21941)):
#0 0x0000003cc86cd722 in select () from /lib64/libc.so.6
#1 0x000000000042daa2 in _slurmctld_rpc_mgr ()
#2 0x0000003cc8e0673d in start_thread () from /lib64/libpthread.so.0
#3 0x0000003cc86d44bd in clone () from /lib64/libc.so.6
Thread 3 (Thread 0x421c8940 (LWP 21942)):
#0 0x0000003cc8e0e898 in do_sigwait () from /lib64/libpthread.so.0
#1 0x0000003cc8e0e93d in sigwait () from /lib64/libpthread.so.0
#2 0x000000000042b838 in _slurmctld_signal_hand ()
#3 0x0000003cc8e0673d in start_thread () from /lib64/libpthread.so.0
#4 0x0000003cc86d44bd in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x422c9940 (LWP 21943)):
#0 0x0000003cc8e0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from
/lib64/libpthread.so.0
#1 0x000000000046a9c1 in slurmctld_state_save ()
#2 0x0000003cc8e0673d in start_thread () from /lib64/libpthread.so.0
#3 0x0000003cc86d44bd in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x2aae5df000d0 (LWP 21933)):
#0 0x0000003cc8e0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from
/lib64/libpthread.so.0
#1 0x0000000000449cbf in _wr_wrlock ()
#2 0x000000000044a1d5 in lock_slurmctld ()
#3 0x000000000042af5e in _slurmctld_background ()
#4 0x000000000042cb8c in main ()
However, i did a gstack and strace on squeue and things did look a
bit weird.
strace:
time(NULL) = 1320435240
time(NULL) = 1320435240
time(NULL) = 1320435240
time(NULL) = 1320435240
time(NULL) = 1320435240
time(NULL) = 1320435240
time(NULL) = 1320435240
time(NULL) = 1320435240
.....
keeps on going, this might be the sorting algo?
gstack:
#0 0xffffffffff600410 in ?? ()
#1 0x0000003cc868c40d in time () from /lib64/libc.so.6
#2 0x000000000042608b in _get_start_time ()
#3 0x00000000004260e1 in _sort_job_by_time_start ()
#4 0x0000000000437958 in slurm_list_sort ()
#5 0x00000000004232af in print_jobs_array ()#6 0x0000000000420234 in
_get_info ()
#7 0x00000000004205a1 in main ()
squeue is pegged at 100% the entire time. I noticed, that if i pass
in the param -S u to squeue the function _sort_job_by_time_start is
still executed.
Some interetesed slurmctl logs enteries:
[2011-11-04T19:50:37] Warning: Note very large processing time from
_slurm_rpc_dump_jobs: usec=6155478
[2011-11-04T19:50:37] Warning: Note very large processing time from
schedule: usec=6158333
[2011-11-04T19:50:37] Warning: Note very large processing time from
_slurmctld_background: usec=15995841
[2011-11-04T19:46:01] debug: backfill: loop taking too long, yielding
locks
[2011-11-04T19:46:01] debug2: Testing job time limits and checkpoints
[2011-11-04T19:46:01] debug2: Performing purge of old job records
[2011-11-04T19:46:01] Warning: Note very large processing time from
_slurmctld_background: usec=9779576
Thanks for all your help,
Clay
On Fri, Nov 4, 2011 at 2:16 AM, HAUTREUX Matthieu
<[email protected] <mailto:[email protected]>> wrote:
In my understanding of the internals, squeue does not query each
node but only the controller. That is interesting to know that the
performance issue seems to be due to the EC2 partition but it
requires further information to know what is the reason of such a
bad performance.
The best way to find the problem is to observe the internal states
of the controller while the "squeue -p aws" is running :
- You should first look at the consumption of the slurmctld
process to see if it is due to a too high complexity of processing
in an internal function of slurm.
The easiest way to have that information is by using top and look
at the consumption of the slurmctld process.
- After that, in case you see no anormal consumption, you should
try to capture the state of the slurmctld process during the hang.
You can do that just by using the gstack command to capture the
stacks of slurmctld threads and see where they are waiting. If you
can send the result of "gstack $(pgrep slurmctld)" it could bring
interesting information.
- In case you see an anormal consumption, the gstack can help a
little bit, but a more evolved tool, a profiler like for example
oprofile, will be necessary to see where most of the time is spent
in the controller
Do not hesitate to send the result of the 2 first tests (top,
gstack), it could help to find the bottleneck.
Regards,
Matthieu
Clay Teeter a écrit :
I did notice something.
Issuing squeue against the local partition is quite speedy,
~2s for 10K job ids. However, if i "squeue -p aws" the
ec2(aws) partition i see the 2m+ delay. Does squeue query
each node in the partition?
Thanks,
Clay
On Thu, Nov 3, 2011 at 2:42 PM, Clay Teeter <[email protected]
<mailto:[email protected]> <mailto:[email protected]
<mailto:[email protected]>>> wrote:
First, thanks for your response!
Yeah, I figured that slurm shouldn't have a problem with
the job
count.
First some background. Our network is split into two parts; A
local partition and a ec2 partition. And, for each task
two jobs
(one is dependent on the other) are submitted. The frist part
computes a large datafile and submits it to s3, a second
job pulls
the files from s3 and copies it to our san. In total a
large job
requires about 30K jobs to be submitted.
Im not sure when squeue started slowing dow. Currently,
via sattr
(about 2s query), there are a little more than 17000 jobs
running.
Im worried that because half the cluster is in a remote
datacenter
latency is the problem.... which, unfortunately, would be a
show
stopper.
Version: slurm 2.3.0-2
ControlMachine=master
AuthType=auth/munge
CacheGroups=0
CryptoType=crypto/munge
MaxJobCount=30000 MpiDefault=none
MpiParams=ports=12000-12999
ProctrackType=proctrack/pgid
ReturnToService=1
SlurmctldPidFile=/var/run/slurm/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurm/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/slurm/spool
SlurmUser=slurm
StateSaveLocation=/var/slurm/state
SwitchType=switch/none
TaskPlugin=task/none
TrackWCKey=yes
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
FastSchedule=1
SchedulerType=sched/backfill
SchedulerPort=7321
SelectType=select/cons_res
SelectTypeParameters=CR_Core
AccountingStorageHost=master
AccountingStorageLoc=slurm_acct_db
AccountingStoragePass=
AccountingStorageType=accounting_storage/mysql
AccountingStorageUser=
AccountingStoreJobComment=YES
ClusterName=hydra
JobCompHost=master
JobCompLoc=slurm_acct_db
JobCompPass=cluster_service
JobCompType=jobcomp/mysql
JobCompUser=cluster_service
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/linux
SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdDebug=3
SlurmdLogFile=/var/log/slurm/slurmd.log
NodeName=aws[1-12] RealMemory=22528 Sockets=2 CoresPerSocket=4
ThreadsPerCore=2 State=UNKNOWN NodeName=centos1
RealMemory=3000 Sockets=2 CoresPerSocket=2
ThreadsPerCore=1 State=UNKNOWN
PartitionName=local Nodes=centos1 Default=YES MaxTime=INFINITE
State=UP Shared=NO
PartitionName=all Nodes=aws[3-12],centos1 Default=NO
MaxTime=INFINITE State=UP Shared=NO
PartitionName=aws Nodes=aws[2-12] Default=NO MaxTime=INFINITE
State=UP Shared=NO
On Thu, Nov 3, 2011 at 2:03 PM, Matthieu Hautreux
<[email protected] <mailto:[email protected]>
<mailto:[email protected]
<mailto:[email protected]>>> wrote:
Hi,
what is the version of SLURM that you are using ? Can
you send
your slurm.conf for details of configuration ?
With 2.3.1 you should be able to manage at least 10k
jobs. We
made tests with 2.2.7 and backported patches of 2.3.x that
shows good results with 10K jobs.
What is the number of jobs for which the system starts
to hang ?
Matthieu
On 11/03/2011 07:03 PM, Clay Teeter wrote:
Hi group!
Im seeing that with about 30K (or so) jobs in the
queue,
squeue hangs
for a around 2 minutes before returning values.
Does anyone
know a way
around this? I expect that this isn't standard
behavior.
Cheers,
Clay