Re: [slurm-dev] Squeue taking 2+ minutes with 30K jobs

Rayson Ho Fri, 04 Nov 2011 15:51:00 -0700

On Fri, Nov 4, 2011 at 6:34 PM, Matthieu Hautreux
<[email protected]> wrote:
> In _print_jobs_array function of src/squeue/print.c, it seems that
> sort_jobs_by_start_time (l) is always called regardless of.the requested
> format of sort parameters. It may be interesting to see how this part of the
> code could be enhanced to optimize the sort of a large set of jobs (perhaps
> avoiding to call time() at each comparison could be a good start).


Just a quick note - the overhead of time(2) is pretty low on modern
Linux and many other modern operating systems. On modern processors
executing 30K time(2) calls would take less than 0.5 sec, as it is not
a full system call anymore (Google: VDSO+gettimeofday).

Rayson



> Moe, Danny, what do you think about that problem ?
>
> Regards,
> Matthieu
>
>
> On 11/04/2011 08:53 PM, Clay Teeter wrote:
>
> Ok, here is some general profiling info:
> the slurmctl process doesn't seem too terribly out of whack.  There are
> bursts of 100% cpu usage, but they lasted no longer then a couple seconds.
> I ran gstack on the slurmctl process while squeue was running.  Again, I
> didn't see anything that look strange:
> Thread 5 (Thread 0x41fc6940 (LWP 21939)):
> #0  0x00002aaaaacb9df8 in select_p_job_test () from
> /usr/lib64/slurm/select_cons_res.so
> #1  0x00002aaaabad8cf3 in backfill_agent () from
> /usr/lib64/slurm/sched_backfill.so
> #2  0x0000003cc8e0673d in start_thread () from /lib64/libpthread.so.0
> #3  0x0000003cc86d44bd in clone () from /lib64/libc.so.6
> Thread 4 (Thread 0x420c7940 (LWP 21941)):
> #0  0x0000003cc86cd722 in select () from /lib64/libc.so.6
> #1  0x000000000042daa2 in _slurmctld_rpc_mgr ()
> #2  0x0000003cc8e0673d in start_thread () from /lib64/libpthread.so.0
> #3  0x0000003cc86d44bd in clone () from /lib64/libc.so.6
> Thread 3 (Thread 0x421c8940 (LWP 21942)):
> #0  0x0000003cc8e0e898 in do_sigwait () from /lib64/libpthread.so.0
> #1  0x0000003cc8e0e93d in sigwait () from /lib64/libpthread.so.0
> #2  0x000000000042b838 in _slurmctld_signal_hand ()
> #3  0x0000003cc8e0673d in start_thread () from /lib64/libpthread.so.0
> #4  0x0000003cc86d44bd in clone () from /lib64/libc.so.6
> Thread 2 (Thread 0x422c9940 (LWP 21943)):
> #0  0x0000003cc8e0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from
> /lib64/libpthread.so.0
> #1  0x000000000046a9c1 in slurmctld_state_save ()
> #2  0x0000003cc8e0673d in start_thread () from /lib64/libpthread.so.0
> #3  0x0000003cc86d44bd in clone () from /lib64/libc.so.6
> Thread 1 (Thread 0x2aae5df000d0 (LWP 21933)):
> #0  0x0000003cc8e0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from
> /lib64/libpthread.so.0
> #1  0x0000000000449cbf in _wr_wrlock ()
> #2  0x000000000044a1d5 in lock_slurmctld ()
> #3  0x000000000042af5e in _slurmctld_background ()
> #4  0x000000000042cb8c in main ()
> However, i did a gstack and strace on squeue and things did look a
> bit weird.
> strace:
> time(NULL)                              = 1320435240
> time(NULL)                              = 1320435240
> time(NULL)                              = 1320435240
> time(NULL)                              = 1320435240
> time(NULL)                              = 1320435240
> time(NULL)                              = 1320435240
> time(NULL)                              = 1320435240
> time(NULL)                              = 1320435240
> .....
> keeps on going, this might be the sorting algo?
>
> gstack:
> #0  0xffffffffff600410 in ?? ()
> #1  0x0000003cc868c40d in time () from /lib64/libc.so.6
> #2  0x000000000042608b in _get_start_time ()
> #3  0x00000000004260e1 in _sort_job_by_time_start ()
> #4  0x0000000000437958 in slurm_list_sort ()
> #5  0x00000000004232af in print_jobs_array ()#6  0x0000000000420234 in
> _get_info ()
> #7  0x00000000004205a1 in main ()
> squeue is pegged at 100% the entire time.  I noticed, that if i pass in the
> param -S u to squeue the function _sort_job_by_time_start is still
> executed.
> Some interetesed slurmctl logs enteries:
> [2011-11-04T19:50:37] Warning: Note very large processing time from
> _slurm_rpc_dump_jobs: usec=6155478
> [2011-11-04T19:50:37] Warning: Note very large processing time from
> schedule: usec=6158333
> [2011-11-04T19:50:37] Warning: Note very large processing time from
> _slurmctld_background: usec=15995841
>
> [2011-11-04T19:46:01] debug:  backfill: loop taking too long, yielding locks
> [2011-11-04T19:46:01] debug2: Testing job time limits and checkpoints
> [2011-11-04T19:46:01] debug2: Performing purge of old job records
> [2011-11-04T19:46:01] Warning: Note very large processing time from
> _slurmctld_background: usec=9779576
> Thanks for all your help,
> Clay
> On Fri, Nov 4, 2011 at 2:16 AM, HAUTREUX Matthieu <[email protected]>
> wrote:
>>
>> In my understanding of the internals, squeue does not query each node but
>> only the controller. That is interesting to know that the performance issue
>> seems to be due to the EC2 partition but it requires further information to
>> know what is the reason of such a bad performance.
>>
>> The best way to find the problem is to observe the internal states of the
>> controller while the "squeue -p aws" is running :
>> - You should first look at the consumption of the slurmctld process to see
>> if it is due to a too high complexity of processing in an internal function
>> of slurm.
>>  The easiest way to have that information is by using top and look at the
>> consumption of the slurmctld process.
>> - After that, in case you see no anormal consumption, you should try to
>> capture the state of the slurmctld process during the hang. You can do that
>> just by using the gstack command to capture the stacks of slurmctld threads
>> and see where they are waiting. If you can send the result of "gstack
>> $(pgrep slurmctld)" it could bring interesting information.
>> - In case you see an anormal consumption, the gstack can help a little
>> bit, but a more evolved tool, a profiler like for example oprofile, will be
>> necessary to see where most of the time is spent in the controller
>>
>> Do not hesitate to send the result of the 2 first tests (top, gstack), it
>> could help to find the bottleneck.
>>
>> Regards,
>> Matthieu
>>
>> Clay Teeter a écrit :
>>>
>>> I did notice something.
>>>
>>> Issuing squeue against the local partition is quite speedy, ~2s for 10K
>>> job ids.  However, if i "squeue -p aws" the ec2(aws) partition i see the 2m+
>>> delay.  Does squeue query each node in the partition?
>>>
>>> Thanks,
>>> Clay
>>>
>>> On Thu, Nov 3, 2011 at 2:42 PM, Clay Teeter <[email protected]
>>> <mailto:[email protected]>> wrote:
>>>
>>>    First, thanks for your response!
>>>
>>>    Yeah, I figured that slurm shouldn't have a problem with the job
>>>    count.
>>>    First some background.   Our network is split into two parts; A
>>>    local partition and a ec2 partition.  And, for each task two jobs
>>>    (one is dependent on the other) are submitted.  The frist part
>>>    computes a large datafile and submits it to s3, a second job pulls
>>>    the files from s3 and copies it to our san. In total a large job
>>>    requires about 30K jobs to be submitted.
>>>
>>>    Im not sure when squeue started slowing dow.  Currently, via sattr
>>>    (about 2s query), there are a little more than 17000 jobs running.
>>>
>>>    Im worried that because half the cluster is in a remote datacenter
>>>    latency is the problem.... which, unfortunately, would be a show
>>>    stopper.
>>>
>>>    Version: slurm 2.3.0-2
>>>    ControlMachine=master
>>>    AuthType=auth/munge
>>>    CacheGroups=0
>>>    CryptoType=crypto/munge
>>>    MaxJobCount=30000     MpiDefault=none
>>>    MpiParams=ports=12000-12999
>>>    ProctrackType=proctrack/pgid
>>>    ReturnToService=1
>>>    SlurmctldPidFile=/var/run/slurm/slurmctld.pid
>>>    SlurmctldPort=6817
>>>    SlurmdPidFile=/var/run/slurm/slurmd.pid
>>>    SlurmdPort=6818
>>>    SlurmdSpoolDir=/var/slurm/spool
>>>    SlurmUser=slurm
>>>    StateSaveLocation=/var/slurm/state
>>>    SwitchType=switch/none
>>>    TaskPlugin=task/none
>>>    TrackWCKey=yes
>>>    InactiveLimit=0
>>>    KillWait=30
>>>    MinJobAge=300
>>>    SlurmctldTimeout=120
>>>    SlurmdTimeout=300
>>>    Waittime=0
>>>    FastSchedule=1
>>>    SchedulerType=sched/backfill
>>>    SchedulerPort=7321
>>>    SelectType=select/cons_res
>>>    SelectTypeParameters=CR_Core
>>>    AccountingStorageHost=master
>>>    AccountingStorageLoc=slurm_acct_db
>>>    AccountingStoragePass=
>>>    AccountingStorageType=accounting_storage/mysql
>>>    AccountingStorageUser=
>>>    AccountingStoreJobComment=YES
>>>    ClusterName=hydra
>>>    JobCompHost=master
>>>    JobCompLoc=slurm_acct_db
>>>    JobCompPass=cluster_service
>>>    JobCompType=jobcomp/mysql
>>>    JobCompUser=cluster_service
>>>    JobAcctGatherFrequency=30
>>>    JobAcctGatherType=jobacct_gather/linux
>>>    SlurmctldDebug=3
>>>    SlurmctldLogFile=/var/log/slurm/slurmctld.log
>>>    SlurmdDebug=3
>>>    SlurmdLogFile=/var/log/slurm/slurmd.log
>>>    NodeName=aws[1-12] RealMemory=22528 Sockets=2 CoresPerSocket=4
>>>    ThreadsPerCore=2 State=UNKNOWN     NodeName=centos1 RealMemory=3000
>>> Sockets=2 CoresPerSocket=2
>>>    ThreadsPerCore=1 State=UNKNOWN
>>>
>>>    PartitionName=local Nodes=centos1 Default=YES MaxTime=INFINITE
>>>    State=UP Shared=NO
>>>    PartitionName=all Nodes=aws[3-12],centos1 Default=NO
>>>    MaxTime=INFINITE State=UP Shared=NO
>>>    PartitionName=aws Nodes=aws[2-12] Default=NO MaxTime=INFINITE
>>>    State=UP Shared=NO
>>>
>>>
>>>
>>>    On Thu, Nov 3, 2011 at 2:03 PM, Matthieu Hautreux
>>>    <[email protected] <mailto:[email protected]>> wrote:
>>>
>>>        Hi,
>>>
>>>        what is the version of SLURM that you are using ? Can you send
>>>        your slurm.conf for details of configuration ?
>>>
>>>        With 2.3.1 you should be able to manage at least 10k jobs. We
>>>        made tests with 2.2.7 and backported patches of 2.3.x that
>>>        shows good results with 10K jobs.
>>>        What is the number of jobs for which the system starts to hang ?
>>>
>>>        Matthieu
>>>
>>>
>>>        On 11/03/2011 07:03 PM, Clay Teeter wrote:
>>>>
>>>>        Hi group!
>>>>
>>>>        Im seeing that with about 30K (or so) jobs in the queue,
>>>>        squeue hangs
>>>>        for a around 2 minutes before returning values.  Does anyone
>>>>        know a way
>>>>        around this?  I expect that this isn't standard behavior.
>>>>
>>>>        Cheers,
>>>>        Clay
>>>
>>>
>>>
>>
>
>
>



--

==================================================
Open Grid Scheduler - The Official Open Source Grid Engine
http://gridscheduler.sourceforge.net/

Re: [slurm-dev] Squeue taking 2+ minutes with 30K jobs

Reply via email to