Well, the gstack of the slurmctld shows that the rpc of the squeue is already finished. As the log said that it tooks usec=6155478, that seems to be a reasonable amount of time to process such a large amount of jobs. As you noticed the problem seems to be in the squeue internals, that seems to take too much time to sort the results.

In _print_jobs_array function of src/squeue/print.c, it seems that sort_jobs_by_start_time (l) is always called regardless of.the requested format of sort parameters. It may be interesting to see how this part of the code could be enhanced to optimize the sort of a large set of jobs (perhaps avoiding to call time() at each comparison could be a good start). Moe, Danny, what do you think about that problem ?

Regards,
Matthieu


On 11/04/2011 08:53 PM, Clay Teeter wrote:
Ok, here is some general profiling info:

the slurmctl process doesn't seem too terribly out of whack. There are bursts of 100% cpu usage, but they lasted no longer then a couple seconds.

I ran gstack on the slurmctl process while squeue was running. Again, I didn't see anything that look strange:

Thread 5 (Thread 0x41fc6940 (LWP 21939)):
#0 0x00002aaaaacb9df8 in select_p_job_test () from /usr/lib64/slurm/select_cons_res.so #1 0x00002aaaabad8cf3 in backfill_agent () from /usr/lib64/slurm/sched_backfill.so
#2  0x0000003cc8e0673d in start_thread () from /lib64/libpthread.so.0
#3  0x0000003cc86d44bd in clone () from /lib64/libc.so.6
Thread 4 (Thread 0x420c7940 (LWP 21941)):
#0  0x0000003cc86cd722 in select () from /lib64/libc.so.6
#1  0x000000000042daa2 in _slurmctld_rpc_mgr ()
#2  0x0000003cc8e0673d in start_thread () from /lib64/libpthread.so.0
#3  0x0000003cc86d44bd in clone () from /lib64/libc.so.6
Thread 3 (Thread 0x421c8940 (LWP 21942)):
#0  0x0000003cc8e0e898 in do_sigwait () from /lib64/libpthread.so.0
#1  0x0000003cc8e0e93d in sigwait () from /lib64/libpthread.so.0
#2  0x000000000042b838 in _slurmctld_signal_hand ()
#3  0x0000003cc8e0673d in start_thread () from /lib64/libpthread.so.0
#4  0x0000003cc86d44bd in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x422c9940 (LWP 21943)):
#0 0x0000003cc8e0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x000000000046a9c1 in slurmctld_state_save ()
#2  0x0000003cc8e0673d in start_thread () from /lib64/libpthread.so.0
#3  0x0000003cc86d44bd in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x2aae5df000d0 (LWP 21933)):
#0 0x0000003cc8e0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x0000000000449cbf in _wr_wrlock ()
#2  0x000000000044a1d5 in lock_slurmctld ()
#3  0x000000000042af5e in _slurmctld_background ()
#4  0x000000000042cb8c in main ()

However, i did a gstack and strace on squeue and things did look a bit weird.
strace:
time(NULL)                              = 1320435240
time(NULL)                              = 1320435240
time(NULL)                              = 1320435240
time(NULL)                              = 1320435240
time(NULL)                              = 1320435240
time(NULL)                              = 1320435240
time(NULL)                              = 1320435240
time(NULL)                              = 1320435240
.....
keeps on going, this might be the sorting algo?


gstack:
#0  0xffffffffff600410 in ?? ()
#1  0x0000003cc868c40d in time () from /lib64/libc.so.6
#2  0x000000000042608b in _get_start_time ()
#3  0x00000000004260e1 in _sort_job_by_time_start ()
#4  0x0000000000437958 in slurm_list_sort ()
#5 0x00000000004232af in print_jobs_array ()#6 0x0000000000420234 in _get_info ()
#7  0x00000000004205a1 in main ()

squeue is pegged at 100% the entire time. I noticed, that if i pass in the param -S u to squeue the function _sort_job_by_time_start is still executed.

Some interetesed slurmctl logs enteries:
[2011-11-04T19:50:37] Warning: Note very large processing time from _slurm_rpc_dump_jobs: usec=6155478 [2011-11-04T19:50:37] Warning: Note very large processing time from schedule: usec=6158333 [2011-11-04T19:50:37] Warning: Note very large processing time from _slurmctld_background: usec=15995841


[2011-11-04T19:46:01] debug: backfill: loop taking too long, yielding locks
[2011-11-04T19:46:01] debug2: Testing job time limits and checkpoints
[2011-11-04T19:46:01] debug2: Performing purge of old job records
[2011-11-04T19:46:01] Warning: Note very large processing time from _slurmctld_background: usec=9779576

Thanks for all your help,
Clay

On Fri, Nov 4, 2011 at 2:16 AM, HAUTREUX Matthieu <[email protected] <mailto:[email protected]>> wrote:

    In my understanding of the internals, squeue does not query each
    node but only the controller. That is interesting to know that the
    performance issue seems to be due to the EC2 partition but it
    requires further information to know what is the reason of such a
    bad performance.

    The best way to find the problem is to observe the internal states
    of the controller while the "squeue -p aws" is running :
    - You should first look at the consumption of the slurmctld
    process to see if it is due to a too high complexity of processing
    in an internal function of slurm.
     The easiest way to have that information is by using top and look
    at the consumption of the slurmctld process.
    - After that, in case you see no anormal consumption, you should
    try to capture the state of the slurmctld process during the hang.
    You can do that just by using the gstack command to capture the
    stacks of slurmctld threads and see where they are waiting. If you
    can send the result of "gstack $(pgrep slurmctld)" it could bring
    interesting information.
    - In case you see an anormal consumption, the gstack can help a
    little bit, but a more evolved tool, a profiler like for example
    oprofile, will be necessary to see where most of the time is spent
    in the controller

    Do not hesitate to send the result of the 2 first tests (top,
    gstack), it could help to find the bottleneck.

    Regards,
    Matthieu

    Clay Teeter a écrit :

        I did notice something.

        Issuing squeue against the local partition is quite speedy,
        ~2s for 10K job ids.  However, if i "squeue -p aws" the
        ec2(aws) partition i see the 2m+ delay.  Does squeue query
        each node in the partition?

        Thanks,
        Clay

        On Thu, Nov 3, 2011 at 2:42 PM, Clay Teeter <[email protected]
        <mailto:[email protected]> <mailto:[email protected]
        <mailto:[email protected]>>> wrote:

           First, thanks for your response!

           Yeah, I figured that slurm shouldn't have a problem with
        the job
           count.
           First some background.   Our network is split into two parts; A
           local partition and a ec2 partition.  And, for each task
        two jobs
           (one is dependent on the other) are submitted.  The frist part
           computes a large datafile and submits it to s3, a second
        job pulls
           the files from s3 and copies it to our san. In total a
        large job
           requires about 30K jobs to be submitted.

           Im not sure when squeue started slowing dow.  Currently,
        via sattr
           (about 2s query), there are a little more than 17000 jobs
        running.

           Im worried that because half the cluster is in a remote
        datacenter
           latency is the problem.... which, unfortunately, would be a
        show
           stopper.

           Version: slurm 2.3.0-2
           ControlMachine=master
           AuthType=auth/munge
           CacheGroups=0
           CryptoType=crypto/munge
           MaxJobCount=30000     MpiDefault=none
           MpiParams=ports=12000-12999
           ProctrackType=proctrack/pgid
           ReturnToService=1
           SlurmctldPidFile=/var/run/slurm/slurmctld.pid
           SlurmctldPort=6817
           SlurmdPidFile=/var/run/slurm/slurmd.pid
           SlurmdPort=6818
           SlurmdSpoolDir=/var/slurm/spool
           SlurmUser=slurm
           StateSaveLocation=/var/slurm/state
           SwitchType=switch/none
           TaskPlugin=task/none
           TrackWCKey=yes
           InactiveLimit=0
           KillWait=30
           MinJobAge=300
           SlurmctldTimeout=120
           SlurmdTimeout=300
           Waittime=0
           FastSchedule=1
           SchedulerType=sched/backfill
           SchedulerPort=7321
           SelectType=select/cons_res
           SelectTypeParameters=CR_Core
           AccountingStorageHost=master
           AccountingStorageLoc=slurm_acct_db
           AccountingStoragePass=
           AccountingStorageType=accounting_storage/mysql
           AccountingStorageUser=
           AccountingStoreJobComment=YES
           ClusterName=hydra
           JobCompHost=master
           JobCompLoc=slurm_acct_db
           JobCompPass=cluster_service
           JobCompType=jobcomp/mysql
           JobCompUser=cluster_service
           JobAcctGatherFrequency=30
           JobAcctGatherType=jobacct_gather/linux
           SlurmctldDebug=3
           SlurmctldLogFile=/var/log/slurm/slurmctld.log
           SlurmdDebug=3
           SlurmdLogFile=/var/log/slurm/slurmd.log
           NodeName=aws[1-12] RealMemory=22528 Sockets=2 CoresPerSocket=4
           ThreadsPerCore=2 State=UNKNOWN     NodeName=centos1
        RealMemory=3000 Sockets=2 CoresPerSocket=2
           ThreadsPerCore=1 State=UNKNOWN

           PartitionName=local Nodes=centos1 Default=YES MaxTime=INFINITE
           State=UP Shared=NO
           PartitionName=all Nodes=aws[3-12],centos1 Default=NO
           MaxTime=INFINITE State=UP Shared=NO
           PartitionName=aws Nodes=aws[2-12] Default=NO MaxTime=INFINITE
           State=UP Shared=NO



           On Thu, Nov 3, 2011 at 2:03 PM, Matthieu Hautreux
        <[email protected] <mailto:[email protected]>
        <mailto:[email protected]
        <mailto:[email protected]>>> wrote:

               Hi,

               what is the version of SLURM that you are using ? Can
        you send
               your slurm.conf for details of configuration ?

               With 2.3.1 you should be able to manage at least 10k
        jobs. We
               made tests with 2.2.7 and backported patches of 2.3.x that
               shows good results with 10K jobs.
               What is the number of jobs for which the system starts
        to hang ?

               Matthieu


               On 11/03/2011 07:03 PM, Clay Teeter wrote:

                   Hi group!

                   Im seeing that with about 30K (or so) jobs in the
            queue,
                   squeue hangs
                   for a around 2 minutes before returning values.
             Does anyone
                   know a way
                   around this?  I expect that this isn't standard
            behavior.

                   Cheers,
                   Clay







Reply via email to