On Mon, 7 Nov 2011 13:29:27 -0800, Moe Jette <[email protected]> wrote: > Note: This patch will not work with the "-i" or "--iterate" option as > "now" would need to be reset for each iteration.
That is a good point. That patch should not be used. Instead a global time_t now should be set and initialized at each iteration. Thanks, mark > Quoting "Mark A. Grondona" <[email protected]>: > > > On Mon, 7 Nov 2011 13:00:00 -0800, Rayson Ho <[email protected]> wrote: > >> On Mon, Nov 7, 2011 at 3:33 PM, Clay Teeter <[email protected]> wrote: > >> > A couple of things to mention about the cluster (slow) node. > >> > cat /proc/sys/kernel/vsyscall64 = 0 > >> > >> Yup, if /proc/sys/kernel/vsyscall64 is 0, then gettimeofday() and > >> friends are real system calls... > >> > >> So doing ~1B real system call to get the same time(2) data is really > >> bad for performance! > >> > >> > >> > the timing loop posted by Mark took > 1m > >> > Ill implement the suggestions by Mark and Rayson and see what happens! > >> > >> And replying to Mark's previous email: I agree with Mark that we need > >> to call time(2) once and save it somewhere, but it is a bigger change > >> and I am not sure if we will need a newer SLURM version to ship the > >> change with. The patch I sent is a second choice, my first suggestion > >> in the email was to call time(2) once and pass the time data into the > >> comparsion function (but still would require some modifications in > >> various places). > > > > The following patch is a stopgap that should only initialize > > "now" once, preventing unecessary calls to time(2), and ensuring > > the value for "now" doesn't change during a sort: > > > > It isn't ideal since there will be an extraneous comparison > > for each call to get_start_time, but it should reduce the calls > > to time(2) to 1. > > > > > > From 318bc74372b7a73b0936e47a7441448b4a693ed7 Mon Sep 17 00:00:00 2001 > > From: Mark A. Grondona <[email protected]> > > Date: Mon, 7 Nov 2011 13:19:16 -0800 > > Subject: [PATCH] squeue: Only call time(2) once when sorting by job > > start time > > > > Since the list_sort algorithm is O(n^2), we want to avoid any > > unecessary work in sort compare functions, especially work > > that is duplicated. In the specific case of _get_start_time(), > > we also want to avoid returning a different start time for > > PENDING jobs during a single sort (which may happen if the > > sort takes >1s). > > > > So, as a stopgap measure, define "now" as static in _get_start_time > > so that we only initialize it once. > > --- > > src/squeue/sort.c | 10 +++++++--- > > 1 files changed, 7 insertions(+), 3 deletions(-) > > > > diff --git a/src/squeue/sort.c b/src/squeue/sort.c > > index af194ef..b75dc99 100644 > > --- a/src/squeue/sort.c > > +++ b/src/squeue/sort.c > > @@ -583,12 +583,16 @@ static int _sort_job_by_time_limit(void > > *void1, void *void2) > > > > static uint32_t _get_start_time(job_info_t *job) > > { > > - time_t now = time(NULL); > > + static time_t now = (time_t) 0; > > > > if (job->start_time == (time_t) 0) > > return 0xffffffff; > > - if ((job->job_state == JOB_PENDING) && (job->start_time < now)) > > - return (uint32_t) now; > > + if (job->job_state == JOB_PENDING) { > > + if (now == (time_t) 0) > > + now = time(NULL); > > + if (job->start_time < now) > > + return (uint32_t) now; > > + } > > return (uint32_t) job->start_time; > > } > > > > -- > > 1.7.1 > > > > > > > > > > > > > >> Rayson > >> > >> > >> > >> > Thanks, > >> > Clay > >> > On Sat, Nov 5, 2011 at 8:08 AM, Mark A. Grondona > >> <[email protected]> wrote: > >> >> > >> >> On Fri, 4 Nov 2011 20:33:31 -0700, Rayson Ho <[email protected]> > >> >> wrote: > >> >> > On Fri, Nov 4, 2011 at 9:58 PM, Mark A. Grondona <[email protected]> > >> >> > wrote: > >> >> > > This is definitely true, but the fact that Clay's debugger caught > >> >> > > squeue in the time(2) system call means something strange might be > >> >> > > going on with that call. > >> >> > > >> >> > That's possible, if it is running on an old Linux kernel, or if > >> >> > /proc/sys/kernel/vsyscall64 is 0. > >> >> > >> >> Also depends on the timer used by gettimeofday. > >> >> > >> >> > (And BTW, that's 100M calls) > >> >> > >> >> Oh, yeah, I increased it to get >1s runtime on my system. sorry! > >> >> > >> >> > > Clay, you might try running the following > >> >> > > benchmark on your system just to see how long 10M time(2) calls > >> >> > > take: > >> >> > > >> >> > According to the comments in src/common/list.c, list_sort() has a > >> >> > complexity of O(n^2). So 30K jobs with complexity of n^2, it's like > >> >> > 900M calls to time(2)!! > >> >> > >> >> Yeah, many of the sort compare functions in squeue have this problem. > >> >> Lots of code needlessly recomputes the same values over and over again > >> >> (n^2 - n times, right?) > >> >> > >> >> > If that really is the case, then it would be slow to execute almost 1 > >> >> > billion time(2) calls even if the overhead is really low for each > >> >> > call. (In the end, a VDSO call is not free) > >> >> > >> >> FWIW, I tried to reproduce Clay's problem on a test system > >> >> (though I only got up to 20K jobs), and was not able to. > >> >> > >> >> > >> >> > I am not sure how job->start_time gets assigned in the code (the > >> >> > header says "time_t start_time; /* time execution begins, actual > >> >> > or expected */", so I would expect it to be unset for pending jobs > >> >> > without expected start time. I don't have a SLURM cluster that I can > >> >> > fool around with and to attach a debugger, yet there are too many > >> >> > start_time variables in the code that is blocking me from digging > >> >> > deeper!!). > >> >> > > >> >> > - If job->start_time really represents the expected start time for > >> >> > every job, then I guess calling time(2) once and passing the value to > >> >> > the comparsion function would make sense. > >> >> > > >> >> > - But if the common case start_time has a value of 0, then the > >> >> > following patch should help (I've checked with gcc 4.4.5, even at -O3, > >> >> > there's not sinking of time(NULL) below the first if statement, so we > >> >> > need to do it ourselves): > >> >> > >> >> Your patch is a good start I would think. However, the start_time > >> >> is still needlessly recomputed for every pending job (I _think_ > >> >> pending jobs start_time is set to submit time, until they actually > >> >> start). So if you have 30K pending jobs you'd be in the same boat. > >> >> > >> >> Also, think about what happens when the sort takes >1s. The start > >> >> time of these pending jobs starts to change! That seems wrong. > >> >> > >> >> The best fix for this one problem (if it is the problem) would be > >> >> to set a global "now" that doesn't change, so the time(2) call > >> >> only needs to be made once, and every pending job gets the same > >> >> value for start_time. > >> >> > >> >> In general, the sorting and other code in squeue could be greatly > >> >> improved by precomputing results or otherwise memoizing results > >> >> (i.e. store results in a hash that could be checked before > >> >> recomputing the same result you had before). A good example > >> >> is _sort_job_by_node_list(). That function does 'a lot' of work, > >> >> and like Rayson said, it could be called up to 900M times > >> >> needlessly for 30K jobs! > >> >> > >> >> BTW, in my testing, profiling showed that the slowest part of > >> >> squeue was the uid_to_string() call, which is called at least > >> >> once for every job (in my case, every job had the same uid, > >> >> so that call should have only been made once). Removing that call > >> >> more than halved the runtime of squeue. > >> >> > >> >> mark > >> >> > >> >> > > >> >> > [rayson@computer slurm-2.3.0-2]$ diff -U4 sort.c.original > >> >> > src/squeue/sort.c > >> >> > --- sort.c.original 2011-11-04 22:47:02.001064100 -0400 > >> >> > +++ src/squeue/sort.c 2011-11-04 22:47:55.187063769 -0400 > >> >> > @@ -582,13 +582,13 @@ > >> >> > } > >> >> > > >> >> > static uint32_t _get_start_time(job_info_t *job) > >> >> > { > >> >> > - time_t now = time(NULL); > >> >> > + time_t now; > >> >> > > >> >> > if (job->start_time == (time_t) 0) > >> >> > return 0xffffffff; > >> >> > - if ((job->job_state == JOB_PENDING) && (job->start_time < > >> >> > now)) > >> >> > + if ((job->job_state == JOB_PENDING) && > >> (job->start_time < (now = > >> >> > time(NULL)))) > >> >> > return (uint32_t) now; > >> >> > return (uint32_t) job->start_time; > >> >> > } > >> >> > > >> >> > Rayson > >> >> > > >> >> > > >> >> > > > >> >> > > #include <time.h> > >> >> > > int main (int ac, char **av) > >> >> > > { > >> >> > > unsigned long i; > >> >> > > time_t t; > >> >> > > for (i = 0; i < 100000000UL; i++) > >> >> > > t = time(NULL); > >> >> > > return 0; > >> >> > > } > >> >> > > > >> >> > > On my system I get: > >> >> > > > >> >> > > $ time ./timebench > >> >> > > > >> >> > > real 0m3.794s > >> >> > > user 0m3.791s > >> >> > > > >> >> > > > >> >> > > mark > >> >> > > > >> >> > > > >> >> > > > >> >> > >> Rayson > >> >> > >> > >> >> > >> > >> >> > >> > >> >> > >> > Moe, Danny, what do you think about that problem ? > >> >> > >> > > >> >> > >> > Regards, > >> >> > >> > Matthieu > >> >> > >> > > >> >> > >> > > >> >> > >> > On 11/04/2011 08:53 PM, Clay Teeter wrote: > >> >> > >> > > >> >> > >> > Ok, here is some general profiling info: > >> >> > >> > the slurmctl process doesn't seem too terribly out of > >> whack. There > >> >> > >> > are > >> >> > >> > bursts of 100% cpu usage, but they lasted no longer then a couple > >> >> > >> > seconds. > >> >> > >> > I ran gstack on the slurmctl process while squeue was running. > >> >> > >> > Again, I > >> >> > >> > didn't see anything that look strange: > >> >> > >> > Thread 5 (Thread 0x41fc6940 (LWP 21939)): > >> >> > >> > #0 0x00002aaaaacb9df8 in select_p_job_test () from > >> >> > >> > /usr/lib64/slurm/select_cons_res.so > >> >> > >> > #1 0x00002aaaabad8cf3 in backfill_agent () from > >> >> > >> > /usr/lib64/slurm/sched_backfill.so > >> >> > >> > #2 0x0000003cc8e0673d in start_thread () from > >> >> > >> > /lib64/libpthread.so.0 > >> >> > >> > #3 0x0000003cc86d44bd in clone () from /lib64/libc.so.6 > >> >> > >> > Thread 4 (Thread 0x420c7940 (LWP 21941)): > >> >> > >> > #0 0x0000003cc86cd722 in select () from /lib64/libc.so.6 > >> >> > >> > #1 0x000000000042daa2 in _slurmctld_rpc_mgr () > >> >> > >> > #2 0x0000003cc8e0673d in start_thread () from > >> >> > >> > /lib64/libpthread.so.0 > >> >> > >> > #3 0x0000003cc86d44bd in clone () from /lib64/libc.so.6 > >> >> > >> > Thread 3 (Thread 0x421c8940 (LWP 21942)): > >> >> > >> > #0 0x0000003cc8e0e898 in do_sigwait () from > >> /lib64/libpthread.so.0 > >> >> > >> > #1 0x0000003cc8e0e93d in sigwait () from /lib64/libpthread.so.0 > >> >> > >> > #2 0x000000000042b838 in _slurmctld_signal_hand () > >> >> > >> > #3 0x0000003cc8e0673d in start_thread () from > >> >> > >> > /lib64/libpthread.so.0 > >> >> > >> > #4 0x0000003cc86d44bd in clone () from /lib64/libc.so.6 > >> >> > >> > Thread 2 (Thread 0x422c9940 (LWP 21943)): > >> >> > >> > #0 0x0000003cc8e0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from > >> >> > >> > /lib64/libpthread.so.0 > >> >> > >> > #1 0x000000000046a9c1 in slurmctld_state_save () > >> >> > >> > #2 0x0000003cc8e0673d in start_thread () from > >> >> > >> > /lib64/libpthread.so.0 > >> >> > >> > #3 0x0000003cc86d44bd in clone () from /lib64/libc.so.6 > >> >> > >> > Thread 1 (Thread 0x2aae5df000d0 (LWP 21933)): > >> >> > >> > #0 0x0000003cc8e0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from > >> >> > >> > /lib64/libpthread.so.0 > >> >> > >> > #1 0x0000000000449cbf in _wr_wrlock () > >> >> > >> > #2 0x000000000044a1d5 in lock_slurmctld () > >> >> > >> > #3 0x000000000042af5e in _slurmctld_background () > >> >> > >> > #4 0x000000000042cb8c in main () > >> >> > >> > However, i did a gstack and strace on squeue and things did look > >> >> > >> > a > >> >> > >> > bit weird. > >> >> > >> > strace: > >> >> > >> > time(NULL) = 1320435240 > >> >> > >> > time(NULL) = 1320435240 > >> >> > >> > time(NULL) = 1320435240 > >> >> > >> > time(NULL) = 1320435240 > >> >> > >> > time(NULL) = 1320435240 > >> >> > >> > time(NULL) = 1320435240 > >> >> > >> > time(NULL) = 1320435240 > >> >> > >> > time(NULL) = 1320435240 > >> >> > >> > ..... > >> >> > >> > keeps on going, this might be the sorting algo? > >> >> > >> > > >> >> > >> > gstack: > >> >> > >> > #0 0xffffffffff600410 in ?? () > >> >> > >> > #1 0x0000003cc868c40d in time () from /lib64/libc.so.6 > >> >> > >> > #2 0x000000000042608b in _get_start_time () > >> >> > >> > #3 0x00000000004260e1 in _sort_job_by_time_start () > >> >> > >> > #4 0x0000000000437958 in slurm_list_sort () > >> >> > >> > #5 0x00000000004232af in print_jobs_array ()#6 > >> 0x0000000000420234 > >> >> > >> > in > >> >> > >> > _get_info () > >> >> > >> > #7 0x00000000004205a1 in main () > >> >> > >> > squeue is pegged at 100% the entire time. I noticed, that if i > >> >> > >> > pass in the > >> >> > >> > param -S u to squeue the function _sort_job_by_time_start is > >> >> > >> > still > >> >> > >> > executed. > >> >> > >> > Some interetesed slurmctl logs enteries: > >> >> > >> > [2011-11-04T19:50:37] Warning: Note very large processing > >> time from > >> >> > >> > _slurm_rpc_dump_jobs: usec=6155478 > >> >> > >> > [2011-11-04T19:50:37] Warning: Note very large processing > >> time from > >> >> > >> > schedule: usec=6158333 > >> >> > >> > [2011-11-04T19:50:37] Warning: Note very large processing > >> time from > >> >> > >> > _slurmctld_background: usec=15995841 > >> >> > >> > > >> >> > >> > [2011-11-04T19:46:01] debug: backfill: loop taking too long, > >> >> > >> > yielding locks > >> >> > >> > [2011-11-04T19:46:01] debug2: Testing job time limits and > >> >> > >> > checkpoints > >> >> > >> > [2011-11-04T19:46:01] debug2: Performing purge of old job records > >> >> > >> > [2011-11-04T19:46:01] Warning: Note very large processing > >> time from > >> >> > >> > _slurmctld_background: usec=9779576 > >> >> > >> > Thanks for all your help, > >> >> > >> > Clay > >> >> > >> > On Fri, Nov 4, 2011 at 2:16 AM, HAUTREUX Matthieu > >> >> > >> > <[email protected]> > >> >> > >> > wrote: > >> >> > >> >> > >> >> > >> >> In my understanding of the internals, squeue does not query each > >> >> > >> >> node but > >> >> > >> >> only the controller. That is interesting to know that the > >> >> > >> >> performance issue > >> >> > >> >> seems to be due to the EC2 partition but it requires further > >> >> > >> >> information to > >> >> > >> >> know what is the reason of such a bad performance. > >> >> > >> >> > >> >> > >> >> The best way to find the problem is to observe the > >> internal states > >> >> > >> >> of the > >> >> > >> >> controller while the "squeue -p aws" is running : > >> >> > >> >> - You should first look at the consumption of the slurmctld > >> >> > >> >> process to see > >> >> > >> >> if it is due to a too high complexity of processing in > >> an internal > >> >> > >> >> function > >> >> > >> >> of slurm. > >> >> > >> >> The easiest way to have that information is by using > >> top and look > >> >> > >> >> at the > >> >> > >> >> consumption of the slurmctld process. > >> >> > >> >> - After that, in case you see no anormal consumption, you should > >> >> > >> >> try to > >> >> > >> >> capture the state of the slurmctld process during the hang. You > >> >> > >> >> can do that > >> >> > >> >> just by using the gstack command to capture the stacks of > >> >> > >> >> slurmctld threads > >> >> > >> >> and see where they are waiting. If you can send the result of > >> >> > >> >> "gstack > >> >> > >> >> $(pgrep slurmctld)" it could bring interesting information. > >> >> > >> >> - In case you see an anormal consumption, the gstack can help a > >> >> > >> >> little > >> >> > >> >> bit, but a more evolved tool, a profiler like for example > >> >> > >> >> oprofile, will be > >> >> > >> >> necessary to see where most of the time is spent in the > >> controller > >> >> > >> >> > >> >> > >> >> Do not hesitate to send the result of the 2 first tests (top, > >> >> > >> >> gstack), it > >> >> > >> >> could help to find the bottleneck. > >> >> > >> >> > >> >> > >> >> Regards, > >> >> > >> >> Matthieu > >> >> > >> >> > >> >> > >> >> Clay Teeter a écrit : > >> >> > >> >>> > >> >> > >> >>> I did notice something. > >> >> > >> >>> > >> >> > >> >>> Issuing squeue against the local partition is quite speedy, ~2s > >> >> > >> >>> for 10K > >> >> > >> >>> job ids. However, if i "squeue -p aws" the ec2(aws) partition > >> >> > >> >>> i > >> >> > >> >>> see the 2m+ > >> >> > >> >>> delay. Does squeue query each node in the partition? > >> >> > >> >>> > >> >> > >> >>> Thanks, > >> >> > >> >>> Clay > >> >> > >> >>> > >> >> > >> >>> On Thu, Nov 3, 2011 at 2:42 PM, Clay Teeter <[email protected] > >> >> > >> >>> <mailto:[email protected]>> wrote: > >> >> > >> >>> > >> >> > >> >>> First, thanks for your response! > >> >> > >> >>> > >> >> > >> >>> Yeah, I figured that slurm shouldn't have a problem with the > >> >> > >> >>> job > >> >> > >> >>> count. > >> >> > >> >>> First some background. Our network is split into two > >> >> > >> >>> parts; > >> >> > >> >>> A > >> >> > >> >>> local partition and a ec2 partition. And, for each task two > >> >> > >> >>> jobs > >> >> > >> >>> (one is dependent on the other) are submitted. The > >> frist part > >> >> > >> >>> computes a large datafile and submits it to s3, a second job > >> >> > >> >>> pulls > >> >> > >> >>> the files from s3 and copies it to our san. In total a large > >> >> > >> >>> job > >> >> > >> >>> requires about 30K jobs to be submitted. > >> >> > >> >>> > >> >> > >> >>> Im not sure when squeue started slowing dow. Currently, via > >> >> > >> >>> sattr > >> >> > >> >>> (about 2s query), there are a little more than 17000 jobs > >> >> > >> >>> running. > >> >> > >> >>> > >> >> > >> >>> Im worried that because half the cluster is in a remote > >> >> > >> >>> datacenter > >> >> > >> >>> latency is the problem.... which, unfortunately, would be a > >> >> > >> >>> show > >> >> > >> >>> stopper. > >> >> > >> >>> > >> >> > >> >>> Version: slurm 2.3.0-2 > >> >> > >> >>> ControlMachine=master > >> >> > >> >>> AuthType=auth/munge > >> >> > >> >>> CacheGroups=0 > >> >> > >> >>> CryptoType=crypto/munge > >> >> > >> >>> MaxJobCount=30000 MpiDefault=none > >> >> > >> >>> MpiParams=ports=12000-12999 > >> >> > >> >>> ProctrackType=proctrack/pgid > >> >> > >> >>> ReturnToService=1 > >> >> > >> >>> SlurmctldPidFile=/var/run/slurm/slurmctld.pid > >> >> > >> >>> SlurmctldPort=6817 > >> >> > >> >>> SlurmdPidFile=/var/run/slurm/slurmd.pid > >> >> > >> >>> SlurmdPort=6818 > >> >> > >> >>> SlurmdSpoolDir=/var/slurm/spool > >> >> > >> >>> SlurmUser=slurm > >> >> > >> >>> StateSaveLocation=/var/slurm/state > >> >> > >> >>> SwitchType=switch/none > >> >> > >> >>> TaskPlugin=task/none > >> >> > >> >>> TrackWCKey=yes > >> >> > >> >>> InactiveLimit=0 > >> >> > >> >>> KillWait=30 > >> >> > >> >>> MinJobAge=300 > >> >> > >> >>> SlurmctldTimeout=120 > >> >> > >> >>> SlurmdTimeout=300 > >> >> > >> >>> Waittime=0 > >> >> > >> >>> FastSchedule=1 > >> >> > >> >>> SchedulerType=sched/backfill > >> >> > >> >>> SchedulerPort=7321 > >> >> > >> >>> SelectType=select/cons_res > >> >> > >> >>> SelectTypeParameters=CR_Core > >> >> > >> >>> AccountingStorageHost=master > >> >> > >> >>> AccountingStorageLoc=slurm_acct_db > >> >> > >> >>> AccountingStoragePass= > >> >> > >> >>> AccountingStorageType=accounting_storage/mysql > >> >> > >> >>> AccountingStorageUser= > >> >> > >> >>> AccountingStoreJobComment=YES > >> >> > >> >>> ClusterName=hydra > >> >> > >> >>> JobCompHost=master > >> >> > >> >>> JobCompLoc=slurm_acct_db > >> >> > >> >>> JobCompPass=cluster_service > >> >> > >> >>> JobCompType=jobcomp/mysql > >> >> > >> >>> JobCompUser=cluster_service > >> >> > >> >>> JobAcctGatherFrequency=30 > >> >> > >> >>> JobAcctGatherType=jobacct_gather/linux > >> >> > >> >>> SlurmctldDebug=3 > >> >> > >> >>> SlurmctldLogFile=/var/log/slurm/slurmctld.log > >> >> > >> >>> SlurmdDebug=3 > >> >> > >> >>> SlurmdLogFile=/var/log/slurm/slurmd.log > >> >> > >> >>> NodeName=aws[1-12] RealMemory=22528 Sockets=2 > >> CoresPerSocket=4 > >> >> > >> >>> ThreadsPerCore=2 State=UNKNOWN NodeName=centos1 > >> >> > >> >>> RealMemory=3000 > >> >> > >> >>> Sockets=2 CoresPerSocket=2 > >> >> > >> >>> ThreadsPerCore=1 State=UNKNOWN > >> >> > >> >>> > >> >> > >> >>> PartitionName=local Nodes=centos1 Default=YES > >> MaxTime=INFINITE > >> >> > >> >>> State=UP Shared=NO > >> >> > >> >>> PartitionName=all Nodes=aws[3-12],centos1 Default=NO > >> >> > >> >>> MaxTime=INFINITE State=UP Shared=NO > >> >> > >> >>> PartitionName=aws Nodes=aws[2-12] Default=NO > >> >> > >> >>> MaxTime=INFINITE > >> >> > >> >>> State=UP Shared=NO > >> >> > >> >>> > >> >> > >> >>> > >> >> > >> >>> > >> >> > >> >>> On Thu, Nov 3, 2011 at 2:03 PM, Matthieu Hautreux > >> >> > >> >>> <[email protected] <mailto:[email protected]>> > >> >> > >> >>> wrote: > >> >> > >> >>> > >> >> > >> >>> Hi, > >> >> > >> >>> > >> >> > >> >>> what is the version of SLURM that you are using ? Can > >> >> > >> >>> you > >> >> > >> >>> send > >> >> > >> >>> your slurm.conf for details of configuration ? > >> >> > >> >>> > >> >> > >> >>> With 2.3.1 you should be able to manage at least > >> 10k jobs. > >> >> > >> >>> We > >> >> > >> >>> made tests with 2.2.7 and backported patches of > >> 2.3.x that > >> >> > >> >>> shows good results with 10K jobs. > >> >> > >> >>> What is the number of jobs for which the system starts > >> >> > >> >>> to > >> >> > >> >>> hang ? > >> >> > >> >>> > >> >> > >> >>> Matthieu > >> >> > >> >>> > >> >> > >> >>> > >> >> > >> >>> On 11/03/2011 07:03 PM, Clay Teeter wrote: > >> >> > >> >>>> > >> >> > >> >>>> Hi group! > >> >> > >> >>>> > >> >> > >> >>>> Im seeing that with about 30K (or so) jobs in the > >> >> > >> >>>> queue, > >> >> > >> >>>> squeue hangs > >> >> > >> >>>> for a around 2 minutes before returning values. Does > >> >> > >> >>>> anyone > >> >> > >> >>>> know a way > >> >> > >> >>>> around this? I expect that this isn't standard > >> behavior. > >> >> > >> >>>> > >> >> > >> >>>> Cheers, > >> >> > >> >>>> Clay > >> >> > >> >>> > >> >> > >> >>> > >> >> > >> >>> > >> >> > >> >> > >> >> > >> > > >> >> > >> > > >> >> > >> > > >> >> > >> > >> >> > >> > >> >> > >> > >> >> > >> -- > >> >> > >> > >> >> > >> ================================================== > >> >> > >> Open Grid Scheduler - The Official Open Source Grid Engine > >> >> > >> http://gridscheduler.sourceforge.net/ > >> >> > >> > >> >> > > > >> >> > > >> >> > > >> >> > > >> >> > -- > >> >> > > >> >> > ================================================== > >> >> > Open Grid Scheduler - The Official Open Source Grid Engine > >> >> > http://gridscheduler.sourceforge.net/ > >> >> > > >> >> > >> > > >> > > >> > >> > >> > >> -- > >> > >> ================================================== > >> Open Grid Scheduler - The Official Open Source Grid Engine > >> http://gridscheduler.sourceforge.net/ > >> > > > > > > >
