[gridengine users] Array job running more tasks than allowed by "-tc"

2020-04-20 Thread Joshua Baker-LePain
252,30254-30259:1,30261,30262,30264-30267:1,30269-30275:1,30277,30278,30280-30284:1,30287,30288,30290-30294:1,30297,30299,30300-30304:1,30307-30314:1,30316-30319:1,30322-30325:1,303! 28-30330:1,30332 but obviously that shouldn't matter. Any hints as to where we should look? Thanks. -- Joshua Baker-L

Re: [gridengine users] Multi-GPU setup

2019-08-14 Thread Joshua Baker-LePain
ening so often that we had to change our approach, and defined a queue on each GPU node with the same number of slots as GPUs. It's a far from perfect system, but it's working for now. -- Joshua Baker-LePain QB3 Shared Cluster Sysadmin UCSF ___

Re: [gridengine users] Removing 1.4 BILLION tasks job array

2019-08-07 Thread Joshua Baker-LePain
files spooling, that entry will be a directory under the "jobs" directory in the spool. If the job ID is 8027327, e.g., then the directory is jobs/00/0802/7327. Stop SGE, 'rm -rf jobs/00/0802/7327', then start SGE up again and the job should be gone. -- Joshua

Re: [gridengine users] scripting help with per user job submit restrictions

2019-06-13 Thread Joshua Baker-LePain
the files. The relevant bits of the job script would include (assuming bash): targzs=(0 file1.tar.gz file2.tar.gz ... file1200.tar.gz) tar xzf ${targzs[$SGE_TASK_ID]} To submit it: qsub -t 1-1200 job.sh -- Joshua Baker-LePain QB3 Shared Cluster Sysadmin UCSF __

Re: [gridengine users] Debugging a commlib error following reboot of exec host

2018-07-03 Thread Joshua Baker-LePain
at's leading to your problem. I don't see this, but I'm running on CentOS-7, which may lead to some different behavior. -- Joshua Baker-LePain QB3 Shared Cluster Sysadmin UCSF ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Possible opportunity for development work

2018-05-14 Thread Joshua Baker-LePain
On Mon, 14 May 2018 at 10:37am, Daniel Povey wrote I wonder if deleting that execution host and adding it back again might work around your issue. This issue showed up on multiple hosts. so I don't think that would help. The issue also survived a restart of SGE. -- Joshua Baker-LePai

Re: [gridengine users] Possible opportunity for development work

2018-05-14 Thread Joshua Baker-LePain
ll the queues the job can't run in (all for legitimate reasons). It's also notable that 'qalter -w p' always said "verification: found possible assignment with 5 slots" when jobs got stuck in this state. -- Joshua Baker-LePain QB3 Shared Cluster Sysadmin UCSF _

[gridengine users] Possible opportunity for development work

2018-05-13 Thread Joshua Baker-LePain
aintaining SoGE). Thanks! -- Joshua Baker-LePain QB3 Shared Cluster Sysadmin UCSF ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Jobs sitting in queue despite suitable slots and resources available

2018-04-17 Thread Joshua Baker-LePain
n the gpu complex, but that seems like it would run afoul of USE_CGROUPS for GPU jobs that want to use more CPU cores than GPUs. So, what are other people doing for GPUs? Thanks. -- Joshua Baker-LePain QB3 Shared Cluster Sysadmin UCSF ___ users mailing

Re: [gridengine users] Jobs sitting in queue despite suitable slots and resources available

2018-04-13 Thread Joshua Baker-LePain
there are available "gpu" complex slots on hosts where there aren't, and another part of the scheduler realizes this and keeps the jobs requesting those slots from starting. But it also won't try different hosts. -- Joshua Baker-LePain QB3 Shared Cluster Sysadmin UCSF

Re: [gridengine users] Jobs sitting in queue despite suitable slots and resources available

2018-04-13 Thread Joshua Baker-LePain
n this happens there are empty queues that are *not* disabled or alarmed. -- Joshua Baker-LePain QB3 Shared Cluster Sysadmin UCSF ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Jobs sitting in queue despite suitable slots and resources available

2018-04-12 Thread Joshua Baker-LePain
On Thu, 12 Apr 2018 at 10:15am, Joshua Baker-LePain wrote We're running SoGE 8.1.9 on a smallish (but growing) cluster. We've recently added GPU nodes to the cluster. On each GPU node, a consumable complex named 'gpu' is defined with the number of GPUs in the node. Th

[gridengine users] Jobs sitting in queue despite suitable slots and resources available

2018-04-12 Thread Joshua Baker-LePain
ur users to set "-R y" for these jobs -- is this a reservation issue? Where else should I look for clues? Any ideas? I'm a bit flummoxed on this one... Thanks. -- Joshua Baker-LePain QB3 Shared Cluster Sysadmin UCSF ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Issues compiling gpu-loadsensor

2018-03-01 Thread Joshua Baker-LePain
s lib into LD_LIBRARY_PATH . Bah. Thanks for the extra eyes. I had "-lnvidia-ml" in the LIBS line in the Makefile, but never noticed that the parameter doesn't get used. I added it to the compile command and out popped the binary. Thanks. -- Joshua Baker-LePain QB3 Sha

[gridengine users] Issues compiling gpu-loadsensor

2018-02-28 Thread Joshua Baker-LePain
: error: ld returned 1 exit status make: *** [gpu-loadsensor] Error 1 The same occurs if I use nvcc and/or if I install the drivers and do '-L/usr/lib64/nvidia'. I'm using cuda-9.1, but the same issue was there with 8.0. And I should confirm that libnvi

Re: [gridengine users] Scheduler getting stuck, "Skipping remaining N orders"

2018-02-09 Thread Joshua Baker-LePain
On Fri, 9 Feb 2018 at 1:29am, William Hay wrote On Thu, Feb 08, 2018 at 03:42:03PM -0800, Joshua Baker-LePain wrote: 153758 0.51149 tomography USER1 qw02/08/2018 14:03:05 192 153759 0.0 qss_svk_ge USER2 qw02/08/2018 14:15:06

Re: [gridengine users] Scheduler getting stuck, "Skipping remaining N orders"

2018-02-08 Thread Joshua Baker-LePain
On Thu, 8 Feb 2018 at 3:42pm, Joshua Baker-LePain wrote Now, the relevant bits of the RQSes referenced in the above look like this: limitprojects {USER1lab,OTHERlab} queues member.q to slots=315 . limitusers {*} queues ondemand.q to slots=0 So why is it trying to give the

Re: [gridengine users] Scheduler getting stuck, "Skipping remaining N orders"

2018-02-08 Thread Joshua Baker-LePain
On Wed, 7 Feb 2018 at 12:46am, William Hay wrote On Tue, Feb 06, 2018 at 12:13:24PM -0800, Joshua Baker-LePain wrote: I'm back again -- is it obvious that my new cluster just went into production? Again, we're running SoGE 8.1.9 on a cluster with nodes of several different siz

Re: [gridengine users] Scheduler getting stuck, "Skipping remaining N orders"

2018-02-07 Thread Joshua Baker-LePain
ations a bug or simply a side effect of how these 2 systems operate? Thanks a bunch for getting back to me. -- Joshua Baker-LePain QB3 Shared Cluster Sysadmin UCSF ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users

[gridengine users] Scheduler getting stuck, "Skipping remaining N orders"

2018-02-06 Thread Joshua Baker-LePain
in the error is actually defined as having 0 slots. So it's not tied to the RQS. Can anyone give me some pointers on how to debug this? Thanks. -- Joshua Baker-LePain QB3 Shared Cluster Sysadmin UCSF ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users

[gridengine users] Best practices for hybrid MPI/OpenMP jobs

2018-02-02 Thread Joshua Baker-LePain
and their early attempts are confusing our current setup. How are folks enabling these types of jobs in a way that SGE can keep track of? If it matters, our nodes are running CentOS-7 and the default MPI is the bundled OpenMPI-1.10.6. Thanks! -- Joshua Baker-LePain QB3 Shared Cluster Sys

[gridengine users] Random issues with SoGE

2017-03-27 Thread Joshua Baker-LePain
the IDs bounce around in the range of 11-15. Google doesn't yield much -- any ideas? Thanks! -- Joshua Baker-LePain QB3 Shared Cluster Sysadmin UCSF ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Some commands not generating any ouput when USE_CGROUPS set

2017-03-24 Thread Joshua Baker-LePain
TE|MAP_ANONYMOUS, -1, 0) = 0x2adced199000 write(1, "qb3-id2\n", 8)= 8 exit_group(0) = ? I'm getting progressively more confused. -- Joshua Baker-LePain QB3 Shared Cluster Sysadmin UCSF ___ user

[gridengine users] Some commands not generating any ouput when USE_CGROUPS set

2017-03-24 Thread Joshua Baker-LePain
s, then I get the normal output of those commands. Is this expected behavior? Or is something wonky with the cgroups here? Thanks for any insights. -- Joshua Baker-LePain QB3 Shared Cluster Sysadmin UCSF ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] load_thresholds, load_scaling, and hyperthreading

2016-11-03 Thread Joshua Baker-LePain
rs as to why that is. I'm going to stick with the previous solution, but I'll file a bug to try to get things fixed up. Thanks again for all your help, Reuti. -- Joshua Baker-LePain QB3 Shared Cluster Sysadmin UCSF ___ users mailing l

Re: [gridengine users] load_thresholds, load_scaling, and hyperthreading

2016-11-02 Thread Joshua Baker-LePain
x27;t get the syntax working. -- Joshua Baker-LePain QB3 Shared Cluster Sysadmin UCSF ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] load_thresholds, load_scaling, and hyperthreading

2016-11-02 Thread Joshua Baker-LePain
On Wed, 2 Nov 2016 at 11:13am, Reuti wrote Am 02.11.2016 um 18:36 schrieb Joshua Baker-LePain : On our cluster, we have three queues per host, each with as many slots as the host has physical cores. The queues are configured as follows: o lab.q (high priority queue for cluster "o

[gridengine users] load_thresholds, load_scaling, and hyperthreading

2016-11-02 Thread Joshua Baker-LePain
t;load_avg=2" and/or "np_load_avg=2", but none of these configurations seem to have any effect. What am I doing wrong? Thanks. -- Joshua Baker-LePain QB3 Shared Cluster Sysadmin UCSF ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] 8.1.9, CSP, and CentOS 7

2016-08-11 Thread Joshua Baker-LePain
On Wed, 10 Aug 2016 at 2:37pm, Joshua Baker-LePain wrote 08/10/2016 14:01:12|listen|head7|E|commlib error: ssl accept error (ssl accept error for client "head7") 08/10/2016 14:01:12|listen|head7|E|commlib error: ssl error ([ID=d0c50a1] in module "asn1 encoding routines&quo

[gridengine users] 8.1.9, CSP, and CentOS 7

2016-08-10 Thread Joshua Baker-LePain
"asn1 encoding routines": "unknown message digest algorithm") I see a few other reports of this in the list archives, but no solution. Can this be made to work? Thanks. -- Joshua Baker-LePain QB3 Shared Cluster Sysadmin UCSF ___

Re: [gridengine users] May I build hybrid SGE?

2015-12-14 Thread Joshua Baker-LePain
On Mon, 14 Dec 2015 at 8:45pm, Steven Du wrote Thank you very much! Does it mean there is no any issue on SGE master to manage x64 and x32 client? Is it right? Yes. My queue master is 64bit, my submit hosts are 32bit, and my exec nodes are both 32 and 64bit. -- Joshua Baker-LePain QB3

Re: [gridengine users] May I build hybrid SGE?

2015-12-14 Thread Joshua Baker-LePain
problem. I run such an environment. Be sure that users submit jobs with the proper "-l arch=" request so that they go to the right architecture. -- Joshua Baker-LePain QB3 Shared Cluster Sysadmin UCSF ___ users mailing list users@gridengine

Re: [gridengine users] Incorrect share tree usage for task array jobs

2015-11-18 Thread Joshua Baker-LePain
On Wed, 18 Nov 2015 at 6:48am, Mark Dixon wrote On Fri, 4 Apr 2014, Joshua Baker-LePain wrote: On Fri, 4 Apr 2014 at 8:45am, Mark Dixon wrote > I think we've been bitten by something that others have seen and brought > up on this list over the years, where the amount of usage

Re: [gridengine users] Open Grid Scheduler abandoned?

2014-08-19 Thread Joshua Baker-LePain
the same thing back in March -- you can see their response here <http://gridengine.org/pipermail/users/2014-March/007266.html>. -- Joshua Baker-LePain QB3 Shared Cluster Sysadmin UCSF ___ users mailing list users@gridengine.org https://gridengi

Re: [gridengine users] wiki is back, now hopefully with far more resiliency

2014-07-25 Thread Joshua Baker-LePain
ning CentOS 6.5 on brand new networking, storage and hypervisor kit. Allow me to say a very public "Thanks!" for maintaining such a great resource. -- Joshua Baker-LePain QB3 Shared Cluster Sysadmin UCSF ___ users mailing list u

Re: [gridengine users] Fwd: qstat job info

2014-06-26 Thread Joshua Baker-LePain
qconf -msconf' and set schedd_job_info=true and is there a big overhead hit fro this? There can be, and it seems to tickle a few bugs that can lead to the qmaster blowing up. Another option is to have users run 'qalter -w p' on their jobs, which gives a lot of the same informatio

Re: [gridengine users] Incorrect share tree usage for task array jobs

2014-04-04 Thread Joshua Baker-LePain
should always have approximately the same number of tasks running. In practice, after a while the array job user would have fewer tasks running than the individual job user. If you have any issues recreating this, let me know and I'll see if I can still do so. -- Joshua

Re: [gridengine users] sge_qmaster uses too much memory and becomes unresponsive

2014-04-02 Thread Joshua Baker-LePain
ne to submit jobs with slot ranges in the PE request? I'm also a big fan of schedd_job_info, but I'm a bigger fan of my scheduler not blowing up. -- Joshua Baker-LePain QB3 Shared Cluster Sysadmin UCSF ___ users mailing list users@gridengine

Re: [gridengine users] Project status

2014-03-19 Thread Joshua Baker-LePain
to hear that y'all are busy and developing. Thanks again. -- Joshua Baker-LePain QB3 Shared Cluster Sysadmin UCSF ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Project status

2014-03-18 Thread Joshua Baker-LePain
On Mon, 10 Mar 2014 at 1:58pm, Joshua Baker-LePain wrote I'm hoping to get some idea as to the status and future of OGS/GE. As background, we're a moderately sized (4000+ cores) academic cluster and have been running SGE for several years (we've been through versions 6, 6.1, 6.

[gridengine users] Project status

2014-03-10 Thread Joshua Baker-LePain
eful and/or critical of any of the developers (and I certainly don't want to get into the politics of the various forks). I'm simply trying to assess the statuses of the various projects so that I can make some plans going forward. Any and all info is much appreciated. Thanks! -

Re: [gridengine users] Parallel jobs with flexible slot requests cause huge memory use

2014-03-10 Thread Joshua Baker-LePain
On Mon, 10 Mar 2014 at 10:17am, Joshua Baker-LePain wrote I found the section of the code allocating the memory and as far as I Can tell commenting it does nothing. If you look through the past emails on the list you will see me writing about it this time (almost exactly + 2 weeks) 2 years

Re: [gridengine users] Parallel jobs with flexible slot requests cause huge memory use

2014-03-10 Thread Joshua Baker-LePain
27;12, but that's it. In any case, it's good to hear I'm not alone in hitting this. will send my patch for an earlier grid on monday I'm definitely interested in hearing more about this patch. Thanks. -- Joshua Baker-LePain QB

Re: [gridengine users] Parallel jobs with flexible slot requests cause huge memory use

2014-03-03 Thread Joshua Baker-LePain
o if the queues were more full). And I'd rather not limit the throughput of jobs to get around what really smells like a bug. -- Joshua Baker-LePain QB3 Shared Cluster Sysadmin UCSF ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users

[gridengine users] Parallel jobs with flexible slot requests cause huge memory use

2014-03-03 Thread Joshua Baker-LePain
ue of N doesn't really matter, just somewhere within those limits), then the scheduler could handle the original array jobs just fine. According to the man page, the flexible slot request is valid. Has anyone else seen this before? Any idea what kind of configs could be triggering thi

Re: [gridengine users] (Seemingly) Random failures of OpenMPI jobs

2014-01-08 Thread Joshua Baker-LePain
On Wed, 8 Jan 2014 at 1:59am, Mark Dixon wrote On Tue, 7 Jan 2014, Joshua Baker-LePain wrote: ... We're running OGS/GE 2011.11p1 on top of fully updated CentOS 6 on a cluster with ~650 nodes. Spool directories are local to the nodes. Our jobs are primarily serial, but with some par

Re: [gridengine users] (Seemingly) Random failures of OpenMPI jobs

2014-01-07 Thread Joshua Baker-LePain
On Tue, 7 Jan 2014 at 3:09pm, Skylar Thompson wrote Quick question - are you limiting memory usage for the job (i.e. h_vmem)? No. We have mem_free set to consumable (and the jobs include a request), but (obviously) that's not a hard limit. -- Joshua Baker-LePain QB3 Shared Cl

[gridengine users] (Seemingly) Random failures of OpenMPI jobs

2014-01-07 Thread Joshua Baker-LePain
this down? I'm a bit stumped... Thanks. -- Joshua Baker-LePain QB3 Shared Cluster Sysadmin UCSF ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Debugging *really* long scheduling runs

2013-11-04 Thread Joshua Baker-LePain
On Fri, 1 Nov 2013 at 10:44am, Joshua Baker-LePain wrote I'm currently running Grid Engine 2011.11p1 on CentOS-6. I'm using classic spooling to a local disk, local $SGE_ROOT (except for $SGE_ROOT/$SGE_CELL/common), and local spooling directories on the nodes (of which there are mor

Re: [gridengine users] Debugging *really* long scheduling runs

2013-11-02 Thread Joshua Baker-LePain
~9GB of RAM). And yet I can't find any job submitted around that time that looks like it would start to utterly confuse the scheduler. Just some points to help you in identifying the issues Thanks -- it's appreciated. -- Joshua Baker-LePain QB3

[gridengine users] Debugging *really* long scheduling runs

2013-11-01 Thread Joshua Baker-LePain
ng? Are there any other options I can set to get more info? Thanks. -- Joshua Baker-LePain QB3 Shared Cluster Sysadmin UCSF ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] hold_jid not working as expected

2013-04-03 Thread Joshua Baker-LePain
it -- thanks. I thought that might be the case, but it was unclear to me since 'man qsub' references the "exit code" and the qacct field is labeled "exit_status". -- Joshua Baker-LePain QB3 Shared Cluster Sysadmin UCSF

[gridengine users] hold_jid not working as expected

2013-04-03 Thread Joshua Baker-LePain
it, the first job gets killed and qacct shows: failed 100 : assumedly after job However the second job ends up running anyway. Am I correct in thinking that it shouldn't do so? Thanks. -- Joshua Baker-LePain QB3 Shared Cluster Sysadmin UCSF ___

Re: [gridengine users] finished / aborted jobs mail spam control

2013-03-20 Thread Joshua Baker-LePain
real addresses? -- Joshua Baker-LePain QB3 Shared Cluster Sysadmin UCSF ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] how to restrict sge_execd to subnet

2012-06-07 Thread Joshua Baker-LePain
On Thu, 7 Jun 2012 at 12:33pm, Miro Drahos wrote can anyone advise how could I restrict the sge daemons to listen only on certain subnet? I don't know of any way to do it within SGE, but it should be easy enough with a host-based firewall (e.g. iptables on Linux). -- Joshua Baker-L

Re: [gridengine users] final maxvmem of a job

2012-05-21 Thread Joshua Baker-LePain
d qacct takes a long while. The easier way, in my opinion, is to simply include 'qstat -j $JOB_ID' as the last line in your job scripts. Then the usage will get written into the standard SGE output file for each job. -- Joshua Baker-LePain QB

Re: [gridengine users] Installing GE in CSP mode?

2012-05-18 Thread Joshua Baker-LePain
side that group. -- Joshua Baker-LePain QB3 Shared Cluster Sysadmin UCSF ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Installing GE in CSP mode?

2012-05-18 Thread Joshua Baker-LePain
that I've run into as well) is SGE complaining about certificate issues despite the fact that the certificates have *not* expired. -- Joshua Baker-LePain QB3 Shared Cluster Sysadmin UCSF ___ users mailing list users@gridengine.org https://gridengine.o

Re: [gridengine users] Installing GE in CSP mode?

2012-05-17 Thread Joshua Baker-LePain
ded. I will say that I had it happen twice relatively quickly, and it hasn't happened for a while since. This was also with 6.1. -- Joshua Baker-LePain QB3 Shared Cluster Sysadmin UCSF ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Installing GE in CSP mode?

2012-05-14 Thread Joshua Baker-LePain
xity of it, I'm assuming that not many people use CSP mode. Does this assumption seem reasonable? I'd say so, given that the documentation is pretty incomplete and, as you say, questions are rare. -- Joshua Baker-LePain QB3 Shared Cluster Sysadmin UCSF _

Re: [gridengine users] RQS and scheduler performance (max-slots-on-all-hosts)

2012-04-26 Thread Joshua Baker-LePain
On Thu, 26 Apr 2012 at 2:01pm, Stuart Barkley wrote However, with scheduler run time now < 10 seconds, I don't see need for any more performance. Turning off the specific RQS made a 90% improvement which was the big win. Do you have other RQSs, or was that your only one? -- Josh

Re: [gridengine users] RQS and scheduler performance (max-slots-on-all-hosts)

2012-04-26 Thread Joshua Baker-LePain
lling hardware for a new server on which I'll be running a much newer SGE version. I was rather hoping that this would allow me to get reservations, RQSs, and "schedd_job_info true" (since it is rather handy) all working together. Apparently not.

Re: [gridengine users] Parallel jobs failure after OS upgrade

2012-04-11 Thread Joshua Baker-LePain
blems in our shared-everything environment, although the main NFS server is Solaris, and most of the nodes are actually RH5. Yeah, we use NFS minimally and it really isn't anywhere in my differential. -- Joshua Baker-LePain QB3 Shared Cluster Sysadmin UCSF ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Parallel jobs failure after OS upgrade

2012-04-11 Thread Joshua Baker-LePain
e same failures. Thanks for having a look, though. -- Joshua Baker-LePain QB3 Shared Cluster Sysadmin UCSF ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Parallel jobs failure after OS upgrade

2012-04-04 Thread Joshua Baker-LePain
On Wed, 4 Apr 2012 at 6:33pm, Tru Huynh wrote On Tue, Apr 03, 2012 at 03:19:51PM -0700, Joshua Baker-LePain wrote: Yes. We have the SGE commlib errors, and the Open MPI "routed:binomial" errors. I'm mainly focusing on the SGE problem right now, as I think (hope) that fixing

Re: [gridengine users] Parallel jobs failure after OS upgrade

2012-04-03 Thread Joshua Baker-LePain
On Tue, 3 Apr 2012 at 9:36pm, Hung-Sheng Tsao (LaoTsao) Ph.D wrote is SElinux on or off? selinux is off on the nodes, but active on the queue master. That being said, as I mentioned, the queue master config (including OS) hasn't changed at all recently. -- Joshua Baker-LePain QB3 S

Re: [gridengine users] Parallel jobs failure after OS upgrade

2012-04-03 Thread Joshua Baker-LePain
On Tue, 3 Apr 2012 at 7:43pm, Rayson Ho wrote Is it possible that some nodes have a firewall running while some don't?? Unfortunately, no. All the nodes are kickstarted from the same template. -- Joshua Baker-LePain QB3 Shared Cluster Sysadmin

Re: [gridengine users] Parallel jobs failure after OS upgrade

2012-04-03 Thread Joshua Baker-LePain
On Wed, 4 Apr 2012 at 12:30am, Reuti wrote Am 04.04.2012 um 00:19 schrieb Joshua Baker-LePain: On Wed, 4 Apr 2012 at 12:12am, Reuti wrote Are you running your jobs across more than one queue? There was an issue recently when the hostfile contains more than one queue per machine on the Open

Re: [gridengine users] Parallel jobs failure after OS upgrade

2012-04-03 Thread Joshua Baker-LePain
n the slaves - right? Yes. We have the SGE commlib errors, and the Open MPI "routed:binomial" errors. I'm mainly focusing on the SGE problem right now, as I think (hope) that fixing that will also fix the MPI issue. -- Joshua Baker-LePain QB3 Shared Cluster Sysadmin UCSF __

Re: [gridengine users] Parallel jobs failure after OS upgrade

2012-04-03 Thread Joshua Baker-LePain
On Tue, 3 Apr 2012 at 10:19pm, Reuti wrote Am 03.04.2012 um 21:49 schrieb Joshua Baker-LePain: error: commlib error: can't connect to service (Connection timed out) ethtool shows the correct speed for the network interface? Yes indeed -- 1000Mb/s across the board. Sometimes a job

Re: [gridengine users] Parallel jobs failure after OS upgrade

2012-04-03 Thread Joshua Baker-LePain
*will* get a shiny new version of SGE. But getting that up, running, and well tested will take longer than the patience of those suffering these failing jobs can take. of course you need to compiler openmpi with sge Oh, it most definitely is. -- Joshua Baker-LePain QB3 Shared Cluster

[gridengine users] Parallel jobs failure after OS upgrade

2012-04-03 Thread Joshua Baker-LePain
re no corresponding messages in $SGE_ROOT/spool/qmaster/messages. Does anyone have any ideas as to why I would be seeing this error (and why it would be so much more frequent after the exec node OS upgrade)? Any ideas on how to track it down? I'm admittedly at a bit of a loss here.

Re: [gridengine users] sge_schedd exhausts all memory

2011-10-25 Thread Joshua Baker-LePain
Good luck. -- Joshua Baker-LePain QB3 Shared Cluster Sysadmin UCSF ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] "Packing" jobs on nodes

2011-05-19 Thread Joshua Baker-LePain
On Thu, 19 May 2011 at 6:03pm, Dave Love wrote Joshua Baker-LePain writes: All you need to avoid over-subscription, whatever the queue defs, is $ qconf -srqs host-slots { name host-slots description "restrict slots to core count" enabled TRUE limitho

Re: [gridengine users] "Packing" jobs on nodes

2011-05-19 Thread Joshua Baker-LePain
On Wed, 18 May 2011 at 8:30pm, Hung-ShengTsao (Lao Tsao) Ph.D. wrote if num_proc= number of core may be with HT you just do limithosts {*} to slots=$num_proc*2 I think it should be $num_proc/2, but yeah, that would work. Thanks. -- Joshua Baker-LePain QB3 Shared Cluster Sysadmin

Re: [gridengine users] "Packing" jobs on nodes

2011-05-18 Thread Joshua Baker-LePain
ng the default slots definition If you define over-subscription as more jobs than *physical* cores, won't the above RQS fail to prevent it on nodes with hyperthreading active? -- Joshua Baker-LePain QB3 Shared Cluster Sysadmin UCSF ___ users mailing

Re: [gridengine users] Berkeley DB (was building RHEL5)

2011-04-11 Thread Joshua Baker-LePain
n that the BDB RPC server is not a good option, spooling over NFS is required if you want to run a shadow master (which I do). -- Joshua Baker-LePain QB3 Shared Cluster Sysadmin UCSF ___ users mailing list users@gridengine.org https://gridengine.org/mailma

Re: [gridengine users] Berkeley DB (was building RHEL5)

2011-04-11 Thread Joshua Baker-LePain
moving our install to BDB. Are there other tunables I should be looking at? -- Joshua Baker-LePain QB3 Shared Cluster Sysadmin UCSF ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users