252,30254-30259:1,30261,30262,30264-30267:1,30269-30275:1,30277,30278,30280-30284:1,30287,30288,30290-30294:1,30297,30299,30300-30304:1,30307-30314:1,30316-30319:1,30322-30325:1,303!
28-30330:1,30332
but obviously that shouldn't matter. Any hints as to where we should
look? Thanks.
--
Joshua Baker-L
ening so often that we had to change our
approach, and defined a queue on each GPU node with the same
number of slots as GPUs. It's a far from perfect system, but it's working
for now.
--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
___
files spooling, that entry will be a directory under the
"jobs" directory in the spool. If the job ID is 8027327, e.g., then the
directory is jobs/00/0802/7327. Stop SGE, 'rm -rf jobs/00/0802/7327',
then start SGE up again and the job should be gone.
--
Joshua
the files.
The relevant bits of the job script would include (assuming bash):
targzs=(0 file1.tar.gz file2.tar.gz ... file1200.tar.gz)
tar xzf ${targzs[$SGE_TASK_ID]}
To submit it:
qsub -t 1-1200 job.sh
--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
__
at's leading to your problem. I don't see this, but I'm running on
CentOS-7, which may lead to some different behavior.
--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users
On Mon, 14 May 2018 at 10:37am, Daniel Povey wrote
I wonder if deleting that execution host and adding it back again
might work around your issue.
This issue showed up on multiple hosts. so I don't think that would help.
The issue also survived a restart of SGE.
--
Joshua Baker-LePai
ll the queues the job can't run in (all for
legitimate reasons). It's also notable that 'qalter -w p' always said
"verification: found possible assignment with 5 slots" when jobs got stuck
in this state.
--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
_
aintaining
SoGE). Thanks!
--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users
n
the gpu complex, but that seems like it would run afoul of USE_CGROUPS for
GPU jobs that want to use more CPU cores than GPUs.
So, what are other people doing for GPUs? Thanks.
--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
___
users mailing
there are available "gpu" complex slots on hosts where there aren't, and
another part of the scheduler realizes this and keeps the jobs requesting
those slots from starting. But it also won't try different hosts.
--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
n
this happens there are empty queues that are *not* disabled or alarmed.
--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users
On Thu, 12 Apr 2018 at 10:15am, Joshua Baker-LePain wrote
We're running SoGE 8.1.9 on a smallish (but growing) cluster. We've recently
added GPU nodes to the cluster. On each GPU node, a consumable complex named
'gpu' is defined with the number of GPUs in the node. Th
ur users to set "-R y" for these
jobs -- is this a reservation issue? Where else should I look for clues?
Any ideas? I'm a bit flummoxed on this one...
Thanks.
--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users
s lib into LD_LIBRARY_PATH .
Bah. Thanks for the extra eyes. I had "-lnvidia-ml" in the LIBS line in
the Makefile, but never noticed that the parameter doesn't get used. I
added it to the compile command and out popped the binary. Thanks.
--
Joshua Baker-LePain
QB3 Sha
: error: ld returned 1 exit status
make: *** [gpu-loadsensor] Error 1
The same occurs if I use nvcc and/or if I install the drivers and do
'-L/usr/lib64/nvidia'. I'm using cuda-9.1, but the same issue was there
with 8.0. And I should confirm that libnvi
On Fri, 9 Feb 2018 at 1:29am, William Hay wrote
On Thu, Feb 08, 2018 at 03:42:03PM -0800, Joshua Baker-LePain wrote:
153758 0.51149 tomography USER1 qw02/08/2018 14:03:05
192
153759 0.0 qss_svk_ge USER2 qw02/08/2018 14:15:06
On Thu, 8 Feb 2018 at 3:42pm, Joshua Baker-LePain wrote
Now, the relevant bits of the RQSes referenced in the above look like this:
limitprojects {USER1lab,OTHERlab} queues member.q to slots=315
.
limitusers {*} queues ondemand.q to slots=0
So why is it trying to give the
On Wed, 7 Feb 2018 at 12:46am, William Hay wrote
On Tue, Feb 06, 2018 at 12:13:24PM -0800, Joshua Baker-LePain wrote:
I'm back again -- is it obvious that my new cluster just went into
production? Again, we're running SoGE 8.1.9 on a cluster with nodes of
several different siz
ations a bug or simply a
side effect of how these 2 systems operate?
Thanks a bunch for getting back to me.
--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users
in the error is actually defined as having 0 slots. So it's not
tied to the RQS.
Can anyone give me some pointers on how to debug this? Thanks.
--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users
and their
early attempts are confusing our current setup. How are folks enabling
these types of jobs in a way that SGE can keep track of?
If it matters, our nodes are running CentOS-7 and the default MPI is
the bundled OpenMPI-1.10.6. Thanks!
--
Joshua Baker-LePain
QB3 Shared Cluster Sys
the IDs bounce
around in the range of 11-15. Google doesn't yield much -- any ideas?
Thanks!
--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users
TE|MAP_ANONYMOUS, -1, 0) =
0x2adced199000
write(1, "qb3-id2\n", 8)= 8
exit_group(0) = ?
I'm getting progressively more confused.
--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
___
user
s, then I get the normal output of those commands.
Is this expected behavior? Or is something wonky with the cgroups here?
Thanks for any insights.
--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users
rs as to why that is. I'm going to
stick with the previous solution, but I'll file a bug to try to get things
fixed up.
Thanks again for all your help, Reuti.
--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
___
users mailing l
x27;t get the syntax working.
--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users
On Wed, 2 Nov 2016 at 11:13am, Reuti wrote
Am 02.11.2016 um 18:36 schrieb Joshua Baker-LePain :
On our cluster, we have three queues per host, each with as many slots
as the host has physical cores. The queues are configured as follows:
o lab.q (high priority queue for cluster "o
t;load_avg=2" and/or "np_load_avg=2", but none of these configurations seem
to have any effect. What am I doing wrong?
Thanks.
--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users
On Wed, 10 Aug 2016 at 2:37pm, Joshua Baker-LePain wrote
08/10/2016 14:01:12|listen|head7|E|commlib error: ssl accept error (ssl
accept error for client "head7")
08/10/2016 14:01:12|listen|head7|E|commlib error: ssl error ([ID=d0c50a1] in
module "asn1 encoding routines&quo
"asn1
encoding routines": "unknown message digest algorithm")
I see a few other reports of this in the list archives, but no solution.
Can this be made to work? Thanks.
--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
___
On Mon, 14 Dec 2015 at 8:45pm, Steven Du wrote
Thank you very much!
Does it mean there is no any issue on SGE master to manage x64 and x32
client? Is it right?
Yes. My queue master is 64bit, my submit hosts are 32bit, and my exec
nodes are both 32 and 64bit.
--
Joshua Baker-LePain
QB3
problem. I run such an environment. Be sure that
users submit jobs with the proper "-l arch=" request so that they go to
the right architecture.
--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
___
users mailing list
users@gridengine
On Wed, 18 Nov 2015 at 6:48am, Mark Dixon wrote
On Fri, 4 Apr 2014, Joshua Baker-LePain wrote:
On Fri, 4 Apr 2014 at 8:45am, Mark Dixon wrote
> I think we've been bitten by something that others have seen and brought
> up on this list over the years, where the amount of usage
the same thing back in March -- you can see their response here
<http://gridengine.org/pipermail/users/2014-March/007266.html>.
--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
___
users mailing list
users@gridengine.org
https://gridengi
ning CentOS 6.5 on brand new
networking, storage and hypervisor kit.
Allow me to say a very public "Thanks!" for maintaining such a great
resource.
--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
___
users mailing list
u
qconf -msconf' and set schedd_job_info=true
and is there a big overhead hit fro this?
There can be, and it seems to tickle a few bugs that can lead to the
qmaster blowing up. Another option is to have users run 'qalter -w p' on
their jobs, which gives a lot of the same informatio
should always have approximately the same
number of tasks running. In practice, after a while the array job user
would have fewer tasks running than the individual job user.
If you have any issues recreating this, let me know and I'll see if I can
still do so.
--
Joshua
ne to submit jobs with slot ranges in the PE request?
I'm also a big fan of schedd_job_info, but I'm a bigger fan of my
scheduler not blowing up.
--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
___
users mailing list
users@gridengine
to hear that y'all are busy and developing. Thanks again.
--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users
On Mon, 10 Mar 2014 at 1:58pm, Joshua Baker-LePain wrote
I'm hoping to get some idea as to the status and future of OGS/GE. As
background, we're a moderately sized (4000+ cores) academic cluster and have
been running SGE for several years (we've been through versions 6, 6.1, 6.
eful and/or critical of any of the
developers (and I certainly don't want to get into the politics of the
various forks). I'm simply trying to assess the statuses of the various
projects so that I can make some plans going forward. Any and all info is
much appreciated. Thanks!
-
On Mon, 10 Mar 2014 at 10:17am, Joshua Baker-LePain wrote
I found the section of the code allocating the memory and as far as I Can
tell commenting it does nothing. If you look through the past emails on
the
list you will see me writing about it this time (almost exactly + 2 weeks)
2 years
27;12, but
that's it. In any case, it's good to hear I'm not alone in hitting this.
will send my patch for an earlier grid on monday
I'm definitely interested in hearing more about this patch. Thanks.
--
Joshua Baker-LePain
QB
o if the queues were more full). And
I'd rather not limit the throughput of jobs to get around what really
smells like a bug.
--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users
ue of N doesn't really matter, just somewhere within those limits),
then the scheduler could handle the original array jobs just fine.
According to the man page, the flexible slot request is valid. Has anyone
else seen this before? Any idea what kind of configs could be triggering
thi
On Wed, 8 Jan 2014 at 1:59am, Mark Dixon wrote
On Tue, 7 Jan 2014, Joshua Baker-LePain wrote:
...
We're running OGS/GE 2011.11p1 on top of fully updated CentOS 6 on a
cluster with ~650 nodes. Spool directories are local to the nodes. Our
jobs are primarily serial, but with some par
On Tue, 7 Jan 2014 at 3:09pm, Skylar Thompson wrote
Quick question - are you limiting memory usage for the job (i.e. h_vmem)?
No. We have mem_free set to consumable (and the jobs include a request),
but (obviously) that's not a hard limit.
--
Joshua Baker-LePain
QB3 Shared Cl
this down? I'm a bit stumped...
Thanks.
--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users
On Fri, 1 Nov 2013 at 10:44am, Joshua Baker-LePain wrote
I'm currently running Grid Engine 2011.11p1 on CentOS-6. I'm using classic
spooling to a local disk, local $SGE_ROOT (except for
$SGE_ROOT/$SGE_CELL/common), and local spooling directories on the nodes (of
which there are mor
~9GB of RAM). And yet I
can't find any job submitted around that time that looks like it would
start to utterly confuse the scheduler.
Just some points to help you in identifying the issues
Thanks -- it's appreciated.
--
Joshua Baker-LePain
QB3
ng? Are there any other options I can set
to get more info?
Thanks.
--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users
it -- thanks. I thought that might be the case, but it was unclear to
me since 'man qsub' references the "exit code" and the qacct field is
labeled "exit_status".
--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
it, the first job gets killed and qacct shows:
failed 100 : assumedly after job
However the second job ends up running anyway. Am I correct in thinking
that it shouldn't do so?
Thanks.
--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
___
real addresses?
--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users
On Thu, 7 Jun 2012 at 12:33pm, Miro Drahos wrote
can anyone advise how could I restrict the sge daemons to listen only on
certain subnet?
I don't know of any way to do it within SGE, but it should be easy enough
with a host-based firewall (e.g. iptables on Linux).
--
Joshua Baker-L
d qacct takes a long
while. The easier way, in my opinion, is to simply include 'qstat -j
$JOB_ID' as the last line in your job scripts. Then the usage will get
written into the standard SGE output file for each job.
--
Joshua Baker-LePain
QB
side that group.
--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users
that I've run into as well) is SGE complaining about certificate issues
despite the fact that the certificates have *not* expired.
--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
___
users mailing list
users@gridengine.org
https://gridengine.o
ded. I will say that I had it happen
twice relatively quickly, and it hasn't happened for a while since. This
was also with 6.1.
--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users
xity of it, I'm assuming that not many people use CSP mode.
Does this assumption seem reasonable?
I'd say so, given that the documentation is pretty incomplete and, as you
say, questions are rare.
--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
_
On Thu, 26 Apr 2012 at 2:01pm, Stuart Barkley wrote
However, with scheduler run time now < 10 seconds, I don't see need
for any more performance. Turning off the specific RQS made a 90%
improvement which was the big win.
Do you have other RQSs, or was that your only one?
--
Josh
lling hardware for a new server on which I'll be
running a much newer SGE version. I was rather hoping that this would
allow me to get reservations, RQSs, and "schedd_job_info true" (since it
is rather handy) all working together. Apparently not.
blems in our shared-everything environment, although
the main NFS server is Solaris, and most of the nodes are actually RH5.
Yeah, we use NFS minimally and it really isn't anywhere in my differential.
--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users
e same failures. Thanks for having a
look, though.
--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users
On Wed, 4 Apr 2012 at 6:33pm, Tru Huynh wrote
On Tue, Apr 03, 2012 at 03:19:51PM -0700, Joshua Baker-LePain wrote:
Yes. We have the SGE commlib errors, and the Open MPI
"routed:binomial" errors. I'm mainly focusing on the SGE problem
right now, as I think (hope) that fixing
On Tue, 3 Apr 2012 at 9:36pm, Hung-Sheng Tsao (LaoTsao) Ph.D wrote
is SElinux on or off?
selinux is off on the nodes, but active on the queue master. That being
said, as I mentioned, the queue master config (including OS) hasn't
changed at all recently.
--
Joshua Baker-LePain
QB3 S
On Tue, 3 Apr 2012 at 7:43pm, Rayson Ho wrote
Is it possible that some nodes have a firewall running while some don't??
Unfortunately, no. All the nodes are kickstarted from the same template.
--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
On Wed, 4 Apr 2012 at 12:30am, Reuti wrote
Am 04.04.2012 um 00:19 schrieb Joshua Baker-LePain:
On Wed, 4 Apr 2012 at 12:12am, Reuti wrote
Are you running your jobs across more than one queue? There was an
issue recently when the hostfile contains more than one queue per
machine on the Open
n the slaves - right?
Yes. We have the SGE commlib errors, and the Open MPI "routed:binomial"
errors. I'm mainly focusing on the SGE problem right now, as I think
(hope) that fixing that will also fix the MPI issue.
--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
__
On Tue, 3 Apr 2012 at 10:19pm, Reuti wrote
Am 03.04.2012 um 21:49 schrieb Joshua Baker-LePain:
error: commlib error: can't connect to service (Connection timed out)
ethtool shows the correct speed for the network interface?
Yes indeed -- 1000Mb/s across the board.
Sometimes a job
*will* get a shiny new version of
SGE. But getting that up, running, and well tested will take longer than
the patience of those suffering these failing jobs can take.
of course you need to compiler openmpi with sge
Oh, it most definitely is.
--
Joshua Baker-LePain
QB3 Shared Cluster
re no
corresponding messages in $SGE_ROOT/spool/qmaster/messages.
Does anyone have any ideas as to why I would be seeing this error (and why
it would be so much more frequent after the exec node OS upgrade)? Any
ideas on how to track it down? I'm admittedly at a bit of a loss here.
Good luck.
--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users
On Thu, 19 May 2011 at 6:03pm, Dave Love wrote
Joshua Baker-LePain writes:
All you need to avoid over-subscription, whatever the queue defs, is
$ qconf -srqs host-slots
{
name host-slots
description "restrict slots to core count"
enabled TRUE
limitho
On Wed, 18 May 2011 at 8:30pm, Hung-ShengTsao (Lao Tsao) Ph.D. wrote
if num_proc= number of core may be with HT you just do
limithosts {*} to slots=$num_proc*2
I think it should be $num_proc/2, but yeah, that would work. Thanks.
--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
ng the default slots definition
If you define over-subscription as more jobs than *physical* cores, won't
the above RQS fail to prevent it on nodes with hyperthreading active?
--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
___
users mailing
n that the BDB RPC server is not a good option, spooling over
NFS is required if you want to run a shadow master (which I do).
--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
___
users mailing list
users@gridengine.org
https://gridengine.org/mailma
moving our install to BDB. Are there other tunables I should be
looking at?
--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users
78 matches
Mail list logo