In the message dated: Fri, 07 Aug 2020 16:24:16 -,
The pithy ruminations from Ondrej Valousek on
[Re: [gridengine users] m_mem_free and cgroups] were:
=> Short answer: Use a different tool than stress Long answer: linux kernel
=> is too clever for tests like stress because allocating a memory
In the message dated: Fri, 18 Oct 2019 15:34:02 -,
The pithy ruminations from WALLIS Michael on
[[gridengine users] SGE rolling over MAX_SEQNUM, peculiar things happened] were:
=> Hi folks,
=>
=> Our instance of (quite old, 2011.11p1_155) SGE rolled over 10,000,000 jobs
at the start of the
In the message dated: Fri, 27 Sep 2019 23:32:43 +0200,
The pithy ruminations from Reuti on
[Re: [gridengine users] jobs stuck in transitioning state] were:
=> Hi,
=>
=> Am 27.09.2019 um 22:21 schrieb berg...@merctech.com:
=>
=> > We're having a problem with submit scripts not being transferred
We're having a problem with submit scripts not being transferred to exec
nodes and jobs being stuck in the [t]ransitioning state.
The issue is present with SoGE 8.1.6 and 8.1.9, under CentOS7.
We are using classic spooling. On the compute nodes, the spool directory
Our SoGE (8.1.6) configuration has essentially two queues: one for "all"
jobs and one for "short jobs". The all.q is subordinate to the short.q,
and short jobs can suspend a job in the general queue. At the moment, the
all.q has nodes with & without GPU resources (not ideal, not permanent,
==\n" ; nvidia=smi ; printf
"\n\nSGE status on this
node:\n===\n" ; qstat -u \* -s rs -l
hostname=`hostname` ) 2>&1 | Mail -s "unexpected: no free GPUs on `hostname
-s`" root
exit 1
fi
echo $free
exit 0
#######
We're setting up SoGE (8.1.6) to establish qlogin sessions to interactive
nodes running CentOS7, with SELinux enabled.
A long-standing issue with qlogin is that it launches the listening ssh
daemon on random ports. The question about restricting qlogin to use
a specific range of ports for ssh has
We're using SoGE 8.1.6 in an environment where users may login to the
cluster from a Linux workstation (typically using a lower-case login
name) or a Windows desktop, where their login name (as supplied by the
enterprise Active Directory) is usually mixed-case.
On the cluster, we've created two
In the message dated: Tue, 02 Jan 2018 09:11:51 +,
The pithy ruminations from William Hay on
were:
=> On Fri, Dec 22, 2017 at 05:55:26PM -0500, berg...@merctech.com wrote:
=> > True, but even with that info, there doesn't seem to be any universal
=> > way to tell an arbitrary GPU job which
In the message dated: Thu, 21 Dec 2017 23:17:52 +0100,
The pithy ruminations from Reuti on
were:
=> Hi,
=>
=> Am 21.12.2017 um 22:46 schrieb berg...@merctech.com:
=>
=> >
=> > I'm considering changing "gpu" to an INT (set to the number of GPUs/node),
=> > making it a consumable resource, and
In our cluster, we've got several different types of GPUs.
Some jobs simply need any GPU, while others require a specific type.
Previously, we had "gpu" declared as a BOOLEAN attribute on each GPU-node
and had the GPU type (ie., TITANX, P100, etc) declared as an INT attribute
with the count of
In the message dated: Fri, 01 Dec 2017 11:41:11 +0100,
The pithy ruminations from Reuti on
were:
=> Hi,
=>
=> >
=> > And is it possible to have the load of the gpus ?
=>
=> Sure. But you will have to set up a load_sensor specific for your
=> GPUs. And it would only allow a feedback to see
Our cluster is used in academic research on medical imaging. Our research
group writes and distributes software for use on clusters & desktops. As
part of determining what OSs (and for Linux distributions, GLIBC release
levels) to target in our software, I would highly appreciate your feedback
Some of our compute nodes have multiple versions of specific resources and
I'm looking for an easy way to enable users match their job requirement
against the per-node list.
For example software package "foobar" might exist on the following nodes
with these versions:
node_a:
In the message dated: Tue, 21 Jun 2016 10:09:11 -0400,
The pithy ruminations from Jesse Becker on
we
re:
=> On Tue, Jun 21, 2016 at 10:12:41AM +0100, William Hay wrote:
=> >On Tue, Jun 21, 2016 at 08:16:25AM +, Yuri Burmachenko wrote:
=> >>
=> >>We have noticed that our sge_qmaster
We're running SoGE 8.1.6 under CentOS6 and had successfully been using
a JSV to set core binding. This has been extremely helpful in
controlling some very 'greedy' multi-threaded processes.
Recently, I've noticed that the core binding is no longer working.
There have been no related changes to
In the message dated: Thu, 25 Feb 2016 13:38:28 -0800,
The pithy ruminations from Skylar Thompson on
were:
=> We have pipelines that are driven by a qsub at the end of a batch script.
We have several pipelines like that. It's not uncommon to have the initial job
submit 5~20 other jobs.
I don't
Is anyone monitoring cluster utilization with a higher-level view than
simply job (qacct) stastics and CPU-seconds used/available?
I'm running SoGE 8.1.6 on a cluster with ~70 nodes, ~1400 cores and
200~350K jobs/month and I'm seeking ways to understand the utilization &
resource constraints in
On December 19, 2014 6:19:58 AM EST, Reuti re...@staff.uni-marburg.de wrote:
= Am 18.12.2014 um 22:21 schrieb berg...@merctech.com:
=
= We've got a job that was suspended via:
=
= qmod -sj $jobid
=
= that's continuing to run. The job consists of a BASH script, which
= in
= turn
We've got a job that was suspended via:
qmod -sj $jobid
that's continuing to run. The job consists of a BASH script, which in
turn submits other jobs in a loop, sleeping for 30 seconds after each loop.
When I examine the job status on the node where it is executing via:
ps -e f
We're running SoGE 8.1.6, and I wanted to understand how SoGE manages
CPU resources for jobs that are both multi-threaded and MPI-parallel.
We have slots configured as a consumable resource, with the number of
slots per-node equal to the number of CPU-cores.
We use OpenMPI with tight SGE
= users@gridengine.org
= https://gridengine.org/mailman/listinfo/users
=
= ___
= users mailing list
= users@gridengine.org
= https://gridengine.org/mailman/listinfo/users
=
--
Mark Bergman
___
users mailing list
In the message dated: Wed, 02 Apr 2014 16:19:14 -0400,
The pithy ruminations from berg...@merctech.com on
[gridengine users] SGE starter_method breaks OpenMPI were:
=
= We're bringing up SoGE 8.1.6 and I've run into a problem with the use of a
= 'starter_method' that's affecting OpenMPI jobs.
In the message dated: Thu, 17 Apr 2014 09:55:42 -0700,
The pithy ruminations from Joseph Farran on
[gridengine users] The state of a queue were:
= Howdy.
=
= I am able to disabled enable a queue @ a compute node with:
=
= $ qmod -d bio@compute-1-1
= me@sys changed state of
as if the
starter_method is somehow corrupting the environment passed to mpirun.
For example, I submitted a job when I was in the directory
/lab/home/bergman/sge_job_output.
The starter_method script reports:
Debugging: About to:
exec OPAL_PREFIX=/lab/bin/openmpi/sge; export OPAL_PREFIX
In the message dated: Mon, 27 Jan 2014 17:50:58 +0100,
The pithy ruminations from Reuti on
Re: [gridengine users] Using modules form the compute nodes were:
= Hi,
=
= Am 27.01.2014 um 17:26 schrieb Txema Heredia:
=
= I have been trying to use modulefiles from my compute nodes with no avail.
= https://gridengine.org/mailman/listinfo/users
=
=
--
Mark BergmanBiker, Rock Climber, Unix mechanic, IATSE #1 Stagehand
'94 Yamaha GTS1000A^2
berg...@panix.com
http://wwwkeys.pgp.net:11371/pks/lookup?op=getsearch=bergman%40panix.com
I want a newsgroup with a infinite S/N ratio! Now taking CFV
= = Tel: +64 (0) 6 350 5701 Extn: 3586
= =
= =
=
= --
= Mark Bergman
= ___
= users mailing list
= users@gridengine.org
= https://gridengine.org/mailman/listinfo/users
=
--
Mark BergmanBiker, Rock Climber, Unix mechanic, IATSE #1 Stagehand
'94 Yamaha
=
=
= --
= Dr Chris Jewell
= Lecturer in Biostatistics
= Institute of Fundamental Sciences
= Massey University
= Private Bag 11222
= Palmerston North 4442
= New Zealand
= Tel: +64 (0) 6 350 5701 Extn: 3586
=
=
--
Mark Bergman
___
users mailing list
users
In the message dated: Thu, 01 Aug 2013 03:09:56 -,
The pithy ruminations from Jewell, Chris on
[gridengine users] Random queue errors, and suspect pe_hostfiles were:
= Hello all,
=
= A while since I posted here, so good to be back!
=
= My installation of GE 8.1.3 from the Scientific Linux
I'm looking for suggestions for dealing with h_vmem requirements for
multi-slot jobs.
We use memory as a consumable and a required complex.
I understand that SGE multiplies the h_vmem request by the number of slots
in order to determine the job memory requirement.
In our environment, there are
We're running SGE 6.2u5 and we've got a short queue, which assigns a
higher priority but imposes run-time and CPU-time limits.
We also have a short-on-interactive queue to allow short jobs to
run on a subset of the slots on our interactive nodes, with the goal
of allowing short, high-priority
I recently reconfigured our SGE (6.2u5) environment to better handle MPI jobs
on a heterogeneous cluster. This seems to have caused a problem with the
threaded (SMP) PE.
Our PEs are:
qconf -spl
make(unused)
openmpi-AMD
In the message dated: Fri, 11 Jan 2013 23:45:05 +0100,
The pithy ruminations from Reuti on
Re: [gridengine users] PE offers 0 slots (conflict in pe_list w. hostgroups?)
were:
= Am 11.01.2013 um 23:16 schrieb berg...@merctech.com:
=
=
[SNIP!]
=
= qconf -sq all.q | grep pe_list
In the message dated: Tue, 01 Jan 2013 17:38:23 +0100,
The pithy ruminations from Reuti on
Re: [gridengine users] How prevent abnormal nodes load using qsub? were:
= Am 31.12.2012 um 09:34 schrieb Semi:
=
= The memory is not a problem, the problem CPU load,
= every python process runs 2
\
--mca btl_base_verbose 30 \
--mca routed direct \
--prefix $OPENMPI -np $NSLOTS ~/hello_openmpi
- STDOUT from ~/hello_openmpi below this line -
Command: ~/hello_openmpi
Arguments:
Executing in: /acme/home/bergman/sge_job_output
Executing on: acme-c5-8.example.com
Executing
In the message dated: Tue, 18 Dec 2012 11:24:22 +0100,
The pithy ruminations from Reuti on
Re: [gridengine users] JSV unexpectedly alters resource request from boolean t
o float were:
= Am 17.12.2012 um 09:19 schrieb William Hay:
=
= On 14 December 2012 22:30, berg...@merctech.com
In the message dated: Mon, 17 Dec 2012 12:26:31 PST,
The pithy ruminations from Joseph Farran on
Re: [gridengine users] Restarting Grid Engine makes qstat forget display order
were:
= On 12/16/2012 10:15 AM, Dave Love wrote:
= I think the answer is not to do that. Why restart it?
=
=
=
Following up on the suggestion from from William Hay about using a
JSV to wildcard a PE request in order to deal with MPI jobs on a
multi-architecture cluster, I've found that the JSV unexpectedly alters
a resource request from booleans to float.
In ~/.sge_request, I've got:
-l centos5
I've got a question that's very similar to Joseph Farran's query How do I
request the CPU type in qrsh / qsub with SGE 8.1.2? [1], but which is a
problem specifically with MPI jobs.
I think we're running into a chipset-architecture issue (AMD vs Intel)
in OpenMPI jobs. We're using SGE 6.2u5 and
In the message dated: Wed, 12 Sep 2012 12:56:56 -,
The pithy ruminations from MacMullan, Hugh on
Re: [gridengine users] How possible to use matlab in mutithreading mode on SGE
? were:
= Hi Semi:
=
= The real question is: how to NOT use multithreading, eh? :)
That was my first reaction as
In the message dated: Wed, 29 Aug 2012 12:09:40 EDT,
The pithy ruminations from Brian Smith on
Re: [gridengine users] Linux OOM killer oom_adj were:
= We found h_vmem to be highly unpredictable, especially with java-based
= applications. Stack settings were screwed up, certain applications
=
In the message dated: Wed, 23 May 2012 09:09:23 BST,
The pithy ruminations from Mark Dixon on
Re: [gridengine users] cgroups Integration in OGS/GE 2011.11 update 1 were:
= On Tue, 22 May 2012, Rayson Ho wrote:
=
= For those who missed the Gompute User Group Meeting:
=
=
We're running SGE 6.2u5 (Sun courtesy binaries) under Linux (CentOS 5.8),
and have discovered that SGE consistently segfaults depending on the
command-line argument passed to the command to be run as an SGE job. For
example, these produce a segfault:
qsub -e stderr -o stdout -b y
In the message dated: Thu, 29 Mar 2012 08:30:35 BST,
The pithy ruminations from William Hay on
Re: [gridengine users] how to override boolean attributes set in .sge_request
were:
= On 29 March 2012 02:07, berg...@merctech.com berg...@merctech.com wrote:
= We're running SGE 6.2u5 and seem to be
We're running SGE 6.2u5 and seem to be having a problem with using
command-line options to disable boolean attributes set in the .sge_request
file.
We've got both CentOS4 and CentOS5 nodes. There are complex entries
for each:
qconf -sc|egrep #|centos
#name shortcut type relop
In the message dated: Tue, 22 Nov 2011 15:05:43 EST,
The pithy ruminations from Chris Dagdigian on
[gridengine users] SGE (univa 8.0.1) - anyone running SGE with Centrify active
directory integration? were:
=
= Hi folks,
In the spirit of supporting the community of SGE users, rather than
Last week, our cluster (SGE 6.2u5) was working finewe've got 4
machines designated for interactive use and batch jobs, in a queue that
can be subordinated when needed, and many batch-only nodes.
We're using SSH for qlogin, with the qlogin command set to:
-
48 matches
Mail list logo