Re: [gridengine users] m_mem_free and cgroups

2020-08-07 Thread bergman
In the message dated: Fri, 07 Aug 2020 16:24:16 -, The pithy ruminations from Ondrej Valousek on [Re: [gridengine users] m_mem_free and cgroups] were: => Short answer: Use a different tool than stress Long answer: linux kernel => is too clever for tests like stress because allocating a memory

Re: [gridengine users] SGE rolling over MAX_SEQNUM, peculiar things happened

2019-10-28 Thread bergman
In the message dated: Fri, 18 Oct 2019 15:34:02 -, The pithy ruminations from WALLIS Michael on [[gridengine users] SGE rolling over MAX_SEQNUM, peculiar things happened] were: => Hi folks, => => Our instance of (quite old, 2011.11p1_155) SGE rolled over 10,000,000 jobs at the start of the

Re: [gridengine users] jobs stuck in transitioning state

2019-09-27 Thread bergman
In the message dated: Fri, 27 Sep 2019 23:32:43 +0200, The pithy ruminations from Reuti on [Re: [gridengine users] jobs stuck in transitioning state] were: => Hi, => => Am 27.09.2019 um 22:21 schrieb berg...@merctech.com: => => > We're having a problem with submit scripts not being transferred

[gridengine users] jobs stuck in transitioning state

2019-09-27 Thread bergman
We're having a problem with submit scripts not being transferred to exec nodes and jobs being stuck in the [t]ransitioning state. The issue is present with SoGE 8.1.6 and 8.1.9, under CentOS7. We are using classic spooling. On the compute nodes, the spool directory

[gridengine users] preventing certain jobs from being suspended (subordinated)

2019-09-04 Thread bergman
Our SoGE (8.1.6) configuration has essentially two queues: one for "all" jobs and one for "short jobs". The all.q is subordinate to the short.q, and short jobs can suspend a job in the general queue. At the moment, the all.q has nodes with & without GPU resources (not ideal, not permanent,

Re: [gridengine users] Multi-GPU setup

2019-08-14 Thread bergman
==\n" ; nvidia=smi ; printf "\n\nSGE status on this node:\n===\n" ; qstat -u \* -s rs -l hostname=`hostname` ) 2>&1 | Mail -s "unexpected: no free GPUs on `hostname -s`" root exit 1 fi echo $free exit 0 #######

[gridengine users] SOLVED: selinux and qlogin

2018-06-01 Thread bergman
We're setting up SoGE (8.1.6) to establish qlogin sessions to interactive nodes running CentOS7, with SELinux enabled. A long-standing issue with qlogin is that it launches the listening ssh daemon on random ports. The question about restricting qlogin to use a specific range of ports for ssh has

[gridengine users] case-insensitive user names?

2018-04-12 Thread bergman
We're using SoGE 8.1.6 in an environment where users may login to the cluster from a Linux workstation (typically using a lower-case login name) or a Windows desktop, where their login name (as supplied by the enterprise Active Directory) is usually mixed-case. On the cluster, we've created two

Re: [gridengine users] resource types -- changing BOOL to INT but keeping qsub unchanged

2018-01-18 Thread bergman
In the message dated: Tue, 02 Jan 2018 09:11:51 +, The pithy ruminations from William Hay on were: => On Fri, Dec 22, 2017 at 05:55:26PM -0500, berg...@merctech.com wrote: => > True, but even with that info, there doesn't seem to be any universal => > way to tell an arbitrary GPU job which

Re: [gridengine users] resource types -- changing BOOL to INT but keeping qsub unchanged

2017-12-22 Thread bergman
In the message dated: Thu, 21 Dec 2017 23:17:52 +0100, The pithy ruminations from Reuti on were: => Hi, => => Am 21.12.2017 um 22:46 schrieb berg...@merctech.com: => => > => > I'm considering changing "gpu" to an INT (set to the number of GPUs/node), => > making it a consumable resource, and

[gridengine users] resource types -- changing BOOL to INT but keeping qsub unchanged

2017-12-21 Thread bergman
In our cluster, we've got several different types of GPUs. Some jobs simply need any GPU, while others require a specific type. Previously, we had "gpu" declared as a BOOLEAN attribute on each GPU-node and had the GPU type (ie., TITANX, P100, etc) declared as an INT attribute with the count of

Re: [gridengine users] display GPU with qstat

2017-12-01 Thread bergman
In the message dated: Fri, 01 Dec 2017 11:41:11 +0100, The pithy ruminations from Reuti on were: => Hi, => => > => > And is it possible to have the load of the gpus ? => => Sure. But you will have to set up a load_sensor specific for your => GPUs. And it would only allow a feedback to see

[gridengine users] cluster OS survey

2017-11-06 Thread bergman
Our cluster is used in academic research on medical imaging. Our research group writes and distributes software for use on clusters & desktops. As part of determining what OSs (and for Linux distributions, GLIBC release levels) to target in our software, I would highly appreciate your feedback

[gridengine users] possible to match resource request against list of values in a complex?

2016-10-28 Thread bergman
Some of our compute nodes have multiple versions of specific resources and I'm looking for an easy way to enable users match their job requirement against the per-node list. For example software package "foobar" might exist on the following nodes with these versions: node_a:

Re: [gridengine users] SoGE 8.1.8 - sge_qmaster fails inconsistently and fail-over occurs quite often - best practices debugging and resolving the issue.

2016-06-21 Thread bergman
In the message dated: Tue, 21 Jun 2016 10:09:11 -0400, The pithy ruminations from Jesse Becker on we re: => On Tue, Jun 21, 2016 at 10:12:41AM +0100, William Hay wrote: => >On Tue, Jun 21, 2016 at 08:16:25AM +, Yuri Burmachenko wrote: => >> => >>We have noticed that our sge_qmaster

[gridengine users] core binding stopped working

2016-05-24 Thread bergman
We're running SoGE 8.1.6 under CentOS6 and had successfully been using a JSV to set core binding. This has been extremely helpful in controlling some very 'greedy' multi-threaded processes. Recently, I've noticed that the core binding is no longer working. There have been no related changes to

Re: [gridengine users] Fwd: dispatching sge task from an sge task - is that a reasonable practice?

2016-02-25 Thread bergman
In the message dated: Thu, 25 Feb 2016 13:38:28 -0800, The pithy ruminations from Skylar Thompson on were: => We have pipelines that are driven by a qsub at the end of a batch script. We have several pipelines like that. It's not uncommon to have the initial job submit 5~20 other jobs. I don't

[gridengine users] cluster utilization

2016-02-24 Thread bergman
Is anyone monitoring cluster utilization with a higher-level view than simply job (qacct) stastics and CPU-seconds used/available? I'm running SoGE 8.1.6 on a cluster with ~70 nodes, ~1400 cores and 200~350K jobs/month and I'm seeking ways to understand the utilization & resource constraints in

Re: [gridengine users] suspended jobs continue to run

2014-12-19 Thread Bergman
On December 19, 2014 6:19:58 AM EST, Reuti re...@staff.uni-marburg.de wrote: = Am 18.12.2014 um 22:21 schrieb berg...@merctech.com: = = We've got a job that was suspended via: = = qmod -sj $jobid = = that's continuing to run. The job consists of a BASH script, which = in = turn

[gridengine users] suspended jobs continue to run

2014-12-18 Thread bergman
We've got a job that was suspended via: qmod -sj $jobid that's continuing to run. The job consists of a BASH script, which in turn submits other jobs in a loop, sleeping for 30 seconds after each loop. When I examine the job status on the node where it is executing via: ps -e f

[gridengine users] does SGE do smart core assignment for jobs that are multi-threaded and parallel?

2014-06-13 Thread bergman
We're running SoGE 8.1.6, and I wanted to understand how SoGE manages CPU resources for jobs that are both multi-threaded and MPI-parallel. We have slots configured as a consumable resource, with the number of slots per-node equal to the number of CPU-cores. We use OpenMPI with tight SGE

Re: [gridengine users] Core Binding and Node Load

2014-05-05 Thread bergman
= users@gridengine.org = https://gridengine.org/mailman/listinfo/users = = ___ = users mailing list = users@gridengine.org = https://gridengine.org/mailman/listinfo/users = -- Mark Bergman ___ users mailing list

[gridengine users] SGE_STARTER_SHELL_PATH unset (Was: Re: SGE starter_method breaks OpenMPI)

2014-04-21 Thread bergman
In the message dated: Wed, 02 Apr 2014 16:19:14 -0400, The pithy ruminations from berg...@merctech.com on [gridengine users] SGE starter_method breaks OpenMPI were: = = We're bringing up SoGE 8.1.6 and I've run into a problem with the use of a = 'starter_method' that's affecting OpenMPI jobs.

Re: [gridengine users] The state of a queue

2014-04-17 Thread bergman
In the message dated: Thu, 17 Apr 2014 09:55:42 -0700, The pithy ruminations from Joseph Farran on [gridengine users] The state of a queue were: = Howdy. = = I am able to disabled enable a queue @ a compute node with: = = $ qmod -d bio@compute-1-1 = me@sys changed state of

[gridengine users] SGE starter_method breaks OpenMPI

2014-04-02 Thread bergman
as if the starter_method is somehow corrupting the environment passed to mpirun. For example, I submitted a job when I was in the directory /lab/home/bergman/sge_job_output. The starter_method script reports: Debugging: About to: exec OPAL_PREFIX=/lab/bin/openmpi/sge; export OPAL_PREFIX

Re: [gridengine users] Using modules form the compute nodes

2014-01-27 Thread bergman
In the message dated: Mon, 27 Jan 2014 17:50:58 +0100, The pithy ruminations from Reuti on Re: [gridengine users] Using modules form the compute nodes were: = Hi, = = Am 27.01.2014 um 17:26 schrieb Txema Heredia: = = I have been trying to use modulefiles from my compute nodes with no avail.

Re: [gridengine users] qacct and resource requests

2014-01-21 Thread bergman
= https://gridengine.org/mailman/listinfo/users = = -- Mark BergmanBiker, Rock Climber, Unix mechanic, IATSE #1 Stagehand '94 Yamaha GTS1000A^2 berg...@panix.com http://wwwkeys.pgp.net:11371/pks/lookup?op=getsearch=bergman%40panix.com I want a newsgroup with a infinite S/N ratio! Now taking CFV

Re: [gridengine users] Random queue errors, and suspect pe_hostfiles

2013-08-27 Thread bergman
= = Tel: +64 (0) 6 350 5701 Extn: 3586 = = = = = = -- = Mark Bergman = ___ = users mailing list = users@gridengine.org = https://gridengine.org/mailman/listinfo/users = -- Mark BergmanBiker, Rock Climber, Unix mechanic, IATSE #1 Stagehand '94 Yamaha

Re: [gridengine users] Random queue errors, and suspect pe_hostfiles

2013-08-23 Thread bergman
= = = -- = Dr Chris Jewell = Lecturer in Biostatistics = Institute of Fundamental Sciences = Massey University = Private Bag 11222 = Palmerston North 4442 = New Zealand = Tel: +64 (0) 6 350 5701 Extn: 3586 = = -- Mark Bergman ___ users mailing list users

Re: [gridengine users] Random queue errors, and suspect pe_hostfiles

2013-08-22 Thread bergman
In the message dated: Thu, 01 Aug 2013 03:09:56 -, The pithy ruminations from Jewell, Chris on [gridengine users] Random queue errors, and suspect pe_hostfiles were: = Hello all, = = A while since I posted here, so good to be back! = = My installation of GE 8.1.3 from the Scientific Linux

[gridengine users] strategy for h_vmem and multi-slot jobs?

2013-02-01 Thread bergman
I'm looking for suggestions for dealing with h_vmem requirements for multi-slot jobs. We use memory as a consumable and a required complex. I understand that SGE multiplies the h_vmem request by the number of slots in order to determine the job memory requirement. In our environment, there are

[gridengine users] h_rt and suspended jobs?

2013-01-30 Thread bergman
We're running SGE 6.2u5 and we've got a short queue, which assigns a higher priority but imposes run-time and CPU-time limits. We also have a short-on-interactive queue to allow short jobs to run on a subset of the slots on our interactive nodes, with the goal of allowing short, high-priority

[gridengine users] PE offers 0 slots (conflict in pe_list w. hostgroups?)

2013-01-11 Thread bergman
I recently reconfigured our SGE (6.2u5) environment to better handle MPI jobs on a heterogeneous cluster. This seems to have caused a problem with the threaded (SMP) PE. Our PEs are: qconf -spl make(unused) openmpi-AMD

Re: [gridengine users] PE offers 0 slots (conflict in pe_list w. hostgroups?)

2013-01-11 Thread bergman
In the message dated: Fri, 11 Jan 2013 23:45:05 +0100, The pithy ruminations from Reuti on Re: [gridengine users] PE offers 0 slots (conflict in pe_list w. hostgroups?) were: = Am 11.01.2013 um 23:16 schrieb berg...@merctech.com: = = [SNIP!] = = qconf -sq all.q | grep pe_list

Re: [gridengine users] How prevent abnormal nodes load using qsub?

2013-01-09 Thread bergman
In the message dated: Tue, 01 Jan 2013 17:38:23 +0100, The pithy ruminations from Reuti on Re: [gridengine users] How prevent abnormal nodes load using qsub? were: = Am 31.12.2012 um 09:34 schrieb Semi: = = The memory is not a problem, the problem CPU load, = every python process runs 2

[gridengine users] problem launching MPI jobs under SGE with architecture-specific PEs

2012-12-21 Thread bergman
\ --mca btl_base_verbose 30 \ --mca routed direct \ --prefix $OPENMPI -np $NSLOTS ~/hello_openmpi - STDOUT from ~/hello_openmpi below this line - Command: ~/hello_openmpi Arguments: Executing in: /acme/home/bergman/sge_job_output Executing on: acme-c5-8.example.com Executing

[gridengine users] jsv_accept vs jsv_correct (Was: Re: JSV unexpectedly alters resource request from boolean to float)

2012-12-18 Thread bergman
In the message dated: Tue, 18 Dec 2012 11:24:22 +0100, The pithy ruminations from Reuti on Re: [gridengine users] JSV unexpectedly alters resource request from boolean t o float were: = Am 17.12.2012 um 09:19 schrieb William Hay: = = On 14 December 2012 22:30, berg...@merctech.com

Re: [gridengine users] Restarting Grid Engine makes qstat forget display order

2012-12-17 Thread bergman
In the message dated: Mon, 17 Dec 2012 12:26:31 PST, The pithy ruminations from Joseph Farran on Re: [gridengine users] Restarting Grid Engine makes qstat forget display order were: = On 12/16/2012 10:15 AM, Dave Love wrote: = I think the answer is not to do that. Why restart it? = = =

[gridengine users] JSV unexpectedly alters resource request from boolean to float

2012-12-14 Thread bergman
Following up on the suggestion from from William Hay about using a JSV to wildcard a PE request in order to deal with MPI jobs on a multi-architecture cluster, I've found that the JSV unexpectedly alters a resource request from booleans to float. In ~/.sge_request, I've got: -l centos5

[gridengine users] MPI jobs on a multi-architecture cluster?

2012-12-12 Thread bergman
I've got a question that's very similar to Joseph Farran's query How do I request the CPU type in qrsh / qsub with SGE 8.1.2? [1], but which is a problem specifically with MPI jobs. I think we're running into a chipset-architecture issue (AMD vs Intel) in OpenMPI jobs. We're using SGE 6.2u5 and

Re: [gridengine users] How possible to use matlab in mutithreading mode on SGE?

2012-09-12 Thread bergman
In the message dated: Wed, 12 Sep 2012 12:56:56 -, The pithy ruminations from MacMullan, Hugh on Re: [gridengine users] How possible to use matlab in mutithreading mode on SGE ? were: = Hi Semi: = = The real question is: how to NOT use multithreading, eh? :) That was my first reaction as

Re: [gridengine users] Linux OOM killer oom_adj

2012-08-30 Thread bergman
In the message dated: Wed, 29 Aug 2012 12:09:40 EDT, The pithy ruminations from Brian Smith on Re: [gridengine users] Linux OOM killer oom_adj were: = We found h_vmem to be highly unpredictable, especially with java-based = applications. Stack settings were screwed up, certain applications =

Re: [gridengine users] cgroups Integration in OGS/GE 2011.11 update 1

2012-05-24 Thread bergman
In the message dated: Wed, 23 May 2012 09:09:23 BST, The pithy ruminations from Mark Dixon on Re: [gridengine users] cgroups Integration in OGS/GE 2011.11 update 1 were: = On Tue, 22 May 2012, Rayson Ho wrote: = = For those who missed the Gompute User Group Meeting: = =

[gridengine users] SGE 6.2u5 segfault due to format strings on command line

2012-04-25 Thread bergman
We're running SGE 6.2u5 (Sun courtesy binaries) under Linux (CentOS 5.8), and have discovered that SGE consistently segfaults depending on the command-line argument passed to the command to be run as an SGE job. For example, these produce a segfault: qsub -e stderr -o stdout -b y

Re: [gridengine users] how to override boolean attributes set in .sge_request

2012-03-29 Thread bergman
In the message dated: Thu, 29 Mar 2012 08:30:35 BST, The pithy ruminations from William Hay on Re: [gridengine users] how to override boolean attributes set in .sge_request were: = On 29 March 2012 02:07, berg...@merctech.com berg...@merctech.com wrote: = We're running SGE 6.2u5 and seem to be

[gridengine users] how to override boolean attributes set in .sge_request

2012-03-28 Thread bergman
We're running SGE 6.2u5 and seem to be having a problem with using command-line options to disable boolean attributes set in the .sge_request file. We've got both CentOS4 and CentOS5 nodes. There are complex entries for each: qconf -sc|egrep #|centos #name shortcut type relop

Re: [gridengine users] SGE (univa 8.0.1) - anyone running SGE with Centrify active directory integration?

2011-11-22 Thread bergman
In the message dated: Tue, 22 Nov 2011 15:05:43 EST, The pithy ruminations from Chris Dagdigian on [gridengine users] SGE (univa 8.0.1) - anyone running SGE with Centrify active directory integration? were: = = Hi folks, In the spirit of supporting the community of SGE users, rather than

[gridengine users] cluster IP change, now qlogin timeout (4 s) expired while waiting on socket fd 4

2011-08-22 Thread bergman
Last week, our cluster (SGE 6.2u5) was working finewe've got 4 machines designated for interactive use and batch jobs, in a queue that can be subordinated when needed, and many batch-only nodes. We're using SSH for qlogin, with the qlogin command set to: -