Re: [gridengine users] Job in error states

2020-03-08 Thread MacMullan IV, Hugh
Or if it’s an NFS share, perhaps it’s become unmounted on one or more exec nodes. -Hugh > On Mar 7, 2020, at 10:55, Reuti wrote: > > Hi, > > is it alwys failing on one and the same node? Or are several nodes affected? > One guess could be that the file system is full. > > -- Reuti > >

Re: [gridengine users] What is the easiest/best way to update our servers' domain name?

2019-10-29 Thread MacMullan IV, Hugh
What’s the output of ‘qconf -sq long.q’? Are you sure it doesn’t still reference the old hostname, maybe within a hostgroup? -Hugh From: users-boun...@gridengine.org On Behalf Of Mun Johl Sent: Tuesday, October 29, 2019 1:38 PM To: dpo...@gmail.com Cc: users@gridengine.org Subject: Re:

Re: [gridengine users] Sorting qhost and choosing qstat columns

2019-08-01 Thread MacMullan IV, Hugh
David, Best for qhost sort would be to change your 'cluster' names to zero-padded, if you really want that kind of sorting. Or you could create an alias like 'qhost | sort -nk 1.8', assuming 'clusterX' is always true (the 8th character is where you start the sort). As Skylar says, if you want

Re: [gridengine users] jobs randomly die

2019-05-14 Thread MacMullan IV, Hugh
cpu 19305.760s mem 7.463TBs io 70.435GB iow 0.000s maxvmem 532.004MB arid undefined ar_sub_time undefined category -l hostname=karun10 -pe make 1 Thanks, ulrich On 5/14/19 3:28 PM, MacMullan IV, Hugh wrote: > It's a limit being reac

Re: [gridengine users] jobs randomly die

2019-05-14 Thread MacMullan IV, Hugh
It's a limit being reached, of some sort. Do you have a RQS of any kind (qconf -srqs)? We see this for job-requested, or system set RAM exhaustion (OOM killer, as mentioned 'dmesg -T' on compute nodes often useful), as well as time limits reached. What is the whole output from 'qacct -j JOBID'?

Re: [gridengine users] batch array jobs are executed on interactive queue

2019-01-09 Thread MacMullan IV, Hugh
Iyad: As part of our JSV (jsv.sh), we force any non-qsub CLIENT to our 'interactive' queue, like: jsv_on_verify() { MYQ=$(jsv_get_param 'q_hard') # force non-qsub to interactive queue if [[ $(jsv_get_param 'CLIENT') != 'qsub' ]]; then NEWQ=$(echo $MYQ | sed

Re: [gridengine users] $TMPDIR With MPI Jobs

2018-12-06 Thread MacMullan IV, Hugh
Perhaps you can 'wrap' the 'work' in a small script (work.sh), like: #!/bin/bash ## pre-work echo TMP: $TMP OUT=$TMP/env.$JOB_ID.$OMPI_COMM_WORLD_RANK.txt ## WORK env > $OUT 2>&1 ## report and clean up? ls -la $TMP rsync -av $OUT $SGE_CWD_PATH Then use a wrapper.sh job script to 'mpiexec

Re: [gridengine users] problem with concurrent jobs

2018-12-06 Thread MacMullan IV, Hugh
Sweet! Glad to see you’re up and running. -H From: Dimar Jaime González Soto Sent: Thursday, December 6, 2018 11:57 AM To: MacMullan IV, Hugh Subject: Re: [gridengine users] problem with concurrent jobs I disabled that quota and no I can see 60 processes running. Thanks El jue., 6 dic. 2018

Re: [gridengine users] problem with concurrent jobs

2018-12-06 Thread MacMullan IV, Hugh
Yes, as Ruti said: What’s the output from ‘qconf -srqs’ (line 1 of the max_slots rule)? Looks like you’re being blocked there (RQS). -H From: Dimar Jaime González Soto Sent: Thursday, December 6, 2018 11:22 AM To: MacMullan IV, Hugh Subject: Re: [gridengine users] problem with concurrent jobs

Re: [gridengine users] problem with concurrent jobs

2018-12-06 Thread MacMullan IV, Hugh
Also: I'm surprised 'qalter -w p ' doesn't show any output. Did you forget the JOBID? -H -Original Message- From: users-boun...@gridengine.org On Behalf Of Reuti Sent: Thursday, December 6, 2018 11:04 AM To: Dimar Jaime González Soto Cc: users@gridengine.org Subject: Re: [gridengine

Re: [gridengine users] Enabling an execd host after a particular job has run

2017-11-06 Thread MacMullan IV, Hugh
You could put that code in the exec startup script, above the line that starts execd, and error out if that part doesn’t complete successfully. Seems weird, though. Shouldn’t the node build include all necessary bits to set up your execd nodes to ‘run jobs’? Are these on your Cygwin cluster?

Re: [gridengine users] sge_shepherd failing under Cygwin

2017-10-31 Thread MacMullan IV, Hugh
You could try ‘qstat -j 11 -explain E’ to see if there’s an explanation for the error On Oct 31, 2017, at 19:56, Simon Matthews > wrote: I have finally got a small test grid running with an execd running under Cygwin on a Windows

Re: [gridengine users] Port range when using ssh for qrsh

2017-10-26 Thread MacMullan IV, Hugh
You can turn on system firewalls, and allow all inbound port TCP traffic from all cluster nodes, only. And then open ssh ports to on-site, or some other restricted set of subnets. Perhaps that will satisfy your InfoSec team. If you use Univa GridEngine, you can specify the ‘port_range’ option

Re: [gridengine users] Automate SGE Tasks

2017-04-20 Thread MacMullan IV, Hugh
Absolutely! Take a look at the qconf man page for details on using files (-[AMRD]attr) and / or the -[amrd]attr options (for specific objects). Hope that's useful! -Hugh On Apr 20, 2017, at 14:13, Douglas Duckworth > wrote: Hey! Is

Re: [gridengine users] How can i make gridengine not to use ssh?

2016-08-10 Thread MacMullan IV, Hugh
Good call, Reuti. Thanks for the expansion and details! > On Aug 10, 2016, at 16:25, Reuti <re...@staff.uni-marburg.de> wrote: > > >> Am 10.08.2016 um 21:46 schrieb MacMullan IV, Hugh: >> >> Hi Ulrich: >> >> I haven't gone past op

Re: [gridengine users] How can i make gridengine not to use ssh?

2016-08-10 Thread MacMullan IV, Hugh
Hi Ulrich: I haven't gone past openmpi v1.10, but you'll likely want to change 'control_slaves' in your PE conf to 'TRUE', to signal that you have tight integration. https://www.open-mpi.org/faq/?category=sge (and probably 'job_is_first_task' to 'FALSE') Does that help? -Hugh -Original

Re: [gridengine users] core binding stopped working

2016-06-10 Thread MacMullan IV, Hugh
Another angle to consider is from the MATLAB side: using the '-singleCompThread' option in your matlab and worker scripts ($MATLAB_ROOT/bin/(matlab|worker)): $ diff matlab.dist matlab 497c497 < arglist="" --- > arglist="-singleCompThread" $ diff worker.dist worker 19c19 < exec

Re: [gridengine users] All queues dropped because of overload or full

2016-05-25 Thread MacMullan IV, Hugh
I'm no Rocks guy, but maybe this thread will help: https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/2013-January/060918.html -Original Message- From: users-boun...@gridengine.org [mailto:users-boun...@gridengine.org] On Behalf Of Pat Haley Sent: Wednesday, May 25, 2016 10:59 AM

Re: [gridengine users] SoGE: hyperthreading on or off?

2016-03-18 Thread MacMullan IV, Hugh
We use HT in AWS when running a large set of single-thread jobs across a cluster. While the individual jobs run more slowly, the total set of jobs completes more quickly, saving us money. Generally HT gets about a 15% speed increase (and corresponding cost savings) over 2x ... in other words 2x

Re: [gridengine users] alias for queue

2016-03-10 Thread MacMullan IV, Hugh
Tested queue re-write within a $SGE_ROOT/$SGE_CELL/sge_request launched JSV ... it works! Hooray! That eliminates 4 dummy queues for us. Good, good stuff. We're very JSV now. -Original Message- From: users-boun...@gridengine.org [mailto:users-boun...@gridengine.org] On Behalf Of Reuti

Re: [gridengine users] rebooting nodes nicely - what happened?

2016-03-02 Thread MacMullan IV, Hugh
Limiting the slots on exec hosts (some options): http://www.softpanorama.org/HPC/Grid_engine/Resources/slot_limits.shtml Exclusive complex configuration (2010 post): https://web.archive.org/web/20101027190030/http://wikis.sun.com/display/gridengine62u3/Configuring+Exclusive+Scheduling exclusive