Re: [gridengine users] Job getting killed in the middle

2016-01-05 Thread MacMullan, Hugh
Hi Sudha:

You seem to have an h_vmem limit of 8GB set for the job, and you're going over 
that, so the job is getting terminated.

Cheers,
-Hugh

From: users-boun...@gridengine.org [mailto:users-boun...@gridengine.org] On 
Behalf Of sudha.penme...@wipro.com
Sent: Tuesday, January 5, 2016 10:05 AM
To: users@gridengine.org
Subject: [gridengine users] Job getting killed in the middle

Hi,

Can you please help me in understanding why does the job gets killed due to the 
below reason.

01/02/2016 06:18:10|execd|host1|W|job 267713 exceeds job hard limit "h_vmem" of 
queue "test.q@host1" (11510681600.0 > 
limit:8589934592.0) - sending SIGKILL

Regards,
Sudha
The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not the 
intended recipient, you should not disseminate, distribute or copy this e-mail. 
Please notify the sender immediately and destroy all copies of this message and 
any attachments. WARNING: Computer viruses can be transmitted via email. The 
recipient should check this email and any attachments for the presence of 
viruses. The company accepts no liability for any damage caused by any virus 
transmitted by this email. www.wipro.com
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] SoGE 8.1.8 - Qsub issue when using request able variable and parallel environment - need your help.

2015-10-30 Thread MacMullan, Hugh
I don't know why, but when we do the same, we need to specify '-soft' for the 
'-l hostname=' request. Like:

qsub -b y -N test -j y -pe somepe 2 -soft -l hostname=hpcc001 hostname

I hope that helps!

-Hugh

-Original Message-
From: users-boun...@gridengine.org [mailto:users-boun...@gridengine.org] On 
Behalf Of Reuti
Sent: Friday, October 30, 2015 6:17 AM
To: Yuri Burmachenko 
Cc: users@gridengine.org; EdaIt ; Leior Varon 

Subject: Re: [gridengine users] SoGE 8.1.8 - Qsub issue when using request able 
variable and parallel environment - need your help.


> Am 30.10.2015 um 10:40 schrieb Yuri Burmachenko :
> 
> Hallo to distinguished forum members,
>  
> Recently we have a need to submit jobs in way that qsub request both 
> requestable variable hostname and parallel environment.
>  
> For example if we submit ‘xterm’ job:
> · $SGE_ROOT/bin/lx-amd64/qsub -V -cwd -b y -l hostname=host_in_grid 
> -pe somePe 1 xterm
>  
> This kind of request results in a strange behavior of the scheduler – this 
> requests results to one of the below states of the submission:
>  
> 1.  xterm job opened as expected.
> 2.  There is a very long delay and then xterm opened.
> 3.  Job enters ‘qw’ state with similar to below error:
> cannot run because it exceeds limit "/" in rule "some_rule/1" 
>   
> cannot run in PE "somePe" because it only offers 0 slots
>  
> In all of the above states the “host_in_grid” has enough free slots and the 
> quota rule “some_rule” is not related in any way to the consumable/request 
> able variable in the job submission request.
> If we try to remove “some_rule” quota from the SGE quotas, then this error 
> picks up another rule and again states that its limit was exceeded.
> NOTE: somePe parallel environment has enough free slots – it is initially 
> defined with 999 slots.
>  
> Basically these “cannot run” messages do not reflect the real reason why the 
> job can’t be run, since all conditions are actually met – this is very 
> confusing, why this happen?
>  
> We also found a workaround without the requestable variable “hostname” like 
> below when it ALWAYS work:
> $SGE_ROOT/bin/lx-amd64/qsub -V -cwd -b y -q host_in_grid -pe testpe 1 xterm
>  
> Any ideas why does this strange behavior occur? Is this some kind of a bug? 
> How this can be resolved?

Unfortunately I have no idea, but I observed already in former versions that 
instead of:

-l h=foo -q bar

it's better to request:

-q bar@foo

Maybe it is similar to the issue you faced.

-- Reuti
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Monitoring slot usage

2015-07-30 Thread MacMullan, Hugh
Hi Simon:

We use 'Core Binding' to restrict users to the same number of cores as slots 
requested.

http://www.gridengine.eu/grid-engine-internals/87-exploiting-the-grid-engine-core-binding-feature

We use a jsv to assign the binding value (force compliance) based on the other 
job inputs: single slot and MPI jobs are bound to 1 core (for each slot 
requested), OpenMP jobs are bound to the number of slots requested in the pe 
option.

Or you might be able to just put '-binding linear:1' in 
$SGE_ROOT/default/common/sge_request, and then have users specify '-binding 
linear:#' if they're doing a SMP job.

Test carefully! :)

-Hugh

From: users-boun...@gridengine.org [mailto:users-boun...@gridengine.org] On 
Behalf Of Simon Andrews
Sent: Thursday, July 30, 2015 11:01 AM
To: users@gridengine.org
Subject: [gridengine users] Monitoring slot usage

What is the recommended way of identifying jobs which are consuming more CPU 
than they've requested?  I have an environment set up where people mostly 
submit SMP jobs through a parallel environment and we can use this information 
to schedule them appropriately.  We've had several cases though where the jobs 
have used significantly more cores on the machine they're assigned to than they 
requested, so the nodes become overloaded and go into an alarm state.

What options do I have for monitoring the number of cores simultaneously used 
by a job and comparing this to the number which were requested so I can find 
cases where the actual usage is way above the request and kill them?

Thanks

Simon.
The Babraham Institute, Babraham Research Campus, Cambridge CB22 3AT Registered 
Charity No. 1053902.
The information transmitted in this email is directed only to the addressee. If 
you received this in error, please contact the sender and delete this email 
from your system. The contents of this e-mail are the views of the sender and 
do not necessarily represent the views of the Babraham Institute. Full 
conditions at: www.babraham.ac.ukhttp://www.babraham.ac.uk/terms
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] frequent errors from the GRID messages

2015-05-28 Thread MacMullan, Hugh
Do users have permission to write to /tmp? 1777 is the 'normal' permissions for 
that directory:

drwxrwxrwt.  12 root root 4096 May 28 11:10 tmp


From: users-boun...@gridengine.org [mailto:users-boun...@gridengine.org] On 
Behalf Of sudha.penme...@wipro.com
Sent: Thursday, May 28, 2015 11:02 AM
To: users@gridengine.org
Subject: [gridengine users] frequent errors from the GRID messages

Hi,

The below  errors are quiet often in the GRID messages file. Though there are 
enough permissions for the /tmp directory on host, the messages show the jobs 
failed because of this error.

Can you please help in understanding the reason for these errors.

05/27/2015 09:58:25|qmaster|gridmaster1|W|job 8318993.1 failed on host host1 
general before job because: 05/27/2015 09:58:25 [899:22217]: can't open file 
/tmp/8318993.1.rhel6.q/pid: Permission denied
05/27/2015 10:55:37|qmaster|gridmaster1|W|job 8319636.1 failed on host host2 
general before job because: 05/27/2015 10:55:37 [899:11272]: can't open file 
/tmp/8319636.1. rhel6.q/pid: Permission denied
05/27/2015 10:58:05|qmaster| gridmaster1|W|job 8319689.1 failed on host host3 
general before job because: 05/27/2015 10:58:04 [899:48764]: can't open file 
/tmp/8319689.1. rhel6.q/pid: Permission denied
05/27/2015 10:58:05|qmaster| gridmaster1|W|job 8319691.1 failed on host host6 
general before job because: 05/27/2015 10:58:04 [899:46484]: can't open file 
/tmp/8319691.1. rhel6.q/pid: Permission denied
05/27/2015 11:00:51|qmaster| gridmaster1|W|job 8319752.1 failed on host host4 
general before job because: 05/27/2015 11:00:51 [899:14950]: can't open file 
/tmp/8319752.1. rhel6.q/pid: Permission denied
05/27/2015 11:00:51|qmaster| gridmaster1|W|job 8319750.1 failed on host host7 
general before job because: 05/27/2015 11:00:51 [899:50509]: can't open file 
/tmp/8319750.1. rhel6.q/pid: Permission denied
05/27/2015 11:02:14|qmaster| gridmaster1|W|job 8319760.1 failed on host host5 
general before job because: 05/27/2015 11:02:14 [899:17507]: can't open file 
/tmp/8319760.1. rhel6.q/pid: Permission denied

Regards,
Sudha
The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not the 
intended recipient, you should not disseminate, distribute or copy this e-mail. 
Please notify the sender immediately and destroy all copies of this message and 
any attachments. WARNING: Computer viruses can be transmitted via email. The 
recipient should check this email and any attachments for the presence of 
viruses. The company accepts no liability for any damage caused by any virus 
transmitted by this email. www.wipro.comhttp://www.wipro.com
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] unable to contact qmaster using port 536 on host

2015-05-08 Thread MacMullan, Hugh
Hi Sudha:

What exactly do you mean by the grid environment didn't load? Was the qmaster 
server down, restarted, patched, rebooted, etc? Can you run 'q' commands 
(qhost, qconf, etc.) on the master itself?

-Hugh
P.S. SGE 6.1 (N1) is really old!

-Original Message-
From: users-boun...@gridengine.org [mailto:users-boun...@gridengine.org] On 
Behalf Of sudha.penme...@wipro.com
Sent: Friday, May 08, 2015 7:08 AM
To: re...@staff.uni-marburg.de
Cc: users@gridengine.org
Subject: Re: [gridengine users] unable to contact qmaster using port 536 on host

The installed version is N1GE 6.1

All the machines have same version

Regards,
Sudha

-Original Message-
From: Reuti [mailto:re...@staff.uni-marburg.de]
Sent: Friday, May 08, 2015 4:34 PM
To: Sudha Padmini Penmetsa (WT01 - Global Media  Telecom)
Cc: users@gridengine.org
Subject: Re: [gridengine users] unable to contact qmaster using port 536 on host

Was the installed SGE updated? Are all machines at the same version of SGE? - 
Reuti

 Am 08.05.2015 um 11:37 schrieb sudha.penme...@wipro.com 
 sudha.penme...@wipro.com:

 Hi,

 We have an issue with our grid env today.

 The grid environment didn’t load and while running grid commands we got the 
 below errors

 error: unable to contact qmaster using port 536 on host (grid master server)
 error: can't unpack gdi request
 error: error unpacking gdi request: bad argument

 Can anyone please let me know what could be the issue.

 Regards,
 Sudha


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] limit the number of jobs users can put on queue (pending)

2014-11-12 Thread MacMullan, Hugh
Hi Robert:

The tasks jobs are all submitted at the same time, and they also have 
SGE_TASK_ID environment variable set, which is VERY useful! Give it a try:

echo 'HOSTNAME=`hostname`; echo this is task $SGE_TASK_ID on $HOSTNAME' | 
qsub -N arraytest -t 1-4 -j y

Use that SGE_TASK_ID to import options or data from a file or files, set a 
seed, etc.

Task array jobs rule! :)

-Hugh

-Original Message-
From: users-boun...@gridengine.org [mailto:users-boun...@gridengine.org] On 
Behalf Of Roberto Nunnari
Sent: Wednesday, November 12, 2014 8:56 AM
To: William Hay
Cc: users@gridengine.org
Subject: Re: [gridengine users] limit the number of jobs users can put on queue 
(pending)

Il 12.11.2014 14:51, William Hay ha scritto:
 On Wed, 12 Nov 2014 13:42:32 +
 Roberto Nunnari roberto.nunn...@supsi.ch wrote:

 Il 12.11.2014 14:33, William Hay ha scritto:
 On Wed, 12 Nov 2014 13:14:25 +
 Roberto Nunnari roberto.nunn...@supsi.ch wrote:

 humm.. answering myself..

 from the man pages it seams that

 maxujobs is for only running jobs per user

 while

 max_u_jobs is for both running and pending jobs per user

 is that correct?

 Thank you and best regards
 Robi
 More or less.  There is one other slight difference when it comes to array 
 jobs.
 Maxujobs counts each running task while max_u_jobs considers the whole 
 array to be a single job.

 hehehe.. I still have to understand what a array job is.. and I don't
 believe any of my users have ever used it. I'll try to find some doc
 about it. :-)

 It's a way to submit a bunch of jobs that are identical from grid engine's 
 POV as a single job.  This lightens the load on the scheduler
 and means qstat normally only reports a single queued job.  Probably what the 
 user who caused your original issue should have submitted.

Nice! :-)

And the tasks in the array jobs can get running in parallel or they 
always run in serial?

Robi
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Son of grid engine RPMs version 8.1.8 check for XML::Simple problem

2014-11-06 Thread MacMullan, Hugh
To close the loop for those who are curious why this DIDN'T happen, Nikolai's 
RHEL 6.5 box wasn't subscribed to the rhel-x86_64-server-optional-6 Repository. 
Once he subscribed, the install worked as expected.

Cheers,
-Hugh

From: users-boun...@gridengine.org [mailto:users-boun...@gridengine.org] On 
Behalf Of Simon Andrews
Sent: Thursday, November 06, 2014 11:37 AM
To: users@gridengine.org
Subject: Re: [gridengine users] Son of grid engine RPMs version 8.1.8 check for 
XML::Simple problem

Since your XML simple is under /usr/local then it's not been installed as part 
of an RPM so that's why it's not being seen.

What I don't understand is why your install command didn't just install that as 
a dependency (the docs for yum install on CentOS6 say that it should).

You could try manually doing:

yum install perl-XML-Simple and see what that does.

Simon.

From: Nikolai N Bezroukov 
nikolai.bezrou...@basf.commailto:nikolai.bezrou...@basf.com
Date: Wednesday, 5 November 2014 20:52
To: users@gridengine.orgmailto:users@gridengine.org 
users@gridengine.orgmailto:users@gridengine.org
Subject: [gridengine users] Son of grid engine RPMs version 8.1.8 check for 
XML::Simple problem

Hi All,

I am trying to install version 8.1.8 but run into problem with XML::Simple. 
This Perl module is installed on my system in 
/usr/local/share/perl5/XML/Simple.pm but rpm check can't find it and complains.

[0]root@sandbox: # yum install gridengine-8.1.8-1.el6.x86_64.rpm
Loaded plugins: product-id, refresh-packagekit, rhnplugin, security, 
subscription-manager
This system is receiving updates from RHN Classic or RHN Satellite.
Setting up Install Process
Examining gridengine-8.1.8-1.el6.x86_64.rpm: gridengine-8.1.8-1.el6.x86_64
Marking gridengine-8.1.8-1.el6.x86_64.rpm to be installed
Resolving Dependencies
-- Running transaction check
--- Package gridengine.x86_64 0:8.1.8-1.el6 will be installed
-- Processing Dependency: perl(XML::Simple) for package: 
gridengine-8.1.8-1.el6.x86_64
-- Processing Dependency: libhwloc.so.5()(64bit) for package: 
gridengine-8.1.8-1.el6.x86_64
-- Processing Dependency: libjemalloc.so.1()(64bit) for package: 
gridengine-8.1.8-1.el6.x86_64
-- Running transaction check
--- Package gridengine.x86_64 0:8.1.8-1.el6 will be installed
-- Processing Dependency: perl(XML::Simple) for package: 
gridengine-8.1.8-1.el6.x86_64
--- Package hwloc.x86_64 0:1.5-3.el6_5 will be installed
--- Package jemalloc.x86_64 0:3.6.0-1.el6 will be installed
-- Finished Dependency Resolution
Error: Package: gridengine-8.1.8-1.el6.x86_64 (/gridengine-8.1.8-1.el6.x86_64)
   Requires: perl(XML::Simple)
 You could try using --skip-broken to work around the problem
 You could try running: rpm -Va --nofiles --nodigest

[1] root@sandbox: # perldoc -l XML::Simple
/usr/local/share/perl5/XML/Simple.pm

What is the best way to resolve this problem?

Regards,

kievite
The Babraham Institute, Babraham Research Campus, Cambridge CB22 3AT Registered 
Charity No. 1053902.
The information transmitted in this email is directed only to the addressee. If 
you received this in error, please contact the sender and delete this email 
from your system. The contents of this e-mail are the views of the sender and 
do not necessarily represent the views of the Babraham Institute. Full 
conditions at: www.babraham.ac.ukhttp://www.babraham.ac.uk/terms
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] GUI or web Interface?

2014-09-18 Thread MacMullan, Hugh
Hi Jose:

I look forward to hearing about any good opensource projects myself! I wrote 
some terrible PHP in the distant past, that was too scary to ever release 
publicly (even to our users), but it's defininitely doable.

I don't believe Univa has a web interface/portal.

The only commercial product I know of is EnginFrame, from Nice. Nice (hah) 
product, but it had too many bells-and-whistles for our use (and commensurate 
cost) at the time of our last review (a couple of years ago).

http://www.nice-software.com/products/enginframe

Cheers,
-Hugh


From: users-boun...@gridengine.org [mailto:users-boun...@gridengine.org] On 
Behalf Of José Román Bilbao
Sent: Thursday, September 18, 2014 9:21 AM
To: users@gridengine.org
Subject: [gridengine users] GUI or web Interface?

Dear community members,

As a prospective grid portal user I am comparing different alternatives. As far 
as I know Grid Engine since its early days in Sun, I would like to give it a 
try. Nevertheless, one of my main concerns is that my users are not 
linux-skilled and would need something graphical. I know of qmon but they are 
usually Windows users. Therefore, I would like to hear any suggestions on 
gridEngine available web interfaces (if any) or even a totally different 
platform if you are famliar with them. Basically I would like my users to be 
able to submit/monitor/cancel their jobs as well as (if possible) to upload 
input files when needed and download resulting outputs. If no open source 
alternative exists, I would also like to hear about commercial alternatives. So 
far I have found the Univa Grid Engie, but don't know where to find information 
regarding any existing web interface. I have written them but no response yet.
Thanks in advance,

Jose
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Migrating from sge 6.2u5 to 8.1.7

2014-09-18 Thread MacMullan, Hugh
Hi Norbert:

As Reuti says here: 
http://gridengine.org/pipermail/users/2012-October/004927.html

There are scripts to save and load the actual configuration to text files:

$ ls /usr/sge/util/upgrade_modules
inst_upgrade.sh  load_sge_config.sh  save_sge_config.sh

We've used these a few times over the years to good effect.

Good luck with the upgrade!

Cheers,
-Hugh

-Original Message-
From: users-boun...@gridengine.org [mailto:users-boun...@gridengine.org] On 
Behalf Of Norbert Crettol
Sent: Thursday, September 18, 2014 10:59 AM
To: users@gridengine.org
Subject: [gridengine users] Migrating from sge 6.2u5 to 8.1.7

Hello,

I've compiled and tested SGE 8.1.7. It looks like
it works well in our environment (Linux Ubuntu
workstations and virtual nodes).

Now, I'd like to upgrade our current 6.2u5 cluster
to this 8.1.7 version.

I've seen that there's a inst_sge -upd that should
do the job but I cannot find documentation on how
to use it and from which version to which one it's
able to correctly upgrade. I'd like to keep the
current configuration (complexes, user groups,
host groups, policies, spool etc...).

Can someone point me to a good documentation ?

Best regards
Norbert
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] jobs are going to servers which I believe are not in the queue

2014-09-05 Thread MacMullan, Hugh
Hi Dan:

A1: I prefer 'qmod -d queuename@hostname' to disable a node or nodes entirely 
(for maintenance, or whatever). Wildcards work, too, so 'qmod -d *@*' will 
disable the cluster, while 'qmod -d *@hostname' will disable a host, and 'qmod 
-d all.q@*' will disable the all.q queue. To permanently remove an exec host 
from the cluster, do 'qconf -de hostname_list', but I don't think that's what 
you're trying to accomplish.

A2: I would think so. Is the server actually 'down'? Powered off? sge_execd not 
running? I don't have much for you here, I'm sure someone else will. ;)

Good luck! Let us know what you discover.

-Hugh

-Original Message-
From: users-boun...@gridengine.org [mailto:users-boun...@gridengine.org] On 
Behalf Of Dan Hyatt
Sent: Friday, September 05, 2014 12:14 PM
To: grid engine users list
Subject: [gridengine users] jobs are going to servers which I believe are not 
in the queue

Hello,
Question: I do not seem to be removing server from the queue list 
correctly. What is the best way to do it.
Question 2: shouldn't grid engine remove servers from receiving jobs if 
it cannot talk to the server, such as server down?

I have 3 blades, which should not be accepting jobs
(OK, I am tracking using qmon)...I know go command line like I do for 
everything else.
Why is the queue still sending jobs to the blades which are down


But under cluster queue control HOSTS tab
loadAvg/CPU/MemUsed/and Swap used  I have dashes which I expect because 
they are not online.
queue instances has
AU  under states which I thought indicated not accepting jobs

One of the blades was actually removed from the all.q  which is used by 
normal queue to schedule jobs.


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Resolvable host problem when install GE on Redhat

2014-07-14 Thread MacMullan, Hugh
Hi:

Real name with 192 address only.

127.0.0.1 localhost.localdomain localhost
192.x.x.x icserver02.icdrec.com.vnhttp://icserver02.icdrec.com.vn icsserver02
::1 localhost6.localdomain6 localhost6

That should fix you up.

On Jul 14, 2014, at 6:22, Man Nguyen The 
man.nguyen...@icdrec.edu.vnmailto:man.nguyen...@icdrec.edu.vn wrote:

Hi every body,

I tried to install GE2011.11 to a Redhat host with GUI installing. But 
installer got problem Resolvable host when I try to add my local hostname

My /etc/hosts file content:

--
127.0.0.1icserver02.icdrec.com.vnhttp://icserver02.icdrec.com.vn 
icserver02 localhost.localdomain localhost
::1   localhost6.localdomain6 localhost6
192.168.x.x icserver02
--

192.168.x.x is IP address of host icserver02.

Please help me!

Thanks so much!

--
Nguyen The Man
Verification Team
Phone Number: 0935.678.703
Email: man.nguyen...@icdrec.edu.vnmailto:man.nguyen...@icdrec.edu.vn
___
users mailing list
users@gridengine.orgmailto:users@gridengine.org
https://gridengine.org/mailman/listinfo/users
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] pbsdsh

2014-07-02 Thread MacMullan, Hugh
Take a look at the '-t' option to qsub, I believe that that is the most 
equivalent to pbsdsh in OGS/SGE.

echo hostname | qsub -t 1-8:1

See the qsub man page for further details and options!

Cheers,
-Hugh

-Original Message-
From: users-boun...@gridengine.org [mailto:users-boun...@gridengine.org] On 
Behalf Of HUMMEL Michel
Sent: Wednesday, July 02, 2014 10:18 AM
To: users@gridengine.org
Subject: [gridengine users] pbsdsh

I wonder if there is, in OGS, an equivalent of the pbsdsh command from torque.
This command spawns a program on all nodes allocated to the PBS job. The spawns 
take place concurrently - all execute at (about) the same time.

Regards,
 


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Core Binding and Node Load

2014-04-29 Thread MacMullan, Hugh
This is a hassle for us too.

In general, what we do is:

1. Set binding by default in launcher scripts to '-binding linear:1', to force 
users to use single threads
2. allow them to override by unaliasing qsub, qrsh, and setting manually to use 
openmp pe
3. for MATLAB this doesn't work because it doesn't honor any env vars or 
whatever, it just greedily looks at the number of threads available and 
launches that many processes. HOWEVER, you can force it to only use one thread 
(even though it launches many!) with '-SingleCompThread' in 
$MATLABROOT/bin/worker and $MATLABROOT/bin/matlab:

# diff worker.dist worker
20c20
 exec ${bindir}/matlab -dmlworker -nodisplay -r distcomp_evaluate_filetask $*
---
 exec ${bindir}/matlab -dmlworker -logfile /dev/null -singleCompThread 
 -nodisplay -r distcomp_evaluate_filetask $*

# diff matlab.dist matlab
164c164
 arglist=
---
 arglist=-singleCompThread
490c490
 arglist=
---
 arglist=-singleCompThread

Then users who want more than one thread in MATLAB MUST use a parallel MPI 
environment with matlabpool, which requires further OGS/SGE/SOG integration and 
licensing, which is described in toolbox/distcomp/examples/integration/sge, but 
I can get you our setup if you're interested and have the Dist_Comp_Engine 
toolbox available (don't need to install the engine, just have the license).

Make sense? Yukk!

For other software, you need to try to find equivalent ways to set them to use 
only single threads, and then parallelize with MPI, OR respect an environment 
variable and use the openmp way with the '-binding XXX:X' set correctly.

For CPLEX, set single thread like so:
Envar across cluster: 
ILOG_CPLEX_PARAMETER_FILE=/usr/local/cplex/CPLEX_Studio/cplex.prm
And in that file:
CPX_PARAM_THREADS1

Bleh! And that's not (or wasn't six months ago) honored by Rcplex, but Hector 
was working on it I think.

I hope some of that is useful. It's been the way that works with the least 
number of questions from users. It only works for us because we have a site 
license for Dist Comp Engine, so can have a license server on each host to 
serve out the threads needed there. Bleh.

If others have novel ways to approach this problem, PLEASE let us all know. 
It's certainly one of the more difficult aspects of user education and cluster 
use for us.

Cheers,
-Hugh

From: users-boun...@gridengine.org [users-boun...@gridengine.org] on behalf of 
Joseph Farran [jfar...@uci.edu]
Sent: Tuesday, April 29, 2014 5:31 PM
To: users@gridengine.org
Subject: [gridengine users] Core Binding and Node Load

Howdy.

We are running Son of GE 8.1.6 on CentOS 6.5 with core binding turned on
for our 64-core nodes.

$ qconf -sconf | grep BINDING
  ENABLE_BINDING=TRUE


When I submit an OpenMP job with:

#!/bin/bash
#$ -N TESTING
#$ -q q64
#$ -pe openmp 16
#$ -binding linear:

The job stays locked to 16 cores out of 64-cores which is great and what
is expected.

Many of our jobs, like MATLAB tries to use as many cores as are
available on a node and we cannot control MATLAB core usage.   So
binding is great when we need to only allow say 16-cores per job.

The issue is that MATLAB has 64 threads locked to 16-cores and thus when
you have 4 of these MATLAB jobs running on a 64-core node, the load on
the node is through the roof because there are more workers than cores.

We have Threshold setup on all of our queues to 110%:

$ qconf -sq q64 | grep np
suspend_thresholdsnp_load_avg=1.1

So jobs begin to suspend because the load is over 70 on a node as expected.

My question is, does it make sense to turn OFF np_load_avg
cluster-wide and turn ON core-binding cluster wide?

What we want to achieve is that jobs only use as many cores as are
requested on a node.With the above scenario we will see nodes with a
HUGE load ( past 64 ) but each job will only be using said cores.

Thank you,
Joseph

___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] The state of a queue

2014-04-17 Thread MacMullan, Hugh
Hi Joseph:

# qstat -f
queuename  qtype resv/used/tot. load_avg arch  
states
-
all.q@node001  BP0/229/256  230  linux-x64 d
...

The 'states' are there in right-most column. And you can query the same way you 
can enable/disable in qmod with wildcards: '-q queue@node', or '-q *@node' or 
'-q queue@*'.

-Hugh

-Original Message-
From: users-boun...@gridengine.org [mailto:users-boun...@gridengine.org] On 
Behalf Of Joseph Farran
Sent: Thursday, April 17, 2014 12:56 PM
To: users@gridengine.org
Subject: [gridengine users] The state of a queue

Howdy.

I am able to disabled  enable a queue @ a compute node with:

$ qmod -d bio@compute-1-1
me@sys  changed state of bio@compute-1-1.local (disabled)

$ qmod -e bio@compute-1-1
me@sys changed state of bio@compute-1-1.local (enabled)


But how can I query the state of a queue @ a node?   In other words, how 
can I find the state of a queue @ a node without modifying it?I like 
to know if it's disabled or enabled.

Thanks,
Joseph
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] array job / node allocation / 'spread' question

2014-04-02 Thread MacMullan, Hugh
As an alternative, you could create a simple queue (onejobpernode) with 'slots 
1'.

-Hugh

-Original Message-
From: users-boun...@gridengine.org [mailto:users-boun...@gridengine.org] On 
Behalf Of Skylar Thompson
Sent: Wednesday, April 02, 2014 11:04 AM
To: Tina Friedrich
Cc: users@gridengine.org
Subject: Re: [gridengine users] array job / node allocation / 'spread' question

An exclusive host consumable is the right way to approach the problem. If
the task elements might be part of a parallel environment, then you'll want
to set the scaling to JOB as well.

On Wed, Apr 02, 2014 at 03:39:03PM +0100, Tina Friedrich wrote:
 Hello,
 
 I'm sure this has been asked time and time before, only I can't find
 it (search foo failling, somehow).
 
 What's the best way to run an array job so that each task ends up on
 a different node (but they run concurrently)? I don't mind other
 jobs running on the nodes at the time, but only want one of mine
 (network IO intensive tasks, best use of file system would be lots
 of them but spread as far and wide as the can).
 
 I've thought about introducing a consumable - apart from there's no
 node-level consumables at the moment - but am unsure whether that's
 the best way to handle this?

-- 
-- Skylar Thompson (skyl...@u.washington.edu)
-- Genome Sciences Department, System Administrator
-- Foege Building S046, (206)-685-7354
-- University of Washington School of Medicine
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] What signal does qdel send to the job?

2014-01-23 Thread MacMullan, Hugh
Hi Joe:

Per your request I confirm (with help from the e-mail that is sent when 
specifying ‘-m e’) that a regular ‘qdel job-ID’ uses KILL (9):


Job 5493527 (sleeper) Aborted

Exit Status  = 137

Signal   = KILL

User = hughmac

Queue= all.q@mysgehost1

Host = mysgehost1

Start Time   = 01/23/2014 12:15:37

End Time = 01/23/2014 12:15:45

CPU  = 00:00:00

Max vmem = 204.609M

failed assumedly after job because:

job 5493527.1 died through signal KILL (9)

From the ‘qdel man’ page:

   -f Force deletion of job(s). The job(s) are deleted  from  the  list 
 of
  jobs  registered  at sge_qmaster(8) even if the sge_execd(8) 
control-
  ling the job(s) does not  respond  to  the  delete  request  sent 
 by
  sge_qmaster(8).

So without the ‘-f’ a request for a KILL is sent from the master to the execd, 
and the job isn’t removed from the master DB if that request fails (but maybe 
it is when the execd comes back up on the node?), while with the ‘-f’ the job 
IS removed from the master DB even if the request for a kill sent to the execd 
by the master fails. I find ‘-f’ useful if an exec node is down for a long 
period of time and I want to clean up the output of qstat (after notifying 
users of the demise of their job(s)).

Cheers, Hugh

From: users-boun...@gridengine.org [mailto:users-boun...@gridengine.org] On 
Behalf Of Joe Borg
Sent: Thursday, January 23, 2014 12:10 PM
To: users@gridengine.org
Subject: [gridengine users] What signal does qdel send to the job?

I thought it would be 15 for qdel and 9 for qdel -f.
But, it seems, that it's always 9, as I can't catch which ever signal is being 
sent.

Can someone confirm, please?


Regards,
Joseph David Borġ
josephb.orghttp://josephb.org
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] sge_execd failing to pass complete path for stderr and stdout

2014-01-18 Thread MacMullan, Hugh
The script as is (only changing the 'user1' part) works fine as is in OGS 
2011.11p1 for me. Hmm.

-Hugh

From: users-boun...@gridengine.org [users-boun...@gridengine.org] on behalf of 
Reuti [re...@staff.uni-marburg.de]
Sent: Saturday, January 18, 2014 10:46 AM
To: Srirangam Addepalli
Cc: users@gridengine.org
Subject: Re: [gridengine users] sge_execd failing to pass complete path for 
stderr and stdout

Hi,

Am 18.01.2014 um 06:14 schrieb Srirangam Addepalli:

 I see a strange problem with using  OGS/GE 2011.11p1.  It appears that 
 sge_execd is unable to get TASK_ID

 Following job submission script that is being used.

 #!/bin/bash
 #$ -N 
 #$ -o /home/user1/work/output/$TASK_ID.out
 #$ -e /home/user1/work/output/$TASK_ID.err
 #$ -t 1:1
 date
 hostname
 sleep 100

 How ever the output is generated in /home/user1/work/.Looking at the log 
 files in active_jobs config file looks like this

 config:stdout_path=/home/user1/work//1.o
 config:stderr_path=/home/user1/work//1.err

 I appears that the $ somehow is replacing the last subdirectory because of 
 some substitution or some sort of aliases.

It's not replaced, but removed - or? I mean the output is missing here.


 I do not have any sge_aliases other than the default ones. Any suggestions.

They are only honored in case you use the option -cwd and targets the start 
directory for the job, not the given paths to the -o/-e options.

I never used OGS and I can't reproduce it with other variants. I assume the 
output of `qstat -j job_id` shows the same false entry?

-- Reuti
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Count slots for a specific User

2013-06-27 Thread MacMullan, Hugh
Hi Rémy:

Hmm, not sure why that wouldn’t work. What do you mean by ‘many threads’? It 
works for me with a mix of single and a few mpi jobs. Only suggestion I would 
make would be to do the column sum in awk:

qstat -s r | grep -w $USER | awk '{ttl+=$9} END {print ttl}'

There’s probably a better way though I bet. ☺

-Hugh


From: users-boun...@gridengine.org [mailto:users-boun...@gridengine.org] On 
Behalf Of Rémy Dernat
Sent: Thursday, June 27, 2013 8:22 AM
To: users@gridengine.org
Subject: [gridengine users] Count slots for a specific User

Hi,

I would like to count the number of slots used for a specific user.

I am doing something like this:

nb_slots=`qstat -s r|grep -w $USER |awk 'BEGIN { ORS=+ } {print $9}'`;echo 
$nb_slots 0|bc

However, I don't think it is working for jobs with many threads.

Is there any possibility to check that in an easy way ?

Thanks,

Regards,
Remy
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Blocking all the slots on the node for one job

2013-03-27 Thread MacMullan, Hugh
Hi Bhavishya:

One way is to have a resource quota set that limits the number of jobs on any 
host to the number of processors on that host ($num_proc):

# qconf -srqs host_slot_limits
{
   name host_slot_limits
   description  Limit Slots for Hosts
   enabled  TRUE
   limithosts {*} to slots=$num_proc
}

Cheers, Hugh


From: users-boun...@gridengine.org [mailto:users-boun...@gridengine.org] On 
Behalf Of Bhavishya Goel
Sent: Wednesday, March 27, 2013 8:09 AM
To: users@gridengine.org
Subject: Re: [gridengine users] Blocking all the slots on the node for one job

So the option of using openmp parallel environment doesn't really work. So if I 
have two queues, q1 and q2 for two set of users, and if q1 has occupied all the 
slots on one node, the grid engine still schedules the jobs from q2 on the same 
node!  So my qstat output may look like this:

queuename qtype resv/used/tot. load_avg arch  states
-
q1@node-1 BIP   0/8/8  0.01 linux-x64
-
q2@node-1 BIP   0/4/8  0.01 linux-x64

Is there a way to fix this?

On Tue, Mar 26, 2013 at 5:02 PM, Reuti 
re...@staff.uni-marburg.demailto:re...@staff.uni-marburg.de wrote:
Hi,

Am 26.03.2013 um 16:39 schrieb Bhavishya Goel:

 I want to schedule a single job on the cluster using grid engine, but while 
 running that job, I want to block all the slots on that particular node 
 during the execution of that job. I need to do this for benchmarking reasons. 
 The easiest way that I can think of is to use the openmp parallel environment 
 with the number of required slots equal to the number of slots on one node. 
 That way (as per my understanding) grid engine won't schedule my job unless 
 there are 8 slots available on a single node and won't allocate the slots on 
 that node to other jobs while job is executing. Is my understanding correct? 
 Is there a better/easier way of doing this?
Correct, this is one way of doing it: requesting a PE with allocation_rule 
$pe_slots and requesting all slots on a machine which still allows you to 
execute a serial job only.

Another way could be to define a complex as BOOL EXCL (`man complex`), attach 
it at least to this particular exechost and request it at the job submission. 
This will ensure that you get this exechost *) for this job alone too.

-- Reuti

*) The complex can also be attached to a queue to get a queue instance on your 
own, but this won't help here in case you have more than one queue per 
exechost. Then attaching it to the exechost is the only working way.


 --
 ಠ_ಠ
 ___
 users mailing list
 users@gridengine.orgmailto:users@gridengine.org
 https://gridengine.org/mailman/listinfo/users



--
ಠ_ಠ
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Fwd: Unable to ssh into node

2013-02-14 Thread MacMullan, Hugh
Hi Joseph:

This looks like you are hitting a 'normal' sshd connection limit specified in 
sshd_config (MaxSessions), or if you're using xinetd a 'per_source' that's too 
low (in /etc/xinetd.d/sshd). I'd guess the former, although we do the latter on 
our head node to try to keep our researchers from opening unlimited numbers of 
sessions on hosts here and yon. :)

Have you looked at your connection log (/var/log/secure or messages)? The 
default MaxSessions is 10 (when MaxSessions is commented out in sshd_config), 
at least in RedHat 6 that we are currently using.

Good luck and let us know what you discover!

-Hugh


-Original Message-
From: users-boun...@gridengine.org [mailto:users-boun...@gridengine.org] On 
Behalf Of Joseph Farran
Sent: Thursday, February 14, 2013 7:04 PM
To: users@gridengine.org
Subject: Re: [gridengine users] Fwd: Unable to ssh into node

Using ssh -vvv when the node refuses a connection from the user gives the clue 
of it being no-more-sessi...@openssh.com

debug1: Requesting no-more-sessi...@openssh.com
debug1: Entering interactive session.
debug3: Wrote 192 bytes for a total of 2581
debug1: channel 0: free: client-session, nchannels 1
debug3: channel 0: status: The following connections are open:
   #0 client-session (t3 r-1 i0/0 o0/0 fd 4/5 cfd -1)
debug3: channel 0: close_fds r 4 w 5 e 6 c -1
Connection to compute-1-1 closed by remote host.


It sure looks to me like Grid Engine is modifying this ssh feature to not allow 
any more new sessions from the user when the node is overloaded.

Joseph
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] How possible to use matlab in mutithreading mode on SGE?

2012-09-12 Thread MacMullan, Hugh
Hi Semi:

The real question is: how to NOT use multithreading, eh? :)

Seriously though, if anyone has figured out a solid way to integrate Matlab 
multithreading, I'd love to hear about it as well! It seems to automatically 
assume it has access to all cores on a box, so the only integration paths that 
I can see would involve reserving whole nodes.

For us, if users want to use multiple cores they need to use matlabpool. We 
'force' singlethreading for Matlab and all workers by modifying the matlab and 
worker scripts in $MATLAB_ROOT/bin (we're running 2012a but should work for 
2011b and probably others) as follows:

# diff matlab.dist matlab
483c483
 arglist=
---
 arglist=-singleCompThread

# diff worker.dist worker
20c20
 exec ${bindir}/matlab -dmlworker -nodisplay -r distcomp_evaluate_filetask $*
---
 exec ${bindir}/matlab -singleCompThread -dmlworker -nodisplay -r 
 distcomp_evaluate_filetask $*

Obviously we also would prefer to be able to integrate the multithreading 
correctly with SGE. It would make the SGE Experience nicer for our users ... 
code that works on their quad-core desktop would then just work with more 
cores on our bigger SGE nodes, and startup times would be faster as well.

Cheers, Hugh


-Original Message-
From: users-boun...@gridengine.org [mailto:users-boun...@gridengine.org] On 
Behalf Of Semi
Sent: Wednesday, September 12, 2012 7:49 AM
To: users@gridengine.org
Subject: [gridengine users] How possible to use matlab in mutithreading mode on 
SGE?


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] DRMAA timeout with gridMathematica

2012-07-07 Thread MacMullan, Hugh
Hi! Thanks for the reply.

So: should I be passing this along to Wolfram in hopes that they do something 
on the Mathematica side? It would be nice if I could pass a timeout as an 
option to the SGE function, but I'm guessing that's no simple task (a change in 
Mathematica).

Can I be doing something on the SGE side? I'd love it to just ALWAYS do 
DRMAA_TIMEOUT_WAIT_FOREVER. :)

Any further clarification would be much appreciated.

Cheers, Hugh

-Original Message-
From: Hung-Sheng Tsao Ph. D. [mailto:laot...@gmail.com] 
Sent: Saturday, July 07, 2012 8:09 AM
To: MacMullan, Hugh
Cc: users@gridengine.org
Subject: Re: [gridengine users] DRMAA timeout with gridMathematica

hi
please see
please see
http://gridscheduler.sourceforge.net/htmlman/htmlman3/drmaa_wait.html

regards

On 7/6/2012 9:46 PM, MacMullan, Hugh wrote:
 Folks:

 We've been using parallel Mathematica for a while now to great success. SGE 
 6.2_U5, gridMathematica 8.0.4. It 'just worked'. No fiddling (well, had to 
 specify a specific NIC in Mathematica code, more below).

 Suddenly I'm getting this:
 ---
 qlogin
 math
 Mathematica 8.0 for Linux x86 (64-bit) Copyright 1988-2011 Wolfram 
 Research, Inc.

 In[1]:= Needs[ClusterIntegration`]

 In[2]:= LaunchKernels[SGE[localhost], 2]

 Java::excptn: A Java exception occurred:
  org.ggf.drmaa.InternalException: cl_com_setup_commlib failed: timeout.
 while waiting for thread start
  at com.sun.grid.drmaa.SessionImpl.nativeInit(Native Method)
  at com.sun.grid.drmaa.SessionImpl.init(SessionImpl.java:291)

 SGE::load: Cannot find components required for SGE.

 Out[2]= $Failed
 ---

 I've got a good CLASSPATH and LD_LIBRARY_PATH

 It was all working, and was working intermittently this past week ... but now 
 it seems permanently wedged.

 Any thoughts? Any way to increase that cl_com_setup_commlib timeout?

 Thanks for any inspiration, we're feeling pretty stuck here. Haven't 
 heard back from Wolfram after their suggestion of forcing the 
 EnginePath, but that wasn't it. :(

 Cheers, Hugh

 ___
 users mailing list
 users@gridengine.org
 https://gridengine.org/mailman/listinfo/users



___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


[gridengine users] DRMAA timeout with gridMathematica

2012-07-06 Thread MacMullan, Hugh
Folks:

We've been using parallel Mathematica for a while now to great success. SGE 
6.2_U5, gridMathematica 8.0.4. It 'just worked'. No fiddling (well, had to 
specify a specific NIC in Mathematica code, more below).

Suddenly I'm getting this:
---
qlogin
math
Mathematica 8.0 for Linux x86 (64-bit)
Copyright 1988-2011 Wolfram Research, Inc.

In[1]:= Needs[ClusterIntegration`]

In[2]:= LaunchKernels[SGE[localhost], 2]

Java::excptn: A Java exception occurred: 
org.ggf.drmaa.InternalException: cl_com_setup_commlib failed: timeout.
   while waiting for thread start
at com.sun.grid.drmaa.SessionImpl.nativeInit(Native Method)
at com.sun.grid.drmaa.SessionImpl.init(SessionImpl.java:291)

SGE::load: Cannot find components required for SGE.

Out[2]= $Failed
---

I've got a good CLASSPATH and LD_LIBRARY_PATH

It was all working, and was working intermittently this past week ... but now 
it seems permanently wedged.

Any thoughts? Any way to increase that cl_com_setup_commlib timeout?

Thanks for any inspiration, we're feeling pretty stuck here. Haven't heard back 
from Wolfram after their suggestion of forcing the EnginePath, but that wasn't 
it. :(

Cheers, Hugh

___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users