Re: [gridengine users] Functional shares autonomously reset to 0 for recently added user

2020-08-25 Thread William Hay
On Mon, Aug 24, 2020 at 08:50:51PM +, Mun Johl wrote:
>Hi all,
> 
> 
> 
>We are running SGE v8.1.9 on systems running Red Hat Enterprise Linux v6.8
>.
> 
> 
> 
>This anomaly isn’t a showstopper by any means, but it has happened enough
>that I decided to reach out and ask if anyone else has experienced this
>phenomenon and if a fix/workaround is available.
> 
> 
> 
>Here’s the anomaly I’ve experienced many times:
> 
> 
> 
>We deploy the Functional Policy, and often times when a new user has been
>added and I have updated the user’s Functional Shares via QMON, that
>user’s Functional Shares will reset to 0 after a while.  I typically don’t
>notice until the user reports that his/her jobs are queued longer than
>others, at which time I will update the user’s functional shares again. 
>After one or two times through this loop the value sticks and I don’t have
>any further problems … until I have to add another new user.
> 
> 
> 
>Anyone else experience this phenomenon?
Never seen this and we use a largely functional policy on our clusters  Could 
the user
be reaching delete_time and then be being recreated with the functional
shares from sge_conf auto_user_fshare?


William



signature.asc
Description: PGP signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] grid engine check of pending jobs before resuming

2020-08-14 Thread William Hay
On Thu, Aug 13, 2020 at 07:29:32PM +, Derek Stephenson wrote:
>HI,
> 
> 
> 
>We’re running SoGE 8.1.9 and we’re running into an issue with preemptive
>queueing I’m curious if others have had to ever address. We have a
>regression queue that is pre-empted by a daily use queue.. the intention
>being that so long as daily queue isn’t active those slots and license can
>be used for regression..
> 
> 
> 
>Now the issue we are observing is that if there are a high number of daily
>use jobs regression jobs will get continually suspeneded and resumed to
>the point that it corrupts the regression job and ultimately fails.
> 
> 
> 
>One though that has been discussed is whether we can have jobs coming out
>of suspension first detect if there is a job in queue that would end up
>causing the job to suspend again immediately. If so to then stay in
>suspension We’re quite novice with SoGE so no one is particularly
>knowledgeable if such a mechanism can work.

Rather than have the jobs come out of suspension and try to detect if
there is anything queued for the "daily use" queue [1] it might be
better to add a "suspend threshold" to the regression queue that keeps
the queue suspended if there is anything waiting for "daily use".

William

[1] All batch schedulers have something called a queue that isn't.
Gridengine queues are particularly unqueue like.
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] How to export an X11 back to the client?

2020-05-12 Thread William Hay
On Mon, May 11, 2020 at 09:39:26PM +, Mun Johl wrote:
> Hi William,
> 
> Thank you for your reply.
> See my comments below.
> 
> > -Original Message-
> > On Thu, May 07, 2020 at 06:29:05PM +, Mun Johl wrote:
> > > Hi William, et al.,
> > > I am not explicitly setting the DISPLAY--as that is how I normally use 
> > > 'ssh -X'.  Nor have I done anything to open any additional ports.
> > Again, since 'ssh -X' is working for us.  As a reminder, there is no way 
> > for me to know what to set DISPLAY to even if I wanted to set it.
> > >
> > Do you invoke qrsh with -V by any chance?  I think that might cause the
> > DISPLAY from the login node to override the one set by ssh -X.  If you
> > do could you switch to using -v to transfer individual environment
> > variables instead?
> 
> [Mun] Yes, I do use -V normally.  When I once again get to a point where qrsh 
> is able to launch I will certain try your suggestion.  But I may tweak it by 
> simply "unset'ing" DISPLAY from the wrapper script rather than using -v 
> because we have many env vars that are required in order to correctly run a 
> job.
The problem with unsetting DISPLAY is that if you do it then ssh won't
be able to forward it.  

Possibly env -u DISPLAY XDISPLAY="${DISPLAY}" qrsh -now n -V 
Then a wrapper around ssh as your qrsh_command
#!/bin/sh
env DISPLAY="${XDISPLAY}" /usr/bin/ssh -X "$@"



signature.asc
Description: PGP signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] How to export an X11 back to the client?

2020-05-12 Thread William Hay
On Mon, May 11, 2020 at 09:30:14PM +, Mun Johl wrote:
> Hi William, et al.,
> [Mun] Thanks for the tip; I'm still trying to get back to where I can launch 
> qsrh again.  Even after I put the requisite /etc/pam.d/sshd line at the head 
> of the file I'm still getting the "Your "qrsh" request could not be 
> scheduled, try again later." message for some reason.  But I will continue to 
> debug that issue.

The pam_sge-qrsh-setup.so shouldn't have anything to do with this since
the message occurs before any attempt to launch the job.  You could try
running a qrsh -w p or and/or qrsh -w v to get a report on why the qrsh
isn't being scheduled.  They aren't always easy to read and -w v doesn't
reliably ignore exclusive vars in use but can nevertheless be helpful.


William


signature.asc
Description: PGP signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] How to export an X11 back to the client?

2020-05-11 Thread William Hay
On Thu, May 07, 2020 at 06:29:05PM +, Mun Johl wrote:
> Hi William, et al.,
> I am not explicitly setting the DISPLAY--as that is how I normally use 'ssh 
> -X'.  Nor have I done anything to open any additional ports.  Again, since 
> 'ssh -X' is working for us.  As a reminder, there is no way for me to know 
> what to set DISPLAY to even if I wanted to set it.
> 
Do you invoke qrsh with -V by any chance?  I think that might cause the
DISPLAY from the login node to override the one set by ssh -X.  If you
do could you switch to using -v to transfer individual environment
variables instead?


William


signature.asc
Description: PGP signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] How to export an X11 back to the client?

2020-05-11 Thread William Hay
On Thu, May 07, 2020 at 06:29:05PM +, Mun Johl wrote:
> Hi William, et al.,
> 
> Thank you kindly for your response and insight.
> Please see my comments below.
> 
> > On Wed, May 06, 2020 at 11:10:40PM +, Mun Johl wrote:
> > > [Mun] In order to use ssh -X for our jobs that require an X11 window to 
> > > be pushed to a user's VNC display, I am planning on the
> > following changes.  But please let me know if I have missed something (or 
> > everything).
> > >
> > > 1. Update the global configuration with the following parameters:
> > >
> > >  rsh_command /usr/bin/ssh -X
> > >  rsh_daemon  /usr/sbin/sshd -i
> > 
> > As you are using the pam_sge-qrsh-setup.so you will need to set
> > rsh_daemon to point to a rshd-wrapper which you should find in
> > $SGE_ROOT/util/resources/wrappers eg if $SGE_ROOT is /opt/sge
> > 
> > rsh_command /usr/bin/ssh -X
> > rsh_daemon /opt/sge/util/resources/wrappers/rshd-wrapper
> 
> [Mun] Thanks for pointing out my mistake!
> 
> > > 2. Use a PAM module to attach an additional group ID to sshd.  The 
> > > following line will be added to /etc/pam.d/sshd on all SGE
> > hosts:
> > >
> > >   auth required /opt/sge/lib/lx-amd64/pam_sge-qrsh-setup.so
> > >
> > > 3. Do I need to restart all of the SGE daemons at this point?
> > 
> > No it should be fine without a restart
> > 
> > >
> > > 4. In order to run our GUI app, launch it thusly:
> > >
> > >   $ qrsh -now no wrapper.tcl
> > 
> > That looks fine, assuming sensible default resource requests, although
> > obviously I don't know the details of the wrapper or application.
> 
> [Mun] After making the above changes, I'm still experiencing problems.  
> First, let me point out that I should have more accurately represented how 
> qrsh will be used:
> 
> $ qrsh -now no  tclsh wrapper.tcl  script>
> 
> Now for the issues:
> 
> I first added the pam_sge-qrsh-setup.so at the top of the /etc/pam.d/sshd 
> file.  When I did that the qrsh job was launched but quickly terminated with 
> the following error from the tool I was attempting to launch:
> 
> ncsim/STRPIN =
>   The connection to SimVision could not be established due to an error
>   in SimVision. Check your DISPLAY environment variable,
>   which may be one of the reasons for this error.
> 
> I am not explicitly setting the DISPLAY--as that is how I normally use 'ssh 
> -X'.  Nor have I done anything to open any additional ports.  Again, since 
> 'ssh -X' is working for us.  As a reminder, there is no way for me to know 
> what to set DISPLAY to even if I wanted to set it.

If you can get it back to the actually launching mode then trying to run
qrsh -now n /bin/env to list out the environment you are getting might
help debug.

> 
> Now, the /etc/pam.d/sshd update caused an ssh issue: Users could no longer 
> ssh into our servers :(  I didn't realize the order of the lines in the sshd 
> is significant.
> 
> Therefore, I moved the pam_sge-qrsh-setup.so entry below the other "auth" 
> lines.  Although, that resulted in the following error when I tried the qrsh 
> command again:
> 
>  Your "qrsh" request could not be scheduled, try again later.
Did you remember the -now no option?  That looks like the sort of
message one might get if you forgot it.

> 
> One final note is that we have "selinux" enabled on our servers.  I don't 
> know if that makes any difference, but I thought I'd throw it out there.
Depends how it is configured I guess.  Which linux distro are you using?


William


signature.asc
Description: PGP signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] How to export an X11 back to the client?

2020-05-07 Thread William Hay
On Wed, May 06, 2020 at 11:10:40PM +, Mun Johl wrote:
> [Mun] In order to use ssh -X for our jobs that require an X11 window to be 
> pushed to a user's VNC display, I am planning on the following changes.  But 
> please let me know if I have missed something (or everything).
> 
> 1. Update the global configuration with the following parameters:
> 
>  rsh_command /usr/bin/ssh -X
>  rsh_daemon  /usr/sbin/sshd -i

As you are using the pam_sge-qrsh-setup.so you will need to set
rsh_daemon to point to a rshd-wrapper which you should find in
$SGE_ROOT/util/resources/wrappers eg if $SGE_ROOT is /opt/sge

rsh_command /usr/bin/ssh -X
rsh_daemon /opt/sge/util/resources/wrappers/rshd-wrapper

> 
> 2. Use a PAM module to attach an additional group ID to sshd.  The following 
> line will be added to /etc/pam.d/sshd on all SGE hosts:
> 
>   auth required /opt/sge/lib/lx-amd64/pam_sge-qrsh-setup.so
> 
> 3. Do I need to restart all of the SGE daemons at this point?

No it should be fine without a restart

> 
> 4. In order to run our GUI app, launch it thusly:
> 
>   $ qrsh -now no wrapper.tcl

That looks fine, assuming sensible default resource requests, although
obviously I don't know the details of the wrapper or application.


William


signature.asc
Description: PGP signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] users Digest, Vol 113, Issue 2

2020-05-04 Thread William Hay
On Mon, May 04, 2020 at 09:06:46AM -0400, Korzennik, Sylvain wrote:
>We have no problem having jobs w/ X11 enabled, BUT users must use qlogin,
>not qsub or qrsh (the way we have configured it). 
>We have switched from SGE to UGE, but I'm sure the 'issue'  is the same,
>you need to have the '[qlogin|rlogin|rsh]_command" and "_deamon" set
>accordingly, and if you use ssh, you need to have X-tunnelling enabled
>(lssh -X) - it's not just simply a matter of setting up DISPLAY or 
>adjusting ~/.Xauthority. Here is our config:
The problem with using qlogin is that it doesn't provide an easy way to pass a
command to it as you always get an interactive shell.  The original
poster wanted to run a command.  You could I suppose feed the command you want
run into qlogin's stdin but that feels a little more fragile.  

Having the rsh_command and rsh_daemon set up using ssh -X/sshd -i (as in
ssh tight integration) lets
you pass commands in a simpler way.  

If you don't want to or can't fiddle with the various _command and _daemon
settings then your only real options are fiddling with DISPLAY and
~/.Xauthority




signature.asc
Description: PGP signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] How to export an X11 back to the client?

2020-05-04 Thread William Hay
On Fri, May 01, 2020 at 06:44:08PM +, Mun Johl wrote:
>Hi,
> 
> 
> 
>I am using SGE on RHEL6.  I am trying to launch a qsub job (a TCL script)
>via grid that will result in a GUI application being opened on the
>caller’s display (which is a VNC session).
Using qsub for this makes this more difficult than it needs to be since
qsub jobs run largely disconnected from the submit host.  I wouldn't
have thought you would want a delay with something interactive like
this.  As Reuti suggested you could set up ssh tight integration (with X
forwarding enabled in ssh) and then use qrsh -now n  to launch
your app.  




>What I’m seeing is that if I set DISPLAY to the actual VNC display (e.g.
>host1:4) in the wrapper script that invokes qsub, the GUI application
Have you checked that host1 resolves to on the machine where you submit
the job and the machine where it runs.  If you are getting a failure to
connect it might be because you need to use the FQDN.  

>complains that it cannot make a connection.  On a side note, I noticed
In general X11 servers don't allow random clients to talk to
the display.  The app may be (mis-)reporting a failure to authorise as a
failure to connect.  This has little to do with grid engine per se just a 
side effect of running the app on a different machine from the X-Server.  
You may need to perform some manipulations with xauth to enable it.

Do the grid engine servers share home directories with the machine where
you are running qsub (eg via NFS or a cluster file system)?
Authorisation tokens are usually stored in/read from  ~/.Xauthority.

William


signature.asc
Description: PGP signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] qsub -V doesn't set $PATH

2020-04-03 Thread William Hay
On Fri, Apr 03, 2020 at 02:54:19AM +, Shiel, Adam wrote:
> I finally had a chance to experiment with this some.
> 
> I think one basic problem was that I had bash as a login shell. Removing bash 
> from the login shell and specifying "qsub -S /bin/bash " passed my local 
> PATH to the remote job.
> 
> But when I don't specify "-S /bin/bash" I get the csh login PATH settings. 
> That's our default shell for the queue I'm using. 
> 
> This happens even when csh isn't in the login shell list. I find that 
> unexpected.
> 
Shells read some initialisation files even when not invoked as a login
shell.  From the man page for csh (which is really tcsh) on one of our
clusters:

Non-login shells read only /etc/csh.cshrc and ~/.tcshrc or ~/.cshrc on startup.

So if the PATH is configured in one of those places then it will
override/modify whatever you pass in with -V or -v

William

> Adam
> 
> -Original Message-
> From: users-boun...@gridengine.org [mailto:users-boun...@gridengine.org] On 
> Behalf Of Hay, William
> Sent: Wednesday, January 22, 2020 9:55 AM
> To: Skylar Thompson 
> Cc: users@gridengine.org
> Subject: Re: [gridengine users] qsub -V doesn't set $PATH
> 
> On Tue, Jan 21, 2020 at 03:51:01PM +, Skylar Thompson wrote:
> > -V strips out PATH and LD_LIBRARY_PATH for security reasons, since 
> > prolog
> 
> I don't think this is the case.  I've just experimented with one of our 8.1.9 
> clusters and I can set arbitrary PATHs run qsub -V and have the value I set 
> show up in the environment of the job.  More likely the job is being run with 
> a shell that is configured as a login shell and the init scripts for the 
> shell are stomping on the value of PATH.
> 
> > and epilog scripts run with the submission environment but possibly in 
> > the context of a different user (i.e. a user could point a 
> > root-running prolog script at compromised binaries or C library).
> 
> This is something slightly different. The prolog and epilog used to run with 
> the exact same environment as the job.  This opened up an attack vector , 
> especially if the prolog or epilog were run as a privileged user rather than 
> the job owner.  The environment in which the prolog and eiplog are run is now 
> sanitised.
> 
> William
> 
> ___
> users mailing list
> users@gridengine.org
> https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgridengine.org%2Fmailman%2Flistinfo%2Fusersdata=02%7C01%7C%7C7879f009a4364b7a184208d7d77e3daf%7C1faf88fea9984c5b93c9210a11d9a5c2%7C0%7C0%7C637214809518540272sdata=YdzYQvg%2F%2BME0MEMzljKo%2BE7e13VWwJrb9PCpEIG2uQ0%3Dreserved=0


signature.asc
Description: PGP signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Dave Love repository issue

2018-10-17 Thread William Hay
On Tue, Oct 16, 2018 at 06:53:11PM -0500, Jerome wrote:
> Dear William
> 
> I'm watching this trac system, and it seem's to be reserved for
> developper only.. That's seems that to report a bug, one need to follow
> some specifications, which i don't really know... WHere can i read about
> this?

Even without a login you can submit bugs via the sge-bugs mailing list and they 
will show up
in trac.  You should be able to create a login in order to use the web interface
by clicking on the Register link and entering your details.

It is a bug tracker not a general user support system though.  If you are 
having issues
with SoGE it is usually best to ask here to eliminate causes other than bugs 
before 
using the bug tracker.

If you use the web interface it is probably safe to leave most of the settings 
at their default
values unless they are obviously wrong (eg defect vs enhancement).

You can DuckDuckGo guides on how to write good bug reports that should be fine 
for the 
Description field/text of the e-mail to sge-bugs.


Apologies if any of the above was teaching grandma how to suck eggs.


William




signature.asc
Description: PGP signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Dave Love repository issue

2018-10-15 Thread William Hay
On Fri, Oct 12, 2018 at 02:13:32PM -0400, Daniel Povey wrote:
>There is an issue tracker here
>https://arc.liv.ac.uk/trac
>but it's not clear whether Dave Love still has access to it (he moved to
The issue tracker has it's own login system.  I still have access to it and 
I've never worked for 
the University of Liverpool :).  Yous should be fine submitting bug reports 
there.

I think where Dave might run into problems is uploading new tarballs.
>Manchester and for a while at least he did not have access; and he doesn't
>seem to have been working on GridEngine lately anyway).  Also I couldn't
>figure out where in the issue tracker you are supposed to make a new
>issue; you probably have to create an account first.
>I made an attempt to re-start a GitHub-based version of the repo, here
>https://github.com/son-of-gridengine/sge
>but the project is not exactly off the ground, partly due to Dave's
>objections and also due to lack of clarity about whether he plans to
>continue maintaining GridEngine.   You could create an issue on the github
>if you want, but I don't promise that that project will necessarily live
>on.
>If you look at the issues in the issue tracker
>https://arc.liv.ac.uk/trac/SGE/query?status=!closed=3=priority
>there are a rather scary number of existing, un-resolved issues.
>To me it raises the question of whether GridEngine might be just too big,
>to old, and too encumbered with features, to be maintainable as an
>open-source project.  But I also don't know what the most viable
>alternative is.
A lot of these were inherited from the original grid engine project
and some of them are reports of unreproducible errors.

William


signature.asc
Description: PGP signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] cpu usage calculation

2018-08-31 Thread William Hay
On Fri, Aug 31, 2018 at 10:27:39AM +, Marshall2, John (SSC/SPC) wrote:
>Hi,
>When gridengine calculates cpu usage (based on wallclock) it uses:
>cpu usage = wallclock * nslots
>This does not account for the number of cpus that may be used for
>each slot, which is problematic.
>I have written up an article at:
>
> https://expl.info/display/MISC/Slot+Multiplier+for+Calculating+CPU+Usage+in+Gridengine
>which explains the issue and provides a patch (against sge-8.1.9)
>so that:
>cpu usage = wallclock * nslots * ncpus_per_slot
>This makes the usage information much more useful/accurate
>when using the fair share.
>Have others encountered this issue? Feedback is welcome.
>Thanks,
>John

Used to do something similar (our magic variable was thr short for
threads).  The one thing that moved us away from that was in 8.x grid
engine binds cores to slots via -binding.

Rather than adding support for another mechanism to specify cores (slots,
-binding) it might be a better idea to support calculating cores per
slot based on -binding.

That said I'm not a huge fan of -binding.  If a job has exclusive access
to a node then the job can handle its own core binding.  If the job
doesn't have exclusive access then binding strategies other than linear
don't seem likely to be successful.

William


signature.asc
Description: PGP signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Start jobs on exec host in sequential order

2018-08-06 Thread William Hay
On Wed, Aug 01, 2018 at 11:06:19AM +1000, Derrick Lin wrote:
>HI Reuti,
>The prolog script is set to run by root indeed. The xfs quota requires
>root privilege.
>I also tried the 2nd approach but it seems that the addgrpid file has not
>been created when the prolog script executed:
>/opt/gridengine/default/common/prolog_exec.sh: line 21:
>/opt/gridengine/default/spool/omega-1-27/active_jobs/1187086.1/addgrpid:
You can also extract the groupid from the config file which should be present 
on the master node
when the prolog is run.

XFS_PROJID="$(awk -F= '/^add_grp_id=/{print $2}' <${SGE_JOB_SPOOL_DIR}/config)"

NB: If you want this on the slave node of a multi-node job and you allow
multi-node jobs to share nodes (we don't) then you will need to extract 
a project id on each slave node.  Probably the best place to do this 
would be in a wrapper around rsh_daemon. However you'll need some sort of 
locking in case a program launches multiple slave tasks(most codes
just launch one slave task per node which then forks) or launches
a slave task on the master node.  

William


signature.asc
Description: PGP signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Make All Usersets Their Own Department

2018-07-12 Thread William Hay
On Wed, Jul 11, 2018 at 09:21:10AM -0400, Douglas Duckworth wrote:
>Hi
>We are running GE 6.2u5 and moving to Slurm.  Though before we do some
>changes need to be made within GE.
>For example we have 66 user sets within our share tree.  However none of
>them were configured as a department.  
>Using qmon I see an option to make the user set a department.  Though
>using the GUI does not seem very fun.  
> 
>Any way to script this change out using qconf?
>Thanks,
>Douglas Duckworth, MSc, LFCS
>HPC System Administrator
>Scientific Computing Unit
>Physiology and Biophysics
>Weill Cornell Medicine
>E: d...@med.cornell.edu
>O: 212-746-6305
>F: 212-746-8690


qconf -sul |xargs qconf -mattr userset type ACL,DEPT

Assuming you want all your usersets to work as both ACLs and departments
otherwise replace qconf -sul with something that lists the sets you want
to change and replace ACL,DEPT with just DEPT (or whatever you want).

William


signature.asc
Description: PGP signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] SGE accounting file getting too big...

2018-05-23 Thread William Hay
On Fri, May 18, 2018 at 05:42:42PM +0200, Reuti wrote:
> Note: to read old accouting files in `qacct` on-the-fly you can use:
> 
> $ qacct -o reuti -f <(zcat /usr/sge/default/common/accounting.1.gz)
> 
If you specify the -f flag twice the later one takes precedence.
You can therefore easily create a wrapper script that processes the current and 
last 
accounting files by default but otherwise acts like regular qacct.  This means 
a bare qacct command won't 
show no history just after a logrotate.

#!/bin/bash
exec ${SGE_ROOT}/bin/${SGE_ARCH}/qacct -f <(cat  
${SGE_ROOT}/${SGE_CELL:-default}/accounting.1 
${SGE_ROOT}/${SGE_CELL:-default}/accounting) "$@"

NB: If you are backing up the logfiles then you may find, depending on how 
smart your backup software is,  that using the dateext option
will save some strain on incremental/differential  backups.  If you do use 
dateext then finding the immediately previous accounting file will require
a bit more smarts than the above.

William




signature.asc
Description: PGP signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Clean up old jobs/spooldb?

2018-05-04 Thread William Hay
On Wed, May 02, 2018 at 07:24:39PM -0700, Simon Matthews wrote:
> That solution requires working "db_dump" and "db_restore" executables,
> which don't appear to be available for the SoGE version.
db_dump etc are part of Berkeley DB.  You need to match the version against 
which
SoGE is built I think.


William



signature.asc
Description: PGP signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


[gridengine users] 04/26/2018 11:58:01| main|node-s03a-003|E|shepherd of job 5083806.1 exited with exit status = 28

2018-04-26 Thread William Hay
04/26/2018 11:58:01|  main|node-s03a-003|E|shepherd of job 5083806.1 exited 
with exit status = 28

We had a shepherd exit with the above error code after about 12 hours.  As a 
result it appears not to have run its
epilog.  This appears to be ENOSPC however we can't see sign of filesystems 
running out of space
We have a load sensor that triggers an alarm if space gets too low and this is 
tracked by our monitoring.  It hasn't
whinged.  Anyone got an idea what else might have the shepherd to exit with an 
error code of 28?

William


signature.asc
Description: PGP signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Corrupt user config?

2018-04-17 Thread William Hay
On Mon, Apr 16, 2018 at 04:52:33PM +0100, Mark Dixon wrote:
> > share-tree only as a tie breaker.  But deleting jobs would be bad.  Is
> > the probably lose any jobs queued something you know from experience? It
> > seems odd that we can have jobs queued and running with the running
> > qmaster knowing nothing of the user but deleting the file would kill
> > them on restart.
> 
> No, not from experience. Maybe it'll all be fine, then :)

I've just done an experiment on our dev cluster.  Submit job, stop grid
engine, delete the file representing me, restart grid engine.  The job
survives and I can then recreate the user with qconf -auser which I
couldn't when the user had a file on disk.

This makes sense as the user wouldn't need to exist if enforce_user
was set to false.  I suspect enforce_user is only checked at job submission.

William


signature.asc
Description: PGP signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Corrupt user config?

2018-04-16 Thread William Hay
On Mon, Apr 16, 2018 at 12:16:26PM +0100, Mark Dixon wrote:
> Hi William,
> 
> I've seen this before back in the SGE 6.2u5 days when it used to write out
> core binding options it couldn't subsequently read back in.
> 
> IIRC, users are read from disk at startup in turn and then the files are
> only written to from then on - so this sort of thing only tends to be
> noticed when the qmaster is restarted. If it finds a user file that it
> cannot read properly, SGE gives up reading any more user files and you'll
> appear to lose a big chunk of your user base even if those other user files
> are ok.
I don't think that can be right given that the qmaster complains about multiple
user files on start up.  If it gave up after the first then presumably it 
wouldn't 
complain about the others.

> 
> Your instinct is right: stop the qmaster, delete or preferably modify the
> file for the reported problem user so that the bit it's complaining about is
> removed, and start the qmaster again. Repeat if it complains about another
> user.
> 
> Feel free to post the main bit of the user file if you want an opinion about
> the edit.
The user who first drew our attention to this has a user file that looks like 
this:

name ucbptba
oticket 0
fshare 1
delete_time 0
usage NONE
usage_time_stamp 1522981780
long_term_usage NONE
project AllUsers 
cpu=10018380758.728712,mem=566859015.524199,io=485.211850,binding_inuse!SS=0.00,iow=0.00,vmem=2649215676489.408691,maxvmem=0.00,submission_time=185478188707.216309,priority=0.00,exit_status=0.00,signal=12732.427329,start_time=185484299059.331726,end_time=185486650486.414856,ru_wallclock=24949894.118398,ru_utime=352365452.860190,ru_stime=227668.336061,ru_maxrss=2312860731.515988,ru_ixrss=0.00,ru_ismrss=0.00,ru_idrss=0.00,ru_isrss=0.00,ru_minflt=139382336529.001709,ru_majflt=13666248.060146,ru_nswap=0.00,ru_inblock=885794734.354178,ru_oublock=19889227.200863,ru_msgsnd=0.00,ru_msgrcv=0.00,ru_nsignals=0.00,ru_nvcsw=2535937951.970052,ru_nivcsw=3926968166.964020,acct_cpu=376560487.827089,acct_mem=654584364.211819,acct_io=207.611235,acct_iow=0.00,acct_maxvmem=30119711845145.703125,finished_jobs=0.00
 
cpu=10021139983.02,mem=567054438.077435,io=485.452673,binding_inuse!SS=0.00,iow=0.00,vmem=2650784657408.00,maxvmem=0.00,submission_time=185581632895.00,priority=0.00,exit_status=0.00,signal=12740.00,start_time=185587744917.00,end_time=185590097173.00,ru_wallclock=24960037.00,ru_utime=352504013.044632,ru_stime=227747.889479,ru_maxrss=2313909568.00,ru_ixrss=0.00,ru_ismrss=0.00,ru_idrss=0.00,ru_isrss=0.00,ru_minflt=139433350974.00,ru_majflt=13670114.00,ru_nswap=0.00,ru_inblock=886048536.00,ru_oublock=19900264.00,ru_msgsnd=0.00,ru_msgrcv=0.00,ru_nsignals=0.00,ru_nvcsw=2536948317.00,ru_nivcsw=3928530274.00,acct_cpu=376713375.452962,acct_mem=654806517.378607,acct_io=207.726701,acct_iow=0.00,acct_maxvmem=30134842138624.00,finished_jobs=125.00;
default_project NONE
debited_job_usage 251393 
binding_inuse!SS=0.00,cpu=11215611.00,mem=0.00,io=0.648810,iow=0.00;

It is possible that this file has fixed itself as two tasks from the problem 
array job have started and the file
has changed since I first looked at it.  However qconf -suser still doesn't 
show the user in question and the array job
is apparently stuck at the back of the queue due to not getting any functional 
tickets.

The messages file complains thusly:
04/16/2018 11:06:53|  main|util01|E|line 12 should begin with an attribute name
04/16/2018 11:06:53|  main|util01|E|error reading file: 
"/var/opt/sge/shared/qmaster/users/ucbptba"
04/16/2018 11:06:53|  main|util01|E|unrecognized characters after the attribute 
values in line 12: "mem"

Line 12 being the line starting project.


> 
> If you delete the user file, you'll lose all usage for that user - including
> that user's contribution to projects in any share tree you might have.
> You'll also probably lose any jobs queued up by them.

Oh fun.  Fortunately we're mostly per user functional share and use share-tree 
only 
as a tie breaker.   But deleting jobs would be bad.  Is the probably lose any 
jobs queued
something you know from experience?  It seems odd that we can have jobs queued 
and running 
with the running qmaster knowing nothing of the user but deleting the file 
would kill them on 
restart.




> 
> Mark
> 
> On Mon, 16 Apr 2018, William Hay wrote:
> 
> > We had a user report that one of their array jobs wasn't scheduling A
> > bit of poking around showed that qconf -suser knew nothing of the user
> > despite them having a queued job.  However there was a file in the spool
> > that should have defined the user.  Seve

[gridengine users] Corrupt user config?

2018-04-16 Thread William Hay
We had a user report that one of their array jobs wasn't scheduling A
bit of poking around showed that qconf -suser knew nothing of the user
despite them having a queued job.  However there was a file in the spool
that should have defined the user.  Several other users appear to be
affected as well.

I bounced the qmaster in the hopes of getting it to reread the users'
details from disk.  And got several messages like this:

04/16/2018 11:06:53| main|util01|E|error reading file: 
"/var/opt/sge/shared/qmaster/users/zccag81" 
04/16/2018 11:06:53| main|util01|E|unrecognized characters after the attribute 
values in line 12: "mem" 
04/16/2018 11:06:53| main|util01|E|line 12 should begin with an attribute name

I suspect that my next step should be to stop the qmaster, delete the
problem files and then restart the qmaster.  Hopefully grid engine will
then recreate the user or I can create them manually.

However if anyone has a better idea or has seen this before I'd be glad
to hear of it.

Creation of the user object on our cluster is done by means of enforce_user 
auto:
#qconf -sconf |grep auto
enforce_user auto
auto_user_oticket0
auto_user_fshare 1
auto_user_default_projectnone
auto_user_delete_time0


William



signature.asc
Description: PGP signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] qstat strange statistic

2018-04-13 Thread William Hay
On Fri, Apr 13, 2018 at 01:54:14PM +0200, leconte j??r??me wrote:
> Hello,
>  I'm using SGE 8.1.9 under debian Stretch
> 
> ?? I have a strange problem.
> 
> ?? when I use qstat , sometime the stats displayed are wrong. Then, I
> believe that gridengine doesn't work properly.
> 
> I explain what I see:
> 
> On Master : qstat -f?? |grep para.q
> 
> par...@node10.example.org?? BIP 0/0/40 14.45 
> lx-amd64
> par...@node11.example.org?? BIP 0/1/40 57.78 
> lx-amd64
> par...@node12.example.org?? BIP 0/0/40 1.00 
> lx-amd64
> par...@node13.example.org?? BIP 0/100/40 57.81 
> lx-amd64
> par...@node14.example.org?? BIP 0/40/40?? 98.00 
> lx-amd64?? a
> par...@node15.example.org?? BIP 0/0/40 0.00 
> lx-amd64
> 
> 
> But if I'm connect a terminal on node13 and I type "top" the load avg is 0.0
> or something else that is not similar to 57.81
> 
> When I stop gridengine_execd?? and then restart it
> 
> On?? Master, qstat -f gives me the right value.
> 
> 
> I suppose there is a better way to do this , but I can't find it.
> 
> Have you some advices ?
Does your sched_conf set job_load_adjustments to anything interesting?

Is the environment variable SGE_LOAD_AVG set?

Also 0/100/40 is a tad odd as you are apparently using more slots than 
configured.

William


signature.asc
Description: PGP signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] [SGE-discuss] case-insensitive user names?

2018-04-13 Thread William Hay
On Thu, Apr 12, 2018 at 04:40:03PM -0400, berg...@merctech.com wrote:
> We're using SoGE 8.1.6 in an environment where users may login to the
> cluster from a Linux workstation (typically using a lower-case login
> name) or a Windows desktop, where their login name (as supplied by the
> enterprise Active Directory) is usually mixed-case.
> 
> On the cluster, we've created two passwd entries per-user with an
> identical UID, so there's no distinction in file ownership or any
> permissions or access rights at the Linux shell level. Most users don't
> notice (or care) about the case that's shown when they login.
> 
> However, SoGE seems to use the login name literally, not the UID.
> 
> This causes two problems:
> 
>   job management
>   User "smithj" cannot manage (qdel, qalter) jobs
>   that they submitted as "SmithJ"
> 
> 
>   scheduler weighting
>   Using fair-share scheduling, John Smith will get
>   a disproportinate share of resources if he submits
>   jobs as both "smithj" and "SmithJ" vs. Jane Doe
>   who only submits jobs from her Linux machine as
>   "doej".
> 
> Is there a way to configure SoGE to treat login IDs with a
> case-insensitive match, or to use UIDs?
> 
> We use a JSV pretty extensively, but I didn't see a way to alter login
> names via a JSV -- any suggestions?

I don't think you can make grid engine treat usernames case insensitively 
without
patching it.

I think you need to arrange for Grid engine to only see one name regardless of 
how 
the user logged in.

Not sure if this will work but:
If you are adding users to /etc/passwd or some other mechanism with
an obvious sort order then ensure the preferred format of name comes
first in order iwhen searching and have the PROFILE scripts adjust the
environment variables USER, LOGNAME and anything else embedding the
user's login name to match.

Alternatively:
Write a replacement shell for the less preferred name that invokes sudo to 
switch
to the more preferred name.

William


signature.asc
Description: PGP signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Job finishes correctly but master is not notified

2018-04-05 Thread William Hay
On Thu, Apr 05, 2018 at 03:38:18PM +0200, Paul Paul wrote:
> William,
> 
> Thanks for your reply.
> 
> In the 'messages' file of the exec host, there is nothing (the last message 
> was 2 weeks ago).

Might be worth increasing the loglevel to get more info about what is going on 
there. 

William


signature.asc
Description: PGP signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Job finishes correctly but master is not notified

2018-04-05 Thread William Hay
On Thu, Apr 05, 2018 at 09:46:23AM +0200, Paul Paul wrote:
> Hello,
> 
> We're using SGE 8.1.9 and randomly, we have jobs that finish with success 
> (our jobs logs confirm this) but the master is not notified.
> On the compute, all the folders related to such a job are still here, 
> correctly filled:
> 
> trace file:
> ...
> 04/04/2018 21:50:13 [300:38328]: now running with uid=300, euid=300
> 04/04/2018 21:50:13 [300:38328]: execvlp(/bin/ksh, "-ksh" 
> "/gridware/sge/gridname/spool/server/job_scripts/1376090")
> 04/04/2018 21:50:23 [300:38327]: wait3 returned 38328 (status: 0; 
> WIFSIGNALED: 0,  WIFEXITED: 1, WEXITSTATUS: 0)
> 04/04/2018 21:50:23 [300:38327]: job exited with exit status 0
> 04/04/2018 21:50:23 [300:38327]: reaped "job" with pid 38328
> 04/04/2018 21:50:23 [300:38327]: job exited not due to signal
> 04/04/2018 21:50:23 [300:38327]: job exited with status 0
> 04/04/2018 21:50:23 [300:38327]: now sending signal KILL to pid -38328
> 04/04/2018 21:50:23 [300:38327]: pdc_kill_addgrpid: 20075 9
> 04/04/2018 21:50:23 [300:38327]: writing usage file to "usage"
> 04/04/2018 21:50:23 [300:38327]: no epilog script to start
> 
> exit_status:
> 0
> 
> error:
> (empty)
> 
> but the process no longer appears in the 'ps' output.
> 
> On the master, doing a 'qstat -j 1376090' works and so, to get rid of such a 
> job, we are performing 'qdel -f 1376090'.
> 
> This happens 3 or 4 times a day (we submit more than 100k jobs per day), on 
> different exec hosts.
> 
> Do you know what could be the cause of this behavior?
Is there anything in the messages log?

Alternatively this might just be networks being less than 100% reliable.  
Possibly tweaking gdi_timeout and gdi_retries 
might help.

William


signature.asc
Description: PGP signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Problems with quotas

2018-04-05 Thread William Hay
On Wed, Mar 28, 2018 at 01:52:59PM +0200, Sms Backup wrote:
>Thanks for reply !
>You are rigght, this is systemd unit file. So for filtering I just
>use ExecStart=/bin/sh -c /opt/sge/bin/sge_qmaster | -v '^RUE_' ?
>Sorry, but I cannot understand this part.

I think you probably want ExecStart=/bin/sh -c "/opt/sge/bin/sge_qmaster| 
-v '^RUE_'"

AIUI systemd understands double quotes and you want to pass the whole construct 
to
shell as a script.  Without the double quotes everything after sge_qmaster is 
interpreted
by the shell as arguments to the script rather than part of the script itself.

William


signature.asc
Description: PGP signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Problems with quotas

2018-03-23 Thread William Hay
On Fri, Mar 23, 2018 at 12:07:39PM +0100, Sms Backup wrote:
>Thanks for your replies,
>So in total it would be something like this: ExecStart=/bin/sh -c
>/opt/sge/bin/sge_qmaster | -v '^RUE_' >&/dev/null ?

No. The  grep is intended to replace of the redirection to /dev/null so
as to remove only unwanted messages while allowing anything important
through.  You shouldn't use both.  Also you probably need to place
everything after the -c in double quotes so it is interpreted by the
shell as I don't think ExecStart parses redirection that way.

If your current unit file is calling sge_qmaster directly and you want
to throw away (rather than filter) stderr you could use StandardOutput=null
and StandardError=null in the unit file rather than using >&/dev/null.

Incidentally an earlier post implies you are trying to set the SGE_CLUSTER_NAME
via shell magic in a context (systemd unit file?) where the shell magic doesn't 
work.

William



signature.asc
Description: PGP signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Problems with quotas

2018-03-23 Thread William Hay
On Fri, Mar 23, 2018 at 09:36:29AM +, Mark Dixon wrote:
> Hi Jakub,
> 
> That's right: if you need to cut down the logging, one option is to add the
> redirection in the start script.
> 
> You're looking for the line starting "sge_qmaster", and you might want to
> try adding a ">/dev/null" after it. You'll lose all syslog messages from
> sge_qmaster though (normally, most useful messages end up in
> $SGE_ROOT/$SGE_CELL/spool/qmaster/messages, but you might want to remove the
> redirection if you ever find that sge_qmaster refuses to start).

The qmaster man page says debug output goes to stderr not stdout so that
would probably need to be a ">&/dev/null" rather than a plain ">/dev/null"

Assuming qmaster is launched from a shell script you might be able to add a 
"| -v '^RUE_'"
after the qmaster invocation to just get rid of these messages while keeping  
others.

There might be a few issues with that though if the script replaces itself with 
the qmaster
via exec or similar in order to ensure the qmaster has the same pid as the 
script systemd 
launched.

William



signature.asc
Description: PGP signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Is it possible to nohup a command within a script dispatched via qsub?

2018-03-23 Thread William Hay
On Fri, Mar 23, 2018 at 12:27:48AM +0100, Reuti wrote:
> Hi,
> 
> Am 22.03.2018 um 20:51 schrieb Mun Johl:
> 
> > Hi,
> >  
> > I?m using SGE v8.1.9 on RHEL6.8 .  In my script that I submit via qsub 
> > (let?s call it scriptA), I have a gxmessage (gxmessage is similar to 
> > xmessage, postnote, etc) statement which pops up a small status window 
> > notifying the user of the results of the qsub job.
> 
> Is SGE and your job running local on your workstation only? I wonder how the 
> gxmessage could display something on the terminal of the user when the job 
> runs on an exechost in the cluster and was submitted at some time in the past.
> 
> 
> >  However, I don?t want the gxmessage to exit when scriptA terminates.  So 
> > far, I have not figured out a what to satisfy my wants.  That is, when 
> > scriptA terminates, so does gxmessage.  nohup  does not help because 
> > gxmessage gets a SIGKILL.
> 
> SGE kills the complete process group when the jobs ends (or is canceled), not 
> just a single process. One might circumvent this with a `setsid foobar &` 
> command. The `nohup` isn't necessary here.
> 
> As a second measure to kill orphaned processes one can use the additional 
> group id, which is attached to all SGE processes. Although it would be 
> counterproductive in your case as it would kill the leftover process despite 
> the newly created process group. This would need to set:
> 
> $ qconf -sconf
> #global:
> ...
> execd_params ENABLE_ADDGRP_KILL=TRUE
>
According to https://arc.liv.ac.uk/repos/darcs/sge/NEWS
ENABLE_ADDGRP_KILL defaults to on after SoGE 8.1.7 so it probably needs to be 
explicitly set false.

As this is about notifying the user of a completed job I'm wondering if an 
alternative might be 
to write a mail compatible wrapper for gxmessage and specify  that as the 
mailer in sge_conf.
The wrapper might need to be somewhat smart to distinguish different uses of 
mailer by SGE though.

William


signature.asc
Description: PGP signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] mpirun without ssh

2018-03-23 Thread William Hay
On Thu, Mar 22, 2018 at 04:29:27PM +0100, leconte j??r??me wrote:
> Thank you,
> 
> ?? But I'm not sure to know what I look for.
> 
> ?? If I correctly understand
> 
> ??I must see qrsh or qlogin when I type
> 
> ompi_info
> 
> and if not I must recompile grid_engine with that option
> 
> Best Regards
> 
At UCL we route cluster internal ssh via qrsh which means we can restrict the 
normal sshd to admin staff only and parallel libraries that don't support
grid engine directly still operate under grid engine control.

https://github.com/UCL-RITS/GridEngine-OpenSSH

You still need to worry about how to tell the parallel code
where to run but getting it wrong causes problems for that job rather
than other jobs.

William


signature.asc
Description: PGP signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] qsub in specific nodes

2018-03-21 Thread William Hay
On Wed, Mar 21, 2018 at 11:55:14AM -0300, Dimar Jaime Gonz??lez Soto wrote:
>Hi, I need to know how can I execute grid engine in specific hosts. I
>tried the follow execution line:
>qsub -v NR_PROCESSES=60  -l
>h='ubuntu-node2|ubuntu-node11|ubuntu-node12|ubuntu-node13'  -b y -j y -t
>1-60 -cwd /usr/local/OMA/bin/OMA
>but doesn't work.

I assume by doesn't work you mean the job doesn't start.
What output do you get if you run qalter -w v on the jobid of a stuck job?

The syntax for selecting a host you specify "Works for me".  Grid Engine can be 
quite picky abount
hostnames.  Are those the exact names you get from a command like qconf -sel?
Did you tell grid engine that all hosts were in one domain while installing.

William


signature.asc
Description: PGP signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Problem with the way environment variables are exported in sge

2018-03-21 Thread William Hay
On Wed, Mar 21, 2018 at 03:23:01PM +, srinivas.chakrava...@wipro.com wrote:
> Hi,
> 
> The version in our environment is 2011.11.
Looking at https://arc.liv.ac.uk/repos/darcs/sge-release/NEWS
it looks like it was fixed in SoGE 8.1.4 which was released in 2013 a couple
of years after the release you are using so wouldn't have been ported across the
forks.  Since you see the problem I guess the regression predates the fork.  I 
guess your options
at this point are to switch forks or try to find some way to kludge around the 
problem.

William


signature.asc
Description: PGP signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Problems with quotas

2018-03-21 Thread William Hay
On Wed, Mar 21, 2018 at 07:59:41AM +0100, Sms Backup wrote:
>William,
>Thanks for reply. Unfortunately I have few non-interactive queues, so I
>cannot limit slots this way.
>99% of messages printed to system log look like this below, so I believe
>that are the messages which are suppressed:
>Mar 20 21:55:18 qmaster sh: ---
>Mar 20 21:55:18 qmaster sh: RUE_name (String)=
>thomas///medium.q//
>Mar 20 21:55:18 qmaster sh: RUE_utilized_now (Double)= 2.00
>Mar 20 21:55:18 qmaster sh: RUE_utilized (List)  = empty
>Mar 20 21:55:18 qmaster sh: RUE_utilized_now_non (Double)= 0.00
>Mar 20 21:55:18 qmaster sh: RUE_utilized_nonexcl (List)  = empty

I'm wondering if your qmaster is running with debuging enabled.
If you dump the environment of the qmaster (/proc//environ)
is there a mention of SGE_DEBUG_LEVEL?

If so you should try to figure out where it gets and tweak to a 
more appropriate level before restarting the qmaster.

William




signature.asc
Description: PGP signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Problem with the way environment variables are exported in sge

2018-03-20 Thread William Hay
On Mon, Mar 19, 2018 at 12:11:04PM +, srinivas.chakrava...@wipro.com wrote:
>Hi,
> 
> 
> 
>We have some functions in our environment which are not being parsed
>properly by sge, which is causing errors on the stdout while launching
>interactive jobs
> 

> 
> 
>But when the environment variables are exported, sge is having trouble
>exporting mutli line functions, leading to the aforementioned error. Is
>there any way we can circumvent this ?
Which version of Grid Engine are you using?  I have a vague recollection
of there being some fixes for environment file parsing around bash
functions in recent versions of SoGE but the only relevant bug I can
find in the SoGE bug database looks to be older.


William


signature.asc
Description: PGP signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] gridengine rpm complaining about perl(XML::Simple) even though it's installed

2018-03-19 Thread William Hay
On Thu, Mar 15, 2018 at 10:19:29PM +, Mun Johl wrote:
>Hi,
> 
> 
> 
>I am trying to install gridengine-8.1.9-1.el6.x86_64.rpm on a RedHat EL6
>system.  The yum command exits with the following error:
> 
> 
> 
>Error: Package: gridengine-8.1.9-1.el6.x86_64
>(/gridengine-8.1.9-1.el6.x86_64)
> 
>   Requires: perl(XML::Simple)
> 
> 
> 
>However, I have installed the XML::Simple package via the following
>command:
> 
> 
> 
>$ cpanm XML::Simple
> 
> 
> 
>I have verified via perldoc that XML::Simple is in fact installed; so I'm
>at a loss as to why yum still is unhappy.
AFAIK yum/rpm don't know about anything installed via cpan by default.

A quick google suggests that rpm --justdb would let you tell the rpm database
about XML::Simple being installed.  Alternatively install XML::Simple via
rpm/yum instead of CPAN.  If it isn't part of your distro then you could look 
for EPEL packages).

William


signature.asc
Description: PGP signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] shepherd timeout when using qmake and qrsh

2018-03-01 Thread William Hay
On Tue, Feb 27, 2018 at 10:46:57AM +0100, Reuti wrote:
> Hi Nils:
> 
> > Am 27.02.2018 um 10:11 schrieb Nils Giordano :
> > however you were right: `ssh` is definitively used to access nodes
> > (probably on purpose since we have access to several GUI apps). Your
> > answer made me check my ~/.ssh/ directory, and I found dozens of
> > *.socket files in there.
> > 
> > After removing these files, qmake and qrsh perform flawlessly (shepherd
> > exit code 0).

> > I still do not know what caused this problem and at which
> > point these files were created, but I will know what to look for would
> > this problem reappear.
> 
> For me I never saw any socket files created in my ~/.ssh Maybe it's custom 
> with your other graphical apps.

I wonder if the ssh_config has ControlMaster set?  That apparently creates
a socket to allow multiple ssh commands to share a single connection.
I suspect it would run into problems with grid engine tight integration
and might also have issues when the directory is on a shared filesystem.
If this is the root cause then setting 'ControlMaster no' in ~/.ssh/config
should prevent a reoccurrencei for you personally.  Qmake is probably
affected more than most parallel libraries because most parallel libraries
execute a single qrsh and then fork on the remote side while I suspect
(I haven't checked) that qmake launches one qrsh per process.

If this is the case then I would suggest contacting the cluster admin
and requesting they disable ControlMaster for the ssh launched by qrsh.

Assuming a fairly normal linux box you should be able to check the
setting with:
grep -i controlmaster /etc/ssh/ssh_config

William



signature.asc
Description: PGP signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Converting from supplemental groups to cgroups for management

2018-02-16 Thread William Hay
On Thu, Feb 15, 2018 at 11:28:58AM -0600, Calvin Dodge wrote:
> While the help we received from this and other gridengine lists helped
> us resolve the issue of jobs being mysteriously killed, we've been
> asked to look into converting the customer's SGE cluster, using
> cgroups for job management.
> 
> The cluster is running gridengine 8.1.9, on Linux kernel
> 3.10.0-514.el7.x86_64 (Centos 7.3.1611). Can anyone comment on the
> reliability of using cgroups for job management in that environment?
> I ask because I ran across some web pages which stated that for some
> versions of gridengine and kernel this was a Bad Idea.
>
Setting USE_CGROUPS in execd_params should handle job killing fine in
8.1.9.  It was buggy in 8.1.8 and earlier.  Note that USE_CGROUPS
configures the use of cpuset cgroups (or original cpusets on older kernels).
It doesn't configure memory cgroups or any other variety of cgroup.  If 
you want that then you'll have to write wrapper scripts.

William


signature.asc
Description: PGP signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Scheduler getting stuck, "Skipping remaining N orders"

2018-02-09 Thread William Hay
On Thu, Feb 08, 2018 at 03:42:03PM -0800, Joshua Baker-LePain wrote:
>  153758 0.51149 tomography USER1   qw02/08/2018 14:03:05  
> 192
>  153759 0.0 qss_svk_ge USER2   qw02/08/2018 14:15:06  
>   1 1
>  153760 0.0 qss_svk_ge USER2   qw02/08/2018 14:15:06  
>   1 1
> 
> with more jobs below that, all with 0. priority.  Starting at 14:03:06
> in the messages file, I see this:
> 
> 02/08/2018 14:03:06|worker|wynq1|E|not enough (1) free slots in queue 
> "ondemand.q@cin-id3" for job 153758.1
> 
> And in the schedule file I see this:
> 
> 153758:1:STARTING:1518127386:82860:P:mpi:slots:192.00
> 153758:1:STARTING:1518127386:82860:H:msg-id19:mem_free:16106127360.00
> 153758:1:STARTING:1518127386:82860:Q:member.q@msg-id19:slots:15.00 
> 153758:1:STARTING:1518127386:82860:L:member_queue_limits:/USER1lab:15.00
> 153758:1:STARTING:1518127386:82860:H:qb3-id1:mem_free:1073741824.00
> 153758:1:STARTING:1518127386:82860:Q:ondemand.q@qb3-id1:slots:1.00 
> 153758:1:STARTING:1518127386:82860:L:ondemand_queue_limits:USER1/:1.00
> 153758:1:STARTING:1518127386:82860:H:qb3-id1:mem_free:11811160064.00
> 153758:1:STARTING:1518127386:82860:Q:long.q@qb3-id1:slots:11.00
> So why is it trying to give the job slots in ondemand.q?
> 
Has the job in question requested the ondemand queue via -masterq by any 
chance?  I have heard people who should know say that -masterq is somewhat
buggy.  I've never had a problem with -masterq myself but I don't use it much 
and we don't use RQS either.  Possibly the alleged bugginess of -masterq
manifests in the presence of RQS.

Does the pe in question have job_is_first_task set to false?  If so this may be 
a funny with treatment of the MASTER task by RQS.

William


signature.asc
Description: PGP signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Scheduler getting stuck, "Skipping remaining N orders"

2018-02-08 Thread William Hay
On Wed, Feb 07, 2018 at 02:15:05PM -0800, Joshua Baker-LePain wrote:
> On Wed, 7 Feb 2018 at 12:46am, William Hay wrote
> 
> > IIRC resource quotas and reservations don't always play nicely together.
> > The same error can come about for multiple different reasons so having
> > had this error in the past when the queue is defined as having 0 slots
> > doesn't eliminate RQS as a suspect.
> > 
> > I would set MONITOR=1 in the sched_conf and have a look at the schedule
> > file to see a little more detail about what is going on.
> 
> I've done this and will have a look next time something gets stuck.
> 
> > As a slightly less drastic method than restarting the qmaster you could
> > try reducing the priority (qalter -p)  on the problem job for a scheduling
> > cycle to below the jobs stuck behind it to see if they will start even
> > if the problem job won't.
> 
> IIRC from past episodes, anything *submitted* after the "skipping orders"
> messages start appearing is not assigned a priority -- they sit there in the
> queue at 0.000.  But I can certainly try this if there *is* something with a
> priority score that's stuck behind the problem job.

The 0. you see in qstat is the qmaster thread's idea of priority.  Part of 
the
order that is skipped is updating the qmaster thread's notion of priority for a 
job.  The scheduler has usually assigned them  a priority lower than your 
problem job 
so they don't get updated because the order to update the qmaster's notion of 
priority for 
that job has been skipped.  Should circumstances change so that a later 
submitted
job  has higher priority than the problem job then the real priority used by 
the 
scheduler thread will become visible.  You can see a similar phenomenon with 
errored
jobs.  The scheduler doesn't consider them so the priority isn't updated.

> 
> > I don't know how your cluster is set up but I would try to tweak the
> > config so that larger jobs (that need reservations) don't even consider
> > queue instances that are constrained by RQS.
> 
> Unforunately, RQSes are pretty integral to our setup.  Our cluster is run as
> a co-op model.  Each lab gets a number of slots (limited via RQS) in our
> high priority queue proportional to their "ownership" of the cluster. Jobs
> in the lower priority queues run niced and have fewer available slots.  I'll
> certainly ask the users to try without reservations to see a) if they can
> still get their jobs through and b) if that keeps the error from cropping
> up.
> 
> Is the lack of fair play between RQSes and reservations a bug or simply a
> side effect of how these 2 systems operate?

Mostly it is that some of the scheduler heuristics can lead to odd scheduling 
decisions when combined with RQS.

William


signature.asc
Description: PGP signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Scheduler getting stuck, "Skipping remaining N orders"

2018-02-07 Thread William Hay
On Tue, Feb 06, 2018 at 12:13:24PM -0800, Joshua Baker-LePain wrote:
> I'm back again -- is it obvious that my new cluster just went into
> production?  Again, we're running SoGE 8.1.9 on a cluster with nodes of
> several different sizes.  We're running into an odd issue where SGE stops
> scheduling jobs despite available slots.  The messages file contains many
> instances of messages like this:
> 
> 02/06/2018 12:03:41|worker|wynq1|E|not enough (1) free slots in queue 
> "ondemand.q@cc-hmid1" for job 142497.1
> 02/06/2018 12:03:41|worker|wynq1|W|Skipping remaining 12 orders
> 
> Now, the project that 142497.1 (a 500 slot MPI job) belongs to cannot run in
> the named queue instance -- an RQS limits the usage to 0 slots.  Also, if I
> run "qalter -w p" on the job, it reports "verification: found possible
> assignment with 500 slots".  But the job will never get scheduled.  And
> neither will *any* other jobs.  The only way I've found to get things
> flowing again is to stop and restart sgemaster.
> 
> Since it's possibly (probably?) related, I should say that I have
> max_reservation set to 1024 in the scheduler config.  Also, I've had
> instances of this error in the past where the queue@host instance mentioned
> in the error is actually defined as having 0 slots.  So it's not tied to the
> RQS.
> 
> Can anyone give me some pointers on how to debug this?  Thanks.

IIRC resource quotas and reservations don't always play nicely together.
The same error can come about for multiple different reasons so having
had this error in the past when the queue is defined as having 0 slots
doesn't eliminate RQS as a suspect.

I would set MONITOR=1 in the sched_conf and have a look at the schedule 
file to see a little more detail about what is going on.

As a slightly less drastic method than restarting the qmaster you could
try reducing the priority (qalter -p)  on the problem job for a scheduling
cycle to below the jobs stuck behind it to see if they will start even
if the problem job won't.

I don't know how your cluster is set up but I would try to tweak the
config so that larger jobs (that need reservations) don't even consider
queue instances that are constrained by RQS.

William  


signature.asc
Description: PGP signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Minimum number of slots

2018-02-01 Thread William Hay
On Thu, Feb 01, 2018 at 11:44:25AM +0100, Ansgar Esztermann-Kirchner wrote:
> Now, I think I can improve upon this choice by creating separate
> queues for different machines "sizes", i.e. an 8-core queue, a
> 20-core queue and so on. However, I do not see a (tractable) way to
> enforce proper job-queue association: allocation_rule 8 (etc) comes to
> mind, but I would lose the crucial one-host limit. This could be
> circumvented by creating one PE per node, but that would mean a huge
> administrative burden (and possible also a lot of extra load on the
> scheduler).

If I undertsand you correctly:
Create a $pe_slots PE for each type of node and associate it with the
appropriate nodes.  Have a jsv tweak the requested pe based on the number
of slots requested.

William


signature.asc
Description: PGP signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] gid_range values

2018-01-24 Thread William Hay
On Tue, Jan 23, 2018 at 06:22:28PM -0600, Calvin Dodge wrote:
> The docs we've found say that gid_range must be greater than the
> number of jobs expected to run currently on one host.
> 
> Our recent experience suggests that it has to be greater than the
> total number of jobs in the queue.  If it's not, then a few jobs get
> mysteriously killed (typically about 1 in 30-40).
> 
> Has anyone else had that experience? We did fix this by expanding the
> range (it was the default of 2-20100, which we changed to
> 20200-21000), but would like to know if there's a "best practice"
> regarding the range of values.

Queued jobs shouldn't make a difference.  It is possible that there might
be some sort of race where the gid is held onto by grid engine while it runs the
epilog(not sure haven't checked exactly when the group is deallocated).   
Having the 
range be twice the number of jobs should cover this unless your epilog is 
getting stuck for some reason.

William


signature.asc
Description: PGP signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Exporting environment variables using -V doesn't work from RHEL7 to RHEL6

2018-01-19 Thread William Hay
On Fri, Jan 05, 2018 at 08:02:18AM +, srinivas.chakrava...@wipro.com wrote:
>Hi,
> 
> 
> 
>We have recently upgraded one submit host from RHEL6.7 to RHEL7.2.
> 
>Most of our grid execution servers are RHEL6.7, with a few RHEL6.3
>servers. When we run any jobs by using "-V" option to export environment
>variables, it does not work and the job fails.
In RHEL7 the /bin directory has been replaced by a symlink to /usr/bin and 
/sbin by a symlink
to /usr/sbin.  The former contents have been merged.

The default path in RHEL7 includes /usr/bin but not /bin.  In RHEL6 /bin and 
/usr/bin
are different directories and most binaries only exist in one place or the 
other.

I don't have a RHEL6 box to hand but on RHEL5 both sleep and date live in /bin 
rather
than /usr/bin.

Adding a line to the script PATH="${PATH}:/bin" will probably result in the 
binary being found.

Rather than passing the entire environment through it might be better to make 
your scripts
run a login shell:

#!/bin/bash -l 

so that the environment gets set up correctly for the machine on which the job 
is run.
Then pass any environment variables you actually need via -v options.


William



> 
> 
> 
>As an example examine a simple job:
> 
> 
> 
>
> 
>#!/bin/bash
> 
> 
> 
>date
> 
>echo "Sleeping now"
> 
>sleep $1
> 
> 
> 
>
> 
> 
> 
>When we run this job without exporting environment variables, it works
>fine. But exporting environment variables causes errors as shown below:
> 
> 
> 
>/job_scripts/2666843: line 3: date: command not found
> 
>/job_scripts/2666843: line 5: sleep: command not found
> 
> 
> 
>Of course, this issue is persistent in interactive jobs also, which causes
>applications not working properly.
> 
> 
> 
>Is there any way to circumvent this issue?
> 
> 
> 
>Thanks and regards,
>Srinivas.
> 
>The information contained in this electronic message and any attachments
>to this message are intended for the exclusive use of the addressee(s) and
>may contain proprietary, confidential or privileged information. If you
>are not the intended recipient, you should not disseminate, distribute or
>copy this e-mail. Please notify the sender immediately and destroy all
>copies of this message and any attachments. WARNING: Computer viruses can
>be transmitted via email. The recipient should check this email and any
>attachments for the presence of viruses. The company accepts no liability
>for any damage caused by any virus transmitted by this email.
>www.wipro.com
>__
>This email has been scanned by the Symantec Email Security.cloud service.
>For more information please visit http://www.symanteccloud.com
>__

> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users



signature.asc
Description: PGP signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


[gridengine users] Happy new year GridEngine Users

2018-01-18 Thread William Hay
Testing if the users@gridengine.org mailing list works in 2018.


signature.asc
Description: PGP signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] resource types -- changing BOOL to INT but keeping qsub unchanged

2018-01-18 Thread William Hay
On Fri, Dec 22, 2017 at 05:55:26PM -0500, berg...@merctech.com wrote:
> True, but even with that info, there doesn't seem to be any universal
> way to tell an arbitrary GPU job which GPU to use -- they all default
> to device 0.

With Nvidia GPUs we use a prolog script that manipulates lock files
to select a GPU then chgrp's the selected /dev/nvidia? file so the group is
the group associated with the job.   An epilog script undoes all of this.  
The /dev/nvidia? files permissions are set to be inaccessible to anyone 
other than owner(root) and the group.  However you have to pass
a magic option to the kernel to prevent permissions from being reset
whenever anyone tries to access the device.

This seems to be a fairly bullet proof way of restricting jobs to
their assigned GPU.


William


signature.asc
Description: PGP signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] resource types -- changing BOOL to INT but keeping qsub unchanged

2018-01-18 Thread William Hay
On Mon, Jan 08, 2018 at 06:23:20PM -0500, berg...@merctech.com wrote:
> Yeah, I've looked at that, but it brings up the 'accounting problem'
> of changing the variable each time a GPU-enabled job begins or ends.
Set it in starter_method (and sshd's force command if you want to support
qlogin etc).


> => > Does this affect things like "nvidia-smi" (user-land, accesses all
> => > GPUs, but does not run jobs)?
> => 
> => It should. If you want it to scan everything (for load sensor purposes
> => or otherwise) you can run it as root or the user owning the /dev/nvidia?
> => files if that isn't root.
> 
> Yes, but some users want to query the GPU as well.
sudo invoking wrapper perhaps?

William


signature.asc
Description: PGP signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] I'm getting an "Unable to initialize env" error; but our simultaneous ECs should be small

2017-11-23 Thread William Hay
On Wed, Nov 22, 2017 at 09:53:17AM -0800, Mun Johl wrote:
>Hi,
>Periodically I am seeing the following error:
> 
>  Unable to initialize environment because of error: cannot register event
>  client. Only 100 event clients are allowed in the system
> 
>The error first showed up a few days ago but stated "950 event clients are
>allowed".  Because MAX_DYN_EC was not set in my config, I equated it to
>100.
I am not sure what you mean by "I equated it to 100"?  Did you set it to 100
after getting  the error?  IIRC the default is 1000.


>However, our sim ring is fairly small at this point and we shouldn't be
>getting anywhere near 100 outstanding qsub's (let alone 950).  Therefore,
>I'm wondering what other factors could result in this error?
>For example, could a slow network or slow grid master result in this
>error?
>Any suggestions on how I can get to root cause would be most appreciated.
>Thanks,

Are you actually using qsub?  IIRC when using DRMAA it is possible to leak 
event clients
(ie the event client is created when a job is qsub'd but isn't automatically 
freed when 
the job terminates only when the client program does) if you launch multiple 
jobs from 
the same process.  

If you are using qsub -sync y check that the qsub processes are actually being
reaped (ie there aren't a bunch of zombie qsubs hanging around).

Also check that you aren't short of filehandles (ie ulimit) either where the 
submit 
program runs or where the qmaster lives.


William


signature.asc
Description: PGP signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Integration of GPUs into GE2011.11p1

2017-10-31 Thread William Hay
On Mon, Oct 30, 2017 at 09:56:37PM +0530, ANS wrote:
>Hi,
>Thank you for the detailed info.
>But can let me know how can i submit a job using 4 GPUs, 8 cores from
>2nodes consisting of 2 GPUs, 4 cores from each node.
>Thanks,

That's not something the free versions of grid engine do easily.
However we fake per HOST consumption at UCL by means of policy.  
We have a JSV enforced rule that multihost jobs  request exclusive 
access to nodes via en an exclusive complex and have our PEs set up 
so that all nodes assigned to a job are identical (with load sensors 
to detect brokenness etc).  The prolog runs subordinate prologs on 
the slave nodes of a job via ssh.  This allows a per JOB complex to act as
a faux per HOST complex.

I believe Univa Grid Engine has genuine per host complexes but only for
the RSMAP (ie named resource) complexes.

William




signature.asc
Description: PGP signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Integration of GPUs into GE2011.11p1

2017-10-30 Thread William Hay
On Wed, Oct 25, 2017 at 04:59:05PM +0200, Reuti wrote:
> Hi,
> 
> > Am 25.10.2017 um 16:06 schrieb ANS :
> > 
> > Hi all,
> > 
> > I am trying to integrate GPUs into my existing cluster with 2 GPUs per 
> > node. I have gone through few sites and done the following
> > 
> > qconf -mc
> > gpu gpuINT <=YES YES0   
> >  0
> > 
> > qconf -me gpunode1
> > complex_valuesgpu=2
> > 
> > But still i am unable to launch the jobs using GPUs. Can anyone help me.
> 
> What do you mean by "unable to launch the jobs using GPUs"? How do you submit 
> the jobs? The jobs are stuck or never accepted by SGE?
> 
> There is no way to determine which GPU was assigned to which job. Univa GE 
> has an extension for it called "named resources" or so. You could define two 
> queues with each having one slot and the name of the name of the chosen queue 
> determines the GPU to be used after some mangling.

We set permissions so that only owner and group,not world, can access  the
/devi/nvidia? file (/dev/nvidactl by contrast needs to be accessible by
anyone).  We use a prolog script  here which uses lock files to logically
assign nvidia GPUs to jobs. The script then chgrp's the /dev/nvidia? file
associated with the GPU to be owned by the group associated with the job.
The epilog undoes what the prolog did.  We need to pass a magic flag to
the kernel or virtually any time you touch them the permissions of the
/dev files get reset.

This seems to work to prevent programs from seeing GPUs they weren't
assigned by the prolog

We use JOB for consumable so GPUs are allocated per job on the head node
not per slot.


William


signature.asc
Description: PGP signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] running SOGE/execd on Cygwin

2017-10-20 Thread William Hay
On Thu, Oct 19, 2017 at 04:49:40PM -0700, Simon Matthews wrote:
> Does anyone have any pointers on running execd on cygwin?
> 
> inst_sge -x fails, because the 'uidgid' command doesn't seem to have been 
> built.
> 
> I can set the environment variables myself and then start the program,
> but the qmaster doesn't see the execd client. I have disabled the
> firewall:
> 
> $ ./CYGWIN_X86/sge_execd.exe

When you say doesn't see it do you mean it is not listed or it is listed
but the host is uncontactable.  Or to put it another way:  what commands are 
you 
using to determine that "the qmaster doesn't see the execd client" 
and what is their output?

If the execd is just uncontactable by the qmaster qping might help determine
whether the execd is responding on the ports it is supposed to.  If it is then 
you should
check where the qmaster is looking.

William


signature.asc
Description: PGP signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Cygwin?

2017-10-16 Thread William Hay
On Fri, Oct 13, 2017 at 05:02:51PM -0700, Simon Matthews wrote:
> William, thanks for the assistance. I was able to get further.
> 
> Can anyone help me with this?
> 
> SGE_INPUT_CFLAGS="-I/usr/include/tirpc" ./aimk -only-core -no-secure
That looks like the same problem at link time.  
Set SGE_INPUT_LDFLAGS="-ltirpc" should do it although you might need
to also pass the path to your tirpc library in with -L as well.


William

> 
> ...
> gcc -DSGE_ARCH_STRING=\"cygwin-x86\" -O3 -Wall -Wstrict-prototypes
> -DUSE_POLL -DLINUX -D_GNU_SOURCE -DGETHOSTBYNAME -DGETHOSTBYADDR
> -DHAVE_XDR_H=1  -DTARGET_32BIT -I/usr/include/tirpc -DSGE_PQS_API
> -DSPOOLING_dynamic  -D_FILE_OFFSET_BITS=64 -DHAVE_HWLOC=1 -DNO_JNI
> -DCOMPILE_DC -D__SGE_NO_USERMAPPING__ -I../common -I../libs
> -I../libs/uti -I../libs/juti -I../libs/gdi -I../libs/japi
> -I../libs/sgeobj -I../libs/cull -I../libs/comm -I../libs/comm/lists
> -I../libs/sched -I../libs/evc -I../libs/evm -I../libs/mir
> -I../daemons/common -I../daemons/qmaster -I../daemons/execd
> -I../clients/common -I. -o test_sge_object -L.
> -Wl,-rpath,\$ORIGIN/../../lib/cygwin-x86  test_sge_object.o
> libsgeobj.a libsgeobjd.a libcull.a libcomm.a libcommlists.a libuti.a
> -luti -ldl  -lm -lpthread
> libcull.a(pack.o):pack.c:(.text+0x515): undefined reference to `xdrmem_create'
> libcull.a(pack.o):pack.c:(.text+0x525): undefined reference to `xdr_double'
> libcull.a(pack.o):pack.c:(.text+0x94d): undefined reference to `xdrmem_create'
> libcull.a(pack.o):pack.c:(.text+0x959): undefined reference to `xdr_double'
> collect2: error: ld returned 1 exit status
> make: *** [../libs/sgeobj/Makefile:364: test_sge_object] Error 1
> not done
> 
> Simon
> 
> On Fri, Oct 13, 2017 at 1:02 AM, William Hay <w@ucl.ac.uk> wrote:
> > On Thu, Oct 12, 2017 at 04:12:48PM -0700, System Administrator wrote:
> >> I think it should be part of the ./configure step.  If you exported it as 
> >> an
> >> env variable, then re-run the ./configure part.  Or put it at the beginning
> >> of the command, for example:
> >>
> >> CPPFLAGS=-I/usr/include/tirpc ./configure
> > Grid engine doesn't really use configure (some components have their own 
> > build process
> > that involve configure but the top level doesn't).  A similar role is 
> > performed
> > by aimk (which also controls the build).
> >
> > According to section 4.6 of 
> > https://arc.liv.ac.uk/trac/SGE/browser/sge/source/README.aimk you can
> > pass flags to the compiler by setting the variable SGE_INPUT_CFLAGS when 
> > invoking aimk.
> > So something likei this should work:
> > SGE_INPUT_CFLAGS="-I/usr/include/tirpc" ./aimk 
> >
> > William
> >
> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users


signature.asc
Description: PGP signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Cygwin?

2017-10-13 Thread William Hay
On Thu, Oct 12, 2017 at 04:12:48PM -0700, System Administrator wrote:
> I think it should be part of the ./configure step.  If you exported it as an
> env variable, then re-run the ./configure part.  Or put it at the beginning
> of the command, for example:
> 
> CPPFLAGS=-I/usr/include/tirpc ./configure
Grid engine doesn't really use configure (some components have their own build 
process
that involve configure but the top level doesn't).  A similar role is performed
by aimk (which also controls the build).

According to section 4.6 of 
https://arc.liv.ac.uk/trac/SGE/browser/sge/source/README.aimk you can
pass flags to the compiler by setting the variable SGE_INPUT_CFLAGS when 
invoking aimk.
So something likei this should work:
SGE_INPUT_CFLAGS="-I/usr/include/tirpc" ./aimk 

William



signature.asc
Description: PGP signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Cygwin?

2017-10-10 Thread William Hay
On Mon, Oct 09, 2017 at 07:46:05PM -0700, Simon Matthews wrote:
> Is it possible to build SOGE for Cygwin?
> 
> SOGE says it is based on OGS which claimed that it supported Cygwin.
> 
> I only need execd on Cygwin. Qmaster and the GUI tools need only run
> under CentOS 6 and 7.
> 
> Simon
I don't think SoGE is (or claims to be) based on OGS although it may have 
imported
the odd feature/patch from there.  

According to 

https://arc.liv.ac.uk/trac/SGE/ticket/1557

Changes have been made to support cygwin compile so it should build OK.  How 
well it works I couldn't say.
In the long run Windows "Services for Linux" may be a better target.

William


signature.asc
Description: PGP signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Max jobs per user

2017-10-10 Thread William Hay
On Sat, Sep 30, 2017 at 02:21:12AM +, John_Tai wrote:
>Currently if I set a max job per user in the cluster, a new job will be
>rejected if it exceeds the max.
> 
> 
> 
>> qrsh
> 
>job rejected: Only 100 jobs are allowed per user (current job count: 264)
> 
> 
> 
>Is there a way to queue the job instead of rejecting it?
> 
> 
> 
>Thanks
> 
>John
> 
There are two similar settings in gridengine.  In the grid engine configuration 
you will find
max_u_jobs which controls the number of active(running OR queued) jobs per user.
In the scheduler configuration you will find maxujobs which controls the 
maximum number of running
jobs per user.  You probably want to set the latter rather than the former.

William



signature.asc
Description: PGP signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] load scaling

2017-09-04 Thread William Hay
On Fri, Sep 01, 2017 at 06:28:43PM +0900, Ueki Hikonuki wrote:
> Hi,
> 
> I tried to understand load scaling. But it is still unclear for me.
> 
> Let's assume two hosts.
> 
> hostA very fast machine
> hostB regular speed machine
> 
> Even though np_load_avg of hostA is much higher than hostB,
> a job still runs faster on hostA. But in this case gridengine usually
> submits the job to hostB because its np_load_avg is lower.
> 
> I want to tune this environment by changing load scaling.
> But I don't know how.
>
There probably isn't a built in "load sensor" that directly reflects what you 
want.  You could:
1)Write your own load sensor that reflects your notion of which node is more 
preferable then use that for the load_formula.
2)Change the queue_sort_method to seqno and give the queues on hostA a lower 
sequence number.  You could add a 
np_load_avg based load_threshold that will stop jobs being assigned to a host 
if it is overloaded.  

William


signature.asc
Description: Digital signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Fonts issue with RHEL 7.3

2017-07-26 Thread William Hay
On Tue, Jul 25, 2017 at 01:23:43AM +, Matt Hohmeister wrote:
>When trying to run qmon on RHEL 7.3, I get this. Can someone share which
>packages would take care of this?
Hopefully one of these pages should sort it out.  The last one should 
definitely do it but is fixing things
user by user rather than globally.
https://kence.org/2011/02/08/gridengine-qmon-font-problems/
https://bugzilla.redhat.com/show_bug.cgi?id=657406
http://talby.rcs.manchester.ac.uk/~ri/_bits/sge_qmon_fonts.html

> 
> 
> 
>Warning: Cannot convert string
>"-adobe-helvetica-medium-r-*--14-*-*-*-p-*-*-*" to type FontStruct
> 
>Warning: Cannot convert string
>"-adobe-helvetica-bold-r-*--14-*-*-*-p-*-*-*" to type FontStruct
> 
>Warning: Cannot convert string
>"-adobe-helvetica-medium-r-*--20-*-*-*-p-*-*-*" to type FontStruct
> 
>Warning: Cannot convert string
>"-adobe-helvetica-medium-r-*--12-*-*-*-p-*-*-*" to type FontStruct
> 
>Warning: Cannot convert string
>"-adobe-helvetica-medium-r-*--24-*-*-*-p-*-*-*" to type FontStruct
> 
>Warning: Cannot convert string
>"-adobe-courier-medium-r-*--14-*-*-*-m-*-*-*" to type FontStruct
> 
>Warning: Cannot convert string "-adobe-courier-bold-r-*--14-*-*-*-m-*-*-*"
>to type FontStruct
> 
>Warning: Cannot convert string
>"-adobe-courier-medium-r-*--12-*-*-*-m-*-*-*" to type FontStruct
> 
>Warning: Cannot convert string
>"-adobe-helvetica-medium-r-*--10-*-*-*-p-*-*-*" to type FontStruct
> 
>X Error of failed request:  BadName (named color or font does not exist)
> 
>  Major opcode of failed request:  45 (X_OpenFont)
> 
>  Serial number of failed request:  654
> 
>  Current serial number in output stream:  665
> 
> 
> 
>Matt Hohmeister, M.S.
> 
>Systems and Network Administrator
> 
>Department of Psychology
> 
>Florida State University
> 
>PO Box 3064301
> 
>Tallahassee, FL 32306-4301
> 
>Phone: +1 850 645 1902
> 
>Fax: +1 850 644 7739
> 
>https://psy.fsu.edu/
> 
> 

> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users



signature.asc
Description: Digital signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] complex error

2017-07-26 Thread William Hay
On Tue, Jul 25, 2017 at 12:57:47AM +, John_Tai wrote:
>I have configured virtual_free as a requestable resource:
> 
> 
> 
>virtual_freememMEMORY  <=YES JOB   
>00
> 
> 
> 
>And it's been working great for months.
> 
> 
> 
>However today all of a sudden I got this error in messages:
> 
> 
> 
>07/25/2017 08:45:41|worker|ibm068|E|host load value "virtual_free"
>exceeded: capacity is 95945748480.262146, job 5983416 requests additional
>2680.00
> 
>07/25/2017 08:45:41|worker|ibm068|E|cannot start job 5983416.1, as
>resources have changed during a scheduling run
> 
>07/25/2017 08:45:41|worker|ibm068|W|Skipping remaining 7 orders
> 
> 
> 
>And any job would not get scheduled at all, they'd be in waiting state
>"qw", no matter how many resources it's requesting:
Are they all failing to start on the same host?  Might be worth disabling the 
queues
on that host so the scheduler looks for another place to put it.  Have a look 
at the host 
to see if something is eating virtual memory there.

William

> 
> 
> 
># qstat -j 5983416
> 
>==
> 
>job_number: 5983416
> 
>exec_file:  job_scripts/5983416
> 
>submission_time:Tue Jul 25 08:18:46 2017
> 
>owner:  jumbo
> 
>uid:986
> 
>group:  memory
> 
>gid:41
> 
>sge_o_home: /home/jumbo
> 
>sge_o_log_name: jumbo
> 
>sge_o_path:
>
> /home/eda/cadence/IC616.500.3_20131102/tools/bin:/home/eda/cadence/IC616.500.3_20131102/tools/dfII/bin:/ho
>
> me/eda/cadence/IC616.500.3_20131102/tools/plot/bin:/home/eda/cadence/Spectre161ISR2/tools/bin:/home/sge/sge6.2u6/bin/lx24-amd64:/bin:/
>
> usr/bin:/usr/local/bin:.:/home/sge/bin:/home/DI/TOOLS/bin:.:/home/IPproj/IOproject/quan/Flatten
> 
>sge_o_shell:/bin/csh
> 
>sge_o_workdir: 
>/home/memorytemp/jumbo/180G_RK/S018DP/design_review
> 
>sge_o_host: ibm041
> 
>account:sge
> 
>cwd:   
>/home/memorytemp/jumbo/180G_RK/S018DP/design_review
> 
>merge:  y
> 
>hard resource_list: virtual_free=2000m
> 
>mail_list:  jumbo@ibm041
> 
>notify: FALSE
> 
>job_name:   run.pl
> 
>jobshare:   0
> 
>hard_queue_list:256g.q
> 
>env_list:  
>
> REMOTEHOST=dsls11,MANPATH=/home/sge/sge6.2u6/man:/opt/SUNWspro/man:/usr/man:/usr/openwin/man:/usr/dt/man:/
>
> usr/local/man:/usr/local/mysql/man:/usr/local/samba/man,VNCDESKTOP=ibm041:344
>(jumbo),HOSTNAME=ibm041,HOST=ibm041,SHELL=/bin/csh,TERM=
>
> xterm,GROUP=memory,USER=jumbo,LD_LIBRARY_PATH=/usr/lib:/usr/openwin/lib:/usr/dt/lib:/usr/ccs/lib:/usr/local/lib:/usr/local/mysql/lib,L
>
> S_COLORS=no=00:fi=00:di=00;34:ln=00;36:pi=40;33:so=00;35:bd=40;33;01:cd=40;33;01:or=01;05;37;41:mi=01;05;37;41:ex=00;32:*.cmd=00;32:*.
>
> exe=00;32:*.com=00;32:*.btm=00;32:*.bat=00;32:*.sh=00;32:*.csh=00;32:*.tar=00;31:*.tgz=00;31:*.arj=00;31:*.taz=00;31:*.lzh=00;31:*.zip
>
> =00;31:*.z=00;31:*.Z=00;31:*.gz=00;31:*.bz2=00;31:*.bz=00;31:*.tz=00;31:*.rpm=00;31:*.cpio=00;31:*.jpg=00;35:*.gif=00;35:*.bmp=00;35:*
>
> .xbm=00;35:*.xpm=00;35:*.png=00;35:*.tif=00;35:,HOSTTYPE=x86_64-linux,MAIL=/var/spool/mail/jumbo,PATH=/home/eda/cadence/IC616.500.3_20
>
> 131102/tools/bin:/home/eda/cadence/IC616.500.3_20131102/tools/dfII/bin:/home/eda/cadence/IC616.500.3_20131102/tools/plot/bin:/home/eda
>
> /cadence/Spectre161ISR2/tools/bin:/home/sge/sge6.2u6/bin/lx24-amd64:/bin:/usr/bin:/usr/local/bin:.:/home/sge/bin:/home/DI/TOOLS/bin:.:
>
> /home/IPproj/IOproject/quan/Flatten,INPUTRC=/etc/inputrc,PWD=/home/memorytemp/jumbo/180G_RK/S018DP/design_review,EDITOR=xterm
>-e vi,LA
>
> NG=en_US.UTF-8,SSH_ASKPASS=/usr/libexec/openssh/gnome-ssh-askpass,SHLVL=6,HOME=/home/jumbo,OSTYPE=linux,VENDOR=unknown,MACHTYPE=x86_64
>,LOGNAME=jumbo,LESSOPEN=|/usr/bin/lesspipe.sh
>
> %s,DISPLAY=:344.0,G_BROKEN_FILENAMES=1,_=/usr/bin/gnome-session,GTK_RC_FILES=/etc/gtk/gt
>
> krc:/home/jumbo/.gtkrc-1.2-gnome2,SESSION_MANAGER=local/ibm041:/tmp/.ICE-unix/17118,GNOME_KEYRING_SOCKET=/tmp/keyring-FJMO4E/socket,GN
>
> OME_DESKTOP_SESSION_ID=Default,DESKTOP_STARTUP_ID=NONE,COLORTERM=gnome-terminal,WINDOWID=38263354,SGE_ROOT=/home/sge/sge6.2u6,SGE_CELL
>
> =cell1,SGE_CLUSTER_NAME=p5098,IC61=/home/eda/cadence/IC616.500.3_20131102,MMSIMHOME=/home/eda/cadence/Spectre161ISR2,LM_LICENSE_FILE=5
>
> 280@ibm041:5280@ibm001:5280@ibm002:5280@ibm003:5260@cadlic:5280@cadlic:5280@dsw3:5280@dsw7:5280@ibm004:5280@ibm005:5280@ibm006:5280@10
>

Re: [gridengine users] DISPLAY problem in RHEL6.8

2017-07-18 Thread William Hay
On Tue, Jul 18, 2017 at 09:23:05AM +, John_Tai wrote:
>I'm having a DISPLAY issue in RHEL6.8 that I don't have in RHEL5. I am
>using SGE6.2u6
> 
> 
> 
>I use VNC to connect to a linux server. By default the DISPLAY is set to
>:4.0 and I can start GUI jobs locally:
> 
> 
> 
># echo $DISPLAY
> 
>:4.0
> 
># xclock
> 
> 
> 
>However if I submit a GUI job it gives me a display error:
> 
> 
> 
># qrsh -V -b y -cwd -now n -q test.q@ibm065 xclock
> 
>Error: Can't open display: :4.0
> 


> 
> 
>At this point I set the display by adding the hostname and I can
>successfully submit the job:
> 
> 
> 
># setenv DISPLAY `hostname`:4.0
> 
># echo $DISPLAY
> 
>ibm041:4.0
> 
># qrsh -V -b y -cwd -now n -q test.q@ibm065 xclock
> 
> 
> 
>At least, it is successful in RHEL5.
> 
> 
> 
>However in RHEL6.8 setting the display with hostname doesn't allow GUi
>jobs to start either locally or remotely:

Given the way you are doing this I doubt it has much to do with gridengine per 
se.

Check whether the xserver is listening on all interfaces or just localhost 
(possible tightening for security).

In general I think you would need to authorise remote hosts to connect to your 
Xserver possibly xauth.  Possibly
your old setup did this behind the scenes.

The other option would be to set up qrsh with ssh integration and configure the 
ssh for  X forwarding
(which is what we do here).  Then you wouldn't need to fiddle with the 
environment variables.

William




signature.asc
Description: Digital signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] New installation

2017-07-18 Thread William Hay
On Mon, Jul 17, 2017 at 08:00:31PM +, Matt Hohmeister wrote:
> Thank you; this is a big help. :-)
> 
> Along those lines, what do you all suggest for the shared directory? From 
> these instructions, it *appears* that the best choice is to share out NFS 
> from the master's /opt/sge/default, having that mounted on the execution host 
> as /opt/sge/default.
> 
> Thoughts?
That is what we do here on our three clusters.

William




signature.asc
Description: Digital signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Repeated error message in logs from RQS rules

2017-07-17 Thread William Hay
On Fri, Jul 14, 2017 at 08:36:06AM +, Simon Andrews wrote:
>Can anyone shed any light on an error I'm getting repeated thousands of
>times in my grid engine messages log.  This happens when I have a job
>which is submitted and which is stopped from running by an RQS rule I have
>set up.  The error I get is:
> 
> 
> 
>07/14/2017 09:27:08|schedu|rocks1|C|not a single host excluded in
>rqs_excluded_hosts()
> 
> 
> 
>The RQS ruleset I have which triggers this looks like:
> 
Not so much a fix but a possible workaround:
Send your logs to syslog (rather than having qmaster log directly into files) 
and rely
on the syslog replacing repeated messages with 'last message repeated  times

You could also try tweaking the log_level parameter.

I don't use RQS myself but my best guess is that you have two sorts of hosts.
Regular with a batch queue and the hosts in @interactive with an interactive 
queue
Because the hosts {@interactive} clause doesn't further restrict where the limit
applies (because jobs are already limited by being batch or interactive) grid 
engine 
complains that you appear to have a no-op in yor limit.  I think this complaint 
by SGE 
is spurious.

Possibly:
Give the interactive queue a different name from the regular batch queue.  Make 
sure the batch 
queue can't run on the interactive hosts and vice versa.  Then apply the limit 
to the queue
rather than the host.

> 
> 
>{
> 
>   name per_user_slot_limit
> 
>   description  "limit the number of slots per user"
> 
>   enabled  TRUE
> 
>   limitusers {*} hosts {@interactive} to slots=8
> 
>   limitusers {andrewss} to slots=2
> 
>   limitusers {@bioinf} to slots=616
> 
>   limitusers {*} to slots=411
> 
>}
> 
> 
> 
>The rule seems to work, and jobs are held, and then started as expected. 
>A job which fails to schedule gets a state like this:
> 
> 
> 
>scheduling info:cannot run in queue instance
>"all.q@compute-1-6.local" because it is not of type batch
> 
>cannot run in queue instance
>"all.q@compute-1-5.local" because it is not of type batch
> 
>cannot run in queue instance
>"all.q@compute-1-7.local" because it is not of type batch
> 
>cannot run in queue instance
>"all.q@compute-1-0.local" because it is not of type batch
> 
>cannot run in queue instance
>"all.q@compute-1-3.local" because it is not of type batch
> 
>cannot run because it exceeds limit
>"andrewss/" in rule "per_user_slot_limit/3"
> 
>cannot run in queue instance
>"all.q@compute-1-4.local" because it is not of type batch
> 
>cannot run in queue instance
>"all.q@compute-1-1.local" because it is not of type batch
> 
>cannot run in queue instance
>"all.q@compute-1-2.local" because it is not of type batch
> 
> 
> 
>So it's seeing the rule and is applying it correctly, but the spurious
>errors are causing my messages file to inflate quickly when there are a
>lot of queued jobs.
> 
> 
> 
>Can anyone suggest how to debug or fix this?  I can't find anything
>relevant from googling around for the specific error outside of the
>library API it comes from.
> 
> 
> 
>This is using SGE-6.2u5p2-1.x86_64.
> 
> 
> 
>Thanks for any help you can offer!
> 
> 
> 
>Simon.
> 
> 
> 
> 
> 
>The Babraham Institute, Babraham Research Campus, Cambridge CB22 3AT
>Registered Charity No. 1053902.
> 
>The information transmitted in this email is directed only to the
>addressee. If you received this in error, please contact the sender and
>delete this email from your system. The contents of this e-mail are the
>views of the sender and do not necessarily represent the views of the
>Babraham Institute. Full conditions at: www.babraham.ac.uk

> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users



signature.asc
Description: Digital signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] New installation

2017-07-17 Thread William Hay
On Fri, Jul 14, 2017 at 08:58:59PM +, Matt Hohmeister wrote:
>Hello-
> 
> 
> 
>First off, please accept my apologies for this post, as I have _never_
>used gridengine before. I have two servers, both running RHEL 7.3, and
>both linked to a shared xfs-formatted iSCSI volume at /mnt/shared.
> 
> 
> 
>I have done quite a bit of searching, but I'm trying to get the basic
>answer of: "Step-by-step installation instructions for someone who has
>never used gridengine before, but knows their way around RHEL."


RPMS are available from:
https://copr.fedorainfracloud.org/coprs/loveshack/SGE/

You'll need to install a queue master:
http://www.softpanorama.org/HPC/Grid_engine/Installation/installation_of_master_host.shtml
And an execution host:
http://www.softpanorama.org/HPC/Grid_engine/Installation/installation_of_execution_host.shtml

With only two hosts you'll probably want to make the queue master your submit 
host 
(see the man page for qconf for how to do this once the qmaster is installed).  
You may also 
want to make it to double duty as an execution host as well.

William

> 
> 
> 
>Any thoughts?
> 
> 
> 
>Thanks!
> 
> 
> 
>Matt Hohmeister, M.S.
> 
>Systems and Network Administrator
> 
>Department of Psychology
> 
>Florida State University
> 
>PO Box 3064301
> 
>Tallahassee, FL 32306-4301
> 
>Phone: +1 850 645 1902
> 
>Fax: +1 850 644 7739
> 
>https://psy.fsu.edu/
> 
> 

> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users



signature.asc
Description: Digital signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Ulimit for max open files

2017-06-28 Thread William Hay
On Tue, Jun 27, 2017 at 02:06:51PM +, Luis Huang wrote:
> I???ve tried setting S_DESCRIPTORS, it still cap it at 65535. When running PE 
> jobs, it also multiplies the cores with the max open file value. Even if I 
> set a realistic ulimit, it will hit the threshold with high slots job.
> 
> Thanks!
> Luis
Assuming linux what does sysctl fs.file-max report?

William



> 
> On 6/27/17, 4:22 AM, "William Hay" <w@ucl.ac.uk> wrote:
> 
> On Mon, Jun 26, 2017 at 05:24:57PM +, Luis Huang wrote:
> >Hi,
> >
> >
> >
> >To increase the max open file, we have set execd_params in qconf 
> -mconf
> >and also on the OS level:
> >
> >execd_params
> >H_DESCRIPTORS=262144,H_LOCKS=262144,H_MAXPROC=262144
> >
> >
> >
> >On our execution nodes we can see that SGE sets a soft limit of 65535
> >despite that we told it to set it to 262144.
> From the above it appears you set the hard limit
> 
> 
> >
> >Max open files65535262144   
> files
> >
> >
> Which appears to have been set.Perhaps you could try setting the soft 
> limit
> (S_DESCRIPTORS) to whatever value you think appropriate.
> 
> William
> 
> 
> This electronic message is intended for the use of the named recipient only, 
> and may contain information that is confidential, privileged or protected 
> from disclosure under applicable law. If you are not the intended recipient, 
> or an employee or agent responsible for delivering this message to the 
> intended recipient, you are hereby notified that any reading, disclosure, 
> dissemination, distribution, copying or use of the contents of this message 
> including any of its attachments is strictly prohibited. If you have received 
> this message in error or are not the named recipient, please notify us 
> immediately by contacting the sender at the electronic mail address noted 
> above, and destroy all copies of this message. Please note, the recipient 
> should check this email and any attachments for the presence of viruses. The 
> organization accepts no liability for any damage caused by any virus 
> transmitted by this email.


signature.asc
Description: Digital signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Ulimit for max open files

2017-06-27 Thread William Hay
On Mon, Jun 26, 2017 at 05:24:57PM +, Luis Huang wrote:
>Hi,
> 
> 
> 
>To increase the max open file, we have set execd_params in qconf -mconf
>and also on the OS level:
> 
>execd_params
>H_DESCRIPTORS=262144,H_LOCKS=262144,H_MAXPROC=262144
> 
> 
> 
>On our execution nodes we can see that SGE sets a soft limit of 65535
>despite that we told it to set it to 262144.
From the above it appears you set the hard limit


> 
>Max open files65535262144   files 
>   
>
Which appears to have been set.Perhaps you could try setting the soft limit 
(S_DESCRIPTORS) to whatever value you think appropriate.

William


signature.asc
Description: Digital signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] taming qlogin

2017-06-26 Thread William Hay
On Fri, Jun 23, 2017 at 08:24:23AM -0700, Ilya wrote:
> Hello,
> 
> I am running 6.2u5 with ssh transport for qlogin (not tight integration) and
> users are abusing this service: run jobs for days, abandon their sessions
> that stay opened forever, etc. So I want to implement mandatory time limits
> for all interactive jobs and, perhaps, limit the number of interactive
> sessions available to any user.

> 
> I was thinking about limiting time one of the two ways: either set h_rt via
> JSV (server side) or by forcing all interactive jobs to a dedicated queue
> with time limit. However, there seem to be issues with both approaches.
> 
> There seems to be no way to reliably identify interactive job in JSV:
> - The only telling attribute is jobname, i.e., QLOGIN or QRLOGIN. However
> some users rename their interactive jobs, so this method will fails.

Assuming you are using a server side JSV then checking whether
the environment variable QRSH_PORT is set for the job from the JSV
distinguishes interactive jobs from batch mode (qsub) commands.  

If you are using a client side JSV then the CLIENT parameter
should tell you which binary did the invocation.

The various JSV are not called for qrsh -inherit so no interference with 
parallel jobs should occur.

William


signature.asc
Description: Digital signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] 6.2 Update 5 Patch 3 not available?

2017-06-13 Thread William Hay
On Mon, Jun 12, 2017 at 05:46:46PM -0400, Jeff Blaine wrote:
> The Open Grid Scheduler homepage at
> http://gridscheduler.sourceforge.net/ says:
> 
> The current bugfix & LTS (Long Term Support) release is version
> 6.2 update 5 patch 3 (SGE 6.2u5p3), which is based on Sun Grid
> Engine 6.2 update 5 (SGE 6.2u5). It was released on April 26, 2012
> and only contains critical bug fixes.

That particular fork hasn't been updated in a long time.  You are probably
better off switching to the Son of Grid Engine fork.

William


signature.asc
Description: Digital signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Throttling job starts (thundering herd)

2017-03-24 Thread William Hay
On Thu, Feb 16, 2017 at 01:43:47PM -0500, Stuart Barkley wrote:
> Is there a way to throttle job starts on Grid Engine (we are using Son
> of Grid Engine)?
Use a load sensor plus job_load_adjustment.  Tweak jobs to request
a low load (via sge_request or jsv) or set an alarm on all queues
when the load is high enough.  The sched_conf man page implies 
load_adjustment is instantaneous so one or both should work.  Never i
tried it myself though.


William


signature.asc
Description: Digital signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] John's cores pe (Was: users Digest...)

2017-03-23 Thread William Hay
On Thu, Mar 23, 2017 at 08:11:02AM +, John_Tai wrote:
> Can I still download 6.2? Haven't been able to find it.
> 
> John

If you're going to upgrade you might as well go all the way to SoGE 8.1.9.  

William
> 
> -Original Message-
> From: Reuti [mailto:re...@staff.uni-marburg.de]
> Sent: Wednesday, March 22, 2017 8:27
> To: John_Tai
> Cc: Christopher Black; users@gridengine.org; Coleman, Marcus [JRDUS Non-J]
> Subject: Re: [gridengine users] John's cores pe (Was: users Digest...)
> 
> Hi,
> 
> > Am 22.03.2017 um 04:24 schrieb John_Tai :
> >
> > I am now using sge6.1, however it doesn't have the option "JOB" for complex 
> > consumable value. Is there another way to NOT multiply consumable memory 
> > resource by number of pe slots?
> 
> Not that I'm aware of. It was a features introduced with SGE 6.2u2:
> 
> https://arc.liv.ac.uk/trac/SGE/ticket/197
> 
> -- Reuti
> 
> 
> >
> > Thanks
> > John
> >
> >
> >
> > -Original Message-
> > From: Reuti [mailto:re...@staff.uni-marburg.de]
> > Sent: Wednesday, December 21, 2016 7:05
> > To: Christopher Black
> > Cc: John_Tai; users@gridengine.org; Coleman, Marcus [JRDUS Non-J]
> > Subject: Re: [gridengine users] John's cores pe (Was: users Digest...)
> >
> >
> > Am 20.12.2016 um 23:42 schrieb Christopher Black:
> >
> >> We have found that the behavior that multiples consumable memory resource 
> >> requests by number of pe slots can be confusing (and requires extra math 
> >> in automation scripts), so we've have the complex consumable value set to 
> >> "JOB" rather than "YES". When this is done (at least on SoGE), the memory 
> >> requested is NOT multiplied by the number of slots. We also use h_vmem 
> >> rather than virtual_free.
> >
> > Correct, it's not multiplied. But only the master exechost will get its 
> > memory reduced in the bookeeping. The slave exechosts might still show a 
> > too high value of the available memory I fear.
> >
> > -- Reuti
> >
> >
> >> Best,
> >> Chris
> >>
> >> On 12/20/16, 5:11 AM, "users-boun...@gridengine.org on behalf of Reuti" 
> >>  
> >> wrote:
> >>
> >>
> >>> Am 20.12.2016 um 02:45 schrieb John_Tai :
> >>>
> >>> I spoke too soon. I can request PE and virtual_free separately, but I 
> >>> cannot request both:
> >>>
> >>>
> >>>
> >>> # qsub -V -b y -cwd -now n -pe cores 7 -l mem=10G -q all.q@ibm037
> >>> xclock
> >>
> >>   Above you request "mem" (which is a snapshot of the actual usage and may 
> >> vary over the runtime of other jobs [unless they request the total amount 
> >> already at the beginning of the job and stay with it]).
> >>
> >>> Your job 180 ("xclock") has been submitted # qstat
> >>> job-ID  prior   name   user state submit/start at queue   
> >>>slots ja-task-ID
> >>> -
> >>>  180 0.55500 xclock johntqw12/20/2016 09:43:41
> >>> 7
> >>> # qstat -j 180
> >>> ==
> >>> job_number: 180
> >>> exec_file:  job_scripts/180
> >>> submission_time:Tue Dec 20 09:43:41 2016
> >>> owner:  johnt
> >>> uid:162
> >>> group:  sa
> >>> gid:4563
> >>> sge_o_home: /home/johnt
> >>> sge_o_log_name: johnt
> >>> sge_o_path: 
> >>> /home/sge/sge8.1.9-1.el5/bin:/home/sge/sge8.1.9-1.el5/bin/lx-amd64:/bin:/usr/bin:/usr/local/bin:/usr/X11R6/bin:/home/johnt/bin:.
> >>> sge_o_shell:/bin/tcsh
> >>> sge_o_workdir:  /home/johnt/sge8
> >>> sge_o_host: ibm005
> >>> account:sge
> >>> cwd:/home/johnt/sge8
> >>> hard resource_list: virtual_free=10G
> >>
> >>   10G times 7 = 70 GB
> >>
> >>   The node has this amount of memory installed and it is defined this way 
> >> in `qconf -me ibm037`?
> >>
> >>   -- Reuti
> >>
> >>
> >>> mail_list:  johnt@ibm005
> >>> notify: FALSE
> >>> job_name:   xclock
> >>> jobshare:   0
> >>> hard_queue_list:all.q@ibm037
> >>> env_list:   TERM=xterm,DISPLAY=dsls11:3. [..]
> >>> script_file:xclock
> >>> parallel environment:  cores range: 7
> >>> binding:NONE
> >>> job_type:   binary
> >>> scheduling info:cannot run in queue "sim.q" because it is not 
> >>> contained in its hard queue list (-q)
> >>>  cannot run in queue "pc.q" because it is not 
> >>> contained in its hard queue list (-q)
> >>>  cannot run in PE "cores" because it only
> >>> offers 0 slots
> 

Re: [gridengine users] Make qmaster buffer larger

2017-03-10 Thread William Hay
On Thu, Mar 09, 2017 at 05:20:37PM +0100, Jerome Poitout wrote:
> Hello,
> 
> OGS/GE 2011.11p1
> 
> I have an issue while submitting numerous jobs in a short time (over 300
> - not so much for me...) with -sync y option. It seems that qmaster
> cannot handle all the requests and i get huge load on the head server
> (>400) and memory gets almost full (32GB).
> 
> These jobs are run by a third party product that does not support job
> arrays (as far as we currently know).
> 
> Then I get some timeout while trying to qstat something...
> 
> [root@ ~]# qstat -u user
> error: failed receiving gdi request response for mid=1 (got syncron
> message receive timeout error).

You could try fiddling with the gdi_timeout and gdi_retries settings in the 
qmaster_params to see if that helps - depends where the timeout is 
happening though.  

> 
> Any idea on how to raise the number a jobs that can be qsub in a short
> time ? I am almost sure that a qmaster params can be used but as I am in
> production environment, I prefer to be careful...

Where is the qmaster relative to the job spool's backing storage?  While in 
theory the qmaster can access the job spool over NFS in practice it can slow 
things down enough to cause timeouts like the above.  I like to keep the 
qmaster on the same machine where the disks which gost the job spool live.

William


signature.asc
Description: Digital signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] qsub and reservation

2017-03-10 Thread William Hay
On Thu, Mar 09, 2017 at 07:29:25PM +0100, Roberto Nunnari wrote:
> I don't mean move from node to node.. by moving I mean that something
> happens in the scheduler.. that the scheduler reserves a slot for the
> pending job requesting reservation.. in the schedule file, I see only lines
> with the word RESERVING.. and never something like RESERVED.. or little
> changes that tell me that something is changing.. I always see lines like
> these:
> 3653372:1:RESERVING:1489043424:660:P:smp:slots:32.00
> 3653372:1:RESERVING:1489043424:660:Q:long.q@node19.cluster:slots:32.00
> I believe that if the scheduler reserves a slot, something in these lines
> should change..

Nope.  RESERVING is what the schedule file says when the scheduler
reserves a resource.  The file is describing what the scheduler is
doing not the state of the resource.  One quirk is that the scheduler
reconsiders where the reservation goes with each scheduling cycle.
The reservation just prevents jobs from starting that would conflict with
the reservation this scheduling cycle.  This means that the same resources
should have the same predicted availability next scheduling cycle..

As long as the number in the 4th field isn't continually increase your job 
should 
eventually get the resources marked as RESERVING.  

One quirk is that if you heavily weight a per user functional share
policy(like we do at UCL) then small jobs from a user backfilling can
deprive large jobs from that same user of the priority needed to hold
onto a reservation.  The workaround for this is to educate said users
to have their small jobs wait for (depend on) the large job to run.
If you use functional share based on somethng other than users have fun
convincing said users to co-ordinate.

William


signature.asc
Description: Digital signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] qsub and reservation

2017-03-10 Thread William Hay
On Thu, Mar 09, 2017 at 02:24:38PM +0100, Roberto Nunnari wrote:
> Hi Reuti.
> Hi William.
> 
> here's my settings you required:
> paramsMONITOR=1
> max_reservation   32
> default_duration  0:10:0
> 
> I cannot understand how What I see in
> ${SGE_ROOT}/${SGE_CELL}/common/schedule can help me.. here's a little
> extract for a job submitted with -R y, and it keeps repeating without change
> ...
> 3653372:1:RESERVING:1489043424:660:P:smp:slots:32.00
> 3653372:1:RESERVING:1489043424:660:Q:long.q@node19.cluster:slots:32.00
> 3653372:1:RESERVING:1489043424:660:P:smp:slots:32.00
> 3653372:1:RESERVING:1489043424:660:Q:long.q@node19.cluster:slots:32.00
> ...
> 
> Thank you for your help.
> Roberto
The 4th field is the date of the reservation in seconds since the unix epoch.

date -d@1489043424 

Will convert it to something readable (early yesterday morning).  The 5th field
(despite what the man page says) is the duration of the reservation.  You 
usually 
want to look at the information from the last scheduling run (scheduling runs 
are 
separated from each other by lines of consecurive colons).

That default_duration looks a little short.  The scheduler is assuming any 
running jobs
will terminate in under 10 minutes and as a result is probably trying to 
reserve 
resoorces that won't actually be free when the reservation comes due.  

To make reservations work you really need most jobs to have a hard time limit 
associated with them
and a long default_duration (as in Reuti's example) to encourage the scheduler 
not to schedule jobs
on resources currently occupied by jobs without such a limit.

> On 08.03.2017 18:48, Reuti wrote:
> >- do you request any expected runtime in the job submissions (-l h_rt=???)?
> >- is a sensible default set in `qconf -msconf` for the runtime 
> >(default_duration 8760:00:00)?
> >- is a sensible default set in `qconf -msconf` for the number of 
> >reservations (max_reservation 20)?

Bear in mind that, with your current config, only the 32 highest priority jobs 
in the queue that request
one will get a reservation.


William


signature.asc
Description: Digital signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] qsub and reservation

2017-03-09 Thread William Hay
On Wed, Mar 08, 2017 at 06:33:23PM +0100, Roberto Nunnari wrote:
> Hello.
> 
> I am using Oracle Grid Engine 6.2u7 and have some trouble understanding
> reservation (qsub -R y ..).
> 
> I'm trying to use this because of big jobs starving because of queues always
> full of smaller jobs..
> 
> Apparently the -R y switch doesn't help at all.. somebody long ago told me
> it's a bug in my version of grid engine..
> 
> Is there a way to find out what is going on with reservation? qstat -j jobID
> doesn't show nothing about it..
> 
> Any ideas or hints?

Make sure you have max_reservation set to some positive number in the scheduler 
configuration.

In order to see what is going on you can set MONITOR=1 in the scheduler's 
params.

The schedulers view of what is happening (including reservations) will then be 
recorded in
${SGE_ROOT}/${SGE_CELL}/common/schedule

William


signature.asc
Description: Digital signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] limtation the number of submission job in queue waiting list

2017-02-23 Thread William Hay
On Thu, Feb 23, 2017 at 09:30:20AM +0900, Sangmin Park wrote:
>Yes, it is.
>I can handle the number of running jobs using resource quota policy.
>However, the number of queue waiting jobs can't.
>Basic rule is FIFO, so if one user submits  hundre of jobs, another user
>has to wait long time.
>Is there no way to do this in SGE?
max_u_jobs does approximately what you want but doesn't stop someone submitting
an array job with lots of tasks which has a similar effect.  You could defacto 
block
arrayjobs by setting max_aj_tasks to 1 but you might not want to do that.  Also 
max_u_jobs
is global ie the limit is the same for all users.

If you want fair share you could use the functional share or sharetree policies
but these produce non-FIFO behavior.  You could mix either of these policies
with some weight given to waiting time to produce something that aproximates 
what you want.  

If you want to control total number of tasks as opposed to jobs then you could
have a jsv run an appropriate qstat command and reject jobs from a user if they 
have too many tasks queued.  In general you can implement any policy you like
for job submission from a jsv the question is whether you should.


>On Wed, Feb 22, 2017 at 7:19 PM, Reuti  wrote:
> 
>  Usually the fair share police targets only running jobs.
>
Love the concept of fair share police.  Arresting people who exceed their
fair share of the cluster...


William


signature.asc
Description: Digital signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] making certain jobs or queues not count for tickets..

2017-02-15 Thread William Hay
On Wed, Feb 15, 2017 at 12:34:08AM +0200, Ben Daniel Pere wrote:
>The suggestion sounds good but I'm not sure I understand step 1 - if it's
>going to be assigned by project and not user - can I still have "powerful"
>users in my normal default project that have more tickets there?
>I should really read about projects. thanks for the lead.

If you mean you have a pre-existing project and want different
users within that project to have different priorities then no that
is incompatible with this scheme.  With functional share (which it
sounds like you are using) projects are compared to projects and users
to users independently.  Therefore if you already have projects with
multiple members and want to prioritise different users within a project
differently you would have to assign priority directly to users and that
means jobs in the cheap queue would affect the regular queues.

The scheme I suggested is based on the idea of each project (except
the cheap project) having a single member and therefore being able to
act as a proxy for that user when they are not requesting the cheap
project/queue.  This precludes using projects for grouping users in 
other ways.

If you are already using projects for something there may be a way
to adapt it but without details of how you currently use projects I can't 
really speculate how.  You might, depending on what you are doing with 
projects, be able to replace them with departments.


William


signature.asc
Description: Digital signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Requesting GPUs on the qsub command line

2017-02-14 Thread William Hay
On Tue, Feb 14, 2017 at 12:22:59PM +, Mark Dixon wrote:
> On Tue, 14 Feb 2017, William Hay wrote:
> ...
> >We tweak the permissions on the device nodes from a privileged prolog but
> >otherwise I suspect we're doing something similar.
> 
> Hi William,
> 
> Yeah, but I've put the permission tweaker in the starter, as that fits our
> existing model a bit better (looking ahead to multi-node GPU codes in
> future).

None of our gpu nodes have infiniband at the moment so we don't allow
multi-node gpu jobs.

Our prolog does a parallel ssh(passing through appropriate envvars) into 
every node assigned to the job and does the equivalent of a run-parts on 
a directory filled with scripts.  Some of these scripts check if they are 
running on the head node.

> 
> >One thing to watch out for is that unless you disable it the device driver
> >can change the permissions on the device nodes behind your back.
> 
> The device driver, or do you mean any CUDA program at all?
> 
> It's a bit of an eye-opener to see no dev entries being created by the
> kernel module / udev, then strace a simple CUDA program and watch it try to
> mknod some /dev entries and call a privileged binary to do some
> modprobe/mknod's before actually doing what the program's supposed to do.

Sounds like your setup is a bit different from ours.  Our devices show up
in the normal way but we need a file in /etc/modprobe.d with the following 
magic module option:

options nvidia NVreg_ModifyDeviceFiles=0 

Our prolog ensures /dev/nvidiactl is world accessible and the
relevant /dev/nvidia? file is owned by the per job sge group.

Without the magic option various things trying to access
the gpus reset the permissions on the /dev/nvidia? devices.

With the magic option programs permissions are left alone and 
jobs only access the gpu we intend for them.  Given that this 
is an option to a kernel module I assume that it is responsible 
for the reset of permissions.


> 
> Would really like to know how to stop it doing that: had been wondering
> about offering the ability to reconfigure or reset the GPU card via a job
> request / JSV / starter method, but at the moment I cannot run anything
> interesting with root privs without screwing up permissions. Grr.
> 
> ...
> >We have separate requests for memory, gpus ,local scratch space, etc with
> >sensible defaults.  If someone did use the command line it could end up
> >looking quite like the example you give.
> ...
> 
> Do people fiddle with them and stick funny numbers in, resulting in GPUs
> unintentionally left idle?

GPUs are not terribly popular on Legion.  Most of the heavy GPU users use 
a shared facility we don't run(Emerald).  So idle but somewhat intentional.

This is probably a good thing as most of the GPUs are:
a)Old.
b)In an external PCI box so have a habit of disspaearing from the PCI bus 
if you look at them funny.

Emerald is shutting down shortly so we may get some people attempting
to use them.  This may lead to some contention for the one machine with
a half-decent GPU actually inside it.

William


signature.asc
Description: Digital signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Requesting GPUs on the qsub command line

2017-02-14 Thread William Hay
On Mon, Feb 13, 2017 at 03:52:20PM +, Mark Dixon wrote:
> Hi,
> 
> I've been playing with allocating GPUs using gridengine and am wondering if
> I'm trying to make it too complicated.
> 
> We have some 24 core, 128G RAM machines, each with two K80 GPU cards in
> them. I have a little client/server program that allocates named cards to
> jobs (via a starter method and the handy job gid).
> 
> What's left is the most important question: how do users request these
> resources?
> 
> I'm worried that, if I ask them to specify all the job's resources, a
> mouthful like "-pe smp 12 -l h_rt=1:0:0,h_vmem=64G,k80=1", just to get one
> card, could all too easily result in a K80 sitting idle if the numbers are
> tweaked a little.
> 
> Instead, I'm wondering if I should encourage users to request a number of
> cards and then we allocate a proportion of cpu and memory based on that (via
> per-slot complex, JSV and starter method).
> 
> Is that too simplistic, or would it be a welcome relief? What does the qsub
> command line look like at other sites for requesting GPUs?

IIRC Univa Grid Engine has a mechanism for job templates which allow the 
definition
of  standard varieties of job.  One could implement a similar mechanism via the 
jsv and
contexts.

qsub -ac Template=GPU -l gpu=1 script

Your jsv could spot that a Template had been requested and fill 
in sensible defaults based on other requests.  If no template is requested
users have access to the full power of this fully operational caommand line.

I'm not sure what purpose contexts originally served in Grid engine but
AFAICT their only practical use is to allow users to pass info to
JSVs without affecting anything important.

You can do almost anything in Grid Engine with a combination of JSVs and
carefully designed started_methods.  The hard question is whether you should.

William





signature.asc
Description: Digital signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] GE 6.2u5 Duplicate Job IDs

2017-02-14 Thread William Hay
On Mon, Feb 13, 2017 at 03:17:29PM -0500, Douglas Duckworth wrote:
>Hello
>About a month ago we recently started seeing duplicate job in SGE.
>For example:
>sysadmin@panda2[~]$ qacct -j 878815
>==
>qnamestandard.q  
>hostname node127.panda.pbtech
>groupabc 
>ownerdeveloper 
>project  NONE
>department   cmlab.u 
>jobname old job
>jobnumber878815  
>taskid   undefined
>account  sge 
>priority 0   
>qsub_timeTue Jan 10 11:49:45 2017
>start_time   Tue Jan 10 11:51:40 2017
>end_time Tue Jan 10 11:51:40 2017
>granted_pe   smp 
>slots1   
>failed   0
>exit_status  0   
>ru_wallclock 0
>ru_utime 0.001
>ru_stime 0.006
>ru_maxrss1428
>ru_ixrss 0   
>ru_ismrss0   
>ru_idrss 0   
>ru_isrss 0   
>ru_minflt1254
>ru_majflt0   
>ru_nswap 0   
>ru_inblock   0   
>ru_oublock   8   
>ru_msgsnd0   
>ru_msgrcv0   
>ru_nsignals  0   
>ru_nvcsw 60  
>ru_nivcsw4   
>cpu  0.007
>mem  0.000 
>io   0.000 
>iow  0.000 
>maxvmem  0.000
>arid undefined
>==
>qnamestandard.q  
>hostname node120.panda.pbtech
>groupabc 
>ownerdeveloper 
>project  NONE
>department   cmlab.u 
>jobname  newjob
>jobnumber878815  
>taskid   undefined
>account  sge 
>priority 0   
>qsub_timeWed Feb  8 12:37:38 2017
>start_time   Wed Feb  8 13:20:49 2017
>end_time Wed Feb  8 13:41:01 2017
>granted_pe   smp 
>slots12  
>failed   100 : assumedly after job
>exit_status  137 
>ru_wallclock 1212 
>ru_utime 0.002
>ru_stime 0.022
>ru_maxrss1280
>ru_ixrss 0   
>ru_ismrss0   
>ru_idrss 0   
>ru_isrss 0   
>ru_minflt623 
>ru_majflt0   
>ru_nswap 0   
>ru_inblock   0   
>ru_oublock   8   
>ru_msgsnd0   
>ru_msgrcv0   
>ru_nsignals  0   
>ru_nvcsw 47  
>ru_nivcsw2   
>cpu  13816.930
>mem  48585.941 
>io   34.210
>iow  0.000 
>maxvmem  3.692G
>arid undefined
>As you can see the jobs are nearly a month apart.  This does not affect
>their ability to complete though it's required that we not have
>these duplicates.
>Has anyone experienced this issue or have an idea of what could be causing
>this behavior?
>We are not rotating our accounting logs.
>Thanks,
>Douglas Duckworth, MSc, LFCS
>HPC System Administrator
>Scientific Computing Unit
>Physiology and Biophysics
>Weill Cornell Medicine
>E: d...@med.cornell.edu
>O: 212-746-6305
>F: 212-746-8690

Apart from Reuti's suggestion.  Possibly the original job got returned to the 
queue for some reason
and the user then qalter'd it so it was unrecognisable?


William


signature.asc
Description: Digital signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Requesting GPUs on the qsub command line

2017-02-14 Thread William Hay
On Mon, Feb 13, 2017 at 03:52:20PM +, Mark Dixon wrote:
> Hi,
> 
> I've been playing with allocating GPUs using gridengine and am wondering if
> I'm trying to make it too complicated.
> 
> We have some 24 core, 128G RAM machines, each with two K80 GPU cards in
> them. I have a little client/server program that allocates named cards to
> jobs (via a starter method and the handy job gid).

We tweak the permissions on the device nodes from a privileged prolog
but otherwise I suspect we're doing something similar.  One thing to watch out
for is that unless you disable it the device driver can change the permissions
on the device nodes behind your back.


> 
> What's left is the most important question: how do users request these
> resources?
> 
> I'm worried that, if I ask them to specify all the job's resources, a
> mouthful like "-pe smp 12 -l h_rt=1:0:0,h_vmem=64G,k80=1", just to get one
> card, could all too easily result in a K80 sitting idle if the numbers are
> tweaked a little.
> 
> Instead, I'm wondering if I should encourage users to request a number of
> cards and then we allocate a proportion of cpu and memory based on that (via
> per-slot complex, JSV and starter method).

Around here our examples put the options in the script after #$ rather
than on the command line.  That makes things a lot more readable.  We
save jobscripts for support purposes so having the requested options 
in there is helpful (interactive jobs with qrsh etc excepted obviously).
> 
> Is that too simplistic, or would it be a welcome relief? What does the qsub
> command line look like at other sites for requesting GPUs?

We have separate requests for memory, gpus ,local scratch space, etc with 
sensible defaults.   If someone did use the command line it could end up 
looking quite like the example you give.

William


signature.asc
Description: Digital signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] SGE 6.1 binaries

2017-02-13 Thread William Hay
On Mon, Feb 13, 2017 at 10:20:51AM +0100, Julien Nicoulaud wrote:
>Hi all,
>I'm looking for Sun GridEngine 6.1 binaries (or sources) for some backward
>compatibility testing, and I can't find it anywhere on the web:
> * ge-6.1u5-common.tar.gz
> * ge-6.1u5-bin-lx24-amd64.tar.gz
>If anyone still has it somewhere, I would be very grateful :)
>Cheers,
>Julien
If you aren't too picky about the exact update the Univa
Gridengine Core repo has a 6.1 Branch:
https://github.com/gridengine/gridengine/tree/V61_BRANCH

William



signature.asc
Description: Digital signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] error qlogin_starter sent: 137 during qrsh

2016-10-13 Thread William Hay
On Thu, Oct 13, 2016 at 11:39:15AM +, Duje Drazin wrote:
> Hi William,
> 
> Sorry, I didn't catch your answer, how to "Check the nodes involved for a 
> firewall/packet filter" 
> 
Assuming this is a linux box then on a worker node of the cluster try
running iptables -L to see if it has an iptables configuration.  

On most machines you want some sort of firewall on the machine but with grid 
engine
so many dynamic ports are listened on that setting one up that both protects  
the machine
and  lets qrsh work is rather complicated.  The normal solution is to just 
disable the 
firewall on any cluster internal network interfaces (usually the only interface 
for an
execution host) and defend only the outward facing network interfaces.

William


signature.asc
Description: Digital signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] error qlogin_starter sent: 137 during qrsh

2016-10-13 Thread William Hay
On Thu, Oct 13, 2016 at 07:21:33AM +, Duje Drazin wrote:
>Hi all,
> 
> 
> 
>I have configured following:
> 
> 
> 
>qlogin_command   telnet
> 
>qlogin_daemon/usr/sbin/in.telnetd
> 
>rlogin_command   /usr/bin/ssh -X
> 
>rlogin_daemon/usr/sbin/sshd -i
> 
>rsh_command  /usr/bin/ssh -X
> 
>rsh_daemon   /usr/sbin/sshd -i
> 
> 
> 
> 
> 
>but with this configuration on qrsh I have following error:
> 
> 
> 
>   1654  23344 main R E A D I N GJ O B ! ! ! ! ! ! ! ! ! !
>!
> 
>  1655  23344 main
>
> 
>  1656  23344 main random polling set to 5
> 
>  1657  23344 140545581356800 --> sge_gettext_() {
> 
>  1658  23344 140545581356800 -->
>sge_get_message_id_output_implementation() {
> 
>  1659  23344 140545581356800 <--
>sge_get_message_id_output_implementation() ../libs/uti/sge_language.c 582
>}
> 
>  1660  23344 140545581356800 <-- sge_gettext_()
>../libs/uti/sge_language.c 730 }
> 
>  1661  23344 140545581356800 --> sge_gettext__() {
> 
>  1662  23344 140545581356800 sge_gettext() called without valid
>gettext function pointer!
> 
>  1663  23344 140545581356800 <-- sge_gettext__()
>../libs/uti/sge_language.c 781 }
> 
>  1664  23344 140545581356800 --> sge_htable_resize() {
> 
>  1665  23344 140545581356800 <-- sge_htable_resize()
>../libs/uti/sge_htable.c 204 }
> 
>  1666  23344 main --> wait_for_qrsh_socket() {
> 
>  1667  23344 main accepted client connection, fd = 3
> 
>  1668  23344 main <-- wait_for_qrsh_socket() ../clients/qsh/qsh.c
>404 }
> 
>  1669  23344 main --> get_client_server_context() {
> 
>  1670  23344 main --> read_from_qrsh_socket() {
> 
>  1671  23344 main <-- read_from_qrsh_socket()
>../clients/qsh/qsh.c 461 }
> 
>  1672  23344 main qlogin_starter sent: 137
> 
> 
> 
>When I manually execute  ssh toward servers everithing works fine, so I
>would like to understand which unix command are executed in background of
>command qrsh?

There's a diagram of the process which qrsh uses below.  The commands you
specified replace rsh, rshd and rlogind in the diagrams.
http://arc.liv.ac.uk/repos/hg/sge/source/clients/qrsh/qrsh.html


Check the nodes involved for a firewall/packet filter as for qrsh to work 
various things need to listen and accept connections  on dynamically assigned 
ports rather than just port 22 like normal ssh.

William
> 
> 
> 
> 
> 
> 
> 
> 

> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users



signature.asc
Description: Digital signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] SoGE 8.1.8 - Very slow schedule time qw ==> r taking 30-60 sec.

2016-10-11 Thread William Hay
On Mon, Oct 10, 2016 at 02:39:21PM +, Yuri Burmachenko wrote:
>We are using SoGE 8.1.8 and since recently approximately 2 months ago our
>job schedule time raised up to 30-60 sec.
> 
> 
>Any tips and advices where to look for the root cause and/or how can we
>improve the situation, will be greatly appreciated.
> 
It isn't clear to me if the cluster has ever run quickly with something like 
the current load
or whether this is a decline in performance from a scheduler that used to 
handle load better.

You could enable PROFILE in the schedulers params to get a bit more info on 
what is going on.

Check if people are soft requesting resources/queues as these are supposed to 
slow it down a lot.

Try to run the qmaster on a dedicated machine or if that isn't possible on 
dedicated cores.

You could also try setting JC_FILTER in params even though it is officially 
deprecated.

We have two clusters run centrally at UCL.  The main difference betwween them 
being that the newer 
one has a uniform flat infiniband network and identical hosts.  The older 
cluster has a lot of 
different hardware and isolated infiniband islands.  The infiniband islands 
each have their own 
PE which is matched by a PE wildcard in job submissions.  The older cluster 
takes a lot
longer to schedule.  It sounds like you might have a similarly complex config.


William


signature.asc
Description: Digital signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Control tmpdir usage on SGE

2016-10-10 Thread William Hay
On Thu, Oct 06, 2016 at 12:47:49PM +0100, Mark Dixon wrote:
> On Wed, 5 Oct 2016, William Hay wrote:
> ...
> >Our prolog and epilog (parallel) ssh into the slave nodes and do the
> >equivalent of run-parts on directories full of scripts some of which check
> >if they are running on the head node of the job before doing anything. If
> >we did want the epilog to save TMPDIRS from slave nodes we'd just have to
> >decide how to name them I guess.
> ...
> 
> Presumably this would work for you capture-wise because you're creating your
> own TMPDIRs rather than using the ones provided by the execd. (As Reuti
> pointed out, the execd TMPDIRs on slave nodes are ephemeral.)

> It'd be a pity to switch to doing it that way: the execd TMPDIR can be
> paired with an xfs project quota scheme which is nice and tidy. I imagine
> that deleting TMPDIRs via an epilog has a greater number of failure modes,
> not all of which can be avoided by purging old directories at boot, like
> intermittent network problems. How has that worked for you in practice?

Pretty well.  The epilog is augmented by a load sensor that checks for
TMPDIRs that aren't associated with a job on the node, raises an alarm
and attempts a cleanup.  Doesn't fire very often.
> 
> Also, passwordless ssh between compute nodes has been useful to avoid. Not
> only principle of least privilege - it's handy to help identify applications
> that aren't tightly integrated.

Our prolog/epilog don't run as the user and the port 22 sshd restricts who can 
log in (with or without password).

We also use the real ssh as a wrapper around qrsh:
https://github.com/UCL-RITS/GridEngine-OpenSSH

Which means it is really hard for a code to avoid being tightly integrated.
The prolog/epilog invoke ssh with -o ProxyCommand=none.

William


signature.asc
Description: Digital signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Control tmpdir usage on SGE

2016-10-05 Thread William Hay
On Wed, Oct 05, 2016 at 12:31:52PM +0100, Mark Dixon wrote:
> On Wed, 5 Oct 2016, William Hay wrote:
> ...
> >It was originally head node only so per job until a user requested local
> >TMPDIR on each node so historical reasons.
> ...
> 
> Hi William,
> 
> What do you do with people who want to keep the contents of $TMPDIR at the
> end of the job?
At the moment, if the user requests it via appropriate incantations,
we save the TMPDIR on the head node only.  So far that's been good enough.
Our generic advice to users is to use the cluster file system for multi-node 
jobs  and $TMPDIR for single node jobs.  The user who wanted TMPDIR on multiple
hosts was a bit of a special case and either didn't need the data saved or 
handled it themselves.  

> 
> It's easy to use the epilog to capture $TMPDIR on the master node and
> present that to the user - that workflow looks natural for checkpointing
> codes - but the slaves look more problematic (both in capture and
> presentation).
> 
Our prolog and epilog (parallel) ssh into the slave nodes and do the
equivalent of run-parts on directories full of scripts some of which check
if they are running on the head node of the job before doing anything.
If we did want the epilog to save TMPDIRS from slave nodes we'd just
have to decide how to name them I guess.

William


signature.asc
Description: Digital signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Control tmpdir usage on SGE

2016-10-05 Thread William Hay
On Tue, Oct 04, 2016 at 04:51:42PM +0100, Mark Dixon wrote:
> On Tue, 4 Oct 2016, William Hay wrote:
> ...
> >I have a per-job consumable and the TMPDIR filesystem is created on every
> >node of the job.  We have a (jsv enforced) policy that all multi-node jobs
> >have exclusive access to the node and run on identical nodes so it works
> >as a faux per-host consumable.
> ...
> 
> Hi William,
> 
> You probably already answered this on list years ago but I've mislaid your
> answer. Why do you want a per-host consumable for local disk space again -
> surely per-slot can work here, although users will need to do some
> arithmetic?
> 
> Mark
It was originally head node only so per job until a user requested local TMPDIR 
on each node so historical reasons.

William


signature.asc
Description: Digital signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Control tmpdir usage on SGE

2016-10-04 Thread William Hay
On Tue, Oct 04, 2016 at 09:32:43AM +0100, Mark Dixon wrote:
> It'd be interesting for people to share what they've done with parallel
> jobs. Rightly or wrongly, I currently have a per-job consumable and the
> $TMPDIR is only on the node with the MASTER task.

I have a per-job consumable  and the TMPDIR filesystem is created on every 
node of the job.  We have a (jsv enforced) policy that all multi-node jobs 
have exclusive access to the node and run on identical nodes so it works as
a faux per-host consumable.

William


signature.asc
Description: Digital signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] access definition in grid

2016-09-26 Thread William Hay
On Tue, Sep 20, 2016 at 08:07:02AM +, sudha.penme...@wipro.com wrote:
>Hi,
> 
> 
> 
>Regarding access rules in grid, users primary UNIX group should be the one
>which is defined in ACL to be able to access.
> 
> 
> 
>Would it be possible to configure it such that user just needs to belong
>to the defined UNIX group and gid can be whatever?
I don't think this is supported by any grid engine fork currently.  You could
request it as an enhancement from whoever maintains the fork you use.

Depending on what you are trying to do then you might be able to just change
your primary group with newgrp or sg  before running qsub.

William


signature.asc
Description: Digital signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Forcing Grid Engine jobs to error state with exit status other than 0, 99 or 100.

2016-09-15 Thread William Hay
On Wed, Sep 14, 2016 at 08:52:12PM +, Lee, Wayne wrote:
> HI William,
> 
> I've performed some tests by submitting a basic shell script which dumps the 
> environment (i.e. env) and performs either an "exit 0", "exit 99", "exit 
> 100", "exit 137" other exit status codes.If I set my script to "exit 0", 
> the job exits normally.   If I set my script to "exit 99", then the job gets 
> requeued for execution and if I set my script to "exit 100", the job goes 
> into error state.   All of these scenarios are what I expect based on the man 
> pages for "queue_conf".   However, I am unable to use any other "exit ##", 
> trap it and force the job to error state by the method I describe.  
> 
What I was after was what happens when you try.  You've described your setup in 
detail but your results are missing.  When the job exits for example with 107 
and the epilog exits 100 then what happens?  Does the queue go into an error 
state?

> I'm not sure if what I'm trying to do makes sense or should I consider a 
> different way to do what I am attempting.   I can look at the 
> "starter_method" to see if this is a viable way.

As per my prior message I think using the starter_method as a wrapper will work 
more reliably than tweaking things in the epilog.

William



signature.asc
Description: Digital signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Forcing Grid Engine jobs to error state with exit status other than 0, 99 or 100.

2016-09-14 Thread William Hay
On Tue, Sep 13, 2016 at 06:52:53PM +, Lee, Wayne wrote:
>In the epilog script that I've setup for our jobs, I've attempted to
>capture the value of the "exit_status" of a job or job task and if it
>isn't 0, 99 or 100, exit the epilog script with an "exit 100".   However
>this doesn't appear to work.  

In general when describing an issue or problem it is more helpful to describe 
what
does happen than what doesn't.  The number of things that didn't happen when you
made the epilog script exit 100 is almost infinite.

> 
> 
> 
>Anyway way of stating what I'm trying to convey is if the exit status a
>job or job task is anything other than 0, 99 or 100 put the job in error
>state.  If this can be done, then we would know that a job didn't
>complete correctly and if it is in Eqw state we have the option of
>clearing error state (i.e. qmod -cj) and re-executing the job again.

One possibility would be to write a starter_method that wraps the real job and
does an exit 100 when the job terminates with an exit status other than 0 or 
99. 

William 


signature.asc
Description: Digital signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Control tmpdir usage on SGE

2016-09-13 Thread William Hay
On Tue, Sep 13, 2016 at 03:15:19PM +1000, Derrick Lin wrote:
>Thanks guys,
>I am implementing the solution as outlined by William, except we are using
>XFS here, so we are trying to do it by using XFS's project/directory
>quota. Will do more testing and see how it goes..
>Cheers,
>Derrick
I should probably add that one reason we're going to btrfs is that it allows us 
to 
take a snapshot when checkpointing so we can have a file system image that is 
consistent with the process image while pausing the process for the shortest 
time possible.  Snapshotting XFS is a little trickier.


William


signature.asc
Description: Digital signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Control tmpdir usage on SGE

2016-09-09 Thread William Hay
On Fri, Sep 09, 2016 at 01:29:52PM +0200, Reuti wrote:
> 
> > Am 09.09.2016 um 12:52 schrieb William Hay <w@ucl.ac.uk>:
> > Grid engine doesn't provide a mechanism to pass the resource requests to 
> > the prolog
> > AFAIK so a mechanism to obtain the value is needed.  Qstat would work if 
> > the execution
> > host is an admin or submit node(ours aren't).
> 
> Being an admin host should be sufficient to execute `qstat`.
> 
I believe I said that.  Our exechosts aren't admin hosts though (minimum 
privilege).

William


signature.asc
Description: Digital signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Control tmpdir usage on SGE

2016-09-09 Thread William Hay
On Fri, Sep 09, 2016 at 09:26:53AM +1000, Derrick Lin wrote:
>Hi William,
>Actually I don't quite get the need of:
>2. Our JSV adds an environment variable to the job recording the amount
>of disk requested (you could try parsing it out of the job spool but
>this is easier).
>If a user has specify the disk usage via consumable complex (like -l
>disk_requested=100G), can the prolog script simply uses that value?

Grid engine doesn't provide a mechanism to pass the resource requests to the 
prolog
AFAIK so a mechanism to obtain the value is needed.  Qstat would work if the 
execution
host is an admin or submit node(ours aren't). Otherwise something like the 
above is needed.

William


signature.asc
Description: Digital signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Control tmpdir usage on SGE

2016-09-09 Thread William Hay
On Fri, Sep 09, 2016 at 10:37:13AM +0100, Mark Dixon wrote:
> On Thu, 8 Sep 2016, William Hay wrote:
> ...
> >Remember tmpfs is not a ramdisk but the linux VFS layer without an attempt
> >to provide real file system guarantees.  It shouldn't be cached any more
> >agressively than other filesystems under normal circumstances. Most of the
> >arguments against it seem to be along the lines of "tmpfs uses swap
> >therefore it must cause swapping" which doesn't seem to be the case.
> 
> You've lived with this in production for some time now, so I should bow to
> your judgement.
> 
> But wouldn't you see a big difference in the amount of I/O activity when a
> system sees a memory pressure, as non-tmpfs filesystems already have more of
> the data on disk at that point?

Sure but on the other hand tmpfs is going to be less picky about where on disk
that data has to go than most filesystems so less seeking.  It doesn't
seem to be an issue in practice as we're usually either don't have much memory 
pressure or the memory pressure is bad enough that anything tmpfs causes is 
small
change.


William


signature.asc
Description: Digital signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Control tmpdir usage on SGE

2016-09-08 Thread William Hay
On Thu, Sep 08, 2016 at 02:40:38PM +0100, Mark Dixon wrote:
> On Thu, 8 Sep 2016, William Hay wrote:
> ...
> >At present we're using a huge swap partition and TMPFS instead of btrfs.
> >You could probably do this with a volume manager and creating a
> >regular filesystem as well but it would be slower.
> ...
> 
> Hi William,
> 
> I always liked your idea for handling scratch space, apart from the
> contention with RAM that using tmpfs would cause. But I thought you
> considered that a feature, not a bug? Your btrfs variation sounds much
> better on that score :)

We always added a fudge factor so the swap was a fair bit larger than 
the maximum size of the tmpfs filesystems which should prevent most 
interference by the filesystem with processes.  Remember tmpfs is not
a ramdisk but the linux VFS layer without an attempt to provide real file
system guarantees.  It shouldn't be cached any more agressively than other
filesystems under normal circumstances.  Most of the arguments against
it seem to be along the lines of "tmpfs uses swap therefore it must cause
swapping" which doesn't seem to be the case.  

swap "a place for the kernel to store objects without a permanent home
that need to survive for now but which it has no imminent use for."

swapping (the bad sort at least) is where  objects are pushed out to swap or
a filesystem and then quickly brought back in to memory repeatedly.

You'll get swapping if there are more objects needed in memory imminently
than will fit or if the kernel is guessing wrong about which objects will 
be needed imminently.  Using tmpfs vs another file system shouldn't alter 
the kernel's ability to guess which objects are likely to be used 
imminently just where the objects get pushed out to.

Our planned switch to btrfs is motivated more by a switch to SSDs than any 
issues with swapping.

> 
> It's just occurred to me that you might also be able to do this on more
> stable^W^Wless advanced filesystems, by assigning a group quota on the GID
> that gridengine allocated to the job and using some chmod g+s trickery. I
> guess it could be circumvented by the determined using chgrp. What do you
> think?

In general I'm fairly comfortable with using btrfs for data that doesn't
need to be preserved in the long term (ie working storage or data that can be
easily recreated) and it has some neat features.  I use other filesystems
for data that would take work to recreate or is worth backing up/restoring.

Not sure if you can chgrp a file to a group that doesn't have a quota
on a filesystem with group quotas enabled.  Being able to do so would
make group quotas a pain to administer.  However ,if you can, I'd be more
worried about innocent subversion than malicious.  Anything that did
a copy from another filesystem or archive while attempting to preserve
permissions could potentially break it.

William 




signature.asc
Description: Digital signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Control tmpdir usage on SGE

2016-09-08 Thread William Hay
On Thu, Sep 08, 2016 at 10:10:51AM +1000, Derrick Lin wrote:
>Hi all,
>Each of our execution nodes has a scratch space mounted as /scratch_local.
>I notice there is tmpdir variable can be changed in a queue's conf.
>According to doc, SGE will create a per job dir on tmpdir, and set path in
>var TMPDIR and TMP. 
>I have setup a complex tmp_requested which a job can specify during
>submission. I want to ensure a job could not utilize what it claims at
>tmp_requested. For example, I would like to set a quota to a job's TMPDIR
>according to tmp_requested.
>What is the best way for doing that?
>Cheers,
>Derrick
We do something along those lines here and making some improvements to it 
is on my todo list.  I'll outline my current plan:
1. Hand all the spare disk space on a node over to a nice btrfs filesystem.
2. Our JSV adds an environment variable to the job recording the amount
of disk requested (you could try parsing it out of the job spool but 
this is easier).
3. The prolog creates a btrfs subvolume and assigns it an appropriate quota.
4. In starter_method point TMPDIR and TMP at the created volume.  We use
ssh for qlogin etc and ForceCommand can do the same job for it.
5. In the epilog delete the subvolume.

At present we're using a huge swap partition and TMPFS instead of btrfs.  
You could probably do this with a volume manager and creating a
regular filesystem as well but it would be slower.

We don't use the grid engine configured TMPDIR as this is also used
for internal gridengine communication and mounting filesystems on
it causes problems.


> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users



signature.asc
Description: Digital signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] sgemaster service fails to stay up after harddisk maxout incident

2016-08-30 Thread William Hay
On Fri, Aug 26, 2016 at 01:35:06PM +0100, Ram??n Fallon wrote:
>Thanks for the reply, William.
> 
>Yes, that's true.
> 
>It's a pity there's not a way to reset gridengine to begin anew on a new
>database for example.
> 
>It seems quite a radical step to have to re-install gridengine just
>because of the database. Decidedly unmodular. But, I suppose, that's the
>nature of Berkeley DB ... I do however miss the stressing of this point in
>the documentation. Oh well, my turn to do it now, I expect. Cheers!
> 
>Any other suggestions are still welcome!

You should be able to get away with just wiping the cell and running sge_inst 
now that I think about it.

William


signature.asc
Description: Digital signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] sgemaster service fails to stay up after harddisk maxout incident

2016-08-26 Thread William Hay
On Thu, Aug 25, 2016 at 04:40:55PM +0100, Ram??n Fallon wrote:
>* sgemaster still fails to come up. "messages" in
>$SGE_ROOT/$SGE_CELL/spool/qmaster now says:
>main|frontend0|W|local configuration frontend0 not defined - using global
>configuration
>main|frontend0|E|global configuration not defined
>main|frontend0|C|setup failed
>* Seems to exonerate the database, but I'm not so sure ... database repair
>was not "satisfying"
This would appear to be the sort of thing that backups are for.  Your bdb 
database is
now self-consistent but doesn't contain the data grid engine expects anymore.  
If you don't have
a backup I suspect you'll probably need do a clean install somewhere.

William


signature.asc
Description: Digital signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] firewall on submit host

2016-08-25 Thread William Hay
On Thu, Aug 25, 2016 at 09:15:26AM +0100, William Hay wrote:
> On Wed, Aug 24, 2016 at 09:07:44PM +0200, Alexander Hasselhuhn wrote:
> > Dear Reuti,
> > 
> > thanks for the reply, indeed at the moment there is a login node, but we 
> > have plans to remove it (by setting up a route through our gateway, which 
> > makes some administrative tasks more smooth) and restricting access using 
> > firewalls. I like your idea of restricting the address range instead of the 
> > port range.
> > 
> > Yours,
> > Alex
> > 
> > On 08/24/2016 08:51 PM, Reuti wrote:
> > >Hi,
> > >
> > >Am 24.08.2016 um 19:33 schrieb Alexander Hasselhuhn:
> > >
> > >>does anyone know which ports I would have to insert into my firewall 
> > >>config for qrsh to work? It seems qrsh opens a port on the submit host 
> > >>and listens on it. The ports seem to change randomly for each execution 
> > >>of qrsh.
> > >
> An alternative would be something like using a qrsh_command that invokes ssh 
> -w to connect to the port in question.
> 
> Something like:
> #!/bin/sh
> HOST=$3
> PORT=$2
> ssh -w ${HOST}:${PORT} ${HOST} 
> 
> Which would access the remote host via the regular sshd then connect to the 
> destination host and port.
> 
> You then need an rshd_command that upon receiving a connection executes the 
> qrsh_starter:
> Something like:
> #!/bin/sh
> su "$(sed -n -e 's/^job_owner=//p' ${SGE_JOB_SPOOL_DIR}')" -c 
> "${SGE_ROOT}/utilbin/${SGE_ARCH}/qrsh_starter ${SGE_JOB_SPOOL_DIR}" -
> 
> The above is thoroughly untested and probably has syntax errors and security 
> holes.
> 
> Then all you need is some means of passwordless ssh authentication, a 
> suitably nailed down sshd on the receiving host,
> port 22 open to the world and the dynamic port range accessible from the 
> localhost.
> 
> William

ooops,  thought you were talking about the destination host for some reason.  
My trick won't work for the random port on the submit host.

William




signature.asc
Description: Digital signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] firewall on submit host

2016-08-25 Thread William Hay
On Wed, Aug 24, 2016 at 09:07:44PM +0200, Alexander Hasselhuhn wrote:
> Dear Reuti,
> 
> thanks for the reply, indeed at the moment there is a login node, but we have 
> plans to remove it (by setting up a route through our gateway, which makes 
> some administrative tasks more smooth) and restricting access using 
> firewalls. I like your idea of restricting the address range instead of the 
> port range.
> 
> Yours,
> Alex
> 
> On 08/24/2016 08:51 PM, Reuti wrote:
> >Hi,
> >
> >Am 24.08.2016 um 19:33 schrieb Alexander Hasselhuhn:
> >
> >>does anyone know which ports I would have to insert into my firewall config 
> >>for qrsh to work? It seems qrsh opens a port on the submit host and listens 
> >>on it. The ports seem to change randomly for each execution of qrsh.
> >
An alternative would be something like using a qrsh_command that invokes ssh -w 
to connect to the port in question.

Something like:
#!/bin/sh
HOST=$3
PORT=$2
ssh -w ${HOST}:${PORT} ${HOST} 

Which would access the remote host via the regular sshd then connect to the 
destination host and port.

You then need an rshd_command that upon receiving a connection executes the 
qrsh_starter:
Something like:
#!/bin/sh
su "$(sed -n -e 's/^job_owner=//p' ${SGE_JOB_SPOOL_DIR}')" -c 
"${SGE_ROOT}/utilbin/${SGE_ARCH}/qrsh_starter ${SGE_JOB_SPOOL_DIR}" -

The above is thoroughly untested and probably has syntax errors and security 
holes.

Then all you need is some means of passwordless ssh authentication, a suitably 
nailed down sshd on the receiving host,
port 22 open to the world and the dynamic port range accessible from the 
localhost.

William


signature.asc
Description: Digital signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] "Decoding gridengine" workshop

2016-08-24 Thread William Hay
On Wed, Aug 24, 2016 at 10:20:06AM +0100, Mark Dixon wrote:
> Hi there,
> 
> Is there any interest for a meeting in the UK looking at the internals of
> gridengine? Potential topics might be:
> 
> * Building from source
> * How the code is organised
> * How to debug or develop gridengine
> 
> The principles discussed ought to be applicable to any flavour of gridengine
> that you happen to have the source for.

While the idea of being open to "any flavour of gridengine" has some
appeal  I wonder if it might be better to focus on SoGE.  If you aren't
using SoGE then presumably you either have a support contract from 
Univa or Scalable or you've been using the same version for a long time
and are presumably willing to live with its bugs.


William 


signature.asc
Description: Digital signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] reporting doesn't log which host receives a task

2016-08-22 Thread William Hay
On Fri, Aug 19, 2016 at 03:59:34PM +0100, Lars van der Bijl wrote:
>Hey William,
>is the schedule log different from the reporting file? i've had a look
>through the common and spool directory but can't find mention of it.
You have to enable it with MONITOR=1 in the scheduler params.  It writes a 
record of jobs rstarting, running 
and reserving resources with each scheduler run.

William



signature.asc
Description: Digital signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


  1   2   3   4   5   >