[gridengine users] h_vmem and parallel jobs, or "why exclusive=true is important"

2011-09-22 Thread Mark Dixon
ostlist such that the host with the greatest number of slots assigned to the job is first in the list, reducing the frequency that the problem is hit. * Enhance GE by making qrsh more light-weight. If you made it down to the bottom of this post, my thanks :) Mark -- ---

[gridengine users] Removing qrsh from h_vmem resource limits?

2011-09-26 Thread Mark Dixon
Thanks, Mark -- ----- Mark Dixon Email: m.c.di...@leeds.ac.uk HPC/Grid Systems Support Tel (int): 35429 Information Systems Services Tel (ext): +44(0)113 343 5429 University of Lee

[gridengine users] Getting sharetree usage data from the command line?

2011-10-27 Thread Mark Dixon
#x27;s source to find out how it does it. Cheers, Mark -- ----- Mark Dixon Email: m.c.di...@leeds.ac.uk HPC/Grid Systems Support Tel (int): 35429 Information Systems Services Tel (ext): +44(0

Re: [gridengine users] Getting sharetree usage data from the command line?

2011-10-27 Thread Mark Dixon
liant, thanks to you both. For what it's worth, it seems to be giving similar answers to qmon. I'll keep an eye out and report back if I notice anything amiss... Cheers, Mark -- ----- Mark Dixon Email:

Re: [gridengine users] qrsh wrappers

2011-11-17 Thread Mark Dixon
ou got on :) Mark -- ----- Mark Dixon Email: m.c.di...@leeds.ac.uk HPC/Grid Systems Support Tel (int): 35429 Information Systems Services Tel (ext): +44(0)113 343 5429 University o

Re: [gridengine users] Simplifying Parallel Environments

2012-02-02 Thread Mark Dixon
is (and many other improvements). All the best, Mark -- ----- Mark Dixon Email: m.c.di...@leeds.ac.uk HPC/Grid Systems Support Tel (int): 35429 Information Systems Services Tel (ex

Re: [gridengine users] Simplifying Parallel Environments

2012-02-03 Thread Mark Dixon
pirun". For really crazy things, we use the queue starter_method :) Perhaps we've been lucky, but I don't think we've got anything that cannot be wrapped in some shape or form at the moment. Cheers, Mark -- -

Re: [gridengine users] "Packing" jobs on nodes v2

2012-05-15 Thread Mark Dixon
e instance. You might want to give that a go. All the best, Mark -- ----- Mark Dixon Email: m.c.di...@leeds.ac.uk HPC/Grid Systems Support Tel (int): 35429 Information Systems Services Tel (ex

Re: [gridengine users] "Packing" jobs on nodes v2

2012-05-15 Thread Mark Dixon
. Mark PS v20z? I liked them, but for some reason I always ended up bleeding slightly from my right hand index finger whenever I took the lids off. -- ----- Mark Dixon Email: m.c.di...@leeds.ac.uk HPC/Grid

Re: [gridengine users] cgroups Integration in OGS/GE 2011.11 update 1

2012-05-23 Thread Mark Dixon
equest, but are not otherwise enforced. How far along with your solution are you? Am I just duplicating work someone else has already done? Mark -- ----- Mark Dixon Email: m.c.di...@leeds.ac.uk HPC/Grid Sy

Re: [gridengine users] cgroups Integration in OGS/GE 2011.11 update 1

2012-05-23 Thread Mark Dixon
e execd, and editing the "obvious" places (data structures and sizes) doesn't seem to be the end of it. gdb ahoy! Mark -- ----- Mark Dixon Email: m.c.di...@leeds.ac.uk HPC/Grid Systems Supp

Re: [gridengine users] cgroups Integration in OGS/GE 2011.11 update 1

2012-05-23 Thread Mark Dixon
On Wed, 23 May 2012, Rayson Ho wrote: On Wed, May 23, 2012 at 4:09 AM, Mark Dixon wrote: Intended notable features of the patchset: * Two new resources h_mem and s_mem to limit total memory + swap usage (i.e. not just rss). In my implementation, I did not add any new queue resource limits

Re: [gridengine users] cgroups Integration in OGS/GE 2011.11 update 1

2012-05-24 Thread Mark Dixon
the development work that has gone into it, and thanks again: it's very pleasing to see this sort of thing done under an open source model. -- - Mark Dixon Email: m.c.di...@leeds.ac.uk

Re: [gridengine users] cgroups Integration in OGS/GE 2011.11 update 1

2012-05-25 Thread Mark Dixon
them under the SISSL. Nothing was mentioned about copyright assignment. [actually, that conversation with them went stale... must pick it up again]. Mark -- ----- Mark Dixon Email: m.c.di...@leeds.ac.uk HPC/Gr

Re: [gridengine users] cgroups Integration in OGS/GE 2011.11 update 1

2012-05-25 Thread Mark Dixon
u can just translate the percentage to an absolute value using the JSV feature. * If you never want to use an absolute value, you can put in a consumable value of 100 and hack the execd code to interpret the number however you like. Mark --

Re: [gridengine users] cgroups Integration in OGS/GE 2011.11 update 1

2012-05-25 Thread Mark Dixon
to make $ but since Oracle gave us the maintainership to maintain open source Grid Engine we have the obligation to maintain it in an *Open Source* way. ... I'm very reluctant to end on this, but here goes; that paragraph should have at least one of: s/as the$/as an/ Or: s/^open so

Re: [gridengine users] cgroups Integration in OGS/GE 2011.11 update 1

2012-05-25 Thread Mark Dixon
to reimplement functionality already found in ssh(d) and rsh(d) on almost every host? :) Mark -- - Mark Dixon Email: m.c.di...@leeds.ac.uk HPC/Grid Systems Support Tel (int): 35429 Information

Re: [gridengine users] cgroups Integration in OGS/GE 2011.11 update 1

2012-05-25 Thread Mark Dixon
port to the wildcard address (instead of just the loopback) and puts a real hostname in your DISPLAY variable. "qsub -V" then makes sure the compute nodes get it. Mark -- ----- Mark Dixon Email: m.c.di

Re: [gridengine users] Requesting h_vmem or memfree

2012-05-28 Thread Mark Dixon
#x27;t repeat the good stuff that other people have already written about what you can do today but, if you haven't already, you might want to skim-read the ongoing cgroups thread that describes a much better approach that should be available in the future. All the best, Mark -- -

Re: [gridengine users] cgroups Integration in OGS/GE 2011.11 update 1

2012-05-28 Thread Mark Dixon
On Fri, 25 May 2012, Rayson Ho wrote: On Fri, May 25, 2012 at 12:01 PM, Mark Dixon wrote: That's what I was wondering was the answer :) In my opinion there are simpler ways round it as long as not having encrypted X11 within the cluster is ok: As I've mentioned befo

Re: [gridengine users] cgroups Integration in OGS/GE 2011.11 update 1

2012-05-29 Thread Mark Dixon
de BTW?? Rayson It's in an email from me to d...@gridengine.org, dated 9/12/2011 with subject "[PATCH] Fix PE task array job failure due to missing job script on execd". I'm happy for any of the gridengine forks to take it (hint). Mark -- -

Re: [gridengine users] cgroups Integration in OGS/GE 2011.11 update 1

2012-05-29 Thread Mark Dixon
On Tue, 29 May 2012, Mark Dixon wrote: ... Is this the gridscheduler-developers sourceforge list? I tried looking for a button to push or address to email to join it, but didn't have any luck. The web frontend doesn't show anything posted to it recently - is archiving turned off

Re: [gridengine users] cgroups Integration in OGS/GE 2011.11 update 1

2012-05-30 Thread Mark Dixon
ase you missed it, here's the nub of what I've previously said: On Fri, 25 May 2012, Mark Dixon wrote: ... memsw usage and virtual address space usage can wildly diverge under fairly common use cases: * Processes under 64-bit mode. Comparing the VIRT column in "top" with

Re: [gridengine users] cgroups Integration in OGS/GE 2011.11 update 1

2012-05-30 Thread Mark Dixon
On Wed, 30 May 2012, Mark Dixon wrote: Hi Rayson, Sorry to sound needy, but have you had time to consider controlling cgroup's memory.memsw.limit_in_bytes via a new attribute defined as the OS-dependent way to measure memory usage (typically RAM+swap), rather than overloading h_vmem/s

Re: [gridengine users] cgroups Integration in OGS/GE 2011.11 update 1

2012-05-31 Thread Mark Dixon
limit's RLIMIT_AS, as previously? Best wishes, Mark -- ----- Mark Dixon Email: m.c.di...@leeds.ac.uk HPC/Grid Systems Support Tel (int): 35429 Information Systems Services Tel

Re: [gridengine users] cgroups Integration in OGS/GE 2011.11 update 1

2012-06-01 Thread Mark Dixon
're putting into it. I've obviously not done a very good job at being clear and concise. All the best, Mark -- ----- Mark Dixon Email: m.c.di...@leeds.ac.uk HPC/Grid Systems Support Tel (int

Re: [gridengine users] guess required memory

2012-06-11 Thread Mark Dixon
h_vmem is needed for? Thx Aside from the other methods mentioned in this thread, I often find the following CLI tool useful: https://software.sandia.gov/trac/utilib/wiki/Documentation/memmon Mark -- - Mark Dixon

Re: [gridengine users] cgroups Integration in OGS/GE 2011.11 update 1

2012-06-14 Thread Mark Dixon
h, but it's a much-needed break from the past which takes advantage of cgroups to greatly improve utilisation of the resources available. How does that sound? Cheers, Mark -- - Mark Dixon Email: m.

Re: [gridengine users] cgroups Integration in OGS/GE 2011.11 update 1

2012-06-29 Thread Mark Dixon
On Thu, 14 Jun 2012, Mark Dixon wrote: ... * I do NOT believe that AS should be summed when the processes of a job are being polled. Instead it should be RSS+SWAP (or similar). (sorry, ignore that 2nd sentence) * I DO believe that the per-process AS setrlimit should be settable to a value

Re: [gridengine users] export of environment variables from start_proc_args

2012-07-05 Thread Mark Dixon
tude of sins being passed through to my starter_method, generally when qrsh gets involved (e.g. tightly-integrated parallel jobs), which generally get coped with better that way. Mark -- - Mark Dixon Email: m

Re: [gridengine users] export of environment variables from start_proc_args

2012-07-05 Thread Mark Dixon
ne="$1" shift if ! which "$arg_one" > /dev/null 2>&1; then eval "$arg_one" "$@" exit $? fi exec "$arg_one" "$@" Perhaps there's a better way. Mark -- -

Re: [gridengine users] export of environment variables from start_proc_args

2012-07-06 Thread Mark Dixon
tle more complicated - cheers! Good luck, Mark PS One other thing I recall, is that I think the exact behaviour of what is passed to the starter_method depends on the queue's shell_start_mode setting. Ours is set to unix_behavior to try and get a more intuitive result than you get from the de

Re: [gridengine users] export of environment variables from start_proc_args

2012-07-06 Thread Mark Dixon
/env to simulate unix_behavior for reasons of backward compatibility. Should be fun. W.Hay Esq ornate grid engine configurations a speciality Yowsah. Go on - what's the subtlety you've hit there? Mark -- ----- Mark Dixon

Re: [gridengine users] export of environment variables from start_proc_args

2012-07-10 Thread Mark Dixon
environment variables. Mark -- ----- Mark Dixon Email: m.c.di...@leeds.ac.uk HPC/Grid Systems Support Tel (int): 35429 Information Systems Services Tel (ext): +44(0)113 343 5429 Un

Re: [gridengine users] export of environment variables from start_proc_args

2012-07-10 Thread Mark Dixon
cification. I doubt it will generally behave the same as without the starter, but it's probably relatively safe in the absence of Liverpool access to the system. ... If there any particular use case that's concerning you here? Cheers, Mark --

Re: [gridengine users] export of environment variables from start_proc_args

2012-07-10 Thread Mark Dixon
;m not the biggest fan of PE customisation (taste). This may or may not be one of them :) Cheers, Mark -- - Mark Dixon Email: m.c.di...@leeds.ac.uk HPC/Grid Systems Support Tel (int): 35429 I

Re: [gridengine users] export of environment variables from start_proc_args

2012-07-16 Thread Mark Dixon
using the same languages, as the JSV bindings? If Python or Lua are important, perhaps JSV bindings should be written for them (if not available already). Mark -- ----- Mark Dixon Email: m.c.di...@leeds.ac.uk H

Re: [gridengine users] export of environment variables from start_proc_args

2012-07-16 Thread Mark Dixon
On Thu, 12 Jul 2012, Dave Love wrote: Mark Dixon writes: Things I think we've used starter_methods for in the past: Gosh. You live in interesting times^Wclusters. I've certainly had some interesting problems to tackle. Something's got to keep me busy in f

Re: [gridengine users] export of environment variables from start_proc_args

2012-07-16 Thread Mark Dixon
This is to create some functionality that is currently missing. It's an experiment at the moment: I'll report back if there's anything positive to say (or negative for that matter, if people are interested). Mark -- ----

Re: [gridengine users] [PATCH] Simple memory cgroup functionality

2012-09-19 Thread Mark Dixon
On Tue, 11 Sep 2012, Schmidt U. wrote: Hi Mark Dixon, is the mentioned patch maybe a little bit helpful as well for my problem with the virtual memory overload of the first node in massive parallel jobs ? overhead_vmem = bash_vmem + mpirun_vmem + (nodes -1)*qrsh_vmem Udo Hi Udo, Apologies

Re: [gridengine users] [PATCH] Simple memory cgroup functionality

2012-09-19 Thread Mark Dixon
y controller. "This can may contain worms." ... Well said! This is obviously a new area and, aside from the obvious problems, we may still get interesting interactions with some of the more exotic things regularly found in our environments (e.g. Lustre, InfiniBand, etc.). It'

Re: [gridengine users] [PATCH] Simple memory cgroup functionality

2012-09-19 Thread Mark Dixon
On Wed, 19 Sep 2012, Mark Dixon wrote: ... "This can may contain worms." ... Well said! This is obviously a new area and, aside from the obvious problems, we may still get interesting interactions with some of the more exotic things regularly found in our environments (e

Re: [gridengine users] [PATCH] Simple memory cgroup functionality

2012-09-25 Thread Mark Dixon
a poke to see if there was any conclusion. Mark -- ----- Mark Dixon Email: m.c.di...@leeds.ac.uk HPC/Grid Systems Support Tel (int): 35429 Information Systems Services Tel (ext): +

Re: [gridengine users] [PATCH] Simple memory cgroup functionality

2012-09-25 Thread Mark Dixon
onment too, so I'm not sure if this helps. Would using SGE_CLUSTER_NAME help? I don't think so. ... Ho-hum :) Mark -- ----- Mark Dixon Email: m.c.di...@leeds.ac.uk HPC/Grid Systems Support

Re: [gridengine users] Memory values reported by SGE too high

2012-09-27 Thread Mark Dixon
be really significant or just a minor stuff. Jérémie Depending on what your program does, the overestimate can pretty much be as big as you like (until you hit architectural 32-bit/64-bit limits). Mark -- - Mark Dixon

[gridengine users] Integration code for serial BLCR checkpointing

2012-10-03 Thread Mark Dixon
On Mon, 16 Jul 2012, Mark Dixon wrote: ... * Transparent (from the job script's perspective) serial BLCR integration Could you post the recipe/code? DMTCP is facing the knife for exactly that, but C++ encourages displacement activities. Sure, I'll dig it out (to follow under

Re: [gridengine users] Raising an old share tree bug...

2013-02-06 Thread Mark Dixon
about tackling this bug? ... Hi Orlando, Great to hear you're game :) Are you after advice on debugging gridengine, or pointers for this bug specifically? Mark -- - Mark Dixon Email: m.c.di...@leeds.ac.u

Re: [gridengine users] Raising an old share tree bug...

2013-02-06 Thread Mark Dixon
s the qmaster is being given to be used in the accounting file and the share tree. If you're unlucky, the problem is in how the qmaster aggregates, records and decays the share tree values over time. If you're really unlucky, the problem might only occur if the vario

Re: [gridengine users] open the output

2013-03-18 Thread Mark Dixon
rted while it's queuing, you'll probably lose the job entirely. TTFN Mark -- ----- Mark Dixon Email: m.c.di...@leeds.ac.uk HPC/Grid Systems Support Tel (int): 35429 Information Systems Service

Re: [gridengine users] X11 not working after qlogin

2013-03-19 Thread Mark Dixon
ungrid-dev machine localhost:16.0? Shouldn't it be :0? ... You might want to read the bottom of the following post: http://gridengine.org/pipermail/dev/2011-November/70.html Mark -- ----- Mark Dixon Em

Re: [gridengine users] X11 not working after qlogin

2013-03-19 Thread Mark Dixon
). It could do MIT-MAGIC-COOKIE-1, but had to be hit until it did it. Mark -- - Mark Dixon Email: m.c.di...@leeds.ac.uk HPC/Grid Systems Support Tel (int): 35429 Information Systems Services Tel

Re: [gridengine users] X11 not working after qlogin

2013-03-19 Thread Mark Dixon
ers, Mark -- - Mark Dixon Email: m.c.di...@leeds.ac.uk HPC/Grid Systems Support Tel (int): 35429 Information Systems Services Tel (ext): +44(0)113 343 5429 University of Leeds, L

Re: [gridengine users] Making the fair-share policy/scheduler algorithm "more fair"

2013-04-30 Thread Mark Dixon
details. If you don't have the same amount of RAM everywhere, you might also want to play with "usage_scaling" parameters in the execd host definitions. Good luck :) Mark -- ----- Mark Dixon Emai

Re: [gridengine users] Making the fair-share policy/scheduler algorithm "more fair"

2013-05-01 Thread Mark Dixon
27;s usage via a usage_scaling of cpu=0.00,mem=0.00,io=0.00. All the best, Mark -- - Mark Dixon Email: m.c.di...@leeds.ac.uk HPC/Grid Systems Support Tel (int): 35429

Re: [gridengine users] Making the fair-share policy/scheduler algorithm "more fair"

2013-05-01 Thread Mark Dixon
On Tue, 30 Apr 2013, Dave Love wrote: Mark Dixon writes: We use the share tree here, rather than the functional policy, so this might not be applicable. By default, the "usage" of a job is wholly based on slots*seconds. I think it's only (effectively) s

Re: [gridengine users] priority formula

2013-05-03 Thread Mark Dixon
formula to calculate the normalized value? ... If you're looking at that level of detail, I would read the source code if I were you. I'd be interested to hear what you find. All the best, Mark -- -

Re: [gridengine users] cgroups integration with ogs 2011.11p1

2013-06-04 Thread Mark Dixon
em. All the best, Mark -- ----- Mark Dixon Email: m.c.di...@leeds.ac.uk HPC/Grid Systems Support Tel (int): 35429 Information Systems Services Tel (ext): +44(0)113 343 5429 Univers

Re: [gridengine users] Drain whole nodes for maintenance

2013-07-18 Thread Mark Dixon
better way for doing that? ... Hi Christoph, We normally use an advance reservation to do this, draining node(s) for a particular point in time: man qrsub All the best, Mark -- - Mark Dixon Email

Re: [gridengine users] [SGE-discuss] variable getting truncated in soge8.1.3 and OGS 2011.11p1

2013-08-12 Thread Mark Dixon
-- - Mark Dixon Email: m.c.di...@leeds.ac.uk HPC/Grid Systems Support Tel (int): 35429 Information Systems Services Tel (ext): +44(0)113 343 5429 University of Leeds, LS2 9JT, UK

Re: [gridengine users] Welcome Home Grid Engine!

2013-10-22 Thread Mark Dixon
to call me naive, but this announcement sounds like good news to me - Oracle were clearly not interested in gridengine. Congratulations to Fritz, the engineers and Univa :) Mark -- - Mark Dixon Email

Re: [gridengine users] (Seemingly) Random failures of OpenMPI jobs

2014-01-08 Thread Mark Dixon
ound in Univa's public git repo here: https://github.com/gridengine/gridengine Alternatively, it was integrated into Son of Gridengine some time ago. All the best, Mark -- ----- Mark Dixon Email: m.c.d

[gridengine users] Jobs in error state still attract policy tickets

2014-02-04 Thread Mark Dixon
ority when they shouldn't. Given that ticket allocations are calculated afresh every scheduling interval, I don't think there's any point in errored jobs attracting tickets like this. Is that right? Mark -- ----- Mark Dix

Re: [gridengine users] Jobs in error state still attract policy tickets

2014-02-13 Thread Mark Dixon
matters? Or is that what you meant? Cheers, Mark (sorry for delay - got distracted on a training course) -- ----- Mark Dixon Email: m.c.di...@leeds.ac.uk HPC/Grid Systems Support Tel (int): 35429 Inform

Re: [gridengine users] array job / node allocation / 'spread' question

2014-04-03 Thread Mark Dixon
then submitting the array into that. Not sure which option I favour, still. The AR approach sounds horrendous! Mark -- ----- Mark Dixon Email: m.c.di...@leeds.ac.uk HPC/Grid Systems Support Tel (in

Re: [gridengine users] Project status

2014-04-03 Thread Mark Dixon
On Mon, 10 Mar 2014, Joshua Baker-LePain wrote: I'm hoping to get some idea as to the status and future of OGS/GE. As background, we're a moderately sized (4000+ cores) academic cluster and have been running SGE for several years (we've been through versions 6, 6.1, 6.2, and are now running OGS

Re: [gridengine users] array job / node allocation / 'spread' question

2014-04-03 Thread Mark Dixon
it explained to me by reuti on this list a while ago...) Reuti's great for that :) Mark -- - Mark Dixon Email: m.c.di...@leeds.ac.uk HPC/Grid Systems Support Tel (int): 35429

[gridengine users] Incorrect share tree usage for task array jobs

2014-04-04 Thread Mark Dixon
Hi, I think we've been bitten by something that others have seen and brought up on this list over the years, where the amount of usage reported in the share tree can become unexpectedly large when using task array jobs. I am trying to reproduce this on a test install, to take a closer look at

Re: [gridengine users] Incorrect share tree usage for task array jobs

2014-04-07 Thread Mark Dixon
On Fri, 4 Apr 2014, Joshua Baker-LePain wrote: ... If you have any issues recreating this, let me know and I'll see if I can still do so. ... Thanks for the pointers :) Mark -- - Mark Dixon Email: m

Re: [gridengine users] Incorrect share tree usage for task array jobs

2014-04-07 Thread Mark Dixon
On Fri, 4 Apr 2014, Cameron Brunner wrote: Mark -    I analyzed this issue on a UGE 8.1.4 cluster last fall and found that this process was the simplest way to get some overbooking reported. ... Thanks for the pointers :) Very useful Cameron and Joshua, thanks again for taking the trouble to

Re: [gridengine users] Incorrect share tree usage for task array jobs

2014-05-19 Thread Mark Dixon
On Fri, 4 Apr 2014, Cameron Brunner wrote: Mark -    I analyzed this issue on a UGE 8.1.4 cluster last fall and found that this process was the simplest way to get some overbooking reported.  I think any version of SGE with original share tree code will show this issue.  The following steps show

Re: [gridengine users] Using ssh with qrsh and qlogin but disable users direct ssh

2014-10-13 Thread Mark Dixon
On Tue, 30 Sep 2014, Derrick Lin wrote: ... I am trying to configure SSH as underlying protocol for qrsh, qlogin. However, this requires allowing users to SSH into compute nodes. In such case, users can simply go to compute nodes with SSH, bypassing SGE (qrsh, qlogin etc). I am wondering what th

Re: [gridengine users] Using ssh with qrsh and qlogin but disable users direct ssh

2014-10-14 Thread Mark Dixon
On Mon, 13 Oct 2014, Prentice Bisbal wrote: ... I think what he wants to do is this, which is actually a pretty common desire: 1. Not let users ssh directly into cluster nodes and bypass the scheduler. 2. If a user is in a qrsh or qlogin session and has requested multiple nodes, for debugging p

Re: [gridengine users] sanity check on usage of "-p" priority value: per-user effect or global across waitlist?

2015-04-30 Thread Mark Dixon
On Thu, 30 Apr 2015, Chris Dagdigian wrote: ... - Does my use of "-p" to send lower-than-zero values for my submitted jobs affect just MY jobs and the order in which they get dispatched or will I end up penalizing myself globally because all the other jobs from other users on the cluster are ru

Re: [gridengine users] sanity check on usage of "-p" priority value: per-user effect or global across waitlist?

2015-04-30 Thread Mark Dixon
On Thu, 30 Apr 2015, Fritz Ferstl wrote: Nah, the weight_priority won't help. It just determines how much influence the -p has vs things like job wait time or urgency. If you have none of those then all being equal it would have the same effect as if you left it untouched. And if you have infl

Re: [gridengine users] Jobs in error state still attract policy tickets

2015-08-21 Thread Mark Dixon
On Tue, 4 Feb 2014, Mark Dixon wrote: ... Over the years, we've run various versions of SGE and SoGE with a share tree policy. I think it's always been the case that jobs in an error state still attract tickets from the share tree policy - despite the fact that the job isn't

Re: [gridengine users] how to get a particular job up the queue?

2015-11-10 Thread Mark Dixon
On Tue, 10 Nov 2015, Marlies Hankel wrote: ... We are using OGS/Grid Engine 2011.11. I have recently implemented a fair share policy which seems to work OK. However, on occasion, when a user comes up to a deadline I would like to advance them up the queue. Previously I could change their priori

Re: [gridengine users] Incorrect share tree usage for task array jobs

2015-11-18 Thread Mark Dixon
On Fri, 4 Apr 2014, Joshua Baker-LePain wrote: On Fri, 4 Apr 2014 at 8:45am, Mark Dixon wrote I think we've been bitten by something that others have seen and brought up on this list over the years, where the amount of usage reported in the share tree can become unexpectedly large when

[gridengine users] "Decoding gridengine" workshop

2016-08-24 Thread Mark Dixon
Hi there, Is there any interest for a meeting in the UK looking at the internals of gridengine? Potential topics might be: * Building from source * How the code is organised * How to debug or develop gridengine The principles discussed ought to be applicable to any flavour of gridengine that

Re: [gridengine users] Control tmpdir usage on SGE

2016-09-08 Thread Mark Dixon
On Thu, 8 Sep 2016, William Hay wrote: ... At present we're using a huge swap partition and TMPFS instead of btrfs. You could probably do this with a volume manager and creating a regular filesystem as well but it would be slower. ... Hi William, I always liked your idea for handling scratch s

Re: [gridengine users] Control tmpdir usage on SGE

2016-09-09 Thread Mark Dixon
On Thu, 8 Sep 2016, William Hay wrote: ... Remember tmpfs is not a ramdisk but the linux VFS layer without an attempt to provide real file system guarantees. It shouldn't be cached any more agressively than other filesystems under normal circumstances. Most of the arguments against it seem to

Re: [gridengine users] Control tmpdir usage on SGE

2016-10-04 Thread Mark Dixon
On Tue, 4 Oct 2016, Derrick Lin wrote: ... I have had a simple implementation working. Now I need to look at a situation when -pe is specified. It looks like the accurate way to determine host/slot allocation is to get from $pe_hostfile. But $pe_hostfile seems to be available only in start_proc

Re: [gridengine users] Control tmpdir usage on SGE

2016-10-04 Thread Mark Dixon
On Tue, 4 Oct 2016, Reuti wrote: ... Do you mean your implementation or the general behavior? The $TMPDIR will be created when a `qrsh -inherit ...` spans a process to a node and is removed once it returns. ... Hi Reuti, Thanks: I thought $TMPDIR only appeared on the host with the MASTER task

Re: [gridengine users] Control tmpdir usage on SGE

2016-10-04 Thread Mark Dixon
On Tue, 4 Oct 2016, William Hay wrote: ... I have a per-job consumable and the TMPDIR filesystem is created on every node of the job. We have a (jsv enforced) policy that all multi-node jobs have exclusive access to the node and run on identical nodes so it works as a faux per-host consumable.

Re: [gridengine users] Control tmpdir usage on SGE

2016-10-05 Thread Mark Dixon
On Tue, 4 Oct 2016, Reuti wrote: ... Yeah, I had the idea of different temporary directories some time ago, as some applications like Molcas need a persistent one across several `mpiruns` on each node and how to delete them again. ... Hi Reuti, Interesting! I wonder what happens with $TMPDIR

Re: [gridengine users] Control tmpdir usage on SGE

2016-10-05 Thread Mark Dixon
On Wed, 5 Oct 2016, William Hay wrote: ... It was originally head node only so per job until a user requested local TMPDIR on each node so historical reasons. ... Hi William, What do you do with people who want to keep the contents of $TMPDIR at the end of the job? It's easy to use the epilog

Re: [gridengine users] Control tmpdir usage on SGE

2016-10-06 Thread Mark Dixon
On Wed, 5 Oct 2016, William Hay wrote: ... Our prolog and epilog (parallel) ssh into the slave nodes and do the equivalent of run-parts on directories full of scripts some of which check if they are running on the head node of the job before doing anything. If we did want the epilog to save TMP

[gridengine users] Requesting GPUs on the qsub command line

2017-02-13 Thread Mark Dixon
Hi, I've been playing with allocating GPUs using gridengine and am wondering if I'm trying to make it too complicated. We have some 24 core, 128G RAM machines, each with two K80 GPU cards in them. I have a little client/server program that allocates named cards to jobs (via a starter method

Re: [gridengine users] Requesting GPUs on the qsub command line

2017-02-14 Thread Mark Dixon
On Tue, 14 Feb 2017, William Hay wrote: ... We tweak the permissions on the device nodes from a privileged prolog but otherwise I suspect we're doing something similar. Hi William, Yeah, but I've put the permission tweaker in the starter, as that fits our existing model a bit better (looking

Re: [gridengine users] Requesting GPUs on the qsub command line

2017-02-14 Thread Mark Dixon
On Tue, 14 Feb 2017, William Hay wrote: ... qsub -ac Template=GPU -l gpu=1 script Your jsv could spot that a Template had been requested and fill in sensible defaults based on other requests. If no template is requested users have access to the full power of this fully operational caommand l

Re: [gridengine users] Requesting GPUs on the qsub command line

2017-03-03 Thread Mark Dixon
On Tue, 14 Feb 2017, William Hay wrote: ... options nvidia NVreg_ModifyDeviceFiles=0 ... Hi William, Many thanks, much appreciated :) Mark ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Requesting GPUs on the qsub command line

2017-03-16 Thread Mark Dixon
On Tue, 14 Feb 2017, William Hay wrote: ... Our prolog does a parallel ssh(passing through appropriate envvars) into every node assigned to the job and does the equivalent of a run-parts on a directory filled with scripts. Some of these scripts check if they are running on the head node. (be

Re: [gridengine users] Random issues with SoGE

2017-03-29 Thread Mark Dixon
On Mon, 27 Mar 2017, Joshua Baker-LePain wrote: Investigating my odd 'some commands don't generate output' issue revealed a couple other issues. First, on my old cluster (running OGS 2011.11p2), only a very few hosts are admin hosts -- notably, the exec nodes are not. On those nodes, 'qconf -

Re: [gridengine users] Problems with quotas

2018-03-22 Thread Mark Dixon
Hi, It's this bit that's doing it: "SGE_ND=true". It's there so that the qmaster doesn't daemonise, in order to play nicely with systemd. Unfortunately, as it was originally put in to aid debugging, it also enables some debug messages. If too much is being generated, I'd suggest either redir

Re: [gridengine users] Problems with quotas

2018-03-23 Thread Mark Dixon
Hi Jakub, That's right: if you need to cut down the logging, one option is to add the redirection in the start script. You're looking for the line starting "sge_qmaster", and you might want to try adding a ">/dev/null" after it. You'll lose all syslog messages from sge_qmaster though (normal

Re: [gridengine users] Corrupt user config?

2018-04-16 Thread Mark Dixon
Hi William, I've seen this before back in the SGE 6.2u5 days when it used to write out core binding options it couldn't subsequently read back in. IIRC, users are read from disk at startup in turn and then the files are only written to from then on - so this sort of thing only tends to be no

Re: [gridengine users] Corrupt user config?

2018-04-16 Thread Mark Dixon
On Mon, 16 Apr 2018, William Hay wrote: ... I don't think that can be right given that the qmaster complains about multiple user files on start up. If it gave up after the first then presumably it wouldn't complain about the others. All I know is that, when we had this sort of problem, most o

Re: [gridengine users] Jobs sitting in queue despite suitable slots and resources available

2018-04-19 Thread Mark Dixon
On Tue, 17 Apr 2018, Joshua Baker-LePain wrote: As an alternative to fixing our current setup, I'd be most interested to hear if/how other folks are handling GPUs in their SoGE setups. I was considering changing the slot count in gpu.q to match the number of GPUs in a host (rather than CPU core

Re: [gridengine users] Possible opportunity for development work

2018-05-14 Thread Mark Dixon
Hi Daniel, Well done on wanting to work on gridengine, it's really good to see people interested. Although the topmost layers have clearly suffered from years of applying patches on top of patches on top of patches and so are in sore need of a bit of refactoring, there are some really nice b

Re: [gridengine users] Scheduling maintenance and using advance reservation

2018-06-06 Thread Mark Dixon
On Tue, 5 Jun 2018, Ilya M wrote: ... Is there a way to submit AR when there are projects attached to queues? I am using SGE 6.2u5. ... Hi Ilya, I've run into this, too: I'm afraid that there isn't. I logged it here: https://arc.liv.ac.uk/trac/SGE/ticket/1466 I started to fix it but ran out

Re: [gridengine users] Scheduling maintenance and using advance reservation

2018-06-07 Thread Mark Dixon
projects from queues' configuration. Ilya. On Wed, Jun 6, 2018 at 2:41 AM, Mark Dixon wrote: On Tue, 5 Jun 2018, Ilya M wrote: ... Is there a way to submit AR when there are projects attached to queues? I am using SGE 6.2u5. ... Hi Ilya, I've run into this, too: I'm afraid that

  1   2   >