Re: [gridengine users] [SGE-discuss] sge in Fedora/EPEL7

2015-02-25 Thread Dave Love
Orion Poplawski or...@cora.nwra.com writes: This would be fairly manual, something like: - Backup /usr/share/gridengine, /var/spool/gridengine/cell - use yum swap/shell to replace the gridengine-* packages with sge-* packages - Move the old /usr/share/gridengine/, /var/spool/gridengine/cell

Re: [gridengine users] suspended jobs continue to run

2014-12-19 Thread Dave Love
berg...@merctech.com writes: We've got a job that was suspended via: qmod -sj $jobid that's continuing to run. The job consists of a BASH script, which in turn submits other jobs in a loop, sleeping for 30 seconds after each loop. When I examine the job status on the node where it

[gridengine users] SGE 8.1.8 available

2014-11-04 Thread Dave Love
Version 8.1.8 of the Son of Grid Engine distribution is available from http://arc.liv.ac.uk/downloads/SGE/releases/8.1.8/, with a few bug fixes and enhancements. The packaging now should work on any recent Fedora and Debian, and there are now RPMs for all the current Fedora/EPEL releases

[gridengine users] SGE 8.1.7 available

2014-06-03 Thread Dave Love
Version 8.1.7 of the Son of Grid Engine distribution is available from http://arc.liv.ac.uk/downloads/SGE/releases/8.1.7/, with various bug fixes and enhancements. A notable fix is for the longstanding problem of occasional major space leaks in qmaster with schedd_job_info=true. Please report

Re: [gridengine users] SGE 8.1.7 available

2014-06-03 Thread Dave Love
I discovered too late that this release doesn't build with later (than in RHEL6 and Debian wheezy) versions of GNU binutils, due to one test program. You can either just delete the dependence on test_drmaa.1.0 in libs/japi/Makefile or patch it like

[gridengine users] Son of Grid Engine 8.1.6 available

2013-11-04 Thread Dave Love
SGE 8.1.6 is available from http://arc.liv.ac.uk/downloads/SGE/releases/8.1.6/, fixing various bugs. Please report bugs, patches and suggestions for enhancement https://arc.liv.ac.uk/trac/SGE#mail. Release notes: * Bug fixes * Man and message fixes * Fix building of patched openssh for

Re: [gridengine users] Can a 6.2u5 node talk to a 6.2u3 cluster?

2013-10-02 Thread Dave Love
You can't, in general, mix grid engine versions -- not 6.2u5 with anything earlier, I'm pretty sure. The qmaster messages file probably says so. Trying to can crash the qmaster https://arc.liv.ac.uk/trac/SGE/ticket/1441. Alan McKay alan.mc...@gmail.com writes: On Mon, Sep 23, 2013 at 11:00

Re: [gridengine users] Help with stdin

2013-10-02 Thread Dave Love
François-Michel L'Heureux fmlheur...@datacratic.com writes: Hi I need help to write to stdin of a qrsh job. The easiest example I came with is: echo test | qrsh cat It prints test as expected but then it hangs. That was fixed in SGE 8.0something; download link via URL below. --

Re: [gridengine users] qlogin sets TERM var to dumb?

2013-10-02 Thread Dave Love
Prentice Bisbal prentice.bis...@rutgers.edu writes: Hi, everyone, After not being on this list for about 15 months, I'm now supporting SGE again and need your assistance with a problem. The problem I have is the same one Alex Chekholko reported just about a year ago:

Re: [gridengine users] How to manage grid nodes

2013-10-02 Thread Dave Love
Lionel SPINELLI spine...@ciml.univ-mrs.fr writes: Hello all, I have a question that is not directly linked to SGE but relates to the same. Which tool administrators that have to install, manage, configure and ensure coherence between lot of grid nodes use? I mean, if I have 10 nodes in my

Re: [gridengine users] Job Checkpoint: BLCR or DMTCP ?

2013-10-02 Thread Dave Love
Joseph Farran jfar...@uci.edu writes: [Please don't post content-type: text/html.] Hi all. We have Grid Engine 8.1.4 running on a cluster with CentOS 6.4, using kernel 2.6.32-358.18.1.We are just getting started on setting up job checkpoint. We got BLCR compiled and are currently

[gridengine users] SGE 8.1.5 available

2013-09-29 Thread Dave Love
Son of Grid Engine 8.1.5 is now available from http://arc.liv.ac.uk/downloads/SGE/releases/8.1.5/. The changes are mainly to fix build problems, particularly on MS Windows: Version 8.1.5 - * Bug fixes * Fix strsignal-related build failure on MS Windows, at least * Fix MS

Re: [gridengine users] start_gui_installer SGE 8.1.4

2013-09-19 Thread Dave Love
Joseph Farran jfar...@uci.edu writes: Howdy. We are running Son of Grid Engine 8.1.3. I compiled 8.1.4 and downloaded and un-tar the gui_installer-8.1.4.tar into the compiled directory. Why do you want the GUI installer if you're doing an upgrade? Does it actually support upgrades?

Re: [gridengine users] start_gui_installer SGE 8.1.4

2013-09-19 Thread Dave Love
Adam Brenner aebre...@uci.edu writes: FYI: The fix for this was to include the IzPack jar files into CLASSPATH: export CLASSPATH=/path/to/IzPack/lib/*:$CLASSPATH; ./start_gui_installer You don't need an installed IzPack, just the installer .jar. Perhaps the problem is that you have one, and

Re: [gridengine users] reservation progress

2013-09-16 Thread Dave Love
William Hay w@ucl.ac.uk writes: my second problem, i have to find out the progress of reserving. may be a job requests 4 slots, i have to find out the time when it starts reserving, when it has got 2 slots, when it has got the 3rd slot and so on. This appears to embody a slight

Re: [gridengine users] sge_execd dies silently with 0 exit status

2013-09-16 Thread Dave Love
Reuti re...@staff.uni-marburg.de writes: Please have a look at your /tmp. The starting execd will write the cause of not being able to start in a file therein. For what it's worth, that depends on the version. sge-8.0.0e+ writes to syslog, as you'd expect a daemon to. (The previous

Re: [gridengine users] Upgrading form SGE to OGE

2013-09-16 Thread Dave Love
Txema Heredia Genestar txema.llis...@gmail.com writes: Hi all, I have a cluster in production running rocks-cluster 6.0 using SGE6.2u5. SGE6.2u5 has a bug that kills the qmaster when an amount of jobs using both -pe and -hold_jid are used. OGE (theoretically) has this bug fixed. OGE is

Re: [gridengine users] Upgrading form SGE to OGE

2013-09-16 Thread Dave Love
Tina Friedrich tina.friedr...@diamond.ac.uk writes: Hi Txema, I recently upgrades our Grid Engine from SGE6.2u4 (I think it was) to OGE8.1.3. I think that means SGE 8.1.3. SGE6.2 and OGE8.1.3 execd's happily coexist on the same nodes; however, I found that the qmaster processes don't,

Re: [gridengine users] Problems with multi-queue parallel jobs

2013-09-16 Thread Dave Love
Brendan Moloney molo...@ohsu.edu writes: Hello, I use multiple queues to divide up available resources based on job run times. Large parallel jobs will typically span multiple queues and this has generally been working fine thus far. I'd strongly recommend avoiding that. Another reason

Re: [gridengine users] Web monitor xml-qstat or similar alternatives

2013-09-16 Thread Dave Love
Guillermo Marco Puche guillermo.ma...@sistemasgenomicos.com writes: Hello, I'm very interested in a web application like xml-qstat: http://xml-qstat.org/. Is there any other alternative to this project (I would like to know the available options before start installing stuff on my

Re: [gridengine users] reservation progress

2013-09-16 Thread Dave Love
Reuti re...@staff.uni-marburg.de writes: You can switch on params MONITOR=1 in `qconf -msconf` (`man sched_conf`), And perhaps use qsched(1) from recent SoGE to analyze the results. -- Community Grid Engine: http://arc.liv.ac.uk/SGE/ ___ users

Re: [gridengine users] qsub -tc / max tasks not working on SoGE 8.1.x

2013-09-16 Thread Dave Love
Black, Chris chris.bl...@roche.com writes: The qsub cmdline I am using is: qsub -q rnd.q -t 1-50 -tc 2 -N cbTA -j y -cwd /path/to/task.sh task.sh just contains some echos and a sleep 600. On our 6.2u4 cluster, the scheduler properly only runs two of the 50 tasks at once. On SoGE 8.1.x all

Re: [gridengine users] adaptive computing spam?

2013-09-06 Thread Dave Love
ChrisDag d...@sonsorol.org writes: Kind of funny as I have an active consulting gig right now to move a financial services group away from Torque/Moab due to poor performance. I heard poor reports of Moab a while ago from a site which was supposed to be a flagship for it. Anyone else get

Re: [gridengine users] [SGE-discuss] [SGE] Does SGE filter the reboot command

2013-09-06 Thread Dave Love
Joe Borġ m...@jdborg.com writes: Yep, removing that from the sudoers file has fixed the issue! Thanks guys. Regards, Joseph David Borġ http://www.jdborg.com On 5 September 2013 13:54, Joe Borġ m...@jdborg.com wrote: I'm about to try hashing out `requiretty` in the sudoers file. I

Re: [gridengine users] Dead nodes running jobs

2013-09-06 Thread Dave Love
François-Michel L'Heureux fmlheur...@datacratic.com writes: While investigating jobs that have been running for way too long, I've found out that qhost shows nodes that are dead with alive stats such as load, memuse and swapus. qstat also shows them processing jobs with state r, as if the

Re: [gridengine users] Core binding and qrsh

2013-09-06 Thread Dave Love
Reuti re...@staff.uni-marburg.de writes: I don't know whether it will be different in old versions, but that's what I'd expect. Argh, true - I missed the 6 and tried with 1. - Reuti I know the feeling! -- Community Grid Engine: http://arc.liv.ac.uk/SGE/

Re: [gridengine users] Manage ressouce

2013-09-06 Thread Dave Love
Reuti re...@staff.uni-marburg.de writes: If you want to keep a dedicated CPU and its cores out of the business: you can try to set an affinity mask on these CPUs before you start the sge_execd. I would expect that all forks of this process will inherit this. The shepherd won't take any

[gridengine users] Gold integration (was: Implementing user chargeback in 6.1u4)

2013-09-06 Thread Dave Love
William Hay w@ucl.ac.uk writes: You could have a look at Gold http://www.clusterresources.com/products/gold-allocation-manager.php We've been integrating it into our SGE cluster/JSV recently for a similar but slightly different usage but have yet to use it in anger. The Gold docs

[gridengine users] SGE 8.1.4 available (milestone release)

2013-09-05 Thread Dave Love
Son of Grid Engine 8.1.4 is now available from http://arc.liv.ac.uk/downloads/SGE/releases/8.1.4/. The changes are mainly bug fixes plus some minor enhancements. See below for the significant ones. There are various binaries available as well as the source, including Debian ARM packages in

Re: [gridengine users] Random queue errors, and suspect pe_hostfiles

2013-08-28 Thread Dave Love
berg...@merctech.com writes: In a continuing effort to resolve the problem where the queue gets put into an error state with the message (in qstat): = = can't open file /opt/sge/6.2u5/default/spool/r820-1/active_jobs/93629.1/pe_hostfile: Permission denied = I've enabled

Re: [gridengine users] Exclude job from rule.

2013-08-28 Thread Dave Love
Reuti re...@staff.uni-marburg.de writes: • Job comes back to R status. Do you use any checkpointing interface, to restart the job? If so, it should output Rr in `qstat` instead of a plain R for the SGE job state. No, I don't use any checkpointing interface. Then the state should

Re: [gridengine users] Core binding and qrsh

2013-08-28 Thread Dave Love
Reuti re...@staff.uni-marburg.de writes: Am 26.08.2013 um 15:57 schrieb Julien Nicoulaud: I meant non-interactive qrsh, eg: $ qrsh -cwd -now no -binding env linear:6 -V -b yes /bin/env = No SGE_BINDING env I don't know whether it will be different in old versions, but that's what I'd

Re: [gridengine users] Random queue errors, and suspect pe_hostfiles

2013-08-28 Thread Dave Love
Jewell, Chris c.p.jew...@massey.ac.nz writes: The message is from a failure of setuid(2) or similar. I don't know if it's a libc bug that errno seems no to be set (Success) as it should be. The two possible cases are: EAGAIN The uid does not match the current uid and uid brings

Re: [gridengine users] Random queue errors, and suspect pe_hostfiles

2013-08-23 Thread Dave Love
Sorry, I apparently missed this before. William Hay w@ucl.ac.uk writes: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 01/08/13 04:09, Jewell, Chris wrote: Hello all, A while since I posted here, so good to be back! My installation of GE 8.1.3 from the Scientific Linux 6.3 RPM

Re: [gridengine users] Use cgroup functionality using prolog

2013-08-23 Thread Dave Love
William Hay w@ucl.ac.uk writes: You could just add a starter_method to your queue configuration. This gets run instead of your command and passed enough info to recreate the command that would have been run. We do something similar with cpusets rather than cgroups. I'd expect the

Re: [gridengine users] [SGE-discuss] variable getting truncated in soge8.1.3 and OGS 2011.11p1

2013-08-22 Thread Dave Love
Ed Lauzier elauzi...@perlstar.com writes: Hi Dave, I found the section where the static buffer is defined. I'm thinking on the best way to handle this for the patch. Having a static buffer is fine especially for the size it is. Properly handling variables that exceed this limit is where

Re: [gridengine users] Use cgroup functionality using prolog

2013-08-22 Thread Dave Love
Jean-Baptiste Denis jbde...@pasteur.fr writes: Hello everybody, I'm using OGS/GE 2011.11p1 on Linux and I want to play with cgroups. What do you want to do with them? (They're not without problems.) My idea was simply to dynamically create a cgroup on an execution node per JOBID using a

Re: [gridengine users] Use cgroup functionality using prolog

2013-08-22 Thread Dave Love
Reuti re...@staff.uni-marburg.de writes: Hi, Am 22.08.2013 um 10:12 schrieb Jean-Baptiste Denis: Hello everybody, I'm using OGS/GE 2011.11p1 on Linux and I want to play with cgroups. My idea was simply to dynamically create a cgroup on an execution node per JOBID using a prolog (and

Re: [gridengine users] Re : Question about load average and slots and non-SGE-managed tasks...

2013-05-23 Thread Dave Love
Stephen Spencer spen...@cs.washington.edu writes: Dave, [...] don't let users use direct ssh to compute nodes. isn't an option. How might I configure SGE to take into account the current node load - it sounds like you may know. Reuti pointed at the documentation, but you seem to have a

Re: [gridengine users] Grid Engine accounting question

2013-05-23 Thread Dave Love
Jesse Becker becker...@mail.nih.gov writes: That rather depends on how unsafe it was to start with... Nah, 5 halflives is 5 halflives. It's just a question of *time* more than initial danger. ;-) Doubtless. I've learned that we physicists don't understand such things. This is especially

Re: [gridengine users] building Son of Grid Engine

2013-05-23 Thread Dave Love
Tina Friedrich tina.friedr...@diamond.ac.uk writes: The real and effective user is not root, and never was. Never caused us any problems. The NFS share is exported with root_squash. If that's the spool area, it means you get world-writable files in it. This is quite interesting. And all

Re: [gridengine users] Have qhost -xml slot reporting bug?

2013-05-23 Thread Dave Love
Reuti re...@staff.uni-marburg.de writes: That isn't getting to my question though. The man page says the number of slots reserved for the job is the number of slots requested with the -pe switch + 1. But I don't see any evidence of that, unless I don't understand what slots reserved

Re: [gridengine users] problem with output format of scripts run by the queue

2013-05-23 Thread Dave Love
Reuti re...@staff.uni-marburg.de writes: The file was created on Windows and the lines end in CR/LF, the ? represents the CR. There is a tool dos2unix (or use an `sed` command) to remove these characters from the script file. I use JSV to catch that: # Check for the common case of

Re: [gridengine users] cgroups integration with ogs 2011.11p1

2013-05-23 Thread Dave Love
Ray Pete rp...@broadinstitute.org writes: Udo, *Assuming you are running a RH like distro and your service cgconfig is running* It was quite some time ago I set this up but If I remember I had trouble with 2011.11p1 for some reason. ( probably user config error on my part :-) ) I don't

Re: [gridengine users] building Son of Grid Engine

2013-05-15 Thread Dave Love
Tina Friedrich tina.friedr...@diamond.ac.uk writes: Hello list, have finally decided to look into upgrading our SGE6.2 installation - mainly to see if it helps with my job scheduling problem. From which version? It should be essentially trivial if you're upgrading from 6.2u5 (after doing

Re: [gridengine users] SGE 8.1.3 Qmaster RPMS have dependencies on Flac, Ogg and PulseAudio ?

2013-05-15 Thread Dave Love
Kevin Buckley kevin.buckley.ecs.vuw.ac...@gmail.com writes: FLAC ? Ogg ? PulseAudio ? So then I read the way the dependencies have developed and it seems that it's down to the requirement for Java, as satisfied in the CentOS 6.4 distro. -- Processing Dependency: java = 1.6.0 for package:

Re: [gridengine users] Allow users to choose how many concurrent jobs they have

2013-05-15 Thread Dave Love
Txema Heredia Genestar txema.llis...@gmail.com writes: Hi all, I was wondering if there is any way to allow a user to choose how many jobs they want to have running concurrently in the cluster. I am aware that I, as an administrator, can specify limits in the slot usage for each user whith

Re: [gridengine users] Have qhost -xml slot reporting bug?

2013-05-15 Thread Dave Love
Orion Poplawski or...@cora.nwra.com writes: On 05/13/2013 09:40 AM, Orion Poplawski wrote: Would it be possible for qhost -xml output to include the number of slots used by a job on that host? Okay, I see how it is done - there are multiple job entries for each slot. Indeed, like the

Re: [gridengine users] qstat generates invalid XML

2013-05-15 Thread Dave Love
Esztermann, Ansgar ansgar.eszterm...@mpibpc.mpg.de writes: Hello List, the JATASK problem is back. It seems to be limited to jobs in E state (but does not disappear upon qmod -c). Is anyone willing to check this? I can't reproduce it, and I've fixed all the cases of malformed XML I know of

Re: [gridengine users] Re : Question about load average and slots and non-SGE-managed tasks...

2013-05-15 Thread Dave Love
Farid Chabane farid.chab...@ymail.com writes: Hi Stephen, Yes, SGE take into account the current load of nodes even if the load was caused with a non-SGE-job. That depends on how you configure it. I'd say don't let users use direct ssh to compute nodes. -- Community Grid Engine:

Re: [gridengine users] Grid Engine accounting question

2013-05-15 Thread Dave Love
Brian McNally bmcna...@uw.edu writes: It seems that using qacct to display overall usage per user (-o), for example, might be a little misleading if the actual accounting information is stored internally. Users might draw conclusions about their usage and how that'll impact their job

Re: [gridengine users] Grid Engine accounting question

2013-05-15 Thread Dave Love
Brian McNally bmcna...@uw.edu writes: See my comment to Reuti about halflife and why that seems confusing. For GE to penalize users beyond one halflife, it'd have to retain job data for longer than one halflife. ?? All it has to do is to decay/accumulate over the relevant interval. You can

Re: [gridengine users] Making the fair-share policy/scheduler algorithm more fair

2013-05-01 Thread Dave Love
Mark Dixon m.c.di...@leeds.ac.uk writes: We use the share tree here, rather than the functional policy, so this might not be applicable. By default, the usage of a job is wholly based on slots*seconds. I think it's only (effectively) slots if you use sharetree_reserved_usage, as we do. I'm

Re: [gridengine users] set processor affinity inside gridengine

2013-05-01 Thread Dave Love
Reuti re...@staff.uni-marburg.de writes: Does anyone know a way to define this option in the gridengine directly ? I'm using ogs 2011.11p1 on Intel 6core processors. With 6 cores per machine I would simply request full nodes. [I guess they're two-socket; we don't insist on filling those,

Re: [gridengine users] power management

2013-04-29 Thread Dave Love
Reuti re...@staff.uni-marburg.de writes: Do you mean it takes resource requests and their reservation into account properly? No, not directly. It scans the output of `qstat` of any given queue for waiting jobs (hence: which may run in this queue) and bases the decision on this. I'm sure

Re: [gridengine users] GE Network license

2013-04-29 Thread Dave Love
Quy NGUYEN DAI clus...@gdtech.eu writes: Hi all, I just setup Son GE on a small Linux cluster. It works well. Now I would like to setup this GE with LS-DYNA which uses a Network License server on LAN. Already look for that on Internet but I can not found any example to do that. Can GE

Re: [gridengine users] how to reserve all cluster slots for maintenance?

2013-04-29 Thread Dave Love
Reuti re...@staff.uni-marburg.de writes: I don't use calendars, and I didn't know there was that interaction with reservation, but I suppose it makes sense in terms of implementing look-ahead. Is it documented somewhere already? It seems to be an effect of the enabled backfilling by this

Re: [gridengine users] Making the fair-share policy/scheduler algorithm more fair

2013-04-29 Thread Dave Love
Jake Carroll jake.carr...@uq.edu.au writes: Hi Grid-Engine gurus of the grid engine list. I'll answer anyway. I have a question about the fair-share policy and the subsequent algorithm/ratios it uses to manage user workload. Currently, we've set our grid engine up for fair-share in a very

Re: [gridengine users] run gamess under GE

2013-04-25 Thread Dave Love
Mahbube Rustaee rust...@gmail.com writes: Hi all, How can run GAMESS under GE? You need to be specific about which GAMESS, and how it was built, at least. -- Community Grid Engine: http://arc.liv.ac.uk/SGE/ ___ users mailing list

Re: [gridengine users] PE only offers 0 slots

2013-04-25 Thread Dave Love
Jesse Becker becker...@mail.nih.gov writes: This is a problem that has plagued me, and various people time and again, and every time it gets fixed, the method and cause seems to get lost in the aether. I've hit it several times over the years, and each time the problem and solution see to

Re: [gridengine users] WALLTIME by qacct?

2013-04-25 Thread Dave Love
Sangamesh Banappa sangamesh.bana...@locuz.com writes: Hi, I need some details on the accounting data captured by GridEngine. The qacct output for the last 90 days is: Please make a bug report with suggestions for improvement if it's not adequately explained under

Re: [gridengine users] Threshold T State Locks User out of SSH to Execution Host - How to Disable?

2013-04-25 Thread Dave Love
Adam Brenner aebre...@uci.edu writes: In our case, it is very helpful for our users to directly SSH into the nodes to determine what is wrong with their qsub scripts, etc. This is a follow up the following thread by Joseph and Harry:

Re: [gridengine users] Execd on Windows 7 using SFU

2013-04-25 Thread Dave Love
Joe Borġ m...@jdborg.com writes: Hi Guys, I was wondering if anyone has any guidance with installing execd onto Windows 7 using Microsoft's SFU. In what respect? If you want to build it, see source/README.windows in SGE 8.1.3. Otherwise you probably need to look at the Oracle docs for more

[gridengine users] Debian packages (was: Open MPI jobs randomly fail to run)

2013-04-25 Thread Dave Love
Bernard Massot bernard.mas...@u-psud.fr writes: I'm using gridengine 6.2u5 on Debian Squeeze. I recommend not using that for various reasons. You can build Debian packages from the SGE 8.1.3 distribution (installing into /opt). Unfortunately it looks as if SGE in Debian is dead -- no-one seems

[gridengine users] TMPDIR naming change (was: orphan tmp directories)

2013-04-25 Thread Dave Love
Reuti re...@staff.uni-marburg.de writes: You mean one level above - in /tmp or alike? The $TMPDIR name is usuall $JOB_ID.${SGE_TASK_ID/undefined/1}.$QUEUE. I've changed that to use the cell name, not the queue for SGE 8.1.4. If anyone thinks that will break more than it fixes, please speak

Re: [gridengine users] orphan tmp directories

2013-04-25 Thread Dave Love
Reuti re...@staff.uni-marburg.de writes: On the exechost? I don't do it at all on a per job basis. In case your users fight for the disk space you can implement a consumable for the disk space in combination with a load sensor:

Re: [gridengine users] how to reserve all cluster slots for maintenance?

2013-04-25 Thread Dave Love
Reuti re...@staff.uni-marburg.de writes: I guess a calendar is the simplest option then? Will SGE refrain from scheduling jobs if I create a maintenance calendar and connect it to all queues? If max_reservation is set to a value different from zero: yes. Then it will take the set calendar

Re: [gridengine users] suspension / load balancing problem

2013-04-25 Thread Dave Love
Reuti re...@staff.uni-marburg.de writes: For serial and parallel SMP jobs there is: http://wiki.gridengine.info/wiki/index.php/StephansBlog as an option. Maybe instead of using slots a custom complex is necessary which is called medium and all jobs of this type have to request it. This is

Re: [gridengine users] power management

2013-04-25 Thread Dave Love
Reuti re...@staff.uni-marburg.de writes: Some time ago Fritz mentioned that SCM to control it is abandoned and Univa integrated something new to control it. SDM is still available if you want it, and the others I'm aware of are on http://arc.liv.ac.uk/SGE/tools.html, apart from one funded with

Re: [gridengine users] Monitor GPU memory usage per job

2013-04-25 Thread Dave Love
Nicolás Serrano Martínez-Santos nserr...@dsic.upv.es writes: Hi, We are currently using SGE6.2u5 in our little cluster (~150 cores) and I am trying to configure it to manage GPU correct usage. I have been able to define multiple slots for each GPU card and also to reserve memory using

Re: [gridengine users] Monitor GPU memory usage per job

2013-04-25 Thread Dave Love
Stephen Willey step...@esstec.co.uk writes: You could use a load sensor to do this. We use one to detect if people are logged in and suspend/requeue the jobs if someone logs in while a job's on their workstation. I don't understand how that addresses the question (as I understand it).

Re: [gridengine users] SGE 8.1.3 and high sge_execd load

2013-03-24 Thread Dave Love
Thomas Mainka t.mai...@science-computing.de writes: Hi everyone, currently I am trying to upgrade an old SGE cluster from an 6.2 release to SGE 8.1.3. The upgrade itself is no problem, however the sge_execd of 8.1.3 consumes a lot of CPU on the exec host when a job is running--sometimes

Re: [gridengine users] SGE 8.1.3 and USE_CGROUPS sets hosts in error state

2013-03-24 Thread Dave Love
Mikael Brandström Durling mikael.durl...@slu.se writes: Hi sge users, I have been testing the USE_CGROUPS option that is available to execd. When USE_CGROUPS is enabled it works fine to submit jobs one by one. But when I submitted 70 serial jobs, all queues on all hosts were set to error

Re: [gridengine users] gridengine 8.1.3 and queue in alarm state: null np_load_avg

2013-03-24 Thread Dave Love
Stefano Bridi stefano.br...@gmail.com writes: Many thanks, in this way it works.. strange behaviour anyway.. stef It was the result of trying to accommodate the system distributed over different locales with the floating point data exchange in text format (ugh) -- from long experience of that

Re: [gridengine users] Mark an execution host as 'errored' in case of a NIS error

2013-03-24 Thread Dave Love
Campbell McLeay campbell.mcl...@primefocusworld.com writes: Thanks for all the suggestions. In the end the easiest was to just monitor the qmaster log and run a qmod -sq on the execution host to suspend it, then a qmod -cj to resubmit the job. This will do until sssd gets fixed Off topic,

Re: [gridengine users] what is IO in qstat

2013-03-24 Thread Dave Love
Reuti re...@staff.uni-marburg.de writes: Hi, Am 20.03.2013 um 15:58 schrieb Lars van der bijl: hey everyone, a few weeks ago we where having issue with users submitting jobs the did massive IO to the file system. each task was writing out about 4 GB of data and reading in about 2GB. as

Re: [gridengine users] MPI jobs spanning several nodes and h_vmem limits

2013-03-08 Thread Dave Love
and I'm surprised the problem hasn't come to light before. Thanks. * Don't pass any user environment to remote startup daemons -- better fix for half of CVE-2012-0208 I think this change should fix it, but I can't test it immediately. Fri Mar 8 22:04:13 GMT 2013 Dave Love d.l

Re: [gridengine users] Limiting number of cores used

2013-03-08 Thread Dave Love
Jim Phillips j...@ks.uiuc.edu writes: The slots feature is only needed to support pe size ranges, e.g., qsub -pe smp 10-20. If so, I wouldn't have bothered with the small effort to do it. Otherwise you can set the binding size to match the pe size in a jsv script. Multi-host jobs should

Re: [gridengine users] The newbie is back

2013-03-08 Thread Dave Love
Reuti re...@staff.uni-marburg.de writes: You'll be lucky if RPMs work on a Debian system. Well, `rpm2cpio` or `alien` could help. I meant because of shared library names and versions rather than the packaging. -- Community Grid Engine: http://arc.liv.ac.uk/SGE/

Re: [gridengine users] SGE-6.2u5: express resource ignored - job suspends itself

2013-03-06 Thread Dave Love
Erik Soyez e.so...@science-computing.de writes: Thanks Reuti for your quick-as-always answer! I have added an additional express-PE for each existing PE which should ensure that express jobs really stick to the express queue; users submit with wildcard-PEs anyway, so they don't need to

Re: [gridengine users] QRSH exiting with 0 but sets Error on queue

2013-03-06 Thread Dave Love
Ian Johnson ian.john...@capita-ti.com writes: Reuti, Problem solved. It was a hostname lookup problem. Once all the hosts had correct hosts files qrsh can now connect to any slave from all submit hosts. Thank you very much for your help over the last few days. Below is the trace from

Re: [gridengine users] MPI jobs spanning several nodes and h_vmem limits

2013-03-06 Thread Dave Love
Reuti re...@staff.uni-marburg.de writes: I can't reproduce that (with openmpi tight integration). Doing this (which gets three four-core nodes): qsub -pe openmpi 12 -l h_vmem=256M echo Script $(hostname): $TMPDIR $NSLOTS ulimit -v for HOST in $(tail -n +2 $PE_HOSTFILE|cut -f1 -d' ');

Re: [gridengine users] Limiting number of cores used

2013-03-05 Thread Dave Love
Jim Phillips j...@ks.uiuc.edu writes: Just enable core binding in the scheduler and use a jsv script to force binding for all jobs. Jobs run only on their assigned set of cores regardless of how many threads they launch. I don't know what platform that refers to, but it is simply false on

Re: [gridengine users] The newbie is back

2013-03-05 Thread Dave Love
Reuti re...@staff.uni-marburg.de writes: Am 01.03.2013 um 15:17 schrieb Jacques Foucry: It looks like qmaster is not running # ps aux | grep qmaster root 2218 0.0 0.0 7548 832 pts/0S+ 17:32 0:00 grep qmaster So I try to launch it: # /etc/init.d/gridengine-master start

Re: [gridengine users] Different Error Codes for Job Failure

2013-03-01 Thread Dave Love
Kshitiz B kshiti...@tcs.com writes: How to distinguish between the following scenarios which leads to job deletion : 1. Slave Node Failure 2. Master/Shepherd Node Failure 3. Job deleted by User 4. Job deleted by Admin I just answered on the SGE list; please don't post separately to both

Re: [gridengine users] Limiting number of cores used

2013-03-01 Thread Dave Love
Raymond Wan rwan.w...@gmail.com writes: RQS sounds like what I'm looking for (for #1) -- the problem I'm having is that users are being a bit greedy when I'm not watching. It doesn't seem to have been said that RQS only affects scheduling the job, not what happens at run time. It specifically

Re: [gridengine users] Slots for queues

2013-03-01 Thread Dave Love
Gaya Nadarajan gay...@ed.ac.uk writes: Hi, I'm from Edinburgh but I'm not using the cluster here, I have a private cluster set up for a project elsewhere. Anyhow, the problem is associated with the execution daemon failing to acknowledge jobs sent from the qmaster, causing some jobs to be

Re: [gridengine users] Issue in Distributed jobs

2013-03-01 Thread Dave Love
Jim Phillips j...@ks.uiuc.edu writes: That error means that the process launched by qrsh on node09 exited before the rest of the slots so qmaster killed everything for you. It's not just that it exited, but that it failed for some reason. I'd set the log level to info and maybe check the

Re: [gridengine users] MPI jobs spanning several nodes and h_vmem limits

2013-03-01 Thread Dave Love
Reuti re...@staff.uni-marburg.de writes: Am 27.02.2013 um 20:56 schrieb Mikael Brandström Durling: snip In case you look deeper into the issue, it's also worth to note that there is no option to specify the target queue for `qrsh -inherit` in case you get slots from different queues on

[gridengine users] SuSE support (was: Grid Scheduler question)

2013-03-01 Thread Dave Love
rk mag hastyt...@gmail.com writes: hi guys.. i am really just new in using a mailing lists.. anyways, I just wanna ask if Open grid scheduler works on openSUSE? I don't know, but you can expect http://arc.liv.ac.uk/downloads/SGE/releases/8.1.3/ to work, although I haven't tried the latest

[gridengine users] SGE 8.1.3 available

2013-02-27 Thread Dave Love
Son of Grid Engine 8.1.3 is available, not before time, from http://arc.liv.ac.uk/downloads/SGE/releases/8.1.3/. The changes are mainly bug fixes, plus support for packaging as a Debian add-on and a few enhancements pulled in by dependencies. Building on MS Windows is now tested and documented

Re: [gridengine users] How redirect terminal output to the file?

2013-01-28 Thread Dave Love
William Hay w@ucl.ac.uk writes: On 25 January 2013 09:56, Semion Chernin s...@bgu.ac.il wrote: How I can help to this user? The trouble is that for some reason the program I am running will output to the console while run directly on a node, but while submitted as a job - will not

Re: [gridengine users] automatically use of scratch area on compute node

2013-01-28 Thread Dave Love
Stefano Bridi stefano.br...@gmail.com writes: Hi all, is there a way to use the scratch area (local disk) on the compute node in a transparent way from the submitted script point of view? What I want to do is to copy to and from the compute node scratch area the job data using the

Re: [gridengine users] automatically use of scratch area on compute node

2013-01-28 Thread Dave Love
Reuti re...@staff.uni-marburg.de writes: Am 25.01.2013 um 18:37 schrieb Tina Friedrich: Don't know if that helps you, but remember that SGE created a temporary directory (something like /tmp/#JOB_NO.#TASK_NO.QUEUE). Available to submitted scripts as $TMPDIR. I believe it even cleans it up

Re: [gridengine users] qmake: best practices, etc....

2013-01-28 Thread Dave Love
Reuti re...@staff.uni-marburg.de writes: I copied the `qmake` out of http://arc.liv.ac.uk/downloads/SGE/releases/8.1.2/gridengine-8.1.2-1.el5.x86_64.rpm and used it under SGE 6.2u5 and the error is gone. There's then https://arc.liv.ac.uk/trac/SGE/changeset/4265/sge, though, which needs a

Re: [gridengine users] Error 137 - trying to figure out what it means.

2013-01-28 Thread Dave Love
Jake Carroll jake.carr...@uq.edu.au writes: We figured it out! Specific user binary was not respecting vf memory complex and decided to use all the RAM on random nodes it landed on! There's nothing to respect. If you use vf, it's only relevant to scheduling jobs, not memory usage while

Re: [gridengine users] A really really newbie

2013-01-28 Thread Dave Love
Reuti re...@staff.uni-marburg.de writes: This is unusual and looks like someone tried to cleanup the location of configuration files by good intention, but didn't make it in a nice way. Maybe the best would be to ask on the crosswords mailing list. The tutorials I could point you to are not

Re: [gridengine users] qmaster segmentation fault

2013-01-24 Thread Dave Love
Txema Heredia Genestar txema.here...@upf.edu writes: Thanks for the answer Dave. I have tried your old rpms and it didn't work. The problem still persists and the qmaster crashes. Then it's a different problem. I'll see if I can schedule a maintenance stop to upgrade the system to a newer

Re: [gridengine users] PE offers 0 slots (conflict in pe_list w. hostgroups?)

2013-01-24 Thread Dave Love
[Excuse any duplicates -- I'm not sure if gridengine.org is tits-up again as well as our mail hub sulking at my laptop.] Reuti re...@staff.uni-marburg.de writes: I think that's an old version. Suggestions are welcome for any improvements to the current one, which I tried to tidy up (from

Re: [gridengine users] fair-share: decay factor and usage_weight_list

2013-01-24 Thread Dave Love
Reuti re...@staff.uni-marburg.de writes: You can only reset it in `qmon` completely. What is missing on qconf compared with qmon? (I think I've only used qmon the few times I've hanged it, but it's a bug if you can't do the same with qconf.) I missed the -clearusage for `qconf`. So it's

  1   2   3   4   5   6   >