Re: [gridengine users] Tightly integrated parallel environment - Cleanly stopping qrsh -inherit sub-processes

2012-08-28 Thread Julien Nicoulaud
The FORBID_APPEROR parameter seems to be specific to applications returning 100. My concern was about a random slave process crashing in the middle of the run, but I realize after some testing you really have to explicitely send a signal to the qrsh process to trigger task failure detection.

[gridengine users] Dispatching job over grid nodes

2012-08-28 Thread Lionel SPINELLI
Hello all, I would need help from experts to know how I can configure my grid so to have job dispatched over most free nodes. The jobs on my grid are launched on a PE using the $pe_slots rule. Each time a user submit a job, he uses the option -pe to indicate the number of slots to associate to

Re: [gridengine users] Dispatching job over grid nodes

2012-08-28 Thread Mazouzi
Hi, To enable *use least used host first* we configure: *qconf -msconf* and set queue_sort_method load and load_formula -slots. source: http://wiki.gridengine.info/wiki/index.php/StephansBlog Regards, On Tue, Aug 28, 2012 at 4:55 PM, Lionel SPINELLI spine...@ciml.univ-mrs.frwrote: Hello

Re: [gridengine users] Dispatching job over grid nodes

2012-08-28 Thread Reuti
Hi, Am 28.08.2012 um 17:16 schrieb Mazouzi: To enable use least used host first we configure: qconf -msconf and set queue_sort_method load and load_formula -slots. source: http://wiki.gridengine.info/wiki/index.php/StephansBlog this is one way. The other option is to add some artificial

Re: [gridengine users] Tightly integrated parallel environment - Cleanly stopping qrsh -inherit sub-processes

2012-08-28 Thread Reuti
Am 28.08.2012 um 11:48 schrieb Julien Nicoulaud: The FORBID_APPEROR parameter seems to be specific to applications returning 100. My concern was about a random slave process crashing in the middle of the run, Your application is fault-tolerant in such a way, that the other processes

[gridengine users] Son of Grid Engine 8.1.2 available

2012-08-28 Thread Dave Love
SGE 8.1.2 is available from http://arc.liv.ac.uk/downloads/SGE/releases/8.1.2/. It is a large superset of the freely available Grid Engine features and fixes from other sources, specifically ~800 changes since 6.2u5 (Sun's last release). Currently binaries are available as RPMs for Red Hat-ish 5

Re: [gridengine users] Tightly integrated parallel environment - Cleanly stopping qrsh -inherit sub-processes

2012-08-28 Thread Julien Nicoulaud
Yes, exactly that. 2012/8/28 Reuti re...@staff.uni-marburg.de Am 28.08.2012 um 11:48 schrieb Julien Nicoulaud: The FORBID_APPEROR parameter seems to be specific to applications returning 100. My concern was about a random slave process crashing in the middle of the run, Your

Re: [gridengine users] Son of Grid Engine 8.1.2 available

2012-08-28 Thread Joseph Farran
Thanks Dave. We just discovered that we cannot request nodes with -l mem_free=xxx. We are on 8.1.1. Does this new release fix this? Joseph On 08/28/2012 09:57 AM, Dave Love wrote: SGE 8.1.2 is available from http://arc.liv.ac.uk/downloads/SGE/releases/8.1.2/. It is a large superset of the

Re: [gridengine users] Son of Grid Engine 8.1.2 available

2012-08-28 Thread Reuti
Do you get an error when you try to do so? -- Reuti Am 28.08.2012 um 23:20 schrieb Joseph Farran jfar...@uci.edu: Thanks Dave. We just discovered that we cannot request nodes with -l mem_free=xxx. We are on 8.1.1. Does this new release fix this? Joseph On 08/28/2012 09:57 AM,

Re: [gridengine users] Son of Grid Engine 8.1.2 available

2012-08-28 Thread Joseph Farran
I don't use it, but one of our users has used it before successfully before we moved to GE 8.1.1. # qstat -q bio -F mem_free|fgrep mem hl:mem_free=498.198G hl:mem_free=498.528G hl:mem_free=499.143G hl:mem_free=498.959G hl:mem_free=499.198G $ qrsh -q bio

Re: [gridengine users] Son of Grid Engine 8.1.2 available

2012-08-28 Thread Reuti
Am 28.08.2012 um 23:57 schrieb Joseph Farran: I don't use it, but one of our users has used it before successfully before we moved to GE 8.1.1. # qstat -q bio -F mem_free|fgrep mem hl:mem_free=498.198G hl:mem_free=498.528G hl:mem_free=499.143G

Re: [gridengine users] sge inspect

2012-08-28 Thread Dave Love
Chakravarthy Girda cha...@girada.com writes: Hi, I am looking for something like Platform-RTM. To configure cluster and also to maintain the history on overall cluster individual nodes. So to my knowledge qmon is just like a admin tool. Yes on qmon (though it can also submit jobs). I

Re: [gridengine users] SoGE Upgrade Method

2012-08-28 Thread Dave Love
Smith, David [EESUS] dsmit...@its.jnj.com writes: Yes, unzipping. I believe it was because, for whatever reason, the libraries needed for Berkeley spooling were not present in the zip files after the March release if memory serves. They previously had been. I don't understand that, if you

Re: [gridengine users] Verifying behavior of max_reservations

2012-08-28 Thread Dave Love
Brian Smith b...@mail.usf.edu writes: Hi, Dave, I'm mostly trying to verify the behavior of max_reservations as the clarity of the man page is a little lacking. No great surprise... I can try to clarify it for things I understand -- maybe after Reuti explains. I don't know whether the

Re: [gridengine users] Son of Grid Engine 8.1.2 available

2012-08-28 Thread Dave Love
Joseph Farran jfar...@uci.edu writes: I don't use it, but one of our users has used it before successfully before we moved to GE 8.1.1. # qstat -q bio -F mem_free|fgrep mem hl:mem_free=498.198G hl:mem_free=498.528G hl:mem_free=499.143G hl:mem_free=498.959G

Re: [gridengine users] Son of Grid Engine 8.1.2 available

2012-08-28 Thread Reuti
Am 29.08.2012 um 00:48 schrieb Dave Love: Joseph Farran jfar...@uci.edu writes: I don't use it, but one of our users has used it before successfully before we moved to GE 8.1.1. # qstat -q bio -F mem_free|fgrep mem hl:mem_free=498.198G hl:mem_free=498.528G

Re: [gridengine users] Son of Grid Engine 8.1.2 available

2012-08-28 Thread Joseph Farran
Hi Reuti. Here it is with the additional info: $ qrsh -w v -q bio -l mem_free=190G Job 1637 (-l h_rt=604800,mem_free=190G) cannot run in queue bio@compute-2-7.local because job requests unknown resource (mem_free) Job 1637 (-l h_rt=604800,mem_free=190G) cannot run in queue