Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-18 Thread Reuti
Am 18.11.2010 um 11:57 schrieb Terry Dontje: > Yes, I believe this solves the mystery. In short OGE and ORTE both work. In > the linear:1 case the job is exiting because there are not enough resources > for the orte binding to work, which actually makes sense. In the linear:2 > case I think

Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-18 Thread Terry Dontje
Yes, I believe this solves the mystery. In short OGE and ORTE both work. In the linear:1 case the job is exiting because there are not enough resources for the orte binding to work, which actually makes sense. In the linear:2 case I think we've proven that we are binding to the right amount

Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-17 Thread Ralph Castain
More than OGE uses external bindings. We have tested it using some tricks, and in environments where binding is available from the RM (e.g., slurm). So we know the basic code works. Whether or not it works with OGE is another matter. On Wed, Nov 17, 2010 at 9:09 AM, Terry Dontje

Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-17 Thread Terry Dontje
On 11/17/2010 10:48 AM, Ralph Castain wrote: No problem at all. I confess that I am lost in all the sometimes disjointed emails in this thread. Frankly, now that I search, I can't find it either! :-( I see one email that clearly shows the external binding report from mpirun, but not from any

Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-17 Thread Ralph Castain
No problem at all. I confess that I am lost in all the sometimes disjointed emails in this thread. Frankly, now that I search, I can't find it either! :-( I see one email that clearly shows the external binding report from mpirun, but not from any daemons. I see another email (after you asked if

Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-17 Thread Terry Dontje
On 11/17/2010 10:00 AM, Ralph Castain wrote: --leave-session-attached is always required if you want to see output from the daemons. Otherwise, the launcher closes the ssh session (or qrsh session, in this case) as part of its normal operating procedure, thus terminating the stdout/err

Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-17 Thread Ralph Castain
--leave-session-attached is always required if you want to see output from the daemons. Otherwise, the launcher closes the ssh session (or qrsh session, in this case) as part of its normal operating procedure, thus terminating the stdout/err channel. On Wed, Nov 17, 2010 at 7:51 AM, Terry Dontje

Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-17 Thread Terry Dontje
On 11/17/2010 09:32 AM, Ralph Castain wrote: Cris' output is coming solely from the HNP, which is correct given the way things were executed. My comment was from another email where he did what I asked, which was to include the flags: --report-bindings --leave-session-attached so we could

Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-17 Thread Ralph Castain
Cris' output is coming solely from the HNP, which is correct given the way things were executed. My comment was from another email where he did what I asked, which was to include the flags: --report-bindings --leave-session-attached so we could see the output from each orted. In that email, it

Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-17 Thread Terry Dontje
On 11/17/2010 07:41 AM, Chris Jewell wrote: On 17 Nov 2010, at 11:56, Terry Dontje wrote: You are absolutely correct, Terry, and the 1.4 release series does include the proper code. The point here, though, is that SGE binds the orted to a single core, even though other cores are also

Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-17 Thread Chris Jewell
On 17 Nov 2010, at 11:56, Terry Dontje wrote: >> >> You are absolutely correct, Terry, and the 1.4 release series does include >> the proper code. The point here, though, is that SGE binds the orted to a >> single core, even though other cores are also allocated. So the orted >> detects an

Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-17 Thread Terry Dontje
On 11/16/2010 08:24 PM, Ralph Castain wrote: On Tue, Nov 16, 2010 at 12:23 PM, Terry Dontje > wrote: On 11/16/2010 01:31 PM, Reuti wrote: Hi Ralph, Am 16.11.2010 um 15:40 schrieb Ralph Castain: 2. have SGE bind

Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-17 Thread Daniel Gruber
Hi, I'm interested in what is expected from OGE/SGE in order to support most of your scenarios. First of all the "-binding pe" request is not flexible and makes only sense in scenarios when having the same architecture on each host, each involved host is used exclusively for the job (SGE

Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-16 Thread Ralph Castain
On Tue, Nov 16, 2010 at 12:23 PM, Terry Dontje wrote: > On 11/16/2010 01:31 PM, Reuti wrote: > > Hi Ralph, > > Am 16.11.2010 um 15:40 schrieb Ralph Castain: > > > 2. have SGE bind procs it launches to -all- of those cores. I believe SGE > does this automatically to

Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-16 Thread Terry Dontje
On 11/16/2010 01:31 PM, Reuti wrote: Hi Ralph, Am 16.11.2010 um 15:40 schrieb Ralph Castain: 2. have SGE bind procs it launches to -all- of those cores. I believe SGE does this automatically to constrain the procs to running on only those cores. This is another "bug/feature" in SGE: it's a

Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-16 Thread Reuti
Hi Ralph, Am 16.11.2010 um 15:40 schrieb Ralph Castain: > > 2. have SGE bind procs it launches to -all- of those cores. I believe SGE > > does this automatically to constrain the procs to running on only those > > cores. > > This is another "bug/feature" in SGE: it's a matter of discussion,

Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-16 Thread Chris Jewell
On 16 Nov 2010, at 17:25, Terry Dontje wrote: >>> >> Sure. Here's the stderr of a job submitted to my cluster with 'qsub -pe >> mpi 8 -binding linear:2 myScript.com' where myScript.com runs 'mpirun -mca >> ras_gridengine_verbose 100 --report-bindings ./unterm': >> >> [exec4:17384] System

Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-16 Thread Terry Dontje
On 11/16/2010 12:13 PM, Chris Jewell wrote: On 16 Nov 2010, at 14:26, Terry Dontje wrote: In the original case of 7 nodes and processes if we do -binding pe linear:2, and add the -bind-to-core to mpirun I'd actually expect 6 of the nodes processes bind to one core and the 7th node with 2

Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-16 Thread Terry Dontje
On 11/16/2010 12:13 PM, Chris Jewell wrote: On 16 Nov 2010, at 14:26, Terry Dontje wrote: In the original case of 7 nodes and processes if we do -binding pe linear:2, and add the -bind-to-core to mpirun I'd actually expect 6 of the nodes processes bind to one core and the 7th node with 2

Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-16 Thread Chris Jewell
On 16 Nov 2010, at 14:26, Terry Dontje wrote: > > In the original case of 7 nodes and processes if we do -binding pe linear:2, > and add the -bind-to-core to mpirun I'd actually expect 6 of the nodes > processes bind to one core and the 7th node with 2 processes to have each of > those

Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-16 Thread Terry Dontje
On 11/16/2010 10:59 AM, Reuti wrote: Am 16.11.2010 um 15:26 schrieb Terry Dontje: 1. allocate a specified number of cores on each node to your job this is currently the bug in the "slot<=> core" relation in SGE, which has to be removed, updated or clarified. For now slot and core count

Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-16 Thread Reuti
Am 16.11.2010 um 15:26 schrieb Terry Dontje: >>> >>> 1. allocate a specified number of cores on each node to your job >>> >> this is currently the bug in the "slot <=> core" relation in SGE, which has >> to be removed, updated or clarified. For now slot and core count are out of >> sync

Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-16 Thread Ralph Castain
Hi Reuti > > 2. have SGE bind procs it launches to -all- of those cores. I believe SGE > does this automatically to constrain the procs to running on only those > cores. > > This is another "bug/feature" in SGE: it's a matter of discussion, whether > the shepherd should get exactly one core (in

Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-16 Thread Terry Dontje
On 11/16/2010 09:08 AM, Reuti wrote: Hi, Am 16.11.2010 um 14:07 schrieb Ralph Castain: Perhaps I'm missing it, but it seems to me that the real problem lies in the interaction between SGE and OMPI during OMPI's two-phase launch. The verbose output shows that SGE dutifully allocated the

Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-16 Thread Reuti
Hi, Am 16.11.2010 um 14:07 schrieb Ralph Castain: > Perhaps I'm missing it, but it seems to me that the real problem lies in the > interaction between SGE and OMPI during OMPI's two-phase launch. The verbose > output shows that SGE dutifully allocated the requested number of cores on > each

Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-16 Thread Ralph Castain
Perhaps I'm missing it, but it seems to me that the real problem lies in the interaction between SGE and OMPI during OMPI's two-phase launch. The verbose output shows that SGE dutifully allocated the requested number of cores on each node. However, OMPI launches only one process on each node (the

Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-16 Thread Reuti
Am 16.11.2010 um 10:26 schrieb Chris Jewell: > Hi all, > >> On 11/15/2010 02:11 PM, Reuti wrote: >>> Just to give my understanding of the problem: >> Sorry, I am still trying to grok all your email as what the problem you >> are trying to solve. So is the issue is trying to have

Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-16 Thread Terry Dontje
On 11/16/2010 04:26 AM, Chris Jewell wrote: Hi all, On 11/15/2010 02:11 PM, Reuti wrote: Just to give my understanding of the problem: Sorry, I am still trying to grok all your email as what the problem you are trying to solve. So is the issue is trying to have two jobs having processes on

Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-16 Thread Chris Jewell
Hi all, > On 11/15/2010 02:11 PM, Reuti wrote: >> Just to give my understanding of the problem: >>> > Sorry, I am still trying to grok all your email as what the problem you > are trying to solve. So is the issue is trying to have two jobs having > processes on the same node be

Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-15 Thread Reuti
Correction: Am 15.11.2010 um 20:23 schrieb Terry Dontje: > On 11/15/2010 02:11 PM, Reuti wrote: >> Just to give my understanding of the problem: >> >> Am 15.11.2010 um 19:57 schrieb Terry Dontje: >> >> >>> On 11/15/2010 11:08 AM, Chris Jewell wrote: >>> > Sorry, I am still trying to grok

Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-15 Thread Reuti
Am 15.11.2010 um 20:23 schrieb Terry Dontje: > >>> Is your complaint really the fact that exec6 has been allocated two slots >>> but there seems to only be one slot worth of resources allocated >>> >> All are wrong except exec6. They should only get one core assigned. >> >> > Huh? I would

Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-15 Thread Terry Dontje
On 11/15/2010 02:11 PM, Reuti wrote: Just to give my understanding of the problem: Am 15.11.2010 um 19:57 schrieb Terry Dontje: On 11/15/2010 11:08 AM, Chris Jewell wrote: Sorry, I am still trying to grok all your email as what the problem you are trying to solve. So is the issue is trying

Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-15 Thread Reuti
Just to give my understanding of the problem: Am 15.11.2010 um 19:57 schrieb Terry Dontje: > On 11/15/2010 11:08 AM, Chris Jewell wrote: >>> Sorry, I am still trying to grok all your email as what the problem you >>> are trying to solve. So is the issue is trying to have two jobs having >>>

Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-15 Thread Terry Dontje
On 11/15/2010 11:08 AM, Chris Jewell wrote: Sorry, I am still trying to grok all your email as what the problem you are trying to solve. So is the issue is trying to have two jobs having processes on the same node be able to bind there processes on different resources. Like core 1 for the first

Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-15 Thread Reuti
Hi, Am 15.11.2010 um 17:06 schrieb Chris Jewell: > Hi Ralph, > > Thanks for the tip. With the command > > $ qsub -pe mpi 8 -binding linear:1 myScript.com > > I get the output > > [exec6:29172] System has detected external process binding to cores 0008 > [exec6:29172] ras:gridengine: JOB_ID:

Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-15 Thread Chris Jewell
> Sorry, I am still trying to grok all your email as what the problem you > are trying to solve. So is the issue is trying to have two jobs having > processes on the same node be able to bind there processes on different > resources. Like core 1 for the first job and core 2 and 3 for the 2nd

Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-15 Thread Chris Jewell
Hi Ralph, Thanks for the tip. With the command $ qsub -pe mpi 8 -binding linear:1 myScript.com I get the output [exec6:29172] System has detected external process binding to cores 0008 [exec6:29172] ras:gridengine: JOB_ID: 59282 [exec6:29172] ras:gridengine: PE_HOSTFILE:

Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-15 Thread Ralph Castain
The external binding code should be in that version. If you add --report-bindings --leave-session-attached to the mpirun command line, you should see output from each daemon telling you what external binding it detected, and how it is binding each app it launches. Thanks! On Mon, Nov 15, 2010

Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-15 Thread Chris Jewell
> I confess I am now confused. What version of OMPI are you using? > > FWIW: OMPI was updated at some point to detect the actual cores of an > external binding, and abide by them. If we aren't doing that, then we have a > bug that needs to be resolved. Or it could be you are using a version

Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-15 Thread Terry Dontje
Sorry, I am still trying to grok all your email as what the problem you are trying to solve. So is the issue is trying to have two jobs having processes on the same node be able to bind there processes on different resources. Like core 1 for the first job and core 2 and 3 for the 2nd job?

Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-15 Thread Reuti
Am 15.11.2010 um 15:29 schrieb Chris Jewell: > Hi, > >>> If, indeed, it is not possible currently to implement this type of >>> core-binding in tightly integrated OpenMPI/GE, then a solution might lie in >>> a custom script run in the parallel environment's 'start proc args'. This >>> script

Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-15 Thread Chris Jewell
Hi, > > If, indeed, it is not possible currently to implement this type of > > core-binding in tightly integrated OpenMPI/GE, then a solution might lie in > > a custom script run in the parallel environment's 'start proc args'. This > > script would have to find out which slots are allocated

Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-15 Thread Ralph Castain
I confess I am now confused. What version of OMPI are you using? FWIW: OMPI was updated at some point to detect the actual cores of an external binding, and abide by them. If we aren't doing that, then we have a bug that needs to be resolved. Or it could be you are using a version that predates

Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-15 Thread Reuti
Hi, Am 15.11.2010 um 13:13 schrieb Chris Jewell: > Okay so I tried what you suggested. You essentially get the requested number > of bound cores on each execution node, so if I use > > $ qsub -pe openmpi 8 -binding linear:2 > > then I get 2 bound cores per node, irrespective of the number

Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-15 Thread Chris Jewell
Hi Reuti, Okay so I tried what you suggested. You essentially get the requested number of bound cores on each execution node, so if I use $ qsub -pe openmpi 8 -binding linear:2 then I get 2 bound cores per node, irrespective of the number of slots (and hence parallel) processes allocated by

Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-13 Thread Chris Jewell
Hi Dave, Reuti, Sorry for kicking off this thread, and then disappearing. I've been away for a bit. Anyway, Dave, I'm glad you experienced the same issue as I had with my installation of SGE 6.2u5 and OpenMPI with core binding -- namely that with 'qsub -pe openmpi 8 -binding set linear:1 ',

Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-10-14 Thread Reuti
Hi, Am 14.10.2010 um 13:23 schrieb Dave Love: > Reuti writes: > >> With the default binding_instance set to "set" (the default) the >> shepherd should bind the processes to cores already. With other types >> of binding_instance these selected cores must be forward

Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-10-14 Thread Dave Love
Reuti writes: > With the default binding_instance set to "set" (the default) the > shepherd should bind the processes to cores already. With other types > of binding_instance these selected cores must be forward to the > application via an environment variable or in

Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-10-12 Thread Reuti
Am 12.10.2010 um 15:49 schrieb Dave Love: > Chris Jewell writes: > >> I've scrapped this system now in favour of the new SGE core binding feature. > > How does that work, exactly? I thought the OMPI SGE integration didn't > support core binding, but good if it

Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-10-12 Thread Dave Love
Chris Jewell writes: > I've scrapped this system now in favour of the new SGE core binding feature. How does that work, exactly? I thought the OMPI SGE integration didn't support core binding, but good if it does.

Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-10-05 Thread Chris Jewell
> > It looks to me like your remote nodes aren't finding the orted executable. I > suspect the problem is that you need to forward the path and ld_library_path > tot he remove nodes. Use the mpirun -x option to do so. Hi, problem sorted. It was actually caused by the system I currently use

Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-10-04 Thread Ralph Castain
It looks to me like your remote nodes aren't finding the orted executable. I suspect the problem is that you need to forward the path and ld_library_path tot he remove nodes. Use the mpirun -x option to do so. On Oct 4, 2010, at 5:08 AM, Chris Jewell wrote: > Hi all, > > Firstly, hello to

[OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-10-04 Thread Chris Jewell
Hi all, Firstly, hello to the mailing list for the first time! Secondly, sorry for the non-descript subject line, but I couldn't really think how to be more specific! Anyway, I am currently having a problem getting OpenMPI to work within my installation of SGE 6.2u5. I compiled OpenMPI