Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-18 Thread Reuti
Am 18.11.2010 um 11:57 schrieb Terry Dontje:

> Yes, I believe this solves the mystery.  In short OGE and ORTE both work.  In 
> the linear:1 case the job is exiting because there are not enough resources 
> for the orte binding to work, which actually makes sense.  In the linear:2 
> case I think we've proven that we are binding to the right amount of 
> resources and to the correct physical resources at the process level.  
> 
> In the case you do not do pass bind-to-core to mpirun with a qsub using  
> linear:2 the processes on the same node will actually bind to the same two 
> cores.  The only way to determine this is to run something that prints out 
> the binding from the system.  There is no way to do this via OMPI because it 
> only reports binding when you are requesting mpirun to do some type of 
> binding (like -bind-to-core or -bind-to-socket.
> 
> In the linear:1 case with no binding I think you are having the processes on 
> the same node run on the same core.   Which is exactly what you are asking 
> for I believe.
> 
> So I believe we understand what is going on with the binding and it makes 
> sense to me.  As far as the allocation issue of slots vs. cores and trying to 
> not overallocate cores I believe the new allocation rule make sense to do but 
> I'll let you hash that out with Daniel.  

I still vote for a flag "limit_to_one_qrsh_per_host true/false" in the PE 
definition which a) checks whether any attempt is made to make a second `qrsh 
-inherit ...` to one and the same node (similar to the "job_is_first_task" to 
allow or deny a local `qrsh -inherit ...`), and b) as a side effect allocate 
*all* cores to this one and only started shepherd then.

And a second one "limit_cores_by_slot_count true/false" instead of new 
allocation_rules. To choose $fillup, $round_robin or others is independent from 
limiting it IMO.

-- Reuti


> In summary I don't believe there is any OMPI bugs related to what we've seen 
> and the OGE issue is just the allocation issue, right?
> 
> --td
> 
> 
> On 11/18/2010 01:32 AM, Chris Jewell wrote:
 Perhaps if someone could run this test again with --report-bindings 
 --leave-session-attached and provide -all- output we could verify that 
 analysis and clear up the confusion?
 
 
>>> Yeah, however I bet you we still won't see output.
>>> 
>> Actually, it seems we do get more output!  Results of 'qsub -pe mpi 8 
>> -binding linear:2 myScript.com'
>> 
>> with
>> 
>> 'mpirun -mca ras_gridengine_verbose 100 -report-bindings 
>> --leave-session-attached -bycore -bind-to-core ./unterm'
>> 
>> [exec1:06504] System has detected external process binding to cores 0028
>> [exec1:06504] ras:gridengine: JOB_ID: 59467
>> [exec1:06504] ras:gridengine: PE_HOSTFILE: 
>> /usr/sge/default/spool/exec1/active_jobs/59467.1/pe_hostfile
>> [exec1:06504] ras:gridengine: exec1.cluster.stats.local: PE_HOSTFILE shows 
>> slots=2
>> [exec1:06504] ras:gridengine: exec3.cluster.stats.local: PE_HOSTFILE shows 
>> slots=1
>> [exec1:06504] ras:gridengine: exec2.cluster.stats.local: PE_HOSTFILE shows 
>> slots=1
>> [exec1:06504] ras:gridengine: exec7.cluster.stats.local: PE_HOSTFILE shows 
>> slots=1
>> [exec1:06504] ras:gridengine: exec4.cluster.stats.local: PE_HOSTFILE shows 
>> slots=1
>> [exec1:06504] ras:gridengine: exec5.cluster.stats.local: PE_HOSTFILE shows 
>> slots=1
>> [exec1:06504] ras:gridengine: exec6.cluster.stats.local: PE_HOSTFILE shows 
>> slots=1
>> [exec1:06504] [[59608,0],0] odls:default:fork binding child [[59608,1],0] to 
>> cpus 0008
>> [exec1:06504] [[59608,0],0] odls:default:fork binding child [[59608,1],1] to 
>> cpus 0020
>> [exec3:20248] [[59608,0],1] odls:default:fork binding child [[59608,1],2] to 
>> cpus 0008
>> [exec4:26792] [[59608,0],4] odls:default:fork binding child [[59608,1],5] to 
>> cpus 0001
>> [exec2:32462] [[59608,0],2] odls:default:fork binding child [[59608,1],3] to 
>> cpus 0001
>> [exec7:09833] [[59608,0],3] odls:default:fork binding child [[59608,1],4] to 
>> cpus 0002
>> [exec5:10834] [[59608,0],5] odls:default:fork binding child [[59608,1],6] to 
>> cpus 0001
>> [exec6:04230] [[59608,0],6] odls:default:fork binding child [[59608,1],7] to 
>> cpus 0001
>> 
>> AHHA!  Now I get the following if I use 'qsub -pe mpi 8 -binding linear:1 
>> myScript.com' with the above mpirun command:
>> 
>> [exec1:06552] System has detected external process binding to cores 0020
>> [exec1:06552] ras:gridengine: JOB_ID: 59468
>> [exec1:06552] ras:gridengine: PE_HOSTFILE: 
>> /usr/sge/default/spool/exec1/active_jobs/59468.1/pe_hostfile
>> [exec1:06552] ras:gridengine: exec1.cluster.stats.local: PE_HOSTFILE shows 
>> slots=2
>> [exec1:06552] ras:gridengine: exec3.cluster.stats.local: PE_HOSTFILE shows 
>> slots=1
>> [exec1:06552] ras:gridengine: exec2.cluster.stats.local: PE_HOSTFILE shows 
>> slots=1
>> [exec1:06552] ras:gridengine: exec7.cluster.stats.local: PE_HOSTFILE shows 
>> slots=1
>> [exec1:06552] ras:gridengine: 

Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-18 Thread Terry Dontje
Yes, I believe this solves the mystery.  In short OGE and ORTE both 
work.  In the linear:1 case the job is exiting because there are not 
enough resources for the orte binding to work, which actually makes 
sense.  In the linear:2 case I think we've proven that we are binding to 
the right amount of resources and to the correct physical resources at 
the process level.


In the case you do not do pass bind-to-core to mpirun with a qsub using  
linear:2 the processes on the same node will actually bind to the same 
two cores.  The only way to determine this is to run something that 
prints out the binding from the system.  There is no way to do this via 
OMPI because it only reports binding when you are requesting mpirun to 
do some type of binding (like -bind-to-core or -bind-to-socket.


In the linear:1 case with no binding I think you are having the 
processes on the same node run on the same core.   Which is exactly what 
you are asking for I believe.


So I believe we understand what is going on with the binding and it 
makes sense to me.  As far as the allocation issue of slots vs. cores 
and trying to not overallocate cores I believe the new allocation rule 
make sense to do but I'll let you hash that out with Daniel.


In summary I don't believe there is any OMPI bugs related to what we've 
seen and the OGE issue is just the allocation issue, right?


--td


On 11/18/2010 01:32 AM, Chris Jewell wrote:

Perhaps if someone could run this test again with --report-bindings 
--leave-session-attached and provide -all- output we could verify that analysis 
and clear up the confusion?


Yeah, however I bet you we still won't see output.

Actually, it seems we do get more output!  Results of 'qsub -pe mpi 8 -binding 
linear:2 myScript.com'

with

'mpirun -mca ras_gridengine_verbose 100 -report-bindings 
--leave-session-attached -bycore -bind-to-core ./unterm'

[exec1:06504] System has detected external process binding to cores 0028
[exec1:06504] ras:gridengine: JOB_ID: 59467
[exec1:06504] ras:gridengine: PE_HOSTFILE: 
/usr/sge/default/spool/exec1/active_jobs/59467.1/pe_hostfile
[exec1:06504] ras:gridengine: exec1.cluster.stats.local: PE_HOSTFILE shows 
slots=2
[exec1:06504] ras:gridengine: exec3.cluster.stats.local: PE_HOSTFILE shows 
slots=1
[exec1:06504] ras:gridengine: exec2.cluster.stats.local: PE_HOSTFILE shows 
slots=1
[exec1:06504] ras:gridengine: exec7.cluster.stats.local: PE_HOSTFILE shows 
slots=1
[exec1:06504] ras:gridengine: exec4.cluster.stats.local: PE_HOSTFILE shows 
slots=1
[exec1:06504] ras:gridengine: exec5.cluster.stats.local: PE_HOSTFILE shows 
slots=1
[exec1:06504] ras:gridengine: exec6.cluster.stats.local: PE_HOSTFILE shows 
slots=1
[exec1:06504] [[59608,0],0] odls:default:fork binding child [[59608,1],0] to 
cpus 0008
[exec1:06504] [[59608,0],0] odls:default:fork binding child [[59608,1],1] to 
cpus 0020
[exec3:20248] [[59608,0],1] odls:default:fork binding child [[59608,1],2] to 
cpus 0008
[exec4:26792] [[59608,0],4] odls:default:fork binding child [[59608,1],5] to 
cpus 0001
[exec2:32462] [[59608,0],2] odls:default:fork binding child [[59608,1],3] to 
cpus 0001
[exec7:09833] [[59608,0],3] odls:default:fork binding child [[59608,1],4] to 
cpus 0002
[exec5:10834] [[59608,0],5] odls:default:fork binding child [[59608,1],6] to 
cpus 0001
[exec6:04230] [[59608,0],6] odls:default:fork binding child [[59608,1],7] to 
cpus 0001

AHHA!  Now I get the following if I use 'qsub -pe mpi 8 -binding linear:1 
myScript.com' with the above mpirun command:

[exec1:06552] System has detected external process binding to cores 0020
[exec1:06552] ras:gridengine: JOB_ID: 59468
[exec1:06552] ras:gridengine: PE_HOSTFILE: 
/usr/sge/default/spool/exec1/active_jobs/59468.1/pe_hostfile
[exec1:06552] ras:gridengine: exec1.cluster.stats.local: PE_HOSTFILE shows 
slots=2
[exec1:06552] ras:gridengine: exec3.cluster.stats.local: PE_HOSTFILE shows 
slots=1
[exec1:06552] ras:gridengine: exec2.cluster.stats.local: PE_HOSTFILE shows 
slots=1
[exec1:06552] ras:gridengine: exec7.cluster.stats.local: PE_HOSTFILE shows 
slots=1
[exec1:06552] ras:gridengine: exec4.cluster.stats.local: PE_HOSTFILE shows 
slots=1
[exec1:06552] ras:gridengine: exec5.cluster.stats.local: PE_HOSTFILE shows 
slots=1
[exec1:06552] ras:gridengine: exec6.cluster.stats.local: PE_HOSTFILE shows 
slots=1
--
mpirun was unable to start the specified application as it encountered an error:

Error name: Unknown error: 1
Node: exec1

when attempting to start process rank 0.
--
[exec1:06552] [[59432,0],0] odls:default:fork binding child [[59432,1],0] to 
cpus 0020
--
Not enough processors were found on the local host to meet the requested
binding action:

   Local host:exec1
   Action requested:  bind-to-core
   Application name:  

Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-17 Thread Ralph Castain
More than OGE uses external bindings. We have tested it using some tricks,
and in environments where binding is available from the RM (e.g., slurm). So
we know the basic code works.

Whether or not it works with OGE is another matter.


On Wed, Nov 17, 2010 at 9:09 AM, Terry Dontje wrote:

>  On 11/17/2010 10:48 AM, Ralph Castain wrote:
>
> No problem at all. I confess that I am lost in all the sometimes disjointed
> emails in this thread. Frankly, now that I search, I can't find it either!
> :-(
>
>  I see one email that clearly shows the external binding report from
> mpirun, but not from any daemons. I see another email (after you asked if
> there was all the output) that states "yep", indicating that was all the
> output, and then proceeds to offer additional output that wasn't in the
> original email you asked about!
>
>  So I am now as thoroughly confused as you are...
>
>  That said, I am confident in the code in ORTE as it has worked correctly
> when I tested it against external bindings in other environments. So I
> really do believe this is an OGE issue where the orted isn't getting
> correctly bound against all allocated cores.
>
>  I am confused by your statement above because we don't even know what is
> being bound or not.  We know that in it looks like the hnp is bound to 2
> cores which is what we asked for but we don't know what any of the processes
> themselves are bound to.   So I personally cannot point to ORTE or OGE as
> the culprit because I don't think we know whether there is an issue.
>
> So, until we are able to get the -report-bindings output from the a.out
> code (note I did not say orted) it is kind of hard to claim there is even an
> issue.  Which brings me back to the output question.  After some thinking
> the --report-bindings output I am expecting is not from the orted itself but
> from the a.out before it executes the user code.   Which now makes me wonder
> if there is some odd OGE/OMPI integration issue which the -bind-to-code
> -report-bindings options are not being propagated/recognized/honored when
> qsub is given the -binding option.
>
>
>  Perhaps if someone could run this test again with --report-bindings
> --leave-session-attached and provide -all- output we could verify that
> analysis and clear up the confusion?
>
>  Yeah, however I bet you we still won't see output.
>
>
> --td
>
>
>
> On Wed, Nov 17, 2010 at 8:13 AM, Terry Dontje wrote:
>
>>  On 11/17/2010 10:00 AM, Ralph Castain wrote:
>>
>> --leave-session-attached is always required if you want to see output from
>> the daemons. Otherwise, the launcher closes the ssh session (or qrsh
>> session, in this case) as part of its normal operating procedure, thus
>> terminating the stdout/err channel.
>>
>>
>>   I believe you but isn't it weird that without the --binding option to
>> qsub we saw -report-bindings output from the orteds?
>>
>> Do you have the date of the email that has the info you talked about
>> below.  I really am not trying to be an a-hole about this but there have
>> been so much data and email flying around it would be nice to actually see
>> the output you mention.
>>
>> --td
>>
>>
>>  On Wed, Nov 17, 2010 at 7:51 AM, Terry Dontje 
>> wrote:
>>
>>>  On 11/17/2010 09:32 AM, Ralph Castain wrote:
>>>
>>> Cris' output is coming solely from the HNP, which is correct given the
>>> way things were executed. My comment was from another email where he did
>>> what I asked, which was to include the flags:
>>>
>>>  --report-bindings --leave-session-attached
>>>
>>>  so we could see the output from each orted. In that email, it was clear
>>> that while mpirun was bound to multiple cores, the orteds are being bound to
>>> a -single- core.
>>>
>>>  Hence the problem.
>>>
>>>   Hmm, I see Ralph's comment on 11/15 but I don't see any output that
>>> shows what Ralph say's above.  The only report-bindings output I see is when
>>> he runs without OGE binding.   Can someone give me the date and time of
>>> Chris' email with the --report-bindings and --leave-session-attached.  Or a
>>> rerun of the below with the --leave-session-attached option would also help.
>>>
>>> I find it confusing that --leave-session-attached is not required when
>>> the OGE binding argument is not given.
>>>
>>> --td
>>>
>>>  HTH
>>> Ralph
>>>
>>>
>>> On Wed, Nov 17, 2010 at 6:57 AM, Terry Dontje 
>>> wrote:
>>>
  On 11/17/2010 07:41 AM, Chris Jewell wrote:

 On 17 Nov 2010, at 11:56, Terry Dontje wrote:

  You are absolutely correct, Terry, and the 1.4 release series does 
 include the proper code. The point here, though, is that SGE binds the 
 orted to a single core, even though other cores are also allocated. So the 
 orted detects an external binding of one core, and binds all its children 
 to that same core.

  I do not think you are right here.  Chris sent the following which looks 

Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-17 Thread Terry Dontje

On 11/17/2010 10:48 AM, Ralph Castain wrote:
No problem at all. I confess that I am lost in all the sometimes 
disjointed emails in this thread. Frankly, now that I search, I can't 
find it either! :-(


I see one email that clearly shows the external binding report from 
mpirun, but not from any daemons. I see another email (after you asked 
if there was all the output) that states "yep", indicating that was 
all the output, and then proceeds to offer additional output that 
wasn't in the original email you asked about!


So I am now as thoroughly confused as you are...

That said, I am confident in the code in ORTE as it has worked 
correctly when I tested it against external bindings in other 
environments. So I really do believe this is an OGE issue where the 
orted isn't getting correctly bound against all allocated cores.


I am confused by your statement above because we don't even know what is 
being bound or not.  We know that in it looks like the hnp is bound to 2 
cores which is what we asked for but we don't know what any of the 
processes themselves are bound to.   So I personally cannot point to 
ORTE or OGE as the culprit because I don't think we know whether there 
is an issue.


So, until we are able to get the -report-bindings output from the a.out 
code (note I did not say orted) it is kind of hard to claim there is 
even an issue.  Which brings me back to the output question.  After some 
thinking the --report-bindings output I am expecting is not from the 
orted itself but from the a.out before it executes the user code.   
Which now makes me wonder if there is some odd OGE/OMPI integration 
issue which the -bind-to-code -report-bindings options are not being 
propagated/recognized/honored when qsub is given the -binding option.


Perhaps if someone could run this test again with --report-bindings 
--leave-session-attached and provide -all- output we could verify that 
analysis and clear up the confusion?



Yeah, however I bet you we still won't see output.

--td



On Wed, Nov 17, 2010 at 8:13 AM, Terry Dontje > wrote:


On 11/17/2010 10:00 AM, Ralph Castain wrote:

--leave-session-attached is always required if you want to see
output from the daemons. Otherwise, the launcher closes the ssh
session (or qrsh session, in this case) as part of its normal
operating procedure, thus terminating the stdout/err channel.



I believe you but isn't it weird that without the --binding option
to qsub we saw -report-bindings output from the orteds?

Do you have the date of the email that has the info you talked
about below.  I really am not trying to be an a-hole about this
but there have been so much data and email flying around it would
be nice to actually see the output you mention.

--td



On Wed, Nov 17, 2010 at 7:51 AM, Terry Dontje
> wrote:

On 11/17/2010 09:32 AM, Ralph Castain wrote:

Cris' output is coming solely from the HNP, which is correct
given the way things were executed. My comment was from
another email where he did what I asked, which was to
include the flags:

--report-bindings --leave-session-attached

so we could see the output from each orted. In that email,
it was clear that while mpirun was bound to multiple cores,
the orteds are being bound to a -single- core.

Hence the problem.


Hmm, I see Ralph's comment on 11/15 but I don't see any
output that shows what Ralph say's above.  The only
report-bindings output I see is when he runs without OGE
binding.   Can someone give me the date and time of Chris'
email with the --report-bindings and
--leave-session-attached.  Or a rerun of the below with the
--leave-session-attached option would also help.

I find it confusing that --leave-session-attached is not
required when the OGE binding argument is not given.

--td


HTH
Ralph


On Wed, Nov 17, 2010 at 6:57 AM, Terry Dontje
>
wrote:

On 11/17/2010 07:41 AM, Chris Jewell wrote:

On 17 Nov 2010, at 11:56, Terry Dontje wrote:

You are absolutely correct, Terry, and the 1.4 release series does 
include the proper code. The point here, though, is that SGE binds the orted to 
a single core, even though other cores are also allocated. So the orted detects 
an external binding of one core, and binds all its children to that same core.

I do not think you are right here.  Chris sent the following which looks like 
OGE (fka SGE) actually did bind the hnp to multiple cores.  However that message I 
believe is not coming from the processes themselves and actually is only shown by the 
hnp.  I wonder if Chris adds a 

Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-17 Thread Ralph Castain
No problem at all. I confess that I am lost in all the sometimes disjointed
emails in this thread. Frankly, now that I search, I can't find it either!
:-(

I see one email that clearly shows the external binding report from mpirun,
but not from any daemons. I see another email (after you asked if there was
all the output) that states "yep", indicating that was all the output, and
then proceeds to offer additional output that wasn't in the original email
you asked about!

So I am now as thoroughly confused as you are...

That said, I am confident in the code in ORTE as it has worked correctly
when I tested it against external bindings in other environments. So I
really do believe this is an OGE issue where the orted isn't getting
correctly bound against all allocated cores.

Perhaps if someone could run this test again with --report-bindings
--leave-session-attached and provide -all- output we could verify that
analysis and clear up the confusion?



On Wed, Nov 17, 2010 at 8:13 AM, Terry Dontje wrote:

>  On 11/17/2010 10:00 AM, Ralph Castain wrote:
>
> --leave-session-attached is always required if you want to see output from
> the daemons. Otherwise, the launcher closes the ssh session (or qrsh
> session, in this case) as part of its normal operating procedure, thus
> terminating the stdout/err channel.
>
>
>  I believe you but isn't it weird that without the --binding option to
> qsub we saw -report-bindings output from the orteds?
>
> Do you have the date of the email that has the info you talked about
> below.  I really am not trying to be an a-hole about this but there have
> been so much data and email flying around it would be nice to actually see
> the output you mention.
>
> --td
>
>
>  On Wed, Nov 17, 2010 at 7:51 AM, Terry Dontje wrote:
>
>>  On 11/17/2010 09:32 AM, Ralph Castain wrote:
>>
>> Cris' output is coming solely from the HNP, which is correct given the way
>> things were executed. My comment was from another email where he did what I
>> asked, which was to include the flags:
>>
>>  --report-bindings --leave-session-attached
>>
>>  so we could see the output from each orted. In that email, it was clear
>> that while mpirun was bound to multiple cores, the orteds are being bound to
>> a -single- core.
>>
>>  Hence the problem.
>>
>>   Hmm, I see Ralph's comment on 11/15 but I don't see any output that
>> shows what Ralph say's above.  The only report-bindings output I see is when
>> he runs without OGE binding.   Can someone give me the date and time of
>> Chris' email with the --report-bindings and --leave-session-attached.  Or a
>> rerun of the below with the --leave-session-attached option would also help.
>>
>> I find it confusing that --leave-session-attached is not required when the
>> OGE binding argument is not given.
>>
>> --td
>>
>>  HTH
>> Ralph
>>
>>
>> On Wed, Nov 17, 2010 at 6:57 AM, Terry Dontje wrote:
>>
>>>  On 11/17/2010 07:41 AM, Chris Jewell wrote:
>>>
>>> On 17 Nov 2010, at 11:56, Terry Dontje wrote:
>>>
>>>  You are absolutely correct, Terry, and the 1.4 release series does include 
>>> the proper code. The point here, though, is that SGE binds the orted to a 
>>> single core, even though other cores are also allocated. So the orted 
>>> detects an external binding of one core, and binds all its children to that 
>>> same core.
>>>
>>>  I do not think you are right here.  Chris sent the following which looks 
>>> like OGE (fka SGE) actually did bind the hnp to multiple cores.  However 
>>> that message I believe is not coming from the processes themselves and 
>>> actually is only shown by the hnp.  I wonder if Chris adds a 
>>> "-bind-to-core" option  we'll see more output from the a.out's before they 
>>> exec unterm?
>>>
>>>  As requested using
>>>
>>> $ qsub -pe mpi 8 -binding linear:2 myScript.com'
>>>
>>> and
>>>
>>> 'mpirun -mca ras_gridengine_verbose 100 --report-bindings -by-core 
>>> -bind-to-core ./unterm'
>>>
>>> [exec5:06671] System has detected external process binding to cores 0028
>>> [exec5:06671] ras:gridengine: JOB_ID: 59434
>>> [exec5:06671] ras:gridengine: PE_HOSTFILE: 
>>> /usr/sge/default/spool/exec5/active_jobs/59434.1/pe_hostfile
>>> [exec5:06671] ras:gridengine: exec5.cluster.stats.local: PE_HOSTFILE shows 
>>> slots=2
>>> [exec5:06671] ras:gridengine: exec1.cluster.stats.local: PE_HOSTFILE shows 
>>> slots=2
>>> [exec5:06671] ras:gridengine: exec4.cluster.stats.local: PE_HOSTFILE shows 
>>> slots=1
>>> [exec5:06671] ras:gridengine: exec3.cluster.stats.local: PE_HOSTFILE shows 
>>> slots=1
>>> [exec5:06671] ras:gridengine: exec2.cluster.stats.local: PE_HOSTFILE shows 
>>> slots=1
>>> [exec5:06671] ras:gridengine: exec7.cluster.stats.local: PE_HOSTFILE shows 
>>> slots=1
>>>
>>> No more info.  I note that the external binding is slightly different to 
>>> what I had before, but our cluster is busier today :-)
>>>
>>>
>>>  I would have expected more output.
>>>
>>> 

Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-17 Thread Terry Dontje

On 11/17/2010 10:00 AM, Ralph Castain wrote:
--leave-session-attached is always required if you want to see output 
from the daemons. Otherwise, the launcher closes the ssh session (or 
qrsh session, in this case) as part of its normal operating procedure, 
thus terminating the stdout/err channel.



I believe you but isn't it weird that without the --binding option to 
qsub we saw -report-bindings output from the orteds?


Do you have the date of the email that has the info you talked about 
below.  I really am not trying to be an a-hole about this but there have 
been so much data and email flying around it would be nice to actually 
see the output you mention.


--td

On Wed, Nov 17, 2010 at 7:51 AM, Terry Dontje > wrote:


On 11/17/2010 09:32 AM, Ralph Castain wrote:

Cris' output is coming solely from the HNP, which is correct
given the way things were executed. My comment was from another
email where he did what I asked, which was to include the flags:

--report-bindings --leave-session-attached

so we could see the output from each orted. In that email, it was
clear that while mpirun was bound to multiple cores, the orteds
are being bound to a -single- core.

Hence the problem.


Hmm, I see Ralph's comment on 11/15 but I don't see any output
that shows what Ralph say's above.  The only report-bindings
output I see is when he runs without OGE binding.   Can someone
give me the date and time of Chris' email with the
--report-bindings and --leave-session-attached.  Or a rerun of the
below with the --leave-session-attached option would also help.

I find it confusing that --leave-session-attached is not required
when the OGE binding argument is not given.

--td


HTH
Ralph


On Wed, Nov 17, 2010 at 6:57 AM, Terry Dontje
> wrote:

On 11/17/2010 07:41 AM, Chris Jewell wrote:

On 17 Nov 2010, at 11:56, Terry Dontje wrote:

You are absolutely correct, Terry, and the 1.4 release series does 
include the proper code. The point here, though, is that SGE binds the orted to 
a single core, even though other cores are also allocated. So the orted detects 
an external binding of one core, and binds all its children to that same core.

I do not think you are right here.  Chris sent the following which looks like OGE 
(fka SGE) actually did bind the hnp to multiple cores.  However that message I believe is 
not coming from the processes themselves and actually is only shown by the hnp.  I wonder 
if Chris adds a "-bind-to-core" option  we'll see more output from the a.out's 
before they exec unterm?

As requested using

$ qsub -pe mpi 8 -binding linear:2 myScript.com'

and

'mpirun -mca ras_gridengine_verbose 100 --report-bindings -by-core 
-bind-to-core ./unterm'

[exec5:06671] System has detected external process binding to cores 0028
[exec5:06671] ras:gridengine: JOB_ID: 59434
[exec5:06671] ras:gridengine: PE_HOSTFILE: 
/usr/sge/default/spool/exec5/active_jobs/59434.1/pe_hostfile
[exec5:06671] ras:gridengine: exec5.cluster.stats.local: PE_HOSTFILE 
shows slots=2
[exec5:06671] ras:gridengine: exec1.cluster.stats.local: PE_HOSTFILE 
shows slots=2
[exec5:06671] ras:gridengine: exec4.cluster.stats.local: PE_HOSTFILE 
shows slots=1
[exec5:06671] ras:gridengine: exec3.cluster.stats.local: PE_HOSTFILE 
shows slots=1
[exec5:06671] ras:gridengine: exec2.cluster.stats.local: PE_HOSTFILE 
shows slots=1
[exec5:06671] ras:gridengine: exec7.cluster.stats.local: PE_HOSTFILE 
shows slots=1

No more info.  I note that the external binding is slightly different 
to what I had before, but our cluster is busier today :-)


I would have expected more output.

--td


Chris


--
Dr Chris Jewell
Department of Statistics
University of Warwick
Coventry
CV4 7AL
UK
Tel: +44 (0)24 7615 0778






___
users mailing list
us...@open-mpi.org  
http://www.open-mpi.org/mailman/listinfo.cgi/users



-- 
Oracle

Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle *- Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.don...@oracle.com 




___
users mailing list
us...@open-mpi.org 
http://www.open-mpi.org/mailman/listinfo.cgi/users



___
users mailing list
us...@open-mpi.org  
   

Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-17 Thread Ralph Castain
--leave-session-attached is always required if you want to see output from
the daemons. Otherwise, the launcher closes the ssh session (or qrsh
session, in this case) as part of its normal operating procedure, thus
terminating the stdout/err channel.


On Wed, Nov 17, 2010 at 7:51 AM, Terry Dontje wrote:

>  On 11/17/2010 09:32 AM, Ralph Castain wrote:
>
> Cris' output is coming solely from the HNP, which is correct given the way
> things were executed. My comment was from another email where he did what I
> asked, which was to include the flags:
>
>  --report-bindings --leave-session-attached
>
>  so we could see the output from each orted. In that email, it was clear
> that while mpirun was bound to multiple cores, the orteds are being bound to
> a -single- core.
>
>  Hence the problem.
>
>  Hmm, I see Ralph's comment on 11/15 but I don't see any output that shows
> what Ralph say's above.  The only report-bindings output I see is when he
> runs without OGE binding.   Can someone give me the date and time of Chris'
> email with the --report-bindings and --leave-session-attached.  Or a rerun
> of the below with the --leave-session-attached option would also help.
>
> I find it confusing that --leave-session-attached is not required when the
> OGE binding argument is not given.
>
> --td
>
>  HTH
> Ralph
>
>
> On Wed, Nov 17, 2010 at 6:57 AM, Terry Dontje wrote:
>
>>  On 11/17/2010 07:41 AM, Chris Jewell wrote:
>>
>> On 17 Nov 2010, at 11:56, Terry Dontje wrote:
>>
>>  You are absolutely correct, Terry, and the 1.4 release series does include 
>> the proper code. The point here, though, is that SGE binds the orted to a 
>> single core, even though other cores are also allocated. So the orted 
>> detects an external binding of one core, and binds all its children to that 
>> same core.
>>
>>  I do not think you are right here.  Chris sent the following which looks 
>> like OGE (fka SGE) actually did bind the hnp to multiple cores.  However 
>> that message I believe is not coming from the processes themselves and 
>> actually is only shown by the hnp.  I wonder if Chris adds a "-bind-to-core" 
>> option  we'll see more output from the a.out's before they exec unterm?
>>
>>  As requested using
>>
>> $ qsub -pe mpi 8 -binding linear:2 myScript.com'
>>
>> and
>>
>> 'mpirun -mca ras_gridengine_verbose 100 --report-bindings -by-core 
>> -bind-to-core ./unterm'
>>
>> [exec5:06671] System has detected external process binding to cores 0028
>> [exec5:06671] ras:gridengine: JOB_ID: 59434
>> [exec5:06671] ras:gridengine: PE_HOSTFILE: 
>> /usr/sge/default/spool/exec5/active_jobs/59434.1/pe_hostfile
>> [exec5:06671] ras:gridengine: exec5.cluster.stats.local: PE_HOSTFILE shows 
>> slots=2
>> [exec5:06671] ras:gridengine: exec1.cluster.stats.local: PE_HOSTFILE shows 
>> slots=2
>> [exec5:06671] ras:gridengine: exec4.cluster.stats.local: PE_HOSTFILE shows 
>> slots=1
>> [exec5:06671] ras:gridengine: exec3.cluster.stats.local: PE_HOSTFILE shows 
>> slots=1
>> [exec5:06671] ras:gridengine: exec2.cluster.stats.local: PE_HOSTFILE shows 
>> slots=1
>> [exec5:06671] ras:gridengine: exec7.cluster.stats.local: PE_HOSTFILE shows 
>> slots=1
>>
>> No more info.  I note that the external binding is slightly different to 
>> what I had before, but our cluster is busier today :-)
>>
>>
>>  I would have expected more output.
>>
>> --td
>>
>>  Chris
>>
>>
>> --
>> Dr Chris Jewell
>> Department of Statistics
>> University of Warwick
>> Coventry
>> CV4 7AL
>> UK
>> Tel: +44 (0)24 7615 0778
>>
>>
>>
>>
>>
>>
>> ___
>> users mailing 
>> listusers@open-mpi.orghttp://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>>
>>  --
>> [image: Oracle]
>> Terry D. Dontje | Principal Software Engineer
>> Developer Tools Engineering | +1.781.442.2631
>>  Oracle * - Performance Technologies*
>>  95 Network Drive, Burlington, MA 01803
>> Email terry.don...@oracle.com
>>
>>
>>
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>
> ___
> users mailing 
> listusers@open-mpi.orghttp://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> --
> [image: Oracle]
> Terry D. Dontje | Principal Software Engineer
> Developer Tools Engineering | +1.781.442.2631
>  Oracle * - Performance Technologies*
>  95 Network Drive, Burlington, MA 01803
> Email terry.don...@oracle.com
>
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-17 Thread Terry Dontje

On 11/17/2010 09:32 AM, Ralph Castain wrote:
Cris' output is coming solely from the HNP, which is correct given the 
way things were executed. My comment was from another email where he 
did what I asked, which was to include the flags:


--report-bindings --leave-session-attached

so we could see the output from each orted. In that email, it was 
clear that while mpirun was bound to multiple cores, the orteds are 
being bound to a -single- core.


Hence the problem.

Hmm, I see Ralph's comment on 11/15 but I don't see any output that 
shows what Ralph say's above.  The only report-bindings output I see is 
when he runs without OGE binding.   Can someone give me the date and 
time of Chris' email with the --report-bindings and 
--leave-session-attached.  Or a rerun of the below with the 
--leave-session-attached option would also help.


I find it confusing that --leave-session-attached is not required when 
the OGE binding argument is not given.


--td

HTH
Ralph


On Wed, Nov 17, 2010 at 6:57 AM, Terry Dontje > wrote:


On 11/17/2010 07:41 AM, Chris Jewell wrote:

On 17 Nov 2010, at 11:56, Terry Dontje wrote:

You are absolutely correct, Terry, and the 1.4 release series does include 
the proper code. The point here, though, is that SGE binds the orted to a 
single core, even though other cores are also allocated. So the orted detects 
an external binding of one core, and binds all its children to that same core.

I do not think you are right here.  Chris sent the following which looks like OGE 
(fka SGE) actually did bind the hnp to multiple cores.  However that message I believe is 
not coming from the processes themselves and actually is only shown by the hnp.  I wonder 
if Chris adds a "-bind-to-core" option  we'll see more output from the a.out's 
before they exec unterm?

As requested using

$ qsub -pe mpi 8 -binding linear:2 myScript.com'

and

'mpirun -mca ras_gridengine_verbose 100 --report-bindings -by-core 
-bind-to-core ./unterm'

[exec5:06671] System has detected external process binding to cores 0028
[exec5:06671] ras:gridengine: JOB_ID: 59434
[exec5:06671] ras:gridengine: PE_HOSTFILE: 
/usr/sge/default/spool/exec5/active_jobs/59434.1/pe_hostfile
[exec5:06671] ras:gridengine: exec5.cluster.stats.local: PE_HOSTFILE shows 
slots=2
[exec5:06671] ras:gridengine: exec1.cluster.stats.local: PE_HOSTFILE shows 
slots=2
[exec5:06671] ras:gridengine: exec4.cluster.stats.local: PE_HOSTFILE shows 
slots=1
[exec5:06671] ras:gridengine: exec3.cluster.stats.local: PE_HOSTFILE shows 
slots=1
[exec5:06671] ras:gridengine: exec2.cluster.stats.local: PE_HOSTFILE shows 
slots=1
[exec5:06671] ras:gridengine: exec7.cluster.stats.local: PE_HOSTFILE shows 
slots=1

No more info.  I note that the external binding is slightly different to 
what I had before, but our cluster is busier today :-)


I would have expected more output.

--td


Chris


--
Dr Chris Jewell
Department of Statistics
University of Warwick
Coventry
CV4 7AL
UK
Tel: +44 (0)24 7615 0778






___
users mailing list
us...@open-mpi.org  
http://www.open-mpi.org/mailman/listinfo.cgi/users



-- 
Oracle

Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle *- Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.don...@oracle.com 




___
users mailing list
us...@open-mpi.org 
http://www.open-mpi.org/mailman/listinfo.cgi/users



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle *- Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.don...@oracle.com 





Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-17 Thread Ralph Castain
Cris' output is coming solely from the HNP, which is correct given the way
things were executed. My comment was from another email where he did what I
asked, which was to include the flags:

--report-bindings --leave-session-attached

so we could see the output from each orted. In that email, it was clear that
while mpirun was bound to multiple cores, the orteds are being bound to a
-single- core.

Hence the problem.

HTH
Ralph


On Wed, Nov 17, 2010 at 6:57 AM, Terry Dontje wrote:

>  On 11/17/2010 07:41 AM, Chris Jewell wrote:
>
> On 17 Nov 2010, at 11:56, Terry Dontje wrote:
>
>  You are absolutely correct, Terry, and the 1.4 release series does include 
> the proper code. The point here, though, is that SGE binds the orted to a 
> single core, even though other cores are also allocated. So the orted detects 
> an external binding of one core, and binds all its children to that same core.
>
>  I do not think you are right here.  Chris sent the following which looks 
> like OGE (fka SGE) actually did bind the hnp to multiple cores.  However that 
> message I believe is not coming from the processes themselves and actually is 
> only shown by the hnp.  I wonder if Chris adds a "-bind-to-core" option  
> we'll see more output from the a.out's before they exec unterm?
>
>  As requested using
>
> $ qsub -pe mpi 8 -binding linear:2 myScript.com'
>
> and
>
> 'mpirun -mca ras_gridengine_verbose 100 --report-bindings -by-core 
> -bind-to-core ./unterm'
>
> [exec5:06671] System has detected external process binding to cores 0028
> [exec5:06671] ras:gridengine: JOB_ID: 59434
> [exec5:06671] ras:gridengine: PE_HOSTFILE: 
> /usr/sge/default/spool/exec5/active_jobs/59434.1/pe_hostfile
> [exec5:06671] ras:gridengine: exec5.cluster.stats.local: PE_HOSTFILE shows 
> slots=2
> [exec5:06671] ras:gridengine: exec1.cluster.stats.local: PE_HOSTFILE shows 
> slots=2
> [exec5:06671] ras:gridengine: exec4.cluster.stats.local: PE_HOSTFILE shows 
> slots=1
> [exec5:06671] ras:gridengine: exec3.cluster.stats.local: PE_HOSTFILE shows 
> slots=1
> [exec5:06671] ras:gridengine: exec2.cluster.stats.local: PE_HOSTFILE shows 
> slots=1
> [exec5:06671] ras:gridengine: exec7.cluster.stats.local: PE_HOSTFILE shows 
> slots=1
>
> No more info.  I note that the external binding is slightly different to what 
> I had before, but our cluster is busier today :-)
>
>
>  I would have expected more output.
>
> --td
>
>  Chris
>
>
> --
> Dr Chris Jewell
> Department of Statistics
> University of Warwick
> Coventry
> CV4 7AL
> UK
> Tel: +44 (0)24 7615 0778
>
>
>
>
>
>
> ___
> users mailing 
> listusers@open-mpi.orghttp://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> --
> [image: Oracle]
> Terry D. Dontje | Principal Software Engineer
> Developer Tools Engineering | +1.781.442.2631
>  Oracle * - Performance Technologies*
>  95 Network Drive, Burlington, MA 01803
> Email terry.don...@oracle.com
>
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-17 Thread Terry Dontje

On 11/17/2010 07:41 AM, Chris Jewell wrote:

On 17 Nov 2010, at 11:56, Terry Dontje wrote:

You are absolutely correct, Terry, and the 1.4 release series does include the 
proper code. The point here, though, is that SGE binds the orted to a single 
core, even though other cores are also allocated. So the orted detects an 
external binding of one core, and binds all its children to that same core.

I do not think you are right here.  Chris sent the following which looks like OGE (fka 
SGE) actually did bind the hnp to multiple cores.  However that message I believe is not 
coming from the processes themselves and actually is only shown by the hnp.  I wonder if 
Chris adds a "-bind-to-core" option  we'll see more output from the a.out's 
before they exec unterm?

As requested using

$ qsub -pe mpi 8 -binding linear:2 myScript.com'

and

'mpirun -mca ras_gridengine_verbose 100 --report-bindings -by-core 
-bind-to-core ./unterm'

[exec5:06671] System has detected external process binding to cores 0028
[exec5:06671] ras:gridengine: JOB_ID: 59434
[exec5:06671] ras:gridengine: PE_HOSTFILE: 
/usr/sge/default/spool/exec5/active_jobs/59434.1/pe_hostfile
[exec5:06671] ras:gridengine: exec5.cluster.stats.local: PE_HOSTFILE shows 
slots=2
[exec5:06671] ras:gridengine: exec1.cluster.stats.local: PE_HOSTFILE shows 
slots=2
[exec5:06671] ras:gridengine: exec4.cluster.stats.local: PE_HOSTFILE shows 
slots=1
[exec5:06671] ras:gridengine: exec3.cluster.stats.local: PE_HOSTFILE shows 
slots=1
[exec5:06671] ras:gridengine: exec2.cluster.stats.local: PE_HOSTFILE shows 
slots=1
[exec5:06671] ras:gridengine: exec7.cluster.stats.local: PE_HOSTFILE shows 
slots=1

No more info.  I note that the external binding is slightly different to what I 
had before, but our cluster is busier today :-)


I would have expected more output.

--td

Chris


--
Dr Chris Jewell
Department of Statistics
University of Warwick
Coventry
CV4 7AL
UK
Tel: +44 (0)24 7615 0778






___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle *- Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.don...@oracle.com 





Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-17 Thread Chris Jewell
On 17 Nov 2010, at 11:56, Terry Dontje wrote:
>> 
>> You are absolutely correct, Terry, and the 1.4 release series does include 
>> the proper code. The point here, though, is that SGE binds the orted to a 
>> single core, even though other cores are also allocated. So the orted 
>> detects an external binding of one core, and binds all its children to that 
>> same core.
> I do not think you are right here.  Chris sent the following which looks like 
> OGE (fka SGE) actually did bind the hnp to multiple cores.  However that 
> message I believe is not coming from the processes themselves and actually is 
> only shown by the hnp.  I wonder if Chris adds a "-bind-to-core" option  
> we'll see more output from the a.out's before they exec unterm?

As requested using 

$ qsub -pe mpi 8 -binding linear:2 myScript.com'  

and 

'mpirun -mca ras_gridengine_verbose 100 --report-bindings -by-core 
-bind-to-core ./unterm'

[exec5:06671] System has detected external process binding to cores 0028
[exec5:06671] ras:gridengine: JOB_ID: 59434
[exec5:06671] ras:gridengine: PE_HOSTFILE: 
/usr/sge/default/spool/exec5/active_jobs/59434.1/pe_hostfile
[exec5:06671] ras:gridengine: exec5.cluster.stats.local: PE_HOSTFILE shows 
slots=2
[exec5:06671] ras:gridengine: exec1.cluster.stats.local: PE_HOSTFILE shows 
slots=2
[exec5:06671] ras:gridengine: exec4.cluster.stats.local: PE_HOSTFILE shows 
slots=1
[exec5:06671] ras:gridengine: exec3.cluster.stats.local: PE_HOSTFILE shows 
slots=1
[exec5:06671] ras:gridengine: exec2.cluster.stats.local: PE_HOSTFILE shows 
slots=1
[exec5:06671] ras:gridengine: exec7.cluster.stats.local: PE_HOSTFILE shows 
slots=1

No more info.  I note that the external binding is slightly different to what I 
had before, but our cluster is busier today :-)

Chris


--
Dr Chris Jewell
Department of Statistics
University of Warwick
Coventry
CV4 7AL
UK
Tel: +44 (0)24 7615 0778








Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-17 Thread Terry Dontje

On 11/16/2010 08:24 PM, Ralph Castain wrote:



On Tue, Nov 16, 2010 at 12:23 PM, Terry Dontje 
> wrote:


On 11/16/2010 01:31 PM, Reuti wrote:

Hi Ralph,

Am 16.11.2010 um 15:40 schrieb Ralph Castain:


2. have SGE bind procs it launches to -all- of those cores. I believe SGE 
does this automatically to constrain the procs to running on only those cores.

This is another "bug/feature" in SGE: it's a matter of discussion, whether the 
shepherd should get exactly one core (in case you use more than one `qrsh`per node) for each call, 
or *all* cores assigned (which we need right now, as the processes in Open MPI will be forks of 
orte daemon). About such a situtation I filled an issue a long time ago and 
"limit_to_one_qrsh_per_host yes/no" in the PE definition would do (this setting should 
then also change the core allocation of the master process):

http://gridengine.sunsource.net/issues/show_bug.cgi?id=1254

I believe this is indeed the crux of the issue

fantastic to share the same view.


FWIW, I think I agree too.


3. tell OMPI to --bind-to-core.

In other words, tell SGE to allocate a certain number of cores on each 
node, but to bind each proc to all of them (i.e., don't bind a proc to a 
specific core). I'm pretty sure that is a standard SGE option today (at least, 
I know it used to be). I don't believe any patch or devel work is required (to 
either SGE or OMPI).

When you use a fixed allocation_rule and a matching -binding request it 
will work today. But any other case won't be distributed in the correct way.

Is it possible to not include the -binding request? If SGE is told to use a 
fixed allocation_rule, and to allocate (for example) 2 cores/node, then won't 
the orted see
itself bound to two specific cores on each node?

When you leave out the -binding, all jobs are allowed to run on any core.



We would then be okay as the spawned children of orted would inherit its 
binding. Just don't tell mpirun to bind the processes and the threads of those 
MPI procs will be able to operate across the provided cores.

Or does SGE only allocate 2 cores/node in that case (i.e., allocate, but no 
-binding given), but doesn't bind the orted to any two specific cores? If so, 
then that would be a problem as the orted would think itself unconstrained. If 
I understand the thread correctly, you're saying that this is what happens 
today - true?

Exactly. It won't apply any binding at all and orted would think of being 
unlimited. I.e. limited only by the number of slots it should use thereon.


So I guess the question I have for Ralph.  I thought, and this
might be mixing some of the ideas Jeff and I've been talking
about, that when a RM executes the orted with a bound set of
resources (ie cores) that orted would bind the individual
processes on a subset of the bounded resources.  Is this not
really the case for 1.4.X branch?  I believe it is the case for
the trunk based on Jeff's refactoring.


You are absolutely correct, Terry, and the 1.4 release series does 
include the proper code. The point here, though, is that SGE binds the 
orted to a single core, even though other cores are also allocated. So 
the orted detects an external binding of one core, and binds all its 
children to that same core.
I do not think you are right here.  Chris sent the following which looks 
like OGE (fka SGE) actually did bind the hnp to multiple cores.  However 
that message I believe is not coming from the processes themselves and 
actually is only shown by the hnp.  I wonder if Chris adds a 
"-bind-to-core" option  we'll see more output from the a.out's before 
they exec unterm?


Sure.   Here's the stderr of a job submitted to my cluster with 'qsub -pe mpi 8 
-binding linear:2 myScript.com'  where myScript.com runs 'mpirun -mca 
ras_gridengine_verbose 100 --report-bindings ./unterm':
 
 [exec4:17384] System has detected external process binding to cores 0022

 [exec4:17384] ras:gridengine: JOB_ID: 59352
 [exec4:17384] ras:gridengine: PE_HOSTFILE: 
/usr/sge/default/spool/exec4/active_jobs/59352.1/pe_hostfile
 [exec4:17384] ras:gridengine: exec4.cluster.stats.local: PE_HOSTFILE shows 
slots=2
 [exec4:17384] ras:gridengine: exec2.cluster.stats.local: PE_HOSTFILE shows 
slots=1
 [exec4:17384] ras:gridengine: exec7.cluster.stats.local: PE_HOSTFILE shows 
slots=1
 [exec4:17384] ras:gridengine: exec3.cluster.stats.local: PE_HOSTFILE shows 
slots=1
 [exec4:17384] ras:gridengine: exec6.cluster.stats.local: PE_HOSTFILE shows 
slots=1
 [exec4:17384] ras:gridengine: exec1.cluster.stats.local: PE_HOSTFILE shows 
slots=1
 [exec4:17384] ras:gridengine: exec5.cluster.stats.local: PE_HOSTFILE shows 
slots=1




--td
What I had suggested to Reuti was to not include the -binding flag to 
SGE in the hopes that SGE would then bind the orted to all the 
allocated cores. However, as I 

Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-17 Thread Daniel Gruber
Hi, 

I'm interested in what is expected from OGE/SGE in order to support 
most of your scenarios. First of all the "-binding pe" request is 
not flexible and makes only sense in scenarios when having the 
same architecture on each host, each involved host is 
used exclusively for the job (SGE exclusive job feature) 
and when the same amount of slots is allocated for each 
host (fixed allocation rule). SGE just writes out the 
socket,core tuples (determined on master task host) in 
the pe_hostfile (the same for each host!). SGE does no 
binding itself. Therefore I think we should have a deeper 
look on the more flexible "-binding [set] ". 

1. One qrsh (--inherit) per slot

If a (legacy) parallel application does a qrsh for *each* granted 
slot (regardless if it calls the local host or a remote host) 
this should work out of the box with OGE/SGE with the 
"-binding linear:1" request in OGE tight integration. 
What might confuse here is when doing a "qstat -cb -j " 
just one core is shown as allocated (which is a bug). 
But when having a look on the host level (qstat -F m_topology_inuse) 
the allocated cores can be seen. This should work with 
different allocation rules.

2. One qrsh per host (OpenMPI case)

This should work under following constraints:
- OGE tight integration (control_slaves true)
- fixed allocation schema (allocation_rule N)
Then what is needed is simply call qsub with 
"-binding linear:N". Then the master script on 
the master host and all orted on the remote 
hosts are bound (if there are free cores) to 
N successive cores. Here orted is detecting 
this and binds its threads each to one of the 
detected cores (when the mpi command line parameter 
is present) - right? 

What does not work is having an OGE/SGE allocation_rule
round robin, or fill up. Since the amount of slots 
per host are unknown on submission time and different 
for each host. Am I right that this is currently the 
only drawback when using SGE and OpenMPI?

The next thing in the discussion was the alignment of 
cores and slots. Because the term of "slots" is 
very flexible in SGE/OGE and does not in all cases 
reflect the amount of cores (in case of SMT for example)
a compiled in mapping does not exist at the moment.
What people could do is to enforce suche a mapping 
via JSV scripts, which do the necessary reformulation 
of the request (modify #slots or #cores if necessary).

Did I miss some important points from SGE/OGE point of 
view? 


Cheers

Daniel


Am Dienstag, den 16.11.2010, 18:24 -0700 schrieb Ralph Castain:
> 
> 
> On Tue, Nov 16, 2010 at 12:23 PM, Terry Dontje
>  wrote:
> On 11/16/2010 01:31 PM, Reuti wrote: 
> > Hi Ralph,
> > 
> > Am 16.11.2010 um 15:40 schrieb Ralph Castain:
> > 
> > > > 2. have SGE bind procs it launches to -all- of those cores. I 
> believe SGE does this automatically to constrain the procs to running on only 
> those cores.
> > > This is another "bug/feature" in SGE: it's a matter of 
> discussion, whether the shepherd should get exactly one core (in case you use 
> more than one `qrsh`per node) for each call, or *all* cores assigned (which 
> we need right now, as the processes in Open MPI will be forks of orte 
> daemon). About such a situtation I filled an issue a long time ago and 
> "limit_to_one_qrsh_per_host yes/no" in the PE definition would do (this 
> setting should then also change the core allocation of the master process):
> > > 
> > > http://gridengine.sunsource.net/issues/show_bug.cgi?id=1254
> > > 
> > > I believe this is indeed the crux of the issue
> > fantastic to share the same view.
> > 
> FWIW, I think I agree too.
> 
> > > > 3. tell OMPI to --bind-to-core.
> > > > 
> > > > In other words, tell SGE to allocate a certain number of cores 
> on each node, but to bind each proc to all of them (i.e., don't bind a proc 
> to a specific core). I'm pretty sure that is a standard SGE option today (at 
> least, I know it used to be). I don't believe any patch or devel work is 
> required (to either SGE or OMPI).
> > > When you use a fixed allocation_rule and a matching -binding 
> request it will work today. But any other case won't be distributed in the 
> correct way.
> > > 
> > > Is it possible to not include the -binding request? If SGE is 
> told to use a fixed allocation_rule, and to allocate (for example) 2 
> cores/node, then won't the orted see 
> > > itself bound to two specific cores on each node?
> > When you leave out the -binding, all jobs are allowed to run on any 
> core.
> > 
> > 
> > > We would then be okay as the spawned children of orted would 
> inherit its binding. Just don't tell mpirun to bind the processes and the 
> threads of those MPI procs will be able to operate across the provided cores.
> > > 
> 

Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-16 Thread Ralph Castain
On Tue, Nov 16, 2010 at 12:23 PM, Terry Dontje wrote:

>  On 11/16/2010 01:31 PM, Reuti wrote:
>
> Hi Ralph,
>
> Am 16.11.2010 um 15:40 schrieb Ralph Castain:
>
>
>  2. have SGE bind procs it launches to -all- of those cores. I believe SGE 
> does this automatically to constrain the procs to running on only those cores.
>
>  This is another "bug/feature" in SGE: it's a matter of discussion, whether 
> the shepherd should get exactly one core (in case you use more than one 
> `qrsh`per node) for each call, or *all* cores assigned (which we need right 
> now, as the processes in Open MPI will be forks of orte daemon). About such a 
> situtation I filled an issue a long time ago and "limit_to_one_qrsh_per_host 
> yes/no" in the PE definition would do (this setting should then also change 
> the core allocation of the master process):
> http://gridengine.sunsource.net/issues/show_bug.cgi?id=1254
>
> I believe this is indeed the crux of the issue
>
>  fantastic to share the same view.
>
>
>  FWIW, I think I agree too.
>
>   3. tell OMPI to --bind-to-core.
>
> In other words, tell SGE to allocate a certain number of cores on each node, 
> but to bind each proc to all of them (i.e., don't bind a proc to a specific 
> core). I'm pretty sure that is a standard SGE option today (at least, I know 
> it used to be). I don't believe any patch or devel work is required (to 
> either SGE or OMPI).
>
>  When you use a fixed allocation_rule and a matching -binding request it will 
> work today. But any other case won't be distributed in the correct way.
>
> Is it possible to not include the -binding request? If SGE is told to use a 
> fixed allocation_rule, and to allocate (for example) 2 cores/node, then won't 
> the orted see
> itself bound to two specific cores on each node?
>
>  When you leave out the -binding, all jobs are allowed to run on any core.
>
>
>
>  We would then be okay as the spawned children of orted would inherit its 
> binding. Just don't tell mpirun to bind the processes and the threads of 
> those MPI procs will be able to operate across the provided cores.
>
> Or does SGE only allocate 2 cores/node in that case (i.e., allocate, but no 
> -binding given), but doesn't bind the orted to any two specific cores? If so, 
> then that would be a problem as the orted would think itself unconstrained. 
> If I understand the thread correctly, you're saying that this is what happens 
> today - true?
>
>  Exactly. It won't apply any binding at all and orted would think of being 
> unlimited. I.e. limited only by the number of slots it should use thereon.
>
>
>  So I guess the question I have for Ralph.  I thought, and this might be
> mixing some of the ideas Jeff and I've been talking about, that when a RM
> executes the orted with a bound set of resources (ie cores) that orted would
> bind the individual processes on a subset of the bounded resources.  Is this
> not really the case for 1.4.X branch?  I believe it is the case for the
> trunk based on Jeff's refactoring.
>

You are absolutely correct, Terry, and the 1.4 release series does include
the proper code. The point here, though, is that SGE binds the orted to a
single core, even though other cores are also allocated. So the orted
detects an external binding of one core, and binds all its children to that
same core.

What I had suggested to Reuti was to not include the -binding flag to SGE in
the hopes that SGE would then bind the orted to all the allocated cores.
However, as I feared, SGE in that case doesn't bind the orted at all - and
so we assume the entire node is available for our use.

This is an SGE issue. We need them to bind the orted to -all- the allocated
cores (and only those cores) in order for us to operate correctly.



>
>
> --
> [image: Oracle]
> Terry D. Dontje | Principal Software Engineer
> Developer Tools Engineering | +1.781.442.2631
>  Oracle * - Performance Technologies*
>  95 Network Drive, Burlington, MA 01803
> Email terry.don...@oracle.com
>
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-16 Thread Terry Dontje

On 11/16/2010 01:31 PM, Reuti wrote:

Hi Ralph,

Am 16.11.2010 um 15:40 schrieb Ralph Castain:


2. have SGE bind procs it launches to -all- of those cores. I believe SGE does 
this automatically to constrain the procs to running on only those cores.

This is another "bug/feature" in SGE: it's a matter of discussion, whether the shepherd 
should get exactly one core (in case you use more than one `qrsh`per node) for each call, or *all* 
cores assigned (which we need right now, as the processes in Open MPI will be forks of orte 
daemon). About such a situtation I filled an issue a long time ago and 
"limit_to_one_qrsh_per_host yes/no" in the PE definition would do (this setting should 
then also change the core allocation of the master process):

http://gridengine.sunsource.net/issues/show_bug.cgi?id=1254

I believe this is indeed the crux of the issue

fantastic to share the same view.


FWIW, I think I agree too.

3. tell OMPI to --bind-to-core.

In other words, tell SGE to allocate a certain number of cores on each node, 
but to bind each proc to all of them (i.e., don't bind a proc to a specific 
core). I'm pretty sure that is a standard SGE option today (at least, I know it 
used to be). I don't believe any patch or devel work is required (to either SGE 
or OMPI).

When you use a fixed allocation_rule and a matching -binding request it will 
work today. But any other case won't be distributed in the correct way.

Is it possible to not include the -binding request? If SGE is told to use a 
fixed allocation_rule, and to allocate (for example) 2 cores/node, then won't 
the orted see
itself bound to two specific cores on each node?

When you leave out the -binding, all jobs are allowed to run on any core.



We would then be okay as the spawned children of orted would inherit its 
binding. Just don't tell mpirun to bind the processes and the threads of those 
MPI procs will be able to operate across the provided cores.

Or does SGE only allocate 2 cores/node in that case (i.e., allocate, but no 
-binding given), but doesn't bind the orted to any two specific cores? If so, 
then that would be a problem as the orted would think itself unconstrained. If 
I understand the thread correctly, you're saying that this is what happens 
today - true?

Exactly. It won't apply any binding at all and orted would think of being 
unlimited. I.e. limited only by the number of slots it should use thereon.

So I guess the question I have for Ralph.  I thought, and this might be 
mixing some of the ideas Jeff and I've been talking about, that when a 
RM executes the orted with a bound set of resources (ie cores) that 
orted would bind the individual processes on a subset of the bounded 
resources.  Is this not really the case for 1.4.X branch?  I believe it 
is the case for the trunk based on Jeff's refactoring.


--
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle *- Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.don...@oracle.com 





Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-16 Thread Reuti
Hi Ralph,

Am 16.11.2010 um 15:40 schrieb Ralph Castain:

> > 2. have SGE bind procs it launches to -all- of those cores. I believe SGE 
> > does this automatically to constrain the procs to running on only those 
> > cores.
> 
> This is another "bug/feature" in SGE: it's a matter of discussion, whether 
> the shepherd should get exactly one core (in case you use more than one 
> `qrsh`per node) for each call, or *all* cores assigned (which we need right 
> now, as the processes in Open MPI will be forks of orte daemon). About such a 
> situtation I filled an issue a long time ago and "limit_to_one_qrsh_per_host 
> yes/no" in the PE definition would do (this setting should then also change 
> the core allocation of the master process):
> 
> http://gridengine.sunsource.net/issues/show_bug.cgi?id=1254
> 
> I believe this is indeed the crux of the issue

fantastic to share the same view.


> > 3. tell OMPI to --bind-to-core.
> >
> > In other words, tell SGE to allocate a certain number of cores on each 
> > node, but to bind each proc to all of them (i.e., don't bind a proc to a 
> > specific core). I'm pretty sure that is a standard SGE option today (at 
> > least, I know it used to be). I don't believe any patch or devel work is 
> > required (to either SGE or OMPI).
> 
> When you use a fixed allocation_rule and a matching -binding request it will 
> work today. But any other case won't be distributed in the correct way.
> 
> Is it possible to not include the -binding request? If SGE is told to use a 
> fixed allocation_rule, and to allocate (for example) 2 cores/node, then won't 
> the orted see 
> itself bound to two specific cores on each node?

When you leave out the -binding, all jobs are allowed to run on any core.


> We would then be okay as the spawned children of orted would inherit its 
> binding. Just don't tell mpirun to bind the processes and the threads of 
> those MPI procs will be able to operate across the provided cores.
> 
> Or does SGE only allocate 2 cores/node in that case (i.e., allocate, but no 
> -binding given), but doesn't bind the orted to any two specific cores? If so, 
> then that would be a problem as the orted would think itself unconstrained. 
> If I understand the thread correctly, you're saying that this is what happens 
> today - true?

Exactly. It won't apply any binding at all and orted would think of being 
unlimited. I.e. limited only by the number of slots it should use thereon.

-- Reuti


Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-16 Thread Chris Jewell

On 16 Nov 2010, at 17:25, Terry Dontje wrote:
>>> 
>> Sure.   Here's the stderr of a job submitted to my cluster with 'qsub -pe 
>> mpi 8 -binding linear:2 myScript.com'  where myScript.com runs 'mpirun -mca 
>> ras_gridengine_verbose 100 --report-bindings ./unterm':
>> 
>> [exec4:17384] System has detected external process binding to cores 0022
>> [exec4:17384] ras:gridengine: JOB_ID: 59352
>> [exec4:17384] ras:gridengine: PE_HOSTFILE: 
>> /usr/sge/default/spool/exec4/active_jobs/59352.1/pe_hostfile
>> [exec4:17384] ras:gridengine: exec4.cluster.stats.local: PE_HOSTFILE shows 
>> slots=2
>> [exec4:17384] ras:gridengine: exec2.cluster.stats.local: PE_HOSTFILE shows 
>> slots=1
>> [exec4:17384] ras:gridengine: exec7.cluster.stats.local: PE_HOSTFILE shows 
>> slots=1
>> [exec4:17384] ras:gridengine: exec3.cluster.stats.local: PE_HOSTFILE shows 
>> slots=1
>> [exec4:17384] ras:gridengine: exec6.cluster.stats.local: PE_HOSTFILE shows 
>> slots=1
>> [exec4:17384] ras:gridengine: exec1.cluster.stats.local: PE_HOSTFILE shows 
>> slots=1
>> [exec4:17384] ras:gridengine: exec5.cluster.stats.local: PE_HOSTFILE shows 
>> slots=1
>> 
>> 
>> 
> Is that all that came out?  I would have expected a some output from each 
> process after the orted forked the processes but before the exec of unterm.

Yes.  It appears that if orted detects binding done by external processes, then 
this is all you get.  Scratch the GE enforced binding, and you get:

[exec4:17670] [[23443,0],0] odls:default:fork binding child [[23443,1],0] to 
cpus 0001
[exec4:17670] [[23443,0],0] odls:default:fork binding child [[23443,1],1] to 
cpus 0002
[exec7:06781] [[23443,0],2] odls:default:fork binding child [[23443,1],3] to 
cpus 0001
[exec2:24160] [[23443,0],1] odls:default:fork binding child [[23443,1],2] to 
cpus 0001
[exec6:30097] [[23443,0],4] odls:default:fork binding child [[23443,1],5] to 
cpus 0001
[exec5:02736] [[23443,0],6] odls:default:fork binding child [[23443,1],7] to 
cpus 0001
[exec1:30779] [[23443,0],5] odls:default:fork binding child [[23443,1],6] to 
cpus 0001
[exec3:12818] [[23443,0],3] odls:default:fork binding child [[23443,1],4] to 
cpus 0001
.


C
--
Dr Chris Jewell
Department of Statistics
University of Warwick
Coventry
CV4 7AL
UK
Tel: +44 (0)24 7615 0778








Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-16 Thread Terry Dontje

On 11/16/2010 12:13 PM, Chris Jewell wrote:

On 16 Nov 2010, at 14:26, Terry Dontje wrote:

In the original case of 7 nodes and processes if we do -binding pe linear:2, 
and add the -bind-to-core to mpirun  I'd actually expect 6 of the nodes 
processes bind to one core and the 7th node with 2 processes to have each of 
those processes bound to different cores on the same machine.

Can we get a full output of such a run with -report-bindings turned on.  I 
think we should find out that things actually are happening correctly except 
for the fact that the 6 of the nodes have 2 cores allocated but only one is 
being bound to by a process.

Sure.   Here's the stderr of a job submitted to my cluster with 'qsub -pe mpi 8 
-binding linear:2 myScript.com'  where myScript.com runs 'mpirun -mca 
ras_gridengine_verbose 100 --report-bindings ./unterm':

[exec4:17384] System has detected external process binding to cores 0022
[exec4:17384] ras:gridengine: JOB_ID: 59352
[exec4:17384] ras:gridengine: PE_HOSTFILE: 
/usr/sge/default/spool/exec4/active_jobs/59352.1/pe_hostfile
[exec4:17384] ras:gridengine: exec4.cluster.stats.local: PE_HOSTFILE shows 
slots=2
[exec4:17384] ras:gridengine: exec2.cluster.stats.local: PE_HOSTFILE shows 
slots=1
[exec4:17384] ras:gridengine: exec7.cluster.stats.local: PE_HOSTFILE shows 
slots=1
[exec4:17384] ras:gridengine: exec3.cluster.stats.local: PE_HOSTFILE shows 
slots=1
[exec4:17384] ras:gridengine: exec6.cluster.stats.local: PE_HOSTFILE shows 
slots=1
[exec4:17384] ras:gridengine: exec1.cluster.stats.local: PE_HOSTFILE shows 
slots=1
[exec4:17384] ras:gridengine: exec5.cluster.stats.local: PE_HOSTFILE shows 
slots=1


Is that all that came out?  I would have expected a some output from 
each process after the orted forked the processes but before the exec of 
unterm.


--td

Chris

--
Dr Chris Jewell
Department of Statistics
University of Warwick
Coventry
CV4 7AL
UK
Tel: +44 (0)24 7615 0778






___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle *- Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.don...@oracle.com 





Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-16 Thread Terry Dontje

On 11/16/2010 12:13 PM, Chris Jewell wrote:

On 16 Nov 2010, at 14:26, Terry Dontje wrote:

In the original case of 7 nodes and processes if we do -binding pe linear:2, 
and add the -bind-to-core to mpirun  I'd actually expect 6 of the nodes 
processes bind to one core and the 7th node with 2 processes to have each of 
those processes bound to different cores on the same machine.

Can we get a full output of such a run with -report-bindings turned on.  I 
think we should find out that things actually are happening correctly except 
for the fact that the 6 of the nodes have 2 cores allocated but only one is 
being bound to by a process.

Sure.   Here's the stderr of a job submitted to my cluster with 'qsub -pe mpi 8 
-binding linear:2 myScript.com'  where myScript.com runs 'mpirun -mca 
ras_gridengine_verbose 100 --report-bindings ./unterm':

[exec4:17384] System has detected external process binding to cores 0022
[exec4:17384] ras:gridengine: JOB_ID: 59352
[exec4:17384] ras:gridengine: PE_HOSTFILE: 
/usr/sge/default/spool/exec4/active_jobs/59352.1/pe_hostfile
[exec4:17384] ras:gridengine: exec4.cluster.stats.local: PE_HOSTFILE shows 
slots=2
[exec4:17384] ras:gridengine: exec2.cluster.stats.local: PE_HOSTFILE shows 
slots=1
[exec4:17384] ras:gridengine: exec7.cluster.stats.local: PE_HOSTFILE shows 
slots=1
[exec4:17384] ras:gridengine: exec3.cluster.stats.local: PE_HOSTFILE shows 
slots=1
[exec4:17384] ras:gridengine: exec6.cluster.stats.local: PE_HOSTFILE shows 
slots=1
[exec4:17384] ras:gridengine: exec1.cluster.stats.local: PE_HOSTFILE shows 
slots=1
[exec4:17384] ras:gridengine: exec5.cluster.stats.local: PE_HOSTFILE shows 
slots=1


Is that all that came out?  I would have expected a some output from 
each process after the orted forked the processes but before the exec of 
unterm.


--td

Chris

--
Dr Chris Jewell
Department of Statistics
University of Warwick
Coventry
CV4 7AL
UK
Tel: +44 (0)24 7615 0778






___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle *- Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.don...@oracle.com 





Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-16 Thread Chris Jewell

On 16 Nov 2010, at 14:26, Terry Dontje wrote:
> 
> In the original case of 7 nodes and processes if we do -binding pe linear:2, 
> and add the -bind-to-core to mpirun  I'd actually expect 6 of the nodes 
> processes bind to one core and the 7th node with 2 processes to have each of 
> those processes bound to different cores on the same machine.
> 
> Can we get a full output of such a run with -report-bindings turned on.  I 
> think we should find out that things actually are happening correctly except 
> for the fact that the 6 of the nodes have 2 cores allocated but only one is 
> being bound to by a process.

Sure.   Here's the stderr of a job submitted to my cluster with 'qsub -pe mpi 8 
-binding linear:2 myScript.com'  where myScript.com runs 'mpirun -mca 
ras_gridengine_verbose 100 --report-bindings ./unterm':

[exec4:17384] System has detected external process binding to cores 0022
[exec4:17384] ras:gridengine: JOB_ID: 59352
[exec4:17384] ras:gridengine: PE_HOSTFILE: 
/usr/sge/default/spool/exec4/active_jobs/59352.1/pe_hostfile
[exec4:17384] ras:gridengine: exec4.cluster.stats.local: PE_HOSTFILE shows 
slots=2
[exec4:17384] ras:gridengine: exec2.cluster.stats.local: PE_HOSTFILE shows 
slots=1
[exec4:17384] ras:gridengine: exec7.cluster.stats.local: PE_HOSTFILE shows 
slots=1
[exec4:17384] ras:gridengine: exec3.cluster.stats.local: PE_HOSTFILE shows 
slots=1
[exec4:17384] ras:gridengine: exec6.cluster.stats.local: PE_HOSTFILE shows 
slots=1
[exec4:17384] ras:gridengine: exec1.cluster.stats.local: PE_HOSTFILE shows 
slots=1
[exec4:17384] ras:gridengine: exec5.cluster.stats.local: PE_HOSTFILE shows 
slots=1


Chris

--
Dr Chris Jewell
Department of Statistics
University of Warwick
Coventry
CV4 7AL
UK
Tel: +44 (0)24 7615 0778








Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-16 Thread Terry Dontje

On 11/16/2010 10:59 AM, Reuti wrote:

Am 16.11.2010 um 15:26 schrieb Terry Dontje:



1. allocate a specified number of cores on each node to your job


this is currently the bug in the "slot<=>  core" relation in SGE, which has to 
be removed, updated or clarified. For now slot and core count are out of sync AFAICS.


Technically this isn't a bug but a gap in the allocation rule.  I think the 
solution is a new allocation rule.

Yes, you can phrase it this way. But what do you mean by "new allocation rule"?
The proposal of have a slot allocation rule that forces the number of 
cores allocated on each node to equal the number of slots.

The slot allocation should follow the specified cores?

The other way around I think.



2. have SGE bind procs it launches to -all- of those cores. I believe SGE does 
this automatically to constrain the procs to running on only those cores.


This is another "bug/feature" in SGE: it's a matter of discussion, whether the shepherd 
should get exactly one core (in case you use more than one `qrsh`per node) for each call, or *all* 
cores assigned (which we need right now, as the processes in Open MPI will be forks of orte 
daemon). About such a situtation I filled an issue a long time ago and 
"limit_to_one_qrsh_per_host yes/no" in the PE definition would do (this setting should 
then also change the core allocation of the master process):


http://gridengine.sunsource.net/issues/show_bug.cgi?id=1254

Isn't it almost required to have the shepherd bind to all the cores so that the 
orted inherits that binding?

Yes, for orted. But if you want to have any other (legacy) application which 
using N times `qrsh` to an exechost when you got N slots thereon, then only one 
core should be bound to each of the started shepherds.

Blech.  Not sure of the solution for that but I see what you are saying 
now :-).

3. tell OMPI to --bind-to-core.

In other words, tell SGE to allocate a certain number of cores on each node, 
but to bind each proc to all of them (i.e., don't bind a proc to a specific 
core). I'm pretty sure that is a standard SGE option today (at least, I know it 
used to be). I don't believe any patch or devel work is required (to either SGE 
or OMPI).


When you use a fixed allocation_rule and a matching -binding request it will 
work today. But any other case won't be distributed in the correct way.


Ok, so what is the "correct" way and we sure it isn't distributed correctly?

You posted the two cases yesterday. Do we agree that both cases aren't correct, or do you think it's a 
correct allocation for both cases? Even if it could be "repaired" in Open MPI, it would be 
better to fix the generated 'pe' PE hostfile and 'set' allocation, i.e. the "slot<=>  
cores" relation.


So I am not a GE type of guy but from what I've been led to believe what 
happened is correct (in some form of correct).  That is in case one we 
asked for a core allocation of 1 core per node and a core allocation of 
2 cores in the other case.  That is what we were given.  The fact that 
we distributed the slots in a non-uniform manner I am not sure is GE's 
fault.  Note I can understand where it may seem non-intuitive and not 
nice for people wanting to do things like this.

In the original case of 7 nodes and processes if we do -binding pe linear:2, 
and add the -bind-to-core to mpirun  I'd actually expect 6 of the nodes 
processes bind to one core and the 7th node with 2 processes to have each of 
those processes bound to different cores on the same machine.

Yes, possibly it could be repaired this way (for now I have no free machines to play with). But 
then the "reserved" cores by the "-binding pe linear:2" are lost for other 
processes on these 6 nodes, and the slot count gets out of sync with slots.
Right, if you want to rightsize the amount of cores allocated to slots 
allocated on each node then we are stuck unless a new allocation rule is 
made.

Can we get a full output of such a run with -report-bindings turned on.  I 
think we should find out that things actually are happening correctly except 
for the fact that the 6 of the nodes have 2 cores allocated but only one is 
being bound to by a process.

You mean, to accept the current behavior as being the intended one, as finally 
for having only one job running on these machines we get what we asked for - 
despite the fact that cores are lost for other processes?

Yes, that is what I mean.  I first would like to prove at least to 
myself things are working the way we think they are.  I believe the 
discussion of recovering the lost cores is the next step.  Either we 
redefine what -binding linear:X means in light of slots, we make a new 
allocation rule -binding slots:X or live with the lost cores.  Note, the 
"we" here is loosely used.  I am by no means the keeper of GE and just 
injected myself in this discussion because, like Ralph, I have dealt 
with binding and I work for Oracle which develops GE.  Just to be clear 
I do not work 

Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-16 Thread Reuti

Am 16.11.2010 um 15:26 schrieb Terry Dontje:

>>> 
>>> 1. allocate a specified number of cores on each node to your job
>>> 
>> this is currently the bug in the "slot <=> core" relation in SGE, which has 
>> to be removed, updated or clarified. For now slot and core count are out of 
>> sync AFAICS.
>> 
> Technically this isn't a bug but a gap in the allocation rule.  I think the 
> solution is a new allocation rule.

Yes, you can phrase it this way. But what do you mean by "new allocation rule"? 
The slot allocation should follow the specified cores? 


>>> 2. have SGE bind procs it launches to -all- of those cores. I believe SGE 
>>> does this automatically to constrain the procs to running on only those 
>>> cores.
>>> 
>> This is another "bug/feature" in SGE: it's a matter of discussion, whether 
>> the shepherd should get exactly one core (in case you use more than one 
>> `qrsh`per node) for each call, or *all* cores assigned (which we need right 
>> now, as the processes in Open MPI will be forks of orte daemon). About such 
>> a situtation I filled an issue a long time ago and 
>> "limit_to_one_qrsh_per_host yes/no" in the PE definition would do (this 
>> setting should then also change the core allocation of the master process):
>> 
>> 
>> http://gridengine.sunsource.net/issues/show_bug.cgi?id=1254
> Isn't it almost required to have the shepherd bind to all the cores so that 
> the orted inherits that binding?

Yes, for orted. But if you want to have any other (legacy) application which 
using N times `qrsh` to an exechost when you got N slots thereon, then only one 
core should be bound to each of the started shepherds.


>>> 3. tell OMPI to --bind-to-core.
>>> 
>>> In other words, tell SGE to allocate a certain number of cores on each 
>>> node, but to bind each proc to all of them (i.e., don't bind a proc to a 
>>> specific core). I'm pretty sure that is a standard SGE option today (at 
>>> least, I know it used to be). I don't believe any patch or devel work is 
>>> required (to either SGE or OMPI).
>>> 
>> When you use a fixed allocation_rule and a matching -binding request it will 
>> work today. But any other case won't be distributed in the correct way.
>> 
> Ok, so what is the "correct" way and we sure it isn't distributed correctly?

You posted the two cases yesterday. Do we agree that both cases aren't correct, 
or do you think it's a correct allocation for both cases? Even if it could be 
"repaired" in Open MPI, it would be better to fix the generated 'pe' PE 
hostfile and 'set' allocation, i.e. the "slot <=> cores" relation.


> In the original case of 7 nodes and processes if we do -binding pe linear:2, 
> and add the -bind-to-core to mpirun  I'd actually expect 6 of the nodes 
> processes bind to one core and the 7th node with 2 processes to have each of 
> those processes bound to different cores on the same machine.

Yes, possibly it could be repaired this way (for now I have no free machines to 
play with). But then the "reserved" cores by the "-binding pe linear:2" are 
lost for other processes on these 6 nodes, and the slot count gets out of sync 
with slots.


> Can we get a full output of such a run with -report-bindings turned on.  I 
> think we should find out that things actually are happening correctly except 
> for the fact that the 6 of the nodes have 2 cores allocated but only one is 
> being bound to by a process.

You mean, to accept the current behavior as being the intended one, as finally 
for having only one job running on these machines we get what we asked for - 
despite the fact that cores are lost for other processes?

-- Reuti


Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-16 Thread Ralph Castain
Hi Reuti


> > 2. have SGE bind procs it launches to -all- of those cores. I believe SGE
> does this automatically to constrain the procs to running on only those
> cores.
>
> This is another "bug/feature" in SGE: it's a matter of discussion, whether
> the shepherd should get exactly one core (in case you use more than one
> `qrsh`per node) for each call, or *all* cores assigned (which we need right
> now, as the processes in Open MPI will be forks of orte daemon). About such
> a situtation I filled an issue a long time ago and
> "limit_to_one_qrsh_per_host yes/no" in the PE definition would do (this
> setting should then also change the core allocation of the master process):
>
> http://gridengine.sunsource.net/issues/show_bug.cgi?id=1254


I believe this is indeed the crux of the issue


>
>
>
> > 3. tell OMPI to --bind-to-core.
> >
> > In other words, tell SGE to allocate a certain number of cores on each
> node, but to bind each proc to all of them (i.e., don't bind a proc to a
> specific core). I'm pretty sure that is a standard SGE option today (at
> least, I know it used to be). I don't believe any patch or devel work is
> required (to either SGE or OMPI).
>
> When you use a fixed allocation_rule and a matching -binding request it
> will work today. But any other case won't be distributed in the correct way.
>

Is it possible to not include the -binding request? If SGE is told to use a
fixed allocation_rule, and to allocate (for example) 2 cores/node, then
won't the orted see itself bound to two specific cores on each node? We
would then be okay as the spawned children of orted would inherit its
binding. Just don't tell mpirun to bind the processes and the threads of
those MPI procs will be able to operate across the provided cores.

Or does SGE only allocate 2 cores/node in that case (i.e., allocate, but no
-binding given), but doesn't bind the orted to any two specific cores? If
so, then that would be a problem as the orted would think itself
unconstrained. If I understand the thread correctly, you're saying that this
is what happens today - true?



>
> -- Reuti
>
>
> >
> >
> > On Tue, Nov 16, 2010 at 4:07 AM, Reuti 
> wrote:
> > Am 16.11.2010 um 10:26 schrieb Chris Jewell:
> >
> > > Hi all,
> > >
> > >> On 11/15/2010 02:11 PM, Reuti wrote:
> > >>> Just to give my understanding of the problem:
> > 
> > >> Sorry, I am still trying to grok all your email as what the
> problem you
> > >> are trying to solve. So is the issue is trying to have two jobs
> having
> > >> processes on the same node be able to bind there processes on
> different
> > >> resources. Like core 1 for the first job and core 2 and 3 for the
> 2nd job?
> > >>
> > >> --td
> > >> You can't get 2 slots on a machine, as it's limited by the core count
> to one here, so such a slot allocation shouldn't occur at all.
> > >
> > > So to clarify, the current -binding :
> allocates binding_amount cores to each sge_shepherd process associated with
> a job_id.  There appears to be only one sge_shepherd process per job_id per
> execution node, so all child processes run on these allocated cores.  This
> is irrespective of the number of slots allocated to the node.
> > >
> > > I agree with Reuti that the binding_amount parameter should be a
> maximum number of bound cores per node, with the actual number determined by
> the number of slots allocated per node.  FWIW, an alternative approach might
> be to have another binding_type ('slot', say) that automatically allocated
> one core per slot.
> > >
> > > Of course, a complex situation might arise if a user submits a combined
> MPI/multithreaded job, but then I guess we're into the realm of setting
> allocation_rule.
> >
> > IIRC there was a discussion on the [GE users] list about it, to get an
> uniform distribution on all slave nodes for such jobs, as also e.g.
> $OMP_NUM_THREADS will be set to the same value for all slave nodes for
> hybrid jobs. Otherwise it would be necessary to adjust SGE to set this value
> in the "-builtin-" startup method automatically on all nodes to the local
> granted slots value. For now a fixed allocation rule of 1,2,4 or whatever
> must be used and you have to submit by reqeusting a wildcard PE to get any
> of these defined PEs for an even distribution and you don't care whether
> it's two times two slots, one time four slots, or four times one slot.
> >
> > In my understanding, any type of parallel job should always request and
> get the total number of slots equal to the cores it needs to execute.
> Independent whether these are threads, forks or any hybrid type of jobs.
> Otherwise any resource planing and reservation will most likely fail.
> Nevertheless, there might exist rare cases where you submit an exclusive
> serial job but create threads/forks in the end. But such a setup should be
> an exception, not the default.
> >
> >
> > > Is it going to be worth looking at creating a patch for this?
> >
> > 

Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-16 Thread Terry Dontje

On 11/16/2010 09:08 AM, Reuti wrote:

Hi,

Am 16.11.2010 um 14:07 schrieb Ralph Castain:


Perhaps I'm missing it, but it seems to me that the real problem lies in the interaction 
between SGE and OMPI during OMPI's two-phase launch. The verbose output shows that SGE 
dutifully allocated the requested number of cores on each node. However, OMPI launches 
only one process on each node (the ORTE daemon), which SGE "binds" to a single 
core since that is what it was told to do.

Since SGE never sees the local MPI procs spawned by ORTE, it can't assign 
bindings to them. The ORTE daemon senses its local binding (i.e., to a single 
core in the allocation), and subsequently binds all its local procs to that 
core.

I believe all you need to do is tell SGE to:

1. allocate a specified number of cores on each node to your job

this is currently the bug in the "slot<=>  core" relation in SGE, which has to 
be removed, updated or clarified. For now slot and core count are out of sync AFAICS.


Technically this isn't a bug but a gap in the allocation rule.  I think 
the solution is a new allocation rule.

2. have SGE bind procs it launches to -all- of those cores. I believe SGE does 
this automatically to constrain the procs to running on only those cores.

This is another "bug/feature" in SGE: it's a matter of discussion, whether the shepherd 
should get exactly one core (in case you use more than one `qrsh`per node) for each call, or *all* 
cores assigned (which we need right now, as the processes in Open MPI will be forks of orte 
daemon). About such a situtation I filled an issue a long time ago and 
"limit_to_one_qrsh_per_host yes/no" in the PE definition would do (this setting should 
then also change the core allocation of the master process):

http://gridengine.sunsource.net/issues/show_bug.cgi?id=1254
Isn't it almost required to have the shepherd bind to all the cores so 
that the orted inherits that binding?



3. tell OMPI to --bind-to-core.

In other words, tell SGE to allocate a certain number of cores on each node, 
but to bind each proc to all of them (i.e., don't bind a proc to a specific 
core). I'm pretty sure that is a standard SGE option today (at least, I know it 
used to be). I don't believe any patch or devel work is required (to either SGE 
or OMPI).

When you use a fixed allocation_rule and a matching -binding request it will 
work today. But any other case won't be distributed in the correct way.

Ok, so what is the "correct" way and we sure it isn't distributed correctly?

In the original case of 7 nodes and processes if we do -binding pe 
linear:2, and add the -bind-to-core to mpirun  I'd actually expect 6 of 
the nodes processes bind to one core and the 7th node with 2 processes 
to have each of those processes bound to different cores on the same 
machine.


Can we get a full output of such a run with -report-bindings turned on.  
I think we should find out that things actually are happening correctly 
except for the fact that the 6 of the nodes have 2 cores allocated but 
only one is being bound to by a process.


--td


-- Reuti




On Tue, Nov 16, 2010 at 4:07 AM, Reuti  wrote:
Am 16.11.2010 um 10:26 schrieb Chris Jewell:


Hi all,


On 11/15/2010 02:11 PM, Reuti wrote:

Just to give my understanding of the problem:

Sorry, I am still trying to grok all your email as what the problem you
are trying to solve. So is the issue is trying to have two jobs having
processes on the same node be able to bind there processes on different
resources. Like core 1 for the first job and core 2 and 3 for the 2nd job?

--td

You can't get 2 slots on a machine, as it's limited by the core count to one 
here, so such a slot allocation shouldn't occur at all.

So to clarify, the current -binding:  
allocates binding_amount cores to each sge_shepherd process associated with a job_id.  
There appears to be only one sge_shepherd process per job_id per execution node, so all 
child processes run on these allocated cores.  This is irrespective of the number of slots 
allocated to the node.

I agree with Reuti that the binding_amount parameter should be a maximum number 
of bound cores per node, with the actual number determined by the number of 
slots allocated per node.  FWIW, an alternative approach might be to have 
another binding_type ('slot', say) that automatically allocated one core per 
slot.

Of course, a complex situation might arise if a user submits a combined 
MPI/multithreaded job, but then I guess we're into the realm of setting 
allocation_rule.

IIRC there was a discussion on the [GE users] list about it, to get an uniform 
distribution on all slave nodes for such jobs, as also e.g. $OMP_NUM_THREADS will be set 
to the same value for all slave nodes for hybrid jobs. Otherwise it would be necessary to 
adjust SGE to set this value in the "-builtin-" startup method automatically on 
all nodes to the local granted slots value. For now a fixed allocation rule of 

Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-16 Thread Reuti
Hi,

Am 16.11.2010 um 14:07 schrieb Ralph Castain:

> Perhaps I'm missing it, but it seems to me that the real problem lies in the 
> interaction between SGE and OMPI during OMPI's two-phase launch. The verbose 
> output shows that SGE dutifully allocated the requested number of cores on 
> each node. However, OMPI launches only one process on each node (the ORTE 
> daemon), which SGE "binds" to a single core since that is what it was told to 
> do.
> 
> Since SGE never sees the local MPI procs spawned by ORTE, it can't assign 
> bindings to them. The ORTE daemon senses its local binding (i.e., to a single 
> core in the allocation), and subsequently binds all its local procs to that 
> core.
> 
> I believe all you need to do is tell SGE to:
> 
> 1. allocate a specified number of cores on each node to your job

this is currently the bug in the "slot <=> core" relation in SGE, which has to 
be removed, updated or clarified. For now slot and core count are out of sync 
AFAICS.


> 2. have SGE bind procs it launches to -all- of those cores. I believe SGE 
> does this automatically to constrain the procs to running on only those cores.

This is another "bug/feature" in SGE: it's a matter of discussion, whether the 
shepherd should get exactly one core (in case you use more than one `qrsh`per 
node) for each call, or *all* cores assigned (which we need right now, as the 
processes in Open MPI will be forks of orte daemon). About such a situtation I 
filled an issue a long time ago and "limit_to_one_qrsh_per_host yes/no" in the 
PE definition would do (this setting should then also change the core 
allocation of the master process):

http://gridengine.sunsource.net/issues/show_bug.cgi?id=1254


> 3. tell OMPI to --bind-to-core.
> 
> In other words, tell SGE to allocate a certain number of cores on each node, 
> but to bind each proc to all of them (i.e., don't bind a proc to a specific 
> core). I'm pretty sure that is a standard SGE option today (at least, I know 
> it used to be). I don't believe any patch or devel work is required (to 
> either SGE or OMPI).

When you use a fixed allocation_rule and a matching -binding request it will 
work today. But any other case won't be distributed in the correct way.

-- Reuti


> 
> 
> On Tue, Nov 16, 2010 at 4:07 AM, Reuti  wrote:
> Am 16.11.2010 um 10:26 schrieb Chris Jewell:
> 
> > Hi all,
> >
> >> On 11/15/2010 02:11 PM, Reuti wrote:
> >>> Just to give my understanding of the problem:
> 
> >> Sorry, I am still trying to grok all your email as what the problem you
> >> are trying to solve. So is the issue is trying to have two jobs having
> >> processes on the same node be able to bind there processes on different
> >> resources. Like core 1 for the first job and core 2 and 3 for the 2nd 
> >> job?
> >>
> >> --td
> >> You can't get 2 slots on a machine, as it's limited by the core count to 
> >> one here, so such a slot allocation shouldn't occur at all.
> >
> > So to clarify, the current -binding : 
> > allocates binding_amount cores to each sge_shepherd process associated with 
> > a job_id.  There appears to be only one sge_shepherd process per job_id per 
> > execution node, so all child processes run on these allocated cores.  This 
> > is irrespective of the number of slots allocated to the node.
> >
> > I agree with Reuti that the binding_amount parameter should be a maximum 
> > number of bound cores per node, with the actual number determined by the 
> > number of slots allocated per node.  FWIW, an alternative approach might be 
> > to have another binding_type ('slot', say) that automatically allocated one 
> > core per slot.
> >
> > Of course, a complex situation might arise if a user submits a combined 
> > MPI/multithreaded job, but then I guess we're into the realm of setting 
> > allocation_rule.
> 
> IIRC there was a discussion on the [GE users] list about it, to get an 
> uniform distribution on all slave nodes for such jobs, as also e.g. 
> $OMP_NUM_THREADS will be set to the same value for all slave nodes for hybrid 
> jobs. Otherwise it would be necessary to adjust SGE to set this value in the 
> "-builtin-" startup method automatically on all nodes to the local granted 
> slots value. For now a fixed allocation rule of 1,2,4 or whatever must be 
> used and you have to submit by reqeusting a wildcard PE to get any of these 
> defined PEs for an even distribution and you don't care whether it's two 
> times two slots, one time four slots, or four times one slot.
> 
> In my understanding, any type of parallel job should always request and get 
> the total number of slots equal to the cores it needs to execute. Independent 
> whether these are threads, forks or any hybrid type of jobs. Otherwise any 
> resource planing and reservation will most likely fail. Nevertheless, there 
> might exist rare cases where you submit an exclusive serial job but create 
> threads/forks in the end. But 

Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-16 Thread Ralph Castain
Perhaps I'm missing it, but it seems to me that the real problem lies in the
interaction between SGE and OMPI during OMPI's two-phase launch. The verbose
output shows that SGE dutifully allocated the requested number of cores on
each node. However, OMPI launches only one process on each node (the ORTE
daemon), which SGE "binds" to a single core since that is what it was told
to do.

Since SGE never sees the local MPI procs spawned by ORTE, it can't assign
bindings to them. The ORTE daemon senses its local binding (i.e., to a
single core in the allocation), and subsequently binds all its local procs
to that core.

I believe all you need to do is tell SGE to:

1. allocate a specified number of cores on each node to your job

2. have SGE bind procs it launches to -all- of those cores. I believe SGE
does this automatically to constrain the procs to running on only those
cores.

3. tell OMPI to --bind-to-core.

In other words, tell SGE to allocate a certain number of cores on each node,
but to bind each proc to all of them (i.e., don't bind a proc to a specific
core). I'm pretty sure that is a standard SGE option today (at least, I know
it used to be). I don't believe any patch or devel work is required (to
either SGE or OMPI).



On Tue, Nov 16, 2010 at 4:07 AM, Reuti  wrote:

> Am 16.11.2010 um 10:26 schrieb Chris Jewell:
>
> > Hi all,
> >
> >> On 11/15/2010 02:11 PM, Reuti wrote:
> >>> Just to give my understanding of the problem:
> 
> >> Sorry, I am still trying to grok all your email as what the problem
> you
> >> are trying to solve. So is the issue is trying to have two jobs
> having
> >> processes on the same node be able to bind there processes on
> different
> >> resources. Like core 1 for the first job and core 2 and 3 for the
> 2nd job?
> >>
> >> --td
> >> You can't get 2 slots on a machine, as it's limited by the core count to
> one here, so such a slot allocation shouldn't occur at all.
> >
> > So to clarify, the current -binding :
> allocates binding_amount cores to each sge_shepherd process associated with
> a job_id.  There appears to be only one sge_shepherd process per job_id per
> execution node, so all child processes run on these allocated cores.  This
> is irrespective of the number of slots allocated to the node.
> >
> > I agree with Reuti that the binding_amount parameter should be a maximum
> number of bound cores per node, with the actual number determined by the
> number of slots allocated per node.  FWIW, an alternative approach might be
> to have another binding_type ('slot', say) that automatically allocated one
> core per slot.
> >
> > Of course, a complex situation might arise if a user submits a combined
> MPI/multithreaded job, but then I guess we're into the realm of setting
> allocation_rule.
>
> IIRC there was a discussion on the [GE users] list about it, to get an
> uniform distribution on all slave nodes for such jobs, as also e.g.
> $OMP_NUM_THREADS will be set to the same value for all slave nodes for
> hybrid jobs. Otherwise it would be necessary to adjust SGE to set this value
> in the "-builtin-" startup method automatically on all nodes to the local
> granted slots value. For now a fixed allocation rule of 1,2,4 or whatever
> must be used and you have to submit by reqeusting a wildcard PE to get any
> of these defined PEs for an even distribution and you don't care whether
> it's two times two slots, one time four slots, or four times one slot.
>
> In my understanding, any type of parallel job should always request and get
> the total number of slots equal to the cores it needs to execute.
> Independent whether these are threads, forks or any hybrid type of jobs.
> Otherwise any resource planing and reservation will most likely fail.
> Nevertheless, there might exist rare cases where you submit an exclusive
> serial job but create threads/forks in the end. But such a setup should be
> an exception, not the default.
>
>
> > Is it going to be worth looking at creating a patch for this?
>
> Absolute.
>
>
> >  I don't know much of the internals of SGE -- would it be hard work to
> do?  I've not that much time to dedicate towards it, but I could put some
> effort in if necessary...
>
> I don't know about the exact coding for it, but when it's for now a plain
> "copy" of the binding list, then it should become a loop to create a list of
> cores from the original specification until all granted slots got a core
> allocated.
>
> -- Reuti
>
>
> >
> > Chris
> >
> >
> > --
> > Dr Chris Jewell
> > Department of Statistics
> > University of Warwick
> > Coventry
> > CV4 7AL
> > UK
> > Tel: +44 (0)24 7615 0778
> >
> >
> >
> >
> >
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> ___
> users mailing list
> us...@open-mpi.org
> 

Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-16 Thread Reuti
Am 16.11.2010 um 10:26 schrieb Chris Jewell:

> Hi all,
> 
>> On 11/15/2010 02:11 PM, Reuti wrote: 
>>> Just to give my understanding of the problem: 
 
>> Sorry, I am still trying to grok all your email as what the problem you 
>> are trying to solve. So is the issue is trying to have two jobs having 
>> processes on the same node be able to bind there processes on different 
>> resources. Like core 1 for the first job and core 2 and 3 for the 2nd 
>> job? 
>> 
>> --td 
>> You can't get 2 slots on a machine, as it's limited by the core count to one 
>> here, so such a slot allocation shouldn't occur at all. 
> 
> So to clarify, the current -binding : 
> allocates binding_amount cores to each sge_shepherd process associated with a 
> job_id.  There appears to be only one sge_shepherd process per job_id per 
> execution node, so all child processes run on these allocated cores.  This is 
> irrespective of the number of slots allocated to the node.  
> 
> I agree with Reuti that the binding_amount parameter should be a maximum 
> number of bound cores per node, with the actual number determined by the 
> number of slots allocated per node.  FWIW, an alternative approach might be 
> to have another binding_type ('slot', say) that automatically allocated one 
> core per slot.
> 
> Of course, a complex situation might arise if a user submits a combined 
> MPI/multithreaded job, but then I guess we're into the realm of setting 
> allocation_rule.

IIRC there was a discussion on the [GE users] list about it, to get an uniform 
distribution on all slave nodes for such jobs, as also e.g. $OMP_NUM_THREADS 
will be set to the same value for all slave nodes for hybrid jobs. Otherwise it 
would be necessary to adjust SGE to set this value in the "-builtin-" startup 
method automatically on all nodes to the local granted slots value. For now a 
fixed allocation rule of 1,2,4 or whatever must be used and you have to submit 
by reqeusting a wildcard PE to get any of these defined PEs for an even 
distribution and you don't care whether it's two times two slots, one time four 
slots, or four times one slot.

In my understanding, any type of parallel job should always request and get the 
total number of slots equal to the cores it needs to execute. Independent 
whether these are threads, forks or any hybrid type of jobs. Otherwise any 
resource planing and reservation will most likely fail. Nevertheless, there 
might exist rare cases where you submit an exclusive serial job but create 
threads/forks in the end. But such a setup should be an exception, not the 
default.


> Is it going to be worth looking at creating a patch for this?

Absolute.


>  I don't know much of the internals of SGE -- would it be hard work to do?  
> I've not that much time to dedicate towards it, but I could put some effort 
> in if necessary...

I don't know about the exact coding for it, but when it's for now a plain 
"copy" of the binding list, then it should become a loop to create a list of 
cores from the original specification until all granted slots got a core 
allocated.

-- Reuti


> 
> Chris
> 
> 
> --
> Dr Chris Jewell
> Department of Statistics
> University of Warwick
> Coventry
> CV4 7AL
> UK
> Tel: +44 (0)24 7615 0778
> 
> 
> 
> 
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-16 Thread Terry Dontje

On 11/16/2010 04:26 AM, Chris Jewell wrote:

Hi all,


On 11/15/2010 02:11 PM, Reuti wrote:

Just to give my understanding of the problem:

Sorry, I am still trying to grok all your email as what the problem you
are trying to solve. So is the issue is trying to have two jobs having
processes on the same node be able to bind there processes on different
resources. Like core 1 for the first job and core 2 and 3 for the 2nd job?

--td

You can't get 2 slots on a machine, as it's limited by the core count to one 
here, so such a slot allocation shouldn't occur at all.

So to clarify, the current -binding:  
allocates binding_amount cores to each sge_shepherd process associated with a job_id.  
There appears to be only one sge_shepherd process per job_id per execution node, so all 
child processes run on these allocated cores.  This is irrespective of the number of slots 
allocated to the node.

I believe the above is correct.

I agree with Reuti that the binding_amount parameter should be a maximum number 
of bound cores per node, with the actual number determined by the number of 
slots allocated per node.  FWIW, an alternative approach might be to have 
another binding_type ('slot', say) that automatically allocated one core per 
slot.

That might be correct, I've put in a question to someone who should know.

Of course, a complex situation might arise if a user submits a combined 
MPI/multithreaded job, but then I guess we're into the realm of setting 
allocation_rule.

Yes, that would get ugly.

Is it going to be worth looking at creating a patch for this?  I don't know 
much of the internals of SGE -- would it be hard work to do?  I've not that 
much time to dedicate towards it, but I could put some effort in if necessary...


Is the patch you're wanting is for a "slot" binding_type?

--
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle *- Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.don...@oracle.com 





Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-16 Thread Chris Jewell
Hi all,

> On 11/15/2010 02:11 PM, Reuti wrote: 
>> Just to give my understanding of the problem: 
>>> 
> Sorry, I am still trying to grok all your email as what the problem you 
> are trying to solve. So is the issue is trying to have two jobs having 
> processes on the same node be able to bind there processes on different 
> resources. Like core 1 for the first job and core 2 and 3 for the 2nd 
> job? 
> 
> --td 
> You can't get 2 slots on a machine, as it's limited by the core count to one 
> here, so such a slot allocation shouldn't occur at all. 

So to clarify, the current -binding : 
allocates binding_amount cores to each sge_shepherd process associated with a 
job_id.  There appears to be only one sge_shepherd process per job_id per 
execution node, so all child processes run on these allocated cores.  This is 
irrespective of the number of slots allocated to the node.  

I agree with Reuti that the binding_amount parameter should be a maximum number 
of bound cores per node, with the actual number determined by the number of 
slots allocated per node.  FWIW, an alternative approach might be to have 
another binding_type ('slot', say) that automatically allocated one core per 
slot.

Of course, a complex situation might arise if a user submits a combined 
MPI/multithreaded job, but then I guess we're into the realm of setting 
allocation_rule.

Is it going to be worth looking at creating a patch for this?  I don't know 
much of the internals of SGE -- would it be hard work to do?  I've not that 
much time to dedicate towards it, but I could put some effort in if necessary...

Chris


--
Dr Chris Jewell
Department of Statistics
University of Warwick
Coventry
CV4 7AL
UK
Tel: +44 (0)24 7615 0778








Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-15 Thread Reuti
Correction:

Am 15.11.2010 um 20:23 schrieb Terry Dontje:

> On 11/15/2010 02:11 PM, Reuti wrote:
>> Just to give my understanding of the problem:
>> 
>> Am 15.11.2010 um 19:57 schrieb Terry Dontje:
>> 
>> 
>>> On 11/15/2010 11:08 AM, Chris Jewell wrote:
>>> 
> Sorry, I am still trying to grok all your email as what the problem you 
> are trying to solve. So is the issue is trying to have two jobs having 
> processes on the same node be able to bind there processes on different 
> resources. Like core 1 for the first job and core 2 and 3 for the 2nd 
> job? 
> 
> --td 
> 
> 
 That's exactly it.  Each MPI process needs to be bound to 1 processor in a 
 way that reflects GE's slot allocation scheme.
 
 
 
>>> I actually don't think that I got it.  So you give two cases:
>>> 
>>> Case 1:
>>> $ qsub -pe mpi 8 -binding pe linear:1 myScript.com
>>> 
>>> and my pe_hostfile looks like:
>>> 
>>> exec6.cluster.stats.local 2 
>>> 
>>> batch.q@exec6.cluster.stats.local
>>> 
>>> 0,1
>>> 
>> Shouldn't here two cores be reserved for exec6 as it got two slots?
>> 
>> 
>> 
> That's what I was wondering.

You can't get 2 slots on a machine, as it's limited by the core count to one 
here, so such a slot allocation shouldn't occur at all.

==

If you want exactly N cores per machine, then also the allocation_rule should 
be set to N.

-- Reuti


Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-15 Thread Reuti

Am 15.11.2010 um 20:23 schrieb Terry Dontje:

> 
>>> Is your complaint really the fact that exec6 has been allocated two slots 
>>> but there seems to only be one slot worth of resources allocated
>>> 
>> All are wrong except exec6. They should only get one core assigned.
>> 
>> 
> Huh?  I would have thought exec6 would get 4 cores and the rest are correct.

In my opinion it would be a violation of the granted slot count when you get 
more cores granted than slots. How should SGE deal with it for the jobs which 
are lateron scheduled to such a machine: still 4 slots free, but all 8 cores 
already used up - what to do?!?

Hence the amount should be interpreted as a "reserve up to amount cores per 
machine", limited by the granted slot count per machine. So "-binding linear:4" 
would mean give me up to 4 cores per maschine if possible.

- possibly only 3, when only 3 slots are granted on a machine

- you will never ever get more than 4 slots per machine, i.e. it's an upper 
limit for slots per machine for this particular job

-- Reuti


> 
> --td
> 
>> -- Reuti
>> 
>> 
>> 
>>> to it (ie in case one exec6 only has 1 core and case 2 it has two where 
>>> maybe you'd expect 2 and 4 cores allocated respectively)?
>>> 
>>> -- 
>>> 
>>> Terry D. Dontje | Principal Software Engineer
>>> Developer Tools Engineering | +1.781.442.2631
>>> Oracle - Performance Technologies
>>> 95 Network Drive, Burlington, MA 01803
>>> Email 
>>> terry.don...@oracle.com
>>> 
>>> 
>>> 
>>> 
>>> ___
>>> users mailing list
>>> 
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> ___
>> users mailing list
>> 
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> -- 
> 
> Terry D. Dontje | Principal Software Engineer
> Developer Tools Engineering | +1.781.442.2631
> Oracle - Performance Technologies
> 95 Network Drive, Burlington, MA 01803
> Email terry.don...@oracle.com
> 
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-15 Thread Terry Dontje

On 11/15/2010 02:11 PM, Reuti wrote:

Just to give my understanding of the problem:

Am 15.11.2010 um 19:57 schrieb Terry Dontje:


On 11/15/2010 11:08 AM, Chris Jewell wrote:

Sorry, I am still trying to grok all your email as what the problem you
are trying to solve. So is the issue is trying to have two jobs having
processes on the same node be able to bind there processes on different
resources. Like core 1 for the first job and core 2 and 3 for the 2nd job?

--td


That's exactly it.  Each MPI process needs to be bound to 1 processor in a way 
that reflects GE's slot allocation scheme.



I actually don't think that I got it.  So you give two cases:

Case 1:
$ qsub -pe mpi 8 -binding pe linear:1 myScript.com

and my pe_hostfile looks like:

exec6.cluster.stats.local 2
batch.q@exec6.cluster.stats.local
  0,1

Shouldn't here two cores be reserved for exec6 as it got two slots?



That's what I was wondering.

exec1.cluster.stats.local 1
batch.q@exec1.cluster.stats.local
  0,1
exec7.cluster.stats.local 1
batch.q@exec7.cluster.stats.local
  0,1
exec5.cluster.stats.local 1
batch.q@exec5.cluster.stats.local
  0,1
exec4.cluster.stats.local 1
batch.q@exec4.cluster.stats.local
  0,1
exec3.cluster.stats.local 1
batch.q@exec3.cluster.stats.local
  0,1
exec2.cluster.stats.local 1
batch.q@exec2.cluster.stats.local
  0,1


Case 2:
Notice that, because I have specified the -binding pe linear:1, each execution 
node binds processes for the job_id to one core.  If I have -binding pe 
linear:2, I get:

exec6.cluster.stats.local 2
batch.q@exec6.cluster.stats.local
  0,1:0,2
exec1.cluster.stats.local 1
batch.q@exec1.cluster.stats.local
  0,1:0,2
exec7.cluster.stats.local 1
batch.q@exec7.cluster.stats.local
  0,1:0,2
exec4.cluster.stats.local 1
batch.q@exec4.cluster.stats.local
  0,1:0,2
exec3.cluster.stats.local 1
batch.q@exec3.cluster.stats.local
  0,1:0,2
exec2.cluster.stats.local 1
batch.q@exec2.cluster.stats.local
  0,1:0,2
exec5.cluster.stats.local 1
batch.q@exec5.cluster.stats.local
  0,1:0,2

Is your complaint really the fact that exec6 has been allocated two slots but 
there seems to only be one slot worth of resources allocated

All are wrong except exec6. They should only get one core assigned.


Huh?  I would have thought exec6 would get 4 cores and the rest are correct.

--td


-- Reuti



to it (ie in case one exec6 only has 1 core and case 2 it has two where maybe 
you'd expect 2 and 4 cores allocated respectively)?

--

Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle - Performance Technologies
95 Network Drive, Burlington, MA 01803
Email terry.don...@oracle.com



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle *- Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.don...@oracle.com 





Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-15 Thread Reuti
Just to give my understanding of the problem:

Am 15.11.2010 um 19:57 schrieb Terry Dontje:

> On 11/15/2010 11:08 AM, Chris Jewell wrote:
>>> Sorry, I am still trying to grok all your email as what the problem you 
>>> are trying to solve. So is the issue is trying to have two jobs having 
>>> processes on the same node be able to bind there processes on different 
>>> resources. Like core 1 for the first job and core 2 and 3 for the 2nd job? 
>>> 
>>> --td 
>>> 
>> That's exactly it.  Each MPI process needs to be bound to 1 processor in a 
>> way that reflects GE's slot allocation scheme.
>> 
>> 
> I actually don't think that I got it.  So you give two cases:
> 
> Case 1:
> $ qsub -pe mpi 8 -binding pe linear:1 myScript.com
> 
> and my pe_hostfile looks like:
> 
> exec6.cluster.stats.local 2 
> batch.q@exec6.cluster.stats.local
>  0,1

Shouldn't here two cores be reserved for exec6 as it got two slots?


> exec1.cluster.stats.local 1 
> batch.q@exec1.cluster.stats.local
>  0,1
> exec7.cluster.stats.local 1 
> batch.q@exec7.cluster.stats.local
>  0,1
> exec5.cluster.stats.local 1 
> batch.q@exec5.cluster.stats.local
>  0,1
> exec4.cluster.stats.local 1 
> batch.q@exec4.cluster.stats.local
>  0,1
> exec3.cluster.stats.local 1 
> batch.q@exec3.cluster.stats.local
>  0,1
> exec2.cluster.stats.local 1 
> batch.q@exec2.cluster.stats.local
>  0,1
> 
> 
> Case 2:
> Notice that, because I have specified the -binding pe linear:1, each 
> execution node binds processes for the job_id to one core.  If I have 
> -binding pe linear:2, I get:
> 
> exec6.cluster.stats.local 2 
> batch.q@exec6.cluster.stats.local
>  0,1:0,2
> exec1.cluster.stats.local 1 
> batch.q@exec1.cluster.stats.local
>  0,1:0,2
> exec7.cluster.stats.local 1 
> batch.q@exec7.cluster.stats.local
>  0,1:0,2
> exec4.cluster.stats.local 1 
> batch.q@exec4.cluster.stats.local
>  0,1:0,2
> exec3.cluster.stats.local 1 
> batch.q@exec3.cluster.stats.local
>  0,1:0,2
> exec2.cluster.stats.local 1 
> batch.q@exec2.cluster.stats.local
>  0,1:0,2
> exec5.cluster.stats.local 1 
> batch.q@exec5.cluster.stats.local
>  0,1:0,2
> 
> Is your complaint really the fact that exec6 has been allocated two slots but 
> there seems to only be one slot worth of resources allocated

All are wrong except exec6. They should only get one core assigned.

-- Reuti


> to it (ie in case one exec6 only has 1 core and case 2 it has two where maybe 
> you'd expect 2 and 4 cores allocated respectively)?
> 
> -- 
> 
> Terry D. Dontje | Principal Software Engineer
> Developer Tools Engineering | +1.781.442.2631
> Oracle - Performance Technologies
> 95 Network Drive, Burlington, MA 01803
> Email terry.don...@oracle.com
> 
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-15 Thread Terry Dontje

On 11/15/2010 11:08 AM, Chris Jewell wrote:

Sorry, I am still trying to grok all your email as what the problem you
are trying to solve. So is the issue is trying to have two jobs having
processes on the same node be able to bind there processes on different
resources. Like core 1 for the first job and core 2 and 3 for the 2nd job?

--td

That's exactly it.  Each MPI process needs to be bound to 1 processor in a way 
that reflects GE's slot allocation scheme.


I actually don't think that I got it.  So you give two cases:

Case 1:

$ qsub -pe mpi 8 -binding pe linear:1 myScript.com

and my pe_hostfile looks like:

exec6.cluster.stats.local 2batch.q@exec6.cluster.stats.local  0,1
exec1.cluster.stats.local 1batch.q@exec1.cluster.stats.local  0,1
exec7.cluster.stats.local 1batch.q@exec7.cluster.stats.local  0,1
exec5.cluster.stats.local 1batch.q@exec5.cluster.stats.local  0,1
exec4.cluster.stats.local 1batch.q@exec4.cluster.stats.local  0,1
exec3.cluster.stats.local 1batch.q@exec3.cluster.stats.local  0,1
exec2.cluster.stats.local 1batch.q@exec2.cluster.stats.local  0,1


Case 2:

Notice that, because I have specified the -binding pe linear:1, each execution 
node binds processes for the job_id to one core.  If I have -binding pe 
linear:2, I get:

exec6.cluster.stats.local 2batch.q@exec6.cluster.stats.local  0,1:0,2
exec1.cluster.stats.local 1batch.q@exec1.cluster.stats.local  0,1:0,2
exec7.cluster.stats.local 1batch.q@exec7.cluster.stats.local  0,1:0,2
exec4.cluster.stats.local 1batch.q@exec4.cluster.stats.local  0,1:0,2
exec3.cluster.stats.local 1batch.q@exec3.cluster.stats.local  0,1:0,2
exec2.cluster.stats.local 1batch.q@exec2.cluster.stats.local  0,1:0,2
exec5.cluster.stats.local 1batch.q@exec5.cluster.stats.local  0,1:0,2

Is your complaint really the fact that exec6 has been allocated two 
slots but there seems to only be one slot worth of resources allocated 
to it (ie in case one exec6 only has 1 core and case 2 it has two where 
maybe you'd expect 2 and 4 cores allocated respectively)?


--
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle *- Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.don...@oracle.com 





Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-15 Thread Reuti
Hi,

Am 15.11.2010 um 17:06 schrieb Chris Jewell:

> Hi Ralph,
> 
> Thanks for the tip.  With the command
> 
> $ qsub -pe mpi 8 -binding linear:1 myScript.com
> 
> I get the output
> 
> [exec6:29172] System has detected external process binding to cores 0008
> [exec6:29172] ras:gridengine: JOB_ID: 59282
> [exec6:29172] ras:gridengine: PE_HOSTFILE: 
> /usr/sge/default/spool/exec6/active_jobs/59282.1/pe_hostfile
> [exec6:29172] ras:gridengine: exec6.cluster.stats.local: PE_HOSTFILE shows 
> slots=2
> [exec6:29172] ras:gridengine: exec2.cluster.stats.local: PE_HOSTFILE shows 
> slots=1
> [exec6:29172] ras:gridengine: exec5.cluster.stats.local: PE_HOSTFILE shows 
> slots=1
> [exec6:29172] ras:gridengine: exec3.cluster.stats.local: PE_HOSTFILE shows 
> slots=1
> [exec6:29172] ras:gridengine: exec4.cluster.stats.local: PE_HOSTFILE shows 
> slots=1
> [exec6:29172] ras:gridengine: exec7.cluster.stats.local: PE_HOSTFILE shows 
> slots=1
> [exec6:29172] ras:gridengine: exec1.cluster.stats.local: PE_HOSTFILE shows 
> slots=1
> 
> Presumably that means that OMPI is detecting the external binding okay.  If 
> so, then that confirms my problem as an issue with how GE sets the processor 
> affinity -- essentially the controlling sge_shepherd process  on each 
> physical exec node gets bound to the requested number of cores (in this case 
> 1) resulting in any child process (ie the ompi parallel processes) being 
> bound to the same core.  What we really need is for GE to set the binding on 
> each execution node according to the number of parallel processes that will 
> run there.  Not sure this is doable currently...

on SGE's side it could be the problem that local MPI processes on each slave 
node are threads and don't invoke an  additional `qrsh -inherit ...`. If you 
have only one MPI process per node it's working fine?

-- Reuti


> Cheers,
> 
> Chris
> 
> 
> --
> Dr Chris Jewell
> Department of Statistics
> University of Warwick
> Coventry
> CV4 7AL
> UK
> Tel: +44 (0)24 7615 0778
> 
> 
> 
> 
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-15 Thread Chris Jewell
> Sorry, I am still trying to grok all your email as what the problem you 
> are trying to solve. So is the issue is trying to have two jobs having 
> processes on the same node be able to bind there processes on different 
> resources. Like core 1 for the first job and core 2 and 3 for the 2nd job? 
> 
> --td 

That's exactly it.  Each MPI process needs to be bound to 1 processor in a way 
that reflects GE's slot allocation scheme.

C

--
Dr Chris Jewell
Department of Statistics
University of Warwick
Coventry
CV4 7AL
UK
Tel: +44 (0)24 7615 0778








Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-15 Thread Chris Jewell
Hi Ralph,

Thanks for the tip.  With the command

$ qsub -pe mpi 8 -binding linear:1 myScript.com

I get the output

[exec6:29172] System has detected external process binding to cores 0008
[exec6:29172] ras:gridengine: JOB_ID: 59282
[exec6:29172] ras:gridengine: PE_HOSTFILE: 
/usr/sge/default/spool/exec6/active_jobs/59282.1/pe_hostfile
[exec6:29172] ras:gridengine: exec6.cluster.stats.local: PE_HOSTFILE shows 
slots=2
[exec6:29172] ras:gridengine: exec2.cluster.stats.local: PE_HOSTFILE shows 
slots=1
[exec6:29172] ras:gridengine: exec5.cluster.stats.local: PE_HOSTFILE shows 
slots=1
[exec6:29172] ras:gridengine: exec3.cluster.stats.local: PE_HOSTFILE shows 
slots=1
[exec6:29172] ras:gridengine: exec4.cluster.stats.local: PE_HOSTFILE shows 
slots=1
[exec6:29172] ras:gridengine: exec7.cluster.stats.local: PE_HOSTFILE shows 
slots=1
[exec6:29172] ras:gridengine: exec1.cluster.stats.local: PE_HOSTFILE shows 
slots=1

Presumably that means that OMPI is detecting the external binding okay.  If so, 
then that confirms my problem as an issue with how GE sets the processor 
affinity -- essentially the controlling sge_shepherd process  on each physical 
exec node gets bound to the requested number of cores (in this case 1) 
resulting in any child process (ie the ompi parallel processes) being bound to 
the same core.  What we really need is for GE to set the binding on each 
execution node according to the number of parallel processes that will run 
there.  Not sure this is doable currently...

Cheers,

Chris


--
Dr Chris Jewell
Department of Statistics
University of Warwick
Coventry
CV4 7AL
UK
Tel: +44 (0)24 7615 0778








Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-15 Thread Ralph Castain
The external binding code should be in that version.

If you add --report-bindings --leave-session-attached to the mpirun command
line, you should see output from each daemon telling you what external
binding it detected, and how it is binding each app it launches.

Thanks!


On Mon, Nov 15, 2010 at 8:33 AM, Chris Jewell wrote:

> > I confess I am now confused. What version of OMPI are you using?
> >
> > FWIW: OMPI was updated at some point to detect the actual cores of an
> > external binding, and abide by them. If we aren't doing that, then we
> have a
> > bug that needs to be resolved. Or it could be you are using a version
> that
> > predates the change.
> >
> > Thanks
> > Ralph
>
> Hi Ralph,
>
> I'm using OMPI version 1.4.2.  I can upgrade and try it out if necessary.
>  Is there anything I can give you as potential debug material?
>
> Cheers,
>
> Chris
>
>
>
> --
> Dr Chris Jewell
> Department of Statistics
> University of Warwick
> Coventry
> CV4 7AL
> UK
> Tel: +44 (0)24 7615 0778
>
>
>
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-15 Thread Chris Jewell
> I confess I am now confused. What version of OMPI are you using? 
> 
> FWIW: OMPI was updated at some point to detect the actual cores of an 
> external binding, and abide by them. If we aren't doing that, then we have a 
> bug that needs to be resolved. Or it could be you are using a version that 
> predates the change. 
> 
> Thanks 
> Ralph

Hi Ralph,

I'm using OMPI version 1.4.2.  I can upgrade and try it out if necessary.  Is 
there anything I can give you as potential debug material?

Cheers,

Chris



--
Dr Chris Jewell
Department of Statistics
University of Warwick
Coventry
CV4 7AL
UK
Tel: +44 (0)24 7615 0778








Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-15 Thread Terry Dontje
Sorry, I am still trying to grok all your email as what the problem you 
are trying to solve.  So is the issue is trying to have two jobs having 
processes on the same node be able to bind there processes on different 
resources.  Like core 1 for the first job and core 2 and 3 for the 2nd job?


--td

On 11/15/2010 09:29 AM, Chris Jewell wrote:

Hi,


If, indeed, it is not possible currently to implement this type of core-binding 
in tightly integrated OpenMPI/GE, then a solution might lie in a custom script 
run in the parallel environment's 'start proc args'. This script would have to 
find out which slots are allocated where on the cluster, and write an OpenMPI 
rankfile.

Exactly this should work.

If you use "binding_instance" "pe" and reformat the information in the $PE_HOSTFILE to a 
"rankfile", it should work to get the desired allocation. Maybe you can share the script with this 
list once you got it working.


As far as I can see, that's not going to work.  This is because, exactly like 
"binding_instance" "set", for -binding pe linear:n you get n cores bound per 
node.  This is easily verifiable by using a long job and examining the pe_hostfile.  For example, I 
submit a job with:

$ qsub -pe mpi 8 -binding pe linear:1 myScript.com

and my pe_hostfile looks like:

exec6.cluster.stats.local 2 batch.q@exec6.cluster.stats.local 0,1
exec1.cluster.stats.local 1 batch.q@exec1.cluster.stats.local 0,1
exec7.cluster.stats.local 1 batch.q@exec7.cluster.stats.local 0,1
exec5.cluster.stats.local 1 batch.q@exec5.cluster.stats.local 0,1
exec4.cluster.stats.local 1 batch.q@exec4.cluster.stats.local 0,1
exec3.cluster.stats.local 1 batch.q@exec3.cluster.stats.local 0,1
exec2.cluster.stats.local 1 batch.q@exec2.cluster.stats.local 0,1

Notice that, because I have specified the -binding pe linear:1, each execution 
node binds processes for the job_id to one core.  If I have -binding pe 
linear:2, I get:

exec6.cluster.stats.local 2 batch.q@exec6.cluster.stats.local 0,1:0,2
exec1.cluster.stats.local 1 batch.q@exec1.cluster.stats.local 0,1:0,2
exec7.cluster.stats.local 1 batch.q@exec7.cluster.stats.local 0,1:0,2
exec4.cluster.stats.local 1 batch.q@exec4.cluster.stats.local 0,1:0,2
exec3.cluster.stats.local 1 batch.q@exec3.cluster.stats.local 0,1:0,2
exec2.cluster.stats.local 1 batch.q@exec2.cluster.stats.local 0,1:0,2
exec5.cluster.stats.local 1 batch.q@exec5.cluster.stats.local 0,1:0,2

So the pe_hostfile still doesn't give an accurate representation of the binding 
allocation for use by OpenMPI.  Question: is there a system file or command that I could 
use to check which processors are "occupied"?

Chris

--
Dr Chris Jewell
Department of Statistics
University of Warwick
Coventry
CV4 7AL
UK
Tel: +44 (0)24 7615 0778






___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle *- Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.don...@oracle.com 





Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-15 Thread Reuti
Am 15.11.2010 um 15:29 schrieb Chris Jewell:

> Hi,
> 
>>> If, indeed, it is not possible currently to implement this type of 
>>> core-binding in tightly integrated OpenMPI/GE, then a solution might lie in 
>>> a custom script run in the parallel environment's 'start proc args'. This 
>>> script would have to find out which slots are allocated where on the 
>>> cluster, and write an OpenMPI rankfile. 
>> 
>> Exactly this should work. 
>> 
>> If you use "binding_instance" "pe" and reformat the information in the 
>> $PE_HOSTFILE to a "rankfile", it should work to get the desired allocation. 
>> Maybe you can share the script with this list once you got it working. 
> 
> 
> As far as I can see, that's not going to work.  This is because, exactly like 
> "binding_instance" "set", for -binding pe linear:n you get n cores bound per 
> node.  This is easily verifiable by using a long job and examining the 
> pe_hostfile.  For example, I submit a job with:
> 
> $ qsub -pe mpi 8 -binding pe linear:1 myScript.com
> 
> and my pe_hostfile looks like:
> 
> exec6.cluster.stats.local 2 batch.q@exec6.cluster.stats.local 0,1
> exec1.cluster.stats.local 1 batch.q@exec1.cluster.stats.local 0,1
> exec7.cluster.stats.local 1 batch.q@exec7.cluster.stats.local 0,1
> exec5.cluster.stats.local 1 batch.q@exec5.cluster.stats.local 0,1
> exec4.cluster.stats.local 1 batch.q@exec4.cluster.stats.local 0,1
> exec3.cluster.stats.local 1 batch.q@exec3.cluster.stats.local 0,1
> exec2.cluster.stats.local 1 batch.q@exec2.cluster.stats.local 0,1
> 
> Notice that, because I have specified the -binding pe linear:1, each 
> execution node binds processes for the job_id to one core.  If I have 
> -binding pe linear:2, I get:
> 
> exec6.cluster.stats.local 2 batch.q@exec6.cluster.stats.local 0,1:0,2

So the cores 1 and 2 on socket 0 aren't free?

-- Reuti


> exec1.cluster.stats.local 1 batch.q@exec1.cluster.stats.local 0,1:0,2
> exec7.cluster.stats.local 1 batch.q@exec7.cluster.stats.local 0,1:0,2
> exec4.cluster.stats.local 1 batch.q@exec4.cluster.stats.local 0,1:0,2
> exec3.cluster.stats.local 1 batch.q@exec3.cluster.stats.local 0,1:0,2
> exec2.cluster.stats.local 1 batch.q@exec2.cluster.stats.local 0,1:0,2
> exec5.cluster.stats.local 1 batch.q@exec5.cluster.stats.local 0,1:0,2
> 
> So the pe_hostfile still doesn't give an accurate representation of the 
> binding allocation for use by OpenMPI.  Question: is there a system file or 
> command that I could use to check which processors are "occupied"?
> 
> Chris
> 
> --
> Dr Chris Jewell
> Department of Statistics
> University of Warwick
> Coventry
> CV4 7AL
> UK
> Tel: +44 (0)24 7615 0778
> 
> 
> 
> 
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-15 Thread Chris Jewell
Hi,

> > If, indeed, it is not possible currently to implement this type of 
> > core-binding in tightly integrated OpenMPI/GE, then a solution might lie in 
> > a custom script run in the parallel environment's 'start proc args'. This 
> > script would have to find out which slots are allocated where on the 
> > cluster, and write an OpenMPI rankfile. 
> 
> Exactly this should work. 
> 
> If you use "binding_instance" "pe" and reformat the information in the 
> $PE_HOSTFILE to a "rankfile", it should work to get the desired allocation. 
> Maybe you can share the script with this list once you got it working. 


As far as I can see, that's not going to work.  This is because, exactly like 
"binding_instance" "set", for -binding pe linear:n you get n cores bound per 
node.  This is easily verifiable by using a long job and examining the 
pe_hostfile.  For example, I submit a job with:

$ qsub -pe mpi 8 -binding pe linear:1 myScript.com

and my pe_hostfile looks like:

exec6.cluster.stats.local 2 batch.q@exec6.cluster.stats.local 0,1
exec1.cluster.stats.local 1 batch.q@exec1.cluster.stats.local 0,1
exec7.cluster.stats.local 1 batch.q@exec7.cluster.stats.local 0,1
exec5.cluster.stats.local 1 batch.q@exec5.cluster.stats.local 0,1
exec4.cluster.stats.local 1 batch.q@exec4.cluster.stats.local 0,1
exec3.cluster.stats.local 1 batch.q@exec3.cluster.stats.local 0,1
exec2.cluster.stats.local 1 batch.q@exec2.cluster.stats.local 0,1

Notice that, because I have specified the -binding pe linear:1, each execution 
node binds processes for the job_id to one core.  If I have -binding pe 
linear:2, I get:

exec6.cluster.stats.local 2 batch.q@exec6.cluster.stats.local 0,1:0,2
exec1.cluster.stats.local 1 batch.q@exec1.cluster.stats.local 0,1:0,2
exec7.cluster.stats.local 1 batch.q@exec7.cluster.stats.local 0,1:0,2
exec4.cluster.stats.local 1 batch.q@exec4.cluster.stats.local 0,1:0,2
exec3.cluster.stats.local 1 batch.q@exec3.cluster.stats.local 0,1:0,2
exec2.cluster.stats.local 1 batch.q@exec2.cluster.stats.local 0,1:0,2
exec5.cluster.stats.local 1 batch.q@exec5.cluster.stats.local 0,1:0,2

So the pe_hostfile still doesn't give an accurate representation of the binding 
allocation for use by OpenMPI.  Question: is there a system file or command 
that I could use to check which processors are "occupied"?

Chris

--
Dr Chris Jewell
Department of Statistics
University of Warwick
Coventry
CV4 7AL
UK
Tel: +44 (0)24 7615 0778








Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-15 Thread Ralph Castain
I confess I am now confused. What version of OMPI are you using?

FWIW: OMPI was updated at some point to detect the actual cores of an
external binding, and abide by them. If we aren't doing that, then we have a
bug that needs to be resolved. Or it could be you are using a version that
predates the change.

Thanks
Ralph


On Mon, Nov 15, 2010 at 5:38 AM, Reuti  wrote:

> Hi,
>
> Am 15.11.2010 um 13:13 schrieb Chris Jewell:
>
> > Okay so I tried what you suggested.  You essentially get the requested
> number of bound cores on each execution node, so if I use
> >
> > $ qsub -pe openmpi 8 -binding linear:2 
> >
> > then I get 2 bound cores per node, irrespective of the number of slots
> (and hence parallel) processes allocated by GE.  This is irrespective of
> which setting I use for the allocation_rule.
>
> but it should work fine with an "allocation_rule 2" then.
>
>
> > My aim with this was to deal with badly behaved multithreaded algorithms
>
> Yep, this causes sometimes the overloading of a machine. When I know that I
> want to compile a parallel Open MPI application, I use non-threaded versions
> of ATLAS, MKL or other libraries.
>
>
> > which end up spreading across more cores on an execution node than the
> number of GE-allocated slots (thereby interfering with other GE scheduled
> tasks running on the same exec node).  By binding a process to one or more
> cores, one can "box in" processes and prevent them from spawning erroneous
> sub-processes and threads.  Unfortunately, the above solution sets the same
> core binding for each execution node to be the same.
> >
> >> From exploring the software (both OpenMPI and GE) further, I have two
> comments:
> >
> > 1) The core binding feature in GE appears to apply the requested
> core-binding topology to every execution node involved in a parallel job,
> rather than assuming that the topology requested is *per parallel process*.
>  So, if I request 'qsub -pe mpi 8 -binding linear:1 ' with
> the intention of getting each of the 8 parallel processes to be bound to 1
> core, I actually get all processes associated with the job_id on one exec
> node bound to 1 core.  Oops!
> >
> > 2) OpenMPI has its own core-binding feature (-mca mpi_paffinity_alone 1)
> which works well to bind each parallel process to one processor.
>  Unfortunately, the binding framework (hwloc) is different to that which GE
> uses (PLPA), resulting in binding overlaps between GE-bound tasks (eg serial
> and smp jobs) and OpenMPI-bound processes (ie my mpi jobs).  Again, oops ;-)
>
> > If, indeed, it is not possible currently to implement this type of
> core-binding in tightly integrated OpenMPI/GE, then a solution might lie in
> a custom script run in the parallel environment's 'start proc args'.  This
> script would have to find out which slots are allocated where on the
> cluster, and write an OpenMPI rankfile.
>
> Exactly this should work.
>
> If you use "binding_instance" "pe" and reformat the information in the
> $PE_HOSTFILE to a "rankfile", it should work to get the desired allocation.
> Maybe you can share the script with this list once you got it working.
>
> -- Reuti
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-15 Thread Reuti
Hi,

Am 15.11.2010 um 13:13 schrieb Chris Jewell:

> Okay so I tried what you suggested.  You essentially get the requested number 
> of bound cores on each execution node, so if I use
> 
> $ qsub -pe openmpi 8 -binding linear:2 
> 
> then I get 2 bound cores per node, irrespective of the number of slots (and 
> hence parallel) processes allocated by GE.  This is irrespective of which 
> setting I use for the allocation_rule.

but it should work fine with an "allocation_rule 2" then.


> My aim with this was to deal with badly behaved multithreaded algorithms

Yep, this causes sometimes the overloading of a machine. When I know that I 
want to compile a parallel Open MPI application, I use non-threaded versions of 
ATLAS, MKL or other libraries.


> which end up spreading across more cores on an execution node than the number 
> of GE-allocated slots (thereby interfering with other GE scheduled tasks 
> running on the same exec node).  By binding a process to one or more cores, 
> one can "box in" processes and prevent them from spawning erroneous 
> sub-processes and threads.  Unfortunately, the above solution sets the same 
> core binding for each execution node to be the same.
> 
>> From exploring the software (both OpenMPI and GE) further, I have two 
>> comments:
> 
> 1) The core binding feature in GE appears to apply the requested core-binding 
> topology to every execution node involved in a parallel job, rather than 
> assuming that the topology requested is *per parallel process*.  So, if I 
> request 'qsub -pe mpi 8 -binding linear:1 ' with the intention 
> of getting each of the 8 parallel processes to be bound to 1 core, I actually 
> get all processes associated with the job_id on one exec node bound to 1 
> core.  Oops!
> 
> 2) OpenMPI has its own core-binding feature (-mca mpi_paffinity_alone 1) 
> which works well to bind each parallel process to one processor.  
> Unfortunately, the binding framework (hwloc) is different to that which GE 
> uses (PLPA), resulting in binding overlaps between GE-bound tasks (eg serial 
> and smp jobs) and OpenMPI-bound processes (ie my mpi jobs).  Again, oops ;-)

> If, indeed, it is not possible currently to implement this type of 
> core-binding in tightly integrated OpenMPI/GE, then a solution might lie in a 
> custom script run in the parallel environment's 'start proc args'.  This 
> script would have to find out which slots are allocated where on the cluster, 
> and write an OpenMPI rankfile.

Exactly this should work.

If you use "binding_instance" "pe" and reformat the information in the 
$PE_HOSTFILE to a "rankfile", it should work to get the desired allocation. 
Maybe you can share the script with this list once you got it working.

-- Reuti


Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-15 Thread Chris Jewell
Hi Reuti,

Okay so I tried what you suggested.  You essentially get the requested number 
of bound cores on each execution node, so if I use

$ qsub -pe openmpi 8 -binding linear:2 

then I get 2 bound cores per node, irrespective of the number of slots (and 
hence parallel) processes allocated by GE.  This is irrespective of which 
setting I use for the allocation_rule.

My aim with this was to deal with badly behaved multithreaded algorithms which 
end up spreading across more cores on an execution node than the number of 
GE-allocated slots (thereby interfering with other GE scheduled tasks running 
on the same exec node).  By binding a process to one or more cores, one can 
"box in" processes and prevent them from spawning erroneous sub-processes and 
threads.  Unfortunately, the above solution sets the same core binding for each 
execution node to be the same.

>From exploring the software (both OpenMPI and GE) further, I have two comments:

1) The core binding feature in GE appears to apply the requested core-binding 
topology to every execution node involved in a parallel job, rather than 
assuming that the topology requested is *per parallel process*.  So, if I 
request 'qsub -pe mpi 8 -binding linear:1 ' with the intention of 
getting each of the 8 parallel processes to be bound to 1 core, I actually get 
all processes associated with the job_id on one exec node bound to 1 core.  
Oops!

2) OpenMPI has its own core-binding feature (-mca mpi_paffinity_alone 1) which 
works well to bind each parallel process to one processor.  Unfortunately, the 
binding framework (hwloc) is different to that which GE uses (PLPA), resulting 
in binding overlaps between GE-bound tasks (eg serial and smp jobs) and 
OpenMPI-bound processes (ie my mpi jobs).  Again, oops ;-)


If, indeed, it is not possible currently to implement this type of core-binding 
in tightly integrated OpenMPI/GE, then a solution might lie in a custom script 
run in the parallel environment's 'start proc args'.  This script would have to 
find out which slots are allocated where on the cluster, and write an OpenMPI 
rankfile.

Any thoughts on that?

Cheers,

Chris


--
Dr Chris Jewell
Department of Statistics
University of Warwick
Coventry
CV4 7AL
UK
Tel: +44 (0)24 7615 0778








Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-13 Thread Chris Jewell
Hi Dave, Reuti,

Sorry for kicking off this thread, and then disappearing.  I've been away for a 
bit.  Anyway, Dave, I'm glad you experienced the same issue as I had with my 
installation of SGE 6.2u5 and OpenMPI with core binding -- namely that with 
'qsub -pe openmpi 8 -binding set linear:1 ', if two or more of 
the parallel processes get scheduled to the same execution node, then the 
processes end up being bound to the same core.  Not good!

I've been playing around quite a bit trying to understand this issue, and ended 
up on the GE dev list:

http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=39=285878

It seems that most people expect that calls to 'qrsh -inherit' (that I assume 
OpenMPI uses to bind parallel processes to reserved GE slots) activates a 
separate binding.  This does not appear to be the case.  I *was* hoping that 
using -binding pe linear:1 might enable me to write a script that read the 
pe_hostfile and created a machine file for OpenMPI, but this fails as GE does 
not appear to give information as to which cores are unbound, only the number 
required.

So, for now, my solution has been to use a JSV to remove core binding for the 
MPI jobs (but retain it for serial and SMP jobs).  Any more ideas??

Cheers,

Chris

(PS. Dave: how is my alma mater these days??)
--
Dr Chris Jewell
Department of Statistics
University of Warwick
Coventry
CV4 7AL
UK
Tel: +44 (0)24 7615 0778








Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-10-14 Thread Reuti
Hi,

Am 14.10.2010 um 13:23 schrieb Dave Love:

> Reuti  writes:
> 
>> With the default binding_instance set to "set" (the default) the
>> shepherd should bind the processes to cores already. With other types
>> of binding_instance these selected cores must be forward to the
>> application via an environment variable or in the hostfile.
> 
> My question was specifically about SGE/OMPI tight integration; are you
> actually doing binding successfully with that?  I think I read here that
> the integration doesn't (yet?) deal with SGE core binding, and when we
> turned on the SGE feature we got the OMPI tasks piled onto a single
> core.  We quickly turned it off for MPI jobs when we realized what was
> happening, and I didn't try to investigate further.

what did you request in particular in `qsub -binding`? When you request `qsub 
-pe openmpi 2 -binding linear:1 ...` it would apply the core assignment per 
`qrsh`. Means, when you are staying on one machine only (because of "$pe_slots" 
for "allocation_rule"), you would indeed oversubscribe the core as Open MPI 
will then use threads (hence "-binding linear:2" should do in this case). But 
if the "allocation_rule" is set to the integer value "1" and you get for sure a 
core on another machine, then "linear:1" would be fine. Similar `qsub -pe 
openmpi 4 -binding linear:2 ...` when you have an "allocation_rule" of "2".

If in a similar scenario you get 4 cores on one and the same machine and SGE 
creates a cpuset of 4 cores, these 4 threads can nevertheless be scheduled to 
any granted core by the Linux scheduler kernel. It would be necssary to use 
another binding_instance "env" or "pe" to get the information of granted cores 
into the jobscript/hostfile and decide on your own how to forward this to Open 
MPI to have each thread also bound to a unique core too and avoid having them 
drifting around the cores in the cpuset.

-- Reuti


>> As this is only a hint to SGE and not a hard request, the user must
>> plan a little bit the allocation beforehand. Especially if you
>> oversubscribe a machine it won't work. 
> 
> [It is documented that the binding isn't applied if the selected cores
> are occupied.]
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-10-14 Thread Dave Love
Reuti  writes:

> With the default binding_instance set to "set" (the default) the
> shepherd should bind the processes to cores already. With other types
> of binding_instance these selected cores must be forward to the
> application via an environment variable or in the hostfile.

My question was specifically about SGE/OMPI tight integration; are you
actually doing binding successfully with that?  I think I read here that
the integration doesn't (yet?) deal with SGE core binding, and when we
turned on the SGE feature we got the OMPI tasks piled onto a single
core.  We quickly turned it off for MPI jobs when we realized what was
happening, and I didn't try to investigate further.

> As this is only a hint to SGE and not a hard request, the user must
> plan a little bit the allocation beforehand. Especially if you
> oversubscribe a machine it won't work. 

[It is documented that the binding isn't applied if the selected cores
are occupied.]



Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-10-12 Thread Reuti
Am 12.10.2010 um 15:49 schrieb Dave Love:

> Chris Jewell  writes:
> 
>> I've scrapped this system now in favour of the new SGE core binding feature.
> 
> How does that work, exactly?  I thought the OMPI SGE integration didn't
> support core binding, but good if it does.

With the default binding_instance set to "set" (the default) the shepherd 
should bind the processes to cores already. With other types of 
binding_instance these selected cores must be forward to the application via an 
environment variable or in the hostfile.

As this is only a hint to SGE and not a hard request, the user must plan a 
little bit the allocation beforehand. Especially if you oversubscribe a machine 
it won't work. When I look at /proc/*/status it's mentioned there as it 
happened. And it's also noted in "config" file of each job's 
.../active_jobs/... file. E.g. a top shows:

 9926 ms04  39  19  3756  292  228 R   25  0.0   0:19.31 ever   

   
 9927 ms04  39  19  3756  292  228 R   25  0.0   0:19.31 ever   

   
 9925 ms04  39  19  3756  288  228 R   25  0.0   0:19.30 ever   

   
 9928 ms04  39  19  3756  292  228 R   25  0.0   0:19.30 ever 

for 4 forks of an endless loop in one and the same jobscript when submitted 
with `qsub -binding linear:1 demo.sh`. Well, the funny thing is that with this 
kernel version I still get a load of 4, despite the fact that all 4 forks are 
bound to one core. Should it really be four?

-- Reuti

> __
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-10-12 Thread Dave Love
Chris Jewell  writes:

> I've scrapped this system now in favour of the new SGE core binding feature.

How does that work, exactly?  I thought the OMPI SGE integration didn't
support core binding, but good if it does.



Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-10-05 Thread Chris Jewell

> 
> It looks to me like your remote nodes aren't finding the orted executable. I 
> suspect the problem is that you need to forward the path and ld_library_path 
> tot he remove nodes. Use the mpirun -x option to do so.


Hi, problem sorted.  It was actually caused by the system I currently use to 
create Linux cpusets on the execution nodes.  Grid Engine was trying to execv 
on the slave nodes, and not supplying an executable to run, since this is 
deferred to OpenMPI.  I've scrapped this system now in favour of the new SGE 
core binding feature.

Thanks, sorry to waste people's time!

Chris








Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-10-04 Thread Ralph Castain
It looks to me like your remote nodes aren't finding the orted executable. I 
suspect the problem is that you need to forward the path and ld_library_path 
tot he remove nodes. Use the mpirun -x option to do so.


On Oct 4, 2010, at 5:08 AM, Chris Jewell wrote:

> Hi all,
> 
> Firstly, hello to the mailing list for the first time!  Secondly, sorry for 
> the non-descript subject line, but I couldn't really think how to be more 
> specific!  
> 
> Anyway, I am currently having a problem getting OpenMPI to work within my 
> installation of SGE 6.2u5.  I compiled OpenMPI 1.4.2 from source, and 
> installed under /usr/local/packages/openmpi-1.4.2.  Software on my system is 
> controlled by the Modules framework which adds the bin and lib directories to 
> PATH and LD_LIBRARY_PATH respectively when a user is connected to an 
> execution node.  I configured a parallel environment in which OpenMPI is to 
> be used: 
> 
> pe_namempi
> slots  16
> user_lists NONE
> xuser_listsNONE
> start_proc_args/bin/true
> stop_proc_args /bin/true
> allocation_rule$round_robin
> control_slaves TRUE
> job_is_first_task  FALSE
> urgency_slots  min
> accounting_summary FALSE
> 
> I then tried a simple job submission script:
> 
> #!/bin/bash
> #
> #$ -S /bin/bash
> . /etc/profile
> module add ompi gcc
> mpirun hostname
> 
> If the parallel environment runs within one execution host (8 slots per 
> host), then all is fine.  However, if scheduled across  several nodes, I get 
> an error:
> 
> execv: No such file or directory
> execv: No such file or directory
> execv: No such file or directory
> --
> A daemon (pid 1629) died unexpectedly with status 1 while attempting
> to launch so we are aborting.
> 
> There may be more information reported by the environment (see above).
> 
> This may be because the daemon was unable to find all the needed shared
> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
> location of the shared libraries on the remote nodes and this will
> automatically be forwarded to the remote nodes.
> --
> --
> mpirun noticed that the job aborted, but has no info as to the process
> that caused that situation.
> --
> mpirun: clean termination accomplished
> 
> 
> I'm at a loss on how to start debugging this, and I don't seem to be getting 
> anything useful using the mpirun '-d' and '-v' switches.  SGE logs don't note 
> anything.  Can anyone suggest either what is wrong, or how I might progress 
> with getting more information?
> 
> Many thanks,
> 
> 
> Chris
> 
> 
> 
> --
> Dr Chris Jewell
> Department of Statistics
> University of Warwick
> Coventry
> CV4 7AL
> UK
> Tel: +44 (0)24 7615 0778
> 
> 
> 
> 
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




[OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-10-04 Thread Chris Jewell
Hi all,

Firstly, hello to the mailing list for the first time!  Secondly, sorry for the 
non-descript subject line, but I couldn't really think how to be more specific! 
 

Anyway, I am currently having a problem getting OpenMPI to work within my 
installation of SGE 6.2u5.  I compiled OpenMPI 1.4.2 from source, and installed 
under /usr/local/packages/openmpi-1.4.2.  Software on my system is controlled 
by the Modules framework which adds the bin and lib directories to PATH and 
LD_LIBRARY_PATH respectively when a user is connected to an execution node.  I 
configured a parallel environment in which OpenMPI is to be used: 

pe_namempi
slots  16
user_lists NONE
xuser_listsNONE
start_proc_args/bin/true
stop_proc_args /bin/true
allocation_rule$round_robin
control_slaves TRUE
job_is_first_task  FALSE
urgency_slots  min
accounting_summary FALSE

I then tried a simple job submission script:

#!/bin/bash
#
#$ -S /bin/bash
. /etc/profile
module add ompi gcc
mpirun hostname

If the parallel environment runs within one execution host (8 slots per host), 
then all is fine.  However, if scheduled across  several nodes, I get an error:

execv: No such file or directory
execv: No such file or directory
execv: No such file or directory
--
A daemon (pid 1629) died unexpectedly with status 1 while attempting
to launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--
--
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--
mpirun: clean termination accomplished


I'm at a loss on how to start debugging this, and I don't seem to be getting 
anything useful using the mpirun '-d' and '-v' switches.  SGE logs don't note 
anything.  Can anyone suggest either what is wrong, or how I might progress 
with getting more information?

Many thanks,


Chris



--
Dr Chris Jewell
Department of Statistics
University of Warwick
Coventry
CV4 7AL
UK
Tel: +44 (0)24 7615 0778