Re: [OMPI users] srun and openmpi

2011-04-29 Thread Michael Di Domenico
Certainly, i reached out to several contacts I have inside qlogic (i
used to work there)...

On Fri, Apr 29, 2011 at 10:30 AM, Ralph Castain  wrote:
> Hi Michael
>
> I'm told that the Qlogic contacts we used to have are no longer there. Since 
> you obviously are a customer, can you ping them and ask (a) what that error 
> message means, and (b) what's wrong with the values I computed?
>
> You can also just send them my way, if that would help. We just need someone 
> to explain the requirements on that precondition value.
>
> Thanks
> Ralph


Re: [OMPI users] srun and openmpi

2011-04-29 Thread Michael Di Domenico
On Fri, Apr 29, 2011 at 10:01 AM, Michael Di Domenico
 wrote:
> On Fri, Apr 29, 2011 at 4:52 AM, Ralph Castain  wrote:
>> Hi Michael
>>
>> Please see the attached updated patch to try for 1.5.3. I mistakenly free'd 
>> the envar after adding it to the environ :-/
>
> The patch works great, i can now see the precondition environment
> variable if i do
>
> mpirun -n 2 -host node1 
>
> and my  runs just fine, However if i do
>
> srun --resv-ports -n 2 -w node1 
>
> I get
>
> [node1:16780] PSM EP connect error (unknown connect error):
> [node1:16780]  node1
> [node1:16780] PSM EP connect error (Endpoint could not be reached):
> [node1:16780]  node1
>
> PML add procs failed
> --> Returned "Error" (-1) instead of "Success" (0)
>
> I did notice a difference in the precondition env variable between the two 
> runs
>
> mpirun -n 2 -host node1 
>
> sets precondition_transports=fbc383997ee1b668-00d40f1401d2e827 (which
> changes with each run (aka random))
>
> srun --resv-ports -n 2 -w node1 

this should have been "srun --resv-ports -n 1 -w node1 ", i
can't run a 2 rank job, i get the PML error above

>
> sets precondition_transports=1845-0001 (which
> doesn't seem to change run to run)
>



Re: [OMPI users] srun and openmpi

2011-04-29 Thread Michael Di Domenico
On Fri, Apr 29, 2011 at 4:52 AM, Ralph Castain  wrote:
> Hi Michael
>
> Please see the attached updated patch to try for 1.5.3. I mistakenly free'd 
> the envar after adding it to the environ :-/

The patch works great, i can now see the precondition environment
variable if i do

mpirun -n 2 -host node1 

and my  runs just fine, However if i do

srun --resv-ports -n 2 -w node1 

I get

[node1:16780] PSM EP connect error (unknown connect error):
[node1:16780]  node1
[node1:16780] PSM EP connect error (Endpoint could not be reached):
[node1:16780]  node1

PML add procs failed
--> Returned "Error" (-1) instead of "Success" (0)

I did notice a difference in the precondition env variable between the two runs

mpirun -n 2 -host node1 

sets precondition_transports=fbc383997ee1b668-00d40f1401d2e827 (which
changes with each run (aka random))

srun --resv-ports -n 2 -w node1 

sets precondition_transports=1845-0001 (which
doesn't seem to change run to run)



Re: [OMPI users] srun and openmpi

2011-04-29 Thread Ralph Castain
Hi Michael

Please see the attached updated patch to try for 1.5.3. I mistakenly free'd the 
envar after adding it to the environ :-/

Thanks
Ralph



slurmd.diff
Description: Binary data

On Apr 28, 2011, at 2:31 PM, Michael Di Domenico wrote:

> On Thu, Apr 28, 2011 at 9:03 AM, Ralph Castain  wrote:
>> 
>> On Apr 28, 2011, at 6:49 AM, Michael Di Domenico wrote:
>> 
>>> On Wed, Apr 27, 2011 at 11:47 PM, Ralph Castain  wrote:
 
 On Apr 27, 2011, at 1:06 PM, Michael Di Domenico wrote:
 
> On Wed, Apr 27, 2011 at 2:46 PM, Ralph Castain  wrote:
>> 
>> On Apr 27, 2011, at 12:38 PM, Michael Di Domenico wrote:
>> 
>>> On Wed, Apr 27, 2011 at 2:25 PM, Ralph Castain  
>>> wrote:
 
 On Apr 27, 2011, at 10:09 AM, Michael Di Domenico wrote:
 
> Was this ever committed to the OMPI src as something not having to be
> run outside of OpenMPI, but as part of the PSM setup that OpenMPI
> does?
 
 Not that I know of - I don't think the PSM developers ever looked at 
 it.
 
 Thought about this some more and I believe I have a soln to the problem. 
 Will try to commit something to the devel trunk by the end of the week.
>>> 
>>> Thanks
>> 
>> Just to save me looking back thru the thread - what OMPI version are you 
>> using? If it isn't the trunk, I'll send you a patch you can use.
> 
> I'm using OpenMPI v1.5.3 currently
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] srun and openmpi

2011-04-28 Thread Michael Di Domenico
On Thu, Apr 28, 2011 at 9:03 AM, Ralph Castain  wrote:
>
> On Apr 28, 2011, at 6:49 AM, Michael Di Domenico wrote:
>
>> On Wed, Apr 27, 2011 at 11:47 PM, Ralph Castain  wrote:
>>>
>>> On Apr 27, 2011, at 1:06 PM, Michael Di Domenico wrote:
>>>
 On Wed, Apr 27, 2011 at 2:46 PM, Ralph Castain  wrote:
>
> On Apr 27, 2011, at 12:38 PM, Michael Di Domenico wrote:
>
>> On Wed, Apr 27, 2011 at 2:25 PM, Ralph Castain  wrote:
>>>
>>> On Apr 27, 2011, at 10:09 AM, Michael Di Domenico wrote:
>>>
 Was this ever committed to the OMPI src as something not having to be
 run outside of OpenMPI, but as part of the PSM setup that OpenMPI
 does?
>>>
>>> Not that I know of - I don't think the PSM developers ever looked at it.
>>>
>>> Thought about this some more and I believe I have a soln to the problem. 
>>> Will try to commit something to the devel trunk by the end of the week.
>>
>> Thanks
>
> Just to save me looking back thru the thread - what OMPI version are you 
> using? If it isn't the trunk, I'll send you a patch you can use.

I'm using OpenMPI v1.5.3 currently


Re: [OMPI users] srun and openmpi

2011-04-28 Thread Ralph Castain
Per earlier in the thread, it looks like you are using a 1.5 series release - 
so here is a patch that -should- fix the PSM setup problem.

Please let me know if/how it works as I honestly have no way of testing it.
Ralph



slurmd.diff
Description: Binary data


On Apr 28, 2011, at 7:03 AM, Ralph Castain wrote:

> 
> On Apr 28, 2011, at 6:49 AM, Michael Di Domenico wrote:
> 
>> On Wed, Apr 27, 2011 at 11:47 PM, Ralph Castain  wrote:
>>> 
>>> On Apr 27, 2011, at 1:06 PM, Michael Di Domenico wrote:
>>> 
 On Wed, Apr 27, 2011 at 2:46 PM, Ralph Castain  wrote:
> 
> On Apr 27, 2011, at 12:38 PM, Michael Di Domenico wrote:
> 
>> On Wed, Apr 27, 2011 at 2:25 PM, Ralph Castain  wrote:
>>> 
>>> On Apr 27, 2011, at 10:09 AM, Michael Di Domenico wrote:
>>> 
 Was this ever committed to the OMPI src as something not having to be
 run outside of OpenMPI, but as part of the PSM setup that OpenMPI
 does?
>>> 
>>> Not that I know of - I don't think the PSM developers ever looked at it.
>>> 
>>> Thought about this some more and I believe I have a soln to the problem. 
>>> Will try to commit something to the devel trunk by the end of the week.
>> 
>> Thanks
> 
> Just to save me looking back thru the thread - what OMPI version are you 
> using? If it isn't the trunk, I'll send you a patch you can use.
> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 



Re: [OMPI users] srun and openmpi

2011-04-28 Thread Ralph Castain

On Apr 28, 2011, at 6:49 AM, Michael Di Domenico wrote:

> On Wed, Apr 27, 2011 at 11:47 PM, Ralph Castain  wrote:
>> 
>> On Apr 27, 2011, at 1:06 PM, Michael Di Domenico wrote:
>> 
>>> On Wed, Apr 27, 2011 at 2:46 PM, Ralph Castain  wrote:
 
 On Apr 27, 2011, at 12:38 PM, Michael Di Domenico wrote:
 
> On Wed, Apr 27, 2011 at 2:25 PM, Ralph Castain  wrote:
>> 
>> On Apr 27, 2011, at 10:09 AM, Michael Di Domenico wrote:
>> 
>>> Was this ever committed to the OMPI src as something not having to be
>>> run outside of OpenMPI, but as part of the PSM setup that OpenMPI
>>> does?
>> 
>> Not that I know of - I don't think the PSM developers ever looked at it.
>> 
>> Thought about this some more and I believe I have a soln to the problem. 
>> Will try to commit something to the devel trunk by the end of the week.
> 
> Thanks

Just to save me looking back thru the thread - what OMPI version are you using? 
If it isn't the trunk, I'll send you a patch you can use.

> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] srun and openmpi

2011-04-28 Thread Michael Di Domenico
On Wed, Apr 27, 2011 at 11:47 PM, Ralph Castain  wrote:
>
> On Apr 27, 2011, at 1:06 PM, Michael Di Domenico wrote:
>
>> On Wed, Apr 27, 2011 at 2:46 PM, Ralph Castain  wrote:
>>>
>>> On Apr 27, 2011, at 12:38 PM, Michael Di Domenico wrote:
>>>
 On Wed, Apr 27, 2011 at 2:25 PM, Ralph Castain  wrote:
>
> On Apr 27, 2011, at 10:09 AM, Michael Di Domenico wrote:
>
>> Was this ever committed to the OMPI src as something not having to be
>> run outside of OpenMPI, but as part of the PSM setup that OpenMPI
>> does?
>
> Not that I know of - I don't think the PSM developers ever looked at it.
>
> Thought about this some more and I believe I have a soln to the problem. Will 
> try to commit something to the devel trunk by the end of the week.

Thanks


Re: [OMPI users] srun and openmpi

2011-04-28 Thread Ralph Castain

On Apr 27, 2011, at 1:06 PM, Michael Di Domenico wrote:

> On Wed, Apr 27, 2011 at 2:46 PM, Ralph Castain  wrote:
>> 
>> On Apr 27, 2011, at 12:38 PM, Michael Di Domenico wrote:
>> 
>>> On Wed, Apr 27, 2011 at 2:25 PM, Ralph Castain  wrote:
 
 On Apr 27, 2011, at 10:09 AM, Michael Di Domenico wrote:
 
> Was this ever committed to the OMPI src as something not having to be
> run outside of OpenMPI, but as part of the PSM setup that OpenMPI
> does?
 
 Not that I know of - I don't think the PSM developers ever looked at it.

Thought about this some more and I believe I have a soln to the problem. Will 
try to commit something to the devel trunk by the end of the week.

Ralph


 
> 
> I'm having some trouble getting Slurm/OpenMPI to play nice with the
> setup of this key.  Namely, with slurm you cannot export variables
> from the --prolog of an srun, only from an --task-prolog,
> unfortunately, if you use a task-prolog each rank gets a different
> key, which doesn't work.
> 
> I'm also guessing that each unique mpirun needs it's own psm key, not
> one for the whole system, so i can't just make it a permanent
> parameter somewhere else.
> 
> Also, i recall reading somewhere that the --resv-ports parameter that
> OMPI uses from slurm to choose a list of ports to use for TCP comm's,
> tries to lock a port from the pool three times before giving up.
 
 Had to look back at the code - I think you misread this. I can find no 
 evidence in the code that we try to bind that port more than once.
>>> 
>>> Perhaps i misstated, i don't believe you're trying to bind to the same
>>> port twice during the same session.  i believe the code re-uses
>>> similar ports from session to session.  what i believe happens (but
>>> could be totally wrong) the previous session releases the port, but
>>> linux isn't quite done with it when the new session tries to bind to
>>> the port, in which case it tries three times and then fails the job
>> 
>> Actually, I understood you correctly. I'm just saying that I find no 
>> evidence in the code that we try three times before giving up. What I see is 
>> a single attempt to bind the port - if it fails, then we abort. There is no 
>> parameter to control that behavior.
>> 
>> So if the OS hasn't released the port by the time a new job starts on that 
>> node, then it will indeed abort if the job was unfortunately given the same 
>> port reservation.
> 
> Oh, okay, sorry...
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] srun and openmpi

2011-04-27 Thread Jeff Squyres
On Apr 27, 2011, at 3:39 PM, Ralph Castain wrote:

> Nope, nope nope...in this mode of operation, we are using -static- ports.

Er.. right.  Sorry -- my bad for not reading the full context here... ignore 
what I said...

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] srun and openmpi

2011-04-27 Thread Ralph Castain

On Apr 27, 2011, at 1:27 PM, Jeff Squyres wrote:

> On Apr 27, 2011, at 2:46 PM, Ralph Castain wrote:
> 
>> Actually, I understood you correctly. I'm just saying that I find no 
>> evidence in the code that we try three times before giving up. What I see is 
>> a single attempt to bind the port - if it fails, then we abort. There is no 
>> parameter to control that behavior.
>> 
>> So if the OS hasn't released the port by the time a new job starts on that 
>> node, then it will indeed abort if the job was unfortunately given the same 
>> port reservation.
> 
> FWIW, the OS may be trying multiple times under the covers, but from as far 
> as OMPI is concerned, we're just trying once.
> 
> OMPI asks for whatever port the OS has open (i.e., we pass in 0 when asking 
> for a specific port number, and the OS fills it in for us).  If it gives us 
> back a port that isn't actually available, that would be really surprising.

Nope, nope nope...in this mode of operation, we are using -static- ports.

The problem here is that srun is incorrectly handing out the same port 
reservation to the next job, causing the port binding to fail because the last 
job's binding hasn't yet timed out.


> 
> If you have a bajiollion short jobs running, I wonder if there's some kind of 
> race condition occurring that some MPI processes are getting messages from 
> the wrong mpirun.  And then things go downhill from there.  
> 
> I can't immediately imagine how that would happen, but maybe there's some 
> kind of weird race condition in there somewhere...?  We pass specific IP 
> addresses and ports around on the command line, though, so I don't quite see 
> how that would happen...
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] srun and openmpi

2011-04-27 Thread Jeff Squyres
On Apr 27, 2011, at 2:46 PM, Ralph Castain wrote:

> Actually, I understood you correctly. I'm just saying that I find no evidence 
> in the code that we try three times before giving up. What I see is a single 
> attempt to bind the port - if it fails, then we abort. There is no parameter 
> to control that behavior.
> 
> So if the OS hasn't released the port by the time a new job starts on that 
> node, then it will indeed abort if the job was unfortunately given the same 
> port reservation.

FWIW, the OS may be trying multiple times under the covers, but from as far as 
OMPI is concerned, we're just trying once.

OMPI asks for whatever port the OS has open (i.e., we pass in 0 when asking for 
a specific port number, and the OS fills it in for us).  If it gives us back a 
port that isn't actually available, that would be really surprising.

If you have a bajiollion short jobs running, I wonder if there's some kind of 
race condition occurring that some MPI processes are getting messages from the 
wrong mpirun.  And then things go downhill from there.  

I can't immediately imagine how that would happen, but maybe there's some kind 
of weird race condition in there somewhere...?  We pass specific IP addresses 
and ports around on the command line, though, so I don't quite see how that 
would happen...

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] srun and openmpi

2011-04-27 Thread Michael Di Domenico
On Wed, Apr 27, 2011 at 2:46 PM, Ralph Castain  wrote:
>
> On Apr 27, 2011, at 12:38 PM, Michael Di Domenico wrote:
>
>> On Wed, Apr 27, 2011 at 2:25 PM, Ralph Castain  wrote:
>>>
>>> On Apr 27, 2011, at 10:09 AM, Michael Di Domenico wrote:
>>>
 Was this ever committed to the OMPI src as something not having to be
 run outside of OpenMPI, but as part of the PSM setup that OpenMPI
 does?
>>>
>>> Not that I know of - I don't think the PSM developers ever looked at it.
>>>

 I'm having some trouble getting Slurm/OpenMPI to play nice with the
 setup of this key.  Namely, with slurm you cannot export variables
 from the --prolog of an srun, only from an --task-prolog,
 unfortunately, if you use a task-prolog each rank gets a different
 key, which doesn't work.

 I'm also guessing that each unique mpirun needs it's own psm key, not
 one for the whole system, so i can't just make it a permanent
 parameter somewhere else.

 Also, i recall reading somewhere that the --resv-ports parameter that
 OMPI uses from slurm to choose a list of ports to use for TCP comm's,
 tries to lock a port from the pool three times before giving up.
>>>
>>> Had to look back at the code - I think you misread this. I can find no 
>>> evidence in the code that we try to bind that port more than once.
>>
>> Perhaps i misstated, i don't believe you're trying to bind to the same
>> port twice during the same session.  i believe the code re-uses
>> similar ports from session to session.  what i believe happens (but
>> could be totally wrong) the previous session releases the port, but
>> linux isn't quite done with it when the new session tries to bind to
>> the port, in which case it tries three times and then fails the job
>
> Actually, I understood you correctly. I'm just saying that I find no evidence 
> in the code that we try three times before giving up. What I see is a single 
> attempt to bind the port - if it fails, then we abort. There is no parameter 
> to control that behavior.
>
> So if the OS hasn't released the port by the time a new job starts on that 
> node, then it will indeed abort if the job was unfortunately given the same 
> port reservation.

Oh, okay, sorry...



Re: [OMPI users] srun and openmpi

2011-04-27 Thread Ralph Castain

On Apr 27, 2011, at 12:38 PM, Michael Di Domenico wrote:

> On Wed, Apr 27, 2011 at 2:25 PM, Ralph Castain  wrote:
>> 
>> On Apr 27, 2011, at 10:09 AM, Michael Di Domenico wrote:
>> 
>>> Was this ever committed to the OMPI src as something not having to be
>>> run outside of OpenMPI, but as part of the PSM setup that OpenMPI
>>> does?
>> 
>> Not that I know of - I don't think the PSM developers ever looked at it.
>> 
>>> 
>>> I'm having some trouble getting Slurm/OpenMPI to play nice with the
>>> setup of this key.  Namely, with slurm you cannot export variables
>>> from the --prolog of an srun, only from an --task-prolog,
>>> unfortunately, if you use a task-prolog each rank gets a different
>>> key, which doesn't work.
>>> 
>>> I'm also guessing that each unique mpirun needs it's own psm key, not
>>> one for the whole system, so i can't just make it a permanent
>>> parameter somewhere else.
>>> 
>>> Also, i recall reading somewhere that the --resv-ports parameter that
>>> OMPI uses from slurm to choose a list of ports to use for TCP comm's,
>>> tries to lock a port from the pool three times before giving up.
>> 
>> Had to look back at the code - I think you misread this. I can find no 
>> evidence in the code that we try to bind that port more than once.
> 
> Perhaps i misstated, i don't believe you're trying to bind to the same
> port twice during the same session.  i believe the code re-uses
> similar ports from session to session.  what i believe happens (but
> could be totally wrong) the previous session releases the port, but
> linux isn't quite done with it when the new session tries to bind to
> the port, in which case it tries three times and then fails the job

Actually, I understood you correctly. I'm just saying that I find no evidence 
in the code that we try three times before giving up. What I see is a single 
attempt to bind the port - if it fails, then we abort. There is no parameter to 
control that behavior.

So if the OS hasn't released the port by the time a new job starts on that 
node, then it will indeed abort if the job was unfortunately given the same 
port reservation.


> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] srun and openmpi

2011-04-27 Thread Michael Di Domenico
On Wed, Apr 27, 2011 at 2:25 PM, Ralph Castain  wrote:
>
> On Apr 27, 2011, at 10:09 AM, Michael Di Domenico wrote:
>
>> Was this ever committed to the OMPI src as something not having to be
>> run outside of OpenMPI, but as part of the PSM setup that OpenMPI
>> does?
>
> Not that I know of - I don't think the PSM developers ever looked at it.
>
>>
>> I'm having some trouble getting Slurm/OpenMPI to play nice with the
>> setup of this key.  Namely, with slurm you cannot export variables
>> from the --prolog of an srun, only from an --task-prolog,
>> unfortunately, if you use a task-prolog each rank gets a different
>> key, which doesn't work.
>>
>> I'm also guessing that each unique mpirun needs it's own psm key, not
>> one for the whole system, so i can't just make it a permanent
>> parameter somewhere else.
>>
>> Also, i recall reading somewhere that the --resv-ports parameter that
>> OMPI uses from slurm to choose a list of ports to use for TCP comm's,
>> tries to lock a port from the pool three times before giving up.
>
> Had to look back at the code - I think you misread this. I can find no 
> evidence in the code that we try to bind that port more than once.

Perhaps i misstated, i don't believe you're trying to bind to the same
port twice during the same session.  i believe the code re-uses
similar ports from session to session.  what i believe happens (but
could be totally wrong) the previous session releases the port, but
linux isn't quite done with it when the new session tries to bind to
the port, in which case it tries three times and then fails the job



Re: [OMPI users] srun and openmpi

2011-04-27 Thread Ralph Castain

On Apr 27, 2011, at 10:09 AM, Michael Di Domenico wrote:

> Was this ever committed to the OMPI src as something not having to be
> run outside of OpenMPI, but as part of the PSM setup that OpenMPI
> does?

Not that I know of - I don't think the PSM developers ever looked at it.

> 
> I'm having some trouble getting Slurm/OpenMPI to play nice with the
> setup of this key.  Namely, with slurm you cannot export variables
> from the --prolog of an srun, only from an --task-prolog,
> unfortunately, if you use a task-prolog each rank gets a different
> key, which doesn't work.
> 
> I'm also guessing that each unique mpirun needs it's own psm key, not
> one for the whole system, so i can't just make it a permanent
> parameter somewhere else.
> 
> Also, i recall reading somewhere that the --resv-ports parameter that
> OMPI uses from slurm to choose a list of ports to use for TCP comm's,
> tries to lock a port from the pool three times before giving up.

Had to look back at the code - I think you misread this. I can find no evidence 
in the code that we try to bind that port more than once.

> 
> Can someone tell me where that parameter is set, i'd like to set it to
> a higher value.  We're seeing issues where running a large number of
> short srun's sequentially is causing some of the mpirun's in the
> stream to be killed because they could not lock the ports.
> 
> I suspect because of the lag between when the port is actually closed
> in linux and when ompi re-opens a new port is very quick, we're trying
> three times and giving up.  I have more then enough ports in the
> resv-ports list, 30k.  but i suspect there is some random re-use being
> done and it's failing
> 
> thanks
> 
> 
> On Mon, Jan 3, 2011 at 10:00 AM, Jeff Squyres  wrote:
>> Yo Ralph --
>> 
>> I see this was committed https://svn.open-mpi.org/trac/ompi/changeset/24197. 
>>  Do you want to add a blurb in README about it, and/or have this executable 
>> compiled as part of the PSM MTL and then installed into $bindir (maybe named 
>> ompi-psm-keygen)?
>> 
>> Right now, it's only compiled as part of "make check" and not installed, 
>> right?
>> 
>> On Dec 30, 2010, at 5:07 PM, Ralph Castain wrote:
>> 
>>> Run the program only once - it can be in the prolog of the job if you like. 
>>> The output value needs to be in the env of every rank.
>>> 
>>> You can reuse the value as many times as you like - it doesn't have to be 
>>> unique for each job. There is nothing magic about the value itself.
>>> 
>>> On Dec 30, 2010, at 2:11 PM, Michael Di Domenico wrote:
>>> 
 How early does this need to run? Can I run it as part of a task
 prolog, or does it need to be the shell env for each rank?  And does
 it need to run on one node or all the nodes in the job?
 
 On Thu, Dec 30, 2010 at 8:54 PM, Ralph Castain  wrote:
> Well, I couldn't do it as a patch - proved too complicated as the psm 
> system looks for the value early in the boot procedure.
> 
> What I can do is give you the attached key generator program. It outputs 
> the envar required to run your program. So if you run the attached 
> program and then export the output into your environment, you should be 
> okay. Looks like this:
> 
> $ ./psm_keygen
> OMPI_MCA_orte_precondition_transports=0099b3eaa2c1547e-afb287789133a954
> $
> 
> You compile the program with the usual mpicc.
> 
> Let me know if this solves the problem (or not).
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] srun and openmpi

2011-04-27 Thread Michael Di Domenico
Was this ever committed to the OMPI src as something not having to be
run outside of OpenMPI, but as part of the PSM setup that OpenMPI
does?

I'm having some trouble getting Slurm/OpenMPI to play nice with the
setup of this key.  Namely, with slurm you cannot export variables
from the --prolog of an srun, only from an --task-prolog,
unfortunately, if you use a task-prolog each rank gets a different
key, which doesn't work.

I'm also guessing that each unique mpirun needs it's own psm key, not
one for the whole system, so i can't just make it a permanent
parameter somewhere else.

Also, i recall reading somewhere that the --resv-ports parameter that
OMPI uses from slurm to choose a list of ports to use for TCP comm's,
tries to lock a port from the pool three times before giving up.

Can someone tell me where that parameter is set, i'd like to set it to
a higher value.  We're seeing issues where running a large number of
short srun's sequentially is causing some of the mpirun's in the
stream to be killed because they could not lock the ports.

I suspect because of the lag between when the port is actually closed
in linux and when ompi re-opens a new port is very quick, we're trying
three times and giving up.  I have more then enough ports in the
resv-ports list, 30k.  but i suspect there is some random re-use being
done and it's failing

thanks


On Mon, Jan 3, 2011 at 10:00 AM, Jeff Squyres  wrote:
> Yo Ralph --
>
> I see this was committed https://svn.open-mpi.org/trac/ompi/changeset/24197.  
> Do you want to add a blurb in README about it, and/or have this executable 
> compiled as part of the PSM MTL and then installed into $bindir (maybe named 
> ompi-psm-keygen)?
>
> Right now, it's only compiled as part of "make check" and not installed, 
> right?
>
> On Dec 30, 2010, at 5:07 PM, Ralph Castain wrote:
>
>> Run the program only once - it can be in the prolog of the job if you like. 
>> The output value needs to be in the env of every rank.
>>
>> You can reuse the value as many times as you like - it doesn't have to be 
>> unique for each job. There is nothing magic about the value itself.
>>
>> On Dec 30, 2010, at 2:11 PM, Michael Di Domenico wrote:
>>
>>> How early does this need to run? Can I run it as part of a task
>>> prolog, or does it need to be the shell env for each rank?  And does
>>> it need to run on one node or all the nodes in the job?
>>>
>>> On Thu, Dec 30, 2010 at 8:54 PM, Ralph Castain  wrote:
 Well, I couldn't do it as a patch - proved too complicated as the psm 
 system looks for the value early in the boot procedure.

 What I can do is give you the attached key generator program. It outputs 
 the envar required to run your program. So if you run the attached program 
 and then export the output into your environment, you should be okay. 
 Looks like this:

 $ ./psm_keygen
 OMPI_MCA_orte_precondition_transports=0099b3eaa2c1547e-afb287789133a954
 $

 You compile the program with the usual mpicc.

 Let me know if this solves the problem (or not).



Re: [OMPI users] srun and openmpi

2011-01-25 Thread Michael Di Domenico
Yes, i am setting the config correcty.  Our IB machines seem to run
just fine so far using srun and openmpi v1.5.

As another data point, we enabled mpi-threads in Openmpi and that also
seems to trigger the Srun/TCP behavior, but on the IB fabric.  Running
the program within an salloc rather the straight srun and the problem
seems to go away



On Tue, Jan 25, 2011 at 2:59 PM, Nathan Hjelm  wrote:
> We are seeing the similar problem with our infiniband machines. After some
> investigation I discovered that we were not setting our slurm environment
> correctly (ref:
> https://computing.llnl.gov/linux/slurm/mpi_guide.html#open_mpi). Are you
> setting the ports in your slurm.conf and executing srun with --resv-ports?
>
> I have yet to see if this fixes the problem for LANL. Waiting on a sysadmin
> to modify the slurm.conf.
>
> -Nathan
> HPC-3, LANL
>
> On Tue, 25 Jan 2011, Michael Di Domenico wrote:
>
>> Thanks.  We're only seeing it on machines with Ethernet only as the
>> interconnect.  fortunately for us that only equates to one small
>> machine, but it's still annoying.  unfortunately, i don't have enough
>> knowledge to dive into the code to help fix, but i can certainly help
>> test
>>
>> On Mon, Jan 24, 2011 at 1:41 PM, Nathan Hjelm  wrote:
>>>
>>> I am seeing similar issues on our slurm clusters. We are looking into the
>>> issue.
>>>
>>> -Nathan
>>> HPC-3, LANL
>>>
>>> On Tue, 11 Jan 2011, Michael Di Domenico wrote:
>>>
 Any ideas on what might be causing this one?  Or atleast what
 additional debug information someone might need?

 On Fri, Jan 7, 2011 at 4:03 PM, Michael Di Domenico
  wrote:
>
> I'm still testing the slurm integration, which seems to work fine so
> far.  However, i just upgraded another cluster to openmpi-1.5 and
> slurm 2.1.15 but this machine has no infiniband
>
> if i salloc the nodes and mpirun the command it seems to run and
> complete
> fine
> however if i srun the command i get
>
> [btl_tcp_endpoint:486] mca_btl_tcp_endpoint_recv_connect_ack received
> unexpected prcoess identifier
>
> the job does not seem to run, but exhibits two behaviors
> running a single process per node the job runs and does not present
> the error (srun -N40 --ntasks-per-node=1)
> running multiple processes per node, the job spits out the error but
> does not run (srun -n40 --ntasks-per-node=8)
>
> I copied the configs from the other machine, so (i think) everything
> should be configured correctly (but i can't rule it out)
>
> I saw (and reported) a similar error to above with the 1.4-dev branch
> (see mailing list) and slurm, I can't say whether they're related or
> not though
>
>
> On Mon, Jan 3, 2011 at 3:00 PM, Jeff Squyres 
> wrote:
>>
>> Yo Ralph --
>>
>> I see this was committed
>> https://svn.open-mpi.org/trac/ompi/changeset/24197.  Do you want to
>> add a
>> blurb in README about it, and/or have this executable compiled as part
>> of
>> the PSM MTL and then installed into $bindir (maybe named
>> ompi-psm-keygen)?
>>
>> Right now, it's only compiled as part of "make check" and not
>> installed,
>> right?
>>
>>
>>
>> On Dec 30, 2010, at 5:07 PM, Ralph Castain wrote:
>>
>>> Run the program only once - it can be in the prolog of the job if you
>>> like. The output value needs to be in the env of every rank.
>>>
>>> You can reuse the value as many times as you like - it doesn't have
>>> to
>>> be unique for each job. There is nothing magic about the value
>>> itself.
>>>
>>> On Dec 30, 2010, at 2:11 PM, Michael Di Domenico wrote:
>>>
 How early does this need to run? Can I run it as part of a task
 prolog, or does it need to be the shell env for each rank?  And does
 it need to run on one node or all the nodes in the job?

 On Thu, Dec 30, 2010 at 8:54 PM, Ralph Castain 
 wrote:
>
> Well, I couldn't do it as a patch - proved too complicated as the
> psm
> system looks for the value early in the boot procedure.
>
> What I can do is give you the attached key generator program. It
> outputs the envar required to run your program. So if you run the
> attached
> program and then export the output into your environment, you
> should be
> okay. Looks like this:
>
> $ ./psm_keygen
>
>
> OMPI_MCA_orte_precondition_transports=0099b3eaa2c1547e-afb287789133a954
> $
>
> You compile the program with the usual mpicc.
>
> Let me know if this solves the problem (or not).
> Ralph
>
>
>
>
> On Dec 30, 2010, at 

Re: [OMPI users] srun and openmpi

2011-01-25 Thread Nathan Hjelm

We are seeing the similar problem with our infiniband machines. After some 
investigation I discovered that we were not setting our slurm environment 
correctly (ref: 
https://computing.llnl.gov/linux/slurm/mpi_guide.html#open_mpi). Are you 
setting the ports in your slurm.conf and executing srun with --resv-ports?

I have yet to see if this fixes the problem for LANL. Waiting on a sysadmin to 
modify the slurm.conf.

-Nathan
HPC-3, LANL

On Tue, 25 Jan 2011, Michael Di Domenico wrote:


Thanks.  We're only seeing it on machines with Ethernet only as the
interconnect.  fortunately for us that only equates to one small
machine, but it's still annoying.  unfortunately, i don't have enough
knowledge to dive into the code to help fix, but i can certainly help
test

On Mon, Jan 24, 2011 at 1:41 PM, Nathan Hjelm  wrote:

I am seeing similar issues on our slurm clusters. We are looking into the
issue.

-Nathan
HPC-3, LANL

On Tue, 11 Jan 2011, Michael Di Domenico wrote:


Any ideas on what might be causing this one?  Or atleast what
additional debug information someone might need?

On Fri, Jan 7, 2011 at 4:03 PM, Michael Di Domenico
 wrote:


I'm still testing the slurm integration, which seems to work fine so
far.  However, i just upgraded another cluster to openmpi-1.5 and
slurm 2.1.15 but this machine has no infiniband

if i salloc the nodes and mpirun the command it seems to run and complete
fine
however if i srun the command i get

[btl_tcp_endpoint:486] mca_btl_tcp_endpoint_recv_connect_ack received
unexpected prcoess identifier

the job does not seem to run, but exhibits two behaviors
running a single process per node the job runs and does not present
the error (srun -N40 --ntasks-per-node=1)
running multiple processes per node, the job spits out the error but
does not run (srun -n40 --ntasks-per-node=8)

I copied the configs from the other machine, so (i think) everything
should be configured correctly (but i can't rule it out)

I saw (and reported) a similar error to above with the 1.4-dev branch
(see mailing list) and slurm, I can't say whether they're related or
not though


On Mon, Jan 3, 2011 at 3:00 PM, Jeff Squyres  wrote:


Yo Ralph --

I see this was committed
https://svn.open-mpi.org/trac/ompi/changeset/24197.  Do you want to add a
blurb in README about it, and/or have this executable compiled as part of
the PSM MTL and then installed into $bindir (maybe named ompi-psm-keygen)?

Right now, it's only compiled as part of "make check" and not installed,
right?



On Dec 30, 2010, at 5:07 PM, Ralph Castain wrote:


Run the program only once - it can be in the prolog of the job if you
like. The output value needs to be in the env of every rank.

You can reuse the value as many times as you like - it doesn't have to
be unique for each job. There is nothing magic about the value itself.

On Dec 30, 2010, at 2:11 PM, Michael Di Domenico wrote:


How early does this need to run? Can I run it as part of a task
prolog, or does it need to be the shell env for each rank?  And does
it need to run on one node or all the nodes in the job?

On Thu, Dec 30, 2010 at 8:54 PM, Ralph Castain 
wrote:


Well, I couldn't do it as a patch - proved too complicated as the psm
system looks for the value early in the boot procedure.

What I can do is give you the attached key generator program. It
outputs the envar required to run your program. So if you run the attached
program and then export the output into your environment, you should be
okay. Looks like this:

$ ./psm_keygen

OMPI_MCA_orte_precondition_transports=0099b3eaa2c1547e-afb287789133a954
$

You compile the program with the usual mpicc.

Let me know if this solves the problem (or not).
Ralph




On Dec 30, 2010, at 11:18 AM, Michael Di Domenico wrote:


Sure, i'll give it a go

On Thu, Dec 30, 2010 at 5:53 PM, Ralph Castain 
wrote:


Ah, yes - that is going to be a problem. The PSM key gets generated
by mpirun as it is shared info - i.e., every proc has to get the same value.

I can create a patch that will do this for the srun direct-launch
scenario, if you want to try it. Would be later today, though.


On Dec 30, 2010, at 10:31 AM, Michael Di Domenico wrote:


Well maybe not horray, yet.  I might have jumped the gun a bit,
it's
looking like srun works in general, but perhaps not with PSM

With PSM i get this error, (at least now i know what i changed)

Error obtaining unique transport key from ORTE
(orte_precondition_transports not present in the environment)
PML add procs failed
--> Returned "Error" (-1) instead of "Success" (0)

Turn off PSM and srun works fine


On Thu, Dec 30, 2010 at 5:13 PM, Ralph Castain 
wrote:


Hooray!

On Dec 30, 2010, at 9:57 AM, Michael Di Domenico wrote:


I think i take it all back.  I just tried it again and it seems
to
work now.  I'm not sure what I changed (between my first and
this
msg), but it does appear 

Re: [OMPI users] srun and openmpi

2011-01-25 Thread Michael Di Domenico
Thanks.  We're only seeing it on machines with Ethernet only as the
interconnect.  fortunately for us that only equates to one small
machine, but it's still annoying.  unfortunately, i don't have enough
knowledge to dive into the code to help fix, but i can certainly help
test

On Mon, Jan 24, 2011 at 1:41 PM, Nathan Hjelm  wrote:
> I am seeing similar issues on our slurm clusters. We are looking into the
> issue.
>
> -Nathan
> HPC-3, LANL
>
> On Tue, 11 Jan 2011, Michael Di Domenico wrote:
>
>> Any ideas on what might be causing this one?  Or atleast what
>> additional debug information someone might need?
>>
>> On Fri, Jan 7, 2011 at 4:03 PM, Michael Di Domenico
>>  wrote:
>>>
>>> I'm still testing the slurm integration, which seems to work fine so
>>> far.  However, i just upgraded another cluster to openmpi-1.5 and
>>> slurm 2.1.15 but this machine has no infiniband
>>>
>>> if i salloc the nodes and mpirun the command it seems to run and complete
>>> fine
>>> however if i srun the command i get
>>>
>>> [btl_tcp_endpoint:486] mca_btl_tcp_endpoint_recv_connect_ack received
>>> unexpected prcoess identifier
>>>
>>> the job does not seem to run, but exhibits two behaviors
>>> running a single process per node the job runs and does not present
>>> the error (srun -N40 --ntasks-per-node=1)
>>> running multiple processes per node, the job spits out the error but
>>> does not run (srun -n40 --ntasks-per-node=8)
>>>
>>> I copied the configs from the other machine, so (i think) everything
>>> should be configured correctly (but i can't rule it out)
>>>
>>> I saw (and reported) a similar error to above with the 1.4-dev branch
>>> (see mailing list) and slurm, I can't say whether they're related or
>>> not though
>>>
>>>
>>> On Mon, Jan 3, 2011 at 3:00 PM, Jeff Squyres  wrote:

 Yo Ralph --

 I see this was committed
 https://svn.open-mpi.org/trac/ompi/changeset/24197.  Do you want to add a
 blurb in README about it, and/or have this executable compiled as part of
 the PSM MTL and then installed into $bindir (maybe named ompi-psm-keygen)?

 Right now, it's only compiled as part of "make check" and not installed,
 right?



 On Dec 30, 2010, at 5:07 PM, Ralph Castain wrote:

> Run the program only once - it can be in the prolog of the job if you
> like. The output value needs to be in the env of every rank.
>
> You can reuse the value as many times as you like - it doesn't have to
> be unique for each job. There is nothing magic about the value itself.
>
> On Dec 30, 2010, at 2:11 PM, Michael Di Domenico wrote:
>
>> How early does this need to run? Can I run it as part of a task
>> prolog, or does it need to be the shell env for each rank?  And does
>> it need to run on one node or all the nodes in the job?
>>
>> On Thu, Dec 30, 2010 at 8:54 PM, Ralph Castain 
>> wrote:
>>>
>>> Well, I couldn't do it as a patch - proved too complicated as the psm
>>> system looks for the value early in the boot procedure.
>>>
>>> What I can do is give you the attached key generator program. It
>>> outputs the envar required to run your program. So if you run the 
>>> attached
>>> program and then export the output into your environment, you should be
>>> okay. Looks like this:
>>>
>>> $ ./psm_keygen
>>>
>>> OMPI_MCA_orte_precondition_transports=0099b3eaa2c1547e-afb287789133a954
>>> $
>>>
>>> You compile the program with the usual mpicc.
>>>
>>> Let me know if this solves the problem (or not).
>>> Ralph
>>>
>>>
>>>
>>>
>>> On Dec 30, 2010, at 11:18 AM, Michael Di Domenico wrote:
>>>
 Sure, i'll give it a go

 On Thu, Dec 30, 2010 at 5:53 PM, Ralph Castain 
 wrote:
>
> Ah, yes - that is going to be a problem. The PSM key gets generated
> by mpirun as it is shared info - i.e., every proc has to get the same 
> value.
>
> I can create a patch that will do this for the srun direct-launch
> scenario, if you want to try it. Would be later today, though.
>
>
> On Dec 30, 2010, at 10:31 AM, Michael Di Domenico wrote:
>
>> Well maybe not horray, yet.  I might have jumped the gun a bit,
>> it's
>> looking like srun works in general, but perhaps not with PSM
>>
>> With PSM i get this error, (at least now i know what i changed)
>>
>> Error obtaining unique transport key from ORTE
>> (orte_precondition_transports not present in the environment)
>> PML add procs failed
>> --> Returned "Error" (-1) instead of "Success" (0)
>>
>> Turn off PSM and srun works fine
>>
>>
>> On Thu, Dec 30, 2010 at 5:13 PM, 

Re: [OMPI users] srun and openmpi

2011-01-11 Thread Michael Di Domenico
Any ideas on what might be causing this one?  Or atleast what
additional debug information someone might need?

On Fri, Jan 7, 2011 at 4:03 PM, Michael Di Domenico
 wrote:
> I'm still testing the slurm integration, which seems to work fine so
> far.  However, i just upgraded another cluster to openmpi-1.5 and
> slurm 2.1.15 but this machine has no infiniband
>
> if i salloc the nodes and mpirun the command it seems to run and complete fine
> however if i srun the command i get
>
> [btl_tcp_endpoint:486] mca_btl_tcp_endpoint_recv_connect_ack received
> unexpected prcoess identifier
>
> the job does not seem to run, but exhibits two behaviors
> running a single process per node the job runs and does not present
> the error (srun -N40 --ntasks-per-node=1)
> running multiple processes per node, the job spits out the error but
> does not run (srun -n40 --ntasks-per-node=8)
>
> I copied the configs from the other machine, so (i think) everything
> should be configured correctly (but i can't rule it out)
>
> I saw (and reported) a similar error to above with the 1.4-dev branch
> (see mailing list) and slurm, I can't say whether they're related or
> not though
>
>
> On Mon, Jan 3, 2011 at 3:00 PM, Jeff Squyres  wrote:
>> Yo Ralph --
>>
>> I see this was committed https://svn.open-mpi.org/trac/ompi/changeset/24197. 
>>  Do you want to add a blurb in README about it, and/or have this executable 
>> compiled as part of the PSM MTL and then installed into $bindir (maybe named 
>> ompi-psm-keygen)?
>>
>> Right now, it's only compiled as part of "make check" and not installed, 
>> right?
>>
>>
>>
>> On Dec 30, 2010, at 5:07 PM, Ralph Castain wrote:
>>
>>> Run the program only once - it can be in the prolog of the job if you like. 
>>> The output value needs to be in the env of every rank.
>>>
>>> You can reuse the value as many times as you like - it doesn't have to be 
>>> unique for each job. There is nothing magic about the value itself.
>>>
>>> On Dec 30, 2010, at 2:11 PM, Michael Di Domenico wrote:
>>>
 How early does this need to run? Can I run it as part of a task
 prolog, or does it need to be the shell env for each rank?  And does
 it need to run on one node or all the nodes in the job?

 On Thu, Dec 30, 2010 at 8:54 PM, Ralph Castain  wrote:
> Well, I couldn't do it as a patch - proved too complicated as the psm 
> system looks for the value early in the boot procedure.
>
> What I can do is give you the attached key generator program. It outputs 
> the envar required to run your program. So if you run the attached 
> program and then export the output into your environment, you should be 
> okay. Looks like this:
>
> $ ./psm_keygen
> OMPI_MCA_orte_precondition_transports=0099b3eaa2c1547e-afb287789133a954
> $
>
> You compile the program with the usual mpicc.
>
> Let me know if this solves the problem (or not).
> Ralph
>
>
>
>
> On Dec 30, 2010, at 11:18 AM, Michael Di Domenico wrote:
>
>> Sure, i'll give it a go
>>
>> On Thu, Dec 30, 2010 at 5:53 PM, Ralph Castain  wrote:
>>> Ah, yes - that is going to be a problem. The PSM key gets generated by 
>>> mpirun as it is shared info - i.e., every proc has to get the same 
>>> value.
>>>
>>> I can create a patch that will do this for the srun direct-launch 
>>> scenario, if you want to try it. Would be later today, though.
>>>
>>>
>>> On Dec 30, 2010, at 10:31 AM, Michael Di Domenico wrote:
>>>
 Well maybe not horray, yet.  I might have jumped the gun a bit, it's
 looking like srun works in general, but perhaps not with PSM

 With PSM i get this error, (at least now i know what i changed)

 Error obtaining unique transport key from ORTE
 (orte_precondition_transports not present in the environment)
 PML add procs failed
 --> Returned "Error" (-1) instead of "Success" (0)

 Turn off PSM and srun works fine


 On Thu, Dec 30, 2010 at 5:13 PM, Ralph Castain  
 wrote:
> Hooray!
>
> On Dec 30, 2010, at 9:57 AM, Michael Di Domenico wrote:
>
>> I think i take it all back.  I just tried it again and it seems to
>> work now.  I'm not sure what I changed (between my first and this
>> msg), but it does appear to work now.
>>
>> On Thu, Dec 30, 2010 at 4:31 PM, Michael Di Domenico
>>  wrote:
>>> Yes that's true, error messages help.  I was hoping there was some
>>> documentation to see what i've done wrong.  I can't easily cut and
>>> paste errors from my cluster.
>>>
>>> Here's a snippet (hand typed) of the error message, but it does look
>>> like a rank 

Re: [OMPI users] srun and openmpi

2011-01-07 Thread Michael Di Domenico
I'm still testing the slurm integration, which seems to work fine so
far.  However, i just upgraded another cluster to openmpi-1.5 and
slurm 2.1.15 but this machine has no infiniband

if i salloc the nodes and mpirun the command it seems to run and complete fine
however if i srun the command i get

[btl_tcp_endpoint:486] mca_btl_tcp_endpoint_recv_connect_ack received
unexpected prcoess identifier

the job does not seem to run, but exhibits two behaviors
running a single process per node the job runs and does not present
the error (srun -N40 --ntasks-per-node=1)
running multiple processes per node, the job spits out the error but
does not run (srun -n40 --ntasks-per-node=8)

I copied the configs from the other machine, so (i think) everything
should be configured correctly (but i can't rule it out)

I saw (and reported) a similar error to above with the 1.4-dev branch
(see mailing list) and slurm, I can't say whether they're related or
not though


On Mon, Jan 3, 2011 at 3:00 PM, Jeff Squyres  wrote:
> Yo Ralph --
>
> I see this was committed https://svn.open-mpi.org/trac/ompi/changeset/24197.  
> Do you want to add a blurb in README about it, and/or have this executable 
> compiled as part of the PSM MTL and then installed into $bindir (maybe named 
> ompi-psm-keygen)?
>
> Right now, it's only compiled as part of "make check" and not installed, 
> right?
>
>
>
> On Dec 30, 2010, at 5:07 PM, Ralph Castain wrote:
>
>> Run the program only once - it can be in the prolog of the job if you like. 
>> The output value needs to be in the env of every rank.
>>
>> You can reuse the value as many times as you like - it doesn't have to be 
>> unique for each job. There is nothing magic about the value itself.
>>
>> On Dec 30, 2010, at 2:11 PM, Michael Di Domenico wrote:
>>
>>> How early does this need to run? Can I run it as part of a task
>>> prolog, or does it need to be the shell env for each rank?  And does
>>> it need to run on one node or all the nodes in the job?
>>>
>>> On Thu, Dec 30, 2010 at 8:54 PM, Ralph Castain  wrote:
 Well, I couldn't do it as a patch - proved too complicated as the psm 
 system looks for the value early in the boot procedure.

 What I can do is give you the attached key generator program. It outputs 
 the envar required to run your program. So if you run the attached program 
 and then export the output into your environment, you should be okay. 
 Looks like this:

 $ ./psm_keygen
 OMPI_MCA_orte_precondition_transports=0099b3eaa2c1547e-afb287789133a954
 $

 You compile the program with the usual mpicc.

 Let me know if this solves the problem (or not).
 Ralph




 On Dec 30, 2010, at 11:18 AM, Michael Di Domenico wrote:

> Sure, i'll give it a go
>
> On Thu, Dec 30, 2010 at 5:53 PM, Ralph Castain  wrote:
>> Ah, yes - that is going to be a problem. The PSM key gets generated by 
>> mpirun as it is shared info - i.e., every proc has to get the same value.
>>
>> I can create a patch that will do this for the srun direct-launch 
>> scenario, if you want to try it. Would be later today, though.
>>
>>
>> On Dec 30, 2010, at 10:31 AM, Michael Di Domenico wrote:
>>
>>> Well maybe not horray, yet.  I might have jumped the gun a bit, it's
>>> looking like srun works in general, but perhaps not with PSM
>>>
>>> With PSM i get this error, (at least now i know what i changed)
>>>
>>> Error obtaining unique transport key from ORTE
>>> (orte_precondition_transports not present in the environment)
>>> PML add procs failed
>>> --> Returned "Error" (-1) instead of "Success" (0)
>>>
>>> Turn off PSM and srun works fine
>>>
>>>
>>> On Thu, Dec 30, 2010 at 5:13 PM, Ralph Castain  
>>> wrote:
 Hooray!

 On Dec 30, 2010, at 9:57 AM, Michael Di Domenico wrote:

> I think i take it all back.  I just tried it again and it seems to
> work now.  I'm not sure what I changed (between my first and this
> msg), but it does appear to work now.
>
> On Thu, Dec 30, 2010 at 4:31 PM, Michael Di Domenico
>  wrote:
>> Yes that's true, error messages help.  I was hoping there was some
>> documentation to see what i've done wrong.  I can't easily cut and
>> paste errors from my cluster.
>>
>> Here's a snippet (hand typed) of the error message, but it does look
>> like a rank communications error
>>
>> ORTE_ERROR_LOG: A message is attempting to be sent to a process whose
>> contact information is unknown in file rml_oob_send.c at line 145.
>> *** MPI_INIT failure message (snipped) ***
>> orte_grpcomm_modex failed
>> --> Returned "A messages is attempting to be 

Re: [OMPI users] srun and openmpi

2010-12-30 Thread Ralph Castain
Run the program only once - it can be in the prolog of the job if you like. The 
output value needs to be in the env of every rank.

You can reuse the value as many times as you like - it doesn't have to be 
unique for each job. There is nothing magic about the value itself.

On Dec 30, 2010, at 2:11 PM, Michael Di Domenico wrote:

> How early does this need to run? Can I run it as part of a task
> prolog, or does it need to be the shell env for each rank?  And does
> it need to run on one node or all the nodes in the job?
> 
> On Thu, Dec 30, 2010 at 8:54 PM, Ralph Castain  wrote:
>> Well, I couldn't do it as a patch - proved too complicated as the psm system 
>> looks for the value early in the boot procedure.
>> 
>> What I can do is give you the attached key generator program. It outputs the 
>> envar required to run your program. So if you run the attached program and 
>> then export the output into your environment, you should be okay. Looks like 
>> this:
>> 
>> $ ./psm_keygen
>> OMPI_MCA_orte_precondition_transports=0099b3eaa2c1547e-afb287789133a954
>> $
>> 
>> You compile the program with the usual mpicc.
>> 
>> Let me know if this solves the problem (or not).
>> Ralph
>> 
>> 
>> 
>> 
>> On Dec 30, 2010, at 11:18 AM, Michael Di Domenico wrote:
>> 
>>> Sure, i'll give it a go
>>> 
>>> On Thu, Dec 30, 2010 at 5:53 PM, Ralph Castain  wrote:
 Ah, yes - that is going to be a problem. The PSM key gets generated by 
 mpirun as it is shared info - i.e., every proc has to get the same value.
 
 I can create a patch that will do this for the srun direct-launch 
 scenario, if you want to try it. Would be later today, though.
 
 
 On Dec 30, 2010, at 10:31 AM, Michael Di Domenico wrote:
 
> Well maybe not horray, yet.  I might have jumped the gun a bit, it's
> looking like srun works in general, but perhaps not with PSM
> 
> With PSM i get this error, (at least now i know what i changed)
> 
> Error obtaining unique transport key from ORTE
> (orte_precondition_transports not present in the environment)
> PML add procs failed
> --> Returned "Error" (-1) instead of "Success" (0)
> 
> Turn off PSM and srun works fine
> 
> 
> On Thu, Dec 30, 2010 at 5:13 PM, Ralph Castain  wrote:
>> Hooray!
>> 
>> On Dec 30, 2010, at 9:57 AM, Michael Di Domenico wrote:
>> 
>>> I think i take it all back.  I just tried it again and it seems to
>>> work now.  I'm not sure what I changed (between my first and this
>>> msg), but it does appear to work now.
>>> 
>>> On Thu, Dec 30, 2010 at 4:31 PM, Michael Di Domenico
>>>  wrote:
 Yes that's true, error messages help.  I was hoping there was some
 documentation to see what i've done wrong.  I can't easily cut and
 paste errors from my cluster.
 
 Here's a snippet (hand typed) of the error message, but it does look
 like a rank communications error
 
 ORTE_ERROR_LOG: A message is attempting to be sent to a process whose
 contact information is unknown in file rml_oob_send.c at line 145.
 *** MPI_INIT failure message (snipped) ***
 orte_grpcomm_modex failed
 --> Returned "A messages is attempting to be sent to a process whose
 contact information us uknown" (-117) instead of "Success" (0)
 
 This msg repeats for each rank, an ultimately hangs the srun which i
 have to Ctrl-C and terminate
 
 I have mpiports defined in my slurm config and running srun with
 -resv-ports does show the SLURM_RESV_PORTS environment variable
 getting parts to the shell
 
 
 On Thu, Dec 23, 2010 at 8:09 PM, Ralph Castain  
 wrote:
> I'm not sure there is any documentation yet - not much clamor for it. 
> :-/
> 
> It would really help if you included the error message. Otherwise, 
> all I can do is guess, which wastes both of our time :-(
> 
> My best guess is that the port reservation didn't get passed down to 
> the MPI procs properly - but that's just a guess.
> 
> 
> On Dec 23, 2010, at 12:46 PM, Michael Di Domenico wrote:
> 
>> Can anyone point me towards the most recent documentation for using
>> srun and openmpi?
>> 
>> I followed what i found on the web with enabling the MpiPorts config
>> in slurm and using the --resv-ports switch, but I'm getting an error
>> from openmpi during setup.
>> 
>> I'm using Slurm 2.1.15 and Openmpi 1.5 w/PSM
>> 
>> I'm sure I'm missing a step.
>> 
>> Thanks
>> ___
>> users mailing list
>> 

Re: [OMPI users] srun and openmpi

2010-12-30 Thread Michael Di Domenico
How early does this need to run? Can I run it as part of a task
prolog, or does it need to be the shell env for each rank?  And does
it need to run on one node or all the nodes in the job?

On Thu, Dec 30, 2010 at 8:54 PM, Ralph Castain  wrote:
> Well, I couldn't do it as a patch - proved too complicated as the psm system 
> looks for the value early in the boot procedure.
>
> What I can do is give you the attached key generator program. It outputs the 
> envar required to run your program. So if you run the attached program and 
> then export the output into your environment, you should be okay. Looks like 
> this:
>
> $ ./psm_keygen
> OMPI_MCA_orte_precondition_transports=0099b3eaa2c1547e-afb287789133a954
> $
>
> You compile the program with the usual mpicc.
>
> Let me know if this solves the problem (or not).
> Ralph
>
>
>
>
> On Dec 30, 2010, at 11:18 AM, Michael Di Domenico wrote:
>
>> Sure, i'll give it a go
>>
>> On Thu, Dec 30, 2010 at 5:53 PM, Ralph Castain  wrote:
>>> Ah, yes - that is going to be a problem. The PSM key gets generated by 
>>> mpirun as it is shared info - i.e., every proc has to get the same value.
>>>
>>> I can create a patch that will do this for the srun direct-launch scenario, 
>>> if you want to try it. Would be later today, though.
>>>
>>>
>>> On Dec 30, 2010, at 10:31 AM, Michael Di Domenico wrote:
>>>
 Well maybe not horray, yet.  I might have jumped the gun a bit, it's
 looking like srun works in general, but perhaps not with PSM

 With PSM i get this error, (at least now i know what i changed)

 Error obtaining unique transport key from ORTE
 (orte_precondition_transports not present in the environment)
 PML add procs failed
 --> Returned "Error" (-1) instead of "Success" (0)

 Turn off PSM and srun works fine


 On Thu, Dec 30, 2010 at 5:13 PM, Ralph Castain  wrote:
> Hooray!
>
> On Dec 30, 2010, at 9:57 AM, Michael Di Domenico wrote:
>
>> I think i take it all back.  I just tried it again and it seems to
>> work now.  I'm not sure what I changed (between my first and this
>> msg), but it does appear to work now.
>>
>> On Thu, Dec 30, 2010 at 4:31 PM, Michael Di Domenico
>>  wrote:
>>> Yes that's true, error messages help.  I was hoping there was some
>>> documentation to see what i've done wrong.  I can't easily cut and
>>> paste errors from my cluster.
>>>
>>> Here's a snippet (hand typed) of the error message, but it does look
>>> like a rank communications error
>>>
>>> ORTE_ERROR_LOG: A message is attempting to be sent to a process whose
>>> contact information is unknown in file rml_oob_send.c at line 145.
>>> *** MPI_INIT failure message (snipped) ***
>>> orte_grpcomm_modex failed
>>> --> Returned "A messages is attempting to be sent to a process whose
>>> contact information us uknown" (-117) instead of "Success" (0)
>>>
>>> This msg repeats for each rank, an ultimately hangs the srun which i
>>> have to Ctrl-C and terminate
>>>
>>> I have mpiports defined in my slurm config and running srun with
>>> -resv-ports does show the SLURM_RESV_PORTS environment variable
>>> getting parts to the shell
>>>
>>>
>>> On Thu, Dec 23, 2010 at 8:09 PM, Ralph Castain  
>>> wrote:
 I'm not sure there is any documentation yet - not much clamor for it. 
 :-/

 It would really help if you included the error message. Otherwise, all 
 I can do is guess, which wastes both of our time :-(

 My best guess is that the port reservation didn't get passed down to 
 the MPI procs properly - but that's just a guess.


 On Dec 23, 2010, at 12:46 PM, Michael Di Domenico wrote:

> Can anyone point me towards the most recent documentation for using
> srun and openmpi?
>
> I followed what i found on the web with enabling the MpiPorts config
> in slurm and using the --resv-ports switch, but I'm getting an error
> from openmpi during setup.
>
> I'm using Slurm 2.1.15 and Openmpi 1.5 w/PSM
>
> I'm sure I'm missing a step.
>
> Thanks
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


 ___
 users mailing list
 us...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/users

>>>
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> 

Re: [OMPI users] srun and openmpi

2010-12-30 Thread Ralph Castain
Should have also warned you: you'll need to configure OMPI --with-devel-headers 
to get this program to build/run.


On Dec 30, 2010, at 1:54 PM, Ralph Castain wrote:

> Well, I couldn't do it as a patch - proved too complicated as the psm system 
> looks for the value early in the boot procedure.
> 
> What I can do is give you the attached key generator program. It outputs the 
> envar required to run your program. So if you run the attached program and 
> then export the output into your environment, you should be okay. Looks like 
> this:
> 
> $ ./psm_keygen
> OMPI_MCA_orte_precondition_transports=0099b3eaa2c1547e-afb287789133a954
> $ 
> 
> You compile the program with the usual mpicc.
> 
> Let me know if this solves the problem (or not).
> Ralph
> 
> 
> 
> On Dec 30, 2010, at 11:18 AM, Michael Di Domenico wrote:
> 
>> Sure, i'll give it a go
>> 
>> On Thu, Dec 30, 2010 at 5:53 PM, Ralph Castain  wrote:
>>> Ah, yes - that is going to be a problem. The PSM key gets generated by 
>>> mpirun as it is shared info - i.e., every proc has to get the same value.
>>> 
>>> I can create a patch that will do this for the srun direct-launch scenario, 
>>> if you want to try it. Would be later today, though.
>>> 
>>> 
>>> On Dec 30, 2010, at 10:31 AM, Michael Di Domenico wrote:
>>> 
 Well maybe not horray, yet.  I might have jumped the gun a bit, it's
 looking like srun works in general, but perhaps not with PSM
 
 With PSM i get this error, (at least now i know what i changed)
 
 Error obtaining unique transport key from ORTE
 (orte_precondition_transports not present in the environment)
 PML add procs failed
 --> Returned "Error" (-1) instead of "Success" (0)
 
 Turn off PSM and srun works fine
 
 
 On Thu, Dec 30, 2010 at 5:13 PM, Ralph Castain  wrote:
> Hooray!
> 
> On Dec 30, 2010, at 9:57 AM, Michael Di Domenico wrote:
> 
>> I think i take it all back.  I just tried it again and it seems to
>> work now.  I'm not sure what I changed (between my first and this
>> msg), but it does appear to work now.
>> 
>> On Thu, Dec 30, 2010 at 4:31 PM, Michael Di Domenico
>>  wrote:
>>> Yes that's true, error messages help.  I was hoping there was some
>>> documentation to see what i've done wrong.  I can't easily cut and
>>> paste errors from my cluster.
>>> 
>>> Here's a snippet (hand typed) of the error message, but it does look
>>> like a rank communications error
>>> 
>>> ORTE_ERROR_LOG: A message is attempting to be sent to a process whose
>>> contact information is unknown in file rml_oob_send.c at line 145.
>>> *** MPI_INIT failure message (snipped) ***
>>> orte_grpcomm_modex failed
>>> --> Returned "A messages is attempting to be sent to a process whose
>>> contact information us uknown" (-117) instead of "Success" (0)
>>> 
>>> This msg repeats for each rank, an ultimately hangs the srun which i
>>> have to Ctrl-C and terminate
>>> 
>>> I have mpiports defined in my slurm config and running srun with
>>> -resv-ports does show the SLURM_RESV_PORTS environment variable
>>> getting parts to the shell
>>> 
>>> 
>>> On Thu, Dec 23, 2010 at 8:09 PM, Ralph Castain  
>>> wrote:
 I'm not sure there is any documentation yet - not much clamor for it. 
 :-/
 
 It would really help if you included the error message. Otherwise, all 
 I can do is guess, which wastes both of our time :-(
 
 My best guess is that the port reservation didn't get passed down to 
 the MPI procs properly - but that's just a guess.
 
 
 On Dec 23, 2010, at 12:46 PM, Michael Di Domenico wrote:
 
> Can anyone point me towards the most recent documentation for using
> srun and openmpi?
> 
> I followed what i found on the web with enabling the MpiPorts config
> in slurm and using the --resv-ports switch, but I'm getting an error
> from openmpi during setup.
> 
> I'm using Slurm 2.1.15 and Openmpi 1.5 w/PSM
> 
> I'm sure I'm missing a step.
> 
> Thanks
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
 
 
 ___
 users mailing list
 us...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/users
 
>>> 
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> ___
> users mailing list
> 

Re: [OMPI users] srun and openmpi

2010-12-30 Thread Ralph Castain
Well, I couldn't do it as a patch - proved too complicated as the psm system 
looks for the value early in the boot procedure.

What I can do is give you the attached key generator program. It outputs the 
envar required to run your program. So if you run the attached program and then 
export the output into your environment, you should be okay. Looks like this:

$ ./psm_keygen
OMPI_MCA_orte_precondition_transports=0099b3eaa2c1547e-afb287789133a954
$ 

You compile the program with the usual mpicc.

Let me know if this solves the problem (or not).
Ralph



psm_keygen.c
Description: Binary data


On Dec 30, 2010, at 11:18 AM, Michael Di Domenico wrote:

> Sure, i'll give it a go
> 
> On Thu, Dec 30, 2010 at 5:53 PM, Ralph Castain  wrote:
>> Ah, yes - that is going to be a problem. The PSM key gets generated by 
>> mpirun as it is shared info - i.e., every proc has to get the same value.
>> 
>> I can create a patch that will do this for the srun direct-launch scenario, 
>> if you want to try it. Would be later today, though.
>> 
>> 
>> On Dec 30, 2010, at 10:31 AM, Michael Di Domenico wrote:
>> 
>>> Well maybe not horray, yet.  I might have jumped the gun a bit, it's
>>> looking like srun works in general, but perhaps not with PSM
>>> 
>>> With PSM i get this error, (at least now i know what i changed)
>>> 
>>> Error obtaining unique transport key from ORTE
>>> (orte_precondition_transports not present in the environment)
>>> PML add procs failed
>>> --> Returned "Error" (-1) instead of "Success" (0)
>>> 
>>> Turn off PSM and srun works fine
>>> 
>>> 
>>> On Thu, Dec 30, 2010 at 5:13 PM, Ralph Castain  wrote:
 Hooray!
 
 On Dec 30, 2010, at 9:57 AM, Michael Di Domenico wrote:
 
> I think i take it all back.  I just tried it again and it seems to
> work now.  I'm not sure what I changed (between my first and this
> msg), but it does appear to work now.
> 
> On Thu, Dec 30, 2010 at 4:31 PM, Michael Di Domenico
>  wrote:
>> Yes that's true, error messages help.  I was hoping there was some
>> documentation to see what i've done wrong.  I can't easily cut and
>> paste errors from my cluster.
>> 
>> Here's a snippet (hand typed) of the error message, but it does look
>> like a rank communications error
>> 
>> ORTE_ERROR_LOG: A message is attempting to be sent to a process whose
>> contact information is unknown in file rml_oob_send.c at line 145.
>> *** MPI_INIT failure message (snipped) ***
>> orte_grpcomm_modex failed
>> --> Returned "A messages is attempting to be sent to a process whose
>> contact information us uknown" (-117) instead of "Success" (0)
>> 
>> This msg repeats for each rank, an ultimately hangs the srun which i
>> have to Ctrl-C and terminate
>> 
>> I have mpiports defined in my slurm config and running srun with
>> -resv-ports does show the SLURM_RESV_PORTS environment variable
>> getting parts to the shell
>> 
>> 
>> On Thu, Dec 23, 2010 at 8:09 PM, Ralph Castain  wrote:
>>> I'm not sure there is any documentation yet - not much clamor for it. 
>>> :-/
>>> 
>>> It would really help if you included the error message. Otherwise, all 
>>> I can do is guess, which wastes both of our time :-(
>>> 
>>> My best guess is that the port reservation didn't get passed down to 
>>> the MPI procs properly - but that's just a guess.
>>> 
>>> 
>>> On Dec 23, 2010, at 12:46 PM, Michael Di Domenico wrote:
>>> 
 Can anyone point me towards the most recent documentation for using
 srun and openmpi?
 
 I followed what i found on the web with enabling the MpiPorts config
 in slurm and using the --resv-ports switch, but I'm getting an error
 from openmpi during setup.
 
 I'm using Slurm 2.1.15 and Openmpi 1.5 w/PSM
 
 I'm sure I'm missing a step.
 
 Thanks
 ___
 users mailing list
 us...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> 
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
 
 
 ___
 users mailing list
 us...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/users
 
>>> 
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> 

Re: [OMPI users] srun and openmpi

2010-12-30 Thread Michael Di Domenico
Sure, i'll give it a go

On Thu, Dec 30, 2010 at 5:53 PM, Ralph Castain  wrote:
> Ah, yes - that is going to be a problem. The PSM key gets generated by mpirun 
> as it is shared info - i.e., every proc has to get the same value.
>
> I can create a patch that will do this for the srun direct-launch scenario, 
> if you want to try it. Would be later today, though.
>
>
> On Dec 30, 2010, at 10:31 AM, Michael Di Domenico wrote:
>
>> Well maybe not horray, yet.  I might have jumped the gun a bit, it's
>> looking like srun works in general, but perhaps not with PSM
>>
>> With PSM i get this error, (at least now i know what i changed)
>>
>> Error obtaining unique transport key from ORTE
>> (orte_precondition_transports not present in the environment)
>> PML add procs failed
>> --> Returned "Error" (-1) instead of "Success" (0)
>>
>> Turn off PSM and srun works fine
>>
>>
>> On Thu, Dec 30, 2010 at 5:13 PM, Ralph Castain  wrote:
>>> Hooray!
>>>
>>> On Dec 30, 2010, at 9:57 AM, Michael Di Domenico wrote:
>>>
 I think i take it all back.  I just tried it again and it seems to
 work now.  I'm not sure what I changed (between my first and this
 msg), but it does appear to work now.

 On Thu, Dec 30, 2010 at 4:31 PM, Michael Di Domenico
  wrote:
> Yes that's true, error messages help.  I was hoping there was some
> documentation to see what i've done wrong.  I can't easily cut and
> paste errors from my cluster.
>
> Here's a snippet (hand typed) of the error message, but it does look
> like a rank communications error
>
> ORTE_ERROR_LOG: A message is attempting to be sent to a process whose
> contact information is unknown in file rml_oob_send.c at line 145.
> *** MPI_INIT failure message (snipped) ***
> orte_grpcomm_modex failed
> --> Returned "A messages is attempting to be sent to a process whose
> contact information us uknown" (-117) instead of "Success" (0)
>
> This msg repeats for each rank, an ultimately hangs the srun which i
> have to Ctrl-C and terminate
>
> I have mpiports defined in my slurm config and running srun with
> -resv-ports does show the SLURM_RESV_PORTS environment variable
> getting parts to the shell
>
>
> On Thu, Dec 23, 2010 at 8:09 PM, Ralph Castain  wrote:
>> I'm not sure there is any documentation yet - not much clamor for it. :-/
>>
>> It would really help if you included the error message. Otherwise, all I 
>> can do is guess, which wastes both of our time :-(
>>
>> My best guess is that the port reservation didn't get passed down to the 
>> MPI procs properly - but that's just a guess.
>>
>>
>> On Dec 23, 2010, at 12:46 PM, Michael Di Domenico wrote:
>>
>>> Can anyone point me towards the most recent documentation for using
>>> srun and openmpi?
>>>
>>> I followed what i found on the web with enabling the MpiPorts config
>>> in slurm and using the --resv-ports switch, but I'm getting an error
>>> from openmpi during setup.
>>>
>>> I'm using Slurm 2.1.15 and Openmpi 1.5 w/PSM
>>>
>>> I'm sure I'm missing a step.
>>>
>>> Thanks
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>

 ___
 users mailing list
 us...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>



Re: [OMPI users] srun and openmpi

2010-12-30 Thread Ralph Castain
Ah, yes - that is going to be a problem. The PSM key gets generated by mpirun 
as it is shared info - i.e., every proc has to get the same value.

I can create a patch that will do this for the srun direct-launch scenario, if 
you want to try it. Would be later today, though.


On Dec 30, 2010, at 10:31 AM, Michael Di Domenico wrote:

> Well maybe not horray, yet.  I might have jumped the gun a bit, it's
> looking like srun works in general, but perhaps not with PSM
> 
> With PSM i get this error, (at least now i know what i changed)
> 
> Error obtaining unique transport key from ORTE
> (orte_precondition_transports not present in the environment)
> PML add procs failed
> --> Returned "Error" (-1) instead of "Success" (0)
> 
> Turn off PSM and srun works fine
> 
> 
> On Thu, Dec 30, 2010 at 5:13 PM, Ralph Castain  wrote:
>> Hooray!
>> 
>> On Dec 30, 2010, at 9:57 AM, Michael Di Domenico wrote:
>> 
>>> I think i take it all back.  I just tried it again and it seems to
>>> work now.  I'm not sure what I changed (between my first and this
>>> msg), but it does appear to work now.
>>> 
>>> On Thu, Dec 30, 2010 at 4:31 PM, Michael Di Domenico
>>>  wrote:
 Yes that's true, error messages help.  I was hoping there was some
 documentation to see what i've done wrong.  I can't easily cut and
 paste errors from my cluster.
 
 Here's a snippet (hand typed) of the error message, but it does look
 like a rank communications error
 
 ORTE_ERROR_LOG: A message is attempting to be sent to a process whose
 contact information is unknown in file rml_oob_send.c at line 145.
 *** MPI_INIT failure message (snipped) ***
 orte_grpcomm_modex failed
 --> Returned "A messages is attempting to be sent to a process whose
 contact information us uknown" (-117) instead of "Success" (0)
 
 This msg repeats for each rank, an ultimately hangs the srun which i
 have to Ctrl-C and terminate
 
 I have mpiports defined in my slurm config and running srun with
 -resv-ports does show the SLURM_RESV_PORTS environment variable
 getting parts to the shell
 
 
 On Thu, Dec 23, 2010 at 8:09 PM, Ralph Castain  wrote:
> I'm not sure there is any documentation yet - not much clamor for it. :-/
> 
> It would really help if you included the error message. Otherwise, all I 
> can do is guess, which wastes both of our time :-(
> 
> My best guess is that the port reservation didn't get passed down to the 
> MPI procs properly - but that's just a guess.
> 
> 
> On Dec 23, 2010, at 12:46 PM, Michael Di Domenico wrote:
> 
>> Can anyone point me towards the most recent documentation for using
>> srun and openmpi?
>> 
>> I followed what i found on the web with enabling the MpiPorts config
>> in slurm and using the --resv-ports switch, but I'm getting an error
>> from openmpi during setup.
>> 
>> I'm using Slurm 2.1.15 and Openmpi 1.5 w/PSM
>> 
>> I'm sure I'm missing a step.
>> 
>> Thanks
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
 
>>> 
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] srun and openmpi

2010-12-30 Thread Michael Di Domenico
Well maybe not horray, yet.  I might have jumped the gun a bit, it's
looking like srun works in general, but perhaps not with PSM

With PSM i get this error, (at least now i know what i changed)

Error obtaining unique transport key from ORTE
(orte_precondition_transports not present in the environment)
PML add procs failed
--> Returned "Error" (-1) instead of "Success" (0)

Turn off PSM and srun works fine


On Thu, Dec 30, 2010 at 5:13 PM, Ralph Castain  wrote:
> Hooray!
>
> On Dec 30, 2010, at 9:57 AM, Michael Di Domenico wrote:
>
>> I think i take it all back.  I just tried it again and it seems to
>> work now.  I'm not sure what I changed (between my first and this
>> msg), but it does appear to work now.
>>
>> On Thu, Dec 30, 2010 at 4:31 PM, Michael Di Domenico
>>  wrote:
>>> Yes that's true, error messages help.  I was hoping there was some
>>> documentation to see what i've done wrong.  I can't easily cut and
>>> paste errors from my cluster.
>>>
>>> Here's a snippet (hand typed) of the error message, but it does look
>>> like a rank communications error
>>>
>>> ORTE_ERROR_LOG: A message is attempting to be sent to a process whose
>>> contact information is unknown in file rml_oob_send.c at line 145.
>>> *** MPI_INIT failure message (snipped) ***
>>> orte_grpcomm_modex failed
>>> --> Returned "A messages is attempting to be sent to a process whose
>>> contact information us uknown" (-117) instead of "Success" (0)
>>>
>>> This msg repeats for each rank, an ultimately hangs the srun which i
>>> have to Ctrl-C and terminate
>>>
>>> I have mpiports defined in my slurm config and running srun with
>>> -resv-ports does show the SLURM_RESV_PORTS environment variable
>>> getting parts to the shell
>>>
>>>
>>> On Thu, Dec 23, 2010 at 8:09 PM, Ralph Castain  wrote:
 I'm not sure there is any documentation yet - not much clamor for it. :-/

 It would really help if you included the error message. Otherwise, all I 
 can do is guess, which wastes both of our time :-(

 My best guess is that the port reservation didn't get passed down to the 
 MPI procs properly - but that's just a guess.


 On Dec 23, 2010, at 12:46 PM, Michael Di Domenico wrote:

> Can anyone point me towards the most recent documentation for using
> srun and openmpi?
>
> I followed what i found on the web with enabling the MpiPorts config
> in slurm and using the --resv-ports switch, but I'm getting an error
> from openmpi during setup.
>
> I'm using Slurm 2.1.15 and Openmpi 1.5 w/PSM
>
> I'm sure I'm missing a step.
>
> Thanks
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


 ___
 users mailing list
 us...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/users

>>>
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>



Re: [OMPI users] srun and openmpi

2010-12-30 Thread Ralph Castain
Hooray!

On Dec 30, 2010, at 9:57 AM, Michael Di Domenico wrote:

> I think i take it all back.  I just tried it again and it seems to
> work now.  I'm not sure what I changed (between my first and this
> msg), but it does appear to work now.
> 
> On Thu, Dec 30, 2010 at 4:31 PM, Michael Di Domenico
>  wrote:
>> Yes that's true, error messages help.  I was hoping there was some
>> documentation to see what i've done wrong.  I can't easily cut and
>> paste errors from my cluster.
>> 
>> Here's a snippet (hand typed) of the error message, but it does look
>> like a rank communications error
>> 
>> ORTE_ERROR_LOG: A message is attempting to be sent to a process whose
>> contact information is unknown in file rml_oob_send.c at line 145.
>> *** MPI_INIT failure message (snipped) ***
>> orte_grpcomm_modex failed
>> --> Returned "A messages is attempting to be sent to a process whose
>> contact information us uknown" (-117) instead of "Success" (0)
>> 
>> This msg repeats for each rank, an ultimately hangs the srun which i
>> have to Ctrl-C and terminate
>> 
>> I have mpiports defined in my slurm config and running srun with
>> -resv-ports does show the SLURM_RESV_PORTS environment variable
>> getting parts to the shell
>> 
>> 
>> On Thu, Dec 23, 2010 at 8:09 PM, Ralph Castain  wrote:
>>> I'm not sure there is any documentation yet - not much clamor for it. :-/
>>> 
>>> It would really help if you included the error message. Otherwise, all I 
>>> can do is guess, which wastes both of our time :-(
>>> 
>>> My best guess is that the port reservation didn't get passed down to the 
>>> MPI procs properly - but that's just a guess.
>>> 
>>> 
>>> On Dec 23, 2010, at 12:46 PM, Michael Di Domenico wrote:
>>> 
 Can anyone point me towards the most recent documentation for using
 srun and openmpi?
 
 I followed what i found on the web with enabling the MpiPorts config
 in slurm and using the --resv-ports switch, but I'm getting an error
 from openmpi during setup.
 
 I'm using Slurm 2.1.15 and Openmpi 1.5 w/PSM
 
 I'm sure I'm missing a step.
 
 Thanks
 ___
 users mailing list
 us...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> 
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] srun and openmpi

2010-12-30 Thread Michael Di Domenico
Yes that's true, error messages help.  I was hoping there was some
documentation to see what i've done wrong.  I can't easily cut and
paste errors from my cluster.

Here's a snippet (hand typed) of the error message, but it does look
like a rank communications error

ORTE_ERROR_LOG: A message is attempting to be sent to a process whose
contact information is unknown in file rml_oob_send.c at line 145.
*** MPI_INIT failure message (snipped) ***
orte_grpcomm_modex failed
--> Returned "A messages is attempting to be sent to a process whose
contact information us uknown" (-117) instead of "Success" (0)

This msg repeats for each rank, an ultimately hangs the srun which i
have to Ctrl-C and terminate

I have mpiports defined in my slurm config and running srun with
-resv-ports does show the SLURM_RESV_PORTS environment variable
getting parts to the shell


On Thu, Dec 23, 2010 at 8:09 PM, Ralph Castain  wrote:
> I'm not sure there is any documentation yet - not much clamor for it. :-/
>
> It would really help if you included the error message. Otherwise, all I can 
> do is guess, which wastes both of our time :-(
>
> My best guess is that the port reservation didn't get passed down to the MPI 
> procs properly - but that's just a guess.
>
>
> On Dec 23, 2010, at 12:46 PM, Michael Di Domenico wrote:
>
>> Can anyone point me towards the most recent documentation for using
>> srun and openmpi?
>>
>> I followed what i found on the web with enabling the MpiPorts config
>> in slurm and using the --resv-ports switch, but I'm getting an error
>> from openmpi during setup.
>>
>> I'm using Slurm 2.1.15 and Openmpi 1.5 w/PSM
>>
>> I'm sure I'm missing a step.
>>
>> Thanks
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] srun and openmpi

2010-12-23 Thread Ralph Castain
I'm not sure there is any documentation yet - not much clamor for it. :-/

It would really help if you included the error message. Otherwise, all I can do 
is guess, which wastes both of our time :-(

My best guess is that the port reservation didn't get passed down to the MPI 
procs properly - but that's just a guess.


On Dec 23, 2010, at 12:46 PM, Michael Di Domenico wrote:

> Can anyone point me towards the most recent documentation for using
> srun and openmpi?
> 
> I followed what i found on the web with enabling the MpiPorts config
> in slurm and using the --resv-ports switch, but I'm getting an error
> from openmpi during setup.
> 
> I'm using Slurm 2.1.15 and Openmpi 1.5 w/PSM
> 
> I'm sure I'm missing a step.
> 
> Thanks
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users