Re: [OMPI users] srun and openmpi

2010-12-30 Thread Ralph Castain
Run the program only once - it can be in the prolog of the job if you like. The 
output value needs to be in the env of every rank.

You can reuse the value as many times as you like - it doesn't have to be 
unique for each job. There is nothing magic about the value itself.

On Dec 30, 2010, at 2:11 PM, Michael Di Domenico wrote:

> How early does this need to run? Can I run it as part of a task
> prolog, or does it need to be the shell env for each rank?  And does
> it need to run on one node or all the nodes in the job?
> 
> On Thu, Dec 30, 2010 at 8:54 PM, Ralph Castain  wrote:
>> Well, I couldn't do it as a patch - proved too complicated as the psm system 
>> looks for the value early in the boot procedure.
>> 
>> What I can do is give you the attached key generator program. It outputs the 
>> envar required to run your program. So if you run the attached program and 
>> then export the output into your environment, you should be okay. Looks like 
>> this:
>> 
>> $ ./psm_keygen
>> OMPI_MCA_orte_precondition_transports=0099b3eaa2c1547e-afb287789133a954
>> $
>> 
>> You compile the program with the usual mpicc.
>> 
>> Let me know if this solves the problem (or not).
>> Ralph
>> 
>> 
>> 
>> 
>> On Dec 30, 2010, at 11:18 AM, Michael Di Domenico wrote:
>> 
>>> Sure, i'll give it a go
>>> 
>>> On Thu, Dec 30, 2010 at 5:53 PM, Ralph Castain  wrote:
 Ah, yes - that is going to be a problem. The PSM key gets generated by 
 mpirun as it is shared info - i.e., every proc has to get the same value.
 
 I can create a patch that will do this for the srun direct-launch 
 scenario, if you want to try it. Would be later today, though.
 
 
 On Dec 30, 2010, at 10:31 AM, Michael Di Domenico wrote:
 
> Well maybe not horray, yet.  I might have jumped the gun a bit, it's
> looking like srun works in general, but perhaps not with PSM
> 
> With PSM i get this error, (at least now i know what i changed)
> 
> Error obtaining unique transport key from ORTE
> (orte_precondition_transports not present in the environment)
> PML add procs failed
> --> Returned "Error" (-1) instead of "Success" (0)
> 
> Turn off PSM and srun works fine
> 
> 
> On Thu, Dec 30, 2010 at 5:13 PM, Ralph Castain  wrote:
>> Hooray!
>> 
>> On Dec 30, 2010, at 9:57 AM, Michael Di Domenico wrote:
>> 
>>> I think i take it all back.  I just tried it again and it seems to
>>> work now.  I'm not sure what I changed (between my first and this
>>> msg), but it does appear to work now.
>>> 
>>> On Thu, Dec 30, 2010 at 4:31 PM, Michael Di Domenico
>>>  wrote:
 Yes that's true, error messages help.  I was hoping there was some
 documentation to see what i've done wrong.  I can't easily cut and
 paste errors from my cluster.
 
 Here's a snippet (hand typed) of the error message, but it does look
 like a rank communications error
 
 ORTE_ERROR_LOG: A message is attempting to be sent to a process whose
 contact information is unknown in file rml_oob_send.c at line 145.
 *** MPI_INIT failure message (snipped) ***
 orte_grpcomm_modex failed
 --> Returned "A messages is attempting to be sent to a process whose
 contact information us uknown" (-117) instead of "Success" (0)
 
 This msg repeats for each rank, an ultimately hangs the srun which i
 have to Ctrl-C and terminate
 
 I have mpiports defined in my slurm config and running srun with
 -resv-ports does show the SLURM_RESV_PORTS environment variable
 getting parts to the shell
 
 
 On Thu, Dec 23, 2010 at 8:09 PM, Ralph Castain  
 wrote:
> I'm not sure there is any documentation yet - not much clamor for it. 
> :-/
> 
> It would really help if you included the error message. Otherwise, 
> all I can do is guess, which wastes both of our time :-(
> 
> My best guess is that the port reservation didn't get passed down to 
> the MPI procs properly - but that's just a guess.
> 
> 
> On Dec 23, 2010, at 12:46 PM, Michael Di Domenico wrote:
> 
>> Can anyone point me towards the most recent documentation for using
>> srun and openmpi?
>> 
>> I followed what i found on the web with enabling the MpiPorts config
>> in slurm and using the --resv-ports switch, but I'm getting an error
>> from openmpi during setup.
>> 
>> I'm using Slurm 2.1.15 and Openmpi 1.5 w/PSM
>> 
>> I'm sure I'm missing a step.
>> 
>> Thanks
>> ___
>> users mailing list
>> 

Re: [OMPI users] srun and openmpi

2010-12-30 Thread Michael Di Domenico
How early does this need to run? Can I run it as part of a task
prolog, or does it need to be the shell env for each rank?  And does
it need to run on one node or all the nodes in the job?

On Thu, Dec 30, 2010 at 8:54 PM, Ralph Castain  wrote:
> Well, I couldn't do it as a patch - proved too complicated as the psm system 
> looks for the value early in the boot procedure.
>
> What I can do is give you the attached key generator program. It outputs the 
> envar required to run your program. So if you run the attached program and 
> then export the output into your environment, you should be okay. Looks like 
> this:
>
> $ ./psm_keygen
> OMPI_MCA_orte_precondition_transports=0099b3eaa2c1547e-afb287789133a954
> $
>
> You compile the program with the usual mpicc.
>
> Let me know if this solves the problem (or not).
> Ralph
>
>
>
>
> On Dec 30, 2010, at 11:18 AM, Michael Di Domenico wrote:
>
>> Sure, i'll give it a go
>>
>> On Thu, Dec 30, 2010 at 5:53 PM, Ralph Castain  wrote:
>>> Ah, yes - that is going to be a problem. The PSM key gets generated by 
>>> mpirun as it is shared info - i.e., every proc has to get the same value.
>>>
>>> I can create a patch that will do this for the srun direct-launch scenario, 
>>> if you want to try it. Would be later today, though.
>>>
>>>
>>> On Dec 30, 2010, at 10:31 AM, Michael Di Domenico wrote:
>>>
 Well maybe not horray, yet.  I might have jumped the gun a bit, it's
 looking like srun works in general, but perhaps not with PSM

 With PSM i get this error, (at least now i know what i changed)

 Error obtaining unique transport key from ORTE
 (orte_precondition_transports not present in the environment)
 PML add procs failed
 --> Returned "Error" (-1) instead of "Success" (0)

 Turn off PSM and srun works fine


 On Thu, Dec 30, 2010 at 5:13 PM, Ralph Castain  wrote:
> Hooray!
>
> On Dec 30, 2010, at 9:57 AM, Michael Di Domenico wrote:
>
>> I think i take it all back.  I just tried it again and it seems to
>> work now.  I'm not sure what I changed (between my first and this
>> msg), but it does appear to work now.
>>
>> On Thu, Dec 30, 2010 at 4:31 PM, Michael Di Domenico
>>  wrote:
>>> Yes that's true, error messages help.  I was hoping there was some
>>> documentation to see what i've done wrong.  I can't easily cut and
>>> paste errors from my cluster.
>>>
>>> Here's a snippet (hand typed) of the error message, but it does look
>>> like a rank communications error
>>>
>>> ORTE_ERROR_LOG: A message is attempting to be sent to a process whose
>>> contact information is unknown in file rml_oob_send.c at line 145.
>>> *** MPI_INIT failure message (snipped) ***
>>> orte_grpcomm_modex failed
>>> --> Returned "A messages is attempting to be sent to a process whose
>>> contact information us uknown" (-117) instead of "Success" (0)
>>>
>>> This msg repeats for each rank, an ultimately hangs the srun which i
>>> have to Ctrl-C and terminate
>>>
>>> I have mpiports defined in my slurm config and running srun with
>>> -resv-ports does show the SLURM_RESV_PORTS environment variable
>>> getting parts to the shell
>>>
>>>
>>> On Thu, Dec 23, 2010 at 8:09 PM, Ralph Castain  
>>> wrote:
 I'm not sure there is any documentation yet - not much clamor for it. 
 :-/

 It would really help if you included the error message. Otherwise, all 
 I can do is guess, which wastes both of our time :-(

 My best guess is that the port reservation didn't get passed down to 
 the MPI procs properly - but that's just a guess.


 On Dec 23, 2010, at 12:46 PM, Michael Di Domenico wrote:

> Can anyone point me towards the most recent documentation for using
> srun and openmpi?
>
> I followed what i found on the web with enabling the MpiPorts config
> in slurm and using the --resv-ports switch, but I'm getting an error
> from openmpi during setup.
>
> I'm using Slurm 2.1.15 and Openmpi 1.5 w/PSM
>
> I'm sure I'm missing a step.
>
> Thanks
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


 ___
 users mailing list
 us...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/users

>>>
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> 

Re: [OMPI users] srun and openmpi

2010-12-30 Thread Ralph Castain
Should have also warned you: you'll need to configure OMPI --with-devel-headers 
to get this program to build/run.


On Dec 30, 2010, at 1:54 PM, Ralph Castain wrote:

> Well, I couldn't do it as a patch - proved too complicated as the psm system 
> looks for the value early in the boot procedure.
> 
> What I can do is give you the attached key generator program. It outputs the 
> envar required to run your program. So if you run the attached program and 
> then export the output into your environment, you should be okay. Looks like 
> this:
> 
> $ ./psm_keygen
> OMPI_MCA_orte_precondition_transports=0099b3eaa2c1547e-afb287789133a954
> $ 
> 
> You compile the program with the usual mpicc.
> 
> Let me know if this solves the problem (or not).
> Ralph
> 
> 
> 
> On Dec 30, 2010, at 11:18 AM, Michael Di Domenico wrote:
> 
>> Sure, i'll give it a go
>> 
>> On Thu, Dec 30, 2010 at 5:53 PM, Ralph Castain  wrote:
>>> Ah, yes - that is going to be a problem. The PSM key gets generated by 
>>> mpirun as it is shared info - i.e., every proc has to get the same value.
>>> 
>>> I can create a patch that will do this for the srun direct-launch scenario, 
>>> if you want to try it. Would be later today, though.
>>> 
>>> 
>>> On Dec 30, 2010, at 10:31 AM, Michael Di Domenico wrote:
>>> 
 Well maybe not horray, yet.  I might have jumped the gun a bit, it's
 looking like srun works in general, but perhaps not with PSM
 
 With PSM i get this error, (at least now i know what i changed)
 
 Error obtaining unique transport key from ORTE
 (orte_precondition_transports not present in the environment)
 PML add procs failed
 --> Returned "Error" (-1) instead of "Success" (0)
 
 Turn off PSM and srun works fine
 
 
 On Thu, Dec 30, 2010 at 5:13 PM, Ralph Castain  wrote:
> Hooray!
> 
> On Dec 30, 2010, at 9:57 AM, Michael Di Domenico wrote:
> 
>> I think i take it all back.  I just tried it again and it seems to
>> work now.  I'm not sure what I changed (between my first and this
>> msg), but it does appear to work now.
>> 
>> On Thu, Dec 30, 2010 at 4:31 PM, Michael Di Domenico
>>  wrote:
>>> Yes that's true, error messages help.  I was hoping there was some
>>> documentation to see what i've done wrong.  I can't easily cut and
>>> paste errors from my cluster.
>>> 
>>> Here's a snippet (hand typed) of the error message, but it does look
>>> like a rank communications error
>>> 
>>> ORTE_ERROR_LOG: A message is attempting to be sent to a process whose
>>> contact information is unknown in file rml_oob_send.c at line 145.
>>> *** MPI_INIT failure message (snipped) ***
>>> orte_grpcomm_modex failed
>>> --> Returned "A messages is attempting to be sent to a process whose
>>> contact information us uknown" (-117) instead of "Success" (0)
>>> 
>>> This msg repeats for each rank, an ultimately hangs the srun which i
>>> have to Ctrl-C and terminate
>>> 
>>> I have mpiports defined in my slurm config and running srun with
>>> -resv-ports does show the SLURM_RESV_PORTS environment variable
>>> getting parts to the shell
>>> 
>>> 
>>> On Thu, Dec 23, 2010 at 8:09 PM, Ralph Castain  
>>> wrote:
 I'm not sure there is any documentation yet - not much clamor for it. 
 :-/
 
 It would really help if you included the error message. Otherwise, all 
 I can do is guess, which wastes both of our time :-(
 
 My best guess is that the port reservation didn't get passed down to 
 the MPI procs properly - but that's just a guess.
 
 
 On Dec 23, 2010, at 12:46 PM, Michael Di Domenico wrote:
 
> Can anyone point me towards the most recent documentation for using
> srun and openmpi?
> 
> I followed what i found on the web with enabling the MpiPorts config
> in slurm and using the --resv-ports switch, but I'm getting an error
> from openmpi during setup.
> 
> I'm using Slurm 2.1.15 and Openmpi 1.5 w/PSM
> 
> I'm sure I'm missing a step.
> 
> Thanks
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
 
 
 ___
 users mailing list
 us...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/users
 
>>> 
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> ___
> users mailing list
> 

Re: [OMPI users] srun and openmpi

2010-12-30 Thread Ralph Castain
Well, I couldn't do it as a patch - proved too complicated as the psm system 
looks for the value early in the boot procedure.

What I can do is give you the attached key generator program. It outputs the 
envar required to run your program. So if you run the attached program and then 
export the output into your environment, you should be okay. Looks like this:

$ ./psm_keygen
OMPI_MCA_orte_precondition_transports=0099b3eaa2c1547e-afb287789133a954
$ 

You compile the program with the usual mpicc.

Let me know if this solves the problem (or not).
Ralph



psm_keygen.c
Description: Binary data


On Dec 30, 2010, at 11:18 AM, Michael Di Domenico wrote:

> Sure, i'll give it a go
> 
> On Thu, Dec 30, 2010 at 5:53 PM, Ralph Castain  wrote:
>> Ah, yes - that is going to be a problem. The PSM key gets generated by 
>> mpirun as it is shared info - i.e., every proc has to get the same value.
>> 
>> I can create a patch that will do this for the srun direct-launch scenario, 
>> if you want to try it. Would be later today, though.
>> 
>> 
>> On Dec 30, 2010, at 10:31 AM, Michael Di Domenico wrote:
>> 
>>> Well maybe not horray, yet.  I might have jumped the gun a bit, it's
>>> looking like srun works in general, but perhaps not with PSM
>>> 
>>> With PSM i get this error, (at least now i know what i changed)
>>> 
>>> Error obtaining unique transport key from ORTE
>>> (orte_precondition_transports not present in the environment)
>>> PML add procs failed
>>> --> Returned "Error" (-1) instead of "Success" (0)
>>> 
>>> Turn off PSM and srun works fine
>>> 
>>> 
>>> On Thu, Dec 30, 2010 at 5:13 PM, Ralph Castain  wrote:
 Hooray!
 
 On Dec 30, 2010, at 9:57 AM, Michael Di Domenico wrote:
 
> I think i take it all back.  I just tried it again and it seems to
> work now.  I'm not sure what I changed (between my first and this
> msg), but it does appear to work now.
> 
> On Thu, Dec 30, 2010 at 4:31 PM, Michael Di Domenico
>  wrote:
>> Yes that's true, error messages help.  I was hoping there was some
>> documentation to see what i've done wrong.  I can't easily cut and
>> paste errors from my cluster.
>> 
>> Here's a snippet (hand typed) of the error message, but it does look
>> like a rank communications error
>> 
>> ORTE_ERROR_LOG: A message is attempting to be sent to a process whose
>> contact information is unknown in file rml_oob_send.c at line 145.
>> *** MPI_INIT failure message (snipped) ***
>> orte_grpcomm_modex failed
>> --> Returned "A messages is attempting to be sent to a process whose
>> contact information us uknown" (-117) instead of "Success" (0)
>> 
>> This msg repeats for each rank, an ultimately hangs the srun which i
>> have to Ctrl-C and terminate
>> 
>> I have mpiports defined in my slurm config and running srun with
>> -resv-ports does show the SLURM_RESV_PORTS environment variable
>> getting parts to the shell
>> 
>> 
>> On Thu, Dec 23, 2010 at 8:09 PM, Ralph Castain  wrote:
>>> I'm not sure there is any documentation yet - not much clamor for it. 
>>> :-/
>>> 
>>> It would really help if you included the error message. Otherwise, all 
>>> I can do is guess, which wastes both of our time :-(
>>> 
>>> My best guess is that the port reservation didn't get passed down to 
>>> the MPI procs properly - but that's just a guess.
>>> 
>>> 
>>> On Dec 23, 2010, at 12:46 PM, Michael Di Domenico wrote:
>>> 
 Can anyone point me towards the most recent documentation for using
 srun and openmpi?
 
 I followed what i found on the web with enabling the MpiPorts config
 in slurm and using the --resv-ports switch, but I'm getting an error
 from openmpi during setup.
 
 I'm using Slurm 2.1.15 and Openmpi 1.5 w/PSM
 
 I'm sure I'm missing a step.
 
 Thanks
 ___
 users mailing list
 us...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> 
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
 
 
 ___
 users mailing list
 us...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/users
 
>>> 
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> 

Re: [OMPI users] srun and openmpi

2010-12-30 Thread Michael Di Domenico
Sure, i'll give it a go

On Thu, Dec 30, 2010 at 5:53 PM, Ralph Castain  wrote:
> Ah, yes - that is going to be a problem. The PSM key gets generated by mpirun 
> as it is shared info - i.e., every proc has to get the same value.
>
> I can create a patch that will do this for the srun direct-launch scenario, 
> if you want to try it. Would be later today, though.
>
>
> On Dec 30, 2010, at 10:31 AM, Michael Di Domenico wrote:
>
>> Well maybe not horray, yet.  I might have jumped the gun a bit, it's
>> looking like srun works in general, but perhaps not with PSM
>>
>> With PSM i get this error, (at least now i know what i changed)
>>
>> Error obtaining unique transport key from ORTE
>> (orte_precondition_transports not present in the environment)
>> PML add procs failed
>> --> Returned "Error" (-1) instead of "Success" (0)
>>
>> Turn off PSM and srun works fine
>>
>>
>> On Thu, Dec 30, 2010 at 5:13 PM, Ralph Castain  wrote:
>>> Hooray!
>>>
>>> On Dec 30, 2010, at 9:57 AM, Michael Di Domenico wrote:
>>>
 I think i take it all back.  I just tried it again and it seems to
 work now.  I'm not sure what I changed (between my first and this
 msg), but it does appear to work now.

 On Thu, Dec 30, 2010 at 4:31 PM, Michael Di Domenico
  wrote:
> Yes that's true, error messages help.  I was hoping there was some
> documentation to see what i've done wrong.  I can't easily cut and
> paste errors from my cluster.
>
> Here's a snippet (hand typed) of the error message, but it does look
> like a rank communications error
>
> ORTE_ERROR_LOG: A message is attempting to be sent to a process whose
> contact information is unknown in file rml_oob_send.c at line 145.
> *** MPI_INIT failure message (snipped) ***
> orte_grpcomm_modex failed
> --> Returned "A messages is attempting to be sent to a process whose
> contact information us uknown" (-117) instead of "Success" (0)
>
> This msg repeats for each rank, an ultimately hangs the srun which i
> have to Ctrl-C and terminate
>
> I have mpiports defined in my slurm config and running srun with
> -resv-ports does show the SLURM_RESV_PORTS environment variable
> getting parts to the shell
>
>
> On Thu, Dec 23, 2010 at 8:09 PM, Ralph Castain  wrote:
>> I'm not sure there is any documentation yet - not much clamor for it. :-/
>>
>> It would really help if you included the error message. Otherwise, all I 
>> can do is guess, which wastes both of our time :-(
>>
>> My best guess is that the port reservation didn't get passed down to the 
>> MPI procs properly - but that's just a guess.
>>
>>
>> On Dec 23, 2010, at 12:46 PM, Michael Di Domenico wrote:
>>
>>> Can anyone point me towards the most recent documentation for using
>>> srun and openmpi?
>>>
>>> I followed what i found on the web with enabling the MpiPorts config
>>> in slurm and using the --resv-ports switch, but I'm getting an error
>>> from openmpi during setup.
>>>
>>> I'm using Slurm 2.1.15 and Openmpi 1.5 w/PSM
>>>
>>> I'm sure I'm missing a step.
>>>
>>> Thanks
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>

 ___
 users mailing list
 us...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>



Re: [OMPI users] srun and openmpi

2010-12-30 Thread Ralph Castain
Ah, yes - that is going to be a problem. The PSM key gets generated by mpirun 
as it is shared info - i.e., every proc has to get the same value.

I can create a patch that will do this for the srun direct-launch scenario, if 
you want to try it. Would be later today, though.


On Dec 30, 2010, at 10:31 AM, Michael Di Domenico wrote:

> Well maybe not horray, yet.  I might have jumped the gun a bit, it's
> looking like srun works in general, but perhaps not with PSM
> 
> With PSM i get this error, (at least now i know what i changed)
> 
> Error obtaining unique transport key from ORTE
> (orte_precondition_transports not present in the environment)
> PML add procs failed
> --> Returned "Error" (-1) instead of "Success" (0)
> 
> Turn off PSM and srun works fine
> 
> 
> On Thu, Dec 30, 2010 at 5:13 PM, Ralph Castain  wrote:
>> Hooray!
>> 
>> On Dec 30, 2010, at 9:57 AM, Michael Di Domenico wrote:
>> 
>>> I think i take it all back.  I just tried it again and it seems to
>>> work now.  I'm not sure what I changed (between my first and this
>>> msg), but it does appear to work now.
>>> 
>>> On Thu, Dec 30, 2010 at 4:31 PM, Michael Di Domenico
>>>  wrote:
 Yes that's true, error messages help.  I was hoping there was some
 documentation to see what i've done wrong.  I can't easily cut and
 paste errors from my cluster.
 
 Here's a snippet (hand typed) of the error message, but it does look
 like a rank communications error
 
 ORTE_ERROR_LOG: A message is attempting to be sent to a process whose
 contact information is unknown in file rml_oob_send.c at line 145.
 *** MPI_INIT failure message (snipped) ***
 orte_grpcomm_modex failed
 --> Returned "A messages is attempting to be sent to a process whose
 contact information us uknown" (-117) instead of "Success" (0)
 
 This msg repeats for each rank, an ultimately hangs the srun which i
 have to Ctrl-C and terminate
 
 I have mpiports defined in my slurm config and running srun with
 -resv-ports does show the SLURM_RESV_PORTS environment variable
 getting parts to the shell
 
 
 On Thu, Dec 23, 2010 at 8:09 PM, Ralph Castain  wrote:
> I'm not sure there is any documentation yet - not much clamor for it. :-/
> 
> It would really help if you included the error message. Otherwise, all I 
> can do is guess, which wastes both of our time :-(
> 
> My best guess is that the port reservation didn't get passed down to the 
> MPI procs properly - but that's just a guess.
> 
> 
> On Dec 23, 2010, at 12:46 PM, Michael Di Domenico wrote:
> 
>> Can anyone point me towards the most recent documentation for using
>> srun and openmpi?
>> 
>> I followed what i found on the web with enabling the MpiPorts config
>> in slurm and using the --resv-ports switch, but I'm getting an error
>> from openmpi during setup.
>> 
>> I'm using Slurm 2.1.15 and Openmpi 1.5 w/PSM
>> 
>> I'm sure I'm missing a step.
>> 
>> Thanks
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
 
>>> 
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] srun and openmpi

2010-12-30 Thread Michael Di Domenico
Well maybe not horray, yet.  I might have jumped the gun a bit, it's
looking like srun works in general, but perhaps not with PSM

With PSM i get this error, (at least now i know what i changed)

Error obtaining unique transport key from ORTE
(orte_precondition_transports not present in the environment)
PML add procs failed
--> Returned "Error" (-1) instead of "Success" (0)

Turn off PSM and srun works fine


On Thu, Dec 30, 2010 at 5:13 PM, Ralph Castain  wrote:
> Hooray!
>
> On Dec 30, 2010, at 9:57 AM, Michael Di Domenico wrote:
>
>> I think i take it all back.  I just tried it again and it seems to
>> work now.  I'm not sure what I changed (between my first and this
>> msg), but it does appear to work now.
>>
>> On Thu, Dec 30, 2010 at 4:31 PM, Michael Di Domenico
>>  wrote:
>>> Yes that's true, error messages help.  I was hoping there was some
>>> documentation to see what i've done wrong.  I can't easily cut and
>>> paste errors from my cluster.
>>>
>>> Here's a snippet (hand typed) of the error message, but it does look
>>> like a rank communications error
>>>
>>> ORTE_ERROR_LOG: A message is attempting to be sent to a process whose
>>> contact information is unknown in file rml_oob_send.c at line 145.
>>> *** MPI_INIT failure message (snipped) ***
>>> orte_grpcomm_modex failed
>>> --> Returned "A messages is attempting to be sent to a process whose
>>> contact information us uknown" (-117) instead of "Success" (0)
>>>
>>> This msg repeats for each rank, an ultimately hangs the srun which i
>>> have to Ctrl-C and terminate
>>>
>>> I have mpiports defined in my slurm config and running srun with
>>> -resv-ports does show the SLURM_RESV_PORTS environment variable
>>> getting parts to the shell
>>>
>>>
>>> On Thu, Dec 23, 2010 at 8:09 PM, Ralph Castain  wrote:
 I'm not sure there is any documentation yet - not much clamor for it. :-/

 It would really help if you included the error message. Otherwise, all I 
 can do is guess, which wastes both of our time :-(

 My best guess is that the port reservation didn't get passed down to the 
 MPI procs properly - but that's just a guess.


 On Dec 23, 2010, at 12:46 PM, Michael Di Domenico wrote:

> Can anyone point me towards the most recent documentation for using
> srun and openmpi?
>
> I followed what i found on the web with enabling the MpiPorts config
> in slurm and using the --resv-ports switch, but I'm getting an error
> from openmpi during setup.
>
> I'm using Slurm 2.1.15 and Openmpi 1.5 w/PSM
>
> I'm sure I'm missing a step.
>
> Thanks
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


 ___
 users mailing list
 us...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/users

>>>
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>



Re: [OMPI users] srun and openmpi

2010-12-30 Thread Ralph Castain
Hooray!

On Dec 30, 2010, at 9:57 AM, Michael Di Domenico wrote:

> I think i take it all back.  I just tried it again and it seems to
> work now.  I'm not sure what I changed (between my first and this
> msg), but it does appear to work now.
> 
> On Thu, Dec 30, 2010 at 4:31 PM, Michael Di Domenico
>  wrote:
>> Yes that's true, error messages help.  I was hoping there was some
>> documentation to see what i've done wrong.  I can't easily cut and
>> paste errors from my cluster.
>> 
>> Here's a snippet (hand typed) of the error message, but it does look
>> like a rank communications error
>> 
>> ORTE_ERROR_LOG: A message is attempting to be sent to a process whose
>> contact information is unknown in file rml_oob_send.c at line 145.
>> *** MPI_INIT failure message (snipped) ***
>> orte_grpcomm_modex failed
>> --> Returned "A messages is attempting to be sent to a process whose
>> contact information us uknown" (-117) instead of "Success" (0)
>> 
>> This msg repeats for each rank, an ultimately hangs the srun which i
>> have to Ctrl-C and terminate
>> 
>> I have mpiports defined in my slurm config and running srun with
>> -resv-ports does show the SLURM_RESV_PORTS environment variable
>> getting parts to the shell
>> 
>> 
>> On Thu, Dec 23, 2010 at 8:09 PM, Ralph Castain  wrote:
>>> I'm not sure there is any documentation yet - not much clamor for it. :-/
>>> 
>>> It would really help if you included the error message. Otherwise, all I 
>>> can do is guess, which wastes both of our time :-(
>>> 
>>> My best guess is that the port reservation didn't get passed down to the 
>>> MPI procs properly - but that's just a guess.
>>> 
>>> 
>>> On Dec 23, 2010, at 12:46 PM, Michael Di Domenico wrote:
>>> 
 Can anyone point me towards the most recent documentation for using
 srun and openmpi?
 
 I followed what i found on the web with enabling the MpiPorts config
 in slurm and using the --resv-ports switch, but I'm getting an error
 from openmpi during setup.
 
 I'm using Slurm 2.1.15 and Openmpi 1.5 w/PSM
 
 I'm sure I'm missing a step.
 
 Thanks
 ___
 users mailing list
 us...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> 
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] srun and openmpi

2010-12-30 Thread Michael Di Domenico
I think i take it all back.  I just tried it again and it seems to
work now.  I'm not sure what I changed (between my first and this
msg), but it does appear to work now.

On Thu, Dec 30, 2010 at 4:31 PM, Michael Di Domenico
 wrote:
> Yes that's true, error messages help.  I was hoping there was some
> documentation to see what i've done wrong.  I can't easily cut and
> paste errors from my cluster.
>
> Here's a snippet (hand typed) of the error message, but it does look
> like a rank communications error
>
> ORTE_ERROR_LOG: A message is attempting to be sent to a process whose
> contact information is unknown in file rml_oob_send.c at line 145.
> *** MPI_INIT failure message (snipped) ***
> orte_grpcomm_modex failed
> --> Returned "A messages is attempting to be sent to a process whose
> contact information us uknown" (-117) instead of "Success" (0)
>
> This msg repeats for each rank, an ultimately hangs the srun which i
> have to Ctrl-C and terminate
>
> I have mpiports defined in my slurm config and running srun with
> -resv-ports does show the SLURM_RESV_PORTS environment variable
> getting parts to the shell
>
>
> On Thu, Dec 23, 2010 at 8:09 PM, Ralph Castain  wrote:
>> I'm not sure there is any documentation yet - not much clamor for it. :-/
>>
>> It would really help if you included the error message. Otherwise, all I can 
>> do is guess, which wastes both of our time :-(
>>
>> My best guess is that the port reservation didn't get passed down to the MPI 
>> procs properly - but that's just a guess.
>>
>>
>> On Dec 23, 2010, at 12:46 PM, Michael Di Domenico wrote:
>>
>>> Can anyone point me towards the most recent documentation for using
>>> srun and openmpi?
>>>
>>> I followed what i found on the web with enabling the MpiPorts config
>>> in slurm and using the --resv-ports switch, but I'm getting an error
>>> from openmpi during setup.
>>>
>>> I'm using Slurm 2.1.15 and Openmpi 1.5 w/PSM
>>>
>>> I'm sure I'm missing a step.
>>>
>>> Thanks
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>



Re: [OMPI users] srun and openmpi

2010-12-30 Thread Michael Di Domenico
Yes that's true, error messages help.  I was hoping there was some
documentation to see what i've done wrong.  I can't easily cut and
paste errors from my cluster.

Here's a snippet (hand typed) of the error message, but it does look
like a rank communications error

ORTE_ERROR_LOG: A message is attempting to be sent to a process whose
contact information is unknown in file rml_oob_send.c at line 145.
*** MPI_INIT failure message (snipped) ***
orte_grpcomm_modex failed
--> Returned "A messages is attempting to be sent to a process whose
contact information us uknown" (-117) instead of "Success" (0)

This msg repeats for each rank, an ultimately hangs the srun which i
have to Ctrl-C and terminate

I have mpiports defined in my slurm config and running srun with
-resv-ports does show the SLURM_RESV_PORTS environment variable
getting parts to the shell


On Thu, Dec 23, 2010 at 8:09 PM, Ralph Castain  wrote:
> I'm not sure there is any documentation yet - not much clamor for it. :-/
>
> It would really help if you included the error message. Otherwise, all I can 
> do is guess, which wastes both of our time :-(
>
> My best guess is that the port reservation didn't get passed down to the MPI 
> procs properly - but that's just a guess.
>
>
> On Dec 23, 2010, at 12:46 PM, Michael Di Domenico wrote:
>
>> Can anyone point me towards the most recent documentation for using
>> srun and openmpi?
>>
>> I followed what i found on the web with enabling the MpiPorts config
>> in slurm and using the --resv-ports switch, but I'm getting an error
>> from openmpi during setup.
>>
>> I'm using Slurm 2.1.15 and Openmpi 1.5 w/PSM
>>
>> I'm sure I'm missing a step.
>>
>> Thanks
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] openmpi hangs when running on more than one node (unless i use --debug-daemons )

2010-12-30 Thread Advanced Computing Group University of Padova
Thank You Raplh
It works
:)

On Wed, Dec 29, 2010 at 4:23 PM, Ralph Castain  wrote:

> Both look perfectly right to me. The difference is only because your
> "success" one still has the ssh session active.
>
> It looks to me like something is preventing communication when the ssh
> session is terminated, but I have no clue why.
>
> Given the small cluster size, I would just add this to your default param
> file and not worry about it:
>
> orte_leave_session_attached = 1
>
>
> On Dec 29, 2010, at 2:10 AM, Advanced Computing Group University of Padova
> wrote:
>
>
>
> On Wed, Dec 29, 2010 at 10:10 AM, Advanced Computing Group University of
> Padova  wrote:
>
>> Thank you Ralph,
>> Your suspects seems to be quite interesting :)
>> I try to run the same program from node 192.168.1/2.11 using also
>> 192.168.2.12 "tracing" .12 activities.
>> I attach the two files (_succ: using --debug-daemons , _fail:without
>> --debug-daemons)
>> I notice that orted daemon on the second node is called in a different
>> way.
>> Moreover when i launch without --debug-daemons a process called
>> orted.. remain active on the second node after i kill (ctrl+c) the
>> command on the first node.
>>
>> Can you continue to help me ?
>>
>>
>> On Tue, Dec 28, 2010 at 8:51 PM, Ralph Castain  wrote:
>>
>>> All --debug-daemons really does is keep the ssh session open after
>>> launching the remote daemon and turn on some output. Otherwise, we close
>>> that session as most systems only allow a limited number of concurrent ssh
>>> sessions to be open.
>>>
>>> I suspect you have a system setting that kills any running job upon ssh
>>> close. It would be best if you removed that restriction. If you cannot, then
>>> you can always run your MPI jobs with --no-daemonize. This will keep the ssh
>>> session open, but without all the debug output.
>>>
>>> That flag is just shorthand for an MCA param, so you can set it in your
>>> environ or put it in your default MCA param file.
>>>
>>>
>>> On Dec 28, 2010, at 3:31 AM, Advanced Computing Group University of
>>> Padova wrote:
>>>
>>> yes i've tested 'em
>>> In fact using the --debug-daemons switch everything works fine! (and i
>>> see that on the nodes a process calles orted... is started whenever i launch
>>> a test application)
>>> I believe this is a environment variables problem
>>>
>>> On Mon, Dec 27, 2010 at 10:16 PM, David Zhang wrote:
>>>
 have you tested your ssh key setup, fire wall, and switch settings to
 ensure all nodes are talking to each other?

 On Mon, Dec 27, 2010 at 1:07 AM, Advanced Computing Group University of
 Padova  wrote:

> using openmpi 1.4.2
>
>
> On Fri, Dec 24, 2010 at 11:17 AM, Advanced Computing Group University
> of Padova  wrote:
>
>> Hi,
>> i am building a small 16 nodes cluster gentoo based.
>> I succesfully installed openmpi and i succesfully tried some simple
>> small test parallel program on a single host but...
>> i can't run parallel program on more than one nodes
>>
>>
>> The nodes are cloned (so they are equals).
>> The mpiuser (and their ssh certificates) uses /home/mpiuser that is a
>> nfs share.
>> I modified .bashrc
>>
>> -
>> PATH=/usr/bin:$PATH ; export PATH ;
>> LD_LIBRARY_PATH=/usr/lib64:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ;
>>
>> # already present below
>> if [[ $- != *i* ]] ; then
>> # Shell is non-interactive.  Be done now!
>> return
>> fi
>> -
>>
>> The very very strange behaviour is that using the --debug-daemons let
>> my program run succesfully.
>>
>> Thank you in advance and sorry for my bad english
>>
>>
>>
>>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>



 --
 David Zhang
 University of California, San Diego

 ___
 users mailing list
 us...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/users

>>>
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>>
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>>
> 
> ___
>
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>