Re: [OMPI users] srun and openmpi
Run the program only once - it can be in the prolog of the job if you like. The output value needs to be in the env of every rank. You can reuse the value as many times as you like - it doesn't have to be unique for each job. There is nothing magic about the value itself. On Dec 30, 2010, at 2:11 PM, Michael Di Domenico wrote: > How early does this need to run? Can I run it as part of a task > prolog, or does it need to be the shell env for each rank? And does > it need to run on one node or all the nodes in the job? > > On Thu, Dec 30, 2010 at 8:54 PM, Ralph Castainwrote: >> Well, I couldn't do it as a patch - proved too complicated as the psm system >> looks for the value early in the boot procedure. >> >> What I can do is give you the attached key generator program. It outputs the >> envar required to run your program. So if you run the attached program and >> then export the output into your environment, you should be okay. Looks like >> this: >> >> $ ./psm_keygen >> OMPI_MCA_orte_precondition_transports=0099b3eaa2c1547e-afb287789133a954 >> $ >> >> You compile the program with the usual mpicc. >> >> Let me know if this solves the problem (or not). >> Ralph >> >> >> >> >> On Dec 30, 2010, at 11:18 AM, Michael Di Domenico wrote: >> >>> Sure, i'll give it a go >>> >>> On Thu, Dec 30, 2010 at 5:53 PM, Ralph Castain wrote: Ah, yes - that is going to be a problem. The PSM key gets generated by mpirun as it is shared info - i.e., every proc has to get the same value. I can create a patch that will do this for the srun direct-launch scenario, if you want to try it. Would be later today, though. On Dec 30, 2010, at 10:31 AM, Michael Di Domenico wrote: > Well maybe not horray, yet. I might have jumped the gun a bit, it's > looking like srun works in general, but perhaps not with PSM > > With PSM i get this error, (at least now i know what i changed) > > Error obtaining unique transport key from ORTE > (orte_precondition_transports not present in the environment) > PML add procs failed > --> Returned "Error" (-1) instead of "Success" (0) > > Turn off PSM and srun works fine > > > On Thu, Dec 30, 2010 at 5:13 PM, Ralph Castain wrote: >> Hooray! >> >> On Dec 30, 2010, at 9:57 AM, Michael Di Domenico wrote: >> >>> I think i take it all back. I just tried it again and it seems to >>> work now. I'm not sure what I changed (between my first and this >>> msg), but it does appear to work now. >>> >>> On Thu, Dec 30, 2010 at 4:31 PM, Michael Di Domenico >>> wrote: Yes that's true, error messages help. I was hoping there was some documentation to see what i've done wrong. I can't easily cut and paste errors from my cluster. Here's a snippet (hand typed) of the error message, but it does look like a rank communications error ORTE_ERROR_LOG: A message is attempting to be sent to a process whose contact information is unknown in file rml_oob_send.c at line 145. *** MPI_INIT failure message (snipped) *** orte_grpcomm_modex failed --> Returned "A messages is attempting to be sent to a process whose contact information us uknown" (-117) instead of "Success" (0) This msg repeats for each rank, an ultimately hangs the srun which i have to Ctrl-C and terminate I have mpiports defined in my slurm config and running srun with -resv-ports does show the SLURM_RESV_PORTS environment variable getting parts to the shell On Thu, Dec 23, 2010 at 8:09 PM, Ralph Castain wrote: > I'm not sure there is any documentation yet - not much clamor for it. > :-/ > > It would really help if you included the error message. Otherwise, > all I can do is guess, which wastes both of our time :-( > > My best guess is that the port reservation didn't get passed down to > the MPI procs properly - but that's just a guess. > > > On Dec 23, 2010, at 12:46 PM, Michael Di Domenico wrote: > >> Can anyone point me towards the most recent documentation for using >> srun and openmpi? >> >> I followed what i found on the web with enabling the MpiPorts config >> in slurm and using the --resv-ports switch, but I'm getting an error >> from openmpi during setup. >> >> I'm using Slurm 2.1.15 and Openmpi 1.5 w/PSM >> >> I'm sure I'm missing a step. >> >> Thanks >> ___ >> users mailing list >>
Re: [OMPI users] srun and openmpi
How early does this need to run? Can I run it as part of a task prolog, or does it need to be the shell env for each rank? And does it need to run on one node or all the nodes in the job? On Thu, Dec 30, 2010 at 8:54 PM, Ralph Castainwrote: > Well, I couldn't do it as a patch - proved too complicated as the psm system > looks for the value early in the boot procedure. > > What I can do is give you the attached key generator program. It outputs the > envar required to run your program. So if you run the attached program and > then export the output into your environment, you should be okay. Looks like > this: > > $ ./psm_keygen > OMPI_MCA_orte_precondition_transports=0099b3eaa2c1547e-afb287789133a954 > $ > > You compile the program with the usual mpicc. > > Let me know if this solves the problem (or not). > Ralph > > > > > On Dec 30, 2010, at 11:18 AM, Michael Di Domenico wrote: > >> Sure, i'll give it a go >> >> On Thu, Dec 30, 2010 at 5:53 PM, Ralph Castain wrote: >>> Ah, yes - that is going to be a problem. The PSM key gets generated by >>> mpirun as it is shared info - i.e., every proc has to get the same value. >>> >>> I can create a patch that will do this for the srun direct-launch scenario, >>> if you want to try it. Would be later today, though. >>> >>> >>> On Dec 30, 2010, at 10:31 AM, Michael Di Domenico wrote: >>> Well maybe not horray, yet. I might have jumped the gun a bit, it's looking like srun works in general, but perhaps not with PSM With PSM i get this error, (at least now i know what i changed) Error obtaining unique transport key from ORTE (orte_precondition_transports not present in the environment) PML add procs failed --> Returned "Error" (-1) instead of "Success" (0) Turn off PSM and srun works fine On Thu, Dec 30, 2010 at 5:13 PM, Ralph Castain wrote: > Hooray! > > On Dec 30, 2010, at 9:57 AM, Michael Di Domenico wrote: > >> I think i take it all back. I just tried it again and it seems to >> work now. I'm not sure what I changed (between my first and this >> msg), but it does appear to work now. >> >> On Thu, Dec 30, 2010 at 4:31 PM, Michael Di Domenico >> wrote: >>> Yes that's true, error messages help. I was hoping there was some >>> documentation to see what i've done wrong. I can't easily cut and >>> paste errors from my cluster. >>> >>> Here's a snippet (hand typed) of the error message, but it does look >>> like a rank communications error >>> >>> ORTE_ERROR_LOG: A message is attempting to be sent to a process whose >>> contact information is unknown in file rml_oob_send.c at line 145. >>> *** MPI_INIT failure message (snipped) *** >>> orte_grpcomm_modex failed >>> --> Returned "A messages is attempting to be sent to a process whose >>> contact information us uknown" (-117) instead of "Success" (0) >>> >>> This msg repeats for each rank, an ultimately hangs the srun which i >>> have to Ctrl-C and terminate >>> >>> I have mpiports defined in my slurm config and running srun with >>> -resv-ports does show the SLURM_RESV_PORTS environment variable >>> getting parts to the shell >>> >>> >>> On Thu, Dec 23, 2010 at 8:09 PM, Ralph Castain >>> wrote: I'm not sure there is any documentation yet - not much clamor for it. :-/ It would really help if you included the error message. Otherwise, all I can do is guess, which wastes both of our time :-( My best guess is that the port reservation didn't get passed down to the MPI procs properly - but that's just a guess. On Dec 23, 2010, at 12:46 PM, Michael Di Domenico wrote: > Can anyone point me towards the most recent documentation for using > srun and openmpi? > > I followed what i found on the web with enabling the MpiPorts config > in slurm and using the --resv-ports switch, but I'm getting an error > from openmpi during setup. > > I'm using Slurm 2.1.15 and Openmpi 1.5 w/PSM > > I'm sure I'm missing a step. > > Thanks > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > >
Re: [OMPI users] srun and openmpi
Should have also warned you: you'll need to configure OMPI --with-devel-headers to get this program to build/run. On Dec 30, 2010, at 1:54 PM, Ralph Castain wrote: > Well, I couldn't do it as a patch - proved too complicated as the psm system > looks for the value early in the boot procedure. > > What I can do is give you the attached key generator program. It outputs the > envar required to run your program. So if you run the attached program and > then export the output into your environment, you should be okay. Looks like > this: > > $ ./psm_keygen > OMPI_MCA_orte_precondition_transports=0099b3eaa2c1547e-afb287789133a954 > $ > > You compile the program with the usual mpicc. > > Let me know if this solves the problem (or not). > Ralph > > > > On Dec 30, 2010, at 11:18 AM, Michael Di Domenico wrote: > >> Sure, i'll give it a go >> >> On Thu, Dec 30, 2010 at 5:53 PM, Ralph Castainwrote: >>> Ah, yes - that is going to be a problem. The PSM key gets generated by >>> mpirun as it is shared info - i.e., every proc has to get the same value. >>> >>> I can create a patch that will do this for the srun direct-launch scenario, >>> if you want to try it. Would be later today, though. >>> >>> >>> On Dec 30, 2010, at 10:31 AM, Michael Di Domenico wrote: >>> Well maybe not horray, yet. I might have jumped the gun a bit, it's looking like srun works in general, but perhaps not with PSM With PSM i get this error, (at least now i know what i changed) Error obtaining unique transport key from ORTE (orte_precondition_transports not present in the environment) PML add procs failed --> Returned "Error" (-1) instead of "Success" (0) Turn off PSM and srun works fine On Thu, Dec 30, 2010 at 5:13 PM, Ralph Castain wrote: > Hooray! > > On Dec 30, 2010, at 9:57 AM, Michael Di Domenico wrote: > >> I think i take it all back. I just tried it again and it seems to >> work now. I'm not sure what I changed (between my first and this >> msg), but it does appear to work now. >> >> On Thu, Dec 30, 2010 at 4:31 PM, Michael Di Domenico >> wrote: >>> Yes that's true, error messages help. I was hoping there was some >>> documentation to see what i've done wrong. I can't easily cut and >>> paste errors from my cluster. >>> >>> Here's a snippet (hand typed) of the error message, but it does look >>> like a rank communications error >>> >>> ORTE_ERROR_LOG: A message is attempting to be sent to a process whose >>> contact information is unknown in file rml_oob_send.c at line 145. >>> *** MPI_INIT failure message (snipped) *** >>> orte_grpcomm_modex failed >>> --> Returned "A messages is attempting to be sent to a process whose >>> contact information us uknown" (-117) instead of "Success" (0) >>> >>> This msg repeats for each rank, an ultimately hangs the srun which i >>> have to Ctrl-C and terminate >>> >>> I have mpiports defined in my slurm config and running srun with >>> -resv-ports does show the SLURM_RESV_PORTS environment variable >>> getting parts to the shell >>> >>> >>> On Thu, Dec 23, 2010 at 8:09 PM, Ralph Castain >>> wrote: I'm not sure there is any documentation yet - not much clamor for it. :-/ It would really help if you included the error message. Otherwise, all I can do is guess, which wastes both of our time :-( My best guess is that the port reservation didn't get passed down to the MPI procs properly - but that's just a guess. On Dec 23, 2010, at 12:46 PM, Michael Di Domenico wrote: > Can anyone point me towards the most recent documentation for using > srun and openmpi? > > I followed what i found on the web with enabling the MpiPorts config > in slurm and using the --resv-ports switch, but I'm getting an error > from openmpi during setup. > > I'm using Slurm 2.1.15 and Openmpi 1.5 w/PSM > > I'm sure I'm missing a step. > > Thanks > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > ___ > users mailing list >
Re: [OMPI users] srun and openmpi
Well, I couldn't do it as a patch - proved too complicated as the psm system looks for the value early in the boot procedure. What I can do is give you the attached key generator program. It outputs the envar required to run your program. So if you run the attached program and then export the output into your environment, you should be okay. Looks like this: $ ./psm_keygen OMPI_MCA_orte_precondition_transports=0099b3eaa2c1547e-afb287789133a954 $ You compile the program with the usual mpicc. Let me know if this solves the problem (or not). Ralph psm_keygen.c Description: Binary data On Dec 30, 2010, at 11:18 AM, Michael Di Domenico wrote: > Sure, i'll give it a go > > On Thu, Dec 30, 2010 at 5:53 PM, Ralph Castainwrote: >> Ah, yes - that is going to be a problem. The PSM key gets generated by >> mpirun as it is shared info - i.e., every proc has to get the same value. >> >> I can create a patch that will do this for the srun direct-launch scenario, >> if you want to try it. Would be later today, though. >> >> >> On Dec 30, 2010, at 10:31 AM, Michael Di Domenico wrote: >> >>> Well maybe not horray, yet. I might have jumped the gun a bit, it's >>> looking like srun works in general, but perhaps not with PSM >>> >>> With PSM i get this error, (at least now i know what i changed) >>> >>> Error obtaining unique transport key from ORTE >>> (orte_precondition_transports not present in the environment) >>> PML add procs failed >>> --> Returned "Error" (-1) instead of "Success" (0) >>> >>> Turn off PSM and srun works fine >>> >>> >>> On Thu, Dec 30, 2010 at 5:13 PM, Ralph Castain wrote: Hooray! On Dec 30, 2010, at 9:57 AM, Michael Di Domenico wrote: > I think i take it all back. I just tried it again and it seems to > work now. I'm not sure what I changed (between my first and this > msg), but it does appear to work now. > > On Thu, Dec 30, 2010 at 4:31 PM, Michael Di Domenico > wrote: >> Yes that's true, error messages help. I was hoping there was some >> documentation to see what i've done wrong. I can't easily cut and >> paste errors from my cluster. >> >> Here's a snippet (hand typed) of the error message, but it does look >> like a rank communications error >> >> ORTE_ERROR_LOG: A message is attempting to be sent to a process whose >> contact information is unknown in file rml_oob_send.c at line 145. >> *** MPI_INIT failure message (snipped) *** >> orte_grpcomm_modex failed >> --> Returned "A messages is attempting to be sent to a process whose >> contact information us uknown" (-117) instead of "Success" (0) >> >> This msg repeats for each rank, an ultimately hangs the srun which i >> have to Ctrl-C and terminate >> >> I have mpiports defined in my slurm config and running srun with >> -resv-ports does show the SLURM_RESV_PORTS environment variable >> getting parts to the shell >> >> >> On Thu, Dec 23, 2010 at 8:09 PM, Ralph Castain wrote: >>> I'm not sure there is any documentation yet - not much clamor for it. >>> :-/ >>> >>> It would really help if you included the error message. Otherwise, all >>> I can do is guess, which wastes both of our time :-( >>> >>> My best guess is that the port reservation didn't get passed down to >>> the MPI procs properly - but that's just a guess. >>> >>> >>> On Dec 23, 2010, at 12:46 PM, Michael Di Domenico wrote: >>> Can anyone point me towards the most recent documentation for using srun and openmpi? I followed what i found on the web with enabling the MpiPorts config in slurm and using the --resv-ports switch, but I'm getting an error from openmpi during setup. I'm using Slurm 2.1.15 and Openmpi 1.5 w/PSM I'm sure I'm missing a step. Thanks ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >> > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >>
Re: [OMPI users] srun and openmpi
Sure, i'll give it a go On Thu, Dec 30, 2010 at 5:53 PM, Ralph Castainwrote: > Ah, yes - that is going to be a problem. The PSM key gets generated by mpirun > as it is shared info - i.e., every proc has to get the same value. > > I can create a patch that will do this for the srun direct-launch scenario, > if you want to try it. Would be later today, though. > > > On Dec 30, 2010, at 10:31 AM, Michael Di Domenico wrote: > >> Well maybe not horray, yet. I might have jumped the gun a bit, it's >> looking like srun works in general, but perhaps not with PSM >> >> With PSM i get this error, (at least now i know what i changed) >> >> Error obtaining unique transport key from ORTE >> (orte_precondition_transports not present in the environment) >> PML add procs failed >> --> Returned "Error" (-1) instead of "Success" (0) >> >> Turn off PSM and srun works fine >> >> >> On Thu, Dec 30, 2010 at 5:13 PM, Ralph Castain wrote: >>> Hooray! >>> >>> On Dec 30, 2010, at 9:57 AM, Michael Di Domenico wrote: >>> I think i take it all back. I just tried it again and it seems to work now. I'm not sure what I changed (between my first and this msg), but it does appear to work now. On Thu, Dec 30, 2010 at 4:31 PM, Michael Di Domenico wrote: > Yes that's true, error messages help. I was hoping there was some > documentation to see what i've done wrong. I can't easily cut and > paste errors from my cluster. > > Here's a snippet (hand typed) of the error message, but it does look > like a rank communications error > > ORTE_ERROR_LOG: A message is attempting to be sent to a process whose > contact information is unknown in file rml_oob_send.c at line 145. > *** MPI_INIT failure message (snipped) *** > orte_grpcomm_modex failed > --> Returned "A messages is attempting to be sent to a process whose > contact information us uknown" (-117) instead of "Success" (0) > > This msg repeats for each rank, an ultimately hangs the srun which i > have to Ctrl-C and terminate > > I have mpiports defined in my slurm config and running srun with > -resv-ports does show the SLURM_RESV_PORTS environment variable > getting parts to the shell > > > On Thu, Dec 23, 2010 at 8:09 PM, Ralph Castain wrote: >> I'm not sure there is any documentation yet - not much clamor for it. :-/ >> >> It would really help if you included the error message. Otherwise, all I >> can do is guess, which wastes both of our time :-( >> >> My best guess is that the port reservation didn't get passed down to the >> MPI procs properly - but that's just a guess. >> >> >> On Dec 23, 2010, at 12:46 PM, Michael Di Domenico wrote: >> >>> Can anyone point me towards the most recent documentation for using >>> srun and openmpi? >>> >>> I followed what i found on the web with enabling the MpiPorts config >>> in slurm and using the --resv-ports switch, but I'm getting an error >>> from openmpi during setup. >>> >>> I'm using Slurm 2.1.15 and Openmpi 1.5 w/PSM >>> >>> I'm sure I'm missing a step. >>> >>> Thanks >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] srun and openmpi
Ah, yes - that is going to be a problem. The PSM key gets generated by mpirun as it is shared info - i.e., every proc has to get the same value. I can create a patch that will do this for the srun direct-launch scenario, if you want to try it. Would be later today, though. On Dec 30, 2010, at 10:31 AM, Michael Di Domenico wrote: > Well maybe not horray, yet. I might have jumped the gun a bit, it's > looking like srun works in general, but perhaps not with PSM > > With PSM i get this error, (at least now i know what i changed) > > Error obtaining unique transport key from ORTE > (orte_precondition_transports not present in the environment) > PML add procs failed > --> Returned "Error" (-1) instead of "Success" (0) > > Turn off PSM and srun works fine > > > On Thu, Dec 30, 2010 at 5:13 PM, Ralph Castainwrote: >> Hooray! >> >> On Dec 30, 2010, at 9:57 AM, Michael Di Domenico wrote: >> >>> I think i take it all back. I just tried it again and it seems to >>> work now. I'm not sure what I changed (between my first and this >>> msg), but it does appear to work now. >>> >>> On Thu, Dec 30, 2010 at 4:31 PM, Michael Di Domenico >>> wrote: Yes that's true, error messages help. I was hoping there was some documentation to see what i've done wrong. I can't easily cut and paste errors from my cluster. Here's a snippet (hand typed) of the error message, but it does look like a rank communications error ORTE_ERROR_LOG: A message is attempting to be sent to a process whose contact information is unknown in file rml_oob_send.c at line 145. *** MPI_INIT failure message (snipped) *** orte_grpcomm_modex failed --> Returned "A messages is attempting to be sent to a process whose contact information us uknown" (-117) instead of "Success" (0) This msg repeats for each rank, an ultimately hangs the srun which i have to Ctrl-C and terminate I have mpiports defined in my slurm config and running srun with -resv-ports does show the SLURM_RESV_PORTS environment variable getting parts to the shell On Thu, Dec 23, 2010 at 8:09 PM, Ralph Castain wrote: > I'm not sure there is any documentation yet - not much clamor for it. :-/ > > It would really help if you included the error message. Otherwise, all I > can do is guess, which wastes both of our time :-( > > My best guess is that the port reservation didn't get passed down to the > MPI procs properly - but that's just a guess. > > > On Dec 23, 2010, at 12:46 PM, Michael Di Domenico wrote: > >> Can anyone point me towards the most recent documentation for using >> srun and openmpi? >> >> I followed what i found on the web with enabling the MpiPorts config >> in slurm and using the --resv-ports switch, but I'm getting an error >> from openmpi during setup. >> >> I'm using Slurm 2.1.15 and Openmpi 1.5 w/PSM >> >> I'm sure I'm missing a step. >> >> Thanks >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > >>> >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] srun and openmpi
Well maybe not horray, yet. I might have jumped the gun a bit, it's looking like srun works in general, but perhaps not with PSM With PSM i get this error, (at least now i know what i changed) Error obtaining unique transport key from ORTE (orte_precondition_transports not present in the environment) PML add procs failed --> Returned "Error" (-1) instead of "Success" (0) Turn off PSM and srun works fine On Thu, Dec 30, 2010 at 5:13 PM, Ralph Castainwrote: > Hooray! > > On Dec 30, 2010, at 9:57 AM, Michael Di Domenico wrote: > >> I think i take it all back. I just tried it again and it seems to >> work now. I'm not sure what I changed (between my first and this >> msg), but it does appear to work now. >> >> On Thu, Dec 30, 2010 at 4:31 PM, Michael Di Domenico >> wrote: >>> Yes that's true, error messages help. I was hoping there was some >>> documentation to see what i've done wrong. I can't easily cut and >>> paste errors from my cluster. >>> >>> Here's a snippet (hand typed) of the error message, but it does look >>> like a rank communications error >>> >>> ORTE_ERROR_LOG: A message is attempting to be sent to a process whose >>> contact information is unknown in file rml_oob_send.c at line 145. >>> *** MPI_INIT failure message (snipped) *** >>> orte_grpcomm_modex failed >>> --> Returned "A messages is attempting to be sent to a process whose >>> contact information us uknown" (-117) instead of "Success" (0) >>> >>> This msg repeats for each rank, an ultimately hangs the srun which i >>> have to Ctrl-C and terminate >>> >>> I have mpiports defined in my slurm config and running srun with >>> -resv-ports does show the SLURM_RESV_PORTS environment variable >>> getting parts to the shell >>> >>> >>> On Thu, Dec 23, 2010 at 8:09 PM, Ralph Castain wrote: I'm not sure there is any documentation yet - not much clamor for it. :-/ It would really help if you included the error message. Otherwise, all I can do is guess, which wastes both of our time :-( My best guess is that the port reservation didn't get passed down to the MPI procs properly - but that's just a guess. On Dec 23, 2010, at 12:46 PM, Michael Di Domenico wrote: > Can anyone point me towards the most recent documentation for using > srun and openmpi? > > I followed what i found on the web with enabling the MpiPorts config > in slurm and using the --resv-ports switch, but I'm getting an error > from openmpi during setup. > > I'm using Slurm 2.1.15 and Openmpi 1.5 w/PSM > > I'm sure I'm missing a step. > > Thanks > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] srun and openmpi
Hooray! On Dec 30, 2010, at 9:57 AM, Michael Di Domenico wrote: > I think i take it all back. I just tried it again and it seems to > work now. I'm not sure what I changed (between my first and this > msg), but it does appear to work now. > > On Thu, Dec 30, 2010 at 4:31 PM, Michael Di Domenico >wrote: >> Yes that's true, error messages help. I was hoping there was some >> documentation to see what i've done wrong. I can't easily cut and >> paste errors from my cluster. >> >> Here's a snippet (hand typed) of the error message, but it does look >> like a rank communications error >> >> ORTE_ERROR_LOG: A message is attempting to be sent to a process whose >> contact information is unknown in file rml_oob_send.c at line 145. >> *** MPI_INIT failure message (snipped) *** >> orte_grpcomm_modex failed >> --> Returned "A messages is attempting to be sent to a process whose >> contact information us uknown" (-117) instead of "Success" (0) >> >> This msg repeats for each rank, an ultimately hangs the srun which i >> have to Ctrl-C and terminate >> >> I have mpiports defined in my slurm config and running srun with >> -resv-ports does show the SLURM_RESV_PORTS environment variable >> getting parts to the shell >> >> >> On Thu, Dec 23, 2010 at 8:09 PM, Ralph Castain wrote: >>> I'm not sure there is any documentation yet - not much clamor for it. :-/ >>> >>> It would really help if you included the error message. Otherwise, all I >>> can do is guess, which wastes both of our time :-( >>> >>> My best guess is that the port reservation didn't get passed down to the >>> MPI procs properly - but that's just a guess. >>> >>> >>> On Dec 23, 2010, at 12:46 PM, Michael Di Domenico wrote: >>> Can anyone point me towards the most recent documentation for using srun and openmpi? I followed what i found on the web with enabling the MpiPorts config in slurm and using the --resv-ports switch, but I'm getting an error from openmpi during setup. I'm using Slurm 2.1.15 and Openmpi 1.5 w/PSM I'm sure I'm missing a step. Thanks ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >> > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] srun and openmpi
Yes that's true, error messages help. I was hoping there was some documentation to see what i've done wrong. I can't easily cut and paste errors from my cluster. Here's a snippet (hand typed) of the error message, but it does look like a rank communications error ORTE_ERROR_LOG: A message is attempting to be sent to a process whose contact information is unknown in file rml_oob_send.c at line 145. *** MPI_INIT failure message (snipped) *** orte_grpcomm_modex failed --> Returned "A messages is attempting to be sent to a process whose contact information us uknown" (-117) instead of "Success" (0) This msg repeats for each rank, an ultimately hangs the srun which i have to Ctrl-C and terminate I have mpiports defined in my slurm config and running srun with -resv-ports does show the SLURM_RESV_PORTS environment variable getting parts to the shell On Thu, Dec 23, 2010 at 8:09 PM, Ralph Castainwrote: > I'm not sure there is any documentation yet - not much clamor for it. :-/ > > It would really help if you included the error message. Otherwise, all I can > do is guess, which wastes both of our time :-( > > My best guess is that the port reservation didn't get passed down to the MPI > procs properly - but that's just a guess. > > > On Dec 23, 2010, at 12:46 PM, Michael Di Domenico wrote: > >> Can anyone point me towards the most recent documentation for using >> srun and openmpi? >> >> I followed what i found on the web with enabling the MpiPorts config >> in slurm and using the --resv-ports switch, but I'm getting an error >> from openmpi during setup. >> >> I'm using Slurm 2.1.15 and Openmpi 1.5 w/PSM >> >> I'm sure I'm missing a step. >> >> Thanks >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >