Re: [OMPI users] srun and openmpi
Certainly, i reached out to several contacts I have inside qlogic (i used to work there)... On Fri, Apr 29, 2011 at 10:30 AM, Ralph Castainwrote: > Hi Michael > > I'm told that the Qlogic contacts we used to have are no longer there. Since > you obviously are a customer, can you ping them and ask (a) what that error > message means, and (b) what's wrong with the values I computed? > > You can also just send them my way, if that would help. We just need someone > to explain the requirements on that precondition value. > > Thanks > Ralph
Re: [OMPI users] srun and openmpi
On Fri, Apr 29, 2011 at 10:01 AM, Michael Di Domenicowrote: > On Fri, Apr 29, 2011 at 4:52 AM, Ralph Castain wrote: >> Hi Michael >> >> Please see the attached updated patch to try for 1.5.3. I mistakenly free'd >> the envar after adding it to the environ :-/ > > The patch works great, i can now see the precondition environment > variable if i do > > mpirun -n 2 -host node1 > > and my runs just fine, However if i do > > srun --resv-ports -n 2 -w node1 > > I get > > [node1:16780] PSM EP connect error (unknown connect error): > [node1:16780] node1 > [node1:16780] PSM EP connect error (Endpoint could not be reached): > [node1:16780] node1 > > PML add procs failed > --> Returned "Error" (-1) instead of "Success" (0) > > I did notice a difference in the precondition env variable between the two > runs > > mpirun -n 2 -host node1 > > sets precondition_transports=fbc383997ee1b668-00d40f1401d2e827 (which > changes with each run (aka random)) > > srun --resv-ports -n 2 -w node1 this should have been "srun --resv-ports -n 1 -w node1 ", i can't run a 2 rank job, i get the PML error above > > sets precondition_transports=1845-0001 (which > doesn't seem to change run to run) >
Re: [OMPI users] srun and openmpi
On Fri, Apr 29, 2011 at 4:52 AM, Ralph Castainwrote: > Hi Michael > > Please see the attached updated patch to try for 1.5.3. I mistakenly free'd > the envar after adding it to the environ :-/ The patch works great, i can now see the precondition environment variable if i do mpirun -n 2 -host node1 and my runs just fine, However if i do srun --resv-ports -n 2 -w node1 I get [node1:16780] PSM EP connect error (unknown connect error): [node1:16780] node1 [node1:16780] PSM EP connect error (Endpoint could not be reached): [node1:16780] node1 PML add procs failed --> Returned "Error" (-1) instead of "Success" (0) I did notice a difference in the precondition env variable between the two runs mpirun -n 2 -host node1 sets precondition_transports=fbc383997ee1b668-00d40f1401d2e827 (which changes with each run (aka random)) srun --resv-ports -n 2 -w node1 sets precondition_transports=1845-0001 (which doesn't seem to change run to run)
Re: [OMPI users] srun and openmpi
Hi Michael Please see the attached updated patch to try for 1.5.3. I mistakenly free'd the envar after adding it to the environ :-/ Thanks Ralph slurmd.diff Description: Binary data On Apr 28, 2011, at 2:31 PM, Michael Di Domenico wrote: > On Thu, Apr 28, 2011 at 9:03 AM, Ralph Castainwrote: >> >> On Apr 28, 2011, at 6:49 AM, Michael Di Domenico wrote: >> >>> On Wed, Apr 27, 2011 at 11:47 PM, Ralph Castain wrote: On Apr 27, 2011, at 1:06 PM, Michael Di Domenico wrote: > On Wed, Apr 27, 2011 at 2:46 PM, Ralph Castain wrote: >> >> On Apr 27, 2011, at 12:38 PM, Michael Di Domenico wrote: >> >>> On Wed, Apr 27, 2011 at 2:25 PM, Ralph Castain >>> wrote: On Apr 27, 2011, at 10:09 AM, Michael Di Domenico wrote: > Was this ever committed to the OMPI src as something not having to be > run outside of OpenMPI, but as part of the PSM setup that OpenMPI > does? Not that I know of - I don't think the PSM developers ever looked at it. Thought about this some more and I believe I have a soln to the problem. Will try to commit something to the devel trunk by the end of the week. >>> >>> Thanks >> >> Just to save me looking back thru the thread - what OMPI version are you >> using? If it isn't the trunk, I'll send you a patch you can use. > > I'm using OpenMPI v1.5.3 currently > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] srun and openmpi
On Thu, Apr 28, 2011 at 9:03 AM, Ralph Castainwrote: > > On Apr 28, 2011, at 6:49 AM, Michael Di Domenico wrote: > >> On Wed, Apr 27, 2011 at 11:47 PM, Ralph Castain wrote: >>> >>> On Apr 27, 2011, at 1:06 PM, Michael Di Domenico wrote: >>> On Wed, Apr 27, 2011 at 2:46 PM, Ralph Castain wrote: > > On Apr 27, 2011, at 12:38 PM, Michael Di Domenico wrote: > >> On Wed, Apr 27, 2011 at 2:25 PM, Ralph Castain wrote: >>> >>> On Apr 27, 2011, at 10:09 AM, Michael Di Domenico wrote: >>> Was this ever committed to the OMPI src as something not having to be run outside of OpenMPI, but as part of the PSM setup that OpenMPI does? >>> >>> Not that I know of - I don't think the PSM developers ever looked at it. >>> >>> Thought about this some more and I believe I have a soln to the problem. >>> Will try to commit something to the devel trunk by the end of the week. >> >> Thanks > > Just to save me looking back thru the thread - what OMPI version are you > using? If it isn't the trunk, I'll send you a patch you can use. I'm using OpenMPI v1.5.3 currently
Re: [OMPI users] srun and openmpi
Per earlier in the thread, it looks like you are using a 1.5 series release - so here is a patch that -should- fix the PSM setup problem. Please let me know if/how it works as I honestly have no way of testing it. Ralph slurmd.diff Description: Binary data On Apr 28, 2011, at 7:03 AM, Ralph Castain wrote: > > On Apr 28, 2011, at 6:49 AM, Michael Di Domenico wrote: > >> On Wed, Apr 27, 2011 at 11:47 PM, Ralph Castainwrote: >>> >>> On Apr 27, 2011, at 1:06 PM, Michael Di Domenico wrote: >>> On Wed, Apr 27, 2011 at 2:46 PM, Ralph Castain wrote: > > On Apr 27, 2011, at 12:38 PM, Michael Di Domenico wrote: > >> On Wed, Apr 27, 2011 at 2:25 PM, Ralph Castain wrote: >>> >>> On Apr 27, 2011, at 10:09 AM, Michael Di Domenico wrote: >>> Was this ever committed to the OMPI src as something not having to be run outside of OpenMPI, but as part of the PSM setup that OpenMPI does? >>> >>> Not that I know of - I don't think the PSM developers ever looked at it. >>> >>> Thought about this some more and I believe I have a soln to the problem. >>> Will try to commit something to the devel trunk by the end of the week. >> >> Thanks > > Just to save me looking back thru the thread - what OMPI version are you > using? If it isn't the trunk, I'll send you a patch you can use. > >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] srun and openmpi
On Apr 28, 2011, at 6:49 AM, Michael Di Domenico wrote: > On Wed, Apr 27, 2011 at 11:47 PM, Ralph Castainwrote: >> >> On Apr 27, 2011, at 1:06 PM, Michael Di Domenico wrote: >> >>> On Wed, Apr 27, 2011 at 2:46 PM, Ralph Castain wrote: On Apr 27, 2011, at 12:38 PM, Michael Di Domenico wrote: > On Wed, Apr 27, 2011 at 2:25 PM, Ralph Castain wrote: >> >> On Apr 27, 2011, at 10:09 AM, Michael Di Domenico wrote: >> >>> Was this ever committed to the OMPI src as something not having to be >>> run outside of OpenMPI, but as part of the PSM setup that OpenMPI >>> does? >> >> Not that I know of - I don't think the PSM developers ever looked at it. >> >> Thought about this some more and I believe I have a soln to the problem. >> Will try to commit something to the devel trunk by the end of the week. > > Thanks Just to save me looking back thru the thread - what OMPI version are you using? If it isn't the trunk, I'll send you a patch you can use. > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] srun and openmpi
On Wed, Apr 27, 2011 at 11:47 PM, Ralph Castainwrote: > > On Apr 27, 2011, at 1:06 PM, Michael Di Domenico wrote: > >> On Wed, Apr 27, 2011 at 2:46 PM, Ralph Castain wrote: >>> >>> On Apr 27, 2011, at 12:38 PM, Michael Di Domenico wrote: >>> On Wed, Apr 27, 2011 at 2:25 PM, Ralph Castain wrote: > > On Apr 27, 2011, at 10:09 AM, Michael Di Domenico wrote: > >> Was this ever committed to the OMPI src as something not having to be >> run outside of OpenMPI, but as part of the PSM setup that OpenMPI >> does? > > Not that I know of - I don't think the PSM developers ever looked at it. > > Thought about this some more and I believe I have a soln to the problem. Will > try to commit something to the devel trunk by the end of the week. Thanks
Re: [OMPI users] srun and openmpi
On Apr 27, 2011, at 1:06 PM, Michael Di Domenico wrote: > On Wed, Apr 27, 2011 at 2:46 PM, Ralph Castainwrote: >> >> On Apr 27, 2011, at 12:38 PM, Michael Di Domenico wrote: >> >>> On Wed, Apr 27, 2011 at 2:25 PM, Ralph Castain wrote: On Apr 27, 2011, at 10:09 AM, Michael Di Domenico wrote: > Was this ever committed to the OMPI src as something not having to be > run outside of OpenMPI, but as part of the PSM setup that OpenMPI > does? Not that I know of - I don't think the PSM developers ever looked at it. Thought about this some more and I believe I have a soln to the problem. Will try to commit something to the devel trunk by the end of the week. Ralph > > I'm having some trouble getting Slurm/OpenMPI to play nice with the > setup of this key. Namely, with slurm you cannot export variables > from the --prolog of an srun, only from an --task-prolog, > unfortunately, if you use a task-prolog each rank gets a different > key, which doesn't work. > > I'm also guessing that each unique mpirun needs it's own psm key, not > one for the whole system, so i can't just make it a permanent > parameter somewhere else. > > Also, i recall reading somewhere that the --resv-ports parameter that > OMPI uses from slurm to choose a list of ports to use for TCP comm's, > tries to lock a port from the pool three times before giving up. Had to look back at the code - I think you misread this. I can find no evidence in the code that we try to bind that port more than once. >>> >>> Perhaps i misstated, i don't believe you're trying to bind to the same >>> port twice during the same session. i believe the code re-uses >>> similar ports from session to session. what i believe happens (but >>> could be totally wrong) the previous session releases the port, but >>> linux isn't quite done with it when the new session tries to bind to >>> the port, in which case it tries three times and then fails the job >> >> Actually, I understood you correctly. I'm just saying that I find no >> evidence in the code that we try three times before giving up. What I see is >> a single attempt to bind the port - if it fails, then we abort. There is no >> parameter to control that behavior. >> >> So if the OS hasn't released the port by the time a new job starts on that >> node, then it will indeed abort if the job was unfortunately given the same >> port reservation. > > Oh, okay, sorry... > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] srun and openmpi
On Apr 27, 2011, at 3:39 PM, Ralph Castain wrote: > Nope, nope nope...in this mode of operation, we are using -static- ports. Er.. right. Sorry -- my bad for not reading the full context here... ignore what I said... -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] srun and openmpi
On Apr 27, 2011, at 1:27 PM, Jeff Squyres wrote: > On Apr 27, 2011, at 2:46 PM, Ralph Castain wrote: > >> Actually, I understood you correctly. I'm just saying that I find no >> evidence in the code that we try three times before giving up. What I see is >> a single attempt to bind the port - if it fails, then we abort. There is no >> parameter to control that behavior. >> >> So if the OS hasn't released the port by the time a new job starts on that >> node, then it will indeed abort if the job was unfortunately given the same >> port reservation. > > FWIW, the OS may be trying multiple times under the covers, but from as far > as OMPI is concerned, we're just trying once. > > OMPI asks for whatever port the OS has open (i.e., we pass in 0 when asking > for a specific port number, and the OS fills it in for us). If it gives us > back a port that isn't actually available, that would be really surprising. Nope, nope nope...in this mode of operation, we are using -static- ports. The problem here is that srun is incorrectly handing out the same port reservation to the next job, causing the port binding to fail because the last job's binding hasn't yet timed out. > > If you have a bajiollion short jobs running, I wonder if there's some kind of > race condition occurring that some MPI processes are getting messages from > the wrong mpirun. And then things go downhill from there. > > I can't immediately imagine how that would happen, but maybe there's some > kind of weird race condition in there somewhere...? We pass specific IP > addresses and ports around on the command line, though, so I don't quite see > how that would happen... > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] srun and openmpi
On Apr 27, 2011, at 2:46 PM, Ralph Castain wrote: > Actually, I understood you correctly. I'm just saying that I find no evidence > in the code that we try three times before giving up. What I see is a single > attempt to bind the port - if it fails, then we abort. There is no parameter > to control that behavior. > > So if the OS hasn't released the port by the time a new job starts on that > node, then it will indeed abort if the job was unfortunately given the same > port reservation. FWIW, the OS may be trying multiple times under the covers, but from as far as OMPI is concerned, we're just trying once. OMPI asks for whatever port the OS has open (i.e., we pass in 0 when asking for a specific port number, and the OS fills it in for us). If it gives us back a port that isn't actually available, that would be really surprising. If you have a bajiollion short jobs running, I wonder if there's some kind of race condition occurring that some MPI processes are getting messages from the wrong mpirun. And then things go downhill from there. I can't immediately imagine how that would happen, but maybe there's some kind of weird race condition in there somewhere...? We pass specific IP addresses and ports around on the command line, though, so I don't quite see how that would happen... -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] srun and openmpi
On Wed, Apr 27, 2011 at 2:46 PM, Ralph Castainwrote: > > On Apr 27, 2011, at 12:38 PM, Michael Di Domenico wrote: > >> On Wed, Apr 27, 2011 at 2:25 PM, Ralph Castain wrote: >>> >>> On Apr 27, 2011, at 10:09 AM, Michael Di Domenico wrote: >>> Was this ever committed to the OMPI src as something not having to be run outside of OpenMPI, but as part of the PSM setup that OpenMPI does? >>> >>> Not that I know of - I don't think the PSM developers ever looked at it. >>> I'm having some trouble getting Slurm/OpenMPI to play nice with the setup of this key. Namely, with slurm you cannot export variables from the --prolog of an srun, only from an --task-prolog, unfortunately, if you use a task-prolog each rank gets a different key, which doesn't work. I'm also guessing that each unique mpirun needs it's own psm key, not one for the whole system, so i can't just make it a permanent parameter somewhere else. Also, i recall reading somewhere that the --resv-ports parameter that OMPI uses from slurm to choose a list of ports to use for TCP comm's, tries to lock a port from the pool three times before giving up. >>> >>> Had to look back at the code - I think you misread this. I can find no >>> evidence in the code that we try to bind that port more than once. >> >> Perhaps i misstated, i don't believe you're trying to bind to the same >> port twice during the same session. i believe the code re-uses >> similar ports from session to session. what i believe happens (but >> could be totally wrong) the previous session releases the port, but >> linux isn't quite done with it when the new session tries to bind to >> the port, in which case it tries three times and then fails the job > > Actually, I understood you correctly. I'm just saying that I find no evidence > in the code that we try three times before giving up. What I see is a single > attempt to bind the port - if it fails, then we abort. There is no parameter > to control that behavior. > > So if the OS hasn't released the port by the time a new job starts on that > node, then it will indeed abort if the job was unfortunately given the same > port reservation. Oh, okay, sorry...
Re: [OMPI users] srun and openmpi
On Apr 27, 2011, at 12:38 PM, Michael Di Domenico wrote: > On Wed, Apr 27, 2011 at 2:25 PM, Ralph Castainwrote: >> >> On Apr 27, 2011, at 10:09 AM, Michael Di Domenico wrote: >> >>> Was this ever committed to the OMPI src as something not having to be >>> run outside of OpenMPI, but as part of the PSM setup that OpenMPI >>> does? >> >> Not that I know of - I don't think the PSM developers ever looked at it. >> >>> >>> I'm having some trouble getting Slurm/OpenMPI to play nice with the >>> setup of this key. Namely, with slurm you cannot export variables >>> from the --prolog of an srun, only from an --task-prolog, >>> unfortunately, if you use a task-prolog each rank gets a different >>> key, which doesn't work. >>> >>> I'm also guessing that each unique mpirun needs it's own psm key, not >>> one for the whole system, so i can't just make it a permanent >>> parameter somewhere else. >>> >>> Also, i recall reading somewhere that the --resv-ports parameter that >>> OMPI uses from slurm to choose a list of ports to use for TCP comm's, >>> tries to lock a port from the pool three times before giving up. >> >> Had to look back at the code - I think you misread this. I can find no >> evidence in the code that we try to bind that port more than once. > > Perhaps i misstated, i don't believe you're trying to bind to the same > port twice during the same session. i believe the code re-uses > similar ports from session to session. what i believe happens (but > could be totally wrong) the previous session releases the port, but > linux isn't quite done with it when the new session tries to bind to > the port, in which case it tries three times and then fails the job Actually, I understood you correctly. I'm just saying that I find no evidence in the code that we try three times before giving up. What I see is a single attempt to bind the port - if it fails, then we abort. There is no parameter to control that behavior. So if the OS hasn't released the port by the time a new job starts on that node, then it will indeed abort if the job was unfortunately given the same port reservation. > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] srun and openmpi
On Wed, Apr 27, 2011 at 2:25 PM, Ralph Castainwrote: > > On Apr 27, 2011, at 10:09 AM, Michael Di Domenico wrote: > >> Was this ever committed to the OMPI src as something not having to be >> run outside of OpenMPI, but as part of the PSM setup that OpenMPI >> does? > > Not that I know of - I don't think the PSM developers ever looked at it. > >> >> I'm having some trouble getting Slurm/OpenMPI to play nice with the >> setup of this key. Namely, with slurm you cannot export variables >> from the --prolog of an srun, only from an --task-prolog, >> unfortunately, if you use a task-prolog each rank gets a different >> key, which doesn't work. >> >> I'm also guessing that each unique mpirun needs it's own psm key, not >> one for the whole system, so i can't just make it a permanent >> parameter somewhere else. >> >> Also, i recall reading somewhere that the --resv-ports parameter that >> OMPI uses from slurm to choose a list of ports to use for TCP comm's, >> tries to lock a port from the pool three times before giving up. > > Had to look back at the code - I think you misread this. I can find no > evidence in the code that we try to bind that port more than once. Perhaps i misstated, i don't believe you're trying to bind to the same port twice during the same session. i believe the code re-uses similar ports from session to session. what i believe happens (but could be totally wrong) the previous session releases the port, but linux isn't quite done with it when the new session tries to bind to the port, in which case it tries three times and then fails the job
Re: [OMPI users] srun and openmpi
On Apr 27, 2011, at 10:09 AM, Michael Di Domenico wrote: > Was this ever committed to the OMPI src as something not having to be > run outside of OpenMPI, but as part of the PSM setup that OpenMPI > does? Not that I know of - I don't think the PSM developers ever looked at it. > > I'm having some trouble getting Slurm/OpenMPI to play nice with the > setup of this key. Namely, with slurm you cannot export variables > from the --prolog of an srun, only from an --task-prolog, > unfortunately, if you use a task-prolog each rank gets a different > key, which doesn't work. > > I'm also guessing that each unique mpirun needs it's own psm key, not > one for the whole system, so i can't just make it a permanent > parameter somewhere else. > > Also, i recall reading somewhere that the --resv-ports parameter that > OMPI uses from slurm to choose a list of ports to use for TCP comm's, > tries to lock a port from the pool three times before giving up. Had to look back at the code - I think you misread this. I can find no evidence in the code that we try to bind that port more than once. > > Can someone tell me where that parameter is set, i'd like to set it to > a higher value. We're seeing issues where running a large number of > short srun's sequentially is causing some of the mpirun's in the > stream to be killed because they could not lock the ports. > > I suspect because of the lag between when the port is actually closed > in linux and when ompi re-opens a new port is very quick, we're trying > three times and giving up. I have more then enough ports in the > resv-ports list, 30k. but i suspect there is some random re-use being > done and it's failing > > thanks > > > On Mon, Jan 3, 2011 at 10:00 AM, Jeff Squyreswrote: >> Yo Ralph -- >> >> I see this was committed https://svn.open-mpi.org/trac/ompi/changeset/24197. >> Do you want to add a blurb in README about it, and/or have this executable >> compiled as part of the PSM MTL and then installed into $bindir (maybe named >> ompi-psm-keygen)? >> >> Right now, it's only compiled as part of "make check" and not installed, >> right? >> >> On Dec 30, 2010, at 5:07 PM, Ralph Castain wrote: >> >>> Run the program only once - it can be in the prolog of the job if you like. >>> The output value needs to be in the env of every rank. >>> >>> You can reuse the value as many times as you like - it doesn't have to be >>> unique for each job. There is nothing magic about the value itself. >>> >>> On Dec 30, 2010, at 2:11 PM, Michael Di Domenico wrote: >>> How early does this need to run? Can I run it as part of a task prolog, or does it need to be the shell env for each rank? And does it need to run on one node or all the nodes in the job? On Thu, Dec 30, 2010 at 8:54 PM, Ralph Castain wrote: > Well, I couldn't do it as a patch - proved too complicated as the psm > system looks for the value early in the boot procedure. > > What I can do is give you the attached key generator program. It outputs > the envar required to run your program. So if you run the attached > program and then export the output into your environment, you should be > okay. Looks like this: > > $ ./psm_keygen > OMPI_MCA_orte_precondition_transports=0099b3eaa2c1547e-afb287789133a954 > $ > > You compile the program with the usual mpicc. > > Let me know if this solves the problem (or not). > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] srun and openmpi
Was this ever committed to the OMPI src as something not having to be run outside of OpenMPI, but as part of the PSM setup that OpenMPI does? I'm having some trouble getting Slurm/OpenMPI to play nice with the setup of this key. Namely, with slurm you cannot export variables from the --prolog of an srun, only from an --task-prolog, unfortunately, if you use a task-prolog each rank gets a different key, which doesn't work. I'm also guessing that each unique mpirun needs it's own psm key, not one for the whole system, so i can't just make it a permanent parameter somewhere else. Also, i recall reading somewhere that the --resv-ports parameter that OMPI uses from slurm to choose a list of ports to use for TCP comm's, tries to lock a port from the pool three times before giving up. Can someone tell me where that parameter is set, i'd like to set it to a higher value. We're seeing issues where running a large number of short srun's sequentially is causing some of the mpirun's in the stream to be killed because they could not lock the ports. I suspect because of the lag between when the port is actually closed in linux and when ompi re-opens a new port is very quick, we're trying three times and giving up. I have more then enough ports in the resv-ports list, 30k. but i suspect there is some random re-use being done and it's failing thanks On Mon, Jan 3, 2011 at 10:00 AM, Jeff Squyreswrote: > Yo Ralph -- > > I see this was committed https://svn.open-mpi.org/trac/ompi/changeset/24197. > Do you want to add a blurb in README about it, and/or have this executable > compiled as part of the PSM MTL and then installed into $bindir (maybe named > ompi-psm-keygen)? > > Right now, it's only compiled as part of "make check" and not installed, > right? > > On Dec 30, 2010, at 5:07 PM, Ralph Castain wrote: > >> Run the program only once - it can be in the prolog of the job if you like. >> The output value needs to be in the env of every rank. >> >> You can reuse the value as many times as you like - it doesn't have to be >> unique for each job. There is nothing magic about the value itself. >> >> On Dec 30, 2010, at 2:11 PM, Michael Di Domenico wrote: >> >>> How early does this need to run? Can I run it as part of a task >>> prolog, or does it need to be the shell env for each rank? And does >>> it need to run on one node or all the nodes in the job? >>> >>> On Thu, Dec 30, 2010 at 8:54 PM, Ralph Castain wrote: Well, I couldn't do it as a patch - proved too complicated as the psm system looks for the value early in the boot procedure. What I can do is give you the attached key generator program. It outputs the envar required to run your program. So if you run the attached program and then export the output into your environment, you should be okay. Looks like this: $ ./psm_keygen OMPI_MCA_orte_precondition_transports=0099b3eaa2c1547e-afb287789133a954 $ You compile the program with the usual mpicc. Let me know if this solves the problem (or not).
Re: [OMPI users] srun and openmpi
Yes, i am setting the config correcty. Our IB machines seem to run just fine so far using srun and openmpi v1.5. As another data point, we enabled mpi-threads in Openmpi and that also seems to trigger the Srun/TCP behavior, but on the IB fabric. Running the program within an salloc rather the straight srun and the problem seems to go away On Tue, Jan 25, 2011 at 2:59 PM, Nathan Hjelmwrote: > We are seeing the similar problem with our infiniband machines. After some > investigation I discovered that we were not setting our slurm environment > correctly (ref: > https://computing.llnl.gov/linux/slurm/mpi_guide.html#open_mpi). Are you > setting the ports in your slurm.conf and executing srun with --resv-ports? > > I have yet to see if this fixes the problem for LANL. Waiting on a sysadmin > to modify the slurm.conf. > > -Nathan > HPC-3, LANL > > On Tue, 25 Jan 2011, Michael Di Domenico wrote: > >> Thanks. We're only seeing it on machines with Ethernet only as the >> interconnect. fortunately for us that only equates to one small >> machine, but it's still annoying. unfortunately, i don't have enough >> knowledge to dive into the code to help fix, but i can certainly help >> test >> >> On Mon, Jan 24, 2011 at 1:41 PM, Nathan Hjelm wrote: >>> >>> I am seeing similar issues on our slurm clusters. We are looking into the >>> issue. >>> >>> -Nathan >>> HPC-3, LANL >>> >>> On Tue, 11 Jan 2011, Michael Di Domenico wrote: >>> Any ideas on what might be causing this one? Or atleast what additional debug information someone might need? On Fri, Jan 7, 2011 at 4:03 PM, Michael Di Domenico wrote: > > I'm still testing the slurm integration, which seems to work fine so > far. However, i just upgraded another cluster to openmpi-1.5 and > slurm 2.1.15 but this machine has no infiniband > > if i salloc the nodes and mpirun the command it seems to run and > complete > fine > however if i srun the command i get > > [btl_tcp_endpoint:486] mca_btl_tcp_endpoint_recv_connect_ack received > unexpected prcoess identifier > > the job does not seem to run, but exhibits two behaviors > running a single process per node the job runs and does not present > the error (srun -N40 --ntasks-per-node=1) > running multiple processes per node, the job spits out the error but > does not run (srun -n40 --ntasks-per-node=8) > > I copied the configs from the other machine, so (i think) everything > should be configured correctly (but i can't rule it out) > > I saw (and reported) a similar error to above with the 1.4-dev branch > (see mailing list) and slurm, I can't say whether they're related or > not though > > > On Mon, Jan 3, 2011 at 3:00 PM, Jeff Squyres > wrote: >> >> Yo Ralph -- >> >> I see this was committed >> https://svn.open-mpi.org/trac/ompi/changeset/24197. Do you want to >> add a >> blurb in README about it, and/or have this executable compiled as part >> of >> the PSM MTL and then installed into $bindir (maybe named >> ompi-psm-keygen)? >> >> Right now, it's only compiled as part of "make check" and not >> installed, >> right? >> >> >> >> On Dec 30, 2010, at 5:07 PM, Ralph Castain wrote: >> >>> Run the program only once - it can be in the prolog of the job if you >>> like. The output value needs to be in the env of every rank. >>> >>> You can reuse the value as many times as you like - it doesn't have >>> to >>> be unique for each job. There is nothing magic about the value >>> itself. >>> >>> On Dec 30, 2010, at 2:11 PM, Michael Di Domenico wrote: >>> How early does this need to run? Can I run it as part of a task prolog, or does it need to be the shell env for each rank? And does it need to run on one node or all the nodes in the job? On Thu, Dec 30, 2010 at 8:54 PM, Ralph Castain wrote: > > Well, I couldn't do it as a patch - proved too complicated as the > psm > system looks for the value early in the boot procedure. > > What I can do is give you the attached key generator program. It > outputs the envar required to run your program. So if you run the > attached > program and then export the output into your environment, you > should be > okay. Looks like this: > > $ ./psm_keygen > > > OMPI_MCA_orte_precondition_transports=0099b3eaa2c1547e-afb287789133a954 > $ > > You compile the program with the usual mpicc. > > Let me know if this solves the problem (or not). > Ralph > > > > > On Dec 30, 2010, at
Re: [OMPI users] srun and openmpi
We are seeing the similar problem with our infiniband machines. After some investigation I discovered that we were not setting our slurm environment correctly (ref: https://computing.llnl.gov/linux/slurm/mpi_guide.html#open_mpi). Are you setting the ports in your slurm.conf and executing srun with --resv-ports? I have yet to see if this fixes the problem for LANL. Waiting on a sysadmin to modify the slurm.conf. -Nathan HPC-3, LANL On Tue, 25 Jan 2011, Michael Di Domenico wrote: Thanks. We're only seeing it on machines with Ethernet only as the interconnect. fortunately for us that only equates to one small machine, but it's still annoying. unfortunately, i don't have enough knowledge to dive into the code to help fix, but i can certainly help test On Mon, Jan 24, 2011 at 1:41 PM, Nathan Hjelmwrote: I am seeing similar issues on our slurm clusters. We are looking into the issue. -Nathan HPC-3, LANL On Tue, 11 Jan 2011, Michael Di Domenico wrote: Any ideas on what might be causing this one? Or atleast what additional debug information someone might need? On Fri, Jan 7, 2011 at 4:03 PM, Michael Di Domenico wrote: I'm still testing the slurm integration, which seems to work fine so far. However, i just upgraded another cluster to openmpi-1.5 and slurm 2.1.15 but this machine has no infiniband if i salloc the nodes and mpirun the command it seems to run and complete fine however if i srun the command i get [btl_tcp_endpoint:486] mca_btl_tcp_endpoint_recv_connect_ack received unexpected prcoess identifier the job does not seem to run, but exhibits two behaviors running a single process per node the job runs and does not present the error (srun -N40 --ntasks-per-node=1) running multiple processes per node, the job spits out the error but does not run (srun -n40 --ntasks-per-node=8) I copied the configs from the other machine, so (i think) everything should be configured correctly (but i can't rule it out) I saw (and reported) a similar error to above with the 1.4-dev branch (see mailing list) and slurm, I can't say whether they're related or not though On Mon, Jan 3, 2011 at 3:00 PM, Jeff Squyres wrote: Yo Ralph -- I see this was committed https://svn.open-mpi.org/trac/ompi/changeset/24197. Do you want to add a blurb in README about it, and/or have this executable compiled as part of the PSM MTL and then installed into $bindir (maybe named ompi-psm-keygen)? Right now, it's only compiled as part of "make check" and not installed, right? On Dec 30, 2010, at 5:07 PM, Ralph Castain wrote: Run the program only once - it can be in the prolog of the job if you like. The output value needs to be in the env of every rank. You can reuse the value as many times as you like - it doesn't have to be unique for each job. There is nothing magic about the value itself. On Dec 30, 2010, at 2:11 PM, Michael Di Domenico wrote: How early does this need to run? Can I run it as part of a task prolog, or does it need to be the shell env for each rank? And does it need to run on one node or all the nodes in the job? On Thu, Dec 30, 2010 at 8:54 PM, Ralph Castain wrote: Well, I couldn't do it as a patch - proved too complicated as the psm system looks for the value early in the boot procedure. What I can do is give you the attached key generator program. It outputs the envar required to run your program. So if you run the attached program and then export the output into your environment, you should be okay. Looks like this: $ ./psm_keygen OMPI_MCA_orte_precondition_transports=0099b3eaa2c1547e-afb287789133a954 $ You compile the program with the usual mpicc. Let me know if this solves the problem (or not). Ralph On Dec 30, 2010, at 11:18 AM, Michael Di Domenico wrote: Sure, i'll give it a go On Thu, Dec 30, 2010 at 5:53 PM, Ralph Castain wrote: Ah, yes - that is going to be a problem. The PSM key gets generated by mpirun as it is shared info - i.e., every proc has to get the same value. I can create a patch that will do this for the srun direct-launch scenario, if you want to try it. Would be later today, though. On Dec 30, 2010, at 10:31 AM, Michael Di Domenico wrote: Well maybe not horray, yet. I might have jumped the gun a bit, it's looking like srun works in general, but perhaps not with PSM With PSM i get this error, (at least now i know what i changed) Error obtaining unique transport key from ORTE (orte_precondition_transports not present in the environment) PML add procs failed --> Returned "Error" (-1) instead of "Success" (0) Turn off PSM and srun works fine On Thu, Dec 30, 2010 at 5:13 PM, Ralph Castain wrote: Hooray! On Dec 30, 2010, at 9:57 AM, Michael Di Domenico wrote: I think i take it all back. I just tried it again and it seems to work now. I'm not sure what I changed (between my first and this msg), but it does appear
Re: [OMPI users] srun and openmpi
Thanks. We're only seeing it on machines with Ethernet only as the interconnect. fortunately for us that only equates to one small machine, but it's still annoying. unfortunately, i don't have enough knowledge to dive into the code to help fix, but i can certainly help test On Mon, Jan 24, 2011 at 1:41 PM, Nathan Hjelmwrote: > I am seeing similar issues on our slurm clusters. We are looking into the > issue. > > -Nathan > HPC-3, LANL > > On Tue, 11 Jan 2011, Michael Di Domenico wrote: > >> Any ideas on what might be causing this one? Or atleast what >> additional debug information someone might need? >> >> On Fri, Jan 7, 2011 at 4:03 PM, Michael Di Domenico >> wrote: >>> >>> I'm still testing the slurm integration, which seems to work fine so >>> far. However, i just upgraded another cluster to openmpi-1.5 and >>> slurm 2.1.15 but this machine has no infiniband >>> >>> if i salloc the nodes and mpirun the command it seems to run and complete >>> fine >>> however if i srun the command i get >>> >>> [btl_tcp_endpoint:486] mca_btl_tcp_endpoint_recv_connect_ack received >>> unexpected prcoess identifier >>> >>> the job does not seem to run, but exhibits two behaviors >>> running a single process per node the job runs and does not present >>> the error (srun -N40 --ntasks-per-node=1) >>> running multiple processes per node, the job spits out the error but >>> does not run (srun -n40 --ntasks-per-node=8) >>> >>> I copied the configs from the other machine, so (i think) everything >>> should be configured correctly (but i can't rule it out) >>> >>> I saw (and reported) a similar error to above with the 1.4-dev branch >>> (see mailing list) and slurm, I can't say whether they're related or >>> not though >>> >>> >>> On Mon, Jan 3, 2011 at 3:00 PM, Jeff Squyres wrote: Yo Ralph -- I see this was committed https://svn.open-mpi.org/trac/ompi/changeset/24197. Do you want to add a blurb in README about it, and/or have this executable compiled as part of the PSM MTL and then installed into $bindir (maybe named ompi-psm-keygen)? Right now, it's only compiled as part of "make check" and not installed, right? On Dec 30, 2010, at 5:07 PM, Ralph Castain wrote: > Run the program only once - it can be in the prolog of the job if you > like. The output value needs to be in the env of every rank. > > You can reuse the value as many times as you like - it doesn't have to > be unique for each job. There is nothing magic about the value itself. > > On Dec 30, 2010, at 2:11 PM, Michael Di Domenico wrote: > >> How early does this need to run? Can I run it as part of a task >> prolog, or does it need to be the shell env for each rank? And does >> it need to run on one node or all the nodes in the job? >> >> On Thu, Dec 30, 2010 at 8:54 PM, Ralph Castain >> wrote: >>> >>> Well, I couldn't do it as a patch - proved too complicated as the psm >>> system looks for the value early in the boot procedure. >>> >>> What I can do is give you the attached key generator program. It >>> outputs the envar required to run your program. So if you run the >>> attached >>> program and then export the output into your environment, you should be >>> okay. Looks like this: >>> >>> $ ./psm_keygen >>> >>> OMPI_MCA_orte_precondition_transports=0099b3eaa2c1547e-afb287789133a954 >>> $ >>> >>> You compile the program with the usual mpicc. >>> >>> Let me know if this solves the problem (or not). >>> Ralph >>> >>> >>> >>> >>> On Dec 30, 2010, at 11:18 AM, Michael Di Domenico wrote: >>> Sure, i'll give it a go On Thu, Dec 30, 2010 at 5:53 PM, Ralph Castain wrote: > > Ah, yes - that is going to be a problem. The PSM key gets generated > by mpirun as it is shared info - i.e., every proc has to get the same > value. > > I can create a patch that will do this for the srun direct-launch > scenario, if you want to try it. Would be later today, though. > > > On Dec 30, 2010, at 10:31 AM, Michael Di Domenico wrote: > >> Well maybe not horray, yet. I might have jumped the gun a bit, >> it's >> looking like srun works in general, but perhaps not with PSM >> >> With PSM i get this error, (at least now i know what i changed) >> >> Error obtaining unique transport key from ORTE >> (orte_precondition_transports not present in the environment) >> PML add procs failed >> --> Returned "Error" (-1) instead of "Success" (0) >> >> Turn off PSM and srun works fine >> >> >> On Thu, Dec 30, 2010 at 5:13 PM,
Re: [OMPI users] srun and openmpi
Any ideas on what might be causing this one? Or atleast what additional debug information someone might need? On Fri, Jan 7, 2011 at 4:03 PM, Michael Di Domenicowrote: > I'm still testing the slurm integration, which seems to work fine so > far. However, i just upgraded another cluster to openmpi-1.5 and > slurm 2.1.15 but this machine has no infiniband > > if i salloc the nodes and mpirun the command it seems to run and complete fine > however if i srun the command i get > > [btl_tcp_endpoint:486] mca_btl_tcp_endpoint_recv_connect_ack received > unexpected prcoess identifier > > the job does not seem to run, but exhibits two behaviors > running a single process per node the job runs and does not present > the error (srun -N40 --ntasks-per-node=1) > running multiple processes per node, the job spits out the error but > does not run (srun -n40 --ntasks-per-node=8) > > I copied the configs from the other machine, so (i think) everything > should be configured correctly (but i can't rule it out) > > I saw (and reported) a similar error to above with the 1.4-dev branch > (see mailing list) and slurm, I can't say whether they're related or > not though > > > On Mon, Jan 3, 2011 at 3:00 PM, Jeff Squyres wrote: >> Yo Ralph -- >> >> I see this was committed https://svn.open-mpi.org/trac/ompi/changeset/24197. >> Do you want to add a blurb in README about it, and/or have this executable >> compiled as part of the PSM MTL and then installed into $bindir (maybe named >> ompi-psm-keygen)? >> >> Right now, it's only compiled as part of "make check" and not installed, >> right? >> >> >> >> On Dec 30, 2010, at 5:07 PM, Ralph Castain wrote: >> >>> Run the program only once - it can be in the prolog of the job if you like. >>> The output value needs to be in the env of every rank. >>> >>> You can reuse the value as many times as you like - it doesn't have to be >>> unique for each job. There is nothing magic about the value itself. >>> >>> On Dec 30, 2010, at 2:11 PM, Michael Di Domenico wrote: >>> How early does this need to run? Can I run it as part of a task prolog, or does it need to be the shell env for each rank? And does it need to run on one node or all the nodes in the job? On Thu, Dec 30, 2010 at 8:54 PM, Ralph Castain wrote: > Well, I couldn't do it as a patch - proved too complicated as the psm > system looks for the value early in the boot procedure. > > What I can do is give you the attached key generator program. It outputs > the envar required to run your program. So if you run the attached > program and then export the output into your environment, you should be > okay. Looks like this: > > $ ./psm_keygen > OMPI_MCA_orte_precondition_transports=0099b3eaa2c1547e-afb287789133a954 > $ > > You compile the program with the usual mpicc. > > Let me know if this solves the problem (or not). > Ralph > > > > > On Dec 30, 2010, at 11:18 AM, Michael Di Domenico wrote: > >> Sure, i'll give it a go >> >> On Thu, Dec 30, 2010 at 5:53 PM, Ralph Castain wrote: >>> Ah, yes - that is going to be a problem. The PSM key gets generated by >>> mpirun as it is shared info - i.e., every proc has to get the same >>> value. >>> >>> I can create a patch that will do this for the srun direct-launch >>> scenario, if you want to try it. Would be later today, though. >>> >>> >>> On Dec 30, 2010, at 10:31 AM, Michael Di Domenico wrote: >>> Well maybe not horray, yet. I might have jumped the gun a bit, it's looking like srun works in general, but perhaps not with PSM With PSM i get this error, (at least now i know what i changed) Error obtaining unique transport key from ORTE (orte_precondition_transports not present in the environment) PML add procs failed --> Returned "Error" (-1) instead of "Success" (0) Turn off PSM and srun works fine On Thu, Dec 30, 2010 at 5:13 PM, Ralph Castain wrote: > Hooray! > > On Dec 30, 2010, at 9:57 AM, Michael Di Domenico wrote: > >> I think i take it all back. I just tried it again and it seems to >> work now. I'm not sure what I changed (between my first and this >> msg), but it does appear to work now. >> >> On Thu, Dec 30, 2010 at 4:31 PM, Michael Di Domenico >> wrote: >>> Yes that's true, error messages help. I was hoping there was some >>> documentation to see what i've done wrong. I can't easily cut and >>> paste errors from my cluster. >>> >>> Here's a snippet (hand typed) of the error message, but it does look >>> like a rank
Re: [OMPI users] srun and openmpi
I'm still testing the slurm integration, which seems to work fine so far. However, i just upgraded another cluster to openmpi-1.5 and slurm 2.1.15 but this machine has no infiniband if i salloc the nodes and mpirun the command it seems to run and complete fine however if i srun the command i get [btl_tcp_endpoint:486] mca_btl_tcp_endpoint_recv_connect_ack received unexpected prcoess identifier the job does not seem to run, but exhibits two behaviors running a single process per node the job runs and does not present the error (srun -N40 --ntasks-per-node=1) running multiple processes per node, the job spits out the error but does not run (srun -n40 --ntasks-per-node=8) I copied the configs from the other machine, so (i think) everything should be configured correctly (but i can't rule it out) I saw (and reported) a similar error to above with the 1.4-dev branch (see mailing list) and slurm, I can't say whether they're related or not though On Mon, Jan 3, 2011 at 3:00 PM, Jeff Squyreswrote: > Yo Ralph -- > > I see this was committed https://svn.open-mpi.org/trac/ompi/changeset/24197. > Do you want to add a blurb in README about it, and/or have this executable > compiled as part of the PSM MTL and then installed into $bindir (maybe named > ompi-psm-keygen)? > > Right now, it's only compiled as part of "make check" and not installed, > right? > > > > On Dec 30, 2010, at 5:07 PM, Ralph Castain wrote: > >> Run the program only once - it can be in the prolog of the job if you like. >> The output value needs to be in the env of every rank. >> >> You can reuse the value as many times as you like - it doesn't have to be >> unique for each job. There is nothing magic about the value itself. >> >> On Dec 30, 2010, at 2:11 PM, Michael Di Domenico wrote: >> >>> How early does this need to run? Can I run it as part of a task >>> prolog, or does it need to be the shell env for each rank? And does >>> it need to run on one node or all the nodes in the job? >>> >>> On Thu, Dec 30, 2010 at 8:54 PM, Ralph Castain wrote: Well, I couldn't do it as a patch - proved too complicated as the psm system looks for the value early in the boot procedure. What I can do is give you the attached key generator program. It outputs the envar required to run your program. So if you run the attached program and then export the output into your environment, you should be okay. Looks like this: $ ./psm_keygen OMPI_MCA_orte_precondition_transports=0099b3eaa2c1547e-afb287789133a954 $ You compile the program with the usual mpicc. Let me know if this solves the problem (or not). Ralph On Dec 30, 2010, at 11:18 AM, Michael Di Domenico wrote: > Sure, i'll give it a go > > On Thu, Dec 30, 2010 at 5:53 PM, Ralph Castain wrote: >> Ah, yes - that is going to be a problem. The PSM key gets generated by >> mpirun as it is shared info - i.e., every proc has to get the same value. >> >> I can create a patch that will do this for the srun direct-launch >> scenario, if you want to try it. Would be later today, though. >> >> >> On Dec 30, 2010, at 10:31 AM, Michael Di Domenico wrote: >> >>> Well maybe not horray, yet. I might have jumped the gun a bit, it's >>> looking like srun works in general, but perhaps not with PSM >>> >>> With PSM i get this error, (at least now i know what i changed) >>> >>> Error obtaining unique transport key from ORTE >>> (orte_precondition_transports not present in the environment) >>> PML add procs failed >>> --> Returned "Error" (-1) instead of "Success" (0) >>> >>> Turn off PSM and srun works fine >>> >>> >>> On Thu, Dec 30, 2010 at 5:13 PM, Ralph Castain >>> wrote: Hooray! On Dec 30, 2010, at 9:57 AM, Michael Di Domenico wrote: > I think i take it all back. I just tried it again and it seems to > work now. I'm not sure what I changed (between my first and this > msg), but it does appear to work now. > > On Thu, Dec 30, 2010 at 4:31 PM, Michael Di Domenico > wrote: >> Yes that's true, error messages help. I was hoping there was some >> documentation to see what i've done wrong. I can't easily cut and >> paste errors from my cluster. >> >> Here's a snippet (hand typed) of the error message, but it does look >> like a rank communications error >> >> ORTE_ERROR_LOG: A message is attempting to be sent to a process whose >> contact information is unknown in file rml_oob_send.c at line 145. >> *** MPI_INIT failure message (snipped) *** >> orte_grpcomm_modex failed >> --> Returned "A messages is attempting to be
Re: [OMPI users] srun and openmpi
Run the program only once - it can be in the prolog of the job if you like. The output value needs to be in the env of every rank. You can reuse the value as many times as you like - it doesn't have to be unique for each job. There is nothing magic about the value itself. On Dec 30, 2010, at 2:11 PM, Michael Di Domenico wrote: > How early does this need to run? Can I run it as part of a task > prolog, or does it need to be the shell env for each rank? And does > it need to run on one node or all the nodes in the job? > > On Thu, Dec 30, 2010 at 8:54 PM, Ralph Castainwrote: >> Well, I couldn't do it as a patch - proved too complicated as the psm system >> looks for the value early in the boot procedure. >> >> What I can do is give you the attached key generator program. It outputs the >> envar required to run your program. So if you run the attached program and >> then export the output into your environment, you should be okay. Looks like >> this: >> >> $ ./psm_keygen >> OMPI_MCA_orte_precondition_transports=0099b3eaa2c1547e-afb287789133a954 >> $ >> >> You compile the program with the usual mpicc. >> >> Let me know if this solves the problem (or not). >> Ralph >> >> >> >> >> On Dec 30, 2010, at 11:18 AM, Michael Di Domenico wrote: >> >>> Sure, i'll give it a go >>> >>> On Thu, Dec 30, 2010 at 5:53 PM, Ralph Castain wrote: Ah, yes - that is going to be a problem. The PSM key gets generated by mpirun as it is shared info - i.e., every proc has to get the same value. I can create a patch that will do this for the srun direct-launch scenario, if you want to try it. Would be later today, though. On Dec 30, 2010, at 10:31 AM, Michael Di Domenico wrote: > Well maybe not horray, yet. I might have jumped the gun a bit, it's > looking like srun works in general, but perhaps not with PSM > > With PSM i get this error, (at least now i know what i changed) > > Error obtaining unique transport key from ORTE > (orte_precondition_transports not present in the environment) > PML add procs failed > --> Returned "Error" (-1) instead of "Success" (0) > > Turn off PSM and srun works fine > > > On Thu, Dec 30, 2010 at 5:13 PM, Ralph Castain wrote: >> Hooray! >> >> On Dec 30, 2010, at 9:57 AM, Michael Di Domenico wrote: >> >>> I think i take it all back. I just tried it again and it seems to >>> work now. I'm not sure what I changed (between my first and this >>> msg), but it does appear to work now. >>> >>> On Thu, Dec 30, 2010 at 4:31 PM, Michael Di Domenico >>> wrote: Yes that's true, error messages help. I was hoping there was some documentation to see what i've done wrong. I can't easily cut and paste errors from my cluster. Here's a snippet (hand typed) of the error message, but it does look like a rank communications error ORTE_ERROR_LOG: A message is attempting to be sent to a process whose contact information is unknown in file rml_oob_send.c at line 145. *** MPI_INIT failure message (snipped) *** orte_grpcomm_modex failed --> Returned "A messages is attempting to be sent to a process whose contact information us uknown" (-117) instead of "Success" (0) This msg repeats for each rank, an ultimately hangs the srun which i have to Ctrl-C and terminate I have mpiports defined in my slurm config and running srun with -resv-ports does show the SLURM_RESV_PORTS environment variable getting parts to the shell On Thu, Dec 23, 2010 at 8:09 PM, Ralph Castain wrote: > I'm not sure there is any documentation yet - not much clamor for it. > :-/ > > It would really help if you included the error message. Otherwise, > all I can do is guess, which wastes both of our time :-( > > My best guess is that the port reservation didn't get passed down to > the MPI procs properly - but that's just a guess. > > > On Dec 23, 2010, at 12:46 PM, Michael Di Domenico wrote: > >> Can anyone point me towards the most recent documentation for using >> srun and openmpi? >> >> I followed what i found on the web with enabling the MpiPorts config >> in slurm and using the --resv-ports switch, but I'm getting an error >> from openmpi during setup. >> >> I'm using Slurm 2.1.15 and Openmpi 1.5 w/PSM >> >> I'm sure I'm missing a step. >> >> Thanks >> ___ >> users mailing list >>
Re: [OMPI users] srun and openmpi
How early does this need to run? Can I run it as part of a task prolog, or does it need to be the shell env for each rank? And does it need to run on one node or all the nodes in the job? On Thu, Dec 30, 2010 at 8:54 PM, Ralph Castainwrote: > Well, I couldn't do it as a patch - proved too complicated as the psm system > looks for the value early in the boot procedure. > > What I can do is give you the attached key generator program. It outputs the > envar required to run your program. So if you run the attached program and > then export the output into your environment, you should be okay. Looks like > this: > > $ ./psm_keygen > OMPI_MCA_orte_precondition_transports=0099b3eaa2c1547e-afb287789133a954 > $ > > You compile the program with the usual mpicc. > > Let me know if this solves the problem (or not). > Ralph > > > > > On Dec 30, 2010, at 11:18 AM, Michael Di Domenico wrote: > >> Sure, i'll give it a go >> >> On Thu, Dec 30, 2010 at 5:53 PM, Ralph Castain wrote: >>> Ah, yes - that is going to be a problem. The PSM key gets generated by >>> mpirun as it is shared info - i.e., every proc has to get the same value. >>> >>> I can create a patch that will do this for the srun direct-launch scenario, >>> if you want to try it. Would be later today, though. >>> >>> >>> On Dec 30, 2010, at 10:31 AM, Michael Di Domenico wrote: >>> Well maybe not horray, yet. I might have jumped the gun a bit, it's looking like srun works in general, but perhaps not with PSM With PSM i get this error, (at least now i know what i changed) Error obtaining unique transport key from ORTE (orte_precondition_transports not present in the environment) PML add procs failed --> Returned "Error" (-1) instead of "Success" (0) Turn off PSM and srun works fine On Thu, Dec 30, 2010 at 5:13 PM, Ralph Castain wrote: > Hooray! > > On Dec 30, 2010, at 9:57 AM, Michael Di Domenico wrote: > >> I think i take it all back. I just tried it again and it seems to >> work now. I'm not sure what I changed (between my first and this >> msg), but it does appear to work now. >> >> On Thu, Dec 30, 2010 at 4:31 PM, Michael Di Domenico >> wrote: >>> Yes that's true, error messages help. I was hoping there was some >>> documentation to see what i've done wrong. I can't easily cut and >>> paste errors from my cluster. >>> >>> Here's a snippet (hand typed) of the error message, but it does look >>> like a rank communications error >>> >>> ORTE_ERROR_LOG: A message is attempting to be sent to a process whose >>> contact information is unknown in file rml_oob_send.c at line 145. >>> *** MPI_INIT failure message (snipped) *** >>> orte_grpcomm_modex failed >>> --> Returned "A messages is attempting to be sent to a process whose >>> contact information us uknown" (-117) instead of "Success" (0) >>> >>> This msg repeats for each rank, an ultimately hangs the srun which i >>> have to Ctrl-C and terminate >>> >>> I have mpiports defined in my slurm config and running srun with >>> -resv-ports does show the SLURM_RESV_PORTS environment variable >>> getting parts to the shell >>> >>> >>> On Thu, Dec 23, 2010 at 8:09 PM, Ralph Castain >>> wrote: I'm not sure there is any documentation yet - not much clamor for it. :-/ It would really help if you included the error message. Otherwise, all I can do is guess, which wastes both of our time :-( My best guess is that the port reservation didn't get passed down to the MPI procs properly - but that's just a guess. On Dec 23, 2010, at 12:46 PM, Michael Di Domenico wrote: > Can anyone point me towards the most recent documentation for using > srun and openmpi? > > I followed what i found on the web with enabling the MpiPorts config > in slurm and using the --resv-ports switch, but I'm getting an error > from openmpi during setup. > > I'm using Slurm 2.1.15 and Openmpi 1.5 w/PSM > > I'm sure I'm missing a step. > > Thanks > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > >
Re: [OMPI users] srun and openmpi
Should have also warned you: you'll need to configure OMPI --with-devel-headers to get this program to build/run. On Dec 30, 2010, at 1:54 PM, Ralph Castain wrote: > Well, I couldn't do it as a patch - proved too complicated as the psm system > looks for the value early in the boot procedure. > > What I can do is give you the attached key generator program. It outputs the > envar required to run your program. So if you run the attached program and > then export the output into your environment, you should be okay. Looks like > this: > > $ ./psm_keygen > OMPI_MCA_orte_precondition_transports=0099b3eaa2c1547e-afb287789133a954 > $ > > You compile the program with the usual mpicc. > > Let me know if this solves the problem (or not). > Ralph > > > > On Dec 30, 2010, at 11:18 AM, Michael Di Domenico wrote: > >> Sure, i'll give it a go >> >> On Thu, Dec 30, 2010 at 5:53 PM, Ralph Castainwrote: >>> Ah, yes - that is going to be a problem. The PSM key gets generated by >>> mpirun as it is shared info - i.e., every proc has to get the same value. >>> >>> I can create a patch that will do this for the srun direct-launch scenario, >>> if you want to try it. Would be later today, though. >>> >>> >>> On Dec 30, 2010, at 10:31 AM, Michael Di Domenico wrote: >>> Well maybe not horray, yet. I might have jumped the gun a bit, it's looking like srun works in general, but perhaps not with PSM With PSM i get this error, (at least now i know what i changed) Error obtaining unique transport key from ORTE (orte_precondition_transports not present in the environment) PML add procs failed --> Returned "Error" (-1) instead of "Success" (0) Turn off PSM and srun works fine On Thu, Dec 30, 2010 at 5:13 PM, Ralph Castain wrote: > Hooray! > > On Dec 30, 2010, at 9:57 AM, Michael Di Domenico wrote: > >> I think i take it all back. I just tried it again and it seems to >> work now. I'm not sure what I changed (between my first and this >> msg), but it does appear to work now. >> >> On Thu, Dec 30, 2010 at 4:31 PM, Michael Di Domenico >> wrote: >>> Yes that's true, error messages help. I was hoping there was some >>> documentation to see what i've done wrong. I can't easily cut and >>> paste errors from my cluster. >>> >>> Here's a snippet (hand typed) of the error message, but it does look >>> like a rank communications error >>> >>> ORTE_ERROR_LOG: A message is attempting to be sent to a process whose >>> contact information is unknown in file rml_oob_send.c at line 145. >>> *** MPI_INIT failure message (snipped) *** >>> orte_grpcomm_modex failed >>> --> Returned "A messages is attempting to be sent to a process whose >>> contact information us uknown" (-117) instead of "Success" (0) >>> >>> This msg repeats for each rank, an ultimately hangs the srun which i >>> have to Ctrl-C and terminate >>> >>> I have mpiports defined in my slurm config and running srun with >>> -resv-ports does show the SLURM_RESV_PORTS environment variable >>> getting parts to the shell >>> >>> >>> On Thu, Dec 23, 2010 at 8:09 PM, Ralph Castain >>> wrote: I'm not sure there is any documentation yet - not much clamor for it. :-/ It would really help if you included the error message. Otherwise, all I can do is guess, which wastes both of our time :-( My best guess is that the port reservation didn't get passed down to the MPI procs properly - but that's just a guess. On Dec 23, 2010, at 12:46 PM, Michael Di Domenico wrote: > Can anyone point me towards the most recent documentation for using > srun and openmpi? > > I followed what i found on the web with enabling the MpiPorts config > in slurm and using the --resv-ports switch, but I'm getting an error > from openmpi during setup. > > I'm using Slurm 2.1.15 and Openmpi 1.5 w/PSM > > I'm sure I'm missing a step. > > Thanks > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > ___ > users mailing list >
Re: [OMPI users] srun and openmpi
Well, I couldn't do it as a patch - proved too complicated as the psm system looks for the value early in the boot procedure. What I can do is give you the attached key generator program. It outputs the envar required to run your program. So if you run the attached program and then export the output into your environment, you should be okay. Looks like this: $ ./psm_keygen OMPI_MCA_orte_precondition_transports=0099b3eaa2c1547e-afb287789133a954 $ You compile the program with the usual mpicc. Let me know if this solves the problem (or not). Ralph psm_keygen.c Description: Binary data On Dec 30, 2010, at 11:18 AM, Michael Di Domenico wrote: > Sure, i'll give it a go > > On Thu, Dec 30, 2010 at 5:53 PM, Ralph Castainwrote: >> Ah, yes - that is going to be a problem. The PSM key gets generated by >> mpirun as it is shared info - i.e., every proc has to get the same value. >> >> I can create a patch that will do this for the srun direct-launch scenario, >> if you want to try it. Would be later today, though. >> >> >> On Dec 30, 2010, at 10:31 AM, Michael Di Domenico wrote: >> >>> Well maybe not horray, yet. I might have jumped the gun a bit, it's >>> looking like srun works in general, but perhaps not with PSM >>> >>> With PSM i get this error, (at least now i know what i changed) >>> >>> Error obtaining unique transport key from ORTE >>> (orte_precondition_transports not present in the environment) >>> PML add procs failed >>> --> Returned "Error" (-1) instead of "Success" (0) >>> >>> Turn off PSM and srun works fine >>> >>> >>> On Thu, Dec 30, 2010 at 5:13 PM, Ralph Castain wrote: Hooray! On Dec 30, 2010, at 9:57 AM, Michael Di Domenico wrote: > I think i take it all back. I just tried it again and it seems to > work now. I'm not sure what I changed (between my first and this > msg), but it does appear to work now. > > On Thu, Dec 30, 2010 at 4:31 PM, Michael Di Domenico > wrote: >> Yes that's true, error messages help. I was hoping there was some >> documentation to see what i've done wrong. I can't easily cut and >> paste errors from my cluster. >> >> Here's a snippet (hand typed) of the error message, but it does look >> like a rank communications error >> >> ORTE_ERROR_LOG: A message is attempting to be sent to a process whose >> contact information is unknown in file rml_oob_send.c at line 145. >> *** MPI_INIT failure message (snipped) *** >> orte_grpcomm_modex failed >> --> Returned "A messages is attempting to be sent to a process whose >> contact information us uknown" (-117) instead of "Success" (0) >> >> This msg repeats for each rank, an ultimately hangs the srun which i >> have to Ctrl-C and terminate >> >> I have mpiports defined in my slurm config and running srun with >> -resv-ports does show the SLURM_RESV_PORTS environment variable >> getting parts to the shell >> >> >> On Thu, Dec 23, 2010 at 8:09 PM, Ralph Castain wrote: >>> I'm not sure there is any documentation yet - not much clamor for it. >>> :-/ >>> >>> It would really help if you included the error message. Otherwise, all >>> I can do is guess, which wastes both of our time :-( >>> >>> My best guess is that the port reservation didn't get passed down to >>> the MPI procs properly - but that's just a guess. >>> >>> >>> On Dec 23, 2010, at 12:46 PM, Michael Di Domenico wrote: >>> Can anyone point me towards the most recent documentation for using srun and openmpi? I followed what i found on the web with enabling the MpiPorts config in slurm and using the --resv-ports switch, but I'm getting an error from openmpi during setup. I'm using Slurm 2.1.15 and Openmpi 1.5 w/PSM I'm sure I'm missing a step. Thanks ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >> > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >>
Re: [OMPI users] srun and openmpi
Sure, i'll give it a go On Thu, Dec 30, 2010 at 5:53 PM, Ralph Castainwrote: > Ah, yes - that is going to be a problem. The PSM key gets generated by mpirun > as it is shared info - i.e., every proc has to get the same value. > > I can create a patch that will do this for the srun direct-launch scenario, > if you want to try it. Would be later today, though. > > > On Dec 30, 2010, at 10:31 AM, Michael Di Domenico wrote: > >> Well maybe not horray, yet. I might have jumped the gun a bit, it's >> looking like srun works in general, but perhaps not with PSM >> >> With PSM i get this error, (at least now i know what i changed) >> >> Error obtaining unique transport key from ORTE >> (orte_precondition_transports not present in the environment) >> PML add procs failed >> --> Returned "Error" (-1) instead of "Success" (0) >> >> Turn off PSM and srun works fine >> >> >> On Thu, Dec 30, 2010 at 5:13 PM, Ralph Castain wrote: >>> Hooray! >>> >>> On Dec 30, 2010, at 9:57 AM, Michael Di Domenico wrote: >>> I think i take it all back. I just tried it again and it seems to work now. I'm not sure what I changed (between my first and this msg), but it does appear to work now. On Thu, Dec 30, 2010 at 4:31 PM, Michael Di Domenico wrote: > Yes that's true, error messages help. I was hoping there was some > documentation to see what i've done wrong. I can't easily cut and > paste errors from my cluster. > > Here's a snippet (hand typed) of the error message, but it does look > like a rank communications error > > ORTE_ERROR_LOG: A message is attempting to be sent to a process whose > contact information is unknown in file rml_oob_send.c at line 145. > *** MPI_INIT failure message (snipped) *** > orte_grpcomm_modex failed > --> Returned "A messages is attempting to be sent to a process whose > contact information us uknown" (-117) instead of "Success" (0) > > This msg repeats for each rank, an ultimately hangs the srun which i > have to Ctrl-C and terminate > > I have mpiports defined in my slurm config and running srun with > -resv-ports does show the SLURM_RESV_PORTS environment variable > getting parts to the shell > > > On Thu, Dec 23, 2010 at 8:09 PM, Ralph Castain wrote: >> I'm not sure there is any documentation yet - not much clamor for it. :-/ >> >> It would really help if you included the error message. Otherwise, all I >> can do is guess, which wastes both of our time :-( >> >> My best guess is that the port reservation didn't get passed down to the >> MPI procs properly - but that's just a guess. >> >> >> On Dec 23, 2010, at 12:46 PM, Michael Di Domenico wrote: >> >>> Can anyone point me towards the most recent documentation for using >>> srun and openmpi? >>> >>> I followed what i found on the web with enabling the MpiPorts config >>> in slurm and using the --resv-ports switch, but I'm getting an error >>> from openmpi during setup. >>> >>> I'm using Slurm 2.1.15 and Openmpi 1.5 w/PSM >>> >>> I'm sure I'm missing a step. >>> >>> Thanks >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] srun and openmpi
Ah, yes - that is going to be a problem. The PSM key gets generated by mpirun as it is shared info - i.e., every proc has to get the same value. I can create a patch that will do this for the srun direct-launch scenario, if you want to try it. Would be later today, though. On Dec 30, 2010, at 10:31 AM, Michael Di Domenico wrote: > Well maybe not horray, yet. I might have jumped the gun a bit, it's > looking like srun works in general, but perhaps not with PSM > > With PSM i get this error, (at least now i know what i changed) > > Error obtaining unique transport key from ORTE > (orte_precondition_transports not present in the environment) > PML add procs failed > --> Returned "Error" (-1) instead of "Success" (0) > > Turn off PSM and srun works fine > > > On Thu, Dec 30, 2010 at 5:13 PM, Ralph Castainwrote: >> Hooray! >> >> On Dec 30, 2010, at 9:57 AM, Michael Di Domenico wrote: >> >>> I think i take it all back. I just tried it again and it seems to >>> work now. I'm not sure what I changed (between my first and this >>> msg), but it does appear to work now. >>> >>> On Thu, Dec 30, 2010 at 4:31 PM, Michael Di Domenico >>> wrote: Yes that's true, error messages help. I was hoping there was some documentation to see what i've done wrong. I can't easily cut and paste errors from my cluster. Here's a snippet (hand typed) of the error message, but it does look like a rank communications error ORTE_ERROR_LOG: A message is attempting to be sent to a process whose contact information is unknown in file rml_oob_send.c at line 145. *** MPI_INIT failure message (snipped) *** orte_grpcomm_modex failed --> Returned "A messages is attempting to be sent to a process whose contact information us uknown" (-117) instead of "Success" (0) This msg repeats for each rank, an ultimately hangs the srun which i have to Ctrl-C and terminate I have mpiports defined in my slurm config and running srun with -resv-ports does show the SLURM_RESV_PORTS environment variable getting parts to the shell On Thu, Dec 23, 2010 at 8:09 PM, Ralph Castain wrote: > I'm not sure there is any documentation yet - not much clamor for it. :-/ > > It would really help if you included the error message. Otherwise, all I > can do is guess, which wastes both of our time :-( > > My best guess is that the port reservation didn't get passed down to the > MPI procs properly - but that's just a guess. > > > On Dec 23, 2010, at 12:46 PM, Michael Di Domenico wrote: > >> Can anyone point me towards the most recent documentation for using >> srun and openmpi? >> >> I followed what i found on the web with enabling the MpiPorts config >> in slurm and using the --resv-ports switch, but I'm getting an error >> from openmpi during setup. >> >> I'm using Slurm 2.1.15 and Openmpi 1.5 w/PSM >> >> I'm sure I'm missing a step. >> >> Thanks >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > >>> >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] srun and openmpi
Well maybe not horray, yet. I might have jumped the gun a bit, it's looking like srun works in general, but perhaps not with PSM With PSM i get this error, (at least now i know what i changed) Error obtaining unique transport key from ORTE (orte_precondition_transports not present in the environment) PML add procs failed --> Returned "Error" (-1) instead of "Success" (0) Turn off PSM and srun works fine On Thu, Dec 30, 2010 at 5:13 PM, Ralph Castainwrote: > Hooray! > > On Dec 30, 2010, at 9:57 AM, Michael Di Domenico wrote: > >> I think i take it all back. I just tried it again and it seems to >> work now. I'm not sure what I changed (between my first and this >> msg), but it does appear to work now. >> >> On Thu, Dec 30, 2010 at 4:31 PM, Michael Di Domenico >> wrote: >>> Yes that's true, error messages help. I was hoping there was some >>> documentation to see what i've done wrong. I can't easily cut and >>> paste errors from my cluster. >>> >>> Here's a snippet (hand typed) of the error message, but it does look >>> like a rank communications error >>> >>> ORTE_ERROR_LOG: A message is attempting to be sent to a process whose >>> contact information is unknown in file rml_oob_send.c at line 145. >>> *** MPI_INIT failure message (snipped) *** >>> orte_grpcomm_modex failed >>> --> Returned "A messages is attempting to be sent to a process whose >>> contact information us uknown" (-117) instead of "Success" (0) >>> >>> This msg repeats for each rank, an ultimately hangs the srun which i >>> have to Ctrl-C and terminate >>> >>> I have mpiports defined in my slurm config and running srun with >>> -resv-ports does show the SLURM_RESV_PORTS environment variable >>> getting parts to the shell >>> >>> >>> On Thu, Dec 23, 2010 at 8:09 PM, Ralph Castain wrote: I'm not sure there is any documentation yet - not much clamor for it. :-/ It would really help if you included the error message. Otherwise, all I can do is guess, which wastes both of our time :-( My best guess is that the port reservation didn't get passed down to the MPI procs properly - but that's just a guess. On Dec 23, 2010, at 12:46 PM, Michael Di Domenico wrote: > Can anyone point me towards the most recent documentation for using > srun and openmpi? > > I followed what i found on the web with enabling the MpiPorts config > in slurm and using the --resv-ports switch, but I'm getting an error > from openmpi during setup. > > I'm using Slurm 2.1.15 and Openmpi 1.5 w/PSM > > I'm sure I'm missing a step. > > Thanks > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] srun and openmpi
Hooray! On Dec 30, 2010, at 9:57 AM, Michael Di Domenico wrote: > I think i take it all back. I just tried it again and it seems to > work now. I'm not sure what I changed (between my first and this > msg), but it does appear to work now. > > On Thu, Dec 30, 2010 at 4:31 PM, Michael Di Domenico >wrote: >> Yes that's true, error messages help. I was hoping there was some >> documentation to see what i've done wrong. I can't easily cut and >> paste errors from my cluster. >> >> Here's a snippet (hand typed) of the error message, but it does look >> like a rank communications error >> >> ORTE_ERROR_LOG: A message is attempting to be sent to a process whose >> contact information is unknown in file rml_oob_send.c at line 145. >> *** MPI_INIT failure message (snipped) *** >> orte_grpcomm_modex failed >> --> Returned "A messages is attempting to be sent to a process whose >> contact information us uknown" (-117) instead of "Success" (0) >> >> This msg repeats for each rank, an ultimately hangs the srun which i >> have to Ctrl-C and terminate >> >> I have mpiports defined in my slurm config and running srun with >> -resv-ports does show the SLURM_RESV_PORTS environment variable >> getting parts to the shell >> >> >> On Thu, Dec 23, 2010 at 8:09 PM, Ralph Castain wrote: >>> I'm not sure there is any documentation yet - not much clamor for it. :-/ >>> >>> It would really help if you included the error message. Otherwise, all I >>> can do is guess, which wastes both of our time :-( >>> >>> My best guess is that the port reservation didn't get passed down to the >>> MPI procs properly - but that's just a guess. >>> >>> >>> On Dec 23, 2010, at 12:46 PM, Michael Di Domenico wrote: >>> Can anyone point me towards the most recent documentation for using srun and openmpi? I followed what i found on the web with enabling the MpiPorts config in slurm and using the --resv-ports switch, but I'm getting an error from openmpi during setup. I'm using Slurm 2.1.15 and Openmpi 1.5 w/PSM I'm sure I'm missing a step. Thanks ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >> > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] srun and openmpi
Yes that's true, error messages help. I was hoping there was some documentation to see what i've done wrong. I can't easily cut and paste errors from my cluster. Here's a snippet (hand typed) of the error message, but it does look like a rank communications error ORTE_ERROR_LOG: A message is attempting to be sent to a process whose contact information is unknown in file rml_oob_send.c at line 145. *** MPI_INIT failure message (snipped) *** orte_grpcomm_modex failed --> Returned "A messages is attempting to be sent to a process whose contact information us uknown" (-117) instead of "Success" (0) This msg repeats for each rank, an ultimately hangs the srun which i have to Ctrl-C and terminate I have mpiports defined in my slurm config and running srun with -resv-ports does show the SLURM_RESV_PORTS environment variable getting parts to the shell On Thu, Dec 23, 2010 at 8:09 PM, Ralph Castainwrote: > I'm not sure there is any documentation yet - not much clamor for it. :-/ > > It would really help if you included the error message. Otherwise, all I can > do is guess, which wastes both of our time :-( > > My best guess is that the port reservation didn't get passed down to the MPI > procs properly - but that's just a guess. > > > On Dec 23, 2010, at 12:46 PM, Michael Di Domenico wrote: > >> Can anyone point me towards the most recent documentation for using >> srun and openmpi? >> >> I followed what i found on the web with enabling the MpiPorts config >> in slurm and using the --resv-ports switch, but I'm getting an error >> from openmpi during setup. >> >> I'm using Slurm 2.1.15 and Openmpi 1.5 w/PSM >> >> I'm sure I'm missing a step. >> >> Thanks >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] srun and openmpi
I'm not sure there is any documentation yet - not much clamor for it. :-/ It would really help if you included the error message. Otherwise, all I can do is guess, which wastes both of our time :-( My best guess is that the port reservation didn't get passed down to the MPI procs properly - but that's just a guess. On Dec 23, 2010, at 12:46 PM, Michael Di Domenico wrote: > Can anyone point me towards the most recent documentation for using > srun and openmpi? > > I followed what i found on the web with enabling the MpiPorts config > in slurm and using the --resv-ports switch, but I'm getting an error > from openmpi during setup. > > I'm using Slurm 2.1.15 and Openmpi 1.5 w/PSM > > I'm sure I'm missing a step. > > Thanks > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users