Re: [OMPI users] srun and openmpi

2011-04-29 Thread Michael Di Domenico
Certainly, i reached out to several contacts I have inside qlogic (i used to work there)... On Fri, Apr 29, 2011 at 10:30 AM, Ralph Castain wrote: > Hi Michael > > I'm told that the Qlogic contacts we used to have are no longer there. Since > you obviously are a customer, can

Re: [OMPI users] srun and openmpi

2011-04-29 Thread Ralph Castain
Hi Michael I'm told that the Qlogic contacts we used to have are no longer there. Since you obviously are a customer, can you ping them and ask (a) what that error message means, and (b) what's wrong with the values I computed? You can also just send them my way, if that would help. We just

Re: [OMPI users] srun and openmpi

2011-04-29 Thread Ralph Castain
On Apr 29, 2011, at 8:05 AM, Michael Di Domenico wrote: > On Fri, Apr 29, 2011 at 10:01 AM, Michael Di Domenico > wrote: >> On Fri, Apr 29, 2011 at 4:52 AM, Ralph Castain wrote: >>> Hi Michael >>> >>> Please see the attached updated patch to try for

Re: [OMPI users] srun and openmpi

2011-04-29 Thread Michael Di Domenico
On Fri, Apr 29, 2011 at 10:01 AM, Michael Di Domenico wrote: > On Fri, Apr 29, 2011 at 4:52 AM, Ralph Castain wrote: >> Hi Michael >> >> Please see the attached updated patch to try for 1.5.3. I mistakenly free'd >> the envar after adding it to the

Re: [OMPI users] srun and openmpi

2011-04-29 Thread Michael Di Domenico
On Fri, Apr 29, 2011 at 4:52 AM, Ralph Castain wrote: > Hi Michael > > Please see the attached updated patch to try for 1.5.3. I mistakenly free'd > the envar after adding it to the environ :-/ The patch works great, i can now see the precondition environment variable if i do

Re: [OMPI users] srun and openmpi

2011-04-29 Thread Ralph Castain
Hi Michael Please see the attached updated patch to try for 1.5.3. I mistakenly free'd the envar after adding it to the environ :-/ Thanks Ralph slurmd.diff Description: Binary data On Apr 28, 2011, at 2:31 PM, Michael Di Domenico wrote: > On Thu, Apr 28, 2011 at 9:03 AM, Ralph Castain

Re: [OMPI users] srun and openmpi

2011-04-28 Thread Michael Di Domenico
On Thu, Apr 28, 2011 at 9:03 AM, Ralph Castain wrote: > > On Apr 28, 2011, at 6:49 AM, Michael Di Domenico wrote: > >> On Wed, Apr 27, 2011 at 11:47 PM, Ralph Castain wrote: >>> >>> On Apr 27, 2011, at 1:06 PM, Michael Di Domenico wrote: >>> On Wed, Apr

Re: [OMPI users] srun and openmpi

2011-04-28 Thread Ralph Castain
Per earlier in the thread, it looks like you are using a 1.5 series release - so here is a patch that -should- fix the PSM setup problem. Please let me know if/how it works as I honestly have no way of testing it. Ralph slurmd.diff Description: Binary data On Apr 28, 2011, at 7:03 AM, Ralph

Re: [OMPI users] srun and openmpi

2011-04-28 Thread Ralph Castain
On Apr 28, 2011, at 6:49 AM, Michael Di Domenico wrote: > On Wed, Apr 27, 2011 at 11:47 PM, Ralph Castain wrote: >> >> On Apr 27, 2011, at 1:06 PM, Michael Di Domenico wrote: >> >>> On Wed, Apr 27, 2011 at 2:46 PM, Ralph Castain wrote: On Apr

Re: [OMPI users] srun and openmpi

2011-04-28 Thread Michael Di Domenico
On Wed, Apr 27, 2011 at 11:47 PM, Ralph Castain wrote: > > On Apr 27, 2011, at 1:06 PM, Michael Di Domenico wrote: > >> On Wed, Apr 27, 2011 at 2:46 PM, Ralph Castain wrote: >>> >>> On Apr 27, 2011, at 12:38 PM, Michael Di Domenico wrote: >>> On Wed,

Re: [OMPI users] srun and openmpi

2011-04-28 Thread Ralph Castain
On Apr 27, 2011, at 1:06 PM, Michael Di Domenico wrote: > On Wed, Apr 27, 2011 at 2:46 PM, Ralph Castain wrote: >> >> On Apr 27, 2011, at 12:38 PM, Michael Di Domenico wrote: >> >>> On Wed, Apr 27, 2011 at 2:25 PM, Ralph Castain wrote: On Apr

Re: [OMPI users] srun and openmpi

2011-04-27 Thread Jeff Squyres
On Apr 27, 2011, at 3:39 PM, Ralph Castain wrote: > Nope, nope nope...in this mode of operation, we are using -static- ports. Er.. right. Sorry -- my bad for not reading the full context here... ignore what I said... -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to:

Re: [OMPI users] srun and openmpi

2011-04-27 Thread Ralph Castain
On Apr 27, 2011, at 1:27 PM, Jeff Squyres wrote: > On Apr 27, 2011, at 2:46 PM, Ralph Castain wrote: > >> Actually, I understood you correctly. I'm just saying that I find no >> evidence in the code that we try three times before giving up. What I see is >> a single attempt to bind the port -

Re: [OMPI users] srun and openmpi

2011-04-27 Thread Jeff Squyres
On Apr 27, 2011, at 2:46 PM, Ralph Castain wrote: > Actually, I understood you correctly. I'm just saying that I find no evidence > in the code that we try three times before giving up. What I see is a single > attempt to bind the port - if it fails, then we abort. There is no parameter > to

Re: [OMPI users] srun and openmpi

2011-04-27 Thread Michael Di Domenico
On Wed, Apr 27, 2011 at 2:46 PM, Ralph Castain wrote: > > On Apr 27, 2011, at 12:38 PM, Michael Di Domenico wrote: > >> On Wed, Apr 27, 2011 at 2:25 PM, Ralph Castain wrote: >>> >>> On Apr 27, 2011, at 10:09 AM, Michael Di Domenico wrote: >>> Was this

Re: [OMPI users] srun and openmpi

2011-04-27 Thread Ralph Castain
On Apr 27, 2011, at 12:38 PM, Michael Di Domenico wrote: > On Wed, Apr 27, 2011 at 2:25 PM, Ralph Castain wrote: >> >> On Apr 27, 2011, at 10:09 AM, Michael Di Domenico wrote: >> >>> Was this ever committed to the OMPI src as something not having to be >>> run outside of

Re: [OMPI users] srun and openmpi

2011-04-27 Thread Michael Di Domenico
On Wed, Apr 27, 2011 at 2:25 PM, Ralph Castain wrote: > > On Apr 27, 2011, at 10:09 AM, Michael Di Domenico wrote: > >> Was this ever committed to the OMPI src as something not having to be >> run outside of OpenMPI, but as part of the PSM setup that OpenMPI >> does? > > Not

Re: [OMPI users] srun and openmpi

2011-04-27 Thread Ralph Castain
On Apr 27, 2011, at 10:09 AM, Michael Di Domenico wrote: > Was this ever committed to the OMPI src as something not having to be > run outside of OpenMPI, but as part of the PSM setup that OpenMPI > does? Not that I know of - I don't think the PSM developers ever looked at it. > > I'm having

Re: [OMPI users] srun and openmpi

2011-04-27 Thread Michael Di Domenico
Was this ever committed to the OMPI src as something not having to be run outside of OpenMPI, but as part of the PSM setup that OpenMPI does? I'm having some trouble getting Slurm/OpenMPI to play nice with the setup of this key. Namely, with slurm you cannot export variables from the --prolog of

Re: [OMPI users] srun and openmpi

2011-01-25 Thread Michael Di Domenico
Yes, i am setting the config correcty. Our IB machines seem to run just fine so far using srun and openmpi v1.5. As another data point, we enabled mpi-threads in Openmpi and that also seems to trigger the Srun/TCP behavior, but on the IB fabric. Running the program within an salloc rather the

Re: [OMPI users] srun and openmpi

2011-01-25 Thread Nathan Hjelm
We are seeing the similar problem with our infiniband machines. After some investigation I discovered that we were not setting our slurm environment correctly (ref: https://computing.llnl.gov/linux/slurm/mpi_guide.html#open_mpi). Are you setting the ports in your slurm.conf and executing srun

Re: [OMPI users] srun and openmpi

2011-01-25 Thread Michael Di Domenico
Thanks. We're only seeing it on machines with Ethernet only as the interconnect. fortunately for us that only equates to one small machine, but it's still annoying. unfortunately, i don't have enough knowledge to dive into the code to help fix, but i can certainly help test On Mon, Jan 24,

Re: [OMPI users] srun and openmpi

2011-01-24 Thread Nathan Hjelm
I am seeing similar issues on our slurm clusters. We are looking into the issue. -Nathan HPC-3, LANL On Tue, 11 Jan 2011, Michael Di Domenico wrote: Any ideas on what might be causing this one? Or atleast what additional debug information someone might need? On Fri, Jan 7, 2011 at 4:03 PM,

Re: [OMPI users] srun and openmpi

2011-01-11 Thread Michael Di Domenico
Any ideas on what might be causing this one? Or atleast what additional debug information someone might need? On Fri, Jan 7, 2011 at 4:03 PM, Michael Di Domenico wrote: > I'm still testing the slurm integration, which seems to work fine so > far.  However, i just

Re: [OMPI users] srun and openmpi

2011-01-07 Thread Michael Di Domenico
I'm still testing the slurm integration, which seems to work fine so far. However, i just upgraded another cluster to openmpi-1.5 and slurm 2.1.15 but this machine has no infiniband if i salloc the nodes and mpirun the command it seems to run and complete fine however if i srun the command i get

Re: [OMPI users] srun and openmpi

2010-12-30 Thread Ralph Castain
Run the program only once - it can be in the prolog of the job if you like. The output value needs to be in the env of every rank. You can reuse the value as many times as you like - it doesn't have to be unique for each job. There is nothing magic about the value itself. On Dec 30, 2010, at

Re: [OMPI users] srun and openmpi

2010-12-30 Thread Michael Di Domenico
How early does this need to run? Can I run it as part of a task prolog, or does it need to be the shell env for each rank? And does it need to run on one node or all the nodes in the job? On Thu, Dec 30, 2010 at 8:54 PM, Ralph Castain wrote: > Well, I couldn't do it as a

Re: [OMPI users] srun and openmpi

2010-12-30 Thread Ralph Castain
Should have also warned you: you'll need to configure OMPI --with-devel-headers to get this program to build/run. On Dec 30, 2010, at 1:54 PM, Ralph Castain wrote: > Well, I couldn't do it as a patch - proved too complicated as the psm system > looks for the value early in the boot procedure.

Re: [OMPI users] srun and openmpi

2010-12-30 Thread Ralph Castain
Well, I couldn't do it as a patch - proved too complicated as the psm system looks for the value early in the boot procedure. What I can do is give you the attached key generator program. It outputs the envar required to run your program. So if you run the attached program and then export the

Re: [OMPI users] srun and openmpi

2010-12-30 Thread Michael Di Domenico
Sure, i'll give it a go On Thu, Dec 30, 2010 at 5:53 PM, Ralph Castain wrote: > Ah, yes - that is going to be a problem. The PSM key gets generated by mpirun > as it is shared info - i.e., every proc has to get the same value. > > I can create a patch that will do this for

Re: [OMPI users] srun and openmpi

2010-12-30 Thread Ralph Castain
Ah, yes - that is going to be a problem. The PSM key gets generated by mpirun as it is shared info - i.e., every proc has to get the same value. I can create a patch that will do this for the srun direct-launch scenario, if you want to try it. Would be later today, though. On Dec 30, 2010, at

Re: [OMPI users] srun and openmpi

2010-12-30 Thread Michael Di Domenico
Well maybe not horray, yet. I might have jumped the gun a bit, it's looking like srun works in general, but perhaps not with PSM With PSM i get this error, (at least now i know what i changed) Error obtaining unique transport key from ORTE (orte_precondition_transports not present in the

Re: [OMPI users] srun and openmpi

2010-12-30 Thread Ralph Castain
Hooray! On Dec 30, 2010, at 9:57 AM, Michael Di Domenico wrote: > I think i take it all back. I just tried it again and it seems to > work now. I'm not sure what I changed (between my first and this > msg), but it does appear to work now. > > On Thu, Dec 30, 2010 at 4:31 PM, Michael Di

Re: [OMPI users] srun and openmpi

2010-12-30 Thread Michael Di Domenico
I think i take it all back. I just tried it again and it seems to work now. I'm not sure what I changed (between my first and this msg), but it does appear to work now. On Thu, Dec 30, 2010 at 4:31 PM, Michael Di Domenico wrote: > Yes that's true, error messages help.  

Re: [OMPI users] srun and openmpi

2010-12-30 Thread Michael Di Domenico
Yes that's true, error messages help. I was hoping there was some documentation to see what i've done wrong. I can't easily cut and paste errors from my cluster. Here's a snippet (hand typed) of the error message, but it does look like a rank communications error ORTE_ERROR_LOG: A message is

Re: [OMPI users] srun and openmpi

2010-12-23 Thread Ralph Castain
I'm not sure there is any documentation yet - not much clamor for it. :-/ It would really help if you included the error message. Otherwise, all I can do is guess, which wastes both of our time :-( My best guess is that the port reservation didn't get passed down to the MPI procs properly -