Thanks. We're only seeing it on machines with Ethernet only as the interconnect. fortunately for us that only equates to one small machine, but it's still annoying. unfortunately, i don't have enough knowledge to dive into the code to help fix, but i can certainly help test
On Mon, Jan 24, 2011 at 1:41 PM, Nathan Hjelm <hje...@lanl.gov> wrote: > I am seeing similar issues on our slurm clusters. We are looking into the > issue. > > -Nathan > HPC-3, LANL > > On Tue, 11 Jan 2011, Michael Di Domenico wrote: > >> Any ideas on what might be causing this one? Or atleast what >> additional debug information someone might need? >> >> On Fri, Jan 7, 2011 at 4:03 PM, Michael Di Domenico >> <mdidomeni...@gmail.com> wrote: >>> >>> I'm still testing the slurm integration, which seems to work fine so >>> far. However, i just upgraded another cluster to openmpi-1.5 and >>> slurm 2.1.15 but this machine has no infiniband >>> >>> if i salloc the nodes and mpirun the command it seems to run and complete >>> fine >>> however if i srun the command i get >>> >>> [btl_tcp_endpoint:486] mca_btl_tcp_endpoint_recv_connect_ack received >>> unexpected prcoess identifier >>> >>> the job does not seem to run, but exhibits two behaviors >>> running a single process per node the job runs and does not present >>> the error (srun -N40 --ntasks-per-node=1) >>> running multiple processes per node, the job spits out the error but >>> does not run (srun -n40 --ntasks-per-node=8) >>> >>> I copied the configs from the other machine, so (i think) everything >>> should be configured correctly (but i can't rule it out) >>> >>> I saw (and reported) a similar error to above with the 1.4-dev branch >>> (see mailing list) and slurm, I can't say whether they're related or >>> not though >>> >>> >>> On Mon, Jan 3, 2011 at 3:00 PM, Jeff Squyres <jsquy...@cisco.com> wrote: >>>> >>>> Yo Ralph -- >>>> >>>> I see this was committed >>>> https://svn.open-mpi.org/trac/ompi/changeset/24197. Do you want to add a >>>> blurb in README about it, and/or have this executable compiled as part of >>>> the PSM MTL and then installed into $bindir (maybe named ompi-psm-keygen)? >>>> >>>> Right now, it's only compiled as part of "make check" and not installed, >>>> right? >>>> >>>> >>>> >>>> On Dec 30, 2010, at 5:07 PM, Ralph Castain wrote: >>>> >>>>> Run the program only once - it can be in the prolog of the job if you >>>>> like. The output value needs to be in the env of every rank. >>>>> >>>>> You can reuse the value as many times as you like - it doesn't have to >>>>> be unique for each job. There is nothing magic about the value itself. >>>>> >>>>> On Dec 30, 2010, at 2:11 PM, Michael Di Domenico wrote: >>>>> >>>>>> How early does this need to run? Can I run it as part of a task >>>>>> prolog, or does it need to be the shell env for each rank? And does >>>>>> it need to run on one node or all the nodes in the job? >>>>>> >>>>>> On Thu, Dec 30, 2010 at 8:54 PM, Ralph Castain <r...@open-mpi.org> >>>>>> wrote: >>>>>>> >>>>>>> Well, I couldn't do it as a patch - proved too complicated as the psm >>>>>>> system looks for the value early in the boot procedure. >>>>>>> >>>>>>> What I can do is give you the attached key generator program. It >>>>>>> outputs the envar required to run your program. So if you run the >>>>>>> attached >>>>>>> program and then export the output into your environment, you should be >>>>>>> okay. Looks like this: >>>>>>> >>>>>>> $ ./psm_keygen >>>>>>> >>>>>>> OMPI_MCA_orte_precondition_transports=0099b3eaa2c1547e-afb287789133a954 >>>>>>> $ >>>>>>> >>>>>>> You compile the program with the usual mpicc. >>>>>>> >>>>>>> Let me know if this solves the problem (or not). >>>>>>> Ralph >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Dec 30, 2010, at 11:18 AM, Michael Di Domenico wrote: >>>>>>> >>>>>>>> Sure, i'll give it a go >>>>>>>> >>>>>>>> On Thu, Dec 30, 2010 at 5:53 PM, Ralph Castain <r...@open-mpi.org> >>>>>>>> wrote: >>>>>>>>> >>>>>>>>> Ah, yes - that is going to be a problem. The PSM key gets generated >>>>>>>>> by mpirun as it is shared info - i.e., every proc has to get the same >>>>>>>>> value. >>>>>>>>> >>>>>>>>> I can create a patch that will do this for the srun direct-launch >>>>>>>>> scenario, if you want to try it. Would be later today, though. >>>>>>>>> >>>>>>>>> >>>>>>>>> On Dec 30, 2010, at 10:31 AM, Michael Di Domenico wrote: >>>>>>>>> >>>>>>>>>> Well maybe not horray, yet. I might have jumped the gun a bit, >>>>>>>>>> it's >>>>>>>>>> looking like srun works in general, but perhaps not with PSM >>>>>>>>>> >>>>>>>>>> With PSM i get this error, (at least now i know what i changed) >>>>>>>>>> >>>>>>>>>> Error obtaining unique transport key from ORTE >>>>>>>>>> (orte_precondition_transports not present in the environment) >>>>>>>>>> PML add procs failed >>>>>>>>>> --> Returned "Error" (-1) instead of "Success" (0) >>>>>>>>>> >>>>>>>>>> Turn off PSM and srun works fine >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Thu, Dec 30, 2010 at 5:13 PM, Ralph Castain <r...@open-mpi.org> >>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> Hooray! >>>>>>>>>>> >>>>>>>>>>> On Dec 30, 2010, at 9:57 AM, Michael Di Domenico wrote: >>>>>>>>>>> >>>>>>>>>>>> I think i take it all back. I just tried it again and it seems >>>>>>>>>>>> to >>>>>>>>>>>> work now. I'm not sure what I changed (between my first and >>>>>>>>>>>> this >>>>>>>>>>>> msg), but it does appear to work now. >>>>>>>>>>>> >>>>>>>>>>>> On Thu, Dec 30, 2010 at 4:31 PM, Michael Di Domenico >>>>>>>>>>>> <mdidomeni...@gmail.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> Yes that's true, error messages help. I was hoping there was >>>>>>>>>>>>> some >>>>>>>>>>>>> documentation to see what i've done wrong. I can't easily cut >>>>>>>>>>>>> and >>>>>>>>>>>>> paste errors from my cluster. >>>>>>>>>>>>> >>>>>>>>>>>>> Here's a snippet (hand typed) of the error message, but it does >>>>>>>>>>>>> look >>>>>>>>>>>>> like a rank communications error >>>>>>>>>>>>> >>>>>>>>>>>>> ORTE_ERROR_LOG: A message is attempting to be sent to a process >>>>>>>>>>>>> whose >>>>>>>>>>>>> contact information is unknown in file rml_oob_send.c at line >>>>>>>>>>>>> 145. >>>>>>>>>>>>> *** MPI_INIT failure message (snipped) *** >>>>>>>>>>>>> orte_grpcomm_modex failed >>>>>>>>>>>>> --> Returned "A messages is attempting to be sent to a process >>>>>>>>>>>>> whose >>>>>>>>>>>>> contact information us uknown" (-117) instead of "Success" (0) >>>>>>>>>>>>> >>>>>>>>>>>>> This msg repeats for each rank, an ultimately hangs the srun >>>>>>>>>>>>> which i >>>>>>>>>>>>> have to Ctrl-C and terminate >>>>>>>>>>>>> >>>>>>>>>>>>> I have mpiports defined in my slurm config and running srun >>>>>>>>>>>>> with >>>>>>>>>>>>> -resv-ports does show the SLURM_RESV_PORTS environment variable >>>>>>>>>>>>> getting parts to the shell >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Thu, Dec 23, 2010 at 8:09 PM, Ralph Castain >>>>>>>>>>>>> <r...@open-mpi.org> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> I'm not sure there is any documentation yet - not much clamor >>>>>>>>>>>>>> for it. :-/ >>>>>>>>>>>>>> >>>>>>>>>>>>>> It would really help if you included the error message. >>>>>>>>>>>>>> Otherwise, all I can do is guess, which wastes both of our time >>>>>>>>>>>>>> :-( >>>>>>>>>>>>>> >>>>>>>>>>>>>> My best guess is that the port reservation didn't get passed >>>>>>>>>>>>>> down to the MPI procs properly - but that's just a guess. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Dec 23, 2010, at 12:46 PM, Michael Di Domenico wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Can anyone point me towards the most recent documentation for >>>>>>>>>>>>>>> using >>>>>>>>>>>>>>> srun and openmpi? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I followed what i found on the web with enabling the MpiPorts >>>>>>>>>>>>>>> config >>>>>>>>>>>>>>> in slurm and using the --resv-ports switch, but I'm getting >>>>>>>>>>>>>>> an error >>>>>>>>>>>>>>> from openmpi during setup. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I'm using Slurm 2.1.15 and Openmpi 1.5 w/PSM >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I'm sure I'm missing a step. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>> users mailing list >>>>>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>> users mailing list >>>>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> users mailing list >>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> users mailing list >>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> users mailing list >>>>>>>>>> us...@open-mpi.org >>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>> >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> users mailing list >>>>>>>>> us...@open-mpi.org >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> users mailing list >>>>>>>> us...@open-mpi.org >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> users mailing list >>>>>>> us...@open-mpi.org >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>>> >>>> -- >>>> Jeff Squyres >>>> jsquy...@cisco.com >>>> For corporate legal information go to: >>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>> >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >