Dear Tim and Scott I followed the suggestions made:
> > So you should either pass '-mca btl mx,sm,self', or just pass > nothing at all. > Open MPI is fairly smart at figuring out what components to > use, so you really should not need to specify anything. > Using node001>mpirun --mca btl mx,sm,self -np 4 -hostfile ompi_machinefile ./cpi conects to some of the mx ports, not all 4, but the program runs: [node001:01562] mca_btl_mx_init: mx_open_endpoint() failed with status=20 [node001:01564] mca_btl_mx_init: mx_open_endpoint() failed with status=20 It spawned 4 processes on node001. Passing nothing at all gave the same problem. > Also, could you try creating a host file named "hosts" with > the names of your machines and then try: > > $ mpirun -np 2 --hostfile hosts ./cpi > > and then > > $ mpirun -np 2 --hostfile hosts --mca pml cm ./cpi node001>mpirun -np 2 -hostfile ompi_machinefile ./cpi_gcc_ompi_mx works but increasing to 4 cores again uses less than 4 ports. Finally node001>mpirun -np 4 -hostfile ompi_machinefile --mca pml cm ./cpi_gcc_ompi_mx is successful even for -np 4. From here I tried 2 nodes: node001>mpirun -np 8 -hostfile ompi_machinefile --mca pml cm ./cpi_gcc_ompi_mx This gave: orted: Command not found. [node001:04585] [0,0,0] ORTE_ERROR_LOG: Timeout in file base/pls_base_orted_cmds.c at line 275 [node001:04585] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at line 1164 [node001:04585] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c at line 90 [node001:04585] ERROR: A daemon on node node002 failed to start as expected. [node001:04585] ERROR: There may be more information available from [node001:04585] ERROR: the remote shell (see above). [node001:04585] ERROR: The daemon exited unexpectedly with status 1. [node001:04585] [0,0,0] ORTE_ERROR_LOG: Timeout in file base/pls_base_orted_cmds.c at line 188 [node001:04585] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at line 1196 ------------------------------------------------------------------------ -- mpirun was unable to cleanly terminate the daemons for this job. Returned value Timeout instead of ORTE_SUCCESS. ------------------------------------------------------------------------ -- Apparently orted is not started up properly. Something missing in the installation? Thanks Henk > -----Original Message----- > From: users-boun...@open-mpi.org > [mailto:users-boun...@open-mpi.org] On Behalf Of Tim Prins > Sent: 06 July 2007 15:59 > To: Open MPI Users > Subject: Re: [OMPI users] openmpi fails on mx endpoint busy > > Henk, > > On Friday 06 July 2007 05:34:35 am SLIM H.A. wrote: > > Dear Tim > > > > I followed the use of "--mca btl mx,self" as suggested in the FAQ > > > > http://www.open-mpi.org/faq/?category=myrinet#myri-btl > Yeah, that FAQ is wrong. I am working right now to fix it up. > It should be updated this afternoon. > > > > > When I use your extra mca value I get: > > >mpirun --mca btl mx,self --mca btl_mx_shared_mem 1 -np 4 ./cpi > > > > > ---------------------------------------------------------------------- > > -- > > -- > > > > > WARNING: A user-supplied value attempted to override the > read-only > > > MCA parameter named "btl_mx_shared_mem". > > > > > > The user-supplied value was ignored. > Opps, on the 1.2 branch this is a read-only parameter. On the > current trunk the user can change it. Sorry for the > confusion. Oh well, you should probably use Open MPI's shared > memory support instead anyways. > > So you should either pass '-mca btl mx,sm,self', or just pass > nothing at all. > Open MPI is fairly smart at figuring out what components to > use, so you really should not need to specify anything. > > > followed by the same error messages as before. > > > > Note that although I add "self" the error messages complain about it > > > > missing: > > > > Process 0.1.0 is unable to reach 0.1.1 for MPI communication. > > > > If you specified the use of a BTL component, you may have > > > > > > forgotten a > > > > > > > component (such as "self") in the list of usable components. > > > > I checked the output from mx_info for both the current node and > > another, there seems not to be a problem. > > I attch the output from ompi_info --all Also > > > > >ompi_info | grep mx > > > > Prefix: > > /usr/local/Cluster-Apps/openmpi/mx/gcc/64/1.2.3 > > MCA btl: mx (MCA v1.0, API v1.0.1, > Component v1.2.3) > > MCA mtl: mx (MCA v1.0, API v1.0, Component v1.2.3) > > > > As a further check, I rebuild the exe with mpich and that > works fine > > on the same node over myrinet. I wonder whether mx is > properly include > > in my openmpi build. > > Use of ldd -v on the mpich exe gives references to > libmyriexpress.so, > > which is not the case for the ompi built exe, suggesting > something is > > missing? > No, this is expected behavior. The Open MPI executeables are > not linked to libmyriexpress.so, only the mx components. So > if you do a ldd on > /usr/local/Cluster-Apps/openmpi/mx/gcc/64/1.2.3/lib/openmpi/mc a_btl_mx.so, > this should show the Myrinet library. > > > I used --with-mx=/usr/local/Cluster-Apps/mx/mx-1.1.1 > > and a listing of that directory is > > > > >ls /usr/local/Cluster-Apps/mx/mx-1.1.1 > > > > bin etc include lib lib32 lib64 sbin > > > > This should be sufficient, I don't need --with-mx-libdir? > Correct. > > > Hope this helps, > > Tim > > > > > Thanks > > > > Henk > > > > > -----Original Message----- > > > From: users-boun...@open-mpi.org > > > [mailto:users-boun...@open-mpi.org] On Behalf Of Tim Prins > > > Sent: 05 July 2007 18:16 > > > To: Open MPI Users > > > Subject: Re: [OMPI users] openmpi fails on mx endpoint busy > > > > > > Hi Henk, > > > > > > By specifying '--mca btl mx,self' you are telling Open MPI not to > > > use its shared memory support. If you want to use Open > MPI's shared > > > memory support, you must add 'sm' to the list. > > > I.e. '--mca btl mx,self'. If you would rather use MX's > shared memory > > > support, instead use '--mca btl mx,self --mca > btl_mx_shared_mem 1'. > > > However, in most cases I believe Open MPI's shared memory > support is > > > a bit better. > > > > > > Alternatively, if you don't specify any btls, Open MPI > should figure > > > out what to use automatically. > > > > > > Hope this helps, > > > > > > Tim > > > > > > SLIM H.A. wrote: > > > > Hello > > > > > > > > I have compiled openmpi-1.2.3 with the --with-mx=<directory> > > > > configuration and gcc compiler. On testing with 4-8 > slots I get an > > > > > > > > error message, the mx ports are busy: > > > >> mpirun --mca btl mx,self -np 4 ./cpi > > > > > > > > [node001:10071] mca_btl_mx_init: mx_open_endpoint() failed with > > > > status=20 [node001:10074] mca_btl_mx_init: > > > > > > mx_open_endpoint() failed > > > > > > > with status=20 [node001:10073] mca_btl_mx_init: > mx_open_endpoint() > > > > failed with status=20 > > > > > > > -------------------------------------------------------------------- > > > -- > > > > > > > -- > > > > -- > > > > Process 0.1.0 is unable to reach 0.1.1 for MPI communication. > > > > If you specified the use of a BTL component, you may have > > > > > > forgotten a > > > > > > > component (such as "self") in the list of usable components. > > > > ... snipped > > > > It looks like MPI_INIT failed for some reason; your > > > > > > parallel process > > > > > > > is likely to abort. There are many reasons that a parallel > > > > process can fail during MPI_INIT; some of which are due to > > > > configuration or environment problems. This failure > appears to be > > > > an > > > > > > internal failure; > > > > > > > here's some additional information (which may only be > > > > > > relevant to an > > > > > > > Open MPI > > > > developer): > > > > > > > > PML add procs failed > > > > --> Returned "Unreachable" (-12) instead of "Success" (0) > > > > > > > -------------------------------------------------------------------- > > > -- > > > > > > > -- > > > > -- > > > > *** An error occurred in MPI_Init > > > > *** before MPI was initialized > > > > *** MPI_ERRORS_ARE_FATAL (goodbye) mpirun noticed that > job rank 0 > > > > with PID 10071 on node > > > > > > node001 exited > > > > > > > on signal 1 (Hangup). > > > > > > > > > > > > I would not expect mx messages as communication should not > > > > > > go through > > > > > > > the mx card? (This is a twin dual core shared memory node) > > > > > > The same > > > > > > > happens when testing on 2 nodes, using a hostfile. > > > > I checked the state of the mx card with mx_endpoint_info > > > > > > and mx_info, > > > > > > > they are healthy and free. > > > > What is missing here? > > > > > > > > Thanks > > > > > > > > Henk > > > > > > > > _______________________________________________ > > > > users mailing list > > > > us...@open-mpi.org > > > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > _______________________________________________ > > > users mailing list > > > us...@open-mpi.org > > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >