Dear Tim > So, you should just be able to run: > mpirun --mca btl mx,sm,self -mca mtl ^mx -np 4 -hostfile > ompi_machinefile ./cpi
I tried node001>mpirun --mca btl mx,sm,self -mca mtl ^mx -np 4 -hostfile ompi_machinefile ./cpi I put in a sleep call to keep it running for some time and to monitor the endpoints. None of the 4 were open, it must have used tcp. Also when I look at the process table for node001 I find orted --bootproxy 1 --name 0.0.1 --num_procs 2 --vpid_start 0 --nodename node001 --universe dcl0has@node001:default-universe-17750 --nsreplica "0.0.0;tcp://10.141.0.1:43640" --gprreplica "0.0.0;tcp://10.141.0.1:43640" --set-sid The argument "--num_procs 2" seems odd, I would expect 4? Henk > -----Original Message----- > From: users-boun...@open-mpi.org > [mailto:users-boun...@open-mpi.org] On Behalf Of Tim Prins > Sent: 09 July 2007 16:34 > To: Open MPI Users > Subject: Re: [OMPI users] openmpi fails on mx endpoint busy > > SLIM H.A. wrote: > > > > Dear Tim and Scott > > > > I followed the suggestions made: > > > >> So you should either pass '-mca btl mx,sm,self', or just > pass nothing > >> at all. > >> Open MPI is fairly smart at figuring out what components > to use, so > >> you really should not need to specify anything. > >> > > > > Using > > > > node001>mpirun --mca btl mx,sm,self -np 4 -hostfile > ompi_machinefile > > ./cpi > > > > conects to some of the mx ports, not all 4, but the program runs: > > > > [node001:01562] mca_btl_mx_init: mx_open_endpoint() failed with > > status=20 [node001:01564] mca_btl_mx_init: > mx_open_endpoint() failed > > with status=20 > > I finally figured out the problem here. What is happening is > that Open MPI now has 2 different network stacks, only one of > which can be used at a time: the mtl and the btl. What is > happening is that both the mx btl and the mx mtl is being > opened, each of which open an endpoint. Then the mtl is > closed because it will not be used, which releases the endpoint. > Meanwhile, while the number of endpoints are exhausted while > others are trying to open them. > > There are two solutions: > 1. Increase the number of available endpoints. According to > the Myrinet documentation, upping the limit to 16 or so > should have no performance impact. > > 2. Alternatively, you can tell the mx mtl not to run with -mca mtl ^mx > > So, you should just be able to run: > mpirun --mca btl mx,sm,self -mca mtl ^mx -np 4 -hostfile > ompi_machinefile ./cpi > > And it should work. > > > > > It spawned 4 processes on node001. Passing nothing at all gave the > > same problem. > > > >> Also, could you try creating a host file named "hosts" > with the names > >> of your machines and then try: > >> > >> $ mpirun -np 2 --hostfile hosts ./cpi > >> > >> and then > >> > >> $ mpirun -np 2 --hostfile hosts --mca pml cm ./cpi > > > > node001>mpirun -np 2 -hostfile ompi_machinefile ./cpi_gcc_ompi_mx > > > > works but increasing to 4 cores again uses less than 4 ports. > > Finally > > > > node001>mpirun -np 4 -hostfile ompi_machinefile --mca pml cm > > ./cpi_gcc_ompi_mx > > > > is successful even for -np 4. From here I tried 2 nodes: > > > > node001>mpirun -np 8 -hostfile ompi_machinefile --mca pml cm > > ./cpi_gcc_ompi_mx > > > > This gave: > > > > orted: Command not found. > > [node001:04585] [0,0,0] ORTE_ERROR_LOG: Timeout in file > > base/pls_base_orted_cmds.c at line 275 [node001:04585] [0,0,0] > > ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at line 1164 > > [node001:04585] [0,0,0] ORTE_ERROR_LOG: Timeout in file > errmgr_hnp.c > > at line 90 [node001:04585] ERROR: A daemon on node node002 > failed to > > start as expected. > > [node001:04585] ERROR: There may be more information available from > > [node001:04585] ERROR: the remote shell (see above). > > [node001:04585] ERROR: The daemon exited unexpectedly with status 1. > > [node001:04585] [0,0,0] ORTE_ERROR_LOG: Timeout in file > > base/pls_base_orted_cmds.c at line 188 [node001:04585] [0,0,0] > > ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at line 1196 > > > ---------------------------------------------------------------------- > > -- > > -- > > mpirun was unable to cleanly terminate the daemons for this job. > > Returned value Timeout instead of ORTE_SUCCESS. > > > > > ---------------------------------------------------------------------- > > -- > > -- > > The problem is that on the remote ompi cannot find the 'orted' > executable. Is the Open MPI install available on the remote node? > > Try: > ssh remote_node which orted > > This should locate the 'orted' program. If it does not, you > may need to modify your PATH, as described here: > http://www.open-mpi.org/faq/?category=running#adding-ompi-to-path > > Hope this helps, > > Tim > > > > > Apparently orted is not started up properly. Something > missing in the > > installation? > > > > Thanks > > > > Henk > > > > > >> -----Original Message----- > >> From: users-boun...@open-mpi.org > >> [mailto:users-boun...@open-mpi.org] On Behalf Of Tim Prins > >> Sent: 06 July 2007 15:59 > >> To: Open MPI Users > >> Subject: Re: [OMPI users] openmpi fails on mx endpoint busy > >> > >> Henk, > >> > >> On Friday 06 July 2007 05:34:35 am SLIM H.A. wrote: > >>> Dear Tim > >>> > >>> I followed the use of "--mca btl mx,self" as suggested in the FAQ > >>> > >>> http://www.open-mpi.org/faq/?category=myrinet#myri-btl > >> Yeah, that FAQ is wrong. I am working right now to fix it up. > >> It should be updated this afternoon. > >> > >>> When I use your extra mca value I get: > >>>> mpirun --mca btl mx,self --mca btl_mx_shared_mem 1 -np 4 ./cpi > >>> > >> > --------------------------------------------------------------------- > >> - > >>> -- > >>> -- > >>> > >>>> WARNING: A user-supplied value attempted to override the > >> read-only > >>>> MCA parameter named "btl_mx_shared_mem". > >>>> > >>>> The user-supplied value was ignored. > >> Opps, on the 1.2 branch this is a read-only parameter. On > the current > >> trunk the user can change it. Sorry for the confusion. Oh > well, you > >> should probably use Open MPI's shared memory support > instead anyways. > >> > >> So you should either pass '-mca btl mx,sm,self', or just > pass nothing > >> at all. > >> Open MPI is fairly smart at figuring out what components > to use, so > >> you really should not need to specify anything. > >> > >>> followed by the same error messages as before. > >>> > >>> Note that although I add "self" the error messages > complain about it > >>> > >>> missing: > >>>>> Process 0.1.0 is unable to reach 0.1.1 for MPI communication. > >>>>> If you specified the use of a BTL component, you may have > >>>> forgotten a > >>>> > >>>>> component (such as "self") in the list of usable components. > >>> I checked the output from mx_info for both the current node and > >>> another, there seems not to be a problem. > >>> I attch the output from ompi_info --all Also > >>> > >>>> ompi_info | grep mx > >>> Prefix: > >>> /usr/local/Cluster-Apps/openmpi/mx/gcc/64/1.2.3 > >>> MCA btl: mx (MCA v1.0, API v1.0.1, > >> Component v1.2.3) > >>> MCA mtl: mx (MCA v1.0, API v1.0, > Component v1.2.3) > >>> > >>> As a further check, I rebuild the exe with mpich and that > >> works fine > >>> on the same node over myrinet. I wonder whether mx is > >> properly include > >>> in my openmpi build. > >>> Use of ldd -v on the mpich exe gives references to > >> libmyriexpress.so, > >>> which is not the case for the ompi built exe, suggesting > >> something is > >>> missing? > >> No, this is expected behavior. The Open MPI executeables are not > >> linked to libmyriexpress.so, only the mx components. So if > you do a > >> ldd on > /usr/local/Cluster-Apps/openmpi/mx/gcc/64/1.2.3/lib/openmpi/mc > > a_btl_mx.so, > >> this should show the Myrinet library. > >> > >>> I used --with-mx=/usr/local/Cluster-Apps/mx/mx-1.1.1 > >>> and a listing of that directory is > >>> > >>>> ls /usr/local/Cluster-Apps/mx/mx-1.1.1 > >>> bin etc include lib lib32 lib64 sbin > >>> > >>> This should be sufficient, I don't need --with-mx-libdir? > >> Correct. > >> > >> > >> Hope this helps, > >> > >> Tim > >> > >>> Thanks > >>> > >>> Henk > >>> > >>>> -----Original Message----- > >>>> From: users-boun...@open-mpi.org > >>>> [mailto:users-boun...@open-mpi.org] On Behalf Of Tim Prins > >>>> Sent: 05 July 2007 18:16 > >>>> To: Open MPI Users > >>>> Subject: Re: [OMPI users] openmpi fails on mx endpoint busy > >>>> > >>>> Hi Henk, > >>>> > >>>> By specifying '--mca btl mx,self' you are telling Open > MPI not to > >>>> use its shared memory support. If you want to use Open > >> MPI's shared > >>>> memory support, you must add 'sm' to the list. > >>>> I.e. '--mca btl mx,self'. If you would rather use MX's > >> shared memory > >>>> support, instead use '--mca btl mx,self --mca > >> btl_mx_shared_mem 1'. > >>>> However, in most cases I believe Open MPI's shared memory > >> support is > >>>> a bit better. > >>>> > >>>> Alternatively, if you don't specify any btls, Open MPI > >> should figure > >>>> out what to use automatically. > >>>> > >>>> Hope this helps, > >>>> > >>>> Tim > >>>> > >>>> SLIM H.A. wrote: > >>>>> Hello > >>>>> > >>>>> I have compiled openmpi-1.2.3 with the --with-mx=<directory> > >>>>> configuration and gcc compiler. On testing with 4-8 > >> slots I get an > >>>>> error message, the mx ports are busy: > >>>>>> mpirun --mca btl mx,self -np 4 ./cpi > >>>>> [node001:10071] mca_btl_mx_init: mx_open_endpoint() failed with > >>>>> status=20 [node001:10074] mca_btl_mx_init: > >>>> mx_open_endpoint() failed > >>>> > >>>>> with status=20 [node001:10073] mca_btl_mx_init: > >> mx_open_endpoint() > >>>>> failed with status=20 > >>>> > >> > -------------------------------------------------------------------- > >>>> -- > >>>> > >>>>> -- > >>>>> -- > >>>>> Process 0.1.0 is unable to reach 0.1.1 for MPI communication. > >>>>> If you specified the use of a BTL component, you may have > >>>> forgotten a > >>>> > >>>>> component (such as "self") in the list of usable components. > >>>>> ... snipped > >>>>> It looks like MPI_INIT failed for some reason; your > >>>> parallel process > >>>> > >>>>> is likely to abort. There are many reasons that a parallel > >>>>> process can fail during MPI_INIT; some of which are due to > >>>>> configuration or environment problems. This failure > >> appears to be > >>>>> an > >>>> internal failure; > >>>> > >>>>> here's some additional information (which may only be > >>>> relevant to an > >>>> > >>>>> Open MPI > >>>>> developer): > >>>>> > >>>>> PML add procs failed > >>>>> --> Returned "Unreachable" (-12) instead of "Success" (0) > >>>> > >> > -------------------------------------------------------------------- > >>>> -- > >>>> > >>>>> -- > >>>>> -- > >>>>> *** An error occurred in MPI_Init > >>>>> *** before MPI was initialized > >>>>> *** MPI_ERRORS_ARE_FATAL (goodbye) mpirun noticed that > >> job rank 0 > >>>>> with PID 10071 on node > >>>> node001 exited > >>>> > >>>>> on signal 1 (Hangup). > >>>>> > >>>>> > >>>>> I would not expect mx messages as communication should not > >>>> go through > >>>> > >>>>> the mx card? (This is a twin dual core shared memory node) > >>>> The same > >>>> > >>>>> happens when testing on 2 nodes, using a hostfile. > >>>>> I checked the state of the mx card with mx_endpoint_info > >>>> and mx_info, > >>>> > >>>>> they are healthy and free. > >>>>> What is missing here? > >>>>> > >>>>> Thanks > >>>>> > >>>>> Henk > >>>>> > >>>>> _______________________________________________ > >>>>> users mailing list > >>>>> us...@open-mpi.org > >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>> _______________________________________________ > >>>> users mailing list > >>>> us...@open-mpi.org > >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >> > >> _______________________________________________ > >> users mailing list > >> us...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/users > >> > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >