Re: [OMPI users] Error while loading shared libraries
Thanks guys. Using absolute path of mpirun fixes my problem. Cheers On Mon, Apr 2, 2012 at 6:24 PM, Reutiwrote: > Am 02.04.2012 um 09:56 schrieb Rohan Deshpande: > > > Yes, I am trying to run the program using multiple hosts. > > > > The program executes fine but does not use any slaves when I run > > > > mpirun -np 8 hello --hostfile slaves > > > > The program throws error saying shared libraries not found when I run > > > > mpirun --hostfile slaves -np 8 > > a) As Rayson mentioned: the libraries are available on the slaves? > > b) It might be necassary to export an LD_LIBRARY_PATH in you .bashrc or > forward the variable by Open MPI to point to the location of the libraries. > > c) It could also work to create a static version of Open MPI by > --enable-static --disable-shared and recompile the application. > > -- Reuti > > > > > > > > On Mon, Apr 2, 2012 at 2:52 PM, Rayson Ho wrote: > > On Sun, Apr 1, 2012 at 11:27 PM, Rohan Deshpande > wrote: > > > error while loading shared libraries: libmpi.so.0: cannot open shared > > > object file no such object file: No such file or directory. > > > > Were you trying to run the MPI program on a remote machine?? If you > > are, then make sure that each machine has the libraries installed (or > > you can install Open MPI on an NFS directory). > > > > Rayson > > > > = > > Open Grid Scheduler / Grid Engine > > http://gridscheduler.sourceforge.net/ > > > > Scalable Grid Engine Support Program > > http://www.scalablelogic.com/ > > > > > > > > > > When I run using - mpirun -np 1 ldd hello the following libraries are > not > > > found > > > 1. libmpi.so.0 > > > 2. libopen-rte.so.0 > > > 3. libopen.pal.so.0 > > > > > > I am using openmpi version 1.4.5. Also PATH and LD_LIBRARY_PATH > variables > > > are correctly set and 'which mpicc' returns correct path > > > > > > Any help would be highly appreciated. > > > > > > Thanks > > > > > > > > > > > > > > > > > > > > > ___ > > > users mailing list > > > us...@open-mpi.org > > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > > > -- > > > > Best Regards, > > > > ROHAN DESHPANDE > > > > > > > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > -- Best Regards, ROHAN DESHPANDE
Re: [OMPI users] openmpi 1.5.5. build issue with cuda 4.1
The offending file: openmpi/contrib/vt/vt/vtlib/vt_cudartwrap.c is easily fixed with placing a const in front of the void *ptr for the cudaPointerGetAttributes wrapper code segments. Then the openmpi 1.5.5 release compiles with Cuda 4.1 <>Srinath = Srinath Vadlamani = On Mon, Apr 2, 2012 at 11:26 AM, Srinath Vadlamani < srinath.vadlam...@gmail.com> wrote: > I have a build error with the newest release openmpi 1.5.5 building > against. cuda 4.1 > > Making all in vtlib > make[5]: Entering directory > `/opt/local/var/macports/build/_opt_local_var_macports_sources_rsync.macports.org_release_ports_science_openmpi/openmpi/work/build/ompi/contrib/vt/vt/vtlib' > CC vt_libwrap.lo > CC vt_gpu.lo > CC vt_cudartwrap.lo > CC vt_cudart.lo > /opt/local/var/macports/build/_opt_local_var_macports_sources_rsync.macports.org_release_ports_science_openmpi/openmpi/work/openmpi-1.5.5/ompi/contrib/vt/vt/vtlib/vt_cudartwrap.c:1378:14: > error: conflicting types for 'cudaPointerGetAttributes' > cudaError_t cudaPointerGetAttributes(struct cudaPointerAttributes > *attributes, void *ptr) > ^ > In file included from > /opt/local/var/macports/build/_opt_local_var_macports_sources_rsync.macports.org_release_ports_science_openmpi/openmpi/work/openmpi-1.5.5/ompi/contrib/vt/vt/vtlib/vt_cudartwrap.c:13: > In file included from > /opt/local/var/macports/build/_opt_local_var_macports_sources_rsync.macports.org_release_ports_science_openmpi/openmpi/work/openmpi-1.5.5/ompi/contrib/vt/vt/vtlib/vt_cudartwrap.h:25: > In file included from > /opt/local/var/macports/build/_opt_local_var_macports_sources_rsync.macports.org_release_ports_science_openmpi/openmpi/work/openmpi-1.5.5/ompi/contrib/vt/vt/vtlib/vt_cuda_runtime_api.h:20: > /usr/local/cuda/include/cuda_runtime_api.h:3899:39: note: previous > declaration is here > extern __host__ cudaError_t CUDARTAPI cudaPointerGetAttributes(struct > cudaPointerAttributes *attributes, const void *ptr); > ^ > 1 error generated. > make[5]: *** [vt_cudartwrap.lo] Error 1 > > The error stems from the use of cuda 4.0 version of > cudaPointerGetAttribute > from: > > http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/html/group__CUDART__UNIFIED_gccb4831aa37562c0af3e6b6712e0f12c.html > > but: > cudaPointerGetAttributes(struct cudaPointerAttributes *attributes, const > void *ptr); > > is the correct signature for Cuda 4.1 > > Cuda 4.1 is the current release so I suggest a patch be made for openpmpi > 1.5.5 to detect cuda version and then use the appropriate signature. > > <>Srinath > > > = > Srinath Vadlamani > = >
Re: [OMPI users] configuration of openmpi-1.5.4 with visual studio
Hi Shiqing, i haven't yet find a solution and for the record, i have installed openmpi from an executable on windows 7(i don't know if i mentioned that before). at first, i had an error message while compiling the hello world application : unresolved link or something like that, then i added "OMPI_IMPORTS" to the configuration of visual studio it's here when i got the error message described before.and for the output of ompi_info it's attached to this email. best regards.Toufik Package: Open MPI hpcfan@VISCLUSTER25 Distribution Open MPI: 1.5.3 Open MPI SVN revision: r24532 Open MPI release date: Mar 16, 2011 Open RTE: 1.5.3 Open RTE SVN revision: r24532 Open RTE release date: Mar 16, 2011 OPAL: 1.5.3 OPAL SVN revision: r24532 OPAL release date: Mar 16, 2011 Ident string: 1.5.3 Prefix: C:\Program Files\OpenMPI_v1.5.3-win32 Configured architecture: x86 Windows-6.1 Configure host: VISCLUSTER25 Configured by: hpcfan Configured on: 16:10 16.03.2011 Configure host: VISCLUSTER25 Built by: hpcfan Built on: 16:10 16.03.2011 Built host: VISCLUSTER25 C bindings: yes C++ bindings: yes Fortran77 bindings: no Fortran90 bindings: no Fortran90 bindings size: na C compiler: cl C compiler absolute: c:/VSDev/VC/bin/cl.exe C compiler family name: MICROSOFT C compiler version: 1600 C++ compiler: cl C++ compiler absolute: c:/VSDev/VC/bin/cl.exe Fortran77 compiler: none Fortran77 compiler abs: none Fortran90 compiler: none Fortran90 compiler abs: none C profiling: yes C++ profiling: yes Fortran77 profiling: no Fortran90 profiling: no C++ exceptions: no Thread support: no Sparse Groups: no Internal debug support: no MPI interface warnings: no MPI parameter check: never Memory profiling support: no Memory debugging support: no libltdl support: no Heterogeneous support: no mpirun default --prefix: yes MPI I/O support: yes MPI_WTIME support: gettimeofday Symbol vis. support: yes MPI extensions: none FT Checkpoint support: yes (checkpoint thread: no) MPI_MAX_PROCESSOR_NAME: 256 MPI_MAX_ERROR_STRING: 256 MPI_MAX_OBJECT_NAME: 64 MPI_MAX_INFO_KEY: 36 MPI_MAX_INFO_VAL: 256 MPI_MAX_PORT_NAME: 1024 MPI_MAX_DATAREP_STRING: 128 MCA backtrace: none (MCA v2.0, API v2.0, Component v1.5.3) MCA carto: auto_detect (MCA v2.0, API v2.0, Component v1.5.3) MCA maffinity: first_use (MCA v2.0, API v2.0, Component v1.5.3) MCA timer: windows (MCA v2.0, API v2.0, Component v1.5.3) MCA installdirs: env (MCA v2.0, API v2.0, Component v1.5.3) MCA installdirs: windows (MCA v2.0, API v2.0, Component v1.5.3) MCA installdirs: config (MCA v2.0, API v2.0, Component v1.5.3) MCA dpm: orte (MCA v2.0, API v2.0, Component v1.5.3) MCA pubsub: orte (MCA v2.0, API v2.0, Component v1.5.3) MCA allocator: bucket (MCA v2.0, API v2.0, Component v1.5.3) MCA allocator: basic (MCA v2.0, API v2.0, Component v1.5.3) MCA coll: sync (MCA v2.0, API v2.0, Component v1.5.3) MCA coll: sm (MCA v2.0, API v2.0, Component v1.5.3) MCA coll: self (MCA v2.0, API v2.0, Component v1.5.3) MCA coll: hierarch (MCA v2.0, API v2.0, Component v1.5.3) MCA coll: basic (MCA v2.0, API v2.0, Component v1.5.3) MCA mpool: sm (MCA v2.0, API v2.0, Component v1.5.3) MCA mpool: rdma (MCA v2.0, API v2.0, Component v1.5.3) MCA pml: ob1 (MCA v2.0, API v2.0, Component v1.5.3) MCA bml: r2 (MCA v2.0, API v2.0, Component v1.5.3) MCA btl: tcp (MCA v2.0, API v2.0, Component v1.5.3) MCA btl: sm (MCA v2.0, API v2.0, Component v1.5.3) MCA btl: self (MCA v2.0, API v2.0, Component v1.5.3) MCA topo: unity (MCA v2.0, API v2.0, Component v1.5.3) MCA osc: rdma (MCA v2.0, API v2.0, Component v1.5.3) MCA osc: pt2pt (MCA v2.0, API v2.0, Component v1.5.3) MCA iof: tool (MCA v2.0, API v2.0, Component v1.5.3) MCA iof: orted (MCA v2.0, API v2.0, Component v1.5.3) MCA iof: hnp (MCA v2.0, API v2.0, Component v1.5.3) MCA oob: tcp (MCA v2.0, API v2.0, Component v1.5.3) MCA odls: process (MCA v2.0, API v2.0, Component v1.5.3) MCA ras: ccp (MCA v2.0, API v2.0, Component v1.5.3) MCA rmaps: topo (MCA v2.0, API v2.0, Component v1.5.3) MCA
[OMPI users] openmpi 1.5.5. build issue with cuda 4.1
I have a build error with the newest release openmpi 1.5.5 building against. cuda 4.1 Making all in vtlib make[5]: Entering directory `/opt/local/var/macports/build/_opt_local_var_macports_sources_rsync.macports.org_release_ports_science_openmpi/openmpi/work/build/ompi/contrib/vt/vt/vtlib' CC vt_libwrap.lo CC vt_gpu.lo CC vt_cudartwrap.lo CC vt_cudart.lo /opt/local/var/macports/build/_opt_local_var_macports_sources_rsync.macports.org_release_ports_science_openmpi/openmpi/work/openmpi-1.5.5/ompi/contrib/vt/vt/vtlib/vt_cudartwrap.c:1378:14: error: conflicting types for 'cudaPointerGetAttributes' cudaError_t cudaPointerGetAttributes(struct cudaPointerAttributes *attributes, void *ptr) ^ In file included from /opt/local/var/macports/build/_opt_local_var_macports_sources_rsync.macports.org_release_ports_science_openmpi/openmpi/work/openmpi-1.5.5/ompi/contrib/vt/vt/vtlib/vt_cudartwrap.c:13: In file included from /opt/local/var/macports/build/_opt_local_var_macports_sources_rsync.macports.org_release_ports_science_openmpi/openmpi/work/openmpi-1.5.5/ompi/contrib/vt/vt/vtlib/vt_cudartwrap.h:25: In file included from /opt/local/var/macports/build/_opt_local_var_macports_sources_rsync.macports.org_release_ports_science_openmpi/openmpi/work/openmpi-1.5.5/ompi/contrib/vt/vt/vtlib/vt_cuda_runtime_api.h:20: /usr/local/cuda/include/cuda_runtime_api.h:3899:39: note: previous declaration is here extern __host__ cudaError_t CUDARTAPI cudaPointerGetAttributes(struct cudaPointerAttributes *attributes, const void *ptr); ^ 1 error generated. make[5]: *** [vt_cudartwrap.lo] Error 1 The error stems from the use of cuda 4.0 version of cudaPointerGetAttribute from: http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/html/group__CUDART__UNIFIED_gccb4831aa37562c0af3e6b6712e0f12c.html but: cudaPointerGetAttributes(struct cudaPointerAttributes *attributes, const void *ptr); is the correct signature for Cuda 4.1 Cuda 4.1 is the current release so I suggest a patch be made for openpmpi 1.5.5 to detect cuda version and then use the appropriate signature. <>Srinath = Srinath Vadlamani =
Re: [OMPI users] Error with multiple MPI runs inside one Slurm allocation (with QLogic PSM)
Am 02.04.2012 um 17:40 schrieb Ralph Castain: > I'm not sure the 1.4 series can support that behavior. Each mpirun only knows > about itself - it has no idea something else is going on. > > If you attempted to bind, all procs of same rank from each run would bind on > the same CPU. > > All you can really do is use -host to tell the fourth run not to use the > first node. Or use the devel trunk, which has more ability to separate runs. Aha, this could be interesting as I face a similar issue on one of the clusters we use: I want to run several `mpiexec &` under SLURM inside one job. But as none knows about the others, I had to disable the SLURM integration by unsetting the job id information. Despite the fact I used a proper value for -np *and* work only on one node, the executions of mpiexec twice reveals (i.e. `mpiexec &` twice) otherwise: "All nodes which are allocated for this job are already filled." What does 1.5 offer in detail in this area? -- Reuti > Sent from my iPad > > On Apr 2, 2012, at 6:53 AM, Rémi Palancherwrote: > >> Hi there, >> >> I'm encountering a problem when trying to run multiple mpirun in parallel >> inside >> one SLURM allocation on multiple nodes using a QLogic interconnect network >> with >> PSM. >> >> I'm using Open MPI version 1.4.5 compiled with GCC 4.4.5 on Debian Lenny. >> >> My cluster is composed of 12 cores nodes. >> >> Here is how I'm able to reproduce the problem: >> >> Allocate 20 CPU on 2 nodes : >> >> frontend $ salloc -N 2 -n 20 >> frontend $ srun hostname | sort | uniq -c >>12 cn1381 >> 8 cn1382 >> >> My job allocates 12 CPU on node cn1381 and 8 CPU on cn1382. >> >> My test MPI program parse for each task the value of Cpus_allowed_list in >> file >> /proc/$PID/status and print it. >> >> If I run it on all 20 allocated CPU, it works well: >> >> frontend $ mpirun get-allowed-cpu-ompi 1 >> Launch 1 Task 00 of 20 (cn1381): 0 >> Launch 1 Task 01 of 20 (cn1381): 1 >> Launch 1 Task 02 of 20 (cn1381): 2 >> Launch 1 Task 03 of 20 (cn1381): 3 >> Launch 1 Task 04 of 20 (cn1381): 4 >> Launch 1 Task 05 of 20 (cn1381): 7 >> Launch 1 Task 06 of 20 (cn1381): 5 >> Launch 1 Task 07 of 20 (cn1381): 9 >> Launch 1 Task 08 of 20 (cn1381): 8 >> Launch 1 Task 09 of 20 (cn1381): 10 >> Launch 1 Task 10 of 20 (cn1381): 6 >> Launch 1 Task 11 of 20 (cn1381): 11 >> Launch 1 Task 12 of 20 (cn1382): 4 >> Launch 1 Task 13 of 20 (cn1382): 5 >> Launch 1 Task 14 of 20 (cn1382): 6 >> Launch 1 Task 15 of 20 (cn1382): 7 >> Launch 1 Task 16 of 20 (cn1382): 8 >> Launch 1 Task 17 of 20 (cn1382): 10 >> Launch 1 Task 18 of 20 (cn1382): 9 >> Launch 1 Task 19 of 20 (cn1382): 11 >> >> Here you can see that Slurm gave me CPU 0-11 on cn1381 and 4-11 on cn1382. >> >> Now I'd like to run multiple MPI runs in parallel, 4 tasks each, inside my >> job. >> >> frontend $ cat params.txt >> 1 >> 2 >> 3 >> 4 >> 5 >> >> It works well when I launch 3 runs in parallel, where it only use the 12 CPU >> of >> the first node (3 runs x 4 tasks = 12 CPU): >> >> frontend $ xargs -P 3 -n 1 mpirun -n 4 get-allowed-cpu-ompi < params.txt >> Launch 2 Task 00 of 04 (cn1381): 1 >> Launch 2 Task 01 of 04 (cn1381): 2 >> Launch 2 Task 02 of 04 (cn1381): 4 >> Launch 2 Task 03 of 04 (cn1381): 7 >> Launch 1 Task 00 of 04 (cn1381): 0 >> Launch 1 Task 01 of 04 (cn1381): 3 >> Launch 1 Task 02 of 04 (cn1381): 5 >> Launch 1 Task 03 of 04 (cn1381): 6 >> Launch 3 Task 00 of 04 (cn1381): 9 >> Launch 3 Task 01 of 04 (cn1381): 8 >> Launch 3 Task 02 of 04 (cn1381): 10 >> Launch 3 Task 03 of 04 (cn1381): 11 >> Launch 4 Task 00 of 04 (cn1381): 0 >> Launch 4 Task 01 of 04 (cn1381): 3 >> Launch 4 Task 02 of 04 (cn1381): 1 >> Launch 4 Task 03 of 04 (cn1381): 5 >> Launch 5 Task 00 of 04 (cn1381): 2 >> Launch 5 Task 01 of 04 (cn1381): 4 >> Launch 5 Task 02 of 04 (cn1381): 7 >> Launch 5 Task 03 of 04 (cn1381): 6 >> >> But when I try to launch 4 runs or more in parallel, where it needs to use >> the >> CPU of the other node as well, it fails: >> >> frontend $ $ xargs -P 4 -n 1 mpirun -n 4 get-allowed-cpu-ompi < params.txt >> cn1381.23245ipath_userinit: assign_context command failed: Network is down >> cn1381.23245can't open /dev/ipath, network down (err=26) >> -- >> PSM was unable to open an endpoint. Please make sure that the network link is >> active on the node and the hardware is functioning. >> >> Error: Could not detect network connectivity >> -- >> cn1381.23248ipath_userinit: assign_context command failed: Network is down >> cn1381.23248can't open /dev/ipath, network down (err=26) >> -- >> PSM was unable to open an endpoint. Please make sure that the network link is >> active on the node and the hardware is functioning. >> >> Error: Could not detect network connectivity >>
Re: [OMPI users] Error with multiple MPI runs inside one Slurm allocation (with QLogic PSM)
Sorry to hijack the thread, but I have a question regarding the failed PSM initialization. Some of our users oversubscribe a node with multiple mpiruns in order to run their regression tests. Recently, a user reported the same "Could not detect network connectivity" error. My question: is there a way to allow this type of behavior? That is, oversubscribe a node with multiple mpiruns. For example, say I have a node with 16 processing elements and I want to run 8 instances of "mpirun -n 3 mpi_foo" on a single node simultaneously and don't care about performance. Please note that oversubscription within one node and a **single** mpirun works as expected. The error only shows up when another mpirun wants to join the party. Thanks, Lost in Los Alamos On Apr 2, 2012, at 9:40 AM, Ralph Castain wrote: > I'm not sure the 1.4 series can support that behavior. Each mpirun only knows > about itself - it has no idea something else is going on. > > If you attempted to bind, all procs of same rank from each run would bind on > the same CPU. > > All you can really do is use -host to tell the fourth run not to use the > first node. Or use the devel trunk, which has more ability to separate runs. > > Sent from my iPad > > On Apr 2, 2012, at 6:53 AM, Rémi Palancherwrote: > >> Hi there, >> >> I'm encountering a problem when trying to run multiple mpirun in parallel >> inside >> one SLURM allocation on multiple nodes using a QLogic interconnect network >> with >> PSM. >> >> I'm using Open MPI version 1.4.5 compiled with GCC 4.4.5 on Debian Lenny. >> >> My cluster is composed of 12 cores nodes. >> >> Here is how I'm able to reproduce the problem: >> >> Allocate 20 CPU on 2 nodes : >> >> frontend $ salloc -N 2 -n 20 >> frontend $ srun hostname | sort | uniq -c >>12 cn1381 >> 8 cn1382 >> >> My job allocates 12 CPU on node cn1381 and 8 CPU on cn1382. >> >> My test MPI program parse for each task the value of Cpus_allowed_list in >> file >> /proc/$PID/status and print it. >> >> If I run it on all 20 allocated CPU, it works well: >> >> frontend $ mpirun get-allowed-cpu-ompi 1 >> Launch 1 Task 00 of 20 (cn1381): 0 >> Launch 1 Task 01 of 20 (cn1381): 1 >> Launch 1 Task 02 of 20 (cn1381): 2 >> Launch 1 Task 03 of 20 (cn1381): 3 >> Launch 1 Task 04 of 20 (cn1381): 4 >> Launch 1 Task 05 of 20 (cn1381): 7 >> Launch 1 Task 06 of 20 (cn1381): 5 >> Launch 1 Task 07 of 20 (cn1381): 9 >> Launch 1 Task 08 of 20 (cn1381): 8 >> Launch 1 Task 09 of 20 (cn1381): 10 >> Launch 1 Task 10 of 20 (cn1381): 6 >> Launch 1 Task 11 of 20 (cn1381): 11 >> Launch 1 Task 12 of 20 (cn1382): 4 >> Launch 1 Task 13 of 20 (cn1382): 5 >> Launch 1 Task 14 of 20 (cn1382): 6 >> Launch 1 Task 15 of 20 (cn1382): 7 >> Launch 1 Task 16 of 20 (cn1382): 8 >> Launch 1 Task 17 of 20 (cn1382): 10 >> Launch 1 Task 18 of 20 (cn1382): 9 >> Launch 1 Task 19 of 20 (cn1382): 11 >> >> Here you can see that Slurm gave me CPU 0-11 on cn1381 and 4-11 on cn1382. >> >> Now I'd like to run multiple MPI runs in parallel, 4 tasks each, inside my >> job. >> >> frontend $ cat params.txt >> 1 >> 2 >> 3 >> 4 >> 5 >> >> It works well when I launch 3 runs in parallel, where it only use the 12 CPU >> of >> the first node (3 runs x 4 tasks = 12 CPU): >> >> frontend $ xargs -P 3 -n 1 mpirun -n 4 get-allowed-cpu-ompi < params.txt >> Launch 2 Task 00 of 04 (cn1381): 1 >> Launch 2 Task 01 of 04 (cn1381): 2 >> Launch 2 Task 02 of 04 (cn1381): 4 >> Launch 2 Task 03 of 04 (cn1381): 7 >> Launch 1 Task 00 of 04 (cn1381): 0 >> Launch 1 Task 01 of 04 (cn1381): 3 >> Launch 1 Task 02 of 04 (cn1381): 5 >> Launch 1 Task 03 of 04 (cn1381): 6 >> Launch 3 Task 00 of 04 (cn1381): 9 >> Launch 3 Task 01 of 04 (cn1381): 8 >> Launch 3 Task 02 of 04 (cn1381): 10 >> Launch 3 Task 03 of 04 (cn1381): 11 >> Launch 4 Task 00 of 04 (cn1381): 0 >> Launch 4 Task 01 of 04 (cn1381): 3 >> Launch 4 Task 02 of 04 (cn1381): 1 >> Launch 4 Task 03 of 04 (cn1381): 5 >> Launch 5 Task 00 of 04 (cn1381): 2 >> Launch 5 Task 01 of 04 (cn1381): 4 >> Launch 5 Task 02 of 04 (cn1381): 7 >> Launch 5 Task 03 of 04 (cn1381): 6 >> >> But when I try to launch 4 runs or more in parallel, where it needs to use >> the >> CPU of the other node as well, it fails: >> >> frontend $ $ xargs -P 4 -n 1 mpirun -n 4 get-allowed-cpu-ompi < params.txt >> cn1381.23245ipath_userinit: assign_context command failed: Network is down >> cn1381.23245can't open /dev/ipath, network down (err=26) >> -- >> PSM was unable to open an endpoint. Please make sure that the network link is >> active on the node and the hardware is functioning. >> >> Error: Could not detect network connectivity >> -- >> cn1381.23248ipath_userinit: assign_context command failed: Network is down >> cn1381.23248can't open /dev/ipath, network down (err=26) >>
Re: [OMPI users] Error with multiple MPI runs inside one Slurm allocation (with QLogic PSM)
I'm not sure the 1.4 series can support that behavior. Each mpirun only knows about itself - it has no idea something else is going on. If you attempted to bind, all procs of same rank from each run would bind on the same CPU. All you can really do is use -host to tell the fourth run not to use the first node. Or use the devel trunk, which has more ability to separate runs. Sent from my iPad On Apr 2, 2012, at 6:53 AM, Rémi Palancherwrote: > Hi there, > > I'm encountering a problem when trying to run multiple mpirun in parallel > inside > one SLURM allocation on multiple nodes using a QLogic interconnect network > with > PSM. > > I'm using Open MPI version 1.4.5 compiled with GCC 4.4.5 on Debian Lenny. > > My cluster is composed of 12 cores nodes. > > Here is how I'm able to reproduce the problem: > > Allocate 20 CPU on 2 nodes : > > frontend $ salloc -N 2 -n 20 > frontend $ srun hostname | sort | uniq -c > 12 cn1381 > 8 cn1382 > > My job allocates 12 CPU on node cn1381 and 8 CPU on cn1382. > > My test MPI program parse for each task the value of Cpus_allowed_list in file > /proc/$PID/status and print it. > > If I run it on all 20 allocated CPU, it works well: > > frontend $ mpirun get-allowed-cpu-ompi 1 > Launch 1 Task 00 of 20 (cn1381): 0 > Launch 1 Task 01 of 20 (cn1381): 1 > Launch 1 Task 02 of 20 (cn1381): 2 > Launch 1 Task 03 of 20 (cn1381): 3 > Launch 1 Task 04 of 20 (cn1381): 4 > Launch 1 Task 05 of 20 (cn1381): 7 > Launch 1 Task 06 of 20 (cn1381): 5 > Launch 1 Task 07 of 20 (cn1381): 9 > Launch 1 Task 08 of 20 (cn1381): 8 > Launch 1 Task 09 of 20 (cn1381): 10 > Launch 1 Task 10 of 20 (cn1381): 6 > Launch 1 Task 11 of 20 (cn1381): 11 > Launch 1 Task 12 of 20 (cn1382): 4 > Launch 1 Task 13 of 20 (cn1382): 5 > Launch 1 Task 14 of 20 (cn1382): 6 > Launch 1 Task 15 of 20 (cn1382): 7 > Launch 1 Task 16 of 20 (cn1382): 8 > Launch 1 Task 17 of 20 (cn1382): 10 > Launch 1 Task 18 of 20 (cn1382): 9 > Launch 1 Task 19 of 20 (cn1382): 11 > > Here you can see that Slurm gave me CPU 0-11 on cn1381 and 4-11 on cn1382. > > Now I'd like to run multiple MPI runs in parallel, 4 tasks each, inside my > job. > > frontend $ cat params.txt > 1 > 2 > 3 > 4 > 5 > > It works well when I launch 3 runs in parallel, where it only use the 12 CPU > of > the first node (3 runs x 4 tasks = 12 CPU): > > frontend $ xargs -P 3 -n 1 mpirun -n 4 get-allowed-cpu-ompi < params.txt > Launch 2 Task 00 of 04 (cn1381): 1 > Launch 2 Task 01 of 04 (cn1381): 2 > Launch 2 Task 02 of 04 (cn1381): 4 > Launch 2 Task 03 of 04 (cn1381): 7 > Launch 1 Task 00 of 04 (cn1381): 0 > Launch 1 Task 01 of 04 (cn1381): 3 > Launch 1 Task 02 of 04 (cn1381): 5 > Launch 1 Task 03 of 04 (cn1381): 6 > Launch 3 Task 00 of 04 (cn1381): 9 > Launch 3 Task 01 of 04 (cn1381): 8 > Launch 3 Task 02 of 04 (cn1381): 10 > Launch 3 Task 03 of 04 (cn1381): 11 > Launch 4 Task 00 of 04 (cn1381): 0 > Launch 4 Task 01 of 04 (cn1381): 3 > Launch 4 Task 02 of 04 (cn1381): 1 > Launch 4 Task 03 of 04 (cn1381): 5 > Launch 5 Task 00 of 04 (cn1381): 2 > Launch 5 Task 01 of 04 (cn1381): 4 > Launch 5 Task 02 of 04 (cn1381): 7 > Launch 5 Task 03 of 04 (cn1381): 6 > > But when I try to launch 4 runs or more in parallel, where it needs to use the > CPU of the other node as well, it fails: > > frontend $ $ xargs -P 4 -n 1 mpirun -n 4 get-allowed-cpu-ompi < params.txt > cn1381.23245ipath_userinit: assign_context command failed: Network is down > cn1381.23245can't open /dev/ipath, network down (err=26) > -- > PSM was unable to open an endpoint. Please make sure that the network link is > active on the node and the hardware is functioning. > > Error: Could not detect network connectivity > -- > cn1381.23248ipath_userinit: assign_context command failed: Network is down > cn1381.23248can't open /dev/ipath, network down (err=26) > -- > PSM was unable to open an endpoint. Please make sure that the network link is > active on the node and the hardware is functioning. > > Error: Could not detect network connectivity > -- > cn1381.23247ipath_userinit: assign_context command failed: Network is down > cn1381.23247can't open /dev/ipath, network down (err=26) > cn1381.23249ipath_userinit: assign_context command failed: Network is down > cn1381.23249can't open /dev/ipath, network down (err=26) > -- > It looks like MPI_INIT failed for some reason; your parallel process is > likely to abort. There are many reasons that a parallel process can > fail during MPI_INIT; some of which are due to configuration or environment > problems. This failure appears to be an internal failure; here's some >
Re: [OMPI users] redirecting output
On 03/30/2012 11:12 AM, Tim Prince wrote: > On 03/30/2012 10:41 AM, tyler.bal...@huskers.unl.edu wrote: >> >> >> I am using the command mpirun -np nprocs -machinefile machines.arch >> Pcrystal and my output strolls across my terminal I would like to >> send this output to a file and I cannot figure out how to do soI >> have tried the general > FILENAME and > log & these generate >> files however they are empty.any help would be appreciated. If you see the output on your screen, but it's not being redirected to a file, it must be printing to STDERR and not STDOUT. The '>' by itself redirects STDOUT only, so it doesn't redirect error messages. To redirect STDERR, you can use '2>', which says redirect filehandle # 2, which is stderr. some_command 2> myerror.log or some_command >myoutput.log 2>myerror.log To redirect both STDOUT and STDERR to the same place, use the syntax "2>&1" to tie STDERR to STDOUT: some_command > myoutput.log 2>&1 I prefer to see the ouput on the screen at the same time I write it to a file. That way, if the command hangs for some reason, I know it immediately. I find the 'tee' command priceless for this: some_command 2>&1 | tee myoutput.log Google for 'bash output redirection' and you'll find many helpful pages with better explanation and examples, like this one: http://tldp.org/HOWTO/Bash-Prog-Intro-HOWTO-3.html If you don't you bash, those results will be much less helpful. I hope that helps, or at least gets you pointed in the right direction. -- Prentice > > If you run under screen your terminal output should be collected in > screenlog. Beats me why some sysadmins don't see fit to install screen. >
[OMPI users] Error with multiple MPI runs inside one Slurm allocation (with QLogic PSM)
Hi there, I'm encountering a problem when trying to run multiple mpirun in parallel inside one SLURM allocation on multiple nodes using a QLogic interconnect network with PSM. I'm using Open MPI version 1.4.5 compiled with GCC 4.4.5 on Debian Lenny. My cluster is composed of 12 cores nodes. Here is how I'm able to reproduce the problem: Allocate 20 CPU on 2 nodes : frontend $ salloc -N 2 -n 20 frontend $ srun hostname | sort | uniq -c 12 cn1381 8 cn1382 My job allocates 12 CPU on node cn1381 and 8 CPU on cn1382. My test MPI program parse for each task the value of Cpus_allowed_list in file /proc/$PID/status and print it. If I run it on all 20 allocated CPU, it works well: frontend $ mpirun get-allowed-cpu-ompi 1 Launch 1 Task 00 of 20 (cn1381): 0 Launch 1 Task 01 of 20 (cn1381): 1 Launch 1 Task 02 of 20 (cn1381): 2 Launch 1 Task 03 of 20 (cn1381): 3 Launch 1 Task 04 of 20 (cn1381): 4 Launch 1 Task 05 of 20 (cn1381): 7 Launch 1 Task 06 of 20 (cn1381): 5 Launch 1 Task 07 of 20 (cn1381): 9 Launch 1 Task 08 of 20 (cn1381): 8 Launch 1 Task 09 of 20 (cn1381): 10 Launch 1 Task 10 of 20 (cn1381): 6 Launch 1 Task 11 of 20 (cn1381): 11 Launch 1 Task 12 of 20 (cn1382): 4 Launch 1 Task 13 of 20 (cn1382): 5 Launch 1 Task 14 of 20 (cn1382): 6 Launch 1 Task 15 of 20 (cn1382): 7 Launch 1 Task 16 of 20 (cn1382): 8 Launch 1 Task 17 of 20 (cn1382): 10 Launch 1 Task 18 of 20 (cn1382): 9 Launch 1 Task 19 of 20 (cn1382): 11 Here you can see that Slurm gave me CPU 0-11 on cn1381 and 4-11 on cn1382. Now I'd like to run multiple MPI runs in parallel, 4 tasks each, inside my job. frontend $ cat params.txt 1 2 3 4 5 It works well when I launch 3 runs in parallel, where it only use the 12 CPU of the first node (3 runs x 4 tasks = 12 CPU): frontend $ xargs -P 3 -n 1 mpirun -n 4 get-allowed-cpu-ompi < params.txt Launch 2 Task 00 of 04 (cn1381): 1 Launch 2 Task 01 of 04 (cn1381): 2 Launch 2 Task 02 of 04 (cn1381): 4 Launch 2 Task 03 of 04 (cn1381): 7 Launch 1 Task 00 of 04 (cn1381): 0 Launch 1 Task 01 of 04 (cn1381): 3 Launch 1 Task 02 of 04 (cn1381): 5 Launch 1 Task 03 of 04 (cn1381): 6 Launch 3 Task 00 of 04 (cn1381): 9 Launch 3 Task 01 of 04 (cn1381): 8 Launch 3 Task 02 of 04 (cn1381): 10 Launch 3 Task 03 of 04 (cn1381): 11 Launch 4 Task 00 of 04 (cn1381): 0 Launch 4 Task 01 of 04 (cn1381): 3 Launch 4 Task 02 of 04 (cn1381): 1 Launch 4 Task 03 of 04 (cn1381): 5 Launch 5 Task 00 of 04 (cn1381): 2 Launch 5 Task 01 of 04 (cn1381): 4 Launch 5 Task 02 of 04 (cn1381): 7 Launch 5 Task 03 of 04 (cn1381): 6 But when I try to launch 4 runs or more in parallel, where it needs to use the CPU of the other node as well, it fails: frontend $ $ xargs -P 4 -n 1 mpirun -n 4 get-allowed-cpu-ompi < params.txt cn1381.23245ipath_userinit: assign_context command failed: Network is down cn1381.23245can't open /dev/ipath, network down (err=26) -- PSM was unable to open an endpoint. Please make sure that the network link is active on the node and the hardware is functioning. Error: Could not detect network connectivity -- cn1381.23248ipath_userinit: assign_context command failed: Network is down cn1381.23248can't open /dev/ipath, network down (err=26) -- PSM was unable to open an endpoint. Please make sure that the network link is active on the node and the hardware is functioning. Error: Could not detect network connectivity -- cn1381.23247ipath_userinit: assign_context command failed: Network is down cn1381.23247can't open /dev/ipath, network down (err=26) cn1381.23249ipath_userinit: assign_context command failed: Network is down cn1381.23249can't open /dev/ipath, network down (err=26) -- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): PML add procs failed --> Returned "Error" (-1) instead of "Success" (0) -- *** The MPI_Init() function was called before MPI_INIT was invoked. *** This is disallowed by the MPI standard. *** Your MPI job will now abort. *** The MPI_Init() function was called before MPI_INIT was invoked. *** This is disallowed by the MPI standard. *** Your MPI job will now abort. *** The MPI_Init() function was called before MPI_INIT was invoked. *** This is disallowed by the MPI standard. *** Your MPI job will now abort. [cn1381:23245]
Re: [OMPI users] Help with multicore AMD machine performance
Hi, I'm benchmarking our (well tested) parallel code on and AMD based system, featuring 2x AMD Opteron(TM) Processor 6276, with 16 cores each for a total of 32 cores. The system is running Scientific Linux 6.1 and OpenMPI 1.4.5. When I run a single core job the performance is as expected. However, when I run with 32 processes the performance drops to about 60% Be aware that on AMD CPUs based on Bulldozer/Interlagos technology 2 cores share the FPU units of one module. There is also a problem with Cross-Cache-Invalidations [1] in earlier kernel versions - be sure to use an up-to-date kernel (2.6.32-220.7.1) Cheers, Nico [1] http://developer.amd.com/Assets/SharedL1InstructionCacheonAMD15hCPU.pdf
Re: [OMPI users] Error while loading shared libraries
Am 02.04.2012 um 09:56 schrieb Rohan Deshpande: > Yes, I am trying to run the program using multiple hosts. > > The program executes fine but does not use any slaves when I run > > mpirun -np 8 hello --hostfile slaves > > The program throws error saying shared libraries not found when I run > > mpirun --hostfile slaves -np 8 a) As Rayson mentioned: the libraries are available on the slaves? b) It might be necassary to export an LD_LIBRARY_PATH in you .bashrc or forward the variable by Open MPI to point to the location of the libraries. c) It could also work to create a static version of Open MPI by --enable-static --disable-shared and recompile the application. -- Reuti > > > On Mon, Apr 2, 2012 at 2:52 PM, Rayson Howrote: > On Sun, Apr 1, 2012 at 11:27 PM, Rohan Deshpande wrote: > > error while loading shared libraries: libmpi.so.0: cannot open shared > > object file no such object file: No such file or directory. > > Were you trying to run the MPI program on a remote machine?? If you > are, then make sure that each machine has the libraries installed (or > you can install Open MPI on an NFS directory). > > Rayson > > = > Open Grid Scheduler / Grid Engine > http://gridscheduler.sourceforge.net/ > > Scalable Grid Engine Support Program > http://www.scalablelogic.com/ > > > > > > When I run using - mpirun -np 1 ldd hello the following libraries are not > > found > > 1. libmpi.so.0 > > 2. libopen-rte.so.0 > > 3. libopen.pal.so.0 > > > > I am using openmpi version 1.4.5. Also PATH and LD_LIBRARY_PATH variables > > are correctly set and 'which mpicc' returns correct path > > > > Any help would be highly appreciated. > > > > Thanks > > > > > > > > > > > > > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > -- > > Best Regards, > > ROHAN DESHPANDE > > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
[OMPI users] (no subject)
http://whatbunny.org/web/app/_cache/02efpk.html;> http://whatbunny.org/web/app/_cache/02efpk.html
Re: [OMPI users] Error while loading shared libraries
Yes, I am trying to run the program using multiple hosts. The program executes fine but *does not use any slaves* when I run *mpirun -np 8 hello --hostfile slaves* The program throws error saying *shared libraries not found* when I run * mpirun --hostfile slaves -np 8* On Mon, Apr 2, 2012 at 2:52 PM, Rayson Howrote: > On Sun, Apr 1, 2012 at 11:27 PM, Rohan Deshpande > wrote: > > error while loading shared libraries: libmpi.so.0: cannot open shared > > object file no such object file: No such file or directory. > > Were you trying to run the MPI program on a remote machine?? If you > are, then make sure that each machine has the libraries installed (or > you can install Open MPI on an NFS directory). > > Rayson > > = > Open Grid Scheduler / Grid Engine > http://gridscheduler.sourceforge.net/ > > Scalable Grid Engine Support Program > http://www.scalablelogic.com/ > > > > > > When I run using - mpirun -np 1 ldd hello the following libraries are not > > found > > 1. libmpi.so.0 > > 2. libopen-rte.so.0 > > 3. libopen.pal.so.0 > > > > I am using openmpi version 1.4.5. Also PATH and LD_LIBRARY_PATH variables > > are correctly set and 'which mpicc' returns correct path > > > > Any help would be highly appreciated. > > > > Thanks > > > > > > > > > > > > > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > -- Best Regards, ROHAN DESHPANDE
Re: [OMPI users] Error while loading shared libraries
On Sun, Apr 1, 2012 at 11:27 PM, Rohan Deshpandewrote: > error while loading shared libraries: libmpi.so.0: cannot open shared > object file no such object file: No such file or directory. Were you trying to run the MPI program on a remote machine?? If you are, then make sure that each machine has the libraries installed (or you can install Open MPI on an NFS directory). Rayson = Open Grid Scheduler / Grid Engine http://gridscheduler.sourceforge.net/ Scalable Grid Engine Support Program http://www.scalablelogic.com/ > > When I run using - mpirun -np 1 ldd hello the following libraries are not > found > 1. libmpi.so.0 > 2. libopen-rte.so.0 > 3. libopen.pal.so.0 > > I am using openmpi version 1.4.5. Also PATH and LD_LIBRARY_PATH variables > are correctly set and 'which mpicc' returns correct path > > Any help would be highly appreciated. > > Thanks > > > > > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
[OMPI users] Error while loading shared libraries
Hi , I have installed mpi successfully and I am able to compile the programs using mpicc But when I run using mpirun, I get following error * error while loading shared libraries: libmpi.so.0: cannot open shared object file no such object file: No such file or directory. * When I run using - mpirun -np 1 ldd hello the following libraries are not found 1. *libmpi.so.0* 2.* libopen-rte.so.0* 3. *libopen.pal.so.0* I am using openmpi version *1.4.5. *Also PATH and LD_LIBRARY_PATH variables are correctly set and 'which mpicc' returns correct path * * Any help would be highly appreciated. Thanks