Re: [OMPI users] Error while loading shared libraries

2012-04-02 Thread Rohan Deshpande
Thanks guys.

Using absolute path of mpirun fixes my problem.

Cheers

On Mon, Apr 2, 2012 at 6:24 PM, Reuti  wrote:

> Am 02.04.2012 um 09:56 schrieb Rohan Deshpande:
>
> > Yes, I am trying to run the program using multiple hosts.
> >
> > The program executes fine but does not use any slaves when I run
> >
> >   mpirun -np 8 hello --hostfile slaves
> >
> > The program throws error saying shared libraries not found when I run
> >
> >   mpirun --hostfile slaves -np 8
>
> a) As Rayson mentioned: the libraries are available on the slaves?
>
> b) It might be necassary to export an LD_LIBRARY_PATH in you .bashrc or
> forward the variable by Open MPI to point to the location of the libraries.
>
> c) It could also work to create a static version of Open MPI by
> --enable-static --disable-shared and recompile the application.
>
> -- Reuti
>
>
> >
> >
> > On Mon, Apr 2, 2012 at 2:52 PM, Rayson Ho  wrote:
> > On Sun, Apr 1, 2012 at 11:27 PM, Rohan Deshpande 
> wrote:
> > >   error while loading shared libraries: libmpi.so.0: cannot open shared
> > > object file no such object file: No such file or directory.
> >
> > Were you trying to run the MPI program on a remote machine?? If you
> > are, then make sure that each machine has the libraries installed (or
> > you can install Open MPI on an NFS directory).
> >
> > Rayson
> >
> > =
> > Open Grid Scheduler / Grid Engine
> > http://gridscheduler.sourceforge.net/
> >
> > Scalable Grid Engine Support Program
> > http://www.scalablelogic.com/
> >
> >
> > >
> > > When I run using - mpirun -np 1 ldd hello the following libraries are
> not
> > > found
> > >   1. libmpi.so.0
> > >   2. libopen-rte.so.0
> > >   3. libopen.pal.so.0
> > >
> > > I am using openmpi version 1.4.5. Also PATH and LD_LIBRARY_PATH
> variables
> > > are correctly set and 'which mpicc' returns correct path
> > >
> > > Any help would be highly appreciated.
> > >
> > > Thanks
> > >
> > >
> > >
> > >
> > >
> > >
> > > ___
> > > users mailing list
> > > us...@open-mpi.org
> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >
> >
> > --
> >
> > Best Regards,
> >
> > ROHAN DESHPANDE
> >
> >
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>



-- 

Best Regards,

ROHAN DESHPANDE


Re: [OMPI users] openmpi 1.5.5. build issue with cuda 4.1

2012-04-02 Thread Srinath Vadlamani
The offending file: openmpi/contrib/vt/vt/vtlib/vt_cudartwrap.c

is easily fixed with placing a const in front of the void *ptr
for the cudaPointerGetAttributes wrapper code segments.

Then the openmpi 1.5.5 release compiles with Cuda 4.1

<>Srinath
=
Srinath Vadlamani
=


On Mon, Apr 2, 2012 at 11:26 AM, Srinath Vadlamani <
srinath.vadlam...@gmail.com> wrote:

> I have a build error with the newest release openmpi 1.5.5 building
> against. cuda 4.1
>
> Making all in vtlib
> make[5]: Entering directory
> `/opt/local/var/macports/build/_opt_local_var_macports_sources_rsync.macports.org_release_ports_science_openmpi/openmpi/work/build/ompi/contrib/vt/vt/vtlib'
>   CC vt_libwrap.lo
>   CC vt_gpu.lo
>   CC vt_cudartwrap.lo
>   CC vt_cudart.lo
> /opt/local/var/macports/build/_opt_local_var_macports_sources_rsync.macports.org_release_ports_science_openmpi/openmpi/work/openmpi-1.5.5/ompi/contrib/vt/vt/vtlib/vt_cudartwrap.c:1378:14:
> error: conflicting types for 'cudaPointerGetAttributes'
> cudaError_t  cudaPointerGetAttributes(struct cudaPointerAttributes
> *attributes, void *ptr)
>  ^
> In file included from
> /opt/local/var/macports/build/_opt_local_var_macports_sources_rsync.macports.org_release_ports_science_openmpi/openmpi/work/openmpi-1.5.5/ompi/contrib/vt/vt/vtlib/vt_cudartwrap.c:13:
> In file included from
> /opt/local/var/macports/build/_opt_local_var_macports_sources_rsync.macports.org_release_ports_science_openmpi/openmpi/work/openmpi-1.5.5/ompi/contrib/vt/vt/vtlib/vt_cudartwrap.h:25:
> In file included from
> /opt/local/var/macports/build/_opt_local_var_macports_sources_rsync.macports.org_release_ports_science_openmpi/openmpi/work/openmpi-1.5.5/ompi/contrib/vt/vt/vtlib/vt_cuda_runtime_api.h:20:
> /usr/local/cuda/include/cuda_runtime_api.h:3899:39: note: previous
> declaration is here
> extern __host__ cudaError_t CUDARTAPI cudaPointerGetAttributes(struct
> cudaPointerAttributes *attributes, const void *ptr);
>   ^
> 1 error generated.
> make[5]: *** [vt_cudartwrap.lo] Error 1
>
> The error stems from the use of cuda 4.0 version of
> cudaPointerGetAttribute
> from:
>
> http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/html/group__CUDART__UNIFIED_gccb4831aa37562c0af3e6b6712e0f12c.html
>
> but:
> cudaPointerGetAttributes(struct cudaPointerAttributes *attributes, const
> void *ptr);
>
> is the correct signature for Cuda 4.1
>
> Cuda 4.1 is the current release so I suggest a patch be made for openpmpi
> 1.5.5 to detect cuda version and then use the appropriate signature.
>
> <>Srinath
>
>
> =
> Srinath Vadlamani
> =
>


Re: [OMPI users] configuration of openmpi-1.5.4 with visual studio

2012-04-02 Thread toufik hadjazi




Hi Shiqing, i haven't yet find a solution and for the record, i have installed 
openmpi from an executable on windows 7(i don't know if i mentioned that 
before). at first, i had an error message while compiling the hello world 
application : unresolved link or something like that, then i added 
"OMPI_IMPORTS" to the configuration of visual studio it's here when i got the 
error message described before.and for the output of ompi_info it's attached to 
this email. best regards.Toufik Package: Open MPI hpcfan@VISCLUSTER25 Distribution
Open MPI: 1.5.3
   Open MPI SVN revision: r24532
   Open MPI release date: Mar 16, 2011
Open RTE: 1.5.3
   Open RTE SVN revision: r24532
   Open RTE release date: Mar 16, 2011
OPAL: 1.5.3
   OPAL SVN revision: r24532
   OPAL release date: Mar 16, 2011
Ident string: 1.5.3
  Prefix: C:\Program Files\OpenMPI_v1.5.3-win32
 Configured architecture: x86 Windows-6.1
  Configure host: VISCLUSTER25
   Configured by: hpcfan
   Configured on: 16:10 16.03.2011
  Configure host: VISCLUSTER25
Built by: hpcfan
Built on: 16:10 16.03.2011
  Built host: VISCLUSTER25
  C bindings: yes
C++ bindings: yes
  Fortran77 bindings: no
  Fortran90 bindings: no
 Fortran90 bindings size: na
  C compiler: cl
 C compiler absolute: c:/VSDev/VC/bin/cl.exe
  C compiler family name: MICROSOFT
  C compiler version: 1600
C++ compiler: cl
   C++ compiler absolute: c:/VSDev/VC/bin/cl.exe
  Fortran77 compiler: none
  Fortran77 compiler abs: none
  Fortran90 compiler: none
  Fortran90 compiler abs: none
 C profiling: yes
   C++ profiling: yes
 Fortran77 profiling: no
 Fortran90 profiling: no
  C++ exceptions: no
  Thread support: no
   Sparse Groups: no
  Internal debug support: no
  MPI interface warnings: no
 MPI parameter check: never
Memory profiling support: no
Memory debugging support: no
 libltdl support: no
   Heterogeneous support: no
 mpirun default --prefix: yes
 MPI I/O support: yes
   MPI_WTIME support: gettimeofday
 Symbol vis. support: yes
  MPI extensions: none
   FT Checkpoint support: yes (checkpoint thread: no)
  MPI_MAX_PROCESSOR_NAME: 256
MPI_MAX_ERROR_STRING: 256
 MPI_MAX_OBJECT_NAME: 64
MPI_MAX_INFO_KEY: 36
MPI_MAX_INFO_VAL: 256
   MPI_MAX_PORT_NAME: 1024
  MPI_MAX_DATAREP_STRING: 128
   MCA backtrace: none (MCA v2.0, API v2.0, Component v1.5.3)
   MCA carto: auto_detect (MCA v2.0, API v2.0, Component v1.5.3)
   MCA maffinity: first_use (MCA v2.0, API v2.0, Component v1.5.3)
   MCA timer: windows (MCA v2.0, API v2.0, Component v1.5.3)
 MCA installdirs: env (MCA v2.0, API v2.0, Component v1.5.3)
 MCA installdirs: windows (MCA v2.0, API v2.0, Component v1.5.3)
 MCA installdirs: config (MCA v2.0, API v2.0, Component v1.5.3)
 MCA dpm: orte (MCA v2.0, API v2.0, Component v1.5.3)
  MCA pubsub: orte (MCA v2.0, API v2.0, Component v1.5.3)
   MCA allocator: bucket (MCA v2.0, API v2.0, Component v1.5.3)
   MCA allocator: basic (MCA v2.0, API v2.0, Component v1.5.3)
MCA coll: sync (MCA v2.0, API v2.0, Component v1.5.3)
MCA coll: sm (MCA v2.0, API v2.0, Component v1.5.3)
MCA coll: self (MCA v2.0, API v2.0, Component v1.5.3)
MCA coll: hierarch (MCA v2.0, API v2.0, Component v1.5.3)
MCA coll: basic (MCA v2.0, API v2.0, Component v1.5.3)
   MCA mpool: sm (MCA v2.0, API v2.0, Component v1.5.3)
   MCA mpool: rdma (MCA v2.0, API v2.0, Component v1.5.3)
 MCA pml: ob1 (MCA v2.0, API v2.0, Component v1.5.3)
 MCA bml: r2 (MCA v2.0, API v2.0, Component v1.5.3)
 MCA btl: tcp (MCA v2.0, API v2.0, Component v1.5.3)
 MCA btl: sm (MCA v2.0, API v2.0, Component v1.5.3)
 MCA btl: self (MCA v2.0, API v2.0, Component v1.5.3)
MCA topo: unity (MCA v2.0, API v2.0, Component v1.5.3)
 MCA osc: rdma (MCA v2.0, API v2.0, Component v1.5.3)
 MCA osc: pt2pt (MCA v2.0, API v2.0, Component v1.5.3)
 MCA iof: tool (MCA v2.0, API v2.0, Component v1.5.3)
 MCA iof: orted (MCA v2.0, API v2.0, Component v1.5.3)
 MCA iof: hnp (MCA v2.0, API v2.0, Component v1.5.3)
 MCA oob: tcp (MCA v2.0, API v2.0, Component v1.5.3)
MCA odls: process (MCA v2.0, API v2.0, Component v1.5.3)
 MCA ras: ccp (MCA v2.0, API v2.0, Component v1.5.3)
   MCA rmaps: topo (MCA v2.0, API v2.0, Component v1.5.3)
   MCA 

[OMPI users] openmpi 1.5.5. build issue with cuda 4.1

2012-04-02 Thread Srinath Vadlamani
I have a build error with the newest release openmpi 1.5.5 building
against. cuda 4.1

Making all in vtlib
make[5]: Entering directory
`/opt/local/var/macports/build/_opt_local_var_macports_sources_rsync.macports.org_release_ports_science_openmpi/openmpi/work/build/ompi/contrib/vt/vt/vtlib'
  CC vt_libwrap.lo
  CC vt_gpu.lo
  CC vt_cudartwrap.lo
  CC vt_cudart.lo
/opt/local/var/macports/build/_opt_local_var_macports_sources_rsync.macports.org_release_ports_science_openmpi/openmpi/work/openmpi-1.5.5/ompi/contrib/vt/vt/vtlib/vt_cudartwrap.c:1378:14:
error: conflicting types for 'cudaPointerGetAttributes'
cudaError_t  cudaPointerGetAttributes(struct cudaPointerAttributes
*attributes, void *ptr)
 ^
In file included from
/opt/local/var/macports/build/_opt_local_var_macports_sources_rsync.macports.org_release_ports_science_openmpi/openmpi/work/openmpi-1.5.5/ompi/contrib/vt/vt/vtlib/vt_cudartwrap.c:13:
In file included from
/opt/local/var/macports/build/_opt_local_var_macports_sources_rsync.macports.org_release_ports_science_openmpi/openmpi/work/openmpi-1.5.5/ompi/contrib/vt/vt/vtlib/vt_cudartwrap.h:25:
In file included from
/opt/local/var/macports/build/_opt_local_var_macports_sources_rsync.macports.org_release_ports_science_openmpi/openmpi/work/openmpi-1.5.5/ompi/contrib/vt/vt/vtlib/vt_cuda_runtime_api.h:20:
/usr/local/cuda/include/cuda_runtime_api.h:3899:39: note: previous
declaration is here
extern __host__ cudaError_t CUDARTAPI cudaPointerGetAttributes(struct
cudaPointerAttributes *attributes, const void *ptr);
  ^
1 error generated.
make[5]: *** [vt_cudartwrap.lo] Error 1

The error stems from the use of cuda 4.0 version of cudaPointerGetAttribute
from:
http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/html/group__CUDART__UNIFIED_gccb4831aa37562c0af3e6b6712e0f12c.html

but:
cudaPointerGetAttributes(struct cudaPointerAttributes *attributes, const
void *ptr);

is the correct signature for Cuda 4.1

Cuda 4.1 is the current release so I suggest a patch be made for openpmpi
1.5.5 to detect cuda version and then use the appropriate signature.

<>Srinath


=
Srinath Vadlamani
=


Re: [OMPI users] Error with multiple MPI runs inside one Slurm allocation (with QLogic PSM)

2012-04-02 Thread Reuti
Am 02.04.2012 um 17:40 schrieb Ralph Castain:

> I'm not sure the 1.4 series can support that behavior. Each mpirun only knows 
> about itself - it has no idea something else is going on.
> 
> If you attempted to bind, all procs of same rank from each run would bind on 
> the same CPU.
> 
> All you can really do is use -host to tell the fourth run not to use the 
> first node. Or use the devel trunk, which has more ability to separate runs.

Aha, this could be interesting as I face a similar issue on one of the clusters 
we use: I want to run several `mpiexec &` under SLURM inside one job. But as 
none knows about the others, I had to disable the SLURM integration by 
unsetting the job id information.

Despite the fact I used a proper value for -np *and* work only on one node, the 
executions of mpiexec twice reveals (i.e. `mpiexec &` twice) otherwise:

"All nodes which are allocated for this job are already filled."

What does 1.5 offer in detail in this area?

-- Reuti


> Sent from my iPad
> 
> On Apr 2, 2012, at 6:53 AM, Rémi Palancher  wrote:
> 
>> Hi there,
>> 
>> I'm encountering a problem when trying to run multiple mpirun in parallel 
>> inside
>> one SLURM allocation on multiple nodes using a QLogic interconnect network 
>> with
>> PSM.
>> 
>> I'm using Open MPI version 1.4.5 compiled with GCC 4.4.5 on Debian Lenny.
>> 
>> My cluster is composed of 12 cores nodes.
>> 
>> Here is how I'm able to reproduce the problem:
>> 
>> Allocate 20 CPU on 2 nodes :
>> 
>> frontend $ salloc -N 2 -n 20
>> frontend $ srun hostname | sort | uniq -c
>>12 cn1381
>> 8 cn1382
>> 
>> My job allocates 12 CPU on node cn1381 and 8 CPU on cn1382.
>> 
>> My test MPI program parse for each task the value of Cpus_allowed_list in 
>> file
>> /proc/$PID/status and print it.
>> 
>> If I run it on all 20 allocated CPU, it works well:
>> 
>> frontend $ mpirun get-allowed-cpu-ompi 1
>> Launch 1 Task 00 of 20 (cn1381): 0
>> Launch 1 Task 01 of 20 (cn1381): 1
>> Launch 1 Task 02 of 20 (cn1381): 2
>> Launch 1 Task 03 of 20 (cn1381): 3
>> Launch 1 Task 04 of 20 (cn1381): 4
>> Launch 1 Task 05 of 20 (cn1381): 7
>> Launch 1 Task 06 of 20 (cn1381): 5
>> Launch 1 Task 07 of 20 (cn1381): 9
>> Launch 1 Task 08 of 20 (cn1381): 8
>> Launch 1 Task 09 of 20 (cn1381): 10
>> Launch 1 Task 10 of 20 (cn1381): 6
>> Launch 1 Task 11 of 20 (cn1381): 11
>> Launch 1 Task 12 of 20 (cn1382): 4
>> Launch 1 Task 13 of 20 (cn1382): 5
>> Launch 1 Task 14 of 20 (cn1382): 6
>> Launch 1 Task 15 of 20 (cn1382): 7
>> Launch 1 Task 16 of 20 (cn1382): 8
>> Launch 1 Task 17 of 20 (cn1382): 10
>> Launch 1 Task 18 of 20 (cn1382): 9
>> Launch 1 Task 19 of 20 (cn1382): 11
>> 
>> Here you can see that Slurm gave me CPU 0-11 on cn1381 and 4-11 on cn1382.
>> 
>> Now I'd like to run multiple MPI runs in parallel, 4 tasks each, inside my 
>> job.
>> 
>> frontend $ cat params.txt
>> 1
>> 2
>> 3
>> 4
>> 5
>> 
>> It works well when I launch 3 runs in parallel, where it only use the 12 CPU 
>> of
>> the first node (3 runs x 4 tasks = 12 CPU):
>> 
>> frontend $ xargs -P 3 -n 1 mpirun -n 4 get-allowed-cpu-ompi < params.txt
>> Launch 2 Task 00 of 04 (cn1381): 1
>> Launch 2 Task 01 of 04 (cn1381): 2
>> Launch 2 Task 02 of 04 (cn1381): 4
>> Launch 2 Task 03 of 04 (cn1381): 7
>> Launch 1 Task 00 of 04 (cn1381): 0
>> Launch 1 Task 01 of 04 (cn1381): 3
>> Launch 1 Task 02 of 04 (cn1381): 5
>> Launch 1 Task 03 of 04 (cn1381): 6
>> Launch 3 Task 00 of 04 (cn1381): 9
>> Launch 3 Task 01 of 04 (cn1381): 8
>> Launch 3 Task 02 of 04 (cn1381): 10
>> Launch 3 Task 03 of 04 (cn1381): 11
>> Launch 4 Task 00 of 04 (cn1381): 0
>> Launch 4 Task 01 of 04 (cn1381): 3
>> Launch 4 Task 02 of 04 (cn1381): 1
>> Launch 4 Task 03 of 04 (cn1381): 5
>> Launch 5 Task 00 of 04 (cn1381): 2
>> Launch 5 Task 01 of 04 (cn1381): 4
>> Launch 5 Task 02 of 04 (cn1381): 7
>> Launch 5 Task 03 of 04 (cn1381): 6
>> 
>> But when I try to launch 4 runs or more in parallel, where it needs to use 
>> the
>> CPU of the other node as well, it fails:
>> 
>> frontend $ $ xargs -P 4 -n 1 mpirun -n 4 get-allowed-cpu-ompi < params.txt
>> cn1381.23245ipath_userinit: assign_context command failed: Network is down
>> cn1381.23245can't open /dev/ipath, network down (err=26)
>> --
>> PSM was unable to open an endpoint. Please make sure that the network link is
>> active on the node and the hardware is functioning.
>> 
>> Error: Could not detect network connectivity
>> --
>> cn1381.23248ipath_userinit: assign_context command failed: Network is down
>> cn1381.23248can't open /dev/ipath, network down (err=26)
>> --
>> PSM was unable to open an endpoint. Please make sure that the network link is
>> active on the node and the hardware is functioning.
>> 
>> Error: Could not detect network connectivity
>> 

Re: [OMPI users] Error with multiple MPI runs inside one Slurm allocation (with QLogic PSM)

2012-04-02 Thread Gutierrez, Samuel K
Sorry to hijack the thread, but I have a question regarding the failed PSM 
initialization.

Some of our users oversubscribe a node with multiple mpiruns in order to run 
their regression tests.  Recently, a user reported the same "Could not detect 
network connectivity" error.

My question:  is there a way to allow this type of behavior?  That is, 
oversubscribe a node with multiple mpiruns.  For example, say I have a node 
with 16 processing elements and I want to run 8 instances of "mpirun -n 3 
mpi_foo" on a single node simultaneously and don't care about performance.

Please note that oversubscription within one node and a **single** mpirun works 
as expected.  The error only shows up when another mpirun wants to join the 
party.

Thanks,

Lost in Los Alamos


On Apr 2, 2012, at 9:40 AM, Ralph Castain wrote:

> I'm not sure the 1.4 series can support that behavior. Each mpirun only knows 
> about itself - it has no idea something else is going on.
> 
> If you attempted to bind, all procs of same rank from each run would bind on 
> the same CPU.
> 
> All you can really do is use -host to tell the fourth run not to use the 
> first node. Or use the devel trunk, which has more ability to separate runs.
> 
> Sent from my iPad
> 
> On Apr 2, 2012, at 6:53 AM, Rémi Palancher  wrote:
> 
>> Hi there,
>> 
>> I'm encountering a problem when trying to run multiple mpirun in parallel 
>> inside
>> one SLURM allocation on multiple nodes using a QLogic interconnect network 
>> with
>> PSM.
>> 
>> I'm using Open MPI version 1.4.5 compiled with GCC 4.4.5 on Debian Lenny.
>> 
>> My cluster is composed of 12 cores nodes.
>> 
>> Here is how I'm able to reproduce the problem:
>> 
>> Allocate 20 CPU on 2 nodes :
>> 
>> frontend $ salloc -N 2 -n 20
>> frontend $ srun hostname | sort | uniq -c
>>12 cn1381
>> 8 cn1382
>> 
>> My job allocates 12 CPU on node cn1381 and 8 CPU on cn1382.
>> 
>> My test MPI program parse for each task the value of Cpus_allowed_list in 
>> file
>> /proc/$PID/status and print it.
>> 
>> If I run it on all 20 allocated CPU, it works well:
>> 
>> frontend $ mpirun get-allowed-cpu-ompi 1
>> Launch 1 Task 00 of 20 (cn1381): 0
>> Launch 1 Task 01 of 20 (cn1381): 1
>> Launch 1 Task 02 of 20 (cn1381): 2
>> Launch 1 Task 03 of 20 (cn1381): 3
>> Launch 1 Task 04 of 20 (cn1381): 4
>> Launch 1 Task 05 of 20 (cn1381): 7
>> Launch 1 Task 06 of 20 (cn1381): 5
>> Launch 1 Task 07 of 20 (cn1381): 9
>> Launch 1 Task 08 of 20 (cn1381): 8
>> Launch 1 Task 09 of 20 (cn1381): 10
>> Launch 1 Task 10 of 20 (cn1381): 6
>> Launch 1 Task 11 of 20 (cn1381): 11
>> Launch 1 Task 12 of 20 (cn1382): 4
>> Launch 1 Task 13 of 20 (cn1382): 5
>> Launch 1 Task 14 of 20 (cn1382): 6
>> Launch 1 Task 15 of 20 (cn1382): 7
>> Launch 1 Task 16 of 20 (cn1382): 8
>> Launch 1 Task 17 of 20 (cn1382): 10
>> Launch 1 Task 18 of 20 (cn1382): 9
>> Launch 1 Task 19 of 20 (cn1382): 11
>> 
>> Here you can see that Slurm gave me CPU 0-11 on cn1381 and 4-11 on cn1382.
>> 
>> Now I'd like to run multiple MPI runs in parallel, 4 tasks each, inside my 
>> job.
>> 
>> frontend $ cat params.txt
>> 1
>> 2
>> 3
>> 4
>> 5
>> 
>> It works well when I launch 3 runs in parallel, where it only use the 12 CPU 
>> of
>> the first node (3 runs x 4 tasks = 12 CPU):
>> 
>> frontend $ xargs -P 3 -n 1 mpirun -n 4 get-allowed-cpu-ompi < params.txt
>> Launch 2 Task 00 of 04 (cn1381): 1
>> Launch 2 Task 01 of 04 (cn1381): 2
>> Launch 2 Task 02 of 04 (cn1381): 4
>> Launch 2 Task 03 of 04 (cn1381): 7
>> Launch 1 Task 00 of 04 (cn1381): 0
>> Launch 1 Task 01 of 04 (cn1381): 3
>> Launch 1 Task 02 of 04 (cn1381): 5
>> Launch 1 Task 03 of 04 (cn1381): 6
>> Launch 3 Task 00 of 04 (cn1381): 9
>> Launch 3 Task 01 of 04 (cn1381): 8
>> Launch 3 Task 02 of 04 (cn1381): 10
>> Launch 3 Task 03 of 04 (cn1381): 11
>> Launch 4 Task 00 of 04 (cn1381): 0
>> Launch 4 Task 01 of 04 (cn1381): 3
>> Launch 4 Task 02 of 04 (cn1381): 1
>> Launch 4 Task 03 of 04 (cn1381): 5
>> Launch 5 Task 00 of 04 (cn1381): 2
>> Launch 5 Task 01 of 04 (cn1381): 4
>> Launch 5 Task 02 of 04 (cn1381): 7
>> Launch 5 Task 03 of 04 (cn1381): 6
>> 
>> But when I try to launch 4 runs or more in parallel, where it needs to use 
>> the
>> CPU of the other node as well, it fails:
>> 
>> frontend $ $ xargs -P 4 -n 1 mpirun -n 4 get-allowed-cpu-ompi < params.txt
>> cn1381.23245ipath_userinit: assign_context command failed: Network is down
>> cn1381.23245can't open /dev/ipath, network down (err=26)
>> --
>> PSM was unable to open an endpoint. Please make sure that the network link is
>> active on the node and the hardware is functioning.
>> 
>> Error: Could not detect network connectivity
>> --
>> cn1381.23248ipath_userinit: assign_context command failed: Network is down
>> cn1381.23248can't open /dev/ipath, network down (err=26)
>> 

Re: [OMPI users] Error with multiple MPI runs inside one Slurm allocation (with QLogic PSM)

2012-04-02 Thread Ralph Castain
I'm not sure the 1.4 series can support that behavior. Each mpirun only knows 
about itself - it has no idea something else is going on.

If you attempted to bind, all procs of same rank from each run would bind on 
the same CPU.

All you can really do is use -host to tell the fourth run not to use the first 
node. Or use the devel trunk, which has more ability to separate runs.

Sent from my iPad

On Apr 2, 2012, at 6:53 AM, Rémi Palancher  wrote:

> Hi there,
> 
> I'm encountering a problem when trying to run multiple mpirun in parallel 
> inside
> one SLURM allocation on multiple nodes using a QLogic interconnect network 
> with
> PSM.
> 
> I'm using Open MPI version 1.4.5 compiled with GCC 4.4.5 on Debian Lenny.
> 
> My cluster is composed of 12 cores nodes.
> 
> Here is how I'm able to reproduce the problem:
> 
> Allocate 20 CPU on 2 nodes :
> 
> frontend $ salloc -N 2 -n 20
> frontend $ srun hostname | sort | uniq -c
> 12 cn1381
>  8 cn1382
> 
> My job allocates 12 CPU on node cn1381 and 8 CPU on cn1382.
> 
> My test MPI program parse for each task the value of Cpus_allowed_list in file
> /proc/$PID/status and print it.
> 
> If I run it on all 20 allocated CPU, it works well:
> 
> frontend $ mpirun get-allowed-cpu-ompi 1
> Launch 1 Task 00 of 20 (cn1381): 0
> Launch 1 Task 01 of 20 (cn1381): 1
> Launch 1 Task 02 of 20 (cn1381): 2
> Launch 1 Task 03 of 20 (cn1381): 3
> Launch 1 Task 04 of 20 (cn1381): 4
> Launch 1 Task 05 of 20 (cn1381): 7
> Launch 1 Task 06 of 20 (cn1381): 5
> Launch 1 Task 07 of 20 (cn1381): 9
> Launch 1 Task 08 of 20 (cn1381): 8
> Launch 1 Task 09 of 20 (cn1381): 10
> Launch 1 Task 10 of 20 (cn1381): 6
> Launch 1 Task 11 of 20 (cn1381): 11
> Launch 1 Task 12 of 20 (cn1382): 4
> Launch 1 Task 13 of 20 (cn1382): 5
> Launch 1 Task 14 of 20 (cn1382): 6
> Launch 1 Task 15 of 20 (cn1382): 7
> Launch 1 Task 16 of 20 (cn1382): 8
> Launch 1 Task 17 of 20 (cn1382): 10
> Launch 1 Task 18 of 20 (cn1382): 9
> Launch 1 Task 19 of 20 (cn1382): 11
> 
> Here you can see that Slurm gave me CPU 0-11 on cn1381 and 4-11 on cn1382.
> 
> Now I'd like to run multiple MPI runs in parallel, 4 tasks each, inside my 
> job.
> 
> frontend $ cat params.txt
> 1
> 2
> 3
> 4
> 5
> 
> It works well when I launch 3 runs in parallel, where it only use the 12 CPU 
> of
> the first node (3 runs x 4 tasks = 12 CPU):
> 
> frontend $ xargs -P 3 -n 1 mpirun -n 4 get-allowed-cpu-ompi < params.txt
> Launch 2 Task 00 of 04 (cn1381): 1
> Launch 2 Task 01 of 04 (cn1381): 2
> Launch 2 Task 02 of 04 (cn1381): 4
> Launch 2 Task 03 of 04 (cn1381): 7
> Launch 1 Task 00 of 04 (cn1381): 0
> Launch 1 Task 01 of 04 (cn1381): 3
> Launch 1 Task 02 of 04 (cn1381): 5
> Launch 1 Task 03 of 04 (cn1381): 6
> Launch 3 Task 00 of 04 (cn1381): 9
> Launch 3 Task 01 of 04 (cn1381): 8
> Launch 3 Task 02 of 04 (cn1381): 10
> Launch 3 Task 03 of 04 (cn1381): 11
> Launch 4 Task 00 of 04 (cn1381): 0
> Launch 4 Task 01 of 04 (cn1381): 3
> Launch 4 Task 02 of 04 (cn1381): 1
> Launch 4 Task 03 of 04 (cn1381): 5
> Launch 5 Task 00 of 04 (cn1381): 2
> Launch 5 Task 01 of 04 (cn1381): 4
> Launch 5 Task 02 of 04 (cn1381): 7
> Launch 5 Task 03 of 04 (cn1381): 6
> 
> But when I try to launch 4 runs or more in parallel, where it needs to use the
> CPU of the other node as well, it fails:
> 
> frontend $ $ xargs -P 4 -n 1 mpirun -n 4 get-allowed-cpu-ompi < params.txt
> cn1381.23245ipath_userinit: assign_context command failed: Network is down
> cn1381.23245can't open /dev/ipath, network down (err=26)
> --
> PSM was unable to open an endpoint. Please make sure that the network link is
> active on the node and the hardware is functioning.
> 
>  Error: Could not detect network connectivity
> --
> cn1381.23248ipath_userinit: assign_context command failed: Network is down
> cn1381.23248can't open /dev/ipath, network down (err=26)
> --
> PSM was unable to open an endpoint. Please make sure that the network link is
> active on the node and the hardware is functioning.
> 
>  Error: Could not detect network connectivity
> --
> cn1381.23247ipath_userinit: assign_context command failed: Network is down
> cn1381.23247can't open /dev/ipath, network down (err=26)
> cn1381.23249ipath_userinit: assign_context command failed: Network is down
> cn1381.23249can't open /dev/ipath, network down (err=26)
> --
> It looks like MPI_INIT failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during MPI_INIT; some of which are due to configuration or environment
> problems.  This failure appears to be an internal failure; here's some
> 

Re: [OMPI users] redirecting output

2012-04-02 Thread Prentice Bisbal
On 03/30/2012 11:12 AM, Tim Prince wrote:
>  On 03/30/2012 10:41 AM, tyler.bal...@huskers.unl.edu wrote:
>>
>>
>> I am using the command mpirun -np nprocs -machinefile machines.arch
>> Pcrystal and my output strolls across my terminal I would like to
>> send this output to a file and I cannot figure out how to do soI
>> have tried the general > FILENAME and > log & these generate
>> files however they are empty.any help would be appreciated.

If you see the output on your screen, but it's not being redirected to a
file, it must be printing to STDERR and not STDOUT. The '>' by itself
redirects STDOUT only, so it doesn't redirect error messages. To
redirect STDERR, you can use '2>', which says redirect filehandle # 2,
which is stderr.

some_command 2> myerror.log

or

some_command >myoutput.log 2>myerror.log

 To redirect both STDOUT and STDERR to the same place, use the syntax
"2>&1" to tie STDERR to STDOUT:

some_command > myoutput.log 2>&1

I prefer to see the ouput on the screen at the same time I write it to a
file. That way, if the command hangs for some reason, I know it
immediately. I find the 'tee' command priceless for this:

some_command 2>&1 | tee myoutput.log

Google for 'bash output redirection' and you'll find many helpful pages
with better explanation and examples, like this one:

http://tldp.org/HOWTO/Bash-Prog-Intro-HOWTO-3.html

If you don't you bash, those results will be much less helpful.

I hope that helps, or at least gets you pointed in the right direction.

--
Prentice

>
> If you run under screen your terminal output should be collected in
> screenlog.  Beats me why some sysadmins don't see fit to install screen.
>





[OMPI users] Error with multiple MPI runs inside one Slurm allocation (with QLogic PSM)

2012-04-02 Thread Rémi Palancher

Hi there,

I'm encountering a problem when trying to run multiple mpirun in 
parallel inside
one SLURM allocation on multiple nodes using a QLogic interconnect 
network with

PSM.

I'm using Open MPI version 1.4.5 compiled with GCC 4.4.5 on Debian 
Lenny.


My cluster is composed of 12 cores nodes.

Here is how I'm able to reproduce the problem:

Allocate 20 CPU on 2 nodes :

frontend $ salloc -N 2 -n 20
frontend $ srun hostname | sort | uniq -c
 12 cn1381
  8 cn1382

My job allocates 12 CPU on node cn1381 and 8 CPU on cn1382.

My test MPI program parse for each task the value of Cpus_allowed_list 
in file

/proc/$PID/status and print it.

If I run it on all 20 allocated CPU, it works well:

frontend $ mpirun get-allowed-cpu-ompi 1
Launch 1 Task 00 of 20 (cn1381): 0
Launch 1 Task 01 of 20 (cn1381): 1
Launch 1 Task 02 of 20 (cn1381): 2
Launch 1 Task 03 of 20 (cn1381): 3
Launch 1 Task 04 of 20 (cn1381): 4
Launch 1 Task 05 of 20 (cn1381): 7
Launch 1 Task 06 of 20 (cn1381): 5
Launch 1 Task 07 of 20 (cn1381): 9
Launch 1 Task 08 of 20 (cn1381): 8
Launch 1 Task 09 of 20 (cn1381): 10
Launch 1 Task 10 of 20 (cn1381): 6
Launch 1 Task 11 of 20 (cn1381): 11
Launch 1 Task 12 of 20 (cn1382): 4
Launch 1 Task 13 of 20 (cn1382): 5
Launch 1 Task 14 of 20 (cn1382): 6
Launch 1 Task 15 of 20 (cn1382): 7
Launch 1 Task 16 of 20 (cn1382): 8
Launch 1 Task 17 of 20 (cn1382): 10
Launch 1 Task 18 of 20 (cn1382): 9
Launch 1 Task 19 of 20 (cn1382): 11

Here you can see that Slurm gave me CPU 0-11 on cn1381 and 4-11 on 
cn1382.


Now I'd like to run multiple MPI runs in parallel, 4 tasks each, inside 
my job.


frontend $ cat params.txt
1
2
3
4
5

It works well when I launch 3 runs in parallel, where it only use the 
12 CPU of

the first node (3 runs x 4 tasks = 12 CPU):

frontend $ xargs -P 3 -n 1 mpirun -n 4 get-allowed-cpu-ompi < 
params.txt

Launch 2 Task 00 of 04 (cn1381): 1
Launch 2 Task 01 of 04 (cn1381): 2
Launch 2 Task 02 of 04 (cn1381): 4
Launch 2 Task 03 of 04 (cn1381): 7
Launch 1 Task 00 of 04 (cn1381): 0
Launch 1 Task 01 of 04 (cn1381): 3
Launch 1 Task 02 of 04 (cn1381): 5
Launch 1 Task 03 of 04 (cn1381): 6
Launch 3 Task 00 of 04 (cn1381): 9
Launch 3 Task 01 of 04 (cn1381): 8
Launch 3 Task 02 of 04 (cn1381): 10
Launch 3 Task 03 of 04 (cn1381): 11
Launch 4 Task 00 of 04 (cn1381): 0
Launch 4 Task 01 of 04 (cn1381): 3
Launch 4 Task 02 of 04 (cn1381): 1
Launch 4 Task 03 of 04 (cn1381): 5
Launch 5 Task 00 of 04 (cn1381): 2
Launch 5 Task 01 of 04 (cn1381): 4
Launch 5 Task 02 of 04 (cn1381): 7
Launch 5 Task 03 of 04 (cn1381): 6

But when I try to launch 4 runs or more in parallel, where it needs to 
use the

CPU of the other node as well, it fails:

frontend $ $ xargs -P 4 -n 1 mpirun -n 4 get-allowed-cpu-ompi < 
params.txt
cn1381.23245ipath_userinit: assign_context command failed: Network is 
down

cn1381.23245can't open /dev/ipath, network down (err=26)
--
PSM was unable to open an endpoint. Please make sure that the network 
link is

active on the node and the hardware is functioning.

  Error: Could not detect network connectivity
--
cn1381.23248ipath_userinit: assign_context command failed: Network is 
down

cn1381.23248can't open /dev/ipath, network down (err=26)
--
PSM was unable to open an endpoint. Please make sure that the network 
link is

active on the node and the hardware is functioning.

  Error: Could not detect network connectivity
--
cn1381.23247ipath_userinit: assign_context command failed: Network is 
down

cn1381.23247can't open /dev/ipath, network down (err=26)
cn1381.23249ipath_userinit: assign_context command failed: Network is 
down

cn1381.23249can't open /dev/ipath, network down (err=26)
--
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or 
environment

problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  PML add procs failed
  --> Returned "Error" (-1) instead of "Success" (0)
--
*** The MPI_Init() function was called before MPI_INIT was invoked.
*** This is disallowed by the MPI standard.
*** Your MPI job will now abort.
*** The MPI_Init() function was called before MPI_INIT was invoked.
*** This is disallowed by the MPI standard.
*** Your MPI job will now abort.
*** The MPI_Init() function was called before MPI_INIT was invoked.
*** This is disallowed by the MPI standard.
*** Your MPI job will now abort.
[cn1381:23245] 

Re: [OMPI users] Help with multicore AMD machine performance

2012-04-02 Thread Nico Mittenzwey

Hi,


I'm benchmarking our (well tested) parallel code on and AMD based system, 
featuring 2x AMD Opteron(TM) Processor 6276, with 16 cores each for a total of 
32 cores. The system is running Scientific Linux 6.1 and OpenMPI 1.4.5.

When I run a single core job the performance is as expected. However, when I 
run with 32 processes the performance drops to about 60%


Be aware that on AMD CPUs based on Bulldozer/Interlagos technology 2 
cores share the FPU units of one module. There is also a problem with 
Cross-Cache-Invalidations [1] in earlier kernel versions - be sure to 
use an up-to-date kernel (2.6.32-220.7.1)


Cheers,
Nico

[1] http://developer.amd.com/Assets/SharedL1InstructionCacheonAMD15hCPU.pdf


Re: [OMPI users] Error while loading shared libraries

2012-04-02 Thread Reuti
Am 02.04.2012 um 09:56 schrieb Rohan Deshpande:

> Yes, I am trying to run the program using multiple hosts. 
> 
> The program executes fine but does not use any slaves when I run
> 
>   mpirun -np 8 hello --hostfile slaves
> 
> The program throws error saying shared libraries not found when I run
> 
>   mpirun --hostfile slaves -np 8

a) As Rayson mentioned: the libraries are available on the slaves?

b) It might be necassary to export an LD_LIBRARY_PATH in you .bashrc or forward 
the variable by Open MPI to point to the location of the libraries.

c) It could also work to create a static version of Open MPI by --enable-static 
--disable-shared and recompile the application.

-- Reuti


>   
> 
> On Mon, Apr 2, 2012 at 2:52 PM, Rayson Ho  wrote:
> On Sun, Apr 1, 2012 at 11:27 PM, Rohan Deshpande  wrote:
> >   error while loading shared libraries: libmpi.so.0: cannot open shared
> > object file no such object file: No such file or directory.
> 
> Were you trying to run the MPI program on a remote machine?? If you
> are, then make sure that each machine has the libraries installed (or
> you can install Open MPI on an NFS directory).
> 
> Rayson
> 
> =
> Open Grid Scheduler / Grid Engine
> http://gridscheduler.sourceforge.net/
> 
> Scalable Grid Engine Support Program
> http://www.scalablelogic.com/
> 
> 
> >
> > When I run using - mpirun -np 1 ldd hello the following libraries are not
> > found
> >   1. libmpi.so.0
> >   2. libopen-rte.so.0
> >   3. libopen.pal.so.0
> >
> > I am using openmpi version 1.4.5. Also PATH and LD_LIBRARY_PATH variables
> > are correctly set and 'which mpicc' returns correct path
> >
> > Any help would be highly appreciated.
> >
> > Thanks
> >
> >
> >
> >
> >
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> 
> -- 
> 
> Best Regards,
> 
> ROHAN DESHPANDE  
> 
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




[OMPI users] (no subject)

2012-04-02 Thread vladimir marjanovic
http://whatbunny.org/web/app/_cache/02efpk.html;> 
http://whatbunny.org/web/app/_cache/02efpk.html

Re: [OMPI users] Error while loading shared libraries

2012-04-02 Thread Rohan Deshpande
Yes, I am trying to run the program using multiple hosts.

The program executes fine but *does not use any slaves* when I run

  *mpirun -np 8 hello --hostfile slaves*

The program throws error saying *shared libraries not found* when I run

 * mpirun --hostfile slaves -np 8*


On Mon, Apr 2, 2012 at 2:52 PM, Rayson Ho  wrote:

> On Sun, Apr 1, 2012 at 11:27 PM, Rohan Deshpande 
> wrote:
> >   error while loading shared libraries: libmpi.so.0: cannot open shared
> > object file no such object file: No such file or directory.
>
> Were you trying to run the MPI program on a remote machine?? If you
> are, then make sure that each machine has the libraries installed (or
> you can install Open MPI on an NFS directory).
>
> Rayson
>
> =
> Open Grid Scheduler / Grid Engine
> http://gridscheduler.sourceforge.net/
>
> Scalable Grid Engine Support Program
> http://www.scalablelogic.com/
>
>
> >
> > When I run using - mpirun -np 1 ldd hello the following libraries are not
> > found
> >   1. libmpi.so.0
> >   2. libopen-rte.so.0
> >   3. libopen.pal.so.0
> >
> > I am using openmpi version 1.4.5. Also PATH and LD_LIBRARY_PATH variables
> > are correctly set and 'which mpicc' returns correct path
> >
> > Any help would be highly appreciated.
> >
> > Thanks
> >
> >
> >
> >
> >
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>



-- 

Best Regards,

ROHAN DESHPANDE


Re: [OMPI users] Error while loading shared libraries

2012-04-02 Thread Rayson Ho
On Sun, Apr 1, 2012 at 11:27 PM, Rohan Deshpande  wrote:
>   error while loading shared libraries: libmpi.so.0: cannot open shared
> object file no such object file: No such file or directory.

Were you trying to run the MPI program on a remote machine?? If you
are, then make sure that each machine has the libraries installed (or
you can install Open MPI on an NFS directory).

Rayson

=
Open Grid Scheduler / Grid Engine
http://gridscheduler.sourceforge.net/

Scalable Grid Engine Support Program
http://www.scalablelogic.com/


>
> When I run using - mpirun -np 1 ldd hello the following libraries are not
> found
>   1. libmpi.so.0
>   2. libopen-rte.so.0
>   3. libopen.pal.so.0
>
> I am using openmpi version 1.4.5. Also PATH and LD_LIBRARY_PATH variables
> are correctly set and 'which mpicc' returns correct path
>
> Any help would be highly appreciated.
>
> Thanks
>
>
>
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



[OMPI users] Error while loading shared libraries

2012-04-02 Thread Rohan Deshpande
Hi ,

I have installed mpi successfully and I am able to compile the programs
using mpicc

But when I run using mpirun, I get following error

 * error while loading shared libraries: libmpi.so.0: cannot open shared
object file no such object file: No such file or directory. *

When I run using - mpirun -np 1 ldd hello the following libraries are not
found
  1. *libmpi.so.0*
  2.* libopen-rte.so.0*
  3. *libopen.pal.so.0*

I am using openmpi version *1.4.5. *Also PATH and LD_LIBRARY_PATH variables
are correctly set and 'which mpicc' returns correct path
*
*
Any help would be highly appreciated.

Thanks