Re: [OMPI users] [EXTERNAL] hwloc support for Power9/IBM AC922 servers

2019-04-16 Thread Hammond, Simon David via users
Hi Prentice,

We are using OpenMPI and HWLOC on POWER9 servers. The topology information 
looks good from our initial use.

Let me know if you need anything specifically.

S.

—
Si Hammond
Scalable Computer Architectures
Sandia National Laboratories, NM

> On Apr 16, 2019, at 11:28 AM, Prentice Bisbal via users 
>  wrote:
> 
> OpenMPI Users,
> 
> Are any of you using hwloc on Power9 hardware, specifically the IBM AC922 
> servers? If so, have you encountered any issues? I checked the documentation 
> for the latest version (2.03), and found this:
> 
>> Since it uses standard Operating System information, hwloc's support is 
>> mostly independant from the processor type
>> (x86, powerpc, ...) and just relies on the Operating System support.
> 
> and this:
> 
>> To check whether hwloc works on a particular machine, just try to build it 
>> and run lstopo or lstopo-no-graphics.
>> If some things do not look right (e.g. bogus or missing cache information
> 
> We haven't bought any AC922 nodes yet, so i can't try that just yet  We are 
> looking to purchase a small cluster, and want to make sure there are no known 
> issues between the hardware and software before we make a purchase.
> 
> Any feedback will be greatly appreciated.
> 
> Thanks,
> 
> Prentice
> 
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] [EXTERNAL] Re: MPI_Reduce_Scatter Segmentation Fault with Intel 2019 Update 1 Compilers on OPA-1

2018-12-05 Thread Hammond, Simon David via users
Folks,

Thanks for your help and prompt replies. We appreciate all the support we get 
from the community.

S.

-- 
Si Hammond
Scalable Computer Architectures
Sandia National Laboratories, NM, USA
 

On 12/4/18, 6:57 PM, "users on behalf of Gilles Gouaillardet" 
 wrote:

Thanks for the report.


As far as I am concerned, this is a bug in the IMB benchmark, and I 
issued a PR to fix that

https://github.com/intel/mpi-benchmarks/pull/11


Meanwhile, you can manually download and apply the patch at

https://github.com/intel/mpi-benchmarks/pull/11.patch



Cheers,


Gilles


On 12/4/2018 4:41 AM, Hammond, Simon David via users wrote:
> Hi Open MPI Users,
>
> Just wanted to report a bug we have seen with OpenMPI 3.1.3 and 4.0.0 
when using the Intel 2019 Update 1 compilers on our Skylake/OmniPath-1 cluster. 
The bug occurs when running the Github master src_c variant of the Intel MPI 
Benchmarks.
>
> Configuration:
>
> ./configure 
--prefix=/home/projects/x86-64-skylake/openmpi/3.1.3/intel/19.1.144 
--with-slurm --with-psm2 
CC=/home/projects/x86-64/intel/compilers/2019/compilers_and_libraries_2019.1.144/linux/bin/intel64/icc
 
CXX=/home/projects/x86-64/intel/compilers/2019/compilers_and_libraries_2019.1.144/linux/bin/intel64/icpc
 
FC=/home/projects/x86-64/intel/compilers/2019/compilers_and_libraries_2019.1.144/linux/bin/intel64/ifort
 --with-zlib=/home/projects/x86-64/zlib/1.2.11 
--with-valgrind=/home/projects/x86-64/valgrind/3.13.0
>
> Operating System is RedHat 7.4 release and we utilize a local build of 
GCC 7.2.0 for our Intel compiler (C++) header files. Everything makes 
correctly, and passes a make check without any issues.
>
> We then compile IMB and run IMB-MPI1 on 24 nodes and get the following:
>
> #
> # Benchmarking Reduce_scatter
> # #processes = 64
> # ( 1088 additional processes waiting in MPI_Barrier)
> #
> #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
>  0 1000 0.18 0.19 0.18
>  4 1000 7.3910.37 8.68
>  8 1000 7.8411.14 9.23
> 16 1000 8.5012.3710.14
> 32 100010.3714.6612.15
> 64 100013.7618.8216.17
>128 100021.6327.6124.87
>256 100039.9847.2743.96
>512 100072.9378.5975.15
>   1024 1000   147.21   152.98   149.94
>   2048 1000   413.41   426.90   420.15
>   4096 1000   421.28   442.58   434.52
>   8192 1000   418.31   450.20   438.51
>  16384 1000  1082.85  1221.44  1140.92
>  32768 1000  2434.11  2529.90  2476.72
>  65536  640  5469.57  6048.60  5687.08
> 131072  320 11702.94 12435.06 12075.07
> 262144  160 19214.42 20433.83 19883.80
> 524288   80 49462.22 53896.43 52101.56
>1048576   40119422.53131922.20126920.99
>2097152   20256345.97288185.72275767.05
> [node06:351648] *** Process received signal ***
> [node06:351648] Signal: Segmentation fault (11)
> [node06:351648] Signal code: Invalid permissions (2)
> [node06:351648] Failing at address: 0x7fdb6efc4000
> [node06:351648] [ 0] /lib64/libpthread.so.0(+0xf5e0)[0x7fdb8646c5e0]
> [node06:351648] [ 1] ./IMB-MPI1(__intel_avx_rep_memcpy+0x140)[0x415380]
> [node06:351648] [ 2] 
/home/projects/x86-64-skylake/openmpi/3.1.3/intel/19.1.144/lib/libopen-pal.so.40(opal_datatype_copy_content_same_ddt+0xca)[0x7fdb858d847a]
> [node06:351648] [ 3] 
/home/projects/x86-64-skylake/openmpi/3.1.3/intel/19.1.144/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x3f9)[0x7fdb86c43b29]
> [node06:351648] [ 4] 
/home/projects/x86-64-skylake/openmpi/3.1.3/intel/19.1.144/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1d7)[0x7fdb86c1de67]
> [node06:351648] [ 5] ./IMB-MPI1[0x40d624]
> [node06:351648] [ 6] ./IMB-MPI1[0x407d16]
> [node06:351648] [ 7] ./IMB-MPI1[0x403356]
> [node06:351648] [ 8] 
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7fdb860bbc05]

[OMPI users] MPI_Reduce_Scatter Segmentation Fault with Intel 2019 Update 1 Compilers on OPA-1

2018-12-03 Thread Hammond, Simon David via users
Hi Open MPI Users,

Just wanted to report a bug we have seen with OpenMPI 3.1.3 and 4.0.0 when 
using the Intel 2019 Update 1 compilers on our Skylake/OmniPath-1 cluster. The 
bug occurs when running the Github master src_c variant of the Intel MPI 
Benchmarks.

Configuration: 

./configure --prefix=/home/projects/x86-64-skylake/openmpi/3.1.3/intel/19.1.144 
--with-slurm --with-psm2 
CC=/home/projects/x86-64/intel/compilers/2019/compilers_and_libraries_2019.1.144/linux/bin/intel64/icc
 
CXX=/home/projects/x86-64/intel/compilers/2019/compilers_and_libraries_2019.1.144/linux/bin/intel64/icpc
 
FC=/home/projects/x86-64/intel/compilers/2019/compilers_and_libraries_2019.1.144/linux/bin/intel64/ifort
 --with-zlib=/home/projects/x86-64/zlib/1.2.11 
--with-valgrind=/home/projects/x86-64/valgrind/3.13.0

Operating System is RedHat 7.4 release and we utilize a local build of GCC 
7.2.0 for our Intel compiler (C++) header files. Everything makes correctly, 
and passes a make check without any issues.

We then compile IMB and run IMB-MPI1 on 24 nodes and get the following:

#
# Benchmarking Reduce_scatter
# #processes = 64
# ( 1088 additional processes waiting in MPI_Barrier)
#
   #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
0 1000 0.18 0.19 0.18
4 1000 7.3910.37 8.68
8 1000 7.8411.14 9.23
   16 1000 8.5012.3710.14
   32 100010.3714.6612.15
   64 100013.7618.8216.17
  128 100021.6327.6124.87
  256 100039.9847.2743.96
  512 100072.9378.5975.15
 1024 1000   147.21   152.98   149.94
 2048 1000   413.41   426.90   420.15
 4096 1000   421.28   442.58   434.52
 8192 1000   418.31   450.20   438.51
16384 1000  1082.85  1221.44  1140.92
32768 1000  2434.11  2529.90  2476.72
65536  640  5469.57  6048.60  5687.08
   131072  320 11702.94 12435.06 12075.07
   262144  160 19214.42 20433.83 19883.80
   524288   80 49462.22 53896.43 52101.56
  1048576   40119422.53131922.20126920.99
  2097152   20256345.97288185.72275767.05
[node06:351648] *** Process received signal ***
[node06:351648] Signal: Segmentation fault (11)
[node06:351648] Signal code: Invalid permissions (2)
[node06:351648] Failing at address: 0x7fdb6efc4000
[node06:351648] [ 0] /lib64/libpthread.so.0(+0xf5e0)[0x7fdb8646c5e0]
[node06:351648] [ 1] ./IMB-MPI1(__intel_avx_rep_memcpy+0x140)[0x415380]
[node06:351648] [ 2] 
/home/projects/x86-64-skylake/openmpi/3.1.3/intel/19.1.144/lib/libopen-pal.so.40(opal_datatype_copy_content_same_ddt+0xca)[0x7fdb858d847a]
[node06:351648] [ 3] 
/home/projects/x86-64-skylake/openmpi/3.1.3/intel/19.1.144/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x3f9)[0x7fdb86c43b29]
[node06:351648] [ 4] 
/home/projects/x86-64-skylake/openmpi/3.1.3/intel/19.1.144/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1d7)[0x7fdb86c1de67]
[node06:351648] [ 5] ./IMB-MPI1[0x40d624]
[node06:351648] [ 6] ./IMB-MPI1[0x407d16]
[node06:351648] [ 7] ./IMB-MPI1[0x403356]
[node06:351648] [ 8] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7fdb860bbc05]
[node06:351648] [ 9] ./IMB-MPI1[0x402da9]
[node06:351648] *** End of error message ***
[node06:351649] *** Process received signal ***
[node06:351649] Signal: Segmentation fault (11)
[node06:351649] Signal code: Invalid permissions (2)
[node06:351649] Failing at address: 0x7f9b19c6f000
[node06:351649] [ 0] /lib64/libpthread.so.0(+0xf5e0)[0x7f9b311295e0]
[node06:351649] [ 1] ./IMB-MPI1(__intel_avx_rep_memcpy+0x140)[0x415380]
[node06:351649] [ 2] 
/home/projects/x86-64-skylake/openmpi/3.1.3/intel/19.1.144/lib/libopen-pal.so.40(opal_datatype_copy_content_same_ddt+0xca)[0x7f9b3059547a]
[node06:351649] [ 3] 
/home/projects/x86-64-skylake/openmpi/3.1.3/intel/19.1.144/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x3f9)[0x7f9b31900b29]
[node06:351649] [ 4] 
/home/projects/x86-64-skylake/openmpi/3.1.3/intel/19.1.144/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1d7)[0x7f9b318dae67]
[node06:351649] [ 5] ./IMB-MPI1[0x40d624]
[node06:351649] [ 6] ./IMB-MPI1[0x407d16]
[node06:351649] [node06:351657] *** Process received signal ***


-- 
Si Hammond
Scalable Computer Architectures
Sandia National Laboratories, NM, USA
 

___
users mailing list
users@lists.open-mpi.org

[OMPI users] Providing an Initial CPU Affinity List to mpirun

2018-11-20 Thread Hammond, Simon David via users
Hi OpenMPI Users,

I wonder if you can help us with a problem we are having when trying to force 
OpenMPI to use specific cores. We want to supply an initial CPU affinity list 
to mpirun and then have it select its appropriate binding from within that set. 
So for instance, to provide it with two cores and then have a bind-to/map-by 
core for two MPI processes. However, it doesn't appear that this works 
correctly with either OpenMPI 2.1.2 or 3.1.0 during our testing.

Example: POWER9 system with 32 cores running in SMT-4 mode.

$ numactl --physcpubind=4-63 mpirun -n 2 --map-by core --bind-to core 
./maskprinter-ppc64
System has: 128 logical cores.
Rank 0 : CPU Mask for Process 0 (total of 4 logical cores out of a max of 128 
cores)
Generating mask information...
1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0
System has: 128 logical cores.
Rank : 1 CPU Mask for Process 0 (total of 4 logical cores out of a max of 128 
cores)
Generating mask information...
0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0

But we do get the correct affinity if we don't use mpirun for a single process:

$ numactl --physcpubind=4-63 ./maskprinter-ppc64
System has: 128 logical cores.
CPU Mask for Process 0 (total of 60 logical cores out of a max of 128 cores)
Generating mask information...
0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0

In an ideal world, the above mpirun usage would shift the cores allocated to be 
within the 4-63 range.

Is this possible with OpenMPI at all? I realize this is a fairly unusal request.

Thanks,

S. 

-- 
Si Hammond
Scalable Computer Architectures
Sandia National Laboratories, NM, USA
[Sent from remote connection, excuse typos]
 

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

[OMPI users] ARM HPC Compiler 18.4.0 / OpenMPI 2.1.4 Hang for IMB All Reduce Test on 4 Ranks

2018-08-15 Thread Hammond, Simon David via users
Hi OpenMPI Users,

I am compiling OpenMPI 2.1.4 with the ARM 18.4.0 HPC Compiler on our ARM 
ThunderX2 system. Configuration options below. For now, I am using the simplest 
configuration test we can use on our system.

If I use the OpenMPI 2.1.4 which I have compiled and run a simple 4 rank run of 
the IMB MPI benchmark on a single node (so using shared memory for 
communication), the test will hang at the 4-rank test case (see below). All 
four processes seem to be spinning at 100% of a single core.

Configure Line: ./configure 
--prefix=/home/projects/arm64-tx2/openmpi/2.1.4/arm/18.4.0 --with-slurm 
--enable-mpi-thread-multiple CC=`which armclang` CXX=`which armclang++` 
FC=`which armflang`

#
# Benchmarking Allreduce
# #processes = 4
#
   #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
0 1000 0.02 0.02 0.02
4 1000 2.31 2.31 2.31
8 1000 2.37 2.37 2.37
   16 1000 2.46 2.46 2.46
   32 1000 2.46 2.46 2.46 


When I use GDB to halt the code on one of the ranks and perform backtracing. I 
get seem to get the following stacks repeated (in a loop).

#0  0xbe3e765c in opal_timer_linux_get_cycles_sys_timer ()
   from /home/projects/arm64-tx2/openmpi/2.1.4/arm/18.4.0/lib/libopen-pal.so.20
#1  0xbe36d910 in opal_progress ()
   from /home/projects/arm64-tx2/openmpi/2.1.4/arm/18.4.0/lib/libopen-pal.so.20
#2  0xbe6f2568 in ompi_request_default_wait ()
   from /home/projects/arm64-tx2/openmpi/2.1.4/arm/18.4.0/lib/libmpi.so.20
#3  0xbe73f718 in ompi_coll_base_barrier_intra_recursivedoubling ()
   from /home/projects/arm64-tx2/openmpi/2.1.4/arm/18.4.0/lib/libmpi.so.20
#4  0xbe703000 in PMPI_Barrier () from 
/home/projects/arm64-tx2/openmpi/2.1.4/arm/18.4.0/lib/libmpi.so.20
#5  0x00402554 in main ()
(gdb) c
Continuing.
^C
Program received signal SIGINT, Interrupt.
0xbc42084c in mlx5_poll_cq_1 () from /lib64/libmlx5-rdmav2.so
(gdb) bt
#0  0xbc42084c in mlx5_poll_cq_1 () from /lib64/libmlx5-rdmav2.so
#1  0xb793f544 in btl_openib_component_progress ()
   from 
/home/projects/arm64-tx2/openmpi/2.1.4/arm/18.4.0/lib/openmpi/mca_btl_openib.so
#2  0xbe36d980 in opal_progress ()
   from /home/projects/arm64-tx2/openmpi/2.1.4/arm/18.4.0/lib/libopen-pal.so.20
#3  0xbe6f2568 in ompi_request_default_wait ()
   from /home/projects/arm64-tx2/openmpi/2.1.4/arm/18.4.0/lib/libmpi.so.20
#4  0xbe73f718 in ompi_coll_base_barrier_intra_recursivedoubling ()
   from /home/projects/arm64-tx2/openmpi/2.1.4/arm/18.4.0/lib/libmpi.so.20
#5  0xbe703000 in PMPI_Barrier () from 
/home/projects/arm64-tx2/openmpi/2.1.4/arm/18.4.0/lib/libmpi.so.20
#6  0x00402554 in main ()

 
-- 
Si Hammond
Scalable Computer Architectures
Sandia National Laboratories, NM, USA
[Sent from remote connection, excuse typos]
 

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] [EXTERNAL] Re: OpenMPI 3.1.0 Lock Up on POWER9 w/ CUDA9.2

2018-07-01 Thread Hammond, Simon David via users
Nathan,

Same issue with OpenMPI 3.1.1 on POWER9 with GCC 7.2.0 and CUDA9.2.

S.

-- 
Si Hammond
Scalable Computer Architectures
Sandia National Laboratories, NM, USA
[Sent from remote connection, excuse typos]
 

On 6/16/18, 10:10 PM, "Nathan Hjelm"  wrote:

Try the latest nightly tarball for v3.1.x. Should be fixed. 

> On Jun 16, 2018, at 5:48 PM, Hammond, Simon David via users 
 wrote:
> 
> The output from the test in question is:
> 
> Single thread test. Time: 0 s 10182 us 10 nsec/poppush
> Atomics thread finished. Time: 0 s 169028 us 169 nsec/poppush
> 
> 
> S.
> 
> -- 
> Si Hammond
> Scalable Computer Architectures
> Sandia National Laboratories, NM, USA
> [Sent from remote connection, excuse typos]
    > 
> 
> On 6/16/18, 5:45 PM, "Hammond, Simon David"  wrote:
> 
>Hi OpenMPI Team,
> 
>We have recently updated an install of OpenMPI on POWER9 system 
(configuration details below). We migrated from OpenMPI 2.1 to OpenMPI 3.1. We 
seem to have a symptom where code than ran before is now locking up and making 
no progress, getting stuck in wait-all operations. While I think it's prudent 
for us to root cause this a little more, I have gone back and rebuilt MPI and 
re-run the "make check" tests. The opal_fifo test appears to hang forever. I am 
not sure if this is the cause of our issue but wanted to report that we are 
seeing this on our system.
> 
>OpenMPI 3.1.0 Configuration:
> 
>./configure 
--prefix=/home/projects/ppc64le-pwr9-nvidia/openmpi/3.1.0-nomxm/gcc/7.2.0/cuda/9.2.88
 --with-cuda=$CUDA_ROOT --enable-mpi-java --enable-java 
--with-lsf=/opt/lsf/10.1 
--with-lsf-libdir=/opt/lsf/10.1/linux3.10-glibc2.17-ppc64le/lib --with-verbs
> 
>GCC versions are 7.2.0, built by our team. CUDA is 9.2.88 from NVIDIA 
for POWER9 (standard download from their website). We enable IBM's JDK 8.0.0.
>RedHat: Red Hat Enterprise Linux Server release 7.5 (Maipo)
> 
>Output:
> 
>make[3]: Entering directory 
`/home/sdhammo/openmpi/openmpi-3.1.0/test/class'
>make[4]: Entering directory 
`/home/sdhammo/openmpi/openmpi-3.1.0/test/class'
>PASS: ompi_rb_tree
>PASS: opal_bitmap
>PASS: opal_hash_table
>PASS: opal_proc_table
>PASS: opal_tree
>PASS: opal_list
>PASS: opal_value_array
>PASS: opal_pointer_array
>PASS: opal_lifo
>
> 
>Output from Top:
> 
>20   0   73280   4224   2560 S 800.0  0.0  17:22.94 lt-opal_fifo
> 
>-- 
>Si Hammond
>Scalable Computer Architectures
>Sandia National Laboratories, NM, USA
>[Sent from remote connection, excuse typos]
> 
> 
> 
> 
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users


___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] OpenMPI 3.1.0 Lock Up on POWER9 w/ CUDA9.2

2018-06-16 Thread Hammond, Simon David via users
The output from the test in question is:

Single thread test. Time: 0 s 10182 us 10 nsec/poppush
Atomics thread finished. Time: 0 s 169028 us 169 nsec/poppush


S.
 
-- 
Si Hammond
Scalable Computer Architectures
Sandia National Laboratories, NM, USA
[Sent from remote connection, excuse typos]
 

On 6/16/18, 5:45 PM, "Hammond, Simon David"  wrote:

Hi OpenMPI Team,

We have recently updated an install of OpenMPI on POWER9 system 
(configuration details below). We migrated from OpenMPI 2.1 to OpenMPI 3.1. We 
seem to have a symptom where code than ran before is now locking up and making 
no progress, getting stuck in wait-all operations. While I think it's prudent 
for us to root cause this a little more, I have gone back and rebuilt MPI and 
re-run the "make check" tests. The opal_fifo test appears to hang forever. I am 
not sure if this is the cause of our issue but wanted to report that we are 
seeing this on our system.

OpenMPI 3.1.0 Configuration:

./configure 
--prefix=/home/projects/ppc64le-pwr9-nvidia/openmpi/3.1.0-nomxm/gcc/7.2.0/cuda/9.2.88
 --with-cuda=$CUDA_ROOT --enable-mpi-java --enable-java 
--with-lsf=/opt/lsf/10.1 
--with-lsf-libdir=/opt/lsf/10.1/linux3.10-glibc2.17-ppc64le/lib --with-verbs

GCC versions are 7.2.0, built by our team. CUDA is 9.2.88 from NVIDIA for 
POWER9 (standard download from their website). We enable IBM's JDK 8.0.0.
RedHat: Red Hat Enterprise Linux Server release 7.5 (Maipo)

Output:

make[3]: Entering directory `/home/sdhammo/openmpi/openmpi-3.1.0/test/class'
make[4]: Entering directory `/home/sdhammo/openmpi/openmpi-3.1.0/test/class'
PASS: ompi_rb_tree
PASS: opal_bitmap
PASS: opal_hash_table
PASS: opal_proc_table
PASS: opal_tree
PASS: opal_list
PASS: opal_value_array
PASS: opal_pointer_array
PASS: opal_lifo


Output from Top:

20   0   73280   4224   2560 S 800.0  0.0  17:22.94 lt-opal_fifo
 
-- 
Si Hammond
Scalable Computer Architectures
Sandia National Laboratories, NM, USA
[Sent from remote connection, excuse typos]
 



___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

[OMPI users] OpenMPI 3.1.0 Lock Up on POWER9 w/ CUDA9.2

2018-06-16 Thread Hammond, Simon David via users
Hi OpenMPI Team,

We have recently updated an install of OpenMPI on POWER9 system (configuration 
details below). We migrated from OpenMPI 2.1 to OpenMPI 3.1. We seem to have a 
symptom where code than ran before is now locking up and making no progress, 
getting stuck in wait-all operations. While I think it's prudent for us to root 
cause this a little more, I have gone back and rebuilt MPI and re-run the "make 
check" tests. The opal_fifo test appears to hang forever. I am not sure if this 
is the cause of our issue but wanted to report that we are seeing this on our 
system.

OpenMPI 3.1.0 Configuration:

./configure 
--prefix=/home/projects/ppc64le-pwr9-nvidia/openmpi/3.1.0-nomxm/gcc/7.2.0/cuda/9.2.88
 --with-cuda=$CUDA_ROOT --enable-mpi-java --enable-java 
--with-lsf=/opt/lsf/10.1 
--with-lsf-libdir=/opt/lsf/10.1/linux3.10-glibc2.17-ppc64le/lib --with-verbs

GCC versions are 7.2.0, built by our team. CUDA is 9.2.88 from NVIDIA for 
POWER9 (standard download from their website). We enable IBM's JDK 8.0.0.
RedHat: Red Hat Enterprise Linux Server release 7.5 (Maipo)

Output:

make[3]: Entering directory `/home/sdhammo/openmpi/openmpi-3.1.0/test/class'
make[4]: Entering directory `/home/sdhammo/openmpi/openmpi-3.1.0/test/class'
PASS: ompi_rb_tree
PASS: opal_bitmap
PASS: opal_hash_table
PASS: opal_proc_table
PASS: opal_tree
PASS: opal_list
PASS: opal_value_array
PASS: opal_pointer_array
PASS: opal_lifo


Output from Top:

20   0   73280   4224   2560 S 800.0  0.0  17:22.94 lt-opal_fifo
 
-- 
Si Hammond
Scalable Computer Architectures
Sandia National Laboratories, NM, USA
[Sent from remote connection, excuse typos]
 

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] [EXTERNAL] OpenMPI 3.0.1 and Power9

2018-04-10 Thread Hammond, Simon David
Steve,

We have been able to get OpenMPI 2.1 and 3.0 series working on our POWER9 
systems. It is my understanding that the IBM offering has performance 
improvements over the open source variant. I would add the caveat we are still 
in fairly early testing of the platform.

S.

—
Si Hammond
Scalable Computer Architectures
Sandia National Laboratories, NM

On Apr 10, 2018, at 11:08 AM, Steve Zehner 
> wrote:

Hello,

Does OpenMPI 3.0.1 run on IBM Power9 AC922?
One of the universities I work with is considering being in the IBM Spectrum 
MPI 10.2 beta but wants to know if there is an open source alternative.

I have reviewed the presentation from SC17 
https://www.open-mpi.org/papers/sc-2017/

Sincerely,

Steve Zehner
Systems Architect
Office: 847-805-2098
Cell: 309-706-9931
e-mail: scze...@us.ibm.com
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] [EXTERNAL] Re: Using shmem_int_fadd() in OpenMPI's SHMEM

2017-11-21 Thread Hammond, Simon David
Hi Howard/OpenMPI Users,

I have had a similar seg-fault this week using OpenMPI 2.1.1 with GCC 4.9.3 so 
I tried to compile the example code in the email below. I see similar behavior 
to a small benchmark we have in house (but using inc not finc).

When I run on a single node (both PE’s on the same node) I get the error below. 
But, if I run on multiple nodes (say 2 nodes with one PE per node) then the 
code runs fine. Same thing for my benchmark which uses shmem_longlong_inc. For 
reference, we are using InfiniBand on our cluster and dual-socket Haswell 
processors.

Hope that helps,

S.

$ shmemrun -n 2 ./testfinc
--
WARNING: There is at least non-excluded one OpenFabrics device found,
but there are no active ports detected (or Open MPI was unable to use
them).  This is most certainly not what you wanted.  Check your
cables, subnet manager configuration, etc.  The openib BTL will be
ignored for this job.

  Local host: shepard-lsm1
--
[shepard-lsm1:49505] *** Process received signal ***
[shepard-lsm1:49505] Signal: Segmentation fault (11)
[shepard-lsm1:49505] Signal code: Address not mapped (1)
[shepard-lsm1:49505] Failing at address: 0x18
[shepard-lsm1:49505] [ 0] /lib64/libpthread.so.0(+0xf710)[0x7ffc4cd9e710]
[shepard-lsm1:49505] [ 1] 
/home/projects/x86-64-haswell/openmpi/2.1.1/gcc/4.9.3/lib/openmpi/mca_spml_yoda.so(mca_spml_yoda_get+0x86d)[0x7ffc337cf37d]
[shepard-lsm1:49505] [ 2] 
/home/projects/x86-64-haswell/openmpi/2.1.1/gcc/4.9.3/lib/openmpi/mca_atomic_basic.so(atomic_basic_lock+0x9a)[0x7ffc32f190aa]
[shepard-lsm1:49505] [ 3] 
/home/projects/x86-64-haswell/openmpi/2.1.1/gcc/4.9.3/lib/openmpi/mca_atomic_basic.so(mca_atomic_basic_fadd+0x39)[0x7ffc32f19409]
[shepard-lsm1:49505] [ 4] 
/home/projects/x86-64-haswell/openmpi/2.1.1/gcc/4.9.3/lib/liboshmem.so.20(shmem_int_fadd+0x80)[0x7ffc4d2fc110]
[shepard-lsm1:49505] [ 5] ./testfinc[0x400888]
[shepard-lsm1:49505] [ 6] 
/lib64/libc.so.6(__libc_start_main+0xfd)[0x7ffc4ca19d5d]
[shepard-lsm1:49505] [ 7] ./testfinc[0x400739]
[shepard-lsm1:49505] *** End of error message ***
--
shmemrun noticed that process rank 1 with PID 0 on node shepard-lsm1 exited on 
signal 11 (Segmentation fault).
--
[shepard-lsm1:49499] 1 more process has sent help message 
help-mpi-btl-openib.txt / no active ports found
[shepard-lsm1:49499] Set MCA parameter "orte_base_help_aggregate" to 0 to see 
all help / error messages

--
Si Hammond
Scalable Computer Architectures
Sandia National Laboratories, NM, USA


From: users  on behalf of Howard Pritchard 

Reply-To: Open MPI Users 
Date: Monday, November 20, 2017 at 4:11 PM
To: Open MPI Users 
Subject: [EXTERNAL] Re: [OMPI users] Using shmem_int_fadd() in OpenMPI's SHMEM

HI Ben,

What version of Open MPI are you trying to use?

Also, could you describe something about your system.  If its a cluster
what sort of interconnect is being used.

Howard


2017-11-20 14:13 GMT-07:00 Benjamin Brock 
>:
What's the proper way to use shmem_int_fadd() in OpenMPI's SHMEM?

A minimal example seems to seg fault:

#include 
#include 

#include 

int main(int argc, char **argv) {
  shmem_init();
  const size_t shared_segment_size = 1024;
  void *shared_segment = shmem_malloc(shared_segment_size);

  int *arr = (int *) shared_segment;
  int *local_arr = (int *) malloc(sizeof(int) * 10);

  if (shmem_my_pe() == 1) {
shmem_int_fadd((int *) shared_segment, 1, 0);
  }
  shmem_barrier_all();

  return 0;
}

Where am I going wrong here?  This sort of thing works in Cray SHMEM.

Ben Bock

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

[OMPI users] Segmentation Fault when using OpenMPI 1.10.6 and PGI 17.1.0 on POWER8

2017-02-21 Thread Hammond, Simon David (-EXP)
Hi OpenMPI Users,

Has anyone successfully tested OpenMPI 1.10.6 with PGI 17.1.0 on POWER8 with 
the LSF scheduler (—with-lsf=..)?

I am getting this error when the code hits MPI_Finalize. It causes the job to 
abort (i.e. exit the LSF session) when I am running interactively.

Are there any materials we can supply to aid debugging/problem isolation?

[white23:58788] *** Process received signal ***
[white23:58788] Signal: Segmentation fault (11)
[white23:58788] Signal code: Invalid permissions (2)
[white23:58788] Failing at address: 0x108e0810
[white23:58788] [ 0] [0x10050478]
[white23:58788] [ 1] [0x0]
[white23:58788] [ 2] 
/home/projects/pwr8-rhel73-lsf/openmpi/1.10.6/pgi/17.1.0/cuda/none/lib/libopen-rte.so.12(+0x1b6b0)[0x1071b6b0]
[white23:58788] [ 3] 
/home/projects/pwr8-rhel73-lsf/openmpi/1.10.6/pgi/17.1.0/cuda/none/lib/libopen-rte.so.12(orte_finalize+0x70)[0x1071b5b8]
[white23:58788] [ 4] 
/home/projects/pwr8-rhel73-lsf/openmpi/1.10.6/pgi/17.1.0/cuda/none/lib/libmpi.so.12(ompi_mpi_finalize+0x760)[0x10121dc8]
[white23:58788] [ 5] 
/home/projects/pwr8-rhel73-lsf/openmpi/1.10.6/pgi/17.1.0/cuda/none/lib/libmpi.so.12(PMPI_Finalize+0x6c)[0x10153154]
[white23:58788] [ 6] ./IMB-MPI1[0x100028dc]
[white23:58788] [ 7] /lib64/libc.so.6(+0x24700)[0x104b4700]
[white23:58788] [ 8] /lib64/libc.so.6(__libc_start_main+0xc4)[0x104b48f4]
[white23:58788] *** End of error message ***
[white22:73620] *** Process received signal ***
[white22:73620] Signal: Segmentation fault (11)
[white22:73620] Signal code: Invalid permissions (2)
[white22:73620] Failing at address: 0x108e0810


Thanks,

S.

—

Si Hammond
Scalable Computer Architectures
Sandia National Laboratories, NM, USA

[Sent from Remote Connection, Please excuse typos]




___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] [EXTERNAL] MPI-Checker - Static Analyzer

2015-05-31 Thread Hammond, Simon David (-EXP)
Alex,

Do you have a paper on the tool we could look at?

Thanks

S



--
Si Hammond
Scalable Computer Architectures
Sandia National Laboratories, NM
[Sent remotely, please excuse typing errors]

From: users  on behalf of Alexander Droste 

Sent: Saturday, May 30, 2015 5:47:36 AM
To: us...@open-mpi.org
Subject: [EXTERNAL] [OMPI users] MPI-Checker - Static Analyzer

Hi everyone,

I've written a Static Analyzer Checker for MPI code
which is published on GitHub https://github.com/0ax1/MPI-Checker.
I'd be exited to get any kind of feedback.

Best regards,
Alex
___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/05/27000.php


Re: [OMPI users] [EXTERNAL] Re: Errors on POWER8 Ubuntu 14.04u2

2015-03-27 Thread Hammond, Simon David (-EXP)
Thanks guys,

I have tried two configure lines:

(1) ./configure 
--prefix=/home/projects/power8/openmpi/1.8.4/gnu/4.8.2/cuda/none 
--enable-mpi-thread-multiple CC=/usr/bin/gcc CXX=/usr/bin/g++ 
FC=/usr/bin/gfortran

(2) ./configure 
--prefix=/home/projects/power8/openmpi/1.8.4/gnu/4.8.2/cuda/none 
--enable-mpi-thread-multiple CC=/usr/bin/gcc CXX=/usr/bin/g++ 
FC=/usr/bin/gfortran --enable-shared --disable-static

The second was just to try and force the generation of shared libraries (I 
notice they are not in 
/home/projects/power8/openmpi/1.8.4/gnu/4.8.2/cuda/none/lib).

I also attached the config.log from (2) bzip2'd as requested on the help page.

Thanks for all of your help,


S.


--
Simon Hammond
Center for Computing Research (Scalable Computer Architectures)
Sandia National Laboratories, NM
[Sent from remote connection, please excuse typing errors]


From: users <users-boun...@open-mpi.org> on behalf of Jeff Squyres (jsquyres) 
<jsquy...@cisco.com>
Sent: Friday, March 27, 2015 11:15 AM
To: Open MPI User's List
Subject: [EXTERNAL] Re: [OMPI users] Errors on POWER8 Ubuntu 14.04u2

It might be helpful to send all the information listed here:

http://www.open-mpi.org/community/help/


> On Mar 26, 2015, at 10:55 PM, Ralph Castain <rhc.open...@gmail.com> wrote:
>
> Could you please send us your configure line?
>
>> On Mar 26, 2015, at 4:47 PM, Hammond, Simon David (-EXP) 
>> <sdha...@sandia.gov> wrote:
>>
>> Hi everyone,
>>
>> We are trying to compile custom installs of OpenMPI 1.8.4 on our POWER8 
>> Ubuntu system. We can configure and build correctly but when running 
>> ompi_info we see many errors like those listed below. It appears that all of 
>> the libraries in the ./lib are static (.a) files. It appears that we are 
>> unable to get our IB system working as a result.
>>
>> Can you recommend what we should be doing to ensure this works correctly?
>>
>> [node11:104711] mca: base: component_find: unable to open 
>> /home/projects/power8/openmpi/1.8.4/gnu/4.8.2/cuda/none/lib/openmpi/mca_compress_bzip:
>>  lt_dlerror() returned NULL! (ignored)
>>
>> Thanks for your help,
>>
>>
>> --
>> Simon Hammond
>> Center for Computing Research (Scalable Computer Architectures)
>> Sandia National Laboratories, NM
>> [Sent from remote connection, please excuse typing errors]
>> ___
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2015/03/26547.php
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/03/26550.php


--
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/03/26551.php

power8-config.log.bz2
Description: power8-config.log.bz2


[OMPI users] Errors on POWER8 Ubuntu 14.04u2

2015-03-26 Thread Hammond, Simon David (-EXP)
Hi everyone,

We are trying to compile custom installs of OpenMPI 1.8.4 on our POWER8 Ubuntu 
system. We can configure and build correctly but when running ompi_info we see 
many errors like those listed below. It appears that all of the libraries in 
the ./lib are static (.a) files. It appears that we are unable to get our IB 
system working as a result.

Can you recommend what we should be doing to ensure this works correctly? 

[node11:104711] mca: base: component_find: unable to open 
/home/projects/power8/openmpi/1.8.4/gnu/4.8.2/cuda/none/lib/openmpi/mca_compress_bzip:
 lt_dlerror() returned NULL! (ignored)

Thanks for your help,


--
Simon Hammond
Center for Computing Research (Scalable Computer Architectures)
Sandia National Laboratories, NM
[Sent from remote connection, please excuse typing errors]

[OMPI users] Compiling OpenMPI 1.8.1 for Cray XC30

2014-06-05 Thread Hammond, Simon David (-EXP)
Hi OpenMPI developers/users,

Does anyone have a working configure line for OpenMPI 1.8.1 on a Cray XC30?

When we compile the code ALPS is located but when we run compiled binaries 
using aprun we get n * 1 ranks rather than 1 job of n ranks.

Thank you.

S.

--
Simon Hammond
Scalable Computer Architectures (Org. 01422)
Sandia National Laboratories, NM
[Sent from remote connection, please excuse typing errors]


Re: [OMPI users] [EXTERNAL] Re: Planned support for Intel Phis

2014-02-02 Thread Hammond, Simon David (-EXP)
Will this support native execution? I.e. MIC only, no host involvement?

S



--
Si Hammond
Sandia National Laboratories
Remote Connection


-Original Message-
From: Ralph Castain [r...@open-mpi.org]
Sent: Sunday, February 02, 2014 09:02 AM Mountain Standard Time
To: Open MPI Users
Subject: [EXTERNAL] Re: [OMPI users] Planned support for Intel Phis


Support for the Phi is in the upcoming 1.7.4 release. It doesn't require any 
version of OFED as it uses the Phi's scif interface for communication to ranks 
on the local host. For communication off-host, OMPI will use whatever NICs are 
available


On Feb 2, 2014, at 6:44 AM, Michael Thomadakis  wrote:

> Hello OpenMPI,
>
> I was wondering what is the support that is being implemented for the Intel 
> Phi platforms. That is would we be able to run MPI code in "symmetric" 
> fashion, where some ranks run on the cores of the multicore hostst and some 
> on the cores of the Phis in a multinode cluster environment.
>
> Also is it based on OFED 1.5.4.1 or on which OFED?
>
> Best regards
> Michael
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


Re: [OMPI users] [EXTERNAL] Re: open-mpi on Mac OS 10.9 (Mavericks)

2013-11-25 Thread Hammond, Simon David (-EXP)
We have occasionally had a problem like this when we set LD_LIBRARY_PATH only. 
On OSX you may need to set DYLD_LIBRARY_PATH instead ( set it to the same lib 
directory )

Can you try that and see if it resolves the problem?



Si Hammond
Sandia National Laboratories
Remote Connection


-Original Message-
From: Meredith, Karl 
[karl.mered...@fmglobal.com]
Sent: Monday, November 25, 2013 06:25 AM Mountain Standard Time
To: Open MPI Users
Subject: [EXTERNAL] Re: [OMPI users] open-mpi on Mac OS 10.9 (Mavericks)


I do have these two environment variables set:

LD_LIBRARY_PATH=/Users/meredithk/tools/openmpi/lib
PATH=/Users/meredithk/tools/openmpi/bin

Running mpirun seems to work fine with a simple command, like hostname:

$ )mpirun -n 2 hostname
meredithk-mac.corp.fmglobal.com
meredithk-mac.corp.fmglobal.com

I am trying to run the simple hello_cxx example from the openmpi distribution, 
compiled as such:
mpic++ -ghello_cxx.cc   -o hello_cxx

It compiles fine, without warning or error.  However, when I go to run the 
example, it stalls on the MPI::Init() command:
mpirun -np 1 hello_cxx
It never errors out or crashes.  It simply hangs.

I am using the same mpic++ and mpirun version:
$ )which mpirun
/Users/meredithk/tools/openmpi/bin/mpirun

$ )which mpic++
/Users/meredithk/tools/openmpi/bin/mpic++

Not quite sure what else to check.

Karl


On Nov 23, 2013, at 5:29 PM, Ralph Castain  wrote:

> Strange - I run on Mavericks now without problem. Can you run "mpirun -n 1 
> hostname"?
>
> You also might want to check your PATH and LD_LIBRARY_PATH to ensure you have 
> the prefix where you installed OMPI 1.6.5 at the front. Mac distributes a 
> very old version of OMPI with its software and you don't want to pick it up 
> by mistake.
>
>
> On Nov 22, 2013, at 1:45 PM, Meredith, Karl  
> wrote:
>
>> I recently upgraded my 2013 Macbook Pro (Retina display) from 10.8 to 10.9.  
>> I downloaded and installed openmpi-1.6.5 and compiled it with gcc 4.8 (gcc 
>> installed from macports).
>> openmpi compiled and installed without error.
>>
>> However, when I try to run any of the example test cases, the code gets 
>> stuck inside the first MPI::Init() call and never returns.
>>
>> Any thoughts on what might be going wrong?
>>
>> The same install on OS 10.8 works fine and the example test cases run 
>> without error.
>>
>> Karl
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


Re: [OMPI users] [EXTERNAL] Re: Build Failing for OpenMPI 1.7.2 and CUDA 5.5.11

2013-10-07 Thread Hammond, Simon David (-EXP)
Thanks Rolf, that seems to have made the code compile and make
successfully. 

S.

-- 
Simon Hammond
Scalable Computer Architectures (CSRI/146, 01422)
Sandia National Laboratories, NM, USA






On 10/7/13 1:47 PM, "Rolf vandeVaart" <rvandeva...@nvidia.com> wrote:

>That might be a bug.  While I am checking, you could try configuring with
>this additional flag:
>
>--enable-mca-no-build=pml-bfo
>
>Rolf
>
>>-Original Message-
>>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Hammond,
>>Simon David (-EXP)
>>Sent: Monday, October 07, 2013 3:30 PM
>>To: us...@open-mpi.org
>>Subject: [OMPI users] Build Failing for OpenMPI 1.7.2 and CUDA 5.5.11
>>
>>Hey everyone,
>>
>>I am trying to build OpenMPI 1.7.2 with CUDA enabled, OpenMPI will
>>configure successfully but I am seeing a build error relating to the
>>inclusion of
>>the CUDA options (at least I think so). Do you guys know if this is a
>>bug or
>>whether something is wrong with how we are configuring OpenMPI for our
>>cluster.
>>
>>Configure Line: ./configure
>>--prefix=/home/projects/openmpi/1.7.2/gnu/4.7.2 --enable-shared --enable-
>>static --disable-vt --with-cuda=/home/projects/cuda/5.5.11
>>CC=`which gcc` CXX=`which g++` FC=`which gfortran`
>>
>>Running make V=1 gives:
>>
>>make[2]: Entering directory `/tmp/openmpi-1.7.2/ompi/tools/ompi_info'
>>/bin/sh ../../../libtool  --tag=CC   --mode=link
>>/home/projects/gcc/4.7.2/bin/gcc -std=gnu99 -
>>DOPAL_CONFIGURE_USER="\"\"" -
>>DOPAL_CONFIGURE_HOST="\"k20-0007\""
>>-DOPAL_CONFIGURE_DATE="\"Mon Oct  7 13:16:12 MDT 2013\""
>>-DOMPI_BUILD_USER="\"$USER\"" -DOMPI_BUILD_HOST="\"`hostname`\""
>>-DOMPI_BUILD_DATE="\"`date`\"" -DOMPI_BUILD_CFLAGS="\"-O3 -
>>DNDEBUG -finline-functions -fno-strict-aliasing -pthread\""
>>-DOMPI_BUILD_CPPFLAGS="\"-I../../..
>>-I/tmp/openmpi-1.7.2/opal/mca/hwloc/hwloc152/hwloc/include
>>-I/tmp/openmpi-1.7.2/opal/mca/event/libevent2019/libevent
>>-I/tmp/openmpi-1.7.2/opal/mca/event/libevent2019/libevent/include
>>-I/usr/include/infiniband -I/usr/include/infiniband
>>-I/usr/include/infiniband -
>>I/usr/include/infiniband -I/usr/include/infiniband\"" -
>>DOMPI_BUILD_CXXFLAGS="\"-O3 -DNDEBUG -finline-functions -pthread\"" -
>>DOMPI_BUILD_CXXCPPFLAGS="\"-I../../..  \""
>>-DOMPI_BUILD_FFLAGS="\"\"" -DOMPI_BUILD_FCFLAGS="\"\""
>>-DOMPI_BUILD_LDFLAGS="\"-export-dynamic  \"" -DOMPI_BUILD_LIBS="\"-
>>lrt -lnsl  -lutil -lm \"" -DOPAL_CC_ABSOLUTE="\"\""
>>-DOMPI_CXX_ABSOLUTE="\"none\"" -O3 -DNDEBUG -finline-functions
>>-fno-strict-aliasing -pthread  -export-dynamic   -o ompi_info ompi_info.o
>>param.o components.o version.o ../../../ompi/libmpi.la -lrt -lnsl
>>-lutil -lm
>>libtool: link: /home/projects/gcc/4.7.2/bin/gcc -std=gnu99 -
>>DOPAL_CONFIGURE_USER=\"\" -
>>DOPAL_CONFIGURE_HOST=\"k20-0007\"
>>"-DOPAL_CONFIGURE_DATE=\"Mon Oct  7 13:16:12 MDT 2013\""
>>-DOMPI_BUILD_USER=\"\" -DOMPI_BUILD_HOST=\"k20-0007\"
>>"-DOMPI_BUILD_DATE=\"Mon Oct  7 13:26:23 MDT 2013\""
>>"-DOMPI_BUILD_CFLAGS=\"-O3 -DNDEBUG -finline-functions -fno-strict-
>>aliasing -pthread\"" "-DOMPI_BUILD_CPPFLAGS=\"-I../../..
>>-I/tmp/openmpi-1.7.2/opal/mca/hwloc/hwloc152/hwloc/include
>>-I/tmp/openmpi-1.7.2/opal/mca/event/libevent2019/libevent
>>-I/tmp/openmpi-1.7.2/opal/mca/event/libevent2019/libevent/include
>>-I/usr/include/infiniband -I/usr/include/infiniband
>>-I/usr/include/infiniband -
>>I/usr/include/infiniband -I/usr/include/infiniband\"" "-
>>DOMPI_BUILD_CXXFLAGS=\"-O3 -DNDEBUG -finline-functions -pthread\"" "-
>>DOMPI_BUILD_CXXCPPFLAGS=\"-I../../..  \""
>>-DOMPI_BUILD_FFLAGS=\"\" -DOMPI_BUILD_FCFLAGS=\"\"
>>"-DOMPI_BUILD_LDFLAGS=\"-export-dynamic  \"" "-DOMPI_BUILD_LIBS=\"-
>>lrt -lnsl  -lutil -lm \"" -DOPAL_CC_ABSOLUTE=\"\" -
>>DOMPI_CXX_ABSOLUTE=\"none\"
>>-O3 -DNDEBUG -finline-functions -fno-strict-aliasing -pthread -o
>>.libs/ompi_info ompi_info.o param.o components.o version.o -Wl,--export-
>>dynamic  ../../../ompi/.libs/libmpi.so -L/usr/lib64 -lrdmacm -losmcomp -
>>libverbs /tmp

[OMPI users] Build Failing for OpenMPI 1.7.2 and CUDA 5.5.11

2013-10-07 Thread Hammond, Simon David (-EXP)
Hey everyone,

I am trying to build OpenMPI 1.7.2 with CUDA enabled, OpenMPI will
configure successfully but I am seeing a build error relating to the
inclusion of the CUDA options (at least I think so). Do you guys know if
this is a bug or whether something is wrong with how we are configuring
OpenMPI for our cluster.

Configure Line: ./configure
--prefix=/home/projects/openmpi/1.7.2/gnu/4.7.2 --enable-shared
--enable-static --disable-vt --with-cuda=/home/projects/cuda/5.5.11
CC=`which gcc` CXX=`which g++` FC=`which gfortran`

Running make V=1 gives:

make[2]: Entering directory `/tmp/openmpi-1.7.2/ompi/tools/ompi_info'
/bin/sh ../../../libtool  --tag=CC   --mode=link
/home/projects/gcc/4.7.2/bin/gcc -std=gnu99
-DOPAL_CONFIGURE_USER="\"\"" -DOPAL_CONFIGURE_HOST="\"k20-0007\""
-DOPAL_CONFIGURE_DATE="\"Mon Oct  7 13:16:12 MDT 2013\""
-DOMPI_BUILD_USER="\"$USER\"" -DOMPI_BUILD_HOST="\"`hostname`\""
-DOMPI_BUILD_DATE="\"`date`\"" -DOMPI_BUILD_CFLAGS="\"-O3 -DNDEBUG
-finline-functions -fno-strict-aliasing -pthread\""
-DOMPI_BUILD_CPPFLAGS="\"-I../../..
-I/tmp/openmpi-1.7.2/opal/mca/hwloc/hwloc152/hwloc/include
-I/tmp/openmpi-1.7.2/opal/mca/event/libevent2019/libevent
-I/tmp/openmpi-1.7.2/opal/mca/event/libevent2019/libevent/include
-I/usr/include/infiniband -I/usr/include/infiniband
-I/usr/include/infiniband -I/usr/include/infiniband
-I/usr/include/infiniband\"" -DOMPI_BUILD_CXXFLAGS="\"-O3 -DNDEBUG
-finline-functions -pthread\"" -DOMPI_BUILD_CXXCPPFLAGS="\"-I../../..  \""
-DOMPI_BUILD_FFLAGS="\"\"" -DOMPI_BUILD_FCFLAGS="\"\""
-DOMPI_BUILD_LDFLAGS="\"-export-dynamic  \"" -DOMPI_BUILD_LIBS="\"-lrt
-lnsl  -lutil -lm \"" -DOPAL_CC_ABSOLUTE="\"\""
-DOMPI_CXX_ABSOLUTE="\"none\"" -O3 -DNDEBUG -finline-functions
-fno-strict-aliasing -pthread  -export-dynamic   -o ompi_info ompi_info.o
param.o components.o version.o ../../../ompi/libmpi.la -lrt -lnsl  -lutil
-lm
libtool: link: /home/projects/gcc/4.7.2/bin/gcc -std=gnu99
-DOPAL_CONFIGURE_USER=\"\" -DOPAL_CONFIGURE_HOST=\"k20-0007\"
"-DOPAL_CONFIGURE_DATE=\"Mon Oct  7 13:16:12 MDT 2013\""
-DOMPI_BUILD_USER=\"\" -DOMPI_BUILD_HOST=\"k20-0007\"
"-DOMPI_BUILD_DATE=\"Mon Oct  7 13:26:23 MDT 2013\""
"-DOMPI_BUILD_CFLAGS=\"-O3 -DNDEBUG -finline-functions
-fno-strict-aliasing -pthread\"" "-DOMPI_BUILD_CPPFLAGS=\"-I../../..
-I/tmp/openmpi-1.7.2/opal/mca/hwloc/hwloc152/hwloc/include
-I/tmp/openmpi-1.7.2/opal/mca/event/libevent2019/libevent
-I/tmp/openmpi-1.7.2/opal/mca/event/libevent2019/libevent/include
-I/usr/include/infiniband -I/usr/include/infiniband
-I/usr/include/infiniband -I/usr/include/infiniband
-I/usr/include/infiniband\"" "-DOMPI_BUILD_CXXFLAGS=\"-O3 -DNDEBUG
-finline-functions -pthread\"" "-DOMPI_BUILD_CXXCPPFLAGS=\"-I../../..  \""
-DOMPI_BUILD_FFLAGS=\"\" -DOMPI_BUILD_FCFLAGS=\"\"
"-DOMPI_BUILD_LDFLAGS=\"-export-dynamic  \"" "-DOMPI_BUILD_LIBS=\"-lrt
-lnsl  -lutil -lm \"" -DOPAL_CC_ABSOLUTE=\"\" -DOMPI_CXX_ABSOLUTE=\"none\"
-O3 -DNDEBUG -finline-functions -fno-strict-aliasing -pthread -o
.libs/ompi_info ompi_info.o param.o components.o version.o
-Wl,--export-dynamic  ../../../ompi/.libs/libmpi.so -L/usr/lib64 -lrdmacm
-losmcomp -libverbs /tmp/openmpi-1.7.2/orte/.libs/libopen-rte.so
/tmp/openmpi-1.7.2/opal/.libs/libopen-pal.so -lcuda -lnuma -ldl -lrt -lnsl
-lutil -lm -pthread -Wl,-rpath
-Wl,/home/projects/openmpi/1.7.2/gnu/4.7.2/lib
../../../ompi/.libs/libmpi.so: undefined reference to
`mca_pml_bfo_send_request_start_cuda'
../../../ompi/.libs/libmpi.so: undefined reference to
`mca_pml_bfo_cuda_need_buffers'
collect2: error: ld returned 1 exit status



Thanks.

S.

-- 
Simon Hammond
Scalable Computer Architectures (CSRI/146, 01422)
Sandia National Laboratories, NM, USA






Re: [hwloc-users] [EXTERNAL] Re: Many queries creating slow performance

2013-03-05 Thread Hammond, Simon David (-EXP)
Hey Jeff,

It's not in OpenMPI or MPICH :(. It's a custom library which is not MPI aware 
making it difficult to share the topology query. Ill see if we can get a stand 
alone piece of code.

From earlier posts it sounds like OpenMPI queries once per physical node so 
probably won't have this problem. I'm guessing MPICH would do something similar?

S.



Sent with Good (www.good.com)


-Original Message-
From: Jeff Hammond [jhamm...@alcf.anl.gov]
Sent: Tuesday, March 05, 2013 07:17 PM Mountain Standard Time
To: Hardware locality user list
Subject: [EXTERNAL] Re: [hwloc-users] Many queries creating slow performance


Si - Is your code that calls hwloc part of MPICH or OpenMPI or
something that can be made standalone and shared?

Brice - Do you have access to a MIC system for testing?  Write me
offline if you don't and I'll see what I can do to help.

If this affects MPICH i.e. Hydra, then I'm sure Intel will be
committed to helping fix it since Intel MPI is using Hydra as the
launcher on systems like Stampede.

Best,

Jeff

On Tue, Mar 5, 2013 at 3:05 PM, Brice Goglin  wrote:
> Just tested on a 96-core shared-memory machine. Running OpenMPI 1.6 mpiexec
> lstopo, here's the execution time (mpiexec launch time is 0.2-0.4s)
>
> 1 rank :  0.2s
> 8 ranks:  0.3-0.5s depending on binding (packed or scatter)
> 24ranks:  0.8-3.7s depending on binding
> 48ranks:  2.8-8.0s depending on binding
> 96ranks: 14.2s
>
> 96ranks from a single XML file: 0.4s (negligible against mpiexec launch
> time)
>
> Brice
>
>
>
> Le 05/03/2013 20:23, Simon Hammond a écrit :
>
> Hi HWLOC users,
>
> We are seeing some significant performance problems using HWLOC 1.6.2 on
> Intel's MIC products. In one of our configurations we create 56 MPI ranks,
> each rank then queries the topology of the MIC card before creating threads.
> We are noticing that if we run 56 MPI ranks as opposed to one the calls to
> query the topology in HWLOC are very slow, runtime goes from seconds to
> minutes (and upwards).
>
> We guessed that this might be caused by the kernel serializing access to the
> /proc filesystem but this is just a hunch.
>
> Has anyone had this problem and found an easy way to change the library /
> calls to HWLOC so that the slow down is not experienced? Would you describe
> this as a bug?
>
> Thanks for your help.
>
>
> --
> Simon Hammond
>
> 1-(505)-845-7897 / MS-1319
> Scalable Computer Architectures
> Sandia National Laboratories, NM
>
>
>
>
>
>
> ___
> hwloc-users mailing list
> hwloc-us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users
>
>
>
> ___
> hwloc-users mailing list
> hwloc-us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users



--
Jeff Hammond
Argonne Leadership Computing Facility
University of Chicago Computation Institute
jhamm...@alcf.anl.gov / (630) 252-5381
http://www.linkedin.com/in/jeffhammond
https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond

___
hwloc-users mailing list
hwloc-us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users