Hi,

The behaviour is reproduceable on our systems:
* Linux Cluster (Intel Xeon E5-2660 v3, Scientific Linux release 6.8 (Carbon), 
Kernel 2.6.32, nightly 2.x branch) 
The error is independent of the used btl combination on the cluster (Tested 
'sm,self,vader', 'sm,self,openib', 'sm,self', 'vader,self', 'openib,self')
* Cray XC40 (using gnu 6.3 and open mpi 2.0.1, Kernel 3.0.101)
The error manifests always within 50 loop iteration of the below command line.

The behaviour is not reproduceable with neither Open MPI 2.0.1 nor 2.1.0rc2 on 
my notebook (Arch Linux, gcc 6.3.1, Kernel 4.9.11)

Best
Christoph


----- Original Message -----
From: "Howard Pritchard" <hpprit...@gmail.com>
To: "Open MPI Users" <users@lists.open-mpi.org>
Sent: Friday, March 3, 2017 9:02:22 PM
Subject: Re: [OMPI users] Shared Windows and MPI_Accumulate

Hello Joseph, 

I'm still unable to reproduce this system on my SLES12 x86_64 node. 

Are you building with CFLAGS=-O3? 

If so, could you build without CFLAGS set and see if you still see the failure? 

Howard 


2017-03-02 2:34 GMT-07:00 Joseph Schuchart < [ mailto:schuch...@hlrs.de | 
schuch...@hlrs.de ] > : 





Hi Howard, 

Thanks for trying to reproduce this. It seems that on master the issue occurs 
less frequently but is still there. I used the following bash one-liner on my 
laptop and on our Linux Cluster (single node, 4 processes): 

``` 
$ for i in $(seq 1 100) ; do echo $i && mpirun -n 4 ./mpi_shared_accumulate | 
grep \! && break ; done 
1 
2 
[0] baseptr[0]: 1004 (expected 1010) [!!!] 
[0] baseptr[1]: 1005 (expected 1011) [!!!] 
[0] baseptr[2]: 1006 (expected 1012) [!!!] 
[0] baseptr[3]: 1007 (expected 1013) [!!!] 
[0] baseptr[4]: 1008 (expected 1014) [!!!] 
``` 

Sometimes the error occurs after one or two iterations (like above), sometimes 
only at iteration 20 or later. However, I can reproduce it within the 100 runs 
every time I run the statement above. I am attaching the config.log and output 
of ompi_info of master on my laptop. Please let me know if I can help with 
anything else. 


Thanks, 
Joseph 

On 03/01/2017 11:24 PM, Howard Pritchard wrote: 



Hi Joseph, 

I built this test with craypich (Cray MPI) and it passed. I also tried 
with Open MPI master and the test passed. I also tried with 2.0.2 
and can't seem to reproduce on my system. 

Could you post the output of config.log? 

Also, how intermittent is the problem? 


Thanks, 

Howard 




2017-03-01 8:03 GMT-07:00 Joseph Schuchart < [ mailto:schuch...@hlrs.de | 
schuch...@hlrs.de ] > : 


Hi all, 

We are seeing issues in one of our applications, in which processes in a shared 
communicator allocate a shared MPI window and execute MPI_Accumulate 
simultaneously on it to iteratively update each process' values. The test boils 
down to the sample code attached. Sample output is as follows: 

``` 
$ mpirun -n 4 ./mpi_shared_accumulate 
[1] baseptr[0]: 1010 (expected 1010) 
[1] baseptr[1]: 1011 (expected 1011) 
[1] baseptr[2]: 1012 (expected 1012) 
[1] baseptr[3]: 1013 (expected 1013) 
[1] baseptr[4]: 1014 (expected 1014) 
[2] baseptr[0]: 1005 (expected 1010) [!!!] 
[2] baseptr[1]: 1006 (expected 1011) [!!!] 
[2] baseptr[2]: 1007 (expected 1012) [!!!] 
[2] baseptr[3]: 1008 (expected 1013) [!!!] 
[2] baseptr[4]: 1009 (expected 1014) [!!!] 
[3] baseptr[0]: 1010 (expected 1010) 
[0] baseptr[0]: 1010 (expected 1010) 
[0] baseptr[1]: 1011 (expected 1011) 
[0] baseptr[2]: 1012 (expected 1012) 
[0] baseptr[3]: 1013 (expected 1013) 
[0] baseptr[4]: 1014 (expected 1014) 
[3] baseptr[1]: 1011 (expected 1011) 
[3] baseptr[2]: 1012 (expected 1012) 
[3] baseptr[3]: 1013 (expected 1013) 
[3] baseptr[4]: 1014 (expected 1014) 
``` 

Each process should hold the same values but sometimes (not on all executions) 
random processes diverge (marked through [!!!]). 

I made the following observations: 

1) The issue occurs with both OpenMPI 1.10.6 and 2.0.2 but not with MPICH 3.2. 
2) The issue occurs only if the window is allocated through 
MPI_Win_allocate_shared, using MPI_Win_allocate works fine. 
3) The code assumes that MPI_Accumulate atomically updates individual elements 
(please correct me if that is not covered by the MPI standard). 

Both OpenMPI and the example code were compiled using GCC 5.4.1 and run on a 
Linux system (single node). OpenMPI was configure with 
--enable-mpi-thread-multiple and --with-threads but the application is not 
multi-threaded. Please let me know if you need any other information. 

Cheers 
Joseph 

-- 
Dipl.-Inf. Joseph Schuchart 
High Performance Computing Center Stuttgart (HLRS) 
Nobelstr. 19 
D-70569 Stuttgart 

Tel.: [ tel:%2B49%280%29711-68565890 | +49(0)711-68565890 ] 
Fax: [ tel:%2B49%280%29711-6856832 | +49(0)711-6856832 ] 
E-Mail: [ mailto:schuch...@hlrs.de | schuch...@hlrs.de ] 


_______________________________________________ 
users mailing list 
[ mailto:users@lists.open-mpi.org | users@lists.open-mpi.org ] 
[ https://rfd.newmexicoconsortium.org/mailman/listinfo/users | 
https://rfd.newmexicoconsortium.org/mailman/listinfo/users ] 



_______________________________________________
users mailing list [ mailto:users@lists.open-mpi.org | users@lists.open-mpi.org 
] [ https://rfd.newmexicoconsortium.org/mailman/listinfo/users | 
https://rfd.newmexicoconsortium.org/mailman/listinfo/users ] 

-- 
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart

Tel.: [ tel:+49%20711%2068565890 | +49(0)711-68565890 ] Fax: [ 
tel:+49%20711%206856832 | +49(0)711-6856832 ] E-Mail: [ 
mailto:schuch...@hlrs.de | schuch...@hlrs.de ] 

_______________________________________________ 
users mailing list 
[ mailto:users@lists.open-mpi.org | users@lists.open-mpi.org ] 
[ https://rfd.newmexicoconsortium.org/mailman/listinfo/users | 
https://rfd.newmexicoconsortium.org/mailman/listinfo/users ] 


_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to