Re: [OMPI users] Shared Windows and MPI_Accumulate
Well, that is embarrassing! Thank you so much for figuring this out and providing a detailed answer (also thanks to everyone else who tried to reproduce it). I guess I assumed some synchronization in lock_all even though I know that it is not collective. With an additional barrier between initialization and accumulate in our original application things work smoothly. Best Joseph On 03/09/2017 03:10 PM, Steffen Christgau wrote: Hi Joseph, in your code, you are updating the local buffer, which is also exposed via the window, right after the lock_all call, but the stores (baseptr[i] = 1000 + loffs++, let's call those the buffer initialization) are may overwrite the outcome of other concurrent operations, i.e. the accumulate calls in your case. Another process that has already advanced to the accumulate loop may change data in the local window, but your local process has not completed the initialization. Thus you loose the outcome of accumulates by initialization in case of process skew. I provoked process skew by adding a if (comm_rank == 0) { sleep(1); } before the initialization loop, which enables me to reproduce the wrong results using GCC 6.3 and OpenMPI 2.0.2 and executing the program with two MPI processes. The lock_all call after the buffer initialization gives you no collective synchronization in the windows' communicator (as hinted on p. 446 in the 3.1 standard). That is, other processes have already performed their accumulate phase while the local one is still (or not yet) in the initialization and overwrites the data (see above). You might consider an EXCLUSIVE lock around your initialization, but this wont solve the issue, because any other process may do its accumulate phase after the window creation but before you enter the buffer initialization loop. As far as I understand your MWE code, the initialization should complete before the accumulate loop starts to get the correct results. I suppose a missing MPI_Barrier before the accumulate loop. Since you are using the unified model, you can omit the proposed exclusive lock (see above) as well. Hope this helps. Regards, Steffen On 03/01/2017 04:03 PM, Joseph Schuchart wrote: Hi all, We are seeing issues in one of our applications, in which processes in a shared communicator allocate a shared MPI window and execute MPI_Accumulate simultaneously on it to iteratively update each process' values. The test boils down to the sample code attached. Sample output is as follows: ``` $ mpirun -n 4 ./mpi_shared_accumulate [1] baseptr[0]: 1010 (expected 1010) [1] baseptr[1]: 1011 (expected 1011) [1] baseptr[2]: 1012 (expected 1012) [1] baseptr[3]: 1013 (expected 1013) [1] baseptr[4]: 1014 (expected 1014) [2] baseptr[0]: 1005 (expected 1010) [!!!] [2] baseptr[1]: 1006 (expected 1011) [!!!] [2] baseptr[2]: 1007 (expected 1012) [!!!] [2] baseptr[3]: 1008 (expected 1013) [!!!] [2] baseptr[4]: 1009 (expected 1014) [!!!] [3] baseptr[0]: 1010 (expected 1010) [0] baseptr[0]: 1010 (expected 1010) [0] baseptr[1]: 1011 (expected 1011) [0] baseptr[2]: 1012 (expected 1012) [0] baseptr[3]: 1013 (expected 1013) [0] baseptr[4]: 1014 (expected 1014) [3] baseptr[1]: 1011 (expected 1011) [3] baseptr[2]: 1012 (expected 1012) [3] baseptr[3]: 1013 (expected 1013) [3] baseptr[4]: 1014 (expected 1014) ``` Each process should hold the same values but sometimes (not on all executions) random processes diverge (marked through [!!!]). I made the following observations: 1) The issue occurs with both OpenMPI 1.10.6 and 2.0.2 but not with MPICH 3.2. 2) The issue occurs only if the window is allocated through MPI_Win_allocate_shared, using MPI_Win_allocate works fine. 3) The code assumes that MPI_Accumulate atomically updates individual elements (please correct me if that is not covered by the MPI standard). Both OpenMPI and the example code were compiled using GCC 5.4.1 and run on a Linux system (single node). OpenMPI was configure with --enable-mpi-thread-multiple and --with-threads but the application is not multi-threaded. Please let me know if you need any other information. Cheers Joseph ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users -- Dipl.-Inf. Joseph Schuchart High Performance Computing Center Stuttgart (HLRS) Nobelstr. 19 D-70569 Stuttgart Tel.: +49(0)711-68565890 Fax: +49(0)711-6856832 E-Mail: schuch...@hlrs.de ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Re: [OMPI users] Shared Windows and MPI_Accumulate
On 03/09/2017 03:10 PM, Steffen Christgau wrote: > > Since you are using > the unified model, you can omit the proposed exclusive lock (see above) > as well. To be fair, you have to be cautious when doing that - even in the unified model. See example 11.7 in the MPI-3.1 standard. In that context, you might also consider example 11.9. Regards, Steffen ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Re: [OMPI users] Shared Windows and MPI_Accumulate
Hi Joseph, in your code, you are updating the local buffer, which is also exposed via the window, right after the lock_all call, but the stores (baseptr[i] = 1000 + loffs++, let's call those the buffer initialization) are may overwrite the outcome of other concurrent operations, i.e. the accumulate calls in your case. Another process that has already advanced to the accumulate loop may change data in the local window, but your local process has not completed the initialization. Thus you loose the outcome of accumulates by initialization in case of process skew. I provoked process skew by adding a if (comm_rank == 0) { sleep(1); } before the initialization loop, which enables me to reproduce the wrong results using GCC 6.3 and OpenMPI 2.0.2 and executing the program with two MPI processes. The lock_all call after the buffer initialization gives you no collective synchronization in the windows' communicator (as hinted on p. 446 in the 3.1 standard). That is, other processes have already performed their accumulate phase while the local one is still (or not yet) in the initialization and overwrites the data (see above). You might consider an EXCLUSIVE lock around your initialization, but this wont solve the issue, because any other process may do its accumulate phase after the window creation but before you enter the buffer initialization loop. As far as I understand your MWE code, the initialization should complete before the accumulate loop starts to get the correct results. I suppose a missing MPI_Barrier before the accumulate loop. Since you are using the unified model, you can omit the proposed exclusive lock (see above) as well. Hope this helps. Regards, Steffen On 03/01/2017 04:03 PM, Joseph Schuchart wrote: > Hi all, > > We are seeing issues in one of our applications, in which processes in a > shared communicator allocate a shared MPI window and execute > MPI_Accumulate simultaneously on it to iteratively update each process' > values. The test boils down to the sample code attached. Sample output > is as follows: > > ``` > $ mpirun -n 4 ./mpi_shared_accumulate > [1] baseptr[0]: 1010 (expected 1010) > [1] baseptr[1]: 1011 (expected 1011) > [1] baseptr[2]: 1012 (expected 1012) > [1] baseptr[3]: 1013 (expected 1013) > [1] baseptr[4]: 1014 (expected 1014) > [2] baseptr[0]: 1005 (expected 1010) [!!!] > [2] baseptr[1]: 1006 (expected 1011) [!!!] > [2] baseptr[2]: 1007 (expected 1012) [!!!] > [2] baseptr[3]: 1008 (expected 1013) [!!!] > [2] baseptr[4]: 1009 (expected 1014) [!!!] > [3] baseptr[0]: 1010 (expected 1010) > [0] baseptr[0]: 1010 (expected 1010) > [0] baseptr[1]: 1011 (expected 1011) > [0] baseptr[2]: 1012 (expected 1012) > [0] baseptr[3]: 1013 (expected 1013) > [0] baseptr[4]: 1014 (expected 1014) > [3] baseptr[1]: 1011 (expected 1011) > [3] baseptr[2]: 1012 (expected 1012) > [3] baseptr[3]: 1013 (expected 1013) > [3] baseptr[4]: 1014 (expected 1014) > ``` > > Each process should hold the same values but sometimes (not on all > executions) random processes diverge (marked through [!!!]). > > I made the following observations: > > 1) The issue occurs with both OpenMPI 1.10.6 and 2.0.2 but not with > MPICH 3.2. > 2) The issue occurs only if the window is allocated through > MPI_Win_allocate_shared, using MPI_Win_allocate works fine. > 3) The code assumes that MPI_Accumulate atomically updates individual > elements (please correct me if that is not covered by the MPI standard). > > Both OpenMPI and the example code were compiled using GCC 5.4.1 and run > on a Linux system (single node). OpenMPI was configure with > --enable-mpi-thread-multiple and --with-threads but the application is > not multi-threaded. Please let me know if you need any other information. > > Cheers > Joseph > > > > ___ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users > ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Re: [OMPI users] Shared Windows and MPI_Accumulate
Hi, The behaviour is reproduceable on our systems: * Linux Cluster (Intel Xeon E5-2660 v3, Scientific Linux release 6.8 (Carbon), Kernel 2.6.32, nightly 2.x branch) The error is independent of the used btl combination on the cluster (Tested 'sm,self,vader', 'sm,self,openib', 'sm,self', 'vader,self', 'openib,self') * Cray XC40 (using gnu 6.3 and open mpi 2.0.1, Kernel 3.0.101) The error manifests always within 50 loop iteration of the below command line. The behaviour is not reproduceable with neither Open MPI 2.0.1 nor 2.1.0rc2 on my notebook (Arch Linux, gcc 6.3.1, Kernel 4.9.11) Best Christoph - Original Message - From: "Howard Pritchard" <hpprit...@gmail.com> To: "Open MPI Users" <users@lists.open-mpi.org> Sent: Friday, March 3, 2017 9:02:22 PM Subject: Re: [OMPI users] Shared Windows and MPI_Accumulate Hello Joseph, I'm still unable to reproduce this system on my SLES12 x86_64 node. Are you building with CFLAGS=-O3? If so, could you build without CFLAGS set and see if you still see the failure? Howard 2017-03-02 2:34 GMT-07:00 Joseph Schuchart < [ mailto:schuch...@hlrs.de | schuch...@hlrs.de ] > : Hi Howard, Thanks for trying to reproduce this. It seems that on master the issue occurs less frequently but is still there. I used the following bash one-liner on my laptop and on our Linux Cluster (single node, 4 processes): ``` $ for i in $(seq 1 100) ; do echo $i && mpirun -n 4 ./mpi_shared_accumulate | grep \! && break ; done 1 2 [0] baseptr[0]: 1004 (expected 1010) [!!!] [0] baseptr[1]: 1005 (expected 1011) [!!!] [0] baseptr[2]: 1006 (expected 1012) [!!!] [0] baseptr[3]: 1007 (expected 1013) [!!!] [0] baseptr[4]: 1008 (expected 1014) [!!!] ``` Sometimes the error occurs after one or two iterations (like above), sometimes only at iteration 20 or later. However, I can reproduce it within the 100 runs every time I run the statement above. I am attaching the config.log and output of ompi_info of master on my laptop. Please let me know if I can help with anything else. Thanks, Joseph On 03/01/2017 11:24 PM, Howard Pritchard wrote: Hi Joseph, I built this test with craypich (Cray MPI) and it passed. I also tried with Open MPI master and the test passed. I also tried with 2.0.2 and can't seem to reproduce on my system. Could you post the output of config.log? Also, how intermittent is the problem? Thanks, Howard 2017-03-01 8:03 GMT-07:00 Joseph Schuchart < [ mailto:schuch...@hlrs.de | schuch...@hlrs.de ] > : Hi all, We are seeing issues in one of our applications, in which processes in a shared communicator allocate a shared MPI window and execute MPI_Accumulate simultaneously on it to iteratively update each process' values. The test boils down to the sample code attached. Sample output is as follows: ``` $ mpirun -n 4 ./mpi_shared_accumulate [1] baseptr[0]: 1010 (expected 1010) [1] baseptr[1]: 1011 (expected 1011) [1] baseptr[2]: 1012 (expected 1012) [1] baseptr[3]: 1013 (expected 1013) [1] baseptr[4]: 1014 (expected 1014) [2] baseptr[0]: 1005 (expected 1010) [!!!] [2] baseptr[1]: 1006 (expected 1011) [!!!] [2] baseptr[2]: 1007 (expected 1012) [!!!] [2] baseptr[3]: 1008 (expected 1013) [!!!] [2] baseptr[4]: 1009 (expected 1014) [!!!] [3] baseptr[0]: 1010 (expected 1010) [0] baseptr[0]: 1010 (expected 1010) [0] baseptr[1]: 1011 (expected 1011) [0] baseptr[2]: 1012 (expected 1012) [0] baseptr[3]: 1013 (expected 1013) [0] baseptr[4]: 1014 (expected 1014) [3] baseptr[1]: 1011 (expected 1011) [3] baseptr[2]: 1012 (expected 1012) [3] baseptr[3]: 1013 (expected 1013) [3] baseptr[4]: 1014 (expected 1014) ``` Each process should hold the same values but sometimes (not on all executions) random processes diverge (marked through [!!!]). I made the following observations: 1) The issue occurs with both OpenMPI 1.10.6 and 2.0.2 but not with MPICH 3.2. 2) The issue occurs only if the window is allocated through MPI_Win_allocate_shared, using MPI_Win_allocate works fine. 3) The code assumes that MPI_Accumulate atomically updates individual elements (please correct me if that is not covered by the MPI standard). Both OpenMPI and the example code were compiled using GCC 5.4.1 and run on a Linux system (single node). OpenMPI was configure with --enable-mpi-thread-multiple and --with-threads but the application is not multi-threaded. Please let me know if you need any other information. Cheers Joseph -- Dipl.-Inf. Joseph Schuchart High Performance Computing Center Stuttgart (HLRS) Nobelstr. 19 D-70569 Stuttgart Tel.: [ tel:%2B49%280%29711-68565890 | +49(0)711-68565890 ] Fax: [ tel:%2B49%280%29711-6856832 | +49(0)711-6856832 ] E-Mail: [ mailto:schuch...@hlrs.de | schuch...@hlrs.de ] ___ users mailing list [ mailto:users@lists.open-mpi.org | users@lists.open-mpi.org ]
Re: [OMPI users] Shared Windows and MPI_Accumulate
Hello Joseph, I'm still unable to reproduce this system on my SLES12 x86_64 node. Are you building with CFLAGS=-O3? If so, could you build without CFLAGS set and see if you still see the failure? Howard 2017-03-02 2:34 GMT-07:00 Joseph Schuchart: > Hi Howard, > > Thanks for trying to reproduce this. It seems that on master the issue > occurs less frequently but is still there. I used the following bash > one-liner on my laptop and on our Linux Cluster (single node, 4 processes): > > ``` > $ for i in $(seq 1 100) ; do echo $i && mpirun -n 4 > ./mpi_shared_accumulate | grep \! && break ; done > 1 > 2 > [0] baseptr[0]: 1004 (expected 1010) [!!!] > [0] baseptr[1]: 1005 (expected 1011) [!!!] > [0] baseptr[2]: 1006 (expected 1012) [!!!] > [0] baseptr[3]: 1007 (expected 1013) [!!!] > [0] baseptr[4]: 1008 (expected 1014) [!!!] > ``` > > Sometimes the error occurs after one or two iterations (like above), > sometimes only at iteration 20 or later. However, I can reproduce it within > the 100 runs every time I run the statement above. I am attaching the > config.log and output of ompi_info of master on my laptop. Please let me > know if I can help with anything else. > > Thanks, > Joseph > > On 03/01/2017 11:24 PM, Howard Pritchard wrote: > > Hi Joseph, > > I built this test with craypich (Cray MPI) and it passed. I also tried > with Open MPI master and the test passed. I also tried with 2.0.2 > and can't seem to reproduce on my system. > > Could you post the output of config.log? > > Also, how intermittent is the problem? > > > Thanks, > > Howard > > > > > 2017-03-01 8:03 GMT-07:00 Joseph Schuchart : > >> Hi all, >> >> We are seeing issues in one of our applications, in which processes in a >> shared communicator allocate a shared MPI window and execute MPI_Accumulate >> simultaneously on it to iteratively update each process' values. The test >> boils down to the sample code attached. Sample output is as follows: >> >> ``` >> $ mpirun -n 4 ./mpi_shared_accumulate >> [1] baseptr[0]: 1010 (expected 1010) >> [1] baseptr[1]: 1011 (expected 1011) >> [1] baseptr[2]: 1012 (expected 1012) >> [1] baseptr[3]: 1013 (expected 1013) >> [1] baseptr[4]: 1014 (expected 1014) >> [2] baseptr[0]: 1005 (expected 1010) [!!!] >> [2] baseptr[1]: 1006 (expected 1011) [!!!] >> [2] baseptr[2]: 1007 (expected 1012) [!!!] >> [2] baseptr[3]: 1008 (expected 1013) [!!!] >> [2] baseptr[4]: 1009 (expected 1014) [!!!] >> [3] baseptr[0]: 1010 (expected 1010) >> [0] baseptr[0]: 1010 (expected 1010) >> [0] baseptr[1]: 1011 (expected 1011) >> [0] baseptr[2]: 1012 (expected 1012) >> [0] baseptr[3]: 1013 (expected 1013) >> [0] baseptr[4]: 1014 (expected 1014) >> [3] baseptr[1]: 1011 (expected 1011) >> [3] baseptr[2]: 1012 (expected 1012) >> [3] baseptr[3]: 1013 (expected 1013) >> [3] baseptr[4]: 1014 (expected 1014) >> ``` >> >> Each process should hold the same values but sometimes (not on all >> executions) random processes diverge (marked through [!!!]). >> >> I made the following observations: >> >> 1) The issue occurs with both OpenMPI 1.10.6 and 2.0.2 but not with MPICH >> 3.2. >> 2) The issue occurs only if the window is allocated through >> MPI_Win_allocate_shared, using MPI_Win_allocate works fine. >> 3) The code assumes that MPI_Accumulate atomically updates individual >> elements (please correct me if that is not covered by the MPI standard). >> >> Both OpenMPI and the example code were compiled using GCC 5.4.1 and run >> on a Linux system (single node). OpenMPI was configure with >> --enable-mpi-thread-multiple and --with-threads but the application is not >> multi-threaded. Please let me know if you need any other information. >> >> Cheers >> Joseph >> >> -- >> Dipl.-Inf. Joseph Schuchart >> High Performance Computing Center Stuttgart (HLRS) >> Nobelstr. 19 >> D-70569 Stuttgart >> >> Tel.: +49(0)711-68565890 >> Fax: +49(0)711-6856832 >> E-Mail: schuch...@hlrs.de >> >> >> ___ >> users mailing list >> users@lists.open-mpi.org >> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >> > > > > ___ > users mailing > listus...@lists.open-mpi.orghttps://rfd.newmexicoconsortium.org/mailman/listinfo/users > > > -- > Dipl.-Inf. Joseph Schuchart > High Performance Computing Center Stuttgart (HLRS) > Nobelstr. 19 > D-70569 Stuttgart > > Tel.: +49(0)711-68565890 <+49%20711%2068565890> > Fax: +49(0)711-6856832 <+49%20711%206856832> > E-Mail: schuch...@hlrs.de > > > ___ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users > ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Re: [OMPI users] Shared Windows and MPI_Accumulate
Hi Joseph, I built this test with craypich (Cray MPI) and it passed. I also tried with Open MPI master and the test passed. I also tried with 2.0.2 and can't seem to reproduce on my system. Could you post the output of config.log? Also, how intermittent is the problem? Thanks, Howard 2017-03-01 8:03 GMT-07:00 Joseph Schuchart: > Hi all, > > We are seeing issues in one of our applications, in which processes in a > shared communicator allocate a shared MPI window and execute MPI_Accumulate > simultaneously on it to iteratively update each process' values. The test > boils down to the sample code attached. Sample output is as follows: > > ``` > $ mpirun -n 4 ./mpi_shared_accumulate > [1] baseptr[0]: 1010 (expected 1010) > [1] baseptr[1]: 1011 (expected 1011) > [1] baseptr[2]: 1012 (expected 1012) > [1] baseptr[3]: 1013 (expected 1013) > [1] baseptr[4]: 1014 (expected 1014) > [2] baseptr[0]: 1005 (expected 1010) [!!!] > [2] baseptr[1]: 1006 (expected 1011) [!!!] > [2] baseptr[2]: 1007 (expected 1012) [!!!] > [2] baseptr[3]: 1008 (expected 1013) [!!!] > [2] baseptr[4]: 1009 (expected 1014) [!!!] > [3] baseptr[0]: 1010 (expected 1010) > [0] baseptr[0]: 1010 (expected 1010) > [0] baseptr[1]: 1011 (expected 1011) > [0] baseptr[2]: 1012 (expected 1012) > [0] baseptr[3]: 1013 (expected 1013) > [0] baseptr[4]: 1014 (expected 1014) > [3] baseptr[1]: 1011 (expected 1011) > [3] baseptr[2]: 1012 (expected 1012) > [3] baseptr[3]: 1013 (expected 1013) > [3] baseptr[4]: 1014 (expected 1014) > ``` > > Each process should hold the same values but sometimes (not on all > executions) random processes diverge (marked through [!!!]). > > I made the following observations: > > 1) The issue occurs with both OpenMPI 1.10.6 and 2.0.2 but not with MPICH > 3.2. > 2) The issue occurs only if the window is allocated through > MPI_Win_allocate_shared, using MPI_Win_allocate works fine. > 3) The code assumes that MPI_Accumulate atomically updates individual > elements (please correct me if that is not covered by the MPI standard). > > Both OpenMPI and the example code were compiled using GCC 5.4.1 and run on > a Linux system (single node). OpenMPI was configure with > --enable-mpi-thread-multiple and --with-threads but the application is not > multi-threaded. Please let me know if you need any other information. > > Cheers > Joseph > > -- > Dipl.-Inf. Joseph Schuchart > High Performance Computing Center Stuttgart (HLRS) > Nobelstr. 19 > D-70569 Stuttgart > > Tel.: +49(0)711-68565890 > Fax: +49(0)711-6856832 > E-Mail: schuch...@hlrs.de > > > ___ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users > ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users