Hi Christof, Thanks for trying out 2.0.1. Sorry that you're hitting problems. Could you try to run the tests using the 'ob1' PML in order to bypass PSM2?
mpirun --mca pml ob1 (all the rest of the args) and see if you still observe the failures? Howard 2016-11-18 9:32 GMT-07:00 Christof Köhler < christof.koeh...@bccms.uni-bremen.de>: > Hello everybody, > > I am observing failures in the xdsyevr (and xssyevr) ScaLapack self tests > when running on one or two nodes with OpenMPI 2.0.1. With 1.10.4 no > failures are observed. Also, with mvapich2 2.2 no failures are observed. > The other testers appear to be working with all MPIs mentioned (have to > triple check again). I somehow overlooked the failures below at first. > > The system is an Intel OmniPath system (newest Intel driver release 10.2), > i.e. we are using the PSM2 > mtl I believe. > > I built the OpenMPIs with gcc 6.2 and the following identical options: > ./configure FFLAGS="-O1" CFLAGS="-O1" FCFLAGS="-O1" CXXFLAGS="-O1" > --with-psm2 --with-tm --with-hwloc=internal --enable-static > --enable-orterun-prefix-by-default > > The ScaLapack build is also with gcc 6.2, openblas 0.2.19 and using "-O1 > -g" as FCFLAGS and CCFLAGS identical for all tests, only wrapper compiler > changes. > > With OpenMPI 1.10.4 I see on a single node > > mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca > oob_tcp_if_include eth0,team0 -host node009,node009,node009,node009 > ./xdsyevr > 136 tests completed and passed residual checks. > 0 tests completed without checking. > 0 tests skipped for lack of memory. > 0 tests completed and failed. > > With OpenMPI 1.10.4 I see on two nodes > > mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca > oob_tcp_if_include eth0,team0 -host node009,node010,node009,node010 > ./xdsyevr > 136 tests completed and passed residual checks. > 0 tests completed without checking. > 0 tests skipped for lack of memory. > 0 tests completed and failed. > > With OpenMPI 2.0.1 I see on a single node > > mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca > oob_tcp_if_include eth0,team0 -host node009,node009,node009,node009 > ./xdsyevr > 32 tests completed and passed residual checks. > 0 tests completed without checking. > 0 tests skipped for lack of memory. > 104 tests completed and failed. > > With OpenMPI 2.0.1 I see on two nodes > > mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca > oob_tcp_if_include eth0,team0 -host node009,node010,node009,node010 > ./xdsyevr > 32 tests completed and passed residual checks. > 0 tests completed without checking. > 0 tests skipped for lack of memory. > 104 tests completed and failed. > > A typical failure looks like this in the output > > IL, IU, VL or VU altered by PDSYEVR > 500 1 1 1 8 Y 0.26 -1.00 0.19E-02 15. FAILED > 500 1 2 1 8 Y 0.29 -1.00 0.79E-03 3.9 PASSED > EVR > IL, IU, VL or VU altered by PDSYEVR > 500 1 1 2 8 Y 0.52 -1.00 0.82E-03 2.5 FAILED > 500 1 2 2 8 Y 0.41 -1.00 0.79E-03 2.3 PASSED > EVR > 500 2 2 2 8 Y 0.18 -1.00 0.78E-03 3.0 PASSED > EVR > IL, IU, VL or VU altered by PDSYEVR > 500 4 1 4 8 Y 0.09 -1.00 0.95E-03 4.1 FAILED > 500 4 4 1 8 Y 0.11 -1.00 0.91E-03 2.8 PASSED > EVR > > > The variable OMP_NUM_THREADS=1 to stop the openblas from threading. > We see similar problems with intel 2016 compilers, but I believe gcc is a > good baseline. > > Any ideas ? For us this is a real problem in that we do not know if this > indicates a network (transport) issue in the intel software stack (libpsm2, > hfi1 kernel module) which might affect our production codes or if this is > an OpenMPI issue. We have some other problems I might ask about later on > this list, but nothing which yields such a nice reproducer and especially > these other problems might well be application related. > > Best Regards > > Christof > > -- > Dr. rer. nat. Christof Köhler email: c.koeh...@bccms.uni-bremen.de > Universitaet Bremen/ BCCMS phone: +49-(0)421-218-62334 > Am Fallturm 1/ TAB/ Raum 3.12 fax: +49-(0)421-218-62770 > 28359 Bremen > > PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/ > > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users