Hi Christof,

Thanks for trying out 2.0.1.  Sorry that you're hitting problems.
Could you try to run the tests using the 'ob1' PML in order to
bypass PSM2?

mpirun --mca pml ob1 (all the rest of the args)

and see if you still observe the failures?

Howard


2016-11-18 9:32 GMT-07:00 Christof Köhler <
christof.koeh...@bccms.uni-bremen.de>:

> Hello everybody,
>
> I am observing failures in the xdsyevr (and xssyevr) ScaLapack self tests
> when running on one or two nodes with OpenMPI 2.0.1. With 1.10.4 no
> failures are observed. Also, with mvapich2 2.2 no failures are observed.
> The other testers appear to be working with all MPIs mentioned (have to
> triple check again). I somehow overlooked the failures below at first.
>
> The system is an Intel OmniPath system (newest Intel driver release 10.2),
> i.e. we are using the PSM2
> mtl I believe.
>
> I built the OpenMPIs with gcc 6.2 and the following identical options:
> ./configure  FFLAGS="-O1" CFLAGS="-O1" FCFLAGS="-O1" CXXFLAGS="-O1"
> --with-psm2 --with-tm --with-hwloc=internal --enable-static
> --enable-orterun-prefix-by-default
>
> The ScaLapack build is also with gcc 6.2, openblas 0.2.19 and using "-O1
> -g" as FCFLAGS and CCFLAGS identical for all tests, only wrapper compiler
> changes.
>
> With OpenMPI 1.10.4 I see on a single node
>
>  mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca
> oob_tcp_if_include eth0,team0 -host node009,node009,node009,node009
> ./xdsyevr
> 136 tests completed and passed residual checks.
>     0 tests completed without checking.
>     0 tests skipped for lack of memory.
>     0 tests completed and failed.
>
> With OpenMPI 1.10.4 I see on two nodes
>
> mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca
> oob_tcp_if_include eth0,team0 -host node009,node010,node009,node010
> ./xdsyevr
>   136 tests completed and passed residual checks.
>     0 tests completed without checking.
>     0 tests skipped for lack of memory.
>     0 tests completed and failed.
>
> With OpenMPI 2.0.1 I see on a single node
>
> mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca
> oob_tcp_if_include eth0,team0 -host node009,node009,node009,node009
> ./xdsyevr
> 32 tests completed and passed residual checks.
>     0 tests completed without checking.
>     0 tests skipped for lack of memory.
>   104 tests completed and failed.
>
> With OpenMPI 2.0.1 I see on two nodes
>
> mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca
> oob_tcp_if_include eth0,team0 -host node009,node010,node009,node010
> ./xdsyevr
>    32 tests completed and passed residual checks.
>     0 tests completed without checking.
>     0 tests skipped for lack of memory.
>   104 tests completed and failed.
>
> A typical failure looks like this in the output
>
> IL, IU, VL or VU altered by PDSYEVR
>    500   1   1   1   8   Y     0.26    -1.00  0.19E-02   15.     FAILED
>    500   1   2   1   8   Y     0.29    -1.00  0.79E-03   3.9     PASSED
>  EVR
> IL, IU, VL or VU altered by PDSYEVR
>    500   1   1   2   8   Y     0.52    -1.00  0.82E-03   2.5     FAILED
>    500   1   2   2   8   Y     0.41    -1.00  0.79E-03   2.3     PASSED
>  EVR
>    500   2   2   2   8   Y     0.18    -1.00  0.78E-03   3.0     PASSED
>  EVR
> IL, IU, VL or VU altered by PDSYEVR
>    500   4   1   4   8   Y     0.09    -1.00  0.95E-03   4.1     FAILED
>    500   4   4   1   8   Y     0.11    -1.00  0.91E-03   2.8     PASSED
>  EVR
>
>
> The variable OMP_NUM_THREADS=1 to stop the openblas from threading.
> We see similar problems with intel 2016 compilers, but I believe gcc is a
> good baseline.
>
> Any ideas ? For us this is a real problem in that we do not know if this
> indicates a network (transport) issue in the intel software stack (libpsm2,
> hfi1 kernel module) which might affect our production codes or if this is
> an OpenMPI issue. We have some other problems I might ask about later on
> this list, but nothing which yields such a nice reproducer and especially
> these other problems might well be application related.
>
> Best Regards
>
> Christof
>
> --
> Dr. rer. nat. Christof Köhler       email: c.koeh...@bccms.uni-bremen.de
> Universitaet Bremen/ BCCMS          phone:  +49-(0)421-218-62334
> Am Fallturm 1/ TAB/ Raum 3.12       fax: +49-(0)421-218-62770
> 28359 Bremen
>
> PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/
>
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to