Hello again,

please ignore the stack trace contained in my previous mail. It fails with 1.10.4 at the same point, apparently the check for IEEE arithmetics is a red herring !

Best Regards

Christof

----- Nachricht von Christof Köhler <christof.koeh...@bccms.uni-bremen.de> ---------
     Datum: Sat, 19 Nov 2016 14:10:55 +0100
       Von: Christof Köhler <christof.koeh...@bccms.uni-bremen.de>
Antwort an: christof.koeh...@bccms.uni-bremen.de
Betreff: Re: [OMPI users] ScaLapack tester fails with 2.0.1, works with 1.10.4; Intel Omni-Path
        An: Howard Pritchard <hpprit...@gmail.com>
        Cc: Open MPI Users <users@lists.open-mpi.org>


Hello,

I tried

mpirun -n 4 --mca pml ob1 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca oob_tcp_if_include eth0,team0 -host node009,node009,node009,node009 ./xdsyevr mpirun -n 4 --mca pml ob1 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca oob_tcp_if_include eth0,team0 -host node009,node010,node009,node010 ./xdsyevr

This does not change anything.


I made an attempt to narrow down what happens. Sorry, but this is a bit longer. A stack trace is also below.

Looking at the actual numbers, see at the very bottom, I notice that the CHK and QTQ columns (9th and 10th column, maximum over all eigentests) between the two OpenMPIs are simliar. What changes is the "IL, IU, VL or VU altered by PDSYEVR" line which is not present in the output with 1.10.4, only in the 2.0.1 output. Looking at pdseprsubtst.f, comment line 751, I see that this is (as far as I understand it) a sanity check.

Inserting my own print statement in pdseprsubtst.f (and changing optimization to "-O0 -g"), i.e.

         IF( IL.NE.OLDIL .OR. IU.NE.OLDIU .OR. VL.NE.OLDVL .OR. VU.NE.
     $       OLDVU ) THEN
            IF( IAM.EQ.0 ) THEN
              WRITE( NOUT, FMT = 9982 )
              WRITE( NOUT, '(F8.3,F8.3,F8.3,F8.3)') VL,VU,OLDVL,OLDVU
              WRITE( NOUT, '(I10,I10,I10,I10)') IL,IU,OLDIL,OLDIU
            END IF
            RESULT = 1
         END IF

The result with 2.0.1 is

   500   2   2   2   8   Y     0.08    -1.00  0.81E-03   3.3     PASSED   EVR
IL, IU, VL or VU altered by PDSYEVR
     NaN   0.000     NaN   0.000
        -1 132733856        -1 132733856
   500   4   1   4   8   Y     0.18    -1.00  0.84E-03   3.5     FAILED
   500   4   4   1   8   Y     0.17    -1.00  0.78E-03   2.9     PASSED   EVR

The values OLDVL and OLDVU are the saved values of VL and VU on entry in pdseprsubtst (line 253 and 254) _before_ the actual eigensolver pdsyevr is called.

Working upwards in the call tree and additionally inserting
IF (IAM.EQ.0) THEN
WRITE(NOUT,'(F8.3,F8.3)') VL, VU
ENDIF
right before each call to PDSEPRSUBTST in pdseprtst.f gives with 2.0.1


   500   2   2   2   8   Y     0.07    -1.00  0.81E-03   3.3     PASSED   EVR
     NaN   0.000
IL, IU, VL or VU altered by PDSYEVR
     NaN   0.000     NaN   0.000
        -1 128725600        -1 128725600
   500   4   1   4   8   Y     0.16    -1.00  0.84E-03   3.5     FAILED
   0.000   0.000
   0.000   0.000
   0.000   0.000
   0.000   0.000
   0.343   0.377
  -0.697   0.104
   500   4   4   1   8   Y     0.17    -1.00  0.76E-03   3.1     PASSED   EVR

With 1.10.4

   500   2   2   2   8   Y     0.07    -1.00  0.80E-03   4.4     PASSED   EVR
   0.000   0.000
   0.000   0.000
   0.000   0.000
   0.000   0.000
   0.435   0.884
  -0.804   0.699
   500   4   1   4   8   Y     0.08    -1.00  0.91E-03   3.3     PASSED   EVR
   0.000   0.000
   0.000   0.000
   0.000   0.000
   0.000   0.000
  -0.437   0.253
  -0.603   0.220
   500   4   4   1   8   Y     0.17    -1.00  0.83E-03   3.7     PASSED   EVR


So something goes wrong early and it is probably not related to numerics.

Setting -ffpe-trap=invalid,zero,overflow in FCFLAGS (and NOOPT), which of course does nothing to the BLACS and C routines, although the stack trace below ends in a C routine (which might be spurious).

login 14:04 ~/src/scalapack/TESTING % mpirun -n 4 --mca pml ob1 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_T[31/1861] ca oob_tcp_if_include eth0,team0 -host node009,node010,node009,node010 ./xdsyevr
Check if overflow is handled in ieee default manner.
If this is the last output you see, you should assume
that overflow caused a floating point exception.

Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.

Backtrace for this error:

Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.

Backtrace for this error:

Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.

Backtrace for this error:

Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.

Backtrace for this error:
#0  0x2b921971266f in ???
#0  0x2ade83c4966f in ???
#1  0x4316fd in pdlachkieee_
        at /home1/ckoe/src/scalapack/SRC/pdlaiect.c:260
#2  0x40457b in pdseprdriver
#1  0x4316fd in pdlachkieee_
        at /home1/ckoe/src/scalapack/SRC/pdlaiect.c:260
        at /home1/ckoe/src/scalapack/TESTING/EIG/pdseprdriver.f:120
#3  0x405828 in main
        at /home1/ckoe/src/scalapack/TESTING/EIG/pdseprdriver.f:257
#2  0x40457b in pdseprdriver
        at /home1/ckoe/src/scalapack/TESTING/EIG/pdseprdriver.f:120
#3  0x405828 in main
        at /home1/ckoe/src/scalapack/TESTING/EIG/pdseprdriver.f:257
#0  0x2b414549566f in ???
#1  0x4316fd in pdlachkieee_
        at /home1/ckoe/src/scalapack/SRC/pdlaiect.c:260
#2  0x40457b in pdseprdriver
        at /home1/ckoe/src/scalapack/TESTING/EIG/pdseprdriver.f:120
#3  0x405828 in main
        at /home1/ckoe/src/scalapack/TESTING/EIG/pdseprdriver.f:257
#0  0x2b3701f4766f in ???
#1  0x4316fd in pdlachkieee_
        at /home1/ckoe/src/scalapack/SRC/pdlaiect.c:260
#2  0x40457b in pdseprdriver
        at /home1/ckoe/src/scalapack/TESTING/EIG/pdseprdriver.f:120
#3  0x405828 in main
        at /home1/ckoe/src/scalapack/TESTING/EIG/pdseprdriver.f:257

Not sure why pdlachkieee_ appears twice !

Thank you for your help !

Best Regards

Christof



Original output without my inserted WRITE statements:

On a single node (node009) with 2.0.1
IL, IU, VL or VU altered by PDSYEVR
   500   1   1   1   8   Y     0.23    -1.00  0.18E-02   26.     FAILED
   500   1   2   1   8   Y     0.09    -1.00  0.74E-03   3.2     PASSED   EVR
IL, IU, VL or VU altered by PDSYEVR
   500   1   1   2   8   Y     0.16    -1.00  0.83E-03   2.3     FAILED
   500   1   2   2   8   Y     0.07    -1.00  0.77E-03   2.2     PASSED   EVR
   500   2   2   2   8   Y     0.04    -1.00  0.81E-03   3.3     PASSED   EVR
IL, IU, VL or VU altered by PDSYEVR
   500   4   1   4   8   Y     0.05    -1.00  0.84E-03   3.5     FAILED
   500   4   4   1   8   Y     0.06    -1.00  0.74E-03   3.5     PASSED   EVR
'End of tests'
Finished    136 tests, with the following results:
   32 tests completed and passed residual checks.
    0 tests completed without checking.
    0 tests skipped for lack of memory.
  104 tests completed and failed.

On node009 and node010 with 2.0.1
IL, IU, VL or VU altered by PDSYEVR
   500   1   1   1   8   Y     0.23    -1.00  0.18E-02   26.     FAILED
   500   1   2   1   8   Y     0.10    -1.00  0.74E-03   3.2     PASSED   EVR
IL, IU, VL or VU altered by PDSYEVR
   500   1   1   2   8   Y     0.16    -1.00  0.83E-03   2.3     FAILED
   500   1   2   2   8   Y     0.09    -1.00  0.77E-03   2.2     PASSED   EVR
   500   2   2   2   8   Y     0.07    -1.00  0.81E-03   3.3     PASSED   EVR
IL, IU, VL or VU altered by PDSYEVR
   500   4   1   4   8   Y     0.17    -1.00  0.84E-03   3.5     FAILED
   500   4   4   1   8   Y     0.15    -1.00  0.77E-03   3.6     PASSED   EVR
'End of tests'
Finished    136 tests, with the following results:
   32 tests completed and passed residual checks.
    0 tests completed without checking.
    0 tests skipped for lack of memory.
  104 tests completed and failed.

On node009 and node010 with 1.10.4
'TEST 10 - test one large matrix'
   500   1   1   1   8   Y     0.15    -1.00  0.18E-02   26.     PASSED   EVR
   500   1   2   1   8   Y     0.10    -1.00  0.81E-03   2.7     PASSED   EVR
   500   1   1   2   8   Y     0.09    -1.00  0.71E-03   3.5     PASSED   EVR
   500   1   2   2   8   Y     0.09    -1.00  0.82E-03   2.6     PASSED   EVR
   500   2   2   2   8   Y     0.06    -1.00  0.80E-03   4.4     PASSED   EVR
   500   4   1   4   8   Y     0.07    -1.00  0.91E-03   3.3     PASSED   EVR
   500   4   4   1   8   Y     0.16    -1.00  0.83E-03   3.7     PASSED   EVR
'End of tests'
Finished    136 tests, with the following results:
  136 tests completed and passed residual checks.
    0 tests completed without checking.
    0 tests skipped for lack of memory.
    0 tests completed and failed.








----- Nachricht von Howard Pritchard <hpprit...@gmail.com> ---------
  Datum: Fri, 18 Nov 2016 11:25:06 -0700
    Von: Howard Pritchard <hpprit...@gmail.com>
Betreff: Re: [OMPI users] ScaLapack tester fails with 2.0.1, works with 1.10.4; Intel Omni-Path An: christof.koeh...@bccms.uni-bremen.de, Open MPI Users <users@lists.open-mpi.org>


Hi Christof,

Thanks for trying out 2.0.1.  Sorry that you're hitting problems.
Could you try to run the tests using the 'ob1' PML in order to
bypass PSM2?

mpirun --mca pml ob1 (all the rest of the args)

and see if you still observe the failures?

Howard


2016-11-18 9:32 GMT-07:00 Christof Köhler <
christof.koeh...@bccms.uni-bremen.de>:

Hello everybody,

I am observing failures in the xdsyevr (and xssyevr) ScaLapack self tests
when running on one or two nodes with OpenMPI 2.0.1. With 1.10.4 no
failures are observed. Also, with mvapich2 2.2 no failures are observed.
The other testers appear to be working with all MPIs mentioned (have to
triple check again). I somehow overlooked the failures below at first.

The system is an Intel OmniPath system (newest Intel driver release 10.2),
i.e. we are using the PSM2
mtl I believe.

I built the OpenMPIs with gcc 6.2 and the following identical options:
./configure  FFLAGS="-O1" CFLAGS="-O1" FCFLAGS="-O1" CXXFLAGS="-O1"
--with-psm2 --with-tm --with-hwloc=internal --enable-static
--enable-orterun-prefix-by-default

The ScaLapack build is also with gcc 6.2, openblas 0.2.19 and using "-O1
-g" as FCFLAGS and CCFLAGS identical for all tests, only wrapper compiler
changes.

With OpenMPI 1.10.4 I see on a single node

mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca
oob_tcp_if_include eth0,team0 -host node009,node009,node009,node009
./xdsyevr
136 tests completed and passed residual checks.
   0 tests completed without checking.
   0 tests skipped for lack of memory.
   0 tests completed and failed.

With OpenMPI 1.10.4 I see on two nodes

mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca
oob_tcp_if_include eth0,team0 -host node009,node010,node009,node010
./xdsyevr
 136 tests completed and passed residual checks.
   0 tests completed without checking.
   0 tests skipped for lack of memory.
   0 tests completed and failed.

With OpenMPI 2.0.1 I see on a single node

mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca
oob_tcp_if_include eth0,team0 -host node009,node009,node009,node009
./xdsyevr
32 tests completed and passed residual checks.
   0 tests completed without checking.
   0 tests skipped for lack of memory.
 104 tests completed and failed.

With OpenMPI 2.0.1 I see on two nodes

mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca
oob_tcp_if_include eth0,team0 -host node009,node010,node009,node010
./xdsyevr
  32 tests completed and passed residual checks.
   0 tests completed without checking.
   0 tests skipped for lack of memory.
 104 tests completed and failed.

A typical failure looks like this in the output

IL, IU, VL or VU altered by PDSYEVR
  500   1   1   1   8   Y     0.26    -1.00  0.19E-02   15.     FAILED
  500   1   2   1   8   Y     0.29    -1.00  0.79E-03   3.9     PASSED
EVR
IL, IU, VL or VU altered by PDSYEVR
  500   1   1   2   8   Y     0.52    -1.00  0.82E-03   2.5     FAILED
  500   1   2   2   8   Y     0.41    -1.00  0.79E-03   2.3     PASSED
EVR
  500   2   2   2   8   Y     0.18    -1.00  0.78E-03   3.0     PASSED
EVR
IL, IU, VL or VU altered by PDSYEVR
  500   4   1   4   8   Y     0.09    -1.00  0.95E-03   4.1     FAILED
  500   4   4   1   8   Y     0.11    -1.00  0.91E-03   2.8     PASSED
EVR


The variable OMP_NUM_THREADS=1 to stop the openblas from threading.
We see similar problems with intel 2016 compilers, but I believe gcc is a
good baseline.

Any ideas ? For us this is a real problem in that we do not know if this
indicates a network (transport) issue in the intel software stack (libpsm2,
hfi1 kernel module) which might affect our production codes or if this is
an OpenMPI issue. We have some other problems I might ask about later on
this list, but nothing which yields such a nice reproducer and especially
these other problems might well be application related.

Best Regards

Christof

--
Dr. rer. nat. Christof Köhler       email: c.koeh...@bccms.uni-bremen.de
Universitaet Bremen/ BCCMS          phone:  +49-(0)421-218-62334
Am Fallturm 1/ TAB/ Raum 3.12       fax: +49-(0)421-218-62770
28359 Bremen

PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


----- Ende der Nachricht von Howard Pritchard <hpprit...@gmail.com> -----


----- Ende der Nachricht von Christof Köhler <christof.koeh...@bccms.uni-bremen.de> -----



--
Dr. rer. nat. Christof Köhler       email: c.koeh...@bccms.uni-bremen.de
Universitaet Bremen/ BCCMS          phone:  +49-(0)421-218-62334
Am Fallturm 1/ TAB/ Raum 3.12       fax: +49-(0)421-218-62770
28359 Bremen

PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to