Hi Paolo, Thanks very much for this. Just to clarify:
Version 5.2.0, no band parallelization - Works Version 5.2.0, with band parallelization - Works Version 5.3.0, new call to mp_start_diag, no band parallelization - Works Version 5.3.0, old call to mp_start_diag, no band parallelization - Works Version 5.3.0, new call to mp_start_diag, with band parallelization - fails with "Duplicate ranks in rank array" errors Version 5.3.0, old call to mp_start_diag, with band parallelization - fails with "problems computing cholesky" errors All of these tests are performed on a Cray XC. Best, Taylor On Tue, Jan 26, 2016 at 3:23 AM, Paolo Giannozzi <[email protected]> wrote: > Recent changes to the way band parallelization is performed seem to be > incompatible with Scalapack. The problem is related to the obscure hacks > needed to convince Scalapack to work in a subgroup of processors. If you > revert to the previous way of setting linear-algebra parallelization, > things should work (or not work) as before, so the latter problem you > mention may have other origins. You should verify if you manage to run > - with the new version, old call to mp_start_diag, no band parallelization > - with an old version, with or withou band parallelization > BEWARE: all versions < 5.3 use an incorrect definition of B3LYP, leading > to small but non-negligible discrepancies with the results of other codes > > Paolo > > On Tue, Jan 26, 2016 at 12:53 AM, Taylor Barnes <[email protected]> wrote: > >> Dear All, >> >> I have found that calculations involving band group parallelism that >> worked correctly using QE 5.2.0 produce errors in version 5.3.0 (see below >> for an example input file). In particular, when I run a PBE0 calculation >> with either nbgrp or ndiag set to 1, everything runs correctly; however, >> when I run a calculation with both nbgrp and ndiag set greater than 1, the >> calculation immediately fails with the following error messages: >> >> Rank 48 [Mon Jan 25 09:52:04 2016] [c0-0c0s14n2] Fatal error in >> PMPI_Group_incl: Invalid rank, error stack: >> PMPI_Group_incl(173).............: MPI_Group_incl(group=0x88000002, n=36, >> ranks=0x53a3c80, new_group=0x7fffffff6794) failed >> MPIR_Group_check_valid_ranks(259): Duplicate ranks in rank array at index >> 12, has value 0 which is also the value at index 0 >> Rank 93 [Mon Jan 25 09:52:04 2016] [c0-0c0s14n3] Fatal error in >> PMPI_Group_incl: Invalid rank, error stack: >> PMPI_Group_incl(173).............: MPI_Group_incl(group=0x88000002, n=36, >> ranks=0x538fdf0, new_group=0x7fffffff6794) failed >> MPIR_Group_check_valid_ranks(259): Duplicate ranks in rank array at index >> 12, has value 0 which is also the value at index 0 >> etc... >> >> The error is apparently related to a change in Modules/mp_global.f90 >> on line 80. Here, the line previously read: >> >> CALL mp_start_diag ( ndiag_, intra_BGRP_comm ) >> >> In QE 5.3.0, this has been changed to: >> >> CALL mp_start_diag ( ndiag_, intra_POOL_comm ) >> >> The call using intra_BGRP_comm still exists in version 5.3.0 of the >> code, but is commented out, and the surrounding comments indicate that it >> should be possible to switch back to the old parallelization by >> commenting/uncommenting as desired. When I do this, I find that instead of >> the error messages described above, I get the following error messages: >> >> Error in routine cdiaghg(193): >> problems computing cholesky >> >> Am I missing something, or are these errors the result of a bug? >> >> Best Regards, >> >> Dr. Taylor Barnes, >> Lawrence Berkeley National Laboratory >> >> >> ================= >> Run Command: >> ================= >> >> srun -n 96 pw.x -nbgrp 4 -in input > input.out >> >> >> >> ================= >> Input File: >> ================= >> >> &control >> prefix = 'water' >> calculation = 'scf' >> restart_mode = 'from_scratch' >> wf_collect = .true. >> disk_io = 'none' >> tstress = .false. >> tprnfor = .false. >> outdir = './' >> wfcdir = './' >> pseudo_dir = '/global/homes/t/tabarnes/espresso/pseudo' >> / >> &system >> ibrav = 1 >> celldm(1) = 15.249332837 >> nat = 48 >> ntyp = 2 >> ecutwfc = 130 >> input_dft = 'pbe0' >> / >> &electrons >> diago_thr_init=5.0d-4 >> mixing_mode = 'plain' >> mixing_beta = 0.7 >> mixing_ndim = 8 >> diagonalization = 'david' >> diago_david_ndim = 4 >> diago_full_acc = .true. >> electron_maxstep=3 >> scf_must_converge=.false. >> / >> ATOMIC_SPECIES >> O 15.999 O.pbe-mt_fhi.UPF >> H 1.008 H.pbe-mt_fhi.UPF >> ATOMIC_POSITIONS alat >> O 0.405369 0.567356 0.442192 >> H 0.471865 0.482160 0.381557 >> H 0.442867 0.572759 0.560178 >> O 0.584679 0.262476 0.215740 >> H 0.689058 0.204790 0.249459 >> H 0.503275 0.179176 0.173433 >> O 0.613936 0.468084 0.701359 >> H 0.720162 0.421081 0.658182 >> H 0.629377 0.503798 0.819016 >> O 0.692499 0.571474 0.008796 >> H 0.815865 0.562339 0.016182 >> H 0.640331 0.489132 0.085318 >> O 0.138542 0.767947 0.322270 >> H 0.052664 0.771819 0.411531 >> H 0.239736 0.710419 0.364788 >> O 0.127282 0.623278 0.765792 >> H 0.075781 0.693268 0.677441 >> H 0.243000 0.662182 0.787094 >> O 0.572799 0.844477 0.542529 >> H 0.556579 0.966998 0.533420 >> H 0.548297 0.791340 0.433292 >> O -0.007677 0.992860 0.095967 >> H 0.064148 1.011844 -0.003219 >> H 0.048026 0.913005 0.172625 >> O 0.035337 0.547318 0.085085 >> H 0.072732 0.625835 0.173379 >> H 0.089917 0.576762 -0.022194 >> O 0.666008 0.900155 0.183677 >> H 0.773299 0.937456 0.134145 >> H 0.609289 0.822407 0.105606 >> O 0.443447 0.737755 0.836152 >> H 0.526041 0.665651 0.893906 >> H 0.483300 0.762549 0.721464 >> O 0.934493 0.378765 0.627850 >> H 1.012721 0.449242 0.693201 >> H 0.955703 0.394823 0.506816 >> O 0.006386 0.270244 0.269327 >> H 0.021231 0.364797 0.190612 >> H 0.021863 0.163251 0.208755 >> O 0.936337 0.855942 0.611999 >> H 0.956610 0.972475 0.648965 >> H 0.815045 0.839173 0.592915 >> O 0.228881 0.037509 0.849634 >> H 0.263938 0.065862 0.734213 >> H 0.282576 -0.068680 0.884220 >> O 0.346187 0.176679 0.553828 >> H 0.247521 0.218347 0.491489 >> H 0.402671 0.271609 0.610010 >> K_POINTS automatic >> 1 1 1 1 1 1 >> >> >> >> >> _______________________________________________ >> Pw_forum mailing list >> [email protected] >> http://pwscf.org/mailman/listinfo/pw_forum >> > > > > -- > Paolo Giannozzi, Dip. Scienze Matematiche Informatiche e Fisiche, > Univ. Udine, via delle Scienze 208, 33100 Udine, Italy > Phone +39-0432-558216, fax +39-0432-558222 > > > _______________________________________________ > Pw_forum mailing list > [email protected] > http://pwscf.org/mailman/listinfo/pw_forum >
_______________________________________________ Pw_forum mailing list [email protected] http://pwscf.org/mailman/listinfo/pw_forum
