Recent changes to the way band parallelization is performed seem to be incompatible with Scalapack. The problem is related to the obscure hacks needed to convince Scalapack to work in a subgroup of processors. If you revert to the previous way of setting linear-algebra parallelization, things should work (or not work) as before, so the latter problem you mention may have other origins. You should verify if you manage to run - with the new version, old call to mp_start_diag, no band parallelization - with an old version, with or withou band parallelization BEWARE: all versions < 5.3 use an incorrect definition of B3LYP, leading to small but non-negligible discrepancies with the results of other codes
Paolo On Tue, Jan 26, 2016 at 12:53 AM, Taylor Barnes <[email protected]> wrote: > Dear All, > > I have found that calculations involving band group parallelism that > worked correctly using QE 5.2.0 produce errors in version 5.3.0 (see below > for an example input file). In particular, when I run a PBE0 calculation > with either nbgrp or ndiag set to 1, everything runs correctly; however, > when I run a calculation with both nbgrp and ndiag set greater than 1, the > calculation immediately fails with the following error messages: > > Rank 48 [Mon Jan 25 09:52:04 2016] [c0-0c0s14n2] Fatal error in > PMPI_Group_incl: Invalid rank, error stack: > PMPI_Group_incl(173).............: MPI_Group_incl(group=0x88000002, n=36, > ranks=0x53a3c80, new_group=0x7fffffff6794) failed > MPIR_Group_check_valid_ranks(259): Duplicate ranks in rank array at index > 12, has value 0 which is also the value at index 0 > Rank 93 [Mon Jan 25 09:52:04 2016] [c0-0c0s14n3] Fatal error in > PMPI_Group_incl: Invalid rank, error stack: > PMPI_Group_incl(173).............: MPI_Group_incl(group=0x88000002, n=36, > ranks=0x538fdf0, new_group=0x7fffffff6794) failed > MPIR_Group_check_valid_ranks(259): Duplicate ranks in rank array at index > 12, has value 0 which is also the value at index 0 > etc... > > The error is apparently related to a change in Modules/mp_global.f90 on > line 80. Here, the line previously read: > > CALL mp_start_diag ( ndiag_, intra_BGRP_comm ) > > In QE 5.3.0, this has been changed to: > > CALL mp_start_diag ( ndiag_, intra_POOL_comm ) > > The call using intra_BGRP_comm still exists in version 5.3.0 of the > code, but is commented out, and the surrounding comments indicate that it > should be possible to switch back to the old parallelization by > commenting/uncommenting as desired. When I do this, I find that instead of > the error messages described above, I get the following error messages: > > Error in routine cdiaghg(193): > problems computing cholesky > > Am I missing something, or are these errors the result of a bug? > > Best Regards, > > Dr. Taylor Barnes, > Lawrence Berkeley National Laboratory > > > ================= > Run Command: > ================= > > srun -n 96 pw.x -nbgrp 4 -in input > input.out > > > > ================= > Input File: > ================= > > &control > prefix = 'water' > calculation = 'scf' > restart_mode = 'from_scratch' > wf_collect = .true. > disk_io = 'none' > tstress = .false. > tprnfor = .false. > outdir = './' > wfcdir = './' > pseudo_dir = '/global/homes/t/tabarnes/espresso/pseudo' > / > &system > ibrav = 1 > celldm(1) = 15.249332837 > nat = 48 > ntyp = 2 > ecutwfc = 130 > input_dft = 'pbe0' > / > &electrons > diago_thr_init=5.0d-4 > mixing_mode = 'plain' > mixing_beta = 0.7 > mixing_ndim = 8 > diagonalization = 'david' > diago_david_ndim = 4 > diago_full_acc = .true. > electron_maxstep=3 > scf_must_converge=.false. > / > ATOMIC_SPECIES > O 15.999 O.pbe-mt_fhi.UPF > H 1.008 H.pbe-mt_fhi.UPF > ATOMIC_POSITIONS alat > O 0.405369 0.567356 0.442192 > H 0.471865 0.482160 0.381557 > H 0.442867 0.572759 0.560178 > O 0.584679 0.262476 0.215740 > H 0.689058 0.204790 0.249459 > H 0.503275 0.179176 0.173433 > O 0.613936 0.468084 0.701359 > H 0.720162 0.421081 0.658182 > H 0.629377 0.503798 0.819016 > O 0.692499 0.571474 0.008796 > H 0.815865 0.562339 0.016182 > H 0.640331 0.489132 0.085318 > O 0.138542 0.767947 0.322270 > H 0.052664 0.771819 0.411531 > H 0.239736 0.710419 0.364788 > O 0.127282 0.623278 0.765792 > H 0.075781 0.693268 0.677441 > H 0.243000 0.662182 0.787094 > O 0.572799 0.844477 0.542529 > H 0.556579 0.966998 0.533420 > H 0.548297 0.791340 0.433292 > O -0.007677 0.992860 0.095967 > H 0.064148 1.011844 -0.003219 > H 0.048026 0.913005 0.172625 > O 0.035337 0.547318 0.085085 > H 0.072732 0.625835 0.173379 > H 0.089917 0.576762 -0.022194 > O 0.666008 0.900155 0.183677 > H 0.773299 0.937456 0.134145 > H 0.609289 0.822407 0.105606 > O 0.443447 0.737755 0.836152 > H 0.526041 0.665651 0.893906 > H 0.483300 0.762549 0.721464 > O 0.934493 0.378765 0.627850 > H 1.012721 0.449242 0.693201 > H 0.955703 0.394823 0.506816 > O 0.006386 0.270244 0.269327 > H 0.021231 0.364797 0.190612 > H 0.021863 0.163251 0.208755 > O 0.936337 0.855942 0.611999 > H 0.956610 0.972475 0.648965 > H 0.815045 0.839173 0.592915 > O 0.228881 0.037509 0.849634 > H 0.263938 0.065862 0.734213 > H 0.282576 -0.068680 0.884220 > O 0.346187 0.176679 0.553828 > H 0.247521 0.218347 0.491489 > H 0.402671 0.271609 0.610010 > K_POINTS automatic > 1 1 1 1 1 1 > > > > > _______________________________________________ > Pw_forum mailing list > [email protected] > http://pwscf.org/mailman/listinfo/pw_forum > -- Paolo Giannozzi, Dip. Scienze Matematiche Informatiche e Fisiche, Univ. Udine, via delle Scienze 208, 33100 Udine, Italy Phone +39-0432-558216, fax +39-0432-558222
_______________________________________________ Pw_forum mailing list [email protected] http://pwscf.org/mailman/listinfo/pw_forum
