Re: [Pw_forum] Possible bug in QE 5.3.0 band group parallelization

Paolo Giannozzi Tue, 26 Jan 2016 03:25:27 -0800

Recent changes to the way band parallelization is performed seem to be
incompatible with Scalapack. The problem is related to the obscure hacks
needed to convince Scalapack to work in a subgroup of processors. If you
revert to the previous way of setting linear-algebra parallelization,
things should work (or not work) as before, so the latter problem you
mention may have other origins. You should verify if you manage to run
- with the new version, old call to mp_start_diag, no band parallelization
- with an old version, with or withou band parallelization
BEWARE: all versions < 5.3 use an incorrect definition of B3LYP, leading to
small but non-negligible discrepancies with the results of other codes


Paolo

On Tue, Jan 26, 2016 at 12:53 AM, Taylor Barnes <[email protected]> wrote:

> Dear All,
>
>    I have found that calculations involving band group parallelism that
> worked correctly using QE 5.2.0 produce errors in version 5.3.0 (see below
> for an example input file).  In particular, when I run a PBE0 calculation
> with either nbgrp or ndiag set to 1, everything runs correctly; however,
> when I run a calculation with both nbgrp and ndiag set greater than 1, the
> calculation immediately fails with the following error messages:
>
> Rank 48 [Mon Jan 25 09:52:04 2016] [c0-0c0s14n2] Fatal error in
> PMPI_Group_incl: Invalid rank, error stack:
> PMPI_Group_incl(173).............: MPI_Group_incl(group=0x88000002, n=36,
> ranks=0x53a3c80, new_group=0x7fffffff6794) failed
> MPIR_Group_check_valid_ranks(259): Duplicate ranks in rank array at index
> 12, has value 0 which is also the value at index 0
> Rank 93 [Mon Jan 25 09:52:04 2016] [c0-0c0s14n3] Fatal error in
> PMPI_Group_incl: Invalid rank, error stack:
> PMPI_Group_incl(173).............: MPI_Group_incl(group=0x88000002, n=36,
> ranks=0x538fdf0, new_group=0x7fffffff6794) failed
> MPIR_Group_check_valid_ranks(259): Duplicate ranks in rank array at index
> 12, has value 0 which is also the value at index 0
> etc...
>
>    The error is apparently related to a change in Modules/mp_global.f90 on
> line 80.  Here, the line previously read:
>
> CALL mp_start_diag  ( ndiag_, intra_BGRP_comm )
>
> In QE 5.3.0, this has been changed to:
>
> CALL mp_start_diag  ( ndiag_, intra_POOL_comm )
>
>    The call using intra_BGRP_comm still exists in version 5.3.0 of the
> code, but is commented out, and the surrounding comments indicate that it
> should be possible to switch back to the old parallelization by
> commenting/uncommenting as desired.  When I do this, I find that instead of
> the error messages described above, I get the following error messages:
>
> Error in routine  cdiaghg(193):
>   problems computing cholesky
>
>    Am I missing something, or are these errors the result of a bug?
>
> Best Regards,
>
> Dr. Taylor Barnes,
> Lawrence Berkeley National Laboratory
>
>
> =================
> Run Command:
> =================
>
> srun -n 96 pw.x -nbgrp 4 -in input > input.out
>
>
>
> =================
> Input File:
> =================
>
> &control
> prefix = 'water'
> calculation = 'scf'
> restart_mode = 'from_scratch'
> wf_collect = .true.
> disk_io = 'none'
> tstress = .false.
> tprnfor = .false.
> outdir = './'
> wfcdir = './'
> pseudo_dir = '/global/homes/t/tabarnes/espresso/pseudo'
> /
> &system
> ibrav = 1
> celldm(1) = 15.249332837
> nat = 48
> ntyp = 2
> ecutwfc = 130
> input_dft = 'pbe0'
> /
> &electrons
> diago_thr_init=5.0d-4
> mixing_mode = 'plain'
> mixing_beta = 0.7
> mixing_ndim = 8
> diagonalization = 'david'
> diago_david_ndim = 4
> diago_full_acc = .true.
> electron_maxstep=3
> scf_must_converge=.false.
> /
> ATOMIC_SPECIES
> O   15.999   O.pbe-mt_fhi.UPF
> H    1.008   H.pbe-mt_fhi.UPF
> ATOMIC_POSITIONS alat
>  O   0.405369   0.567356   0.442192
>  H   0.471865   0.482160   0.381557
>  H   0.442867   0.572759   0.560178
>  O   0.584679   0.262476   0.215740
>  H   0.689058   0.204790   0.249459
>  H   0.503275   0.179176   0.173433
>  O   0.613936   0.468084   0.701359
>  H   0.720162   0.421081   0.658182
>  H   0.629377   0.503798   0.819016
>  O   0.692499   0.571474   0.008796
>  H   0.815865   0.562339   0.016182
>  H   0.640331   0.489132   0.085318
>  O   0.138542   0.767947   0.322270
>  H   0.052664   0.771819   0.411531
>  H   0.239736   0.710419   0.364788
>  O   0.127282   0.623278   0.765792
>  H   0.075781   0.693268   0.677441
>  H   0.243000   0.662182   0.787094
>  O   0.572799   0.844477   0.542529
>  H   0.556579   0.966998   0.533420
>  H   0.548297   0.791340   0.433292
>  O  -0.007677   0.992860   0.095967
>  H   0.064148   1.011844  -0.003219
>  H   0.048026   0.913005   0.172625
>  O   0.035337   0.547318   0.085085
>  H   0.072732   0.625835   0.173379
>  H   0.089917   0.576762  -0.022194
>  O   0.666008   0.900155   0.183677
>  H   0.773299   0.937456   0.134145
>  H   0.609289   0.822407   0.105606
>  O   0.443447   0.737755   0.836152
>  H   0.526041   0.665651   0.893906
>  H   0.483300   0.762549   0.721464
>  O   0.934493   0.378765   0.627850
>  H   1.012721   0.449242   0.693201
>  H   0.955703   0.394823   0.506816
>  O   0.006386   0.270244   0.269327
>  H   0.021231   0.364797   0.190612
>  H   0.021863   0.163251   0.208755
>  O   0.936337   0.855942   0.611999
>  H   0.956610   0.972475   0.648965
>  H   0.815045   0.839173   0.592915
>  O   0.228881   0.037509   0.849634
>  H   0.263938   0.065862   0.734213
>  H   0.282576  -0.068680   0.884220
>  O   0.346187   0.176679   0.553828
>  H   0.247521   0.218347   0.491489
>  H   0.402671   0.271609   0.610010
> K_POINTS automatic
> 1 1 1 1 1 1
>
>
>
>
> _______________________________________________
> Pw_forum mailing list
> [email protected]
> http://pwscf.org/mailman/listinfo/pw_forum
>



-- 
Paolo Giannozzi, Dip. Scienze Matematiche Informatiche e Fisiche,
Univ. Udine, via delle Scienze 208, 33100 Udine, Italy
Phone +39-0432-558216, fax +39-0432-558222

_______________________________________________
Pw_forum mailing list
[email protected]
http://pwscf.org/mailman/listinfo/pw_forum

Re: [Pw_forum] Possible bug in QE 5.3.0 band group parallelization

Reply via email to