Recent changes to the way band parallelization is performed seem to be
incompatible with Scalapack. The problem is related to the obscure hacks
needed to convince Scalapack to work in a subgroup of processors. If you
revert to the previous way of setting linear-algebra parallelization,
things should work (or not work) as before, so the latter problem you
mention may have other origins. You should verify if you manage to run
- with the new version, old call to mp_start_diag, no band parallelization
- with an old version, with or withou band parallelization
BEWARE: all versions < 5.3 use an incorrect definition of B3LYP, leading to
small but non-negligible discrepancies with the results of other codes

Paolo

On Tue, Jan 26, 2016 at 12:53 AM, Taylor Barnes <[email protected]> wrote:

> Dear All,
>
>    I have found that calculations involving band group parallelism that
> worked correctly using QE 5.2.0 produce errors in version 5.3.0 (see below
> for an example input file).  In particular, when I run a PBE0 calculation
> with either nbgrp or ndiag set to 1, everything runs correctly; however,
> when I run a calculation with both nbgrp and ndiag set greater than 1, the
> calculation immediately fails with the following error messages:
>
> Rank 48 [Mon Jan 25 09:52:04 2016] [c0-0c0s14n2] Fatal error in
> PMPI_Group_incl: Invalid rank, error stack:
> PMPI_Group_incl(173).............: MPI_Group_incl(group=0x88000002, n=36,
> ranks=0x53a3c80, new_group=0x7fffffff6794) failed
> MPIR_Group_check_valid_ranks(259): Duplicate ranks in rank array at index
> 12, has value 0 which is also the value at index 0
> Rank 93 [Mon Jan 25 09:52:04 2016] [c0-0c0s14n3] Fatal error in
> PMPI_Group_incl: Invalid rank, error stack:
> PMPI_Group_incl(173).............: MPI_Group_incl(group=0x88000002, n=36,
> ranks=0x538fdf0, new_group=0x7fffffff6794) failed
> MPIR_Group_check_valid_ranks(259): Duplicate ranks in rank array at index
> 12, has value 0 which is also the value at index 0
> etc...
>
>    The error is apparently related to a change in Modules/mp_global.f90 on
> line 80.  Here, the line previously read:
>
> CALL mp_start_diag  ( ndiag_, intra_BGRP_comm )
>
> In QE 5.3.0, this has been changed to:
>
> CALL mp_start_diag  ( ndiag_, intra_POOL_comm )
>
>    The call using intra_BGRP_comm still exists in version 5.3.0 of the
> code, but is commented out, and the surrounding comments indicate that it
> should be possible to switch back to the old parallelization by
> commenting/uncommenting as desired.  When I do this, I find that instead of
> the error messages described above, I get the following error messages:
>
> Error in routine  cdiaghg(193):
>   problems computing cholesky
>
>    Am I missing something, or are these errors the result of a bug?
>
> Best Regards,
>
> Dr. Taylor Barnes,
> Lawrence Berkeley National Laboratory
>
>
> =================
> Run Command:
> =================
>
> srun -n 96 pw.x -nbgrp 4 -in input > input.out
>
>
>
> =================
> Input File:
> =================
>
> &control
> prefix = 'water'
> calculation = 'scf'
> restart_mode = 'from_scratch'
> wf_collect = .true.
> disk_io = 'none'
> tstress = .false.
> tprnfor = .false.
> outdir = './'
> wfcdir = './'
> pseudo_dir = '/global/homes/t/tabarnes/espresso/pseudo'
> /
> &system
> ibrav = 1
> celldm(1) = 15.249332837
> nat = 48
> ntyp = 2
> ecutwfc = 130
> input_dft = 'pbe0'
> /
> &electrons
> diago_thr_init=5.0d-4
> mixing_mode = 'plain'
> mixing_beta = 0.7
> mixing_ndim = 8
> diagonalization = 'david'
> diago_david_ndim = 4
> diago_full_acc = .true.
> electron_maxstep=3
> scf_must_converge=.false.
> /
> ATOMIC_SPECIES
> O   15.999   O.pbe-mt_fhi.UPF
> H    1.008   H.pbe-mt_fhi.UPF
> ATOMIC_POSITIONS alat
>  O   0.405369   0.567356   0.442192
>  H   0.471865   0.482160   0.381557
>  H   0.442867   0.572759   0.560178
>  O   0.584679   0.262476   0.215740
>  H   0.689058   0.204790   0.249459
>  H   0.503275   0.179176   0.173433
>  O   0.613936   0.468084   0.701359
>  H   0.720162   0.421081   0.658182
>  H   0.629377   0.503798   0.819016
>  O   0.692499   0.571474   0.008796
>  H   0.815865   0.562339   0.016182
>  H   0.640331   0.489132   0.085318
>  O   0.138542   0.767947   0.322270
>  H   0.052664   0.771819   0.411531
>  H   0.239736   0.710419   0.364788
>  O   0.127282   0.623278   0.765792
>  H   0.075781   0.693268   0.677441
>  H   0.243000   0.662182   0.787094
>  O   0.572799   0.844477   0.542529
>  H   0.556579   0.966998   0.533420
>  H   0.548297   0.791340   0.433292
>  O  -0.007677   0.992860   0.095967
>  H   0.064148   1.011844  -0.003219
>  H   0.048026   0.913005   0.172625
>  O   0.035337   0.547318   0.085085
>  H   0.072732   0.625835   0.173379
>  H   0.089917   0.576762  -0.022194
>  O   0.666008   0.900155   0.183677
>  H   0.773299   0.937456   0.134145
>  H   0.609289   0.822407   0.105606
>  O   0.443447   0.737755   0.836152
>  H   0.526041   0.665651   0.893906
>  H   0.483300   0.762549   0.721464
>  O   0.934493   0.378765   0.627850
>  H   1.012721   0.449242   0.693201
>  H   0.955703   0.394823   0.506816
>  O   0.006386   0.270244   0.269327
>  H   0.021231   0.364797   0.190612
>  H   0.021863   0.163251   0.208755
>  O   0.936337   0.855942   0.611999
>  H   0.956610   0.972475   0.648965
>  H   0.815045   0.839173   0.592915
>  O   0.228881   0.037509   0.849634
>  H   0.263938   0.065862   0.734213
>  H   0.282576  -0.068680   0.884220
>  O   0.346187   0.176679   0.553828
>  H   0.247521   0.218347   0.491489
>  H   0.402671   0.271609   0.610010
> K_POINTS automatic
> 1 1 1 1 1 1
>
>
>
>
> _______________________________________________
> Pw_forum mailing list
> [email protected]
> http://pwscf.org/mailman/listinfo/pw_forum
>



-- 
Paolo Giannozzi, Dip. Scienze Matematiche Informatiche e Fisiche,
Univ. Udine, via delle Scienze 208, 33100 Udine, Italy
Phone +39-0432-558216, fax +39-0432-558222
_______________________________________________
Pw_forum mailing list
[email protected]
http://pwscf.org/mailman/listinfo/pw_forum

Reply via email to