"underflows"? They should never be a problem, unless you instruct the compiler (by activating some obscure flag) to catch them.
Paolo On Thu, Dec 1, 2016 at 4:48 PM, Sergi Vela <sergi.v...@gmail.com> wrote: > Dear Paolo, > > I have some more details on the problem with DFT+U. The problem arises > from underflows somewhere in QE code, hence the MPI_Bcast message described > in previous emails. A systematic crash occurs for the attached input, at > least, in versions 5.1.1, 5.2, 5.4 and 6.0. > > According to the support team of HPC-GRNET, the problem is not related to > MPI (no matter if IntelMPI or OpenMPI - various versions for both) and it > is not related to BLAS libraries (MKL, OpenBLAS). For Intel compilers, the > flag "-fp-model precise" seems to be necessary (at least for 5.2 and 5.4). > In turn, GNU compilers work. They also notice the underflow (a message > appears in the job file after completion), but it seems that they can > handle them. > > The attached input is just an example. Many other jobs of different > systems have failed whereas other closely-related inputs have run without > any problem. I have the impression that the underflow is not always > occurring or, at least, is not always enough to crash the job. > > Right now I'm extensively using version 5.1.1 compiled with GNU/4.9 > compiler and it seems to work well. > > That's all the info I can give you about the problem. I hope it may > eventually help. > > Bests, > Sergi > > > > 2016-11-23 16:13 GMT+01:00 Sergi Vela <sergi.v...@gmail.com>: > >> Dear Paolo, >> >> Unfortunately, there's not much to report so far. Many "relax" jobs for a >> system of ca. 500 atoms (including Fe) fail giving the same message Davide >> reported long time ago: >> _________________ >> >> Fatal error in PMPI_Bcast: Other MPI error, error stack: >> PMPI_Bcast(2434)........: MPI_Bcast(buf=0x8b25e30, count=7220, >> MPI_DOUBLE_PRECISION, root=0, comm=0x84000007) failed >> MPIR_Bcast_impl(1807)...: >> MPIR_Bcast(1835)........: >> I_MPIR_Bcast_intra(2016): Failure during collective >> MPIR_Bcast_intra(1665)..: Failure during collective >> _________________ >> >> It only occurs in some architectures. The same inputs work for me in 2 >> other machines, so it seems to be related to the compilation. The support >> team of the HPC center I'm working on is trying to identify the problem. >> Also, it seems to occur randomly. In the sense that for some DFT+U >> calculations of the same type (same cutoffs, pp's, system) there is no >> problem at all. >> >> I'll try to be more helpful next time, and I'll keep you updated. >> >> Bests, >> Sergi >> >> 2016-11-23 15:21 GMT+01:00 Paolo Giannozzi <p.gianno...@gmail.com>: >> >>> Thank you, but unless an example demonstrating the problem is provided, >>> or at least some information on where this message come from is supplied, >>> there is close to nothing that can be done >>> >>> Paolo >>> >>> On Wed, Nov 23, 2016 at 10:05 AM, Sergi Vela <sergi.v...@gmail.com> >>> wrote: >>> >>>> Dear Colleagues, >>>> >>>> Just to report that I'm having exactly the same problem with DFT+U. The >>>> same message is appearing randomly only when I use the Hubbard term. I >>>> could test versions 5.2 and 6.0 and it occurs in both. >>>> >>>> All my best, >>>> Sergi >>>> >>>> 2015-07-16 18:43 GMT+02:00 Paolo Giannozzi <p.gianno...@gmail.com>: >>>> >>>>> There are many well-known problems of DFT+U, but none that is known to >>>>> crash jobs with an obscure message. >>>>> >>>>> Rank 21 [Thu Jul 16 15:51:04 2015] [c4-2c0s15n2] Fatal error in >>>>>> PMPI_Bcast: Message truncated, error stack: >>>>>> PMPI_Bcast(1615)..................: MPI_Bcast(buf=0x75265e0, >>>>>> count=160, MPI_DOUBLE_PRECISION, root=0, comm=0xc4000000) failed >>>>>> >>>>> >>>>> this signals a mismatch between what is sent and what is received in a >>>>> broadcast operation. This may be due to an obvious bug, that however >>>>> should >>>>> show up at the first iteration, not after XX. Apart compiler or MPI >>>>> library >>>>> bugs, another reason is the one described in sec.8.3 of the developer >>>>> manual: different processes following a different execution paths. From >>>>> time to time, cases like this are found (the latest occurrence, in band >>>>> parallelization of exact exchange) and easily fixed. Unfortunately, >>>>> finding >>>>> them (that is: where this happens) typically requires a painstaking >>>>> parallel debugging. >>>>> >>>>> Paolo >>>>> -- >>>>> Paolo Giannozzi, Dept. Chemistry&Physics&Environment, >>>>> Univ. Udine, via delle Scienze 208, 33100 Udine, Italy >>>>> Phone +39-0432-558216, fax +39-0432-558222 >>>>> >>>>> _______________________________________________ >>>>> Pw_forum mailing list >>>>> Pw_forum@pwscf.org >>>>> http://pwscf.org/mailman/listinfo/pw_forum >>>>> >>>> >>>> >>>> _______________________________________________ >>>> Pw_forum mailing list >>>> Pw_forum@pwscf.org >>>> http://pwscf.org/mailman/listinfo/pw_forum >>>> >>> >>> >>> >>> -- >>> Paolo Giannozzi, Dip. Scienze Matematiche Informatiche e Fisiche, >>> Univ. Udine, via delle Scienze 208, 33100 Udine, Italy >>> Phone +39-0432-558216, fax +39-0432-558222 >>> >>> >>> _______________________________________________ >>> Pw_forum mailing list >>> Pw_forum@pwscf.org >>> http://pwscf.org/mailman/listinfo/pw_forum >>> >> >> > > _______________________________________________ > Pw_forum mailing list > Pw_forum@pwscf.org > http://pwscf.org/mailman/listinfo/pw_forum > -- Paolo Giannozzi, Dip. Scienze Matematiche Informatiche e Fisiche, Univ. Udine, via delle Scienze 208, 33100 Udine, Italy Phone +39-0432-558216, fax +39-0432-558222
_______________________________________________ Pw_forum mailing list Pw_forum@pwscf.org http://pwscf.org/mailman/listinfo/pw_forum