Re: [Pw_forum] problem with DFT+U

Paolo Giannozzi Thu, 01 Dec 2016 09:44:27 -0800

"underflows"? They should never be a problem, unless you instruct the
compiler (by activating some obscure flag) to catch them.


Paolo

On Thu, Dec 1, 2016 at 4:48 PM, Sergi Vela <[email protected]> wrote:

> Dear Paolo,
>
> I have some more details on the problem with DFT+U. The problem arises
> from underflows somewhere in QE code, hence the MPI_Bcast message described
> in previous emails. A systematic crash occurs for the attached input, at
> least, in versions 5.1.1, 5.2, 5.4 and 6.0.
>
> According to the support team of HPC-GRNET, the problem is not related to
> MPI (no matter if IntelMPI or OpenMPI - various versions for both) and it
> is not related to BLAS libraries (MKL, OpenBLAS). For Intel compilers, the
> flag "-fp-model precise" seems to be necessary (at least for 5.2 and 5.4).
> In turn, GNU compilers work. They also notice the underflow (a message
> appears in the job file after completion), but it seems that they can
> handle them.
>
> The attached input is just an example. Many other jobs of different
> systems have failed whereas other closely-related inputs have run without
> any problem. I have the impression that the underflow is not always
> occurring or, at least, is not always enough to crash the job.
>
> Right now I'm extensively using version 5.1.1 compiled with GNU/4.9
> compiler and it seems to work well.
>
> That's all the info I can give you about the problem. I hope it may
> eventually help.
>
> Bests,
> Sergi
>
>
>
> 2016-11-23 16:13 GMT+01:00 Sergi Vela <[email protected]>:
>
>> Dear Paolo,
>>
>> Unfortunately, there's not much to report so far. Many "relax" jobs for a
>> system of ca. 500 atoms (including Fe) fail giving the same message Davide
>> reported long time ago:
>> _________________
>>
>> Fatal error in PMPI_Bcast: Other MPI error, error stack:
>> PMPI_Bcast(2434)........: MPI_Bcast(buf=0x8b25e30, count=7220,
>> MPI_DOUBLE_PRECISION, root=0, comm=0x84000007) failed
>> MPIR_Bcast_impl(1807)...:
>> MPIR_Bcast(1835)........:
>> I_MPIR_Bcast_intra(2016): Failure during collective
>> MPIR_Bcast_intra(1665)..: Failure during collective
>> _________________
>>
>> It only occurs in some architectures. The same inputs work for me in 2
>> other machines, so it seems to be related to the compilation. The support
>> team of the HPC center I'm working on is trying to identify the problem.
>> Also, it seems to occur randomly. In the sense that for some DFT+U
>> calculations of the same type (same cutoffs, pp's, system) there is no
>> problem at all.
>>
>> I'll try to be more helpful next time, and I'll keep you updated.
>>
>> Bests,
>> Sergi
>>
>> 2016-11-23 15:21 GMT+01:00 Paolo Giannozzi <[email protected]>:
>>
>>> Thank you, but unless an example demonstrating the problem is provided,
>>> or at least some information on where this message come from is supplied,
>>> there is close to nothing that can be done
>>>
>>> Paolo
>>>
>>> On Wed, Nov 23, 2016 at 10:05 AM, Sergi Vela <[email protected]>
>>> wrote:
>>>
>>>> Dear Colleagues,
>>>>
>>>> Just to report that I'm having exactly the same problem with DFT+U. The
>>>> same message is appearing randomly only when I use the Hubbard term. I
>>>> could test versions 5.2 and 6.0 and it occurs in both.
>>>>
>>>> All my best,
>>>> Sergi
>>>>
>>>> 2015-07-16 18:43 GMT+02:00 Paolo Giannozzi <[email protected]>:
>>>>
>>>>> There are many well-known problems of DFT+U, but none that is known to
>>>>> crash jobs with an obscure message.
>>>>>
>>>>> Rank 21 [Thu Jul 16 15:51:04 2015] [c4-2c0s15n2] Fatal error in
>>>>>> PMPI_Bcast: Message truncated, error stack:
>>>>>> PMPI_Bcast(1615)..................: MPI_Bcast(buf=0x75265e0,
>>>>>> count=160, MPI_DOUBLE_PRECISION, root=0, comm=0xc4000000) failed
>>>>>>
>>>>>
>>>>> this signals a mismatch between what is sent and what is received in a
>>>>> broadcast operation. This may be due to an obvious bug, that however 
>>>>> should
>>>>> show up at the first iteration, not after XX. Apart compiler or MPI 
>>>>> library
>>>>> bugs, another reason is the one described in sec.8.3 of the developer
>>>>> manual: different processes following a different execution paths. From
>>>>> time to time, cases like this are found  (the latest occurrence, in band
>>>>> parallelization of exact exchange) and easily fixed. Unfortunately, 
>>>>> finding
>>>>> them (that is: where this happens) typically requires a painstaking
>>>>> parallel debugging.
>>>>>
>>>>> Paolo
>>>>> --
>>>>> Paolo Giannozzi, Dept. Chemistry&Physics&Environment,
>>>>> Univ. Udine, via delle Scienze 208, 33100 Udine, Italy
>>>>> Phone +39-0432-558216, fax +39-0432-558222
>>>>>
>>>>> _______________________________________________
>>>>> Pw_forum mailing list
>>>>> [email protected]
>>>>> http://pwscf.org/mailman/listinfo/pw_forum
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Pw_forum mailing list
>>>> [email protected]
>>>> http://pwscf.org/mailman/listinfo/pw_forum
>>>>
>>>
>>>
>>>
>>> --
>>> Paolo Giannozzi, Dip. Scienze Matematiche Informatiche e Fisiche,
>>> Univ. Udine, via delle Scienze 208, 33100 Udine, Italy
>>> Phone +39-0432-558216, fax +39-0432-558222
>>>
>>>
>>> _______________________________________________
>>> Pw_forum mailing list
>>> [email protected]
>>> http://pwscf.org/mailman/listinfo/pw_forum
>>>
>>
>>
>
> _______________________________________________
> Pw_forum mailing list
> [email protected]
> http://pwscf.org/mailman/listinfo/pw_forum
>



-- 
Paolo Giannozzi, Dip. Scienze Matematiche Informatiche e Fisiche,
Univ. Udine, via delle Scienze 208, 33100 Udine, Italy
Phone +39-0432-558216, fax +39-0432-558222

_______________________________________________
Pw_forum mailing list
[email protected]
http://pwscf.org/mailman/listinfo/pw_forum

Re: [Pw_forum] problem with DFT+U

Reply via email to