Thanks Lorenzo I hope so too, I think the best references are Examples 4 and
10, I have this tendency to just go ahead once I get something working, need to
work on that :P
Indeed I have reproduced almost exactly what you have said. What I can
confirm when using bp_c_phase (no electric field):
- all gdir work, only gdir=3 has a notable improvement in performance.
- when gdir=3, up to 4 processors scaling is good, on 8 it is terrible it
actually takes longer, WALL time is notably larger than CPU time.
- the call to 'CALL mp_sum(aux_g(:), intra_bgrp_comm )' is made when gdir != 3.
My current understanding is that mp_sum takes the trace of the 'aux_g' matrix,
whereas for gdir=3 there is significantly less code that ends up building the
matrix 'aux' which is finally used to build 'mat'. The matrix 'evc' represents
the wavefunctions built using plane waves, but 'evc' is used in many files.
Since bp_c_phase is executed last, 'evc' has already been built and is only
read in this file. With this and comparing the output I notice that performance
when gdir=3 is better for almost all routines.. I will continue debugging
tomorrow on the 8 processor machine where the differences are much more
noticeable.. Do you think I should contact Paolo Giannozzi directly to better
understand what is going on here?
Thanks so much [??]
Louis
________________________________
From: [email protected] <[email protected]> on behalf of
Lorenzo Paulatto <[email protected]>
Sent: 13 February 2017 13:04:22
To: PWSCF Forum
Subject: Re: [Pw_forum] PW.x homogeneous electric field berry phase calculation
in trigonal cell
On Monday, February 13, 2017 11:43:08 AM CET Louis Fry-Bouriaux wrote:
> Finally when you were talking about the bottleneck, I suppose you were
> talking about the efield code, my impression so far is this is not a
> problem using 4 processors, I will also test using 8 and compare the time
> taken. I have no idea how fast it 'should' be with proper parallisation,
> assuming it is possible to parallelise.
When you increase the number of CPUs, you would expect the time to decreased
linearly, if over a certain number of CPUs it stops decreasing or if it
decreases slower than linear, it is a bottleneck. This will always happen
eventually, but with berry/lefield it happens much sooner.
Thank you for reporting back! I hope this information will be useful to future
users
--
Dr. Lorenzo Paulatto
IdR @ IMPMC -- CNRS & Université Paris 6
phone: +33 (0)1 442 79822 / skype: paulatz
www: http://www-int.impmc.upmc.fr/~paulatto/
mail: 23-24/423 Boîte courrier 115, 4 place Jussieu 75252 Paris Cédex 05
_______________________________________________
Pw_forum mailing list
[email protected]
http://pwscf.org/mailman/listinfo/pw_forum
_______________________________________________
Pw_forum mailing list
[email protected]
http://pwscf.org/mailman/listinfo/pw_forum