It is not so easy to give unique answers to this question, as the performance depends on:
a) your case (size of the problem)
b) your specific hardware (in particular network speed)
c) your mpi and mkl-software (version).

In my experience (but see the above remarks), and this is what is clearly written in the UG about parallelization:

I run small cases (usually below 50 atoms/cell) on a simple local PC cluster with Gigabit network on about 10-50 cores (depending on size and k-points). For these cases I use k-parallelism and OMP_NUM_THREAD=2. OMP_NUM_THREAD=4 gives for me a very small performance increase, so I do not use it (maybe with the latest mkl ... ??), but I never experienced a "crash" after 2 cycles ???

I run larger cases (where the matrix size is too big for a single computer) on a big cluster with 16 core nodes and Infiniband and queuing system. The MINIMUM number of mpi-jobs is 16 (below it is usually useless), but for cases with a couple of hundredths of atoms/cell I also used up to 512 cores. Often I couple k-point parallel (usually we have only 1-8 k-points for such large cells) and mpi-parallelism.

Final remarks:
On a Gigabit network mpi-parallel is "useless".
The mpi-parallel version is about a factor of 2 "slower" and takes 2x as much memory as the sequential code. Thus you need a "sizable" number of cores. Therefore mpi on a single "quadcore"-cpu is also not very useful. And for large cases, ALWAYS use "iterative diagonalization" (and an "adapted (optimized)" RKMAX and k-point mesh, otherwise calculations will run "forever"!!


On 01/13/2016 08:25 AM, Hu, Wenhao wrote:
Hi, all:

I met some confusions when I try to compare the efficiency of MPI and 
multi-thread calculations. In the lapw1 stage of the same case, I found that 
MPI will take double time of that with multi-thread. Other than, it even takes 
longer time than k-point parallelization without multi-thread setup. Can anyone 
tell me under what case MPI has a better performance? Another question is about 
the number of thread per job. When I increase the OMP_NUM_THREADS from 2 to 4, 
my process usually crashes after two cycles although it does have a boost 
effect on the finished cycle. Is this a normal thing? Do we have an optimal 
threads number?

Best,
Wenhao
_______________________________________________
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html


--

                                      P.Blaha
--------------------------------------------------------------------------
Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
Phone: +43-1-58801-165300             FAX: +43-1-58801-165982
Email: bl...@theochem.tuwien.ac.at    WIEN2k: http://www.wien2k.at
WWW:   http://www.imc.tuwien.ac.at/staff/tc_group_e.php
--------------------------------------------------------------------------
_______________________________________________
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html

Reply via email to