On Wed, 7 May 2008, Eduardo Ariel Menendez Proupin wrote: EAM> Hi, EAM> Plese, find attached my best make.sys, to be run serially. Try this in your EAM> system. My timings are close to yours. Below are the details. However, it
ok, i tried running on my machine with the intel wrapper i get a wall time of 13m43s and using a multi-threaded fftw3 i need a wall time of 16m13s to complete the job, but i have not yet added the additional tunings that i added to CPMD that finally made fftw3 faster. in summary it looks as if on my hardware MPI is the winner. i would be interested to see if you get different timings with OpenMPI instead of MPICH. EAM> runs faster serially than using mpiexec -n 1. [...] EAM> > EAM> > obviously, switching to the intel fft didn't help. EAM> EAM> FOR ME, IT HELPS ONLY WHEN RUNNING SERIAL. on CPMD i found that actually, using the multi-threaded fftw3 is _even_ faster. you will need to add one function call to tell the fftw3-planner that all future plans should be generated for $OMP_NUM_THREADS threads. the fact that it helps in the serial code only, is easily understandable if you look what QEs FFT modules do differently when running in serial or in parallel. if you run in serial, QE calls a 3d-FFT directly instead of a sequence of 1d/2d-FFTs. with the 3d-fft you have the chance to parallelize in the same was as with MPI by using threads. if you run in parallel, you already call many small 1d-ffts and those don't parallelize well. instead it would be required to distribute those calls across threads to have a similar gain. EAM> > your system with many states and only gamma point EAM> > is definitely a case that benefits the most from EAM> > multi-threaded BLAS/LAPACK. EAM> EAM> TYPICAL FOR BO MOLECULAR DYNAMICS. EAM> I WOULD SAY, AVOID MIXING MPI AND OPENMP. ALSO AVOID INTEL FFTW WRAPPERS EAM> WITH MPI, EVEN IF OMP_NUM_THREADS=1. EAM> USE THREADED BLAS/LAPACK/FFTW2(3) FOR SERIAL RUNS. i don't think that this can be said in general, because your system is a best case scenario. in my experience a serial executable is about 10% faster than a parallel one for one task with plane-wave pseudopotential calculations. the fact that you have a large system with only gamma point gives you the maximum benefit from parallel LAPACK/BLAS and the multi-threaded FFT. however, if you want to do BO-dynamics i suspect that you may lose the performance advantage, since the wavefunction extrapolation will cut down the number of SCF cycles needed and at the same time the force calculation is not multi-threaded at all. to get a real benefit from a multi-core machine, additional OpenMP directives need to be added to the QE code. the fact that OpenMP libraries and MPI parallelization are somewhat comparable, could indicate that there is some more room to improve the MPI parallelization. luckily for most QE-users the first, simple level of parallelization across k-points will apply and give them a lot of speedup without much and only _then_ the parallelization across the G-space, task groups and finally threads/libraries/OpenMP directives should apply. cheers, axel. EAM> EAM> ANYWAY, THE DIFFERENCE BETWEEN THE BEST MPI AND THE BEST OPENMP IS LESS THAN EAM> 10% (11m30s vs 12m43s) EAM> EAM> > EAM> > EAM> > i'm curious to learn how these number match up EAM> > with your performance measurements. EAM> > EAM> > cheers, EAM> > axel. EAM> > EAM> > EAM> > EAM> EAM> EAM> -- ======================================================================= Axel Kohlmeyer akohlmey at cmm.chem.upenn.edu http://www.cmm.upenn.edu Center for Molecular Modeling -- University of Pennsylvania Department of Chemistry, 231 S.34th Street, Philadelphia, PA 19104-6323 tel: 1-215-898-1582, fax: 1-215-573-6233, office-tel: 1-215-898-5425 ======================================================================= If you make something idiot-proof, the universe creates a better idiot.
