On Tue, 6 May 2008, Nicola Marzari wrote: NM> NM> Dear Eduardo,
hi nicola, NM> 1) no improvements with the Intel fftw2 wrapper, as opposed to fftw2 NM> Q-E sources, when using mpi. I also never managed to successfully run NM> with the Intel fftw3 wrapper (or with fftw3 - that probably says NM> something about me). no it doesn't. i NM> 2) great improvements of a serial code (different from Q-E) when using NM> the automatic parallelism of MKL in quad-cores. nod. i just yesterday made some tests with different BLAS/LAPACK implementations, and it turns out that the 10.0 MKL is pretty effecient in parallelizing tasks like DGEMM. through use of SSE2/3/4 and multi threading you can get easily a factor of 6 improvement on a 4-core node. NM> NM> 3) btw, MPICH has always been for us the slower protocol, compared with NM> LAMMPI or OpenMPI NM> NM> I actually wonder if the best solution on a quad-core would be, say, NM> to use two cores for MPI, and the other two for the openmp threads. this is a _very_ tricky issue. usually for a plane wave pseudopotential codes, that distributed data parallelization is pretty efficient, except for the 3d-fourier transforms across the whole data set, which are very sensitive to network latencies. for jobs using k-points, you also have the option to parallelize over k-points with is very efficient, even on not so fast networks. with the CVS versions, you have another level of parallelism added (parallelisation over function instead of data = task groups). thus given an ideal network, you first want to exploit MPI parallelism maximally and then what is left is rather small, and - sadly - OpenMP doesn't work very efficiently on that. the overhead of spawning, synchronizing and joining threads is too high compared to the gain through parallelism. but we live in a real world and there are unexpected side effects and non-ideal machines and networks. e.g. when using nodes with many cores, e.g. two-socket quad-core, you have to "squeeze" a lot of communication through just one network card (be it infiniband, myrinet or ethernet) that will serialize communication and add unwanted conflicts and latencies. i've seen this happen particularly when using a very large number of nodes where you can run out of (physical) memory simply because of the way how the lowlevel communication was programmed. in that case you may indeed be better off using only half or a quarter of the cores with MPI and then set OMP_NUM_THREADS to 2 or even keep it at 1 (because that will, provided you have an MPI with processor affinity and optimal job placement, double the cpu cache). it is particularly interesting to discuss from this perspective having multi-core nodes connected by a high-latency TCP/IP network (e.g. gigabit ethernet). here with one MPI task per node you reach the limit of scaling pretty fast, and also using multiple MPI tasks per node is mostly multiplying the latencies, which is not helping. under those circumstance the data set is still rather large and then OpenMP parallelism can help to get the most out of a given machine. as noted before, it would be _even_ better if OpenMP directives were added to time critical and multi-threadable parts of QE. i have experienced this in CPMD where i managed to get about 80% of the MPI performance with the latest (extensively threaded) development sources and a fully-multi-threaded toolchain on a single node. however, running across multiple nodes quickly reduces the effectivity of the OpenMP support. just with two nodes you are at 60% only. now, deciding on what is the best combination of options is a very tricky multi-dimensional optimization problem you have to consider the following: - the size of the typical problem and job type - whether you can benefit from k-point parallelism - whether you prefer faster execution over cost efficiency and throughput. - the total amount of money you want to spend - the skillset of people that have to run the machine - how many people have to share the machine. - how I/O bound the jobs are. - how much memory you need and how much money you are willing to invest in faster memory. - failure rates and the level of service (gigabit equipment is easily available). also some of those parameters are (non-linearly) coupled which makes the decision making process even nastier. cheers, axel. NM> NM> I eagerly await Axel's opinion. NM> NM> nicola NM> NM> Eduardo Ariel Menendez Proupin wrote: NM> > Hi, NM> > I have noted recently that I am able to obtain faster binaries of pw.x NM> > using the the OpenMP paralellism implemented in the Intel MKL libraries NM> > of version 10.xxx, than using MPICH, in the Intel cpus. Previously I had NM> > always gotten better performance using MPI. I would like to know of NM> > other experience on how to make the machines faster. Let me explain in NM> > more details. NM> > NM> > Compiling using MPI means using mpif90 as linker and compiler, linking NM> > against mkl_ia32 or mkl_em64t, and using link flags -i-static -openmp. NM> > This is just the what appears in the make.sys after running configure NM> > in version 4cvs, NM> > NM> > At runtime, I set NM> > export OMP_NUM_THREADS=1 NM> > export MKL_NUM_THREADS=1 NM> > and run using NM> > mpiexec -n $NCPUs pw.x <input >output NM> > where NCPUs is the number of cores available in the system. NM> > NM> > The second choice is NM> > ./configure --disable-parallel NM> > NM> > and at runtime NM> > export OMP_NUM_THREADS=$NCPU NM> > export MKL_NUM_THREADS=$NCPU NM> > and run using NM> > pw.x <input >output NM> > NM> > I have tested it in Quadcores (NCPU=4) and with an old Dual Xeon B.C. NM> > (before cores) (NCPU=2). NM> > NM> > Before April 2007, the first choice had always workes faster. After NM> > that, when I came to use the MKL 10.xxx, the second choice is working NM> > faster. I have found no significant difference between version 3.2.3 and NM> > 4cvs. NM> > NM> > A special comment is for the FFT library. The MKL has a wrapper to the NM> > FFTW, that must be compiled after instalation (it is very easy). This NM> > creates additional libraries named like libfftw3xf_intel.a and NM> > libfftw2xf_intel.a NM> > This allows improves the performance in the second choice, specially NM> > with libfftw3xf_intel.a. NM> > NM> > Using MPI, libfftw2xf_intel.a is as fast as using the FFTW source NM> > distributed with espresso, i.e., there is no gain in using NM> > libfftw2xf_intel.a. With libfftw3xf_intel.a and MPI, I have never been NM> > able to run pw.x succesfully, it just aborts. NM> > NM> > I would like to hear of your experiences. NM> > NM> > Best regards NM> > Eduardo Menendez NM> > University of Chile NM> > NM> > NM> > ------------------------------------------------------------------------ NM> > NM> > _______________________________________________ NM> > Pw_forum mailing list NM> > Pw_forum at pwscf.org NM> > http://www.democritos.it/mailman/listinfo/pw_forum NM> NM> NM> -- ======================================================================= Axel Kohlmeyer akohlmey at cmm.chem.upenn.edu http://www.cmm.upenn.edu Center for Molecular Modeling -- University of Pennsylvania Department of Chemistry, 231 S.34th Street, Philadelphia, PA 19104-6323 tel: 1-215-898-1582, fax: 1-215-573-6233, office-tel: 1-215-898-5425 ======================================================================= If you make something idiot-proof, the universe creates a better idiot.
