Dear Eduardo,
our own experiences are summarized here: http://quasiamore.mit.edu/pmwiki/index.php?n=Main.CP90Timings It would be great if you could contribute your own data, either for pw.x or cp.x under the conditions you describe. I noticed indeed, informally, a few of the things you mention: 1) no improvements with the Intel fftw2 wrapper, as opposed to fftw2 Q-E sources, when using mpi. I also never managed to successfully run with the Intel fftw3 wrapper (or with fftw3 - that probably says something about me). 2) great improvements of a serial code (different from Q-E) when using the automatic parallelism of MKL in quad-cores. 3) btw, MPICH has always been for us the slower protocol, compared with LAMMPI or OpenMPI I actually wonder if the best solution on a quad-core would be, say, to use two cores for MPI, and the other two for the openmp threads. I eagerly await Axel's opinion. nicola Eduardo Ariel Menendez Proupin wrote: > Hi, > I have noted recently that I am able to obtain faster binaries of pw.x > using the the OpenMP paralellism implemented in the Intel MKL libraries > of version 10.xxx, than using MPICH, in the Intel cpus. Previously I had > always gotten better performance using MPI. I would like to know of > other experience on how to make the machines faster. Let me explain in > more details. > > Compiling using MPI means using mpif90 as linker and compiler, linking > against mkl_ia32 or mkl_em64t, and using link flags -i-static -openmp. > This is just the what appears in the make.sys after running configure > in version 4cvs, > > At runtime, I set > export OMP_NUM_THREADS=1 > export MKL_NUM_THREADS=1 > and run using > mpiexec -n $NCPUs pw.x <input >output > where NCPUs is the number of cores available in the system. > > The second choice is > ./configure --disable-parallel > > and at runtime > export OMP_NUM_THREADS=$NCPU > export MKL_NUM_THREADS=$NCPU > and run using > pw.x <input >output > > I have tested it in Quadcores (NCPU=4) and with an old Dual Xeon B.C. > (before cores) (NCPU=2). > > Before April 2007, the first choice had always workes faster. After > that, when I came to use the MKL 10.xxx, the second choice is working > faster. I have found no significant difference between version 3.2.3 and > 4cvs. > > A special comment is for the FFT library. The MKL has a wrapper to the > FFTW, that must be compiled after instalation (it is very easy). This > creates additional libraries named like libfftw3xf_intel.a and > libfftw2xf_intel.a > This allows improves the performance in the second choice, specially > with libfftw3xf_intel.a. > > Using MPI, libfftw2xf_intel.a is as fast as using the FFTW source > distributed with espresso, i.e., there is no gain in using > libfftw2xf_intel.a. With libfftw3xf_intel.a and MPI, I have never been > able to run pw.x succesfully, it just aborts. > > I would like to hear of your experiences. > > Best regards > Eduardo Menendez > University of Chile > > > ------------------------------------------------------------------------ > > _______________________________________________ > Pw_forum mailing list > Pw_forum at pwscf.org > http://www.democritos.it/mailman/listinfo/pw_forum -- --------------------------------------------------------------------- Prof Nicola Marzari Department of Materials Science and Engineering 13-5066 MIT 77 Massachusetts Avenue Cambridge MA 02139-4307 USA tel 617.4522758 fax 2586534 marzari at mit.edu http://quasiamore.mit.edu
