Hi Coiby, Using -ntg and -ndiag doesn't mean exactly using threaded library. Those options are useful when your calculation involves over 1k processors without (-ni image parallelization) and needed to be tested carefully at pw.x before performing any large rigorous calculations. I don't think you need them.
What I meant is using threaded math library MKL. You need to build QE suite with openmp (check make.sys to see if the threaded MKL is linked). I recommend using at lease QE version 5.3. If you have 32 nodes with 24 cores each, export OMP_NUM_THREADS=4 # each MPI rank has 4 threads. You need to ensure each node gets only 6 MPI ranks. Still run the following, mpirun -np 192 ph.x -ni 4 -nk 3 -inp your_ph.input You don't even need to redo you pw calculation if only the threads and total nodes used are increased correspondingly. For phonon calculation, I also often get disk quota issues. Ph.x eats a lot of disk even though reduce_io=.true. is set. If you have more than 3 k-points, try to maximize your -nk option and reduce the corresponding -ni. More -nk uses more disk but less -ni uses less disk. In sum, you should need less disk without compromising parallel efficiency. If you still get disk issue, use less images and more threads. Ciao, Ye =================== Ye Luo, Ph.D. Leadership Computing Facility Argonne National Laboratory 2016-05-05 8:42 GMT-05:00 Coiby Xu <[email protected]>: > Dear Dr. Luo, > > Thank you for your detailed reply! > > I'm sorry I disabled Mail delivery before so I didn't receive the email > until I checked the mailing list archive. > > I've successfully run phonon calculation without using *wf_collect=.true.* > following your advise. This helps reduce the size of *outdir* from 142G > to 48G. > > For threaded MKL and FFT, I tested one case (-nimage 48 -npool 3 -ntg 2 > -ndiag 4). To my surprise, it's marginally slower than the calculation > without* -ntg 2 -ndiag 4*. In PHonon/examples/Image_example, I didn't > find any useful info. > >> PH_IMAGE_COMMAND="$PARA_IMAGE_PREFIX $BIN_DIR/ph.x $PARA_IMAGE_POSTFIX" >> > > In the file environment_variables, no info about ntg and ndiag are given > >> PARA_POSTFIX=" -nk 1 -nd 1 -nb 1 -nt 1 " >> PARA_IMAGE_POSTFIX="-ni 2 $PARA_POSTFIX" >> PARA_IMAGE_PREFIX="mpirun -np 4" >> > > I also checked the job log for failed calculation ("Not diagonalizing > because representation xx is not done"). Maybe ph.x crashes due to I/O > problem (the size of outdir was 142G). > > forrtl: No such file or directory >> forrtl: No such file or directory >> forrtl: severe (28): CLOSE error, unit 20, file "Unknown" >> Image PC Routine Line >> Source >> ph.x 000000000088A00F Unknown Unknown >> Unknown >> ph.x 0000000000517B26 buffers_mp_close_ 620 >> buffers.f90 >> ph.x 00000000004B85E8 close_phq_ 39 >> close_phq.f90 >> ph.x 00000000004B7888 clean_pw_ph_ 41 >> clean_pw_ph.f90 >> ph.x 000000000042E5EF do_phonon_ 126 >> do_phonon.f90 >> ph.x 000000000042A554 MAIN__ 78 >> phonon.f90 >> ph.x 000000000042A4B6 Unknown Unknown >> Unknown >> libc.so.6 0000003921A1ED1D Unknown Unknown >> Unknown >> ph.x 000000000042A3A9 Unknown Unknown >> Unknown > > forrtl: severe (28): CLOSE error, unit 20, file "Unknown" >> > > > Btw, I'm from School of Earth and Space Science of USTC. > > On Wed, May 4, 2016 at 07:41:30 CEST, Ye Luo <[email protected] > <[email protected]>> wrote: > >> Hi Coiby, >> >> "it seems to be one requirement to let ph.x and pw.x have the same number >> of processors." >> This is not true. >> >> If you are using image parallelization in your phonon calculation, you need >> to maintain the same amount of processes per image as your pw calculation. >> In this way, wf_collect=.true. is not needed. >> >> Here is an example. I assume you use k point parallelization (-nk). >> 1, mpirun -np 48 pw.x -nk 12 -inp your_pw.input >> 2, mpirun -np 192 ph.x -ni 4 -nk 12 -inp your_ph.input >> In this step, you might notice "Not diagonalizing because representation >> xx is not done" which is normal. >> The code should not abort because of this. >> 3, After calculating all the representations belongs a given q or q-mesh. >> Just add "recover = .true." in your_ph.input and run >> mpirun -np 48 ph.x -nk 12 -inp your_ph.input >> The dynamical matrix will be computed for that q. >> >> If you are confident with threaded pw.x, ph.x also gets benefit from >> threaded MKL and FFT and the time to solution is further reduced. >> >> For more details, you can look into PHonon/examples/Image_example. >> >> P.S. >> Your affiliation is missing. >> >> =================== >> Ye Luo, Ph.D. >> Leadership Computing Facility >> Argonne National Laboratory >> >> >> >> On Wed, May 4, 2016 at 11:33 AM, Coiby Xu <[email protected]> wrote: >> >>> Dear Quantum Espresso Developers and Users, >>> >>> >>> I'm running a phonon calculation parallelizing over the >>> representations/q vectors. For my cluster, there are 24 cores per node. I >>> want to use as many nodes as possible to speed up the calculation. >>> >>> I set the number of parallelizations to be the number of nodes, >>> >>>> mpirun -np NUMBER_OF_NODESx24 ph.x -nimage NUMBER_OF_NODES >>>> >>> >>> >>> If I only use 4 nodes (4 images), 8 nodes ( 8 images), the calculation >>> will be finished successfully. However, more than 8 nodes, say 16 or 32 >>> nodes, are used, each time running the calculation, such error will be >>> given, >>> >>>> Not diagonalizing because representation xx is not done >>>> >>> >>> Btw, I want to reduce I/O overhead by discarding `wf_collect` option, >>> but the following way doesn't work (the number of processors and pools for >>> scf calculation is the same to that in phonon calculation) >>> >>> mpirun -np NUMBER_OF_NODESx24 pw.x >>>> >>> >>> ph.x complains, >>> >>>> Error in routine phq_readin (1):pw.x run with a different number of >>>> processors. >>>> Use wf_collect=.true. >>>> >>> >>> The beginning output of pw.x, >>> >>>> Parallel version (MPI), running on 96 processors >>>> R & G space division: proc/nbgrp/npool/nimage = 96 >>>> Waiting for input... >>>> Reading input from standard input >>>> >>> >>> and the beginning output of ph.x, >>> >>>> Parallel version (MPI), running on 96 processors >>>> path-images division: nimage = 4 >>>> R & G space division: proc/nbgrp/npool/nimage = 24 >>>> >>> >>> Do I miss something? I know it's inefficient to let pw.x use so many >>> processors, but it seems to be one requirement to let ph.x and pw.x have >>> the same number of processors. >>> >>> Thank you! >>> >>> -- >>> *Best regards,* >>> *Coiby* >>> >>> > > > -- > *Best regards,* > *Coiby* > >
_______________________________________________ Pw_forum mailing list [email protected] http://pwscf.org/mailman/listinfo/pw_forum
