[Wien] How to reduce the number of energy bands being calculated
Dear Wien community, I want to render the Fermi surface of a system with a few hundreds of atoms. Are there any ways to force the calculation just to find the bands near the Fermi level? I have done some tests using Copper. I increased Emin in case.in1 from default (-9.0) to -2.0, there was an error in finding the Fermi energy. This was the default case.in1 file: WFFIL EF=.478187426925 (WFFIL, WFPRI, ENFIL, SUPWF) 7.00 104 (R-MT*K-MAX; MAX L IN WF, V-NMT 0.304 0 (GLOBAL E-PARAMETER WITH n OTHER CHOICES, global APW/LAPW) 10.30 0.000 CONT 1 1 -5.36 0.001 STOP 1 20.30 0.005 CONT 1 00.30 0.000 CONT 1 K-VECTORS FROM UNIT:4 -9.0 0.519 emin / de (emax=Ef+de) / nband Best, Fermin ___ Wien mailing list Wien@zeus.theochem.tuwien.ac.at http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien SEARCH the MAILING-LIST at: http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
Re: [Wien] Error in mpi+k point parallelization across multiple nodes
Thanks for the reply. Please see below. As I asked before, did you give us all the error information in the case.dayfile and from standard output? It is not entirely clear in your previous posts, but it looks to me that you might have only provided information from the case.dayfile and the error files (cat *.error), but maybe not from the standard output. Are you still using the PBS script in your old post at http://www.mail-archive.com/wien%40zeus.theochem.tuwien.ac.at/msg11770.html ? In the script, I can see that the standard output is set to be written to a file called wien2k_output. Sorry for the confusion. Yes, I still use the PBS script in the above link. The posts before are from the standard outputs (wien2k). When using 2 nodes with 32 cores for one k point, the standard output gives Warning: no access to tty (Bad file descriptor). Thus no job control in this shell. z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-1 8 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 number of processors: 32 LAPW0 END [16] Failed to dealloc pd (Device or resource busy) [0] Failed to dealloc pd (Device or resource busy) [17] Failed to dealloc pd (Device or resource busy) [2] Failed to dealloc pd (Device or resource busy) [18] Failed to dealloc pd (Device or resource busy) [1] Failed to dealloc pd (Device or resource busy) LAPW1 END LAPW2 - FERMI; weighs written [z1-17:mpispawn_0][child_handler] MPI process (rank: 0, pid: 28291) terminated with signal 9 - abort job [z1-17:mpispawn_0][readline] Unexpected End-Of-File on file descriptor 9. MPI process died? [z1-17:mpispawn_0][mtpmi_processops] Error while reading PMI socket. MPI process died? [z1-17:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from node z1-17 aborted: Error while reading a PMI socket (4) [z1-18:mpispawn_1][read_size] Unexpected End-Of-File on file descriptor 21. MPI process died? [z1-18:mpispawn_1][read_size] Unexpected End-Of-File on file descriptor 21. MPI process died? [z1-18:mpispawn_1][handle_mt_peer] Error while reading PMI socket. MPI process died? cp: cannot stat `.in.tmp': No such file or directory stop error - And the .dayfile reads: on z1-17 with PID 29439 using WIEN2k_14.2 (Release 15/10/2014) start (Thu Apr 30 17:36:59 2015) with lapw0 (40/99 to go) cycle 1 (Thu Apr 30 17:36:59 2015) (40/99 to go) lapw0 -p(17:36:59) starting parallel lapw0 at Thu Apr 3017:36:59 2015 .machine0 : 32 processors 904.074u 8.710s 1:01.54 1483.2% 0+0k 239608+78400io 105pf+0w lapw1 -p -c (17:38:01) starting parallel lapw1 at Thu Apr 30 17:38:01 2015 - starting parallel LAPW1 jobs at Thu Apr 30 17:38:01 2015 running LAPW1 in parallel mode (using .machines) 1 number_of_parallel_jobs z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18(8) 469689.261u 1680.003s 8:12:29.52 1595.1% 0+0k 204560+31265944io 366pf+0w Summary of lapw1para: z1-17 k=0 user=0 wallclock=0 469788.683u 1726.356s 8:12:31.33 1595.5%0+0k 206128+31266512io 379pf+0w lapw2 -p -c (01:50:32) running LAPW2 in parallel mode z1-17 0.034u 0.040s 1:35.16 0.0% 0+0k 10696+0io 80pf+0w Summary of lapw2para: z1-17 user=0.034 wallclock=95.16 ** LAPW2 crashed! 4.645u 0.458s 1:42.01 4.9% 0+0k 74792+45008io 133pf+0w error: command /home/stretch/flung/DFT/WIEN2k/lapw2cpara -c lapw2.def failed stop error - When it runs fine on a single node, does it always use the same node (say z1-17) or does it run fine on other nodes (like z1-18)? Not really. The nodes were assigned randomly. ___ Wien mailing list Wien@zeus.theochem.tuwien.ac.at http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien SEARCH the MAILING-LIST at: http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
Re: [Wien] Error in mpi+k point parallelization across multiple nodes
Thanks for all the information and suggestions. I have tried to change -lmkl_blacs_intelmpi_lp64 to -lmkl_blacs_lp64 and recompile. However, I got the following error message in the screen output LAPW0 END [cli_14]: [cli_15]: [cli_6]: aborting job: Fatal error in PMPI_Comm_size: Invalid communicator, error stack: PMPI_Comm_size(110): MPI_Comm_size(comm=0x5b, size=0x7f190c) failed PMPI_Comm_size(69).: Invalid communicator aborting job: Fatal error in PMPI_Comm_size: Invalid communicator, error stack: PMPI_Comm_size(110): MPI_Comm_size(comm=0x5b, size=0x7f190c) failed PMPI_Comm_size(69).: Invalid communicator ... [z0-5:mpispawn_0][readline] Unexpected End-Of-File on file descriptor 20. MPI process died? [z0-5:mpispawn_0][mtpmi_processops] Error while reading PMI socket. MPI process died? [z0-5:mpispawn_0][child_handler] MPI process (rank: 14, pid: 11260) exited with status 1 [z0-5:mpispawn_0][child_handler] MPI process (rank: 3, pid: 11249) exited with status 1 [z0-5:mpispawn_0][child_handler] MPI process (rank: 6, pid: 11252) exited with status 1 . Previously I compiled the program with -lmkl_blacs_intelmpi_lp64 and the mpi parallelization on a single node seems to be working. I notice that during the run, the *.error files have finite sizes, but I re-examine them after the job finished and there were no errors written inside (and the files have 0kb now). Does this indicates that the mpi is not running probably at all even on a single node? But I have checked the output result and it's in agreement with the non-mpi results..(for some simple cases) I also tried changing the mpirun to mpiexec as suggested by Prof. Marks by setting: setenv WIEN_MPIRUN /usr/local/mvapich2-icc/bin/mpiexec -np _NP_ -f _HOSTS_ _EXEC_ in the parallel_option. In this case, the program does not run and also does not terminate (qstat on cluster just gives 00:00:00 for the time with a running status).. Regards, Fermin ___ Wien mailing list Wien@zeus.theochem.tuwien.ac.at http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien SEARCH the MAILING-LIST at: http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
Re: [Wien] Error in mpi+k point parallelization across multiple nodes
I have checked that case.vsp/vns are up-to-date. I guess lawp0_mpi runs properly. I compiled the source codes with ifort and please find the following for the linking options: current:FOPT:-FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML -traceback current:FPOPT:-FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML -Dmkl_scalapack -traceback current:FFTW_OPT:-DFFTW3 -I/usr/local/include current:FFTW_LIBS:-lfftw3_mpi -lfftw3 -L/usr/local/lib current:LDFLAGS:$(FOPT) -L/opt/intel/Compiler/11.1/046/mkl/lib/em64t -pthread current:DPARALLEL:'-DParallel' current:R_LIBS:-lmkl_lapack -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -openmp -lpthread -lguide current:RP_LIBS:-lmkl_scalapack_lp64 -lmkl_solver_lp64 -lmkl_blacs_intelmpi_lp64 $(R_LIBS) current:MPIRUN:/usr/local/mvapich2-icc/bin/mpirun -np _NP_ -hostfile _HOSTS_ _EXEC_ current:MKL_TARGET_ARCH:intel64 Is it ok to use -lmkl_blacs_intelmpi_lp64? Thanks a lot for all the suggestions. Regards, Fermin -Original Message- From: wien-boun...@zeus.theochem.tuwien.ac.at [mailto: wien-boun...@zeus.theochem.tuwien.ac.at] On Behalf Of Peter Blaha To: A Mailing list for WIEN2k users Subject: Re: [Wien] Error in mpi+k point parallelization across multiple nodes It seems as if lapw0_mpi runs properly ?? Please check if you have NEW (check date with ls -als)!! valid case.vsp/vns files, which can be used in eg. a sequential lapw1 step. This suggests that mpi and fftw are ok. The problems seem to start in lapw1_mpi, and this program requires in addition to mpi also scalapack. I guess you compile with ifort and link with the mkl ?? There is one crucial blacs library, which must be adapted to your mpi, since they are specific to a particular mpi (intelmpi, openmpi, ...): Which blacks-library do you link ? -lmkl_blacs_lp64 or another one ?? Check out the doku for the mkl. Am 04.05.2015 um 05:18 schrieb lung Fermin: I have tried to set MPI_REMOTE=0 and used 32 cores (on 2 nodes) for distributing the mpi job. However, the problem still persist... but the error message looks different this time: $ cat *.error Error in LAPW2 ** testerror: Error in Parallel LAPW2 and the output on screen: Warning: no access to tty (Bad file descriptor). Thus no job control in this shell. z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-1 8 z1-18 z1-18 number of processors: 32 LAPW0 END [16] Failed to dealloc pd (Device or resource busy) [0] Failed to dealloc pd (Device or resource busy) [17] Failed to dealloc pd (Device or resource busy) [2] Failed to dealloc pd (Device or resource busy) [18] Failed to dealloc pd (Device or resource busy) [1] Failed to dealloc pd (Device or resource busy) LAPW1 END LAPW2 - FERMI; weighs written [z1-17:mpispawn_0][child_handler] MPI process (rank: 0, pid: 28291) terminated with signal 9 - abort job [z1-17:mpispawn_0][readline] Unexpected End-Of-File on file descriptor 9. MPI process died? [z1-17:mpispawn_0][mtpmi_processops] Error while reading PMI socket. MPI process died? [z1-17:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from node z1-17 aborted: Error while reading a PMI socket (4) [z1-18:mpispawn_1][read_size] Unexpected End-Of-File on file descriptor 21. MPI process died? [z1-18:mpispawn_1][read_size] Unexpected End-Of-File on file descriptor 21. MPI process died? [z1-18:mpispawn_1][handle_mt_peer] Error while reading PMI socket. MPI process died? cp: cannot stat `.in.tmp': No such file or directory stop error -- -- ___ Wien mailing list Wien@zeus.theochem.tuwien.ac.at http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien SEARCH the MAILING-LIST at: http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
Re: [Wien] Error in mpi+k point parallelization across multiple nodes
I have tried to set MPI_REMOTE=0 and used 32 cores (on 2 nodes) for distributing the mpi job. However, the problem still persist... but the error message looks different this time: $ cat *.error Error in LAPW2 ** testerror: Error in Parallel LAPW2 and the output on screen: Warning: no access to tty (Bad file descriptor). Thus no job control in this shell. z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-1 8 z1-18 z1-18 number of processors: 32 LAPW0 END [16] Failed to dealloc pd (Device or resource busy) [0] Failed to dealloc pd (Device or resource busy) [17] Failed to dealloc pd (Device or resource busy) [2] Failed to dealloc pd (Device or resource busy) [18] Failed to dealloc pd (Device or resource busy) [1] Failed to dealloc pd (Device or resource busy) LAPW1 END LAPW2 - FERMI; weighs written [z1-17:mpispawn_0][child_handler] MPI process (rank: 0, pid: 28291) terminated with signal 9 - abort job [z1-17:mpispawn_0][readline] Unexpected End-Of-File on file descriptor 9. MPI process died? [z1-17:mpispawn_0][mtpmi_processops] Error while reading PMI socket. MPI process died? [z1-17:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from node z1-17 aborted: Error while reading a PMI socket (4) [z1-18:mpispawn_1][read_size] Unexpected End-Of-File on file descriptor 21. MPI process died? [z1-18:mpispawn_1][read_size] Unexpected End-Of-File on file descriptor 21. MPI process died? [z1-18:mpispawn_1][handle_mt_peer] Error while reading PMI socket. MPI process died? cp: cannot stat `.in.tmp': No such file or directory stop error Try setting setenv MPI_REMOTE 0 in parallel options. Am 29.04.2015 um 09:44 schrieb lung Fermin: Thanks for your comment, Prof. Marks. Each node on the cluster has 32GB memory and each core (16 in total) on the node is limited to 2GB of memory usage. For the current system, I used RKMAX=6, and the smallest RMT=2.25. I have tested the calculation with single k point and mpi on 16 cores within a node. The matrix size from $ cat *.nmat_only is 29138 Does this mean that the number of matrix elements is 29138 or (29138)^2? In general, how shall I estimate the memory required for a calculation? I have also checked the memory usage with top on the node. Each core has used up ~5% of the memory and this adds up to ~5*16% on the node. Perhaps the problem is really caused by the overflow of memory.. I am now queuing on the cluster to test for the case of mpi over 32 cores (2 nodes). Thanks. Regards, Fermin -- -- As an addendum, the calculation may be too big for a single node. How much memory does the node have, what is the RKMAX, the smallest RMT unit cell size? Maybe use in your machines file 1:z1-2:16 z1-13:16 lapw0: z1-2:16 z1-13:16 granularity:1 extrafine:1 Check the size using x law1 -c -p -nmat_only cat *.nmat ___ Professor Laurence Marks Department of Materials Science and Engineering Northwestern University www.numis.northwestern.edu http://www.numis.northwestern.edu MURI4D.numis.northwestern.edu http://MURI4D.numis.northwestern.edu Co-Editor, Acta Cryst A Research is to see what everybody else has seen, and to think what nobody else has thought Albert Szent-Gyorgi On Apr 28, 2015 10:45 PM, Laurence Marks l-ma...@northwestern.edu mailto:l-ma...@northwestern.edu l-ma...@northwestern.edu wrote: Unfortunately it is hard to know what is going on. A google search on Error while reading PMI socket. indicates that the message you have means it did not work, and is not specific. Some suggestions: a) Try mpiexec (slightly different arguments). You just edit parallel_options. https://wiki.mpich.org/mpich/index.php/Using_the_Hydra_Process_Manager b) Try an older version of mvapich2 if it is on the system. c) Do you have to launch mpdboot for your system https://wiki.calculquebec.ca/w/MVAPICH2/en? d) Talk to a sys_admin, particularly the one who setup mvapich e) Do cat *.error, maybe something else went wrong or it is not mpi's fault but a user error. ___ Professor Laurence Marks Department of Materials Science and Engineering Northwestern University www.numis.northwestern.edu http://www.numis.northwestern.edu MURI4D.numis.northwestern.edu http://MURI4D.numis.northwestern.edu Co-Editor, Acta Cryst A Research is to see what everybody else has seen, and to think what nobody else has thought Albert Szent-Gyorgi On Apr 28, 2015 10:17 PM, lung Fermin ferminl...@gmail.com mailto:ferminl...@gmail.com ferminl...@gmail.com wrote
Re: [Wien] Error in mpi+k point parallelization across multiple nodes
Thanks for your comment, Prof. Marks. Each node on the cluster has 32GB memory and each core (16 in total) on the node is limited to 2GB of memory usage. For the current system, I used RKMAX=6, and the smallest RMT=2.25. I have tested the calculation with single k point and mpi on 16 cores within a node. The matrix size from $ cat *.nmat_only is 29138 Does this mean that the number of matrix elements is 29138 or (29138)^2? In general, how shall I estimate the memory required for a calculation? I have also checked the memory usage with top on the node. Each core has used up ~5% of the memory and this adds up to ~5*16% on the node. Perhaps the problem is really caused by the overflow of memory.. I am now queuing on the cluster to test for the case of mpi over 32 cores (2 nodes). Thanks. Regards, Fermin As an addendum, the calculation may be too big for a single node. How much memory does the node have, what is the RKMAX, the smallest RMT unit cell size? Maybe use in your machines file 1:z1-2:16 z1-13:16 lapw0: z1-2:16 z1-13:16 granularity:1 extrafine:1 Check the size using x law1 -c -p -nmat_only cat *.nmat ___ Professor Laurence Marks Department of Materials Science and Engineering Northwestern University www.numis.northwestern.edu MURI4D.numis.northwestern.edu Co-Editor, Acta Cryst A Research is to see what everybody else has seen, and to think what nobody else has thought Albert Szent-Gyorgi On Apr 28, 2015 10:45 PM, Laurence Marks l-ma...@northwestern.edu wrote: Unfortunately it is hard to know what is going on. A google search on Error while reading PMI socket. indicates that the message you have means it did not work, and is not specific. Some suggestions: a) Try mpiexec (slightly different arguments). You just edit parallel_options. https://wiki.mpich.org/mpich/index.php/Using_the_Hydra_Process_Manager b) Try an older version of mvapich2 if it is on the system. c) Do you have to launch mpdboot for your system https://wiki.calculquebec.ca/w/MVAPICH2/en? d) Talk to a sys_admin, particularly the one who setup mvapich e) Do cat *.error, maybe something else went wrong or it is not mpi's fault but a user error. ___ Professor Laurence Marks Department of Materials Science and Engineering Northwestern University www.numis.northwestern.edu MURI4D.numis.northwestern.edu Co-Editor, Acta Cryst A Research is to see what everybody else has seen, and to think what nobody else has thought Albert Szent-Gyorgi On Apr 28, 2015 10:17 PM, lung Fermin ferminl...@gmail.com wrote: Thanks for Prof. Marks' comment. 1. In the previous email, I have missed to copy the line setenv WIEN_MPIRUN /usr/local/mvapich2-icc/bin/mpirun -np _NP_ -hostfile _HOSTS_ _EXEC_ It was in the parallel_option. Sorry about that. 2. I have checked that the running program was lapw1c_mpi. Besides, when the mpi calculation was done on a single node for some other system, the results are consistent with the literature. So I believe that the mpi code has been setup and compiled properly. Would there be something wrong with my option in siteconfig..? Do I have to set some command to bind the job? Any other possible cause of the error? Any suggestions or comments would be appreciated. Thanks. Regards, Fermin You appear to be missing the line setenv WIEN_MPIRUN=... This is setup when you run siteconfig, and provides the information on how mpi is run on your system. N.B., did you setup and compile the mpi code? ___ Professor Laurence Marks Department of Materials Science and Engineering Northwestern University www.numis.northwestern.edu MURI4D.numis.northwestern.edu Co-Editor, Acta Cryst A Research is to see what everybody else has seen, and to think what nobody else has thought Albert Szent-Gyorgi On Apr 28, 2015 4:22 AM, lung Fermin ferminl...@gmail.com wrote: Dear Wien2k community, I am trying to perform calculation on a system of ~100 in-equivalent atoms using mpi+k point parallelization on a cluster. Everything goes fine when the program was run on a single node. However, if I perform the calculation across different nodes, the follow error occurs. How to solve this problem? I am a newbie to mpi programming, any help would be appreciated. Thanks. The error message (MVAPICH2 2.0a): --- Warning: no access to tty (Bad file descriptor). Thus no job control in this shell. z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1 -13 z1-13 z1-13 z1-13 z1-13 z1-13 number of processors: 32 LAPW0 END [z1-2:mpirun_rsh][process_mpispawn_connection
Re: [Wien] Error in mpi+k point parallelization across multiple nodes
Thanks for Prof. Marks' comment. 1. In the previous email, I have missed to copy the line setenv WIEN_MPIRUN /usr/local/mvapich2-icc/bin/mpirun -np _NP_ -hostfile _HOSTS_ _EXEC_ It was in the parallel_option. Sorry about that. 2. I have checked that the running program was lapw1c_mpi. Besides, when the mpi calculation was done on a single node for some other system, the results are consistent with the literature. So I believe that the mpi code has been setup and compiled properly. Would there be something wrong with my option in siteconfig..? Do I have to set some command to bind the job? Any other possible cause of the error? Any suggestions or comments would be appreciated. Thanks. Regards, Fermin You appear to be missing the line setenv WIEN_MPIRUN=... This is setup when you run siteconfig, and provides the information on how mpi is run on your system. N.B., did you setup and compile the mpi code? ___ Professor Laurence Marks Department of Materials Science and Engineering Northwestern University www.numis.northwestern.edu MURI4D.numis.northwestern.edu Co-Editor, Acta Cryst A Research is to see what everybody else has seen, and to think what nobody else has thought Albert Szent-Gyorgi On Apr 28, 2015 4:22 AM, lung Fermin ferminl...@gmail.com wrote: Dear Wien2k community, I am trying to perform calculation on a system of ~100 in-equivalent atoms using mpi+k point parallelization on a cluster. Everything goes fine when the program was run on a single node. However, if I perform the calculation across different nodes, the follow error occurs. How to solve this problem? I am a newbie to mpi programming, any help would be appreciated. Thanks. The error message (MVAPICH2 2.0a): --- Warning: no access to tty (Bad file descriptor). Thus no job control in this shell. z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1 -13 z1-13 z1-13 z1-13 z1-13 z1-13 number of processors: 32 LAPW0 END [z1-2:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from node z1-13 aborted: Error while reading a PMI socket (4) [z1-13:mpispawn_0][child_handler] MPI process (rank: 11, pid: 8546) terminated with signal 9 - abort job [z1-13:mpispawn_0][readline] Unexpected End-Of-File on file descriptor 8. MPI process died? [z1-13:mpispawn_0][mtpmi_processops] Error while reading PMI socket. MPI process died? [z1-2:mpispawn_0][readline] Unexpected End-Of-File on file descriptor 12. MPI process died? [z1-2:mpispawn_0][mtpmi_processops] Error while reading PMI socket. MPI process died? [z1-2:mpispawn_0][child_handler] MPI process (rank: 0, pid: 35454) terminated with signal 9 - abort job [z1-2:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from node z1-2 aborted: MPI process error (1) [cli_15]: aborting job: application called MPI_Abort(MPI_COMM_WORLD, 0) - process 15 stop error -- The .machines file: # 1:z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 1:z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 granularity:1 extrafine:1 The parallel_options: setenv TASKSET no setenv USE_REMOTE 0 setenv MPI_REMOTE 1 setenv WIEN_GRANULARITY 1 Thanks. Regards, Fermin ___ Wien mailing list Wien@zeus.theochem.tuwien.ac.at http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien SEARCH the MAILING-LIST at: http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
[Wien] Error in mpi+k point parallelization across multiple nodes
Dear Wien2k community, I am trying to perform calculation on a system of ~100 in-equivalent atoms using mpi+k point parallelization on a cluster. Everything goes fine when the program was run on a single node. However, if I perform the calculation across different nodes, the follow error occurs. How to solve this problem? I am a newbie to mpi programming, any help would be appreciated. Thanks. The error message (MVAPICH2 2.0a): --- Warning: no access to tty (Bad file descriptor). Thus no job control in this shell. z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1 -13 z1-13 z1-13 z1-13 z1-13 z1-13 number of processors: 32 LAPW0 END [z1-2:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from node z1-13 aborted: Error while reading a PMI socket (4) [z1-13:mpispawn_0][child_handler] MPI process (rank: 11, pid: 8546) terminated with signal 9 - abort job [z1-13:mpispawn_0][readline] Unexpected End-Of-File on file descriptor 8. MPI process died? [z1-13:mpispawn_0][mtpmi_processops] Error while reading PMI socket. MPI process died? [z1-2:mpispawn_0][readline] Unexpected End-Of-File on file descriptor 12. MPI process died? [z1-2:mpispawn_0][mtpmi_processops] Error while reading PMI socket. MPI process died? [z1-2:mpispawn_0][child_handler] MPI process (rank: 0, pid: 35454) terminated with signal 9 - abort job [z1-2:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from node z1-2 aborted: MPI process error (1) [cli_15]: aborting job: application called MPI_Abort(MPI_COMM_WORLD, 0) - process 15 stop error -- The .machines file: # 1:z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 1:z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 granularity:1 extrafine:1 The parallel_options: setenv TASKSET no setenv USE_REMOTE 0 setenv MPI_REMOTE 1 setenv WIEN_GRANULARITY 1 Thanks. Regards, Fermin ___ Wien mailing list Wien@zeus.theochem.tuwien.ac.at http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien SEARCH the MAILING-LIST at: http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
Re: [Wien] Job distribution problem in MPI+k point parallelization
Thanks for all the help and comments. I tried Oleg's suggestion and it works. I will go onto compare the performance of different parallelization settings on my system. Fermin --- --- -Original Message- From: wien-boun...@zeus.theochem.tuwien.ac.at [mailto: wien-boun...@zeus.theochem.tuwien.ac.at] On Behalf Of Peter Blaha Sent: Wednesday, January 28, 2015 2:45 PM To: A Mailing list for WIEN2k users Subject: Re: [Wien] Job distribution problem in MPI+k point parallelization Now it is rather clear why you had 8 mpi jobs running previously. The new definition of WIEN_MPIRUN and also your pbs script seems ok and the jobs are now distributed as expected. I do not know why you get only 50% in this test. Maybe because the test is not suitable and requires so much communication that the cpu cannot run at full speed. As I said before, a setup with two mpi jobs and 2 k-parallel jobs on a 4 core machine is a useless setup. Parallelization is not a task which works in an arbitrary way, but needs to be adapted to the hardware AND the physical problem. Your task is now to compare timings and find out the optimal setup for the specific problem and the available hardware. Run the same job with .machines file: 1:host:4 or 1:host 1:host 1:host 1:host or setenv OMP_NUM_THREAD =2 1:host 1:host and check which run is the fastest. --- --- -Original Message- From: wien-boun...@zeus.theochem.tuwien.ac.at [mailto: wien-boun...@zeus.theochem.tuwien.ac.at] On Behalf Of Oleg Rubel Sent: Wednesday, January 28, 2015 12:42 PM To: A Mailing list for WIEN2k users Subject: Re: [Wien] Job distribution problem in MPI+k point parallelization It might be unrelated, but worth a try. I had a similar problem once with MVAPICH2. It was solved by setting up this environment variable in the submission script setenv MV2_ENABLE_AFFINITY 0 You can also check which core each process is bound to using taskset command. The same command also allows to change the affinity on fly. I hope this will help Oleg ___ Wien mailing list Wien@zeus.theochem.tuwien.ac.at http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien SEARCH the MAILING-LIST at: http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
[Wien] Error in compiling mpi-parallel version
Dear all, Recently, I am trying to do a calculation with a supercell of about 100 atoms. Previously I have tried to do it with k-point parallelizatoin but it failed due to insufficient virtual memory. So instead I am moving to the mpi parallelization. I tried to compile the lapw0 program first at a test. The following is the settings in the Makefile: .SUFFIXES:.F .SUFFIXES:.F90 SHELL = /bin/sh FC = ifort MPF = mpif90 CC = cc FOPT = -FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML -traceback -r8 FPOPT = -ffree-form -O2 -m64 -ffree-line-length-none -DFFTW3 -I/usr/local/fftw-3.3.3/mpi -I/usr/local/include DParallel = '-DParallel' FGEN = $(PARALLEL) LDFLAGS = $(FOPT) -L/opt/intel/Compiler/11.1/046/mkl/lib/em64t -pthread R_LIBS = -L/usr/local/lib -lmkl_lapack -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -openmp -lpthread -lguide RP_LIBS = -lfftw3_mpi -lmkl_scalapack_lp64 -lmkl_solver_lp64 -lmkl_blacs_lp64 $(R_LIBS) --- The following error is reported: touch .parallel make PARALLEL='-DParallel' ./lapw0_mpi \ FORT=mpif90 FFLAGS=' -ffree-form -O2 -m64 -ffree-line-length-none -DFFTW3 -I/usr/local/fftw-3.3.3/mpi -I/usr/local/include '-DParallel'' make[1]: Entering directory `/home/stretch/WIEN2k/SRC_lapw0' cc -c cputim.c mpif90 -ffree-form -O2 -m64 -ffree-line-length-none -DFFTW3 -I/usr/local/fftw-3.3.3/mpi -I/usr/local/include -DParallel -c modules.F mpif90 -ffree-form -O2 -m64 -ffree-line-length-none -DFFTW3 -I/usr/local/fftw-3.3.3/mpi -I/usr/local/include -DParallel -c fft_modules.F fftw3-mpi.f03:33.14: Included at fft_modules.F:69: integer(), value :: comm 1 Error: Expected initialization expression at (1) fftw3-mpi.f03:47.14: Included at fft_modules.F:69: integer(), value :: comm 1 Error: Expected initialization expression at (1) fftw3-mpi.f03:57.14: Included at fft_modules.F:69: integer(), value :: comm 1 - What is the cause of those errors? Any suggestions in solving the issue would be appreciated. -- P.S. Some details about the system: * 45-nodes cluster formed by DELL R720/R620 servers (16 cores per node) * OS : Rocks 6.1 (CentOS) * MPI : mpif90 for MVAPICH2 2.0a * FFTW ver 3.3.3 * MKL ver 11.1.0.080 * Scalapack ver 2.0.2 * WIEN2k ver 14.1 and the setting in the parallel_option file is: setenv TASKSET no setenv USE_REMOTE 1 setenv MPI_REMOTE 1 setenv WIEN_GRANULARITY 1 --- Thanks and Regards, Fermin ___ Wien mailing list Wien@zeus.theochem.tuwien.ac.at http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien SEARCH the MAILING-LIST at: http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html