Dear Professor Blaha
Thanks a lot for your responses. I have performed some additional testing, which has been delayed because I cannot run lapw0/1/2 from the command-line due to memory issues. Hence, I have had to go through the queue for each test. On top of that, I have been unable to get information about our installation. However, I finally achieved ~99% CPU efficiency with the following setup: CPUs: 2 nodes with 24 cores each (x073 and x082) .machines: dstart:x073:24 x082:24 lapw0:x073:24 x082:24 1:x073:3 1:x082:3 1:x073:3 1:x082:3 1:x073:3 1:x082:3 1:x073:3 1:x082:3 # 16 lines total; 8 for each node 1:x073:3 1:x082:3 1:x073:3 1:x082:3 1:x073:3 1:x082:3 1:x073:3 1:x082:3 After creating the .machines-file I call 'mpirun run_lapw -p'. The above .machines file is basically a combination of the two examples found on page 86 of the User's Guide (without using OMP, of course). From checking the case.klist_1-16 files, I have verified that each individual job works on a different subset of the k-points. Can anyone confirm whether this setup is correct; i.e. is it a proper way to parallellize the lapw1/lapw2 cycles? Assuming the compilations of lapw0/1/2_mpi proceeded without errors, which seems to be the case. Best regards Christian ________________________________ Fra: Wien <[email protected]> på vegne af Peter Blaha <[email protected]> Sendt: 13. oktober 2020 07:43:16 Til: [email protected] Emne: Re: [Wien] .machines for several nodes To run a single program for testing, do: x lapw0 -p (after creation of .machines.) Then check all error files, but in particular also the slurm-output (whatever it is called on your machines. It probably gives some messages like library xxxx not found or so, which is needed for additional debugging. AND: We still don't know how many cores your nodes have We still don't know your compiler options (WIEN2k_OPTIONS, parallel_options) and if the compilation of eg. lapw0_mpi did work at all (compile.msg in SRC_lapw0). Am 12.10.2020 um 22:17 schrieb Christian Søndergaard Pedersen: > Dear everybody > > > I am following up on this thread to report on two separate errors in my > attempts to properly parallellize a calculation. For the first, a > calculation utilized 0.00% of available CPU resources. My .machines file > looks like this: > > > # > dstart:g004:8 g010:8 g011:8 g040:8 > lapw0:g004:8 g010:8 g011:8 g040:8 > 1:g004:16 > 1:g010:16 > 1:g011:16 > 1:g040:16 > > With my submit script calling the following commands: > > > srun hostname -s > slurm.hosts > > run_lapw -p > > x qtl -p -telnes > > > Of course, the job didn't reach x qtl. The resultant case.dayfile is > short, so I am dumping all of it here: > > > Calculating test-machines in /path/to/directory > on node.host.name.dtu.dk with PID XXXXX > using WIEN2k_19.1 (Release 25/6/2019) in > /path/to/installation/directory/WIEN2k/19.1-intel-2019a > > > start (Mon Oct 12 19:04:06 CEST 2020) with lapw0 (40/99 to go) > > cycle 1 (Mon Oct 12 19:04:06 CEST 2020) (40/99 to go) > >> lapw0 -p (19:04:06) starting parallel lapw0 at Mon Oct 12 19:04:06 CEST >> 2020 > -------- .machine0 : 32 processors > [1] 16095 > > > The .machine0 file displays the lines > > g004 [repeated for 8 lines] > g010 [repeated for 8 lines] > g011 [repeated for 8 lines] > g040 [repeated for 8 lines] > > which tells me that the .machines file works as intended, and that the > cause of the problem is located somewhere else. Which brings me to the > second error, which occured when I tried calling mpirun explicitly like so: > > srun hostname -s > slurm.hosts > mpirun run_lapw -p > mpirun qtl -p -telnes > > from within the job script. This crashed the job right away. The > lapw0.error file prints out "Error in Parallel lapw0" and "check ERROR > FILES!" a number of times. The case.clmsum file is present and looks > correct, and the .machines file looks like the one from before (with > different node numbers). However, the .machine0 file now looks like: > > g094 > g094 > g094 > g081 > g081 > g08g094 > g094 > g094 > g094 > g094 > [...] > > I.e. there's an error on line 6, where a node is not properly named and > a line break is missing. The dayfile repeatedly prints out "> stop > error" a total of sixteen times. I don't know if the above .machine0 > file is the culprit, but it seems the obvious conclusion. Any help in > this matter will be much appreciated. > > Best regards > Christian > > _______________________________________________ > Wien mailing list > [email protected] > http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien > SEARCH the MAILING-LIST at: > http://www.mail-archive.com/[email protected]/index.html > -- -------------------------------------------------------------------------- Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna Phone: +43-1-58801-165300 FAX: +43-1-58801-165982 Email: [email protected] WIEN2k: http://www.wien2k.at WWW: http://www.imc.tuwien.ac.at/tc_blaha------------------------------------------------------------------------- _______________________________________________ Wien mailing list [email protected] http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien SEARCH the MAILING-LIST at: http://www.mail-archive.com/[email protected]/index.html
_______________________________________________ Wien mailing list [email protected] http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien SEARCH the MAILING-LIST at: http://www.mail-archive.com/[email protected]/index.html

