To run a single program for testing, do:
x lapw0 -p
(after creation of .machines.)
Then check all error files, but in particular also the slurm-output
(whatever it is called on your machines. It probably gives some messages
like library xxxx not found or so, which is needed for additional debugging.
AND:
We still don't know how many cores your nodes have
We still don't know your compiler options (WIEN2k_OPTIONS,
parallel_options) and if the compilation of eg. lapw0_mpi did work at
all (compile.msg in SRC_lapw0).
Am 12.10.2020 um 22:17 schrieb Christian Søndergaard Pedersen:
Dear everybody
I am following up on this thread to report on two separate errors in my
attempts to properly parallellize a calculation. For the first, a
calculation utilized 0.00% of available CPU resources. My .machines file
looks like this:
#
dstart:g004:8 g010:8 g011:8 g040:8
lapw0:g004:8 g010:8 g011:8 g040:8
1:g004:16
1:g010:16
1:g011:16
1:g040:16
With my submit script calling the following commands:
srun hostname -s > slurm.hosts
run_lapw -p
x qtl -p -telnes
Of course, the job didn't reach x qtl. The resultant case.dayfile is
short, so I am dumping all of it here:
Calculating test-machines in /path/to/directory
on node.host.name.dtu.dk with PID XXXXX
using WIEN2k_19.1 (Release 25/6/2019) in
/path/to/installation/directory/WIEN2k/19.1-intel-2019a
start (Mon Oct 12 19:04:06 CEST 2020) with lapw0 (40/99 to go)
cycle 1 (Mon Oct 12 19:04:06 CEST 2020) (40/99 to go)
lapw0 -p (19:04:06) starting parallel lapw0 at Mon Oct 12 19:04:06 CEST
2020
-------- .machine0 : 32 processors
[1] 16095
The .machine0 file displays the lines
g004 [repeated for 8 lines]
g010 [repeated for 8 lines]
g011 [repeated for 8 lines]
g040 [repeated for 8 lines]
which tells me that the .machines file works as intended, and that the
cause of the problem is located somewhere else. Which brings me to the
second error, which occured when I tried calling mpirun explicitly like so:
srun hostname -s > slurm.hosts
mpirun run_lapw -p
mpirun qtl -p -telnes
from within the job script. This crashed the job right away. The
lapw0.error file prints out "Error in Parallel lapw0" and "check ERROR
FILES!" a number of times. The case.clmsum file is present and looks
correct, and the .machines file looks like the one from before (with
different node numbers). However, the .machine0 file now looks like:
g094
g094
g094
g081
g081
g08g094
g094
g094
g094
g094
[...]
I.e. there's an error on line 6, where a node is not properly named and
a line break is missing. The dayfile repeatedly prints out "> stop
error" a total of sixteen times. I don't know if the above .machine0
file is the culprit, but it seems the obvious conclusion. Any help in
this matter will be much appreciated.
Best regards
Christian
_______________________________________________
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
--
--------------------------------------------------------------------------
Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
Phone: +43-1-58801-165300 FAX: +43-1-58801-165982
Email: bl...@theochem.tuwien.ac.at WIEN2k: http://www.wien2k.at
WWW:
http://www.imc.tuwien.ac.at/tc_blaha-------------------------------------------------------------------------
_______________________________________________
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html