Re: [Wien] .machines for several nodes

Peter Blaha Mon, 12 Oct 2020 22:43:39 -0700

To run a single program for testing, do:

x lapw0 -p


(after creation of .machines.)

Then check all error files, but in particular also the slurm-output(whatever it is called on your machines. It probably gives some messageslike library xxxx not found or so, which is needed for additional debugging.


AND:

We still don't know how many cores your nodes have

We still don't know your compiler options (WIEN2k_OPTIONS,parallel_options) and if the compilation of eg. lapw0_mpi did work atall (compile.msg in SRC_lapw0).


Am 12.10.2020 um 22:17 schrieb Christian Søndergaard Pedersen:

Dear everybody
I am following up on this thread to report on two separate errors in myattempts to properly parallellize a calculation. For the first, acalculation utilized 0.00% of available CPU resources. My .machines filelooks like this:
#
dstart:g004:8 g010:8 g011:8 g040:8
lapw0:g004:8 g010:8 g011:8 g040:8
1:g004:16
1:g010:16
1:g011:16
1:g040:16

With my submit script calling the following commands:


srun hostname -s > slurm.hosts

run_lapw -p

x qtl -p -telnes
Of course, the job didn't reach x qtl. The resultant case.dayfile isshort, so I am dumping all of it here:
Calculating test-machines in /path/to/directory
on node.host.name.dtu.dk with PID XXXXX
using WIEN2k_19.1 (Release 25/6/2019) in/path/to/installation/directory/WIEN2k/19.1-intel-2019a
     start       (Mon Oct 12 19:04:06 CEST 2020) with lapw0 (40/99 to go)

     cycle 1     (Mon Oct 12 19:04:06 CEST 2020)         (40/99 to go)
   lapw0   -p  (19:04:06) starting parallel lapw0 at Mon Oct 12 19:04:06 CEST 
2020
-------- .machine0 : 32 processors
[1] 16095


The .machine0 file displays the lines

g004 [repeated for 8 lines]
g010 [repeated for 8 lines]
g011 [repeated for 8 lines]
g040 [repeated for 8 lines]
which tells me that the .machines file works as intended, and that thecause of the problem is located somewhere else. Which brings me to thesecond error, which occured when I tried calling mpirun explicitly like so:
srun hostname -s > slurm.hosts
mpirun run_lapw -p
mpirun qtl -p -telnes
from within the job script. This crashed the job right away. Thelapw0.error file prints out "Error in Parallel lapw0" and "check ERRORFILES!" a number of times. The case.clmsum file is present and lookscorrect, and the .machines file looks like the one from before (withdifferent node numbers). However, the .machine0 file now looks like:
g094
g094
g094
g081
g081
g08g094
g094
g094
g094
g094
[...]
I.e. there's an error on line 6, where a node is not properly named anda line break is missing. The dayfile repeatedly prints out "> stoperror" a total of sixteen times. I don't know if the above .machine0file is the culprit, but it seems the obvious conclusion. Any help inthis matter will be much appreciated.
Best regards
Christian

_______________________________________________
Wien mailing list
[email protected]
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/[email protected]/index.html


--
--------------------------------------------------------------------------
Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
Phone: +43-1-58801-165300             FAX: +43-1-58801-165982
Email: [email protected]    WIEN2k: http://www.wien2k.at

WWW:http://www.imc.tuwien.ac.at/tc_blaha-------------------------------------------------------------------------

_______________________________________________
Wien mailing list
[email protected]
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/[email protected]/index.html

Re: [Wien] .machines for several nodes

Reply via email to