Re: [Wien] qtl: error reading parallel vectors

Gavin Abo Sat, 24 Oct 2020 16:20:08 -0700

Regarding [1], I did expect that you would have to submit the commandswithin your job script via the SLURM workload manager on your systemwith something like [5,6]


     sbatch my_job_script.job

or by whatever method you have to use on your system. Where, thecommands at [7] are in the job file, such as:



     my_job_script.job

     -------------------------------------

     #!/bin/bash

     # ...

     run_lapw -p
     x -qtl -p -telnes
     x telnes3

     -------------------------------------

In my case, I don't have SLURM. So I'm unable to do any testing inthat environment. Maybe someone else in the mailing list has a SLURMsystem that check if they are encountering the same problem that you arehaving.

[5]https://www.hpc2n.umu.se/documentation/batchsystem/basic-submit-example-scripts


    [6] https://doku.lrz.de/display/PUBLIC/WIEN2k

[7]https://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/msg20597.html

Regarding [2], good to read that mpi parallel with "x -qtl -p -telnes"works fine on your system with Vanadium Dioxide (VO2). If you havecontrol of what nodes the calculation will run on, does the VO2 run fineon your 1st node (e.g., x073 [8]) with multiple cores of a single CPU,then does it run fine on the 2nd node (e.g., x082) with multiple coresof a single CPU? I have read at [9] that some schedule managersautomatically assign the nodes on the fly such that the user might haveno control in some case on which nodes the job will run on. Does theVO2 run fine with mpi parallel using 1 processor core on node 1 and 1processor core on node 2, if your able to control that as it may help tonarrow down the problem?

[8]https://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/msg20617.html


    [9] http://susi.theochem.tuwien.ac.at/reg_user/faq/pbs.html

Regarding [3], the output you posted looks as expected. So nothingwrong with that.

In the past, I posted in the mailing list some things that I foundhelpful for troubleshooting parallel issues, but you would have tosearch the mailing list to find them. I believe a couple of them mayhave been at the following two links:

[10]https://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/msg17973.html

[11]http://zeus.theochem.tuwien.ac.at/pipermail/wien/2018-April/027944.html

Lastly, I have now tried a WIEN2k 19.2 calculation using mpi parallel onmy system with the struct file athttps://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/msg20645.html .

It looks like it ran fine when it was set to run on two of the fourprocessors on my system:



username@computername:~/wiendata/diamond$ ls ~/wiendata/scratch
username@computername:~/wiendata/diamond$ ls
diamond.struct
username@computername:~/wiendata/diamond$ init_lapw -b
...
username@computername:~/wiendata/diamond$ cat $WIENROOT/parallel_options
setenv TASKSET "no"
if ( ! $?USE_REMOTE ) setenv USE_REMOTE 1
if ( ! $?MPI_REMOTE ) setenv MPI_REMOTE 1
setenv WIEN_GRANULARITY 1
setenv DELAY 0.1
setenv SLEEPY 1
username@computername:~/wiendata/diamond$ cat .machines
1:localhost:2
granularity:1
extrafine:1
username@computername:~/wiendata/diamond$ run_lapw -p
...
in cycle 11    ETEST: .0001457550000000   CTEST: .0033029
hup: Command not found.
STOP  LAPW0 END
STOP  LAPW1 END

real    0m6.744s
user    0m12.679s
sys    0m0.511s
STOP LAPW2 - FERMI; weights written
STOP  LAPW2 END

real    0m1.123s
user    0m1.785s
sys    0m0.190s
STOP  SUMPARA END
STOP  CORE  END
STOP  MIXER END
ec cc and fc_conv 1 1 1

>   stop

username@computername:~/wiendata/diamond$ cp$WIENROOT/SRC_templates/case.innes diamond.innes

username@computername:~/wiendata/diamond$ x qtl -p -telnes
running QTL in parallel mode
calculating QTL's from parallel vectors
STOP  QTL END
6.5u 0.0s 0:06.77 98.3% 0+0k 928+8080io 4pf+0w
username@computername:~/wiendata/diamond$ cat diamond.inq
0 2.20000000000000000000
1
1 99 1 0
4 0 1 2 3
username@computername:~/wiendata/diamond$ x telnes3
STOP TELNES3 DONE
3.2u 0.0s 0:03.39 98.8% 0+0k 984+96io 3pf+0w
username@computername:~/wiendata/diamond$ ls -l ~/wiendata/scratch
total 624
-rw-rw-r-- 1 username username      0 Oct 24 15:40 diamond.vector
-rw-rw-r-- 1 username username 637094 Oct 24 15:43 diamond.vector_1
-rw-rw-r-- 1 username username      0 Oct 24 15:44 diamond.vectordn
-rw-rw-r-- 1 username username      0 Oct 24 15:44 diamond.vectordn_1


On 10/24/2020 2:30 PM, Christian Søndergaard Pedersen wrote:

Hello Gavin


Thanks for your reply, and apologies for my tardiness.
[1] All my calculations are run in MPI-parallel on our HPC cluster. Icannot execute any 'x lapw[0,1,2] -p' command in the terminal (on thecluster login node); this results in 'pbsssh: command not found'.However, submitting via the SLURM workload manager works fine. In allmy submit scripts, I specify 'setenv SCRATCH /scratch/$USER', which isthe proper location of scratch storage on our HPC cluster.
[2] Without having tried your example for diamond, I can report that'run_lapw -p' followed by 'x qtl -p -telnes' works without problemsfor a single cell of Vanadium dioxide. However, for other systems Iget the error I specified. The other systems (1) are larger, and (2)use two CPU's instead of a single CPU (.machines file are modifiedsuitably).
Checking the qtl.def file for the calculation that _did_ work, I cansee that the line specifying '/scratch/chrsop/VO2.vectordn' is _also_present here, so this is not to blame. This leaves me baffled as towhat the error can be - as far as I can tell, I am trying to performthe exact same calculation for different systems. I thought maybeinsufficient scratch storage could be to blame, but this would mostlikely show up in the 'run_lapw' cycles (I believe).
[3] I am posting here the difference between qtlpara and lapw2para:

$ grep "single" $WIENROOT/qtlpara_lapw
testinput .processes single
$ grep "single" $WIENROOT/lapw2para_lapw
testinput .processes single
single:
echo "running in single mode"
... if this is wrong, I kindly request advice on how to fix it, so Ican pass it on to our software maintenance guy. If there's anythingelse I can try please let me know.
Best regards
Christian

_______________________________________________
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html

Re: [Wien] qtl: error reading parallel vectors

Reply via email to