Regarding [1], I did expect that you would have to submit the commands within your job script via the SLURM workload manager on your system with something like [5,6]

     sbatch my_job_script.job


     or by whatever method you have to use on your system. Where, the commands at [7] are in the job file, such as:


     my_job_script.job

     -------------------------------------

     #!/bin/bash

     # ...

     run_lapw -p
     x -qtl -p -telnes
     x telnes3

     -------------------------------------


    In my case, I don't have SLURM.  So I'm unable to do any testing in that environment.  Maybe someone else in the mailing list has a SLURM system that check if they are encountering the same problem that you are having.


    [5] https://www.hpc2n.umu.se/documentation/batchsystem/basic-submit-example-scripts

    [6] https://doku.lrz.de/display/PUBLIC/WIEN2k

    [7] https://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/msg20597.html


Regarding [2], good to read that mpi parallel with "x -qtl -p -telnes" works fine on your system with Vanadium Dioxide (VO2). If you have control of what nodes the calculation will run on, does the VO2 run fine on your 1st node (e.g., x073 [8]) with multiple cores of a single CPU, then does it run fine on the 2nd node (e.g., x082) with multiple cores of a single CPU?  I have read at [9] that some schedule managers automatically assign the nodes on the fly such that the user might have no control in some case on which nodes the job will run on.  Does the VO2 run fine with mpi parallel using 1 processor core on node 1 and 1 processor core on node 2, if your able to control that as it may help to narrow down the problem?


    [8] https://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/msg20617.html

    [9] http://susi.theochem.tuwien.ac.at/reg_user/faq/pbs.html


Regarding [3], the output you posted looks as expected.  So nothing wrong with that.


    In the past, I posted in the mailing list some things that I found helpful for troubleshooting parallel issues, but you would have to search the mailing list to find them.  I believe a couple of them may have been at the following two links:


  [10] https://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/msg17973.html

  [11] http://zeus.theochem.tuwien.ac.at/pipermail/wien/2018-April/027944.html


Lastly, I have now tried a WIEN2k 19.2 calculation using mpi parallel on my system with the struct file at https://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/msg20645.html .


It looks like it ran fine when it was set to run on two of the four processors on my system:


username@computername:~/wiendata/diamond$ ls ~/wiendata/scratch
username@computername:~/wiendata/diamond$ ls
diamond.struct
username@computername:~/wiendata/diamond$ init_lapw -b
...
username@computername:~/wiendata/diamond$ cat $WIENROOT/parallel_options
setenv TASKSET "no"
if ( ! $?USE_REMOTE ) setenv USE_REMOTE 1
if ( ! $?MPI_REMOTE ) setenv MPI_REMOTE 1
setenv WIEN_GRANULARITY 1
setenv DELAY 0.1
setenv SLEEPY 1
username@computername:~/wiendata/diamond$ cat .machines
1:localhost:2
granularity:1
extrafine:1
username@computername:~/wiendata/diamond$ run_lapw -p
...
in cycle 11    ETEST: .0001457550000000   CTEST: .0033029
hup: Command not found.
STOP  LAPW0 END
STOP  LAPW1 END

real    0m6.744s
user    0m12.679s
sys    0m0.511s
STOP LAPW2 - FERMI; weights written
STOP  LAPW2 END

real    0m1.123s
user    0m1.785s
sys    0m0.190s
STOP  SUMPARA END
STOP  CORE  END
STOP  MIXER END
ec cc and fc_conv 1 1 1

>   stop
username@computername:~/wiendata/diamond$ cp $WIENROOT/SRC_templates/case.innes diamond.innes
username@computername:~/wiendata/diamond$ x qtl -p -telnes
running QTL in parallel mode
calculating QTL's from parallel vectors
STOP  QTL END
6.5u 0.0s 0:06.77 98.3% 0+0k 928+8080io 4pf+0w
username@computername:~/wiendata/diamond$ cat diamond.inq
0 2.20000000000000000000
1
1 99 1 0
4 0 1 2 3
username@computername:~/wiendata/diamond$ x telnes3
STOP TELNES3 DONE
3.2u 0.0s 0:03.39 98.8% 0+0k 984+96io 3pf+0w
username@computername:~/wiendata/diamond$ ls -l ~/wiendata/scratch
total 624
-rw-rw-r-- 1 username username      0 Oct 24 15:40 diamond.vector
-rw-rw-r-- 1 username username 637094 Oct 24 15:43 diamond.vector_1
-rw-rw-r-- 1 username username      0 Oct 24 15:44 diamond.vectordn
-rw-rw-r-- 1 username username      0 Oct 24 15:44 diamond.vectordn_1


On 10/24/2020 2:30 PM, Christian Søndergaard Pedersen wrote:

Hello Gavin


Thanks for your reply, and apologies for my tardiness.


[1] All my calculations are run in MPI-parallel on our HPC cluster. I cannot execute any 'x lapw[0,1,2] -p' command in the terminal (on the cluster login node); this results in 'pbsssh: command not found'. However, submitting via the SLURM workload manager works fine. In all my submit scripts, I specify 'setenv SCRATCH /scratch/$USER', which is the proper location of scratch storage on our HPC cluster.


[2] Without having tried your example for diamond, I can report that 'run_lapw -p' followed by 'x qtl -p -telnes' works without problems for a single cell of Vanadium dioxide. However, for other systems I get the error I specified. The other systems (1) are larger, and (2) use two CPU's instead of a single CPU (.machines file are modified suitably).

Checking the qtl.def file for the calculation that _did_ work, I can see that the line specifying '/scratch/chrsop/VO2.vectordn' is _also_ present here, so this is not to blame. This leaves me baffled as to what the error can be - as far as I can tell, I am trying to perform the exact same calculation for different systems. I thought maybe insufficient scratch storage could be to blame, but this would most likely show up in the 'run_lapw' cycles (I believe).


[3] I am posting here the difference between qtlpara and lapw2para:

$ grep "single" $WIENROOT/qtlpara_lapw
testinput .processes single
$ grep "single" $WIENROOT/lapw2para_lapw
testinput .processes single
single:
echo "running in single mode"

... if this is wrong, I kindly request advice on how to fix it, so I can pass it on to our software maintenance guy. If there's anything else I can try please let me know.

Best regards
Christian
_______________________________________________
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html

Reply via email to