What you describe doesn't seem to happen any longer in the development version. There have been a few changes since and now all operations on a file are done only by the processor that reads or writes it. Note however that there might still be problems with k-point parallelization. Basically: I/O for non-parallel file systems is not guaranteed.
Paolo On Thu, Feb 20, 2020 at 7:48 PM janardhan H.L. <[email protected]> wrote: > Dear prof. Giannozzi > > I am writing to the same thread as it may be relevant here. > > I am using qe 6.5 on 3 node linux cluster. > When the calculation is performed everything runs normally. When saving wf > something unusual happens. > 1) calculation exits without time stamps and job done stamp > 2) this happened due to mpi exit from one of the node which cannot write > to out dir. > 3) scf run only starts after copying the files to slave nodes without > which it will terminate saying files cannot be read. > 4) after pointing outdir to common paths (via NFS). > These errors have disappeared. > > 1) My question is if recent versions of QE is collecting all the wf to > head node why slave nodes eco mpi abort while they have no access to head > node. > 2) is there any way that we can restart calculations without copying to > slave nodes. > > Thanks and regards > Janardhan > > > > > On Thursday, 20 February, 2020, 11:15:53 pm IST, Paolo Giannozzi < > [email protected]> wrote: > > > It's a long story. By default, recent versions of QE collect both the > wavefunctions and the charge density into a single array on a single > processor that writes them to file. Even if you do not have a parallel file > system, your data is no longer spread on scratch directories that are not > visible to the other processors. This means that in principle it is > possible to restart, witj several potential caveats: > - there is no guarantee that a batch queuing system will distribute > processes across processors in the same way as in the previous run; > - pseudopotential files are in principle read from data file so they may > still be a source of problems; > - if you parallelize on k-points, with Nk pools, one process per pool will > write wavefunctions, that will thus end up on Nk different processors. > > Paolo > > On Thu, Feb 20, 2020 at 4:54 PM alberto <[email protected]> wrote: > > Hi, > I'm using QE in some single point simulations. > In particular I'm running scf/nscf calculations > > In my block input > > calculation = 'nscf' , > restart_mode = 'from_scratch' , > outdir = './tmp_qe' , > pseudo_dir = > '/home/alberto/QUANTUM_ESPRESSO/BASIS/upf_files/' , > prefix = 'BIS-IMID-PbI4_SR' , > verbosity = 'high' , > etot_conv_thr = 1.0D-8 , > forc_conv_thr = 1.0D-7 , > wf_collect = .true. > > the out dir is located in /home/alberto/ and I notice that the > writing/reading time is very long > > I would use /tmp dir of one node where the jobs is running. > (my cluster has got some nodes xeon to 20 CPU every nodes) > > This is my PBS script > > ## Script for parallel Quantum Espresso job by Alberto > ## Run script with 3 arguments: > ## $1 = Name of input-file, without extension > ## $2 = Numbers of nodes to use (ncpus=nodes*20) > ## $3 = Module to run > > if [ -z "$1" -o -z "$2" -o -z "$3" ]; then > echo "Usage: $0 <input_file> <np> <module> " > fi > > if [ $2 -ge 8 ]; then > NODES=$(($2/20)) > CPUS=20 > else > NODES=1 > CPUS=$2 > fi > > cat<<EOF>$1.job > #!/bin/bash > #PBS -l > nodes=xeon1:ppn=$CPUS:xeon20+xeon2:ppn=$CPUS:xeon20+xeon3:ppn=$CPUS:xeon20+xeon4:ppn=$CPUS:xeon20+xeon5:ppn=$CPUS:xeon20+xeon6:ppn=$CPUS:xeon20 > #PBS -l walltime=9999:00:00 > #PBS -N $1 > #PBS -e $1.err > #PBS -o $1.sum > #PBS -j oe > job=$1 # Name of input file, no extension > project=\$PBS_O_WORKDIR > cd \$project > cat \$PBS_NODEFILE > \$PBS_O_WORKDIR/nodes.txt > > export OMP_NUM_THREADS=$(($2/40)) > time /opt/openmpi-1.4.5/bin/mpirun -machinefile \$PBS_NODEFILE -np $2 > /opt/qe-6.4.1/bin/$3 -ntg $(($2/60)) -npool $(($2/60)) < $1.inp > $1.out > EOF > > qsub $1.job > > how could I use the directory /tmp and avoid that the nscf calculation > don't stop it because no files are found! really the files are present, > but they are divided on different nodes > > regards > > Alberto > _______________________________________________ > Quantum ESPRESSO is supported by MaX (www.max-centre.eu/quantum-espresso) > users mailing list [email protected] > https://lists.quantum-espresso.org/mailman/listinfo/users > > > > -- > Paolo Giannozzi, Dip. Scienze Matematiche Informatiche e Fisiche, > Univ. Udine, via delle Scienze 208, 33100 Udine, Italy > Phone +39-0432-558216, fax +39-0432-558222 > > _______________________________________________ > Quantum ESPRESSO is supported by MaX (www.max-centre.eu/quantum-espresso) > users mailing list [email protected] > https://lists.quantum-espresso.org/mailman/listinfo/users > -- Paolo Giannozzi, Dip. Scienze Matematiche Informatiche e Fisiche, Univ. Udine, via delle Scienze 208, 33100 Udine, Italy Phone +39-0432-558216, fax +39-0432-558222
_______________________________________________ Quantum ESPRESSO is supported by MaX (www.max-centre.eu/quantum-espresso) users mailing list [email protected] https://lists.quantum-espresso.org/mailman/listinfo/users
