Dear Prof. Peter Blaha, Thank you for your reply.
In principle you can use 36,18,9,6,4,or 3 parallel jobs, but 16 us not > meaningful. The computing node has really 16 cores (two AMD Opteron(tm) Processor 6136 cpus) and 32 Gb momery. So the 36 k-points are divided by 16 cores, 3 k-points for 4 cores and 2 k-points for the other 12 cores. As you suggestion, if I only use 12 cores, it might be take less time in lapw1. ii) try to use a (local) $SCRATCH directory, which reduces the NFS load. > But this works only > if your k-list and .machines file is "compatible" as mentioned above. Actually, the administrator just changed my /home directory to a local disk in the login node. Before this, the heavy I/O has never happened through a network disk array. I guess this may be the reason for the crash. Any comments will be appreciated. Best, On Fri, Feb 3, 2012 at 9:53 PM, Peter Blaha <pblaha at theochem.tuwien.ac.at>wrote: > > > Clearly you should write your job script such that it divides the 36 > k-points in a > "meaningful" way. > In principle you can use 36,18,9,6,4,or 3 parallel jobs, but 16 us not > meaningful. > > Furthermore, it seems that your cluster has problems with heavy I/O (NFS) > and this is > most likely the reason for the observed high load and the crash. Thus I > would > i) not use too many cores. Has one node of your cluster really 16 cores, > or is this just due > to "multithreading" and in fact it has only 8 ? Do you have enough memory > per node ? > ii) try to use a (local) $SCRATCH directory, which reduces the NFS load. > But this works only > if your k-list and .machines file is "compatible" as mentioned above. > > It also seems a bit of a bigger calculations (lapw1 took nearly 2h), thus > you may either need MPI > or you should not use all cores on one node at your cluster because of > memory restrictions. > > > Am 03.02.2012 13:56, schrieb Bin Shao: > >> Dear all, >> >> I am running wien2k 11.1 on a cluster with Centos 6 under a pbs queuing >> system. The job is submitted in a k-point parallel mode and the total 36 >> kpoints are divided by 16 cups. >> But there comes some errors in lapw2 and the dnlapw2_18/19/20.error files >> are not empty. At the same time, the job in pbs system seems dead and can >> not be killed by the pbs >> command. The administrator check the computing node and command top shows >> that the node is experiencing very heavy load above 40. Further, ps aux >> shows that there are 16 lapw2 >> processes but not running or say suspended. The jobs caused a heavy load >> and triggered the self-protection mechanism of the OS, which automatically >> suspends any running process >> including ssh login except root account. >> >> Any comments will be appreciated and thanks in advanced. >> >> The followings are the error files and case.dayfile. >> --------------------dnlapw2_**18/19/20.error----------------**-- >> Error in LAPW2 >> ------------------------------**------------------------------** >> ------------ >> >> ---------------------case.**output2dn_19------------------**------ >> ... >> KVEC( 73563) = -19 -5 9 9.1046 1 >> KVEC( 73564) = -19 24 -9 9.1046 1 >> KVEC( 73565) = -19 24 9 9.1046 1 >> KVEC( 73566) = 19 -24 -9 9.1046 1 >> KVEC( 73567) = 19 -24 9 9.1046 1 >> KVEC( 73568) = 19 5 -9 9.1046 1 >> KVEC( 73569) = 19 5 9 9.1046 1 >> KVE >> ------------------------------**------------------------------** >> ------------ >> >> --------------------case.**dayfile-----------------------**------------ >> ... >> [14] Done ( ( $remote $machine[$p] "cd $PWD;$t >> $exe ${def}_${loop}.def $loop;fixerror_lapw ${def}_$loop"; rm -f >> .lock_$lockfile[$p] ) >& .stdout2_$loop; >> if ( -f .stdout2_$loop ) bashtime2csh.pl_lapw .stdout2_$loop > >> .temp2_$loop; grep \% .temp2_$loop >> .time2_$loop; grep -v \% .temp2_$loop >> | perl -e "print stderr <STDIN>" ) >> [9] Done ( ( $remote $machine[$p] "cd $PWD;$t >> $exe ${def}_${loop}.def $loop;fixerror_lapw ${def}_$loop"; rm -f >> .lock_$lockfile[$p] ) >& .stdout2_$loop; >> if ( -f .stdout2_$loop ) bashtime2csh.pl_lapw .stdout2_$loop > >> .temp2_$loop; grep \% .temp2_$loop >> .time2_$loop; grep -v \% .temp2_$loop >> | perl -e "print stderr <STDIN>" ) >> [4] Done ( ( $remote $machine[$p] "cd $PWD;$t >> $exe ${def}_${loop}.def $loop;fixerror_lapw ${def}_$loop"; rm -f >> .lock_$lockfile[$p] ) >& .stdout2_$loop; >> if ( -f .stdout2_$loop ) bashtime2csh.pl_lapw .stdout2_$loop > >> .temp2_$loop; grep \% .temp2_$loop >> .time2_$loop; grep -v \% .temp2_$loop >> | perl -e "print stderr <STDIN>" ) >> [4] 18809 >> ------------------------------**------------------------------** >> ----------------- >> >> -----------------------------:**log---------------------------** >> ----------------- >> ... >> Thu Feb 2 17:58:03 CST 2012> (x) lapw1 -c -dn -p -orb >> Thu Feb 2 19:46:53 CST 2012> (x) lapw2 -c -up -p >> Thu Feb 2 19:51:36 CST 2012> (x) sumpara -up -d >> Thu Feb 2 19:52:07 CST 2012> (x) lapw2 -c -dn -p >> ------------------------------**------------------------------** >> -------------------- >> >> (If more information is needed, I will provide.) >> >> Best, >> >> -- >> Bin Shao, Ph.D. Candidate >> College of Information Technical Science, Nankai University >> 94 Weijin Rd. Nankai Dist. Tianjin 300071, China >> Email: bshao at mail.nankai.edu.cn <mailto:bshao at >> mail.nankai.edu.**cn<bshao at mail.nankai.edu.cn> >> > >> >> >> >> ______________________________**_________________ >> Wien mailing list >> Wien at zeus.theochem.tuwien.ac.**at <Wien at zeus.theochem.tuwien.ac.at> >> http://zeus.theochem.tuwien.**ac.at/mailman/listinfo/wien<http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien> >> > > -- > > P.Blaha > ------------------------------**------------------------------** > -------------- > Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna > Phone: +43-1-58801-165300 FAX: +43-1-58801-165982 > Email: blaha at theochem.tuwien.ac.at WWW: http://info.tuwien.ac.at/** > theochem/ <http://info.tuwien.ac.at/theochem/> > ------------------------------**------------------------------** > -------------- > ______________________________**_________________ > Wien mailing list > Wien at zeus.theochem.tuwien.ac.**at <Wien at zeus.theochem.tuwien.ac.at> > http://zeus.theochem.tuwien.**ac.at/mailman/listinfo/wien<http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien> > -- Bin Shao, Ph.D. Candidate College of Information Technical Science, Nankai University 94 Weijin Rd. Nankai Dist. Tianjin 300071, China Email: bshao at mail.nankai.edu.cn -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20120204/4d10757d/attachment.htm>