Dear Prof. Peter Blaha,

Thank you for your reply.

In principle you can use 36,18,9,6,4,or 3 parallel jobs, but 16 us not
> meaningful.


The computing node has really 16 cores (two AMD Opteron(tm) Processor 6136
cpus) and 32 Gb momery. So the 36 k-points are divided by 16 cores, 3
k-points for 4 cores and 2 k-points for the other 12 cores. As you
suggestion, if I only use 12 cores, it might be take less time in lapw1.

ii) try to use a (local) $SCRATCH directory, which reduces the NFS load.
> But this works only
>    if your k-list and .machines file is "compatible" as mentioned above.


Actually, the administrator just changed my /home directory to a local disk
in the login node. Before this, the heavy I/O has never happened through a
network disk array. I guess this may be the reason for the crash.

Any comments will be appreciated.

Best,


On Fri, Feb 3, 2012 at 9:53 PM, Peter Blaha <pblaha at 
theochem.tuwien.ac.at>wrote:

>
>
> Clearly you should write your job script such that it divides the 36
> k-points in a
> "meaningful" way.
> In principle you can use 36,18,9,6,4,or 3 parallel jobs, but 16 us not
> meaningful.
>
> Furthermore, it seems that your cluster has problems with heavy I/O (NFS)
> and this is
> most likely the reason for the observed high load and the crash. Thus I
> would
> i) not use too many cores. Has one node of your cluster really 16 cores,
> or is this just due
> to "multithreading" and in fact it has only 8 ? Do you have enough memory
> per node ?
> ii) try to use a (local) $SCRATCH directory, which reduces the NFS load.
> But this works only
>    if your k-list and .machines file is "compatible" as mentioned above.
>
> It also seems a bit of a bigger calculations (lapw1 took nearly 2h), thus
> you may either need MPI
> or you should not use all cores on one node at your cluster because of
> memory restrictions.
>
>
> Am 03.02.2012 13:56, schrieb Bin Shao:
>
>> Dear all,
>>
>> I am running wien2k 11.1 on a cluster with Centos 6 under a pbs queuing
>> system. The job is submitted in a k-point parallel mode and the total 36
>> kpoints are divided by 16 cups.
>> But there comes some errors in lapw2 and the dnlapw2_18/19/20.error files
>> are not empty. At the same time, the job in pbs system seems dead and can
>> not be killed by the pbs
>> command. The administrator check the computing node and command top shows
>> that the node is experiencing very heavy load above 40. Further, ps aux
>> shows that there are 16 lapw2
>> processes but not running or say suspended. The jobs caused a heavy load
>> and triggered the self-protection mechanism of the OS, which automatically
>> suspends any running process
>> including ssh login except root account.
>>
>> Any comments will be appreciated and thanks in advanced.
>>
>> The followings are the error files and case.dayfile.
>> --------------------dnlapw2_**18/19/20.error----------------**--
>> Error in LAPW2
>> ------------------------------**------------------------------**
>> ------------
>>
>> ---------------------case.**output2dn_19------------------**------
>> ...
>>        KVEC(     73563) =   -19   -5    9    9.1046    1
>>        KVEC(     73564) =   -19   24   -9    9.1046    1
>>        KVEC(     73565) =   -19   24    9    9.1046    1
>>        KVEC(     73566) =    19  -24   -9    9.1046    1
>>        KVEC(     73567) =    19  -24    9    9.1046    1
>>        KVEC(     73568) =    19    5   -9    9.1046    1
>>        KVEC(     73569) =    19    5    9    9.1046    1
>>        KVE
>> ------------------------------**------------------------------**
>> ------------
>>
>> --------------------case.**dayfile-----------------------**------------
>> ...
>> [14]   Done                          ( ( $remote $machine[$p] "cd $PWD;$t
>> $exe ${def}_${loop}.def $loop;fixerror_lapw ${def}_$loop"; rm -f
>> .lock_$lockfile[$p] ) >& .stdout2_$loop;
>> if ( -f .stdout2_$loop ) bashtime2csh.pl_lapw .stdout2_$loop >
>> .temp2_$loop; grep \% .temp2_$loop >> .time2_$loop; grep -v \% .temp2_$loop
>> | perl -e "print stderr <STDIN>" )
>> [9]    Done                          ( ( $remote $machine[$p] "cd $PWD;$t
>> $exe ${def}_${loop}.def $loop;fixerror_lapw ${def}_$loop"; rm -f
>> .lock_$lockfile[$p] ) >& .stdout2_$loop;
>> if ( -f .stdout2_$loop ) bashtime2csh.pl_lapw .stdout2_$loop >
>> .temp2_$loop; grep \% .temp2_$loop >> .time2_$loop; grep -v \% .temp2_$loop
>> | perl -e "print stderr <STDIN>" )
>> [4]    Done                          ( ( $remote $machine[$p] "cd $PWD;$t
>> $exe ${def}_${loop}.def $loop;fixerror_lapw ${def}_$loop"; rm -f
>> .lock_$lockfile[$p] ) >& .stdout2_$loop;
>> if ( -f .stdout2_$loop ) bashtime2csh.pl_lapw .stdout2_$loop >
>> .temp2_$loop; grep \% .temp2_$loop >> .time2_$loop; grep -v \% .temp2_$loop
>> | perl -e "print stderr <STDIN>" )
>> [4] 18809
>> ------------------------------**------------------------------**
>> -----------------
>>
>> -----------------------------:**log---------------------------**
>> -----------------
>> ...
>> Thu Feb  2 17:58:03 CST 2012> (x) lapw1 -c -dn -p -orb
>> Thu Feb  2 19:46:53 CST 2012> (x) lapw2 -c -up -p
>> Thu Feb  2 19:51:36 CST 2012> (x) sumpara -up -d
>> Thu Feb  2 19:52:07 CST 2012> (x) lapw2 -c -dn -p
>> ------------------------------**------------------------------**
>> --------------------
>>
>> (If more information is needed, I will provide.)
>>
>> Best,
>>
>> --
>> Bin Shao, Ph.D. Candidate
>> College of Information Technical Science, Nankai University
>> 94 Weijin Rd. Nankai Dist. Tianjin 300071, China
>> Email: bshao at mail.nankai.edu.cn <mailto:bshao at 
>> mail.nankai.edu.**cn<bshao at mail.nankai.edu.cn>
>> >
>>
>>
>>
>> ______________________________**_________________
>> Wien mailing list
>> Wien at zeus.theochem.tuwien.ac.**at <Wien at zeus.theochem.tuwien.ac.at>
>> http://zeus.theochem.tuwien.**ac.at/mailman/listinfo/wien<http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien>
>>
>
> --
>
>                                      P.Blaha
> ------------------------------**------------------------------**
> --------------
> Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
> Phone: +43-1-58801-165300             FAX: +43-1-58801-165982
> Email: blaha at theochem.tuwien.ac.at    WWW: http://info.tuwien.ac.at/**
> theochem/ <http://info.tuwien.ac.at/theochem/>
> ------------------------------**------------------------------**
> --------------
> ______________________________**_________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.**at <Wien at zeus.theochem.tuwien.ac.at>
> http://zeus.theochem.tuwien.**ac.at/mailman/listinfo/wien<http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien>
>



-- 
Bin Shao, Ph.D. Candidate
College of Information Technical Science, Nankai University
94 Weijin Rd. Nankai Dist. Tianjin 300071, China
Email: bshao at mail.nankai.edu.cn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: 
<http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20120204/4d10757d/attachment.htm>

Reply via email to