09.09.2020 00:01, Peter Blaha wrote:
alias   testerror       'if (! -z \!:1.error) goto error'
you can catch a problem.

Am 08.09.2020 um 20:38 schrieb Yundi Quan:
The simplest way that I can think of is to check whether the lawp1.error file is empty or not after executing x lapw1.

On Tue, Sep 8, 2020 at 2:23 PM Rubel, Oleg <[email protected] <mailto:[email protected]>> wrote:
    I wonder if there is a _simple_ alternative way for sensing an
    error? Also message is not always "XXXXX - Error". It can be

Just now I try to make a calculation at supercomputer with a random structure for testing, I passed already some problems, but sometimes I still meet errors, and there is no nonzero files. I am attaching three files: 1. slurm*out, where errors are shown, the first one before lapw0 didn't affect, do not know why?, lapw0 was calculated, all output files are good. lapw1 was not calculated.

2. *.dayfile I can see that lapw1 was not calculated only by too small times:
tesla46(6) 0.006u 0.010s 0.75 2.11%      0+0k 0+0io 0pf+0w
(the next lines are my additional output inserted into lapw1para:
1 t taskset0 exe def_loop.def time srun 0 lapw1 lapw1_1.def)

3. ls-l.output shows that all the *.error files are zero, and the files that should be done by lapw1, are absent.

Doesn't matter why the task didn't calculated, but why the lapw1*.error's are zero? I sent for testing run -e lapw1, otherwise it would have come to lapw2 without stopping.

Best regards
Lyudmila Dobysheva
------------------
http://ftiudm.ru/content/view/25/103/lang,english/
Physics-Techn.Institute,
Udmurt Federal Research Center, Ural Br. of Rus.Ac.Sci.
426000 Izhevsk Kirov str. 132
Russia
---
Tel. +7 (34I2)43-24-59 (office), +7 (9I2)OI9-795O (home)
Skype: lyuka18 (office), lyuka17 (home)
E-mail: [email protected] (office), [email protected] (home)

DIRECTORY = /misc/home4/u3104/work/orgFeZn/Gold_23l
WIENROOT = /misc/home4/u3104/BIN/WIEN2k-19
SCRATCH = ./
Got 16 cores
nodelist tesla46
tasks_per_node 16
slurmstepd: error: _is_a_lwp: open() /proc/408167/status failed: No such file 
or directory
jobs_per_node 4 because OMP_NUM_THREADS = 4
4 nodes for this job: tesla46 tesla46 tesla46 tesla46
 LAPW0 END
[1]    Done                          srun -K -N1 -n4 -r0 
/misc/home4/u3104/BIN/WIEN2k-19/lapw0_mpi lapw0.def >> .time00
slurmstepd: error: execve(): 0: No such file or directory
srun: error: apollo17: task 0: Exited with exit code 2
slurmstepd: error: execve(): 2: No such file or directory
srun: error: apollo17: task 0: Exited with exit code 2
slurmstepd: error: execve(): 1: No such file or directory
srun: error: apollo17: task 0: Exited with exit code 2
slurmstepd: error: execve(): 3: No such file or directory
srun: error: apollo17: task 0: Exited with exit code 2
[4]  - Done                          ( ( $remote $machine[$p] "cd 
$PWD;$set_OMP_NUM_THREADS;$t $taskset0 $exe ${def}_$loop.def ;fixerror_lapw 
${def}_$loop"; rm -f .lock_$lockfile[$p] ) >& .stdout1_$loop; if ( -f 
.stdout1_$loop ) bashtime2csh.pl_lapw .stdout1_$loop > .temp1_$loop; grep \% 
.temp1_$loop >> .time1_$loop; grep -v \% .temp1_$loop | perl -e "print stderr 
<STDIN>" )
[3]  + Done                          ( ( $remote $machine[$p] "cd 
$PWD;$set_OMP_NUM_THREADS;$t $taskset0 $exe ${def}_$loop.def ;fixerror_lapw 
${def}_$loop"; rm -f .lock_$lockfile[$p] ) >& .stdout1_$loop; if ( -f 
.stdout1_$loop ) bashtime2csh.pl_lapw .stdout1_$loop > .temp1_$loop; grep \% 
.temp1_$loop >> .time1_$loop; grep -v \% .temp1_$loop | perl -e "print stderr 
<STDIN>" )
[2]  + Done                          ( ( $remote $machine[$p] "cd 
$PWD;$set_OMP_NUM_THREADS;$t $taskset0 $exe ${def}_$loop.def ;fixerror_lapw 
${def}_$loop"; rm -f .lock_$lockfile[$p] ) >& .stdout1_$loop; if ( -f 
.stdout1_$loop ) bashtime2csh.pl_lapw .stdout1_$loop > .temp1_$loop; grep \% 
.temp1_$loop >> .time1_$loop; grep -v \% .temp1_$loop | perl -e "print stderr 
<STDIN>" )
[1]  + Done                          ( ( $remote $machine[$p] "cd 
$PWD;$set_OMP_NUM_THREADS;$t $taskset0 $exe ${def}_$loop.def ;fixerror_lapw 
${def}_$loop"; rm -f .lock_$lockfile[$p] ) >& .stdout1_$loop; if ( -f 
.stdout1_$loop ) bashtime2csh.pl_lapw .stdout1_$loop > .temp1_$loop; grep \% 
.temp1_$loop >> .time1_$loop; grep -v \% .temp1_$loop | perl -e "print stderr 
<STDIN>" )
Gold_23l.scf1_1: No such file or directory.

>   stop
Calculating Gold_23l in /misc/home4/u3104/work/orgFeZn/Gold_23l
on tesla46 with PID 408380
using WIEN2k_19.1 (Release 25/6/2019) in /misc/home4/u3104/BIN/WIEN2k-19


    start       (Tue Sep  8 18:57:18 +05 2020) with lapw0 (2/99 to go)

    cycle 1     (Tue Sep  8 18:57:18 +05 2020)  (2/99 to go)

>   lapw0   -p  (18:57:18) starting parallel lapw0 at Tue Sep  8 18:57:18 +05 
> 2020
-------- .machine0 : 4 processors
0.056u 0.082s 0:04.65 2.7%      0+0k 16+112io 0pf+0w
>   lapw1  -p           (18:57:23) starting parallel lapw1 at Tue Sep  8 
> 18:57:23 +05 2020
->  starting parallel LAPW1 jobs at Tue Sep  8 18:57:23 +05 2020
running LAPW1 in parallel mode (using .machines)
4 number_of_parallel_jobs
1 t taskset0 exe def_loop.def time srun 0 lapw1 lapw1_1.def
1 t taskset0 exe def_loop.def time srun 1 lapw1 lapw1_2.def
1 t taskset0 exe def_loop.def time srun 2 lapw1 lapw1_3.def
1 t taskset0 exe def_loop.def time srun 3 lapw1 lapw1_4.def
     tesla46(6) 0.006u 0.010s 0.75 2.11%      0+0k 0+0io 0pf+0w
     tesla46(5) 0.007u 0.009s 0.75 2.11%      0+0k 0+0io 0pf+0w
     tesla46(5) 0.011u 0.005s 0.75 2.12%      0+0k 0+0io 0pf+0w
     tesla46(5) 0.008u 0.007s 0.68 2.21%      0+0k 0+0io 0pf+0w
   Summary of lapw1para:
   tesla46       k=21    user=0.032      wallclock=184.35
0.268u 0.569s 0:03.29 24.9%     0+0k 6408+1120io 4pf+0w

>   stop
итого 3088
-rw-r--r-- 1 u3104 users       0 сен  9 16:24 aaa
-rw-r--r-- 1 u3104 users    1312 сен  8 18:53 Gold_23l.dayfile
-rw-r--r-- 1 u3104 users     380 сен  8 18:53 Gold_23l.klist_1
-rw-r--r-- 1 u3104 users     324 сен  8 18:53 Gold_23l.klist_2
-rw-r--r-- 1 u3104 users     324 сен  8 18:53 Gold_23l.klist_3
-rw-r--r-- 1 u3104 users     324 сен  8 18:53 Gold_23l.klist_4
-rw-r--r-- 1 u3104 users    1220 сен  8 18:53 Gold_23l.klist.tmp.u3104.408228
-rw-r--r-- 1 u3104 users     140 сен  8 18:53 Gold_23l.mbjmix
-rw-r--r-- 1 u3104 users   76952 сен  8 18:53 Gold_23l.output0000
-rw-r--r-- 1 u3104 users   49181 сен  8 18:53 Gold_23l.output0001
-rw-r--r-- 1 u3104 users   49181 сен  8 18:53 Gold_23l.output0002
-rw-r--r-- 1 u3104 users   46632 сен  8 18:53 Gold_23l.output0003
-rw-r--r-- 1 u3104 users     280 сен  8 18:53 Gold_23l.scf
-rw-r--r-- 1 u3104 users   17089 сен  8 18:53 Gold_23l.scf0
-rw-r--r-- 1 u3104 users 2505132 сен  8 18:53 Gold_23l.vns
-rw-r--r-- 1 u3104 users       0 сен  8 18:53 Gold_23l.vnsdn
-rw-r--r-- 1 u3104 users  188433 сен  8 18:53 Gold_23l.vsp
-rw-r--r-- 1 u3104 users       0 сен  8 18:53 Gold_23l.vspdn
-rw-r--r-- 1 u3104 users      41 сен  8 18:53 head.diff.u3104.408228
-rw-r--r-- 1 u3104 users    1313 сен  8 18:53 lapw0.def
-rw-r--r-- 1 u3104 users       0 сен  8 18:53 lapw0.error
-rw-r--r-- 1 u3104 users     613 сен  8 18:53 lapw1_1.def
-rw-r--r-- 1 u3104 users       0 сен  8 18:53 lapw1_1.error
-rw-r--r-- 1 u3104 users     565 сен  8 18:53 lapw1_2.def
-rw-r--r-- 1 u3104 users       0 сен  8 18:53 lapw1_2.error
-rw-r--r-- 1 u3104 users     565 сен  8 18:53 lapw1_3.def
-rw-r--r-- 1 u3104 users       0 сен  8 18:53 lapw1_3.error
-rw-r--r-- 1 u3104 users     565 сен  8 18:53 lapw1_4.def
-rw-r--r-- 1 u3104 users       0 сен  8 18:53 lapw1_4.error
-rw-r--r-- 1 u3104 users     601 сен  8 18:53 lapw1.def
-rw-r--r-- 1 u3104 users       0 сен  8 18:53 lapw1.error
-rw-r--r-- 1 u3104 users     199 сен  8 18:53 :log
-rw-r--r-- 1 u3104 users     652 сен  8 18:53 :parallel
-rw-r--r-- 1 u3104 users     162 сен  8 18:53 :parallel_lapw0
-rw-r--r-- 1 u3104 users    2494 сен  8 18:53 slurm-10784772.out
-rw-r--r-- 1 u3104 users     128 сен  8 18:53 slurm.hosts
-rwxrwxr-x 1 u3104 users    3365 сен  8 18:53 slurm.job
_______________________________________________
Wien mailing list
[email protected]
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/[email protected]/index.html

Reply via email to