Unfortunately the structure of *.error files which are zero length when the task runs correctly can easily be broken if there is remote execution/ssh/mpi which does not work. I think in the cases you sent there is sufficient information to debug; I suspect an issue with directory names and/or mount.
Suggestion to Peter: perhaps add a "echo Startup Error > lapw1[0-2].error" in lapw1[0-2]para to catch this? _____ Professor Laurence Marks "Research is to see what everybody else has seen, and to think what nobody else has thought", Albert Szent-Gyorgi www.numis.northwestern.edu On Wed, Sep 9, 2020, 06:48 Lyudmila Dobysheva <lyuk...@mail.ru> wrote: > 09.09.2020 00:01, Peter Blaha wrote: > > alias testerror 'if (! -z \!:1.error) goto error' > > you can catch a problem. > > > Am 08.09.2020 um 20:38 schrieb Yundi Quan: > >> The simplest way that I can think of is to check whether the > >> lawp1.error file is empty or not after executing x lapw1. > > >> On Tue, Sep 8, 2020 at 2:23 PM Rubel, Oleg <rub...@mcmaster.ca > >> <mailto:rub...@mcmaster.ca>> wrote: > >> I wonder if there is a _simple_ alternative way for sensing an > >> error? Also message is not always "XXXXX - Error". It can be > > Just now I try to make a calculation at supercomputer with a random > structure for testing, I passed already some problems, but sometimes I > still meet errors, and there is no nonzero files. I am attaching three > files: > 1. slurm*out, where errors are shown, the first one before lapw0 didn't > affect, do not know why?, lapw0 was calculated, all output files are > good. lapw1 was not calculated. > > 2. *.dayfile I can see that lapw1 was not calculated only by too small > times: > tesla46(6) 0.006u 0.010s 0.75 2.11% 0+0k 0+0io 0pf+0w > (the next lines are my additional output inserted into lapw1para: > 1 t taskset0 exe def_loop.def time srun 0 lapw1 lapw1_1.def) > > 3. ls-l.output shows that all the *.error files are zero, and the files > that should be done by lapw1, are absent. > > Doesn't matter why the task didn't calculated, but why the > lapw1*.error's are zero? > I sent for testing run -e lapw1, otherwise it would have come to lapw2 > without stopping. > > Best regards > Lyudmila Dobysheva > ------------------ > > https://urldefense.com/v3/__http://ftiudm.ru/content/view/25/103/lang,english/__;!!Dq0X2DkFhyF93HkjWTBQKhk!Cc2li1FWPTknXFHo7SLSTcHwYxmAXYvt52a4_PqAO7th-nFUOo9Iemg70fG8N1JIo8uRXg$ > Physics-Techn.Institute, > Udmurt Federal Research Center, Ural Br. of Rus.Ac.Sci. > 426000 Izhevsk Kirov str. 132 > Russia > --- > Tel. +7 (34I2)43-24-59 (office), +7 (9I2)OI9-795O (home) > Skype: lyuka18 (office), lyuka17 (home) > E-mail: lyuk...@mail.ru (office), lyuk...@gmail.com (home) > > _______________________________________________ > Wien mailing list > Wien@zeus.theochem.tuwien.ac.at > > https://urldefense.com/v3/__http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien__;!!Dq0X2DkFhyF93HkjWTBQKhk!Cc2li1FWPTknXFHo7SLSTcHwYxmAXYvt52a4_PqAO7th-nFUOo9Iemg70fG8N1L-bFCp3A$ > SEARCH the MAILING-LIST at: > https://urldefense.com/v3/__http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html__;!!Dq0X2DkFhyF93HkjWTBQKhk!Cc2li1FWPTknXFHo7SLSTcHwYxmAXYvt52a4_PqAO7th-nFUOo9Iemg70fG8N1IXddgg7w$ >
_______________________________________________ Wien mailing list Wien@zeus.theochem.tuwien.ac.at http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien SEARCH the MAILING-LIST at: http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html