Am 26.04.2014 um 02:02 schrieb Ilya M: > Recently I have been seeing a situation when queue instances on almost all > nodes were set to error while attempting to run array tasks. For each such > occurrence I receive an email: > > --------------------- > Job 1606586 caused action: Queue "all.q@node042" set to ERROR > User = user > Queue = all.q@node042 > Start Time = <unknown> > End Time = <unknown> > failed in prolog:04/25/2014 22:46:34 [2055:51650]: exit_status of prolog = 4
What is your prolog doing - global or queue level? -- Reuti > Shepherd trace: > 04/25/2014 22:46:32 [2055:51650]: shepherd called with uid = 0, euid = 2055 > 04/25/2014 22:46:32 [2055:51650]: starting up 6.2u5 > 04/25/2014 22:46:32 [2055:51650]: setpgid(51650, 51650) returned 0 > 04/25/2014 22:46:32 [2055:51650]: do_core_binding: "binding" parameter not > found in config file > 04/25/2014 22:46:32 [2055:51650]: parent: forked "prolog" with pid 51662 > 04/25/2014 22:46:32 [2055:51650]: using signal delivery delay of 120 seconds > 04/25/2014 22:46:32 [2055:51650]: parent: prolog-pid: 51662 > 04/25/2014 22:46:32 [2055:51662]: child: starting son(prolog, /grid/prolog.sh > user all.q 1606586, 0); > 04/25/2014 22:46:32 [2055:51662]: pid=51662 pgrp=51662 sid=51662 old > pgrp=51650 getlogin()=root > 04/25/2014 22:46:32 [2055:51662]: reading passwd information for user 'user' > 04/25/2014 22:46:32 [2055:51662]: setting limits > 04/25/2014 22:46:32 [2055:51662]: setting environment > 04/25/2014 22:46:32 [2055:51662]: Initializing error file > 04/25/2014 22:46:32 [2055:51662]: switching to intermediate/target user > 04/25/2014 22:46:32 [1757739906:51662]: closing all filedescriptors > 04/25/2014 22:46:32 [1757739906:51662]: further messages are in "error" and > "trace" > 04/25/2014 22:46:32 [1757739906:51662]: using "/bin/bash" as shell of user > "user" > 04/25/2014 22:46:32 [1757739906:51662]: using stdout as stderr > 04/25/2014 22:46:32 [1757739906:51662]: now running with uid=175773, > euid=175773 > 04/25/2014 22:46:32 [1757739906:51662]: execvp(/grid/prolog.sh, > "/grid/prolog.sh" "user" "all.q" "1606586") > 04/25/2014 22:46:34 [2055:51650]: wait3 returned 51662 (status: 1024; > WIFSIGNALED: 0, WIFEXITED: 1, WEXITSTATUS: 4) > 04/25/2014 22:46:34 [2055:51650]: prolog exited with exit status 4 > 04/25/2014 22:46:34 [2055:51650]: reaped "prolog" with pid 51662 > 04/25/2014 22:46:34 [2055:51650]: prolog exited not due to signal > 04/25/2014 22:46:34 [2055:51650]: prolog exited with status 4 > 04/25/2014 22:46:34 [2055:51650]: exit_status of prolog = 4 > 04/25/2014 22:46:34 [2055:51650]: parent: forked "epilog" with pid 51991 > 04/25/2014 22:46:34 [2055:51650]: using signal delivery delay of 120 seconds > 04/25/2014 22:46:34 [2055:51650]: parent: epilog-pid: 51991 > 04/25/2014 22:46:34 [2055:51991]: child: starting son(epilog, > /grid/epilog.sh, 0); > 04/25/2014 22:46:34 [2055:51991]: pid=51991 pgrp=51991 sid=51991 old > pgrp=51650 getlogin()=root > 04/25/2014 22:46:34 [2055:51991]: reading passwd information for user 'user' > 04/25/2014 22:46:34 [2055:51991]: setting limits > 04/25/2014 22:46:34 [2055:51991]: setting environment > 04/25/2014 22:46:34 [2055:51991]: Initializing error file > 04/25/2014 22:46:34 [2055:51991]: switching to intermediate/target user > 04/25/2014 22:46:34 [1757739906:51991]: closing all filedescriptors > 04/25/2014 22:46:34 [1757739906:51991]: further messages are in "error" and > "trace" > 04/25/2014 22:46:34 [1757739906:51991]: using "/bin/bash" as shell of user > "user" > 04/25/2014 22:46:34 [1757739906:51991]: using stdout as stderr > 04/25/2014 22:46:34 [1757739906:51991]: now running with uid=175773, > euid=175773 > 04/25/2014 22:46:34 [1757739906:51991]: execvp(/grid/epilog.sh, > "/grid/epilog.sh") > 04/25/2014 22:46:34 [2055:51650]: wait3 returned 51991 (status: 0; > WIFSIGNALED: 0, WIFEXITED: 1, WEXITSTATUS: 0) > 04/25/2014 22:46:34 [2055:51650]: epilog exited with exit status 0 > 04/25/2014 22:46:34 [2055:51650]: reaped "epilog" with pid 51991 > 04/25/2014 22:46:34 [2055:51650]: epilog exited not due to signal > 04/25/2014 22:46:34 [2055:51650]: epilog exited with status 0 > 04/25/2014 22:46:34 [2055:51650]: no tasker to notify > > Shepherd error: > 04/25/2014 22:46:34 [2055:51650]: exit_status of prolog = 4 > --------------------- > > execd messages file on all the nodes has the following: > > 04/25/2014 22:46:36| main|node245|E|shepherd of job 1606586.19 exited with > exit status = 8 > 04/25/2014 22:46:36| main|node245|W|reaping job "1606586" ptf complains: Job > does not exist > 04/25/2014 22:46:36| main|node245|I|sending admin mail mail to user > "grid-sys"|mailer "/bin/mail"|"GE 6.2u5: Job-array task 1606586.19 failed" > > qacct shows: > > start_time -/- > end_time -/- > granted_pe NONE > slots 0 > failed 8 : in prolog > exit_status 0 > > So I wonder if this means the user deleted the job just as it was getting > started on the nodes. And because it happened while prolog script was > running, prolog exited with error, setting queue instance to error as well. > Or is there another explanation? > > Thanks, > Ilya. > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
