Recently I have been seeing a situation when queue instances on almost
all nodes were set to error while attempting to run array tasks. For
each such occurrence I receive an email:
---------------------
Job 1606586 caused action: Queue "all.q@node042" set to ERROR
User = user
Queue = all.q@node042
Start Time = <unknown>
End Time = <unknown>
failed in prolog:04/25/2014 22:46:34 [2055:51650]: exit_status of prolog = 4
Shepherd trace:
04/25/2014 22:46:32 [2055:51650]: shepherd called with uid = 0, euid = 2055
04/25/2014 22:46:32 [2055:51650]: starting up 6.2u5
04/25/2014 22:46:32 [2055:51650]: setpgid(51650, 51650) returned 0
04/25/2014 22:46:32 [2055:51650]: do_core_binding: "binding" parameter
not found in config file
04/25/2014 22:46:32 [2055:51650]: parent: forked "prolog" with pid 51662
04/25/2014 22:46:32 [2055:51650]: using signal delivery delay of 120 seconds
04/25/2014 22:46:32 [2055:51650]: parent: prolog-pid: 51662
04/25/2014 22:46:32 [2055:51662]: child: starting son(prolog,
/grid/prolog.sh user all.q 1606586, 0);
04/25/2014 22:46:32 [2055:51662]: pid=51662 pgrp=51662 sid=51662 old
pgrp=51650 getlogin()=root
04/25/2014 22:46:32 [2055:51662]: reading passwd information for user 'user'
04/25/2014 22:46:32 [2055:51662]: setting limits
04/25/2014 22:46:32 [2055:51662]: setting environment
04/25/2014 22:46:32 [2055:51662]: Initializing error file
04/25/2014 22:46:32 [2055:51662]: switching to intermediate/target user
04/25/2014 22:46:32 [1757739906:51662]: closing all filedescriptors
04/25/2014 22:46:32 [1757739906:51662]: further messages are in "error"
and "trace"
04/25/2014 22:46:32 [1757739906:51662]: using "/bin/bash" as shell of
user "user"
04/25/2014 22:46:32 [1757739906:51662]: using stdout as stderr
04/25/2014 22:46:32 [1757739906:51662]: now running with uid=175773,
euid=175773
04/25/2014 22:46:32 [1757739906:51662]: execvp(/grid/prolog.sh,
"/grid/prolog.sh" "user" "all.q" "1606586")
04/25/2014 22:46:34 [2055:51650]: wait3 returned 51662 (status: 1024;
WIFSIGNALED: 0, WIFEXITED: 1, WEXITSTATUS: 4)
04/25/2014 22:46:34 [2055:51650]: prolog exited with exit status 4
04/25/2014 22:46:34 [2055:51650]: reaped "prolog" with pid 51662
04/25/2014 22:46:34 [2055:51650]: prolog exited not due to signal
04/25/2014 22:46:34 [2055:51650]: prolog exited with status 4
04/25/2014 22:46:34 [2055:51650]: exit_status of prolog = 4
04/25/2014 22:46:34 [2055:51650]: parent: forked "epilog" with pid 51991
04/25/2014 22:46:34 [2055:51650]: using signal delivery delay of 120 seconds
04/25/2014 22:46:34 [2055:51650]: parent: epilog-pid: 51991
04/25/2014 22:46:34 [2055:51991]: child: starting son(epilog,
/grid/epilog.sh, 0);
04/25/2014 22:46:34 [2055:51991]: pid=51991 pgrp=51991 sid=51991 old
pgrp=51650 getlogin()=root
04/25/2014 22:46:34 [2055:51991]: reading passwd information for user 'user'
04/25/2014 22:46:34 [2055:51991]: setting limits
04/25/2014 22:46:34 [2055:51991]: setting environment
04/25/2014 22:46:34 [2055:51991]: Initializing error file
04/25/2014 22:46:34 [2055:51991]: switching to intermediate/target user
04/25/2014 22:46:34 [1757739906:51991]: closing all filedescriptors
04/25/2014 22:46:34 [1757739906:51991]: further messages are in "error"
and "trace"
04/25/2014 22:46:34 [1757739906:51991]: using "/bin/bash" as shell of
user "user"
04/25/2014 22:46:34 [1757739906:51991]: using stdout as stderr
04/25/2014 22:46:34 [1757739906:51991]: now running with uid=175773,
euid=175773
04/25/2014 22:46:34 [1757739906:51991]: execvp(/grid/epilog.sh,
"/grid/epilog.sh")
04/25/2014 22:46:34 [2055:51650]: wait3 returned 51991 (status: 0;
WIFSIGNALED: 0, WIFEXITED: 1, WEXITSTATUS: 0)
04/25/2014 22:46:34 [2055:51650]: epilog exited with exit status 0
04/25/2014 22:46:34 [2055:51650]: reaped "epilog" with pid 51991
04/25/2014 22:46:34 [2055:51650]: epilog exited not due to signal
04/25/2014 22:46:34 [2055:51650]: epilog exited with status 0
04/25/2014 22:46:34 [2055:51650]: no tasker to notify
Shepherd error:
04/25/2014 22:46:34 [2055:51650]: exit_status of prolog = 4
---------------------
execd messages file on all the nodes has the following:
04/25/2014 22:46:36| main|node245|E|shepherd of job 1606586.19 exited
with exit status = 8
04/25/2014 22:46:36| main|node245|W|reaping job "1606586" ptf
complains: Job does not exist
04/25/2014 22:46:36| main|node245|I|sending admin mail mail to user
"grid-sys"|mailer "/bin/mail"|"GE 6.2u5: Job-array task 1606586.19 failed"
qacct shows:
start_time -/-
end_time -/-
granted_pe NONE
slots 0
failed 8 : in prolog
exit_status 0
So I wonder if this means the user deleted the job just as it was
getting started on the nodes. And because it happened while prolog
script was running, prolog exited with error, setting queue instance to
error as well.
Or is there another explanation?
Thanks,
Ilya.
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users