[gridengine users] Queue instances set to error, exit_status of prolog = 4

Ilya M Fri, 25 Apr 2014 17:05:23 -0700

Recently I have been seeing a situation when queue instances on almostall nodes were set to error while attempting to run array tasks. Foreach such occurrence I receive an email:


---------------------
Job 1606586 caused action: Queue "all.q@node042" set to ERROR
User        = user
Queue       = all.q@node042
Start Time  = <unknown>
End Time    = <unknown>
failed in prolog:04/25/2014 22:46:34 [2055:51650]: exit_status of prolog = 4
Shepherd trace:
04/25/2014 22:46:32 [2055:51650]: shepherd called with uid = 0, euid = 2055
04/25/2014 22:46:32 [2055:51650]: starting up 6.2u5
04/25/2014 22:46:32 [2055:51650]: setpgid(51650, 51650) returned 0

04/25/2014 22:46:32 [2055:51650]: do_core_binding: "binding" parameternot found in config file

04/25/2014 22:46:32 [2055:51650]: parent: forked "prolog" with pid 51662
04/25/2014 22:46:32 [2055:51650]: using signal delivery delay of 120 seconds
04/25/2014 22:46:32 [2055:51650]: parent: prolog-pid: 51662

04/25/2014 22:46:32 [2055:51662]: child: starting son(prolog,/grid/prolog.sh user all.q 1606586, 0);04/25/2014 22:46:32 [2055:51662]: pid=51662 pgrp=51662 sid=51662 oldpgrp=51650 getlogin()=root

04/25/2014 22:46:32 [2055:51662]: reading passwd information for user 'user'
04/25/2014 22:46:32 [2055:51662]: setting limits
04/25/2014 22:46:32 [2055:51662]: setting environment
04/25/2014 22:46:32 [2055:51662]: Initializing error file
04/25/2014 22:46:32 [2055:51662]: switching to intermediate/target user
04/25/2014 22:46:32 [1757739906:51662]: closing all filedescriptors

04/25/2014 22:46:32 [1757739906:51662]: further messages are in "error"and "trace"04/25/2014 22:46:32 [1757739906:51662]: using "/bin/bash" as shell ofuser "user"

04/25/2014 22:46:32 [1757739906:51662]: using stdout as stderr

04/25/2014 22:46:32 [1757739906:51662]: now running with uid=175773,euid=17577304/25/2014 22:46:32 [1757739906:51662]: execvp(/grid/prolog.sh,"/grid/prolog.sh" "user" "all.q" "1606586")04/25/2014 22:46:34 [2055:51650]: wait3 returned 51662 (status: 1024;WIFSIGNALED: 0, WIFEXITED: 1, WEXITSTATUS: 4)

04/25/2014 22:46:34 [2055:51650]: prolog exited with exit status 4
04/25/2014 22:46:34 [2055:51650]: reaped "prolog" with pid 51662
04/25/2014 22:46:34 [2055:51650]: prolog exited not due to signal
04/25/2014 22:46:34 [2055:51650]: prolog exited with status 4
04/25/2014 22:46:34 [2055:51650]: exit_status of prolog = 4
04/25/2014 22:46:34 [2055:51650]: parent: forked "epilog" with pid 51991
04/25/2014 22:46:34 [2055:51650]: using signal delivery delay of 120 seconds
04/25/2014 22:46:34 [2055:51650]: parent: epilog-pid: 51991

04/25/2014 22:46:34 [2055:51991]: child: starting son(epilog,/grid/epilog.sh, 0);04/25/2014 22:46:34 [2055:51991]: pid=51991 pgrp=51991 sid=51991 oldpgrp=51650 getlogin()=root

04/25/2014 22:46:34 [2055:51991]: reading passwd information for user 'user'
04/25/2014 22:46:34 [2055:51991]: setting limits
04/25/2014 22:46:34 [2055:51991]: setting environment
04/25/2014 22:46:34 [2055:51991]: Initializing error file
04/25/2014 22:46:34 [2055:51991]: switching to intermediate/target user
04/25/2014 22:46:34 [1757739906:51991]: closing all filedescriptors

04/25/2014 22:46:34 [1757739906:51991]: further messages are in "error"and "trace"04/25/2014 22:46:34 [1757739906:51991]: using "/bin/bash" as shell ofuser "user"

04/25/2014 22:46:34 [1757739906:51991]: using stdout as stderr

04/25/2014 22:46:34 [1757739906:51991]: now running with uid=175773,euid=17577304/25/2014 22:46:34 [1757739906:51991]: execvp(/grid/epilog.sh,"/grid/epilog.sh")04/25/2014 22:46:34 [2055:51650]: wait3 returned 51991 (status: 0;WIFSIGNALED: 0, WIFEXITED: 1, WEXITSTATUS: 0)

04/25/2014 22:46:34 [2055:51650]: epilog exited with exit status 0
04/25/2014 22:46:34 [2055:51650]: reaped "epilog" with pid 51991
04/25/2014 22:46:34 [2055:51650]: epilog exited not due to signal
04/25/2014 22:46:34 [2055:51650]: epilog exited with status 0
04/25/2014 22:46:34 [2055:51650]: no tasker to notify


Shepherd error:
04/25/2014 22:46:34 [2055:51650]: exit_status of prolog = 4
---------------------

execd messages file on all the nodes has the following:


qacct shows:

start_time   -/-
end_time     -/-
granted_pe   NONE
slots        0
failed       8   : in prolog
exit_status  0

So I wonder if this means the user deleted the job just as it wasgetting started on the nodes. And because it happened while prologscript was running, prolog exited with error, setting queue instance toerror as well.

Or is there another explanation?

Thanks,
Ilya.
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

[gridengine users] Queue instances set to error, exit_status of prolog = 4

Reply via email to