Am 26.04.2014 um 02:02 schrieb Ilya M:

> Recently I have been seeing a situation when queue instances on almost all 
> nodes were set to error while attempting to run array tasks. For each such 
> occurrence I receive an email:
> 
> ---------------------
> Job 1606586 caused action: Queue "all.q@node042" set to ERROR
> User        = user
> Queue       = all.q@node042
> Start Time  = <unknown>
> End Time    = <unknown>
> failed in prolog:04/25/2014 22:46:34 [2055:51650]: exit_status of prolog = 4

What is your prolog doing - global or queue level?

-- Reuti


> Shepherd trace:
> 04/25/2014 22:46:32 [2055:51650]: shepherd called with uid = 0, euid = 2055
> 04/25/2014 22:46:32 [2055:51650]: starting up 6.2u5
> 04/25/2014 22:46:32 [2055:51650]: setpgid(51650, 51650) returned 0
> 04/25/2014 22:46:32 [2055:51650]: do_core_binding: "binding" parameter not 
> found in config file
> 04/25/2014 22:46:32 [2055:51650]: parent: forked "prolog" with pid 51662
> 04/25/2014 22:46:32 [2055:51650]: using signal delivery delay of 120 seconds
> 04/25/2014 22:46:32 [2055:51650]: parent: prolog-pid: 51662
> 04/25/2014 22:46:32 [2055:51662]: child: starting son(prolog, /grid/prolog.sh 
> user all.q 1606586, 0);
> 04/25/2014 22:46:32 [2055:51662]: pid=51662 pgrp=51662 sid=51662 old 
> pgrp=51650 getlogin()=root
> 04/25/2014 22:46:32 [2055:51662]: reading passwd information for user 'user'
> 04/25/2014 22:46:32 [2055:51662]: setting limits
> 04/25/2014 22:46:32 [2055:51662]: setting environment
> 04/25/2014 22:46:32 [2055:51662]: Initializing error file
> 04/25/2014 22:46:32 [2055:51662]: switching to intermediate/target user
> 04/25/2014 22:46:32 [1757739906:51662]: closing all filedescriptors
> 04/25/2014 22:46:32 [1757739906:51662]: further messages are in "error" and 
> "trace"
> 04/25/2014 22:46:32 [1757739906:51662]: using "/bin/bash" as shell of user 
> "user"
> 04/25/2014 22:46:32 [1757739906:51662]: using stdout as stderr
> 04/25/2014 22:46:32 [1757739906:51662]: now running with uid=175773, 
> euid=175773
> 04/25/2014 22:46:32 [1757739906:51662]: execvp(/grid/prolog.sh, 
> "/grid/prolog.sh" "user" "all.q" "1606586")
> 04/25/2014 22:46:34 [2055:51650]: wait3 returned 51662 (status: 1024; 
> WIFSIGNALED: 0,  WIFEXITED: 1, WEXITSTATUS: 4)
> 04/25/2014 22:46:34 [2055:51650]: prolog exited with exit status 4
> 04/25/2014 22:46:34 [2055:51650]: reaped "prolog" with pid 51662
> 04/25/2014 22:46:34 [2055:51650]: prolog exited not due to signal
> 04/25/2014 22:46:34 [2055:51650]: prolog exited with status 4
> 04/25/2014 22:46:34 [2055:51650]: exit_status of prolog = 4
> 04/25/2014 22:46:34 [2055:51650]: parent: forked "epilog" with pid 51991
> 04/25/2014 22:46:34 [2055:51650]: using signal delivery delay of 120 seconds
> 04/25/2014 22:46:34 [2055:51650]: parent: epilog-pid: 51991
> 04/25/2014 22:46:34 [2055:51991]: child: starting son(epilog, 
> /grid/epilog.sh, 0);
> 04/25/2014 22:46:34 [2055:51991]: pid=51991 pgrp=51991 sid=51991 old 
> pgrp=51650 getlogin()=root
> 04/25/2014 22:46:34 [2055:51991]: reading passwd information for user 'user'
> 04/25/2014 22:46:34 [2055:51991]: setting limits
> 04/25/2014 22:46:34 [2055:51991]: setting environment
> 04/25/2014 22:46:34 [2055:51991]: Initializing error file
> 04/25/2014 22:46:34 [2055:51991]: switching to intermediate/target user
> 04/25/2014 22:46:34 [1757739906:51991]: closing all filedescriptors
> 04/25/2014 22:46:34 [1757739906:51991]: further messages are in "error" and 
> "trace"
> 04/25/2014 22:46:34 [1757739906:51991]: using "/bin/bash" as shell of user 
> "user"
> 04/25/2014 22:46:34 [1757739906:51991]: using stdout as stderr
> 04/25/2014 22:46:34 [1757739906:51991]: now running with uid=175773, 
> euid=175773
> 04/25/2014 22:46:34 [1757739906:51991]: execvp(/grid/epilog.sh, 
> "/grid/epilog.sh")
> 04/25/2014 22:46:34 [2055:51650]: wait3 returned 51991 (status: 0; 
> WIFSIGNALED: 0,  WIFEXITED: 1, WEXITSTATUS: 0)
> 04/25/2014 22:46:34 [2055:51650]: epilog exited with exit status 0
> 04/25/2014 22:46:34 [2055:51650]: reaped "epilog" with pid 51991
> 04/25/2014 22:46:34 [2055:51650]: epilog exited not due to signal
> 04/25/2014 22:46:34 [2055:51650]: epilog exited with status 0
> 04/25/2014 22:46:34 [2055:51650]: no tasker to notify
> 
> Shepherd error:
> 04/25/2014 22:46:34 [2055:51650]: exit_status of prolog = 4
> ---------------------
> 
> execd messages file on all the nodes has the following:
> 
> 04/25/2014 22:46:36|  main|node245|E|shepherd of job 1606586.19 exited with 
> exit status = 8
> 04/25/2014 22:46:36|  main|node245|W|reaping job "1606586" ptf complains: Job 
> does not exist
> 04/25/2014 22:46:36|  main|node245|I|sending admin mail mail to user 
> "grid-sys"|mailer "/bin/mail"|"GE 6.2u5: Job-array task 1606586.19 failed"
> 
> qacct shows:
> 
> start_time   -/-
> end_time     -/-
> granted_pe   NONE
> slots        0
> failed       8   : in prolog
> exit_status  0
> 
> So I wonder if this means the user deleted the job just as it was getting 
> started on the nodes. And because it happened while prolog script was 
> running, prolog exited with error, setting queue instance to error as well.
> Or is there another explanation?
> 
> Thanks,
> Ilya.
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to