Howdy.

We have a cluster running Rocks 6.1 with Grid Engine 8.1.2.

Every once in a while, we get jobs that fail not being able to set the user id 
( setuid  fails ).

The nodes have the correct /etc/passwd entry as many jobs from the same user work while a few fail every once in a while. The user submitts several hundred 1-core jobs at once so I am not sure if that is contributing to the failures - but it should not. The failures are random and are happening around 1 failure every 300 jobs or so.

Any suggestions on what could be causing this?

Here is the Grid Engine failed log of one case:

Job 29405 caused action: Queue "q64@compute-1-7.local" set to ERROR
 User        = theuser
 Queue       = q64@compute-1-7.local
 Start Time  = <unknown>
 End Time    = <unknown>
failed before job: 01/13/2013 21:01:25 [0:2897]: setuid(686) failed
Shepherd trace:
01/13/2013 21:01:25 [400:2685]: shepherd called with uid = 0, euid = 400
01/13/2013 21:01:25 [400:2685]: starting up 8.1.2
01/13/2013 21:01:25 [400:2685]: setpgid(2685, 2685) returned 0
01/13/2013 21:01:25 [400:2685]: do_core_binding: "binding" parameter not found 
in config file
01/13/2013 21:01:25 [400:2685]: parent: forked "prolog" with pid 2767
01/13/2013 21:01:25 [400:2685]: using signal delivery delay of 120 seconds
01/13/2013 21:01:25 [400:2685]: parent: prolog-pid: 2767
01/13/2013 21:01:25 [400:2767]: child: starting son(prolog, 
sge@/opt/gridengine/prolog.sh, 0, 10000);
01/13/2013 21:01:25 [400:2767]: pid=2767 pgrp=2767 sid=2767 old pgrp=2685 
getlogin()=root
01/13/2013 21:01:25 [400:2767]: reading passwd information for user 'sge'
01/13/2013 21:01:25 [400:2767]: setting limits
01/13/2013 21:01:25 [400:2767]: setting environment
01/13/2013 21:01:25 [400:2767]: Initializing error file
01/13/2013 21:01:25 [400:2767]: switching to intermediate/target user
01/13/2013 21:01:25 [400:2767]: setting additional gid=0
01/13/2013 21:01:25 [686:2767]: closing all filedescriptors
01/13/2013 21:01:25 [686:2767]: further messages are in "error" and "trace"
01/13/2013 21:01:25 [686:2767]: using "/bin/true" as shell of user "sge"
01/13/2013 21:01:25 [400:2767]: now running with uid=400, euid=400
01/13/2013 21:01:25 [400:2767]: execvlp(/opt/gridengine/prolog.sh, 
"/opt/gridengine/prolog.sh")
01/13/2013 21:01:25 [400:2685]: wait3 returned 2767 (status: 0; WIFSIGNALED: 0, 
 WIFEXITED: 1, WEXITSTATUS: 0)
01/13/2013 21:01:25 [400:2685]: prolog exited with exit status 0
01/13/2013 21:01:25 [400:2685]: reaped "prolog" with pid 2767
01/13/2013 21:01:25 [400:2685]: prolog exited not due to signal
01/13/2013 21:01:25 [400:2685]: prolog exited with status 0
01/13/2013 21:01:25 [400:2685]: no pe_start script to start
01/13/2013 21:01:25 [400:2685]: parent: forked "job" with pid 2897
01/13/2013 21:01:25 [400:2897]: child: starting son(job, 
/var/spool/sge/compute-1-7/job_scripts/29405, 0, 4096);
01/13/2013 21:01:25 [400:2685]: parent: job-pid: 2897
01/13/2013 21:01:25 [400:2897]: pid=2897 pgrp=2897 sid=2897 old pgrp=2685 
getlogin()=root
01/13/2013 21:01:25 [400:2897]: reading passwd information for user 'theuser'
01/13/2013 21:01:25 [400:2897]: setosjobid: uid = 0, euid = 400
01/13/2013 21:01:25 [400:2897]: setting limits
01/13/2013 21:01:25 [400:2897]: RLIMIT_CPU setting: (soft INFINITY hard 
INFINITY) resulting: (soft INFINITY hard INFINITY)
01/13/2013 21:01:25 [400:2897]: RLIMIT_FSIZE setting: (soft INFINITY hard 
INFINITY) resulting: (soft INFINITY hard INFINITY)
01/13/2013 21:01:25 [400:2897]: RLIMIT_DATA setting: (soft INFINITY hard 
INFINITY) resulting: (soft INFINITY hard INFINITY)
01/13/2013 21:01:25 [400:2897]: RLIMIT_STACK setting: (soft INFINITY hard 
INFINITY) resulting: (soft INFINITY hard INFINITY)
01/13/2013 21:01:25 [400:2897]: RLIMIT_CORE setting: (soft INFINITY hard 
INFINITY) resulting: (soft INFINITY hard INFINITY)
01/13/2013 21:01:25 [400:2897]: RLIMIT_NOFILE setting: (soft 4096 hard 10240) 
resulting: (soft 4096 hard 10240)
01/13/2013 21:01:25 [400:2897]: RLIMIT_MEMLOCK setting: (soft INFINITY hard 
INFINITY) resulting: (soft INFINITY hard INFINITY)
01/13/2013 21:01:25 [400:2897]: RLIMIT_VMEM/RLIMIT_AS setting: (soft INFINITY 
hard INFINITY) resulting: (soft INFINITY hard INFINITY)
01/13/2013 21:01:25 [400:2897]: RLIMIT_RSS setting: (soft INFINITY hard 
INFINITY) resulting: (soft INFINITY hard INFINITY)
01/13/2013 21:01:25 [400:2897]: setting environment
01/13/2013 21:01:25 [400:2897]: Initializing error file
01/13/2013 21:01:25 [400:2897]: switching to intermediate/target user
01/13/2013 21:01:25 [400:2897]: setting additional gid=20191
01/13/2013 21:01:25 [0:2897]: setuid(686) failed
01/13/2013 21:01:25 [400:2685]: wait3 returned 2897 (status: 2816; WIFSIGNALED: 
0,  WIFEXITED: 1, WEXITSTATUS: 11)
01/13/2013 21:01:25 [400:2685]: job exited with exit status 11
01/13/2013 21:01:25 [400:2685]: reaped "job" with pid 2897
01/13/2013 21:01:25 [400:2685]: job exited not due to signal
01/13/2013 21:01:25 [400:2685]: job exited with status 11
01/13/2013 21:01:25 [400:2685]: now sending signal KILL to pid -2897
01/13/2013 21:01:25 [400:2685]: pdc_kill_addgrpid: 20191 9
01/13/2013 21:01:25 [400:2685]: failed starting job
01/13/2013 21:01:25 [400:2685]: no pe_stop script to start
01/13/2013 21:01:25 [400:2685]: parent: forked "epilog" with pid 2929
01/13/2013 21:01:25 [400:2685]: using signal delivery delay of 120 seconds
01/13/2013 21:01:25 [400:2685]: parent: epilog-pid: 2929
01/13/2013 21:01:25 [400:2929]: child: starting son(epilog, 
/opt/gridengine/epilog.sh, 0, 10000);
01/13/2013 21:01:25 [400:2929]: pid=2929 pgrp=2929 sid=2929 old pgrp=2685 
getlogin()=root
01/13/2013 21:01:25 [400:2929]: reading passwd information for user 'theuser'
01/13/2013 21:01:25 [400:2929]: setting limits
01/13/2013 21:01:25 [400:2929]: setting environment
01/13/2013 21:01:25 [400:2929]: Initializing error file
01/13/2013 21:01:25 [400:2929]: switching to intermediate/target user
01/13/2013 21:01:25 [400:2929]: setting additional gid=0
01/13/2013 21:01:25 [0:2929]: setuid(686) failed
01/13/2013 21:01:25 [400:2685]: wait3 returned 2929 (status: 3584; WIFSIGNALED: 
0,  WIFEXITED: 1, WEXITSTATUS: 14)
01/13/2013 21:01:25 [400:2685]: epilog exited with exit status 14
01/13/2013 21:01:25 [400:2685]: reaped "epilog" with pid 2929
01/13/2013 21:01:25 [400:2685]: epilog exited not due to signal
01/13/2013 21:01:25 [400:2685]: epilog exited with status 14
01/13/2013 21:01:25 [400:2685]: exit states increased from 1 to 2
01/13/2013 21:01:25 [400:2685]: failed starting epilog

Shepherd error:
01/13/2013 21:01:25 [0:2897]: setuid(686) failed
01/13/2013 21:01:25 [0:2929]: setuid(686) failed

Shepherd pe_hostfile:
compute-1-7.local 1 q64@compute-1-7.local UNDEFINED
_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to