Hi,

Am 14.01.2013 um 07:00 schrieb Joseph Farran:

> We have a cluster running Rocks 6.1 with Grid Engine 8.1.2.
> 
> Every once in a while, we get jobs that fail not being able to set the user 
> id ( setuid  fails ).
> 
> The nodes have the correct /etc/passwd entry as many jobs from the same user 
> work while a few fail every once in a while.    The user submitts several 
> hundred 1-core jobs at once so I am not sure if that is contributing to the 
> failures - but it should not.   The failures are random and are happening 
> around 1 failure every 300 jobs or so.
> 
> Any suggestions on what could be causing this?
> 
> Here is the Grid Engine failed log of one case:
> 
> <snip>
> 01/13/2013 21:01:25 [400:2897]: setting additional gid=20191
> 01/13/2013 21:01:25 [0:2897]: setuid(686) failed
> 01/13/2013 21:01:25 [400:2685]: wait3 returned 2897 (status: 2816; 
> WIFSIGNALED: 0,  WIFEXITED: 1, WEXITSTATUS: 11)

can it be:

11: Resource temporarily unavailable

What are the ulimits of the started execd - is anything therein near a limit?

-- Reuti


> 01/13/2013 21:01:25 [400:2685]: job exited with exit status 11
> 01/13/2013 21:01:25 [400:2685]: reaped "job" with pid 2897
> 01/13/2013 21:01:25 [400:2685]: job exited not due to signal
> 01/13/2013 21:01:25 [400:2685]: job exited with status 11
> 01/13/2013 21:01:25 [400:2685]: now sending signal KILL to pid -2897
> 01/13/2013 21:01:25 [400:2685]: pdc_kill_addgrpid: 20191 9
> 01/13/2013 21:01:25 [400:2685]: failed starting job
> 01/13/2013 21:01:25 [400:2685]: no pe_stop script to start
> 01/13/2013 21:01:25 [400:2685]: parent: forked "epilog" with pid 2929
> 01/13/2013 21:01:25 [400:2685]: using signal delivery delay of 120 seconds
> 01/13/2013 21:01:25 [400:2685]: parent: epilog-pid: 2929
> 01/13/2013 21:01:25 [400:2929]: child: starting son(epilog, 
> /opt/gridengine/epilog.sh, 0, 10000);
> 01/13/2013 21:01:25 [400:2929]: pid=2929 pgrp=2929 sid=2929 old pgrp=2685 
> getlogin()=root
> 01/13/2013 21:01:25 [400:2929]: reading passwd information for user 'theuser'
> 01/13/2013 21:01:25 [400:2929]: setting limits
> 01/13/2013 21:01:25 [400:2929]: setting environment
> 01/13/2013 21:01:25 [400:2929]: Initializing error file
> 01/13/2013 21:01:25 [400:2929]: switching to intermediate/target user
> 01/13/2013 21:01:25 [400:2929]: setting additional gid=0
> 01/13/2013 21:01:25 [0:2929]: setuid(686) failed
> 01/13/2013 21:01:25 [400:2685]: wait3 returned 2929 (status: 3584; 
> WIFSIGNALED: 0,  WIFEXITED: 1, WEXITSTATUS: 14)
> 01/13/2013 21:01:25 [400:2685]: epilog exited with exit status 14
> 01/13/2013 21:01:25 [400:2685]: reaped "epilog" with pid 2929
> 01/13/2013 21:01:25 [400:2685]: epilog exited not due to signal
> 01/13/2013 21:01:25 [400:2685]: epilog exited with status 14
> 01/13/2013 21:01:25 [400:2685]: exit states increased from 1 to 2
> 01/13/2013 21:01:25 [400:2685]: failed starting epilog
> 
> Shepherd error:
> 01/13/2013 21:01:25 [0:2897]: setuid(686) failed
> 01/13/2013 21:01:25 [0:2929]: setuid(686) failed
> 
> Shepherd pe_hostfile:
> compute-1-7.local 1 q64@compute-1-7.local UNDEFINED
> _______________________________________________
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to