Hi, Am 14.01.2013 um 07:00 schrieb Joseph Farran:
> We have a cluster running Rocks 6.1 with Grid Engine 8.1.2. > > Every once in a while, we get jobs that fail not being able to set the user > id ( setuid fails ). > > The nodes have the correct /etc/passwd entry as many jobs from the same user > work while a few fail every once in a while. The user submitts several > hundred 1-core jobs at once so I am not sure if that is contributing to the > failures - but it should not. The failures are random and are happening > around 1 failure every 300 jobs or so. > > Any suggestions on what could be causing this? > > Here is the Grid Engine failed log of one case: > > <snip> > 01/13/2013 21:01:25 [400:2897]: setting additional gid=20191 > 01/13/2013 21:01:25 [0:2897]: setuid(686) failed > 01/13/2013 21:01:25 [400:2685]: wait3 returned 2897 (status: 2816; > WIFSIGNALED: 0, WIFEXITED: 1, WEXITSTATUS: 11) can it be: 11: Resource temporarily unavailable What are the ulimits of the started execd - is anything therein near a limit? -- Reuti > 01/13/2013 21:01:25 [400:2685]: job exited with exit status 11 > 01/13/2013 21:01:25 [400:2685]: reaped "job" with pid 2897 > 01/13/2013 21:01:25 [400:2685]: job exited not due to signal > 01/13/2013 21:01:25 [400:2685]: job exited with status 11 > 01/13/2013 21:01:25 [400:2685]: now sending signal KILL to pid -2897 > 01/13/2013 21:01:25 [400:2685]: pdc_kill_addgrpid: 20191 9 > 01/13/2013 21:01:25 [400:2685]: failed starting job > 01/13/2013 21:01:25 [400:2685]: no pe_stop script to start > 01/13/2013 21:01:25 [400:2685]: parent: forked "epilog" with pid 2929 > 01/13/2013 21:01:25 [400:2685]: using signal delivery delay of 120 seconds > 01/13/2013 21:01:25 [400:2685]: parent: epilog-pid: 2929 > 01/13/2013 21:01:25 [400:2929]: child: starting son(epilog, > /opt/gridengine/epilog.sh, 0, 10000); > 01/13/2013 21:01:25 [400:2929]: pid=2929 pgrp=2929 sid=2929 old pgrp=2685 > getlogin()=root > 01/13/2013 21:01:25 [400:2929]: reading passwd information for user 'theuser' > 01/13/2013 21:01:25 [400:2929]: setting limits > 01/13/2013 21:01:25 [400:2929]: setting environment > 01/13/2013 21:01:25 [400:2929]: Initializing error file > 01/13/2013 21:01:25 [400:2929]: switching to intermediate/target user > 01/13/2013 21:01:25 [400:2929]: setting additional gid=0 > 01/13/2013 21:01:25 [0:2929]: setuid(686) failed > 01/13/2013 21:01:25 [400:2685]: wait3 returned 2929 (status: 3584; > WIFSIGNALED: 0, WIFEXITED: 1, WEXITSTATUS: 14) > 01/13/2013 21:01:25 [400:2685]: epilog exited with exit status 14 > 01/13/2013 21:01:25 [400:2685]: reaped "epilog" with pid 2929 > 01/13/2013 21:01:25 [400:2685]: epilog exited not due to signal > 01/13/2013 21:01:25 [400:2685]: epilog exited with status 14 > 01/13/2013 21:01:25 [400:2685]: exit states increased from 1 to 2 > 01/13/2013 21:01:25 [400:2685]: failed starting epilog > > Shepherd error: > 01/13/2013 21:01:25 [0:2897]: setuid(686) failed > 01/13/2013 21:01:25 [0:2929]: setuid(686) failed > > Shepherd pe_hostfile: > compute-1-7.local 1 q64@compute-1-7.local UNDEFINED > _______________________________________________ > users mailing list > users@gridengine.org > https://gridengine.org/mailman/listinfo/users _______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users