Hi Brian, There are a lot of fixes and enhancements done after 6.2u5 by Sun, Oracle, the 3 forks based on SGE 6.2u5. However, it is a bit hard to pin point the location of the crash from the strace log -- can you attach a debugger??
% gdb -q <location of qmaster> (gdb) attach <pid of qmaster> (gdb) cont And when qmaster crashes again, gdb will give you the stack trace. You may need to run gdb as root. Rayson On Mon, May 2, 2011 at 1:23 PM, Murphy, Brian (E IT F 45) <[email protected]> wrote: > Running 6.2u5. > qmaster running on RHEL 5.4. Exec host machines running on 5.5/5.6. > (Currently in upgrade process to 5.6) > Qmaster keeps dying seemingly randomly (9 times since Friday afternoon.) > Have not experienced this issue since installing a year ago. > Problem started a month or so ago and has increased in frequency. > Currently running a crontab every 2 minutes to check if qmaster is down > and if so, do a restart. > I can't find any indication anywhere, e.g., log files etc., as to why it is > dying. > So I did an strace on the qmaster PID. > It shows a segmentation fault (last few lines below.) > Any ideas? > > [pid 24778] futex(0x7375e0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...> > [pid 24774] clock_gettime(CLOCK_REALTIME, <unfinished ...> > [pid 24753] futex(0x7375e0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> > [pid 24744] gettimeofday( <unfinished ...> > [pid 24743] futex(0x2b662bd40c24, FUTEX_CMP_REQUEUE_PRIVATE, 1, 2147483647, > 0x2b662bd40bc0, 7404026 <unfinished ...> > [pid 24778] <... futex resumed> ) = -1 EAGAIN (Resource temporarily > unavailable) > [pid 24776] <... futex resumed> ) = 0 > [pid 24774] <... clock_gettime resumed> {1304038113, 8112000}) = 0 > [pid 24753] <... futex resumed> ) = 0 > [pid 24744] <... gettimeofday resumed> {1304038113, 8320}, NULL) = 0 > [pid 24743] <... futex resumed> ) = 2 > [pid 24778] futex(0x7375e0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> > [pid 24776] futex(0x2b662bd40bc0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished > ...> > [pid 24774] futex(0x2aaaabc5aa0c, FUTEX_WAIT_PRIVATE, 2519512, {0, > 998853000} <unfinished ...> > [pid 24753] gettimeofday( <unfinished ...> > [pid 24744] futex(0x2b662bd409e0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> > [pid 24743] futex(0x2b662bd40bc0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> > [pid 24778] <... futex resumed> ) = 0 > [pid 24777] <... futex resumed> ) = 0 > [pid 24776] <... futex resumed> ) = -1 EAGAIN (Resource temporarily > unavailable) > [pid 24753] <... gettimeofday resumed> {1304038113, 9573}, {0, 1304038113}) > = 0 > [pid 24744] <... futex resumed> ) = 0 > [pid 24743] <... futex resumed> ) = 1 > [pid 24778] gettimeofday( <unfinished ...> > [pid 24777] futex(0x2b662bd40bc0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> > [pid 24776] futex(0x2b662bd40bc0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> > [pid 24753] gettimeofday( <unfinished ...> > [pid 24744] poll([{fd=38, events=POLLOUT}], 1, 5 <unfinished ...> > [pid 24743] gettimeofday( <unfinished ...> > [pid 24778] <... gettimeofday resumed> {1304038113, 10670}, {0, 1304038113}) > = 0 > [pid 24777] <... futex resumed> ) = 0 > [pid 24776] <... futex resumed> ) = 0 > [pid 24753] <... gettimeofday resumed> {1304038113, 11054}, NULL) = 0 > [pid 24744] <... poll resumed> ) = 1 ([{fd=38, revents=POLLOUT}]) > [pid 24743] <... gettimeofday resumed> {1304038113, 11228}, NULL) = 0 > [pid 24778] --- SIGSEGV (Segmentation fault) @ 0 (0) --- > Process 24778 detached > [pid 24794] +++ killed by SIGSEGV +++ > [pid 24793] +++ killed by SIGSEGV +++ > [pid 24790] +++ killed by SIGSEGV +++ > [pid 24789] +++ killed by SIGSEGV +++ > [pid 24788] +++ killed by SIGSEGV +++ > [pid 24787] +++ killed by SIGSEGV +++ > [pid 24786] +++ killed by SIGSEGV +++ > [pid 24785] +++ killed by SIGSEGV +++ > [pid 24784] +++ killed by SIGSEGV +++ > [pid 24783] +++ killed by SIGSEGV +++ > [pid 24782] +++ killed by SIGSEGV +++ > [pid 24781] +++ killed by SIGSEGV +++ > [pid 24780] +++ killed by SIGSEGV +++ > [pid 24779] +++ killed by SIGSEGV +++ > [pid 24777] +++ killed by SIGSEGV +++ > [pid 24776] +++ killed by SIGSEGV +++ > [pid 24774] +++ killed by SIGSEGV +++ > [pid 24755] +++ killed by SIGSEGV +++ > [pid 24754] +++ killed by SIGSEGV +++ > [pid 24753] +++ killed by SIGSEGV +++ > [pid 24752] +++ killed by SIGSEGV +++ > [pid 24744] +++ killed by SIGSEGV +++ > [pid 24743] +++ killed by SIGSEGV +++ > [pid 24742] +++ killed by SIGSEGV +++ > [pid 24740] +++ killed by SIGSEGV +++ > +++ killed by SIGSEGV +++ > > > Best Regards, > Brian Murphy > ________________________________________ > Siemens Energy, Inc. > Global Engineering Computing Operations > Engineering Applications Administrator > Compute Grid Administrator > Orlando, Florida, USA > 407.736.5215 > > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users > > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
