We have about 550 nodes. Here what the messages file contains right before the switch-over:
06/14/2016 04:31:43|worker|mtlxsge001|E|ex...@mtlx168.yok.mtl.com reports running job (89975.1/master) in queue "idl...@mtlx168.yok.mtl.com" that was not supposed to be there - killing 06/14/2016 04:31:44|worker|mtlxsge001|I|execd on mtlxvm262.yok.mtl.com registered 06/14/2016 04:31:44|worker|mtlxsge001|E|ex...@mtlxvm346.yok.mtl.com reports running job (209460.1/master) in queue "al...@mtlxvm346.yok.mtl.com" that was not supposed to be there - killing 06/14/2016 04:31:44|worker|mtlxsge001|E|ex...@mtlx071.yok.mtl.com reports running job (181095.1/master) in queue "al...@mtlx071.yok.mtl.com" that was not supposed to be there - killing 06/14/2016 04:31:45|worker|mtlxsge001|E|ex...@mtlxvm062.yok.mtl.com reports running job (170856.1/master) in queue "al...@mtlxvm062.yok.mtl.com" that was not supposed to be there - killing 06/14/2016 04:31:45|worker|mtlxsge001|E|ex...@mtlxvm115.yok.mtl.com reports running job (6228587.1/master) in queue "al...@mtlxvm115.yok.mtl.com" that was not supposed to be there - killing 06/14/2016 04:31:46|worker|mtlxsge001|E|ex...@mtlxvm128.yok.mtl.com reports running job (86091.1/master) in queue "al...@mtlxvm128.yok.mtl.com" that was not supposed to be there - killing 06/14/2016 04:31:46|worker|mtlxsge001|E|ex...@mtlxvm128.yok.mtl.com reports running job (1495714.1/master) in queue "al...@mtlxvm128.yok.mtl.com" that was not supposed to be there - killing 06/14/2016 04:31:46|worker|mtlxsge001|E|ex...@mtlxvm128.yok.mtl.com reports running job (1495710.1/master) in queue "al...@mtlxvm128.yok.mtl.com" that was not supposed to be there - killing 06/14/2016 04:31:46|worker|mtlxsge001|E|ex...@mtlxvm128.yok.mtl.com reports running job (1102562.1/master) in queue "al...@mtlxvm128.yok.mtl.com" that was not supposed to be there - killing 06/14/2016 04:31:46|worker|mtlxsge001|E|ex...@mtlxvm128.yok.mtl.com reports running job (1044702.1/master) in queue "al...@mtlxvm128.yok.mtl.com" that was not supposed to be there - killing 06/14/2016 04:31:46|worker|mtlxsge001|E|ex...@mtlxvm128.yok.mtl.com reports running job (1041885.1/master) in queue "al...@mtlxvm128.yok.mtl.com" that was not supposed to be there - killing 06/14/2016 04:31:46|worker|mtlxsge001|E|ex...@mtlxvm128.yok.mtl.com reports running job (1041889.1/master) in queue "al...@mtlxvm128.yok.mtl.com" that was not supposed to be there - killing 06/14/2016 04:31:46|worker|mtlxsge001|E|ex...@mtlxvm422.yok.mtl.com reports running job (142528.1/master) in queue "al...@mtlxvm422.yok.mtl.com" that was not supposed to be there - killing 06/14/2016 04:31:47|worker|mtlxsge001|E|ex...@mtlxvm427.yok.mtl.com reports running job (9200278.1/master) in queue "al...@mtlxvm427.yok.mtl.com" that was not supposed to be there - killing 06/14/2016 04:35:39| main|mtlxsge002|I|using "/local/sge_spool" for execd_spool_dir 06/14/2016 04:35:39| main|mtlxsge002|I|using "/bin/mail" for mailer You can see mtlxsge002 taking over mtlxsge001. I've reviewed parallel environments we have, in some of the cases we use very high number for slots limit like 999999 - can this be a problem? Thanks. -----Original Message----- From: Jesse Becker [mailto:becke...@mail.nih.gov] Sent: Tuesday, June 21, 2016 6:49 PM To: Yuri Burmachenko <yur...@mellanox.com> Cc: Jesse Becker <becke...@mail.nih.gov>; William Hay <w....@ucl.ac.uk>; users@gridengine.org Subject: Re: [gridengine users] SoGE 8.1.8 - sge_qmaster fails inconsistently and fail-over occurs quite often - best practices debugging and resolving the issue. On Tue, Jun 21, 2016 at 03:25:25PM +0000, Yuri Burmachenko wrote: >The dmesg is clean and no indication on oom_killer or segfaults. >The only thing we have in dmesg & /var/log/messages is this line: > >possible SYN flooding on port 7444. Sending cookies. > >7444 is the sge_qmaster port. How many exec nodes do you have? While surprising, if the exec nodes all hammer the qmaster host at the same time, I can imagine the kenrel thinking it's under attack. >Didn't find any files in /tmp and $SGE_ROOT/$SGE_CELL/spool/qmaster has the >following files & folders (looks like regular): >admin_hosts centry heartbeat jobseqnum messages.2.gz > messages_shadowd.mtlxsge02 qinstances >shadowd_mtlxsge002.yok.mtl.com.pid users >advance_reservations ckpt hostgroups managers >messages_shadowd.mtlxsge001 operators qmaster.pid > shadowd_mtlxsge01.yok.mtl.com.pid usersets >arseqnum cqueues jobs messages >messages_shadowd.mtlxsge002 pe resource_quotas > shadowd_mtlxsge02.yok.mtl.com.pid zombies >calendars exec_hosts job_scripts messages.1.gz >messages_shadowd.mtlxsge01 projects >shadowd_mtlxsge001.yok.mtl.com.pid submit_hosts The 'messages' file would be where to look--shortly before a failover event... > >Previously we run strace and this is the last several lines before it was >detached due to process failure: >[pid 42490] <... futex resumed> ) = 0 >[pid 42489] futex(0xfba3d4, >FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 327103507, {1459997595, >48056000}, ffffffff <unfinished ...> [pid 42490] futex(0xfba3d4, >FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 327103508, {1459997595, >48126000}, ffffffff <unfinished ...> >[pid 42489] <... futex resumed> ) = -1 EAGAIN (Resource temporarily >unavailable) >[pid 42489] futex(0xfba3d4, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, >327103508, {1459997595, 48056000}, ffffffff <unfinished ...> >[pid 42487] mkdir("jobs", 0755) = -1 EEXIST (File exists) >[pid 42487] mkdir("jobs/00", 0755) = -1 EEXIST (File exists) >[pid 42487] mkdir("jobs/00/0133", 0755) = -1 EEXIST (File exists) [pid >42487] open("jobs/00/0133/.1144", O_WRONLY|O_CREAT|O_TRUNC, 0666 ><unfinished ...> [pid 42491] write(2, "======= Backtrace: =========\n", >29) = 29 [pid 42491] writev(2, [{"/lib64/libc.so.6", 16}, {"[0x", 3}, >{"3354076166", 10}, {"]\n", 2}], 4) = 31 [pid 42491] writev(2, >[{"/opt/sge/bin/lx-amd64/sge_qmaste"..., 33}, {"(", 1}, {"sge_free", >8}, {"+0x", 3}, {"16", 2}, {")", 1}, {"[0x", 3}, {"5e3e76", 6}, {"]\n", >2}], 9) = 59 [pid 42491] writev(2, >[{"/opt/sge/bin/lx-amd64/sge_qmaste"..., 33}, {"(", 1}, {"lSetString", >10}, {"+0x", 3}, {"d8", 2}, {")", 1}, {"[0x", 3}, {"590238", 6}, >{"]\n", 2}], 9) = 61 [pid 42491] writev(2, >[{"/opt/sge/bin/lx-amd64/sge_qmaste"..., 33}, {"[0x", 3}, {"4cbfbf", >6}, {"]\n", 2}], 4) = 44 [pid 42491] writev(2, >[{"/opt/sge/bin/lx-amd64/sge_qmaste"..., 33}, {"[0x", 3}, {"4cd565", >6}, {"]\n", 2}], 4) = 44 [pid 42491] writev(2, >[{"/opt/sge/bin/lx-amd64/sge_qmaste"..., 33}, {"(", 1}, >{"sge_select_parallel_environment", 31}, {"+0x", 3}, {"939", 3}, {")", >1}, {"[0x", 3}, {"4ceb59", 6}, {"]\n", 2}], 9) = 83 [pid 42491] >writev(2, [{"/opt/sge/bin/lx-amd64/sge_qmaste"..., 33}, {"(", 1}, >{"scheduler_method", 16}, {"+0x", 3}, {"3150", 4}, {")", 1}, {"[0x", 3}, >{"43f470", 6}, {"]\n", 2}], 9) = 69 [pid 42491] writev(2, >[{"/opt/sge/bin/lx-amd64/sge_qmaste"..., 33}, {"(", 1}, {"sge_scheduler_main", >18}, {"+0x", 3}, {"f7c", 3}, {")", 1}, {"[0x", 3}, {"4342fc", 6}, {"]\n", 2}], >9) = 70 [pid 42491] writev(2, [{"/lib64/libpthread.so.0", 22}, {"[0x", 3}, >{"33548079d1", 10}, {"]\n", 2}], 4) = 37 [pid 42491] writev(2, >[{"/lib64/libc.so.6", 16}, {"(", 1}, {"clone", 5}, {"+0x", 3}, {"6d", 2}, >{")", 1}, {"[0x", 3}, {"33540e8b6d", 10}, {"]\n", 2}], 9) = 43 [pid 42491] >write(2, "======= Memory map: ========\n", 29) = 29 [pid 42491] >open("/proc/self/maps", O_RDONLY) = 385 [pid 42491] read(385, >"00400000-0065a000 r-xp 00000000 "..., 1024) = 1024 [pid 42491] write(2, >"00400000-0065a000 r-xp 00000000 "..., 1024) = 1024 [pid 42491] read(385, >"so\n3354000000-335418b000 r-xp 00"..., 1024) = 1024 [pid 42491] write(2,! "so\n3354000000-335418b000 r-xp 00"..., 1024) = 1024 >[pid 42491] read(385, "17000 08:03 16646600 "..., 1024) = 1024 >[pid 42491] write(2, "17000 08:03 16646600 "..., 1024) = 1024 >[pid 42491] read(385, "0000-3355c16000 r-xp 00000000 08"..., 1024) = >1024 [pid 42491] write(2, "0000-3355c16000 r-xp 00000000 08"..., 1024) = 1024 >[pid 42491] read(385, " /lib64/libcom_err.so.2.1\n335b"..., 1024) = 1024 >[pid 42491] write(2, " /lib64/libcom_err.so.2.1\n335b"..., 1024) = 1024 >[pid 42491] read(385, ".3.1\n335c829000-335ca29000 ---p "..., 1024) = >1024 [pid 42491] write(2, ".3.1\n335c829000-335ca29000 ---p "..., 1024) >= 1024 [pid 42491] read(385, "/libkrb5.so.3.3\n335d2e4000-335d2"..., >1024) = 1024 [pid 42491] write(2, >"/libkrb5.so.3.3\n335d2e4000-335d2"..., 1024) = 1024 [pid 42491] >read(385, " \n7faea0000000-7faea4000000 rw-p"..., 1024) = 1024 [pid >42491] write(2, " \n7faea0000000-7faea4000000 rw-p"..., 1024) = 1024 >[pid 42491] read(385, ":00 0 \n7faed0000000-7faed4000000"..., 1024) = >1024 [pid 42491] write(2, ":00 0 \n7faed0000000-7faed4000000"..., 1024) >= 1024 [pid 42491] read(385, "s_nis-2.12.so\n7faee70cf000-7faee"..., 1024) = >1024 [pid 42491] write(2, "s_nis-2.12.so\n7faee70cf000-7faee"..., 1024) = 1024 >[pid 42491] read(385, "6646172 /lib64"..., 1024) = 1024 >[pid 42491] write(2, "6646172 /lib64"..., 1024) = 1024 >[pid 42491] read(385, "faef024e000 ---p 00138000 08:03 "..., 1024) = >889 [pid 42491] write(2, "faef024e000 ---p 00138000 08:03 "..., 889) = 889 >[pid 42491] read(385, "", 1024) = 0 >[pid 42491] close(385) = 0 >[pid 42491] rt_sigprocmask(SIG_UNBLOCK, [ABRT], NULL, 8) = 0 [pid >42491] tgkill(42474, 42491, SIGABRT) = 0 [pid 42491] --- SIGABRT >(Aborted) @ 0 (0) --- Now, this is kind of intersting. It looks like the qmaster is killing part of itself: tgkill() sends the signal sig to the thread with the thread ID tid in the thread group tgid. (By contrast, kill(2) can only be used to send a signal to a process (i.e., thread group) as a whole, and the signal will be delivered to an arbitrary thread within that process.) It's not clear what thread 42474 is doing, but that's part of the same group as the process calling tgkill. Also, and this is possibly unrelated, one of the calls in the backtrace is: >[pid 42491] writev(2, [{"/opt/sge/bin/lx-amd64/sge_qmaste"..., 33}, {"(", 1}, {"sge_select_parallel_environment", 31}, {"+0x", 3}, {"939", 3}, {")", 1}, {"[0x", 3}, {"4ceb59", 6}, {"]\n", 2}], 9) = 83 Perhaps check your PE configurations to make sure they do not have any strange characters or settings in them? You may also want to consider removing them and reconfiguring them if possible. >Process 42491 detached >[pid 42490] +++ killed by SIGABRT +++ >[pid 42489] +++ killed by SIGABRT +++ >[pid 42488] +++ killed by SIGABRT +++ >[pid 42487] +++ killed by SIGABRT +++ >[pid 42486] +++ killed by SIGABRT +++ >[pid 42485] +++ killed by SIGABRT +++ >[pid 42484] +++ killed by SIGABRT +++ >[pid 42478] +++ killed by SIGABRT +++ >[pid 42476] +++ killed by SIGABRT +++ >[pid 42475] +++ killed by SIGABRT +++ >[pid 42477] +++ killed by SIGABRT +++ >+++ killed by SIGABRT +++ > > >Regarding GDB - what is the best practices to use it, by default it stops the >process once attached ( we are running sge_qmaster in production) - how can we >attach to sge_qmaster so it won't stop its normal operation? Once you connect, you should be able to type "run" to continue running the process. >Also what will happen when sge_qmaster will fail - AFAIK it holds the process >in "T" state - will fail-over occur or not if gdb will be attached to the >process during its failure? Failover will (probably) still occur, since the shadowmaster will notice that the heartbeat file has not been updated. >-----Original Message----- >From: Jesse Becker [mailto:becke...@mail.nih.gov] >Sent: Tuesday, June 21, 2016 5:09 PM >To: William Hay <w....@ucl.ac.uk> >Cc: Yuri Burmachenko <yur...@mellanox.com>; users@gridengine.org >Subject: Re: [gridengine users] SoGE 8.1.8 - sge_qmaster fails inconsistently >and fail-over occurs quite often - best practices debugging and resolving the >issue. > >On Tue, Jun 21, 2016 at 10:12:41AM +0100, William Hay wrote: >>On Tue, Jun 21, 2016 at 08:16:25AM +0000, Yuri Burmachenko wrote: >>> >>> We have noticed that our sge_qmaster process fails inconsistently and >>> jumps between shadow and master servers. >>> >>> Issue occurs every 2-5 days. >>One possibility that occurs to me is that you might be suffering from >>a memory leak that causes the oom_killer to target the qmaster. > >That will be logged via syslog, and in the kernel dmesg buffer. It should be >fairly obvious if this the case. > > >>> We don't understand the root cause and the qmaster messages file does not >>> indicate any issue. >> >>I would suggest increasing the loglevel and also checking to see if >>there is anything that immediately precedes the failure repeatedly >>(the qmaster starting up again should be fairly obvious). > > >It's been a while (years), but I've seen cases where bad configuration >settings will cause the qmaster to segfault. This very annoyingly happened >during an upgrade, which made for a Bad Day. > >The logs are usually kept in $SGE_ROOT/$SGE_CELL/spool/qmaster > >However, in extreme cases, including startup, there may logs put into /tmp >(for cases where writing into $SGE_ROOT fails for some reason). > > > >>> What are the best practices debugging this issue and resolving the >>> problem >>> without interrupting normal operation of sge_qmaster? >> >>There is also running the qmaster with debugging turned up but that >>could easily generate excessive an excessive volume of messages >>especially if you don't know what you are looking for. > > >Try running the qmaster under strace or GDB. You'll likely have to either >modify the init script, or run it by hand. > > >-- >Jesse Becker (Contractor) -- Jesse Becker (Contractor) _______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users