We have about 550 nodes.

Here what the messages file contains right before the switch-over:

06/14/2016 04:31:43|worker|mtlxsge001|E|ex...@mtlx168.yok.mtl.com reports 
running job (89975.1/master) in queue "idl...@mtlx168.yok.mtl.com" that was not 
supposed to be there - killing
06/14/2016 04:31:44|worker|mtlxsge001|I|execd on mtlxvm262.yok.mtl.com 
registered
06/14/2016 04:31:44|worker|mtlxsge001|E|ex...@mtlxvm346.yok.mtl.com reports 
running job (209460.1/master) in queue "al...@mtlxvm346.yok.mtl.com" that was 
not supposed to be there - killing
06/14/2016 04:31:44|worker|mtlxsge001|E|ex...@mtlx071.yok.mtl.com reports 
running job (181095.1/master) in queue "al...@mtlx071.yok.mtl.com" that was not 
supposed to be there - killing
06/14/2016 04:31:45|worker|mtlxsge001|E|ex...@mtlxvm062.yok.mtl.com reports 
running job (170856.1/master) in queue "al...@mtlxvm062.yok.mtl.com" that was 
not supposed to be there - killing
06/14/2016 04:31:45|worker|mtlxsge001|E|ex...@mtlxvm115.yok.mtl.com reports 
running job (6228587.1/master) in queue "al...@mtlxvm115.yok.mtl.com" that was 
not supposed to be there - killing
06/14/2016 04:31:46|worker|mtlxsge001|E|ex...@mtlxvm128.yok.mtl.com reports 
running job (86091.1/master) in queue "al...@mtlxvm128.yok.mtl.com" that was 
not supposed to be there - killing
06/14/2016 04:31:46|worker|mtlxsge001|E|ex...@mtlxvm128.yok.mtl.com reports 
running job (1495714.1/master) in queue "al...@mtlxvm128.yok.mtl.com" that was 
not supposed to be there - killing
06/14/2016 04:31:46|worker|mtlxsge001|E|ex...@mtlxvm128.yok.mtl.com reports 
running job (1495710.1/master) in queue "al...@mtlxvm128.yok.mtl.com" that was 
not supposed to be there - killing
06/14/2016 04:31:46|worker|mtlxsge001|E|ex...@mtlxvm128.yok.mtl.com reports 
running job (1102562.1/master) in queue "al...@mtlxvm128.yok.mtl.com" that was 
not supposed to be there - killing
06/14/2016 04:31:46|worker|mtlxsge001|E|ex...@mtlxvm128.yok.mtl.com reports 
running job (1044702.1/master) in queue "al...@mtlxvm128.yok.mtl.com" that was 
not supposed to be there - killing
06/14/2016 04:31:46|worker|mtlxsge001|E|ex...@mtlxvm128.yok.mtl.com reports 
running job (1041885.1/master) in queue "al...@mtlxvm128.yok.mtl.com" that was 
not supposed to be there - killing
06/14/2016 04:31:46|worker|mtlxsge001|E|ex...@mtlxvm128.yok.mtl.com reports 
running job (1041889.1/master) in queue "al...@mtlxvm128.yok.mtl.com" that was 
not supposed to be there - killing
06/14/2016 04:31:46|worker|mtlxsge001|E|ex...@mtlxvm422.yok.mtl.com reports 
running job (142528.1/master) in queue "al...@mtlxvm422.yok.mtl.com" that was 
not supposed to be there - killing
06/14/2016 04:31:47|worker|mtlxsge001|E|ex...@mtlxvm427.yok.mtl.com reports 
running job (9200278.1/master) in queue "al...@mtlxvm427.yok.mtl.com" that was 
not supposed to be there - killing
06/14/2016 04:35:39|  main|mtlxsge002|I|using "/local/sge_spool" for 
execd_spool_dir
06/14/2016 04:35:39|  main|mtlxsge002|I|using "/bin/mail" for mailer

You can see mtlxsge002 taking over mtlxsge001.

I've reviewed parallel environments we have, in some of the cases we use very 
high number for slots limit like 999999 - can this be a problem?

Thanks.

-----Original Message-----
From: Jesse Becker [mailto:becke...@mail.nih.gov] 
Sent: Tuesday, June 21, 2016 6:49 PM
To: Yuri Burmachenko <yur...@mellanox.com>
Cc: Jesse Becker <becke...@mail.nih.gov>; William Hay <w....@ucl.ac.uk>; 
users@gridengine.org
Subject: Re: [gridengine users] SoGE 8.1.8 - sge_qmaster fails inconsistently 
and fail-over occurs quite often - best practices debugging and resolving the 
issue.

On Tue, Jun 21, 2016 at 03:25:25PM +0000, Yuri Burmachenko wrote:
>The dmesg is clean and no indication on oom_killer or segfaults.
>The only thing we have in dmesg & /var/log/messages is this line:
>
>possible SYN flooding on port 7444. Sending cookies.
>
>7444 is the sge_qmaster port.

How many exec nodes do you have?  While surprising, if the exec nodes all 
hammer the qmaster host at the same time, I can imagine the kenrel thinking 
it's under attack.

>Didn't find any files in /tmp and $SGE_ROOT/$SGE_CELL/spool/qmaster has the 
>following files & folders (looks like regular):
>admin_hosts           centry      heartbeat    jobseqnum      messages.2.gz    
>            messages_shadowd.mtlxsge02  qinstances                          
>shadowd_mtlxsge002.yok.mtl.com.pid  users
>advance_reservations  ckpt        hostgroups   managers       
>messages_shadowd.mtlxsge001  operators                   qmaster.pid           
>              shadowd_mtlxsge01.yok.mtl.com.pid   usersets
>arseqnum              cqueues     jobs         messages       
>messages_shadowd.mtlxsge002  pe                          resource_quotas       
>              shadowd_mtlxsge02.yok.mtl.com.pid   zombies
>calendars             exec_hosts  job_scripts  messages.1.gz  
>messages_shadowd.mtlxsge01   projects                    
>shadowd_mtlxsge001.yok.mtl.com.pid  submit_hosts

The 'messages' file would be where to look--shortly before a failover event...

>
>Previously we run strace and this is the last several lines before it was 
>detached due to process failure:
>[pid 42490] <... futex resumed> )       = 0
>[pid 42489] futex(0xfba3d4, 
>FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 327103507, {1459997595, 
>48056000}, ffffffff <unfinished ...> [pid 42490] futex(0xfba3d4, 
>FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 327103508, {1459997595, 
>48126000}, ffffffff <unfinished ...>
>[pid 42489] <... futex resumed> )       = -1 EAGAIN (Resource temporarily 
>unavailable)
>[pid 42489] futex(0xfba3d4, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 
>327103508, {1459997595, 48056000}, ffffffff <unfinished ...>
>[pid 42487] mkdir("jobs", 0755)         = -1 EEXIST (File exists)
>[pid 42487] mkdir("jobs/00", 0755)      = -1 EEXIST (File exists)
>[pid 42487] mkdir("jobs/00/0133", 0755) = -1 EEXIST (File exists) [pid 
>42487] open("jobs/00/0133/.1144", O_WRONLY|O_CREAT|O_TRUNC, 0666 
><unfinished ...> [pid 42491] write(2, "======= Backtrace: =========\n", 
>29) = 29 [pid 42491] writev(2, [{"/lib64/libc.so.6", 16}, {"[0x", 3}, 
>{"3354076166", 10}, {"]\n", 2}], 4) = 31 [pid 42491] writev(2, 
>[{"/opt/sge/bin/lx-amd64/sge_qmaste"..., 33}, {"(", 1}, {"sge_free", 
>8}, {"+0x", 3}, {"16", 2}, {")", 1}, {"[0x", 3}, {"5e3e76", 6}, {"]\n", 
>2}], 9) = 59 [pid 42491] writev(2, 
>[{"/opt/sge/bin/lx-amd64/sge_qmaste"..., 33}, {"(", 1}, {"lSetString", 
>10}, {"+0x", 3}, {"d8", 2}, {")", 1}, {"[0x", 3}, {"590238", 6}, 
>{"]\n", 2}], 9) = 61 [pid 42491] writev(2, 
>[{"/opt/sge/bin/lx-amd64/sge_qmaste"..., 33}, {"[0x", 3}, {"4cbfbf", 
>6}, {"]\n", 2}], 4) = 44 [pid 42491] writev(2, 
>[{"/opt/sge/bin/lx-amd64/sge_qmaste"..., 33}, {"[0x", 3}, {"4cd565", 
>6}, {"]\n", 2}], 4) = 44 [pid 42491] writev(2, 
>[{"/opt/sge/bin/lx-amd64/sge_qmaste"..., 33}, {"(", 1}, 
>{"sge_select_parallel_environment", 31}, {"+0x", 3}, {"939", 3}, {")", 
>1}, {"[0x", 3}, {"4ceb59", 6}, {"]\n", 2}], 9) = 83 [pid 42491] 
>writev(2, [{"/opt/sge/bin/lx-amd64/sge_qmaste"..., 33}, {"(", 1}, 
>{"scheduler_method", 16}, {"+0x", 3}, {"3150", 4}, {")", 1}, {"[0x", 3}, 
>{"43f470", 6}, {"]\n", 2}], 9) = 69 [pid 42491] writev(2, 
>[{"/opt/sge/bin/lx-amd64/sge_qmaste"..., 33}, {"(", 1}, {"sge_scheduler_main", 
>18}, {"+0x", 3}, {"f7c", 3}, {")", 1}, {"[0x", 3}, {"4342fc", 6}, {"]\n", 2}], 
>9) = 70 [pid 42491] writev(2, [{"/lib64/libpthread.so.0", 22}, {"[0x", 3}, 
>{"33548079d1", 10}, {"]\n", 2}], 4) = 37 [pid 42491] writev(2, 
>[{"/lib64/libc.so.6", 16}, {"(", 1}, {"clone", 5}, {"+0x", 3}, {"6d", 2}, 
>{")", 1}, {"[0x", 3}, {"33540e8b6d", 10}, {"]\n", 2}], 9) = 43 [pid 42491] 
>write(2, "======= Memory map: ========\n", 29) = 29 [pid 42491] 
>open("/proc/self/maps", O_RDONLY) = 385 [pid 42491] read(385, 
>"00400000-0065a000 r-xp 00000000 "..., 1024) = 1024 [pid 42491] write(2, 
>"00400000-0065a000 r-xp 00000000 "..., 1024) = 1024 [pid 42491] read(385, 
>"so\n3354000000-335418b000 r-xp 00"..., 1024) = 1024 [pid 42491] write(2,!
  "so\n3354000000-335418b000 r-xp 00"..., 1024) = 1024
>[pid 42491] read(385, "17000 08:03 16646600            "..., 1024) = 1024
>[pid 42491] write(2, "17000 08:03 16646600            "..., 1024) = 1024
>[pid 42491] read(385, "0000-3355c16000 r-xp 00000000 08"..., 1024) = 
>1024 [pid 42491] write(2, "0000-3355c16000 r-xp 00000000 08"..., 1024) = 1024
>[pid 42491] read(385, "   /lib64/libcom_err.so.2.1\n335b"..., 1024) = 1024
>[pid 42491] write(2, "   /lib64/libcom_err.so.2.1\n335b"..., 1024) = 1024
>[pid 42491] read(385, ".3.1\n335c829000-335ca29000 ---p "..., 1024) = 
>1024 [pid 42491] write(2, ".3.1\n335c829000-335ca29000 ---p "..., 1024) 
>= 1024 [pid 42491] read(385, "/libkrb5.so.3.3\n335d2e4000-335d2"..., 
>1024) = 1024 [pid 42491] write(2, 
>"/libkrb5.so.3.3\n335d2e4000-335d2"..., 1024) = 1024 [pid 42491] 
>read(385, " \n7faea0000000-7faea4000000 rw-p"..., 1024) = 1024 [pid 
>42491] write(2, " \n7faea0000000-7faea4000000 rw-p"..., 1024) = 1024 
>[pid 42491] read(385, ":00 0 \n7faed0000000-7faed4000000"..., 1024) = 
>1024 [pid 42491] write(2, ":00 0 \n7faed0000000-7faed4000000"..., 1024) 
>= 1024 [pid 42491] read(385, "s_nis-2.12.so\n7faee70cf000-7faee"..., 1024) = 
>1024 [pid 42491] write(2, "s_nis-2.12.so\n7faee70cf000-7faee"..., 1024) = 1024
>[pid 42491] read(385, "6646172                   /lib64"..., 1024) = 1024
>[pid 42491] write(2, "6646172                   /lib64"..., 1024) = 1024
>[pid 42491] read(385, "faef024e000 ---p 00138000 08:03 "..., 1024) = 
>889 [pid 42491] write(2, "faef024e000 ---p 00138000 08:03 "..., 889) = 889
>[pid 42491] read(385, "", 1024)         = 0
>[pid 42491] close(385)                  = 0
>[pid 42491] rt_sigprocmask(SIG_UNBLOCK, [ABRT], NULL, 8) = 0 [pid 
>42491] tgkill(42474, 42491, SIGABRT) = 0 [pid 42491] --- SIGABRT 
>(Aborted) @ 0 (0) ---


Now, this is kind of intersting.  It looks like the qmaster is killing part of 
itself:

       tgkill()  sends  the  signal  sig to the thread with the
       thread ID tid in the thread group tgid.  (By contrast, kill(2)
       can only be used to send a signal to a process (i.e., thread
       group) as a whole, and the signal will be delivered to an
       arbitrary thread within that process.)

It's not clear what thread 42474 is doing, but that's part of the same group as 
the process calling tgkill.

Also, and this is possibly unrelated, one of the calls in the backtrace
is:

  >[pid 42491] writev(2, [{"/opt/sge/bin/lx-amd64/sge_qmaste"..., 33}, {"(", 
1}, {"sge_select_parallel_environment", 31}, {"+0x", 3}, {"939", 3}, {")", 1}, 
{"[0x", 3}, {"4ceb59", 6}, {"]\n", 2}], 9) = 83

Perhaps check your PE configurations to make sure they do not have any strange 
characters or settings in them?  You may also want to consider removing them 
and reconfiguring them if possible.


>Process 42491 detached
>[pid 42490] +++ killed by SIGABRT +++
>[pid 42489] +++ killed by SIGABRT +++
>[pid 42488] +++ killed by SIGABRT +++
>[pid 42487] +++ killed by SIGABRT +++
>[pid 42486] +++ killed by SIGABRT +++
>[pid 42485] +++ killed by SIGABRT +++
>[pid 42484] +++ killed by SIGABRT +++
>[pid 42478] +++ killed by SIGABRT +++
>[pid 42476] +++ killed by SIGABRT +++
>[pid 42475] +++ killed by SIGABRT +++
>[pid 42477] +++ killed by SIGABRT +++
>+++ killed by SIGABRT +++
>
>
>Regarding GDB - what is the best practices to use it, by default it stops the 
>process once attached ( we are running sge_qmaster in production) - how can we 
>attach to sge_qmaster so it won't stop its normal operation?

Once you connect, you should be able to type "run" to continue running the 
process.

>Also what will happen when sge_qmaster will fail - AFAIK it holds the process 
>in "T" state - will fail-over occur or not if gdb will be attached to the 
>process during its failure?

Failover will (probably) still occur, since the shadowmaster will notice that 
the heartbeat file has not been updated.  



>-----Original Message-----
>From: Jesse Becker [mailto:becke...@mail.nih.gov]
>Sent: Tuesday, June 21, 2016 5:09 PM
>To: William Hay <w....@ucl.ac.uk>
>Cc: Yuri Burmachenko <yur...@mellanox.com>; users@gridengine.org
>Subject: Re: [gridengine users] SoGE 8.1.8 - sge_qmaster fails inconsistently 
>and fail-over occurs quite often - best practices debugging and resolving the 
>issue.
>
>On Tue, Jun 21, 2016 at 10:12:41AM +0100, William Hay wrote:
>>On Tue, Jun 21, 2016 at 08:16:25AM +0000, Yuri Burmachenko wrote:
>>>
>>>    We have noticed that our sge_qmaster process fails inconsistently and
>>>    jumps between shadow and master servers.
>>>
>>>    Issue occurs every 2-5 days.
>>One possibility that occurs to me is that you might be suffering from 
>>a memory leak that causes the oom_killer to target the qmaster.
>
>That will be logged via syslog, and in the kernel dmesg buffer.  It should be 
>fairly obvious if this the case.
>
>
>>>    We don't understand the root cause and the qmaster messages file does not
>>>    indicate any issue.
>>
>>I would suggest increasing the loglevel and also checking to see if 
>>there is anything that immediately precedes the failure repeatedly 
>>(the qmaster starting up again should be fairly obvious).
>
>
>It's been a while (years), but I've seen cases where bad configuration 
>settings will cause the qmaster to segfault.  This very annoyingly happened 
>during an upgrade, which made for a Bad Day.
>
>The logs are usually kept in $SGE_ROOT/$SGE_CELL/spool/qmaster
>
>However, in extreme cases, including startup, there may logs put into /tmp 
>(for cases where writing into $SGE_ROOT fails for some reason).
>
>
>
>>>    What are the best practices debugging this issue and resolving the 
>>> problem
>>>    without interrupting normal operation of sge_qmaster?
>>
>>There is also running the qmaster with debugging turned up but that 
>>could easily generate excessive an excessive volume of messages 
>>especially if you don't know what you are looking for.
>
>
>Try running the qmaster under strace or GDB.  You'll likely have to either 
>modify the init script, or run it by hand.
>
>
>--
>Jesse Becker (Contractor)

--
Jesse Becker (Contractor)

_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to