Re: [ceph-users] ?==?utf-8?q? OSD's hang after network blip

2020-01-16 Thread Nick Fisk
On Thursday, January 16, 2020 09:15 GMT, Dan van der Ster  
wrote: 
 
> Hi Nick,
> 
> We saw the exact same problem yesterday after a network outage -- a few of
> our down OSDs were stuck down until we restarted their processes.
> 
> -- Dan
> 
> 
> On Wed, Jan 15, 2020 at 3:37 PM Nick Fisk  wrote:
> 
> > Hi All,
> >
> > Running 14.2.5, currently experiencing some network blips isolated to a
> > single rack which is under investigation. However, it appears following a
> > network blip, random OSD's in unaffected racks are sometimes not recovering
> > from the incident and are left running running in a zombie state. The OSD's
> > appear to be running from a process perspective, but the cluster thinks
> > they are down and will not rejoin the cluster until the OSD process is
> > restarted, which incidentally takes a lot longer than usual (systemctl
> > command takes a couple of minutes to complete).
> >
> > If the OSD is left in this state, CPU and memory usage of the process
> > appears to climb, but never rejoins, at least for several hours that I have
> > left them. Not exactly sure what the OSD is trying to do during this
> > period. There's nothing in the logs during this hung state to indicate that
> > anything is happening, but I will try and inject more verbose logging next
> > time it occurs.
> >
> > Not sure if anybody has come across this before or any ideas? In the past
> > as long as OSD's have been running they have always re-joint following any
> > network issues.
> >
> > Nick
> >
> > Sample from OSD and cluster logs below. Blip happened at 12:06, I
> > restarted OSD at 12:26
> >
> > OSD Logs from OSD that hung (Note this OSD was not directly affected by
> > network outage)
> > 2020-01-15 12:06:32.234 7f41a1023700 -1 osd.43 2342991 heartbeat_check: no
> > reply from [*:*:*:5::14]:6838 osd.71 ever on either front or back, first
> > ping sent 2020-01-15 12:06:1
> > 1.411216 (oldest deadline 2020-01-15 12:06:31.411216)
> > 2020-01-15 12:06:33.194 7f41a1023700 -1 osd.43 2342991 heartbeat_check: no
> > reply from [*:*:*:5::13]:6854 osd.49 ever on either front or back, first
> > ping sent 2020-01-15 12:06:1
> > 1.411216 (oldest deadline 2020-01-15 12:06:31.411216)
> > 2020-01-15 12:06:33.194 7f41a1023700 -1 osd.43 2342991 heartbeat_check: no
> > reply from [*:*:*:5::13]:6834 osd.51 ever on either front or back, first
> > ping sent 2020-01-15 12:06:1
> > 1.411216 (oldest deadline 2020-01-15 12:06:31.411216)
> > 2020-01-15 12:06:33.194 7f41a1023700 -1 osd.43 2342991 heartbeat_check: no
> > reply from [*:*:*:5::13]:6862 osd.52 ever on either front or back, first
> > ping sent 2020-01-15 12:06:1
> > 1.411216 (oldest deadline 2020-01-15 12:06:31.411216)
> > 2020-01-15 12:06:33.194 7f41a1023700 -1 osd.43 2342991 heartbeat_check: no
> > reply from [*:*:*:5::13]:6875 osd.53 ever on either front or back, first
> > ping sent 2020-01-15 12:06:1
> > 1.411216 (oldest deadline 2020-01-15 12:06:31.411216)
> > 2020-01-15 12:06:33.194 7f41a1023700 -1 osd.43 2342991 heartbeat_check: no
> > reply from [*:*:*:5::13]:6894 osd.54 ever on either front or back, first
> > ping sent 2020-01-15 12:06:1
> > 1.411216 (oldest deadline 2020-01-15 12:06:31.411216)
> > 2020-01-15 12:06:33.194 7f41a1023700 -1 osd.43 2342991 heartbeat_check: no
> > reply from [*:*:*:5::14]:6838 osd.71 ever on either front or back, first
> > ping sent 2020-01-15 12:06:1
> > 1.411216 (oldest deadline 2020-01-15 12:06:31.411216)
> > 2020-01-15 12:06:34.034 7f419480a700  0 log_channel(cluster) log [WRN] :
> > Monitor daemon marked osd.43 down, but it is still running
> > 2020-01-15 12:06:34.034 7f419480a700  0 log_channel(cluster) log [DBG] :
> > map e2342992 wrongly marked me down at e2342992
> > 2020-01-15 12:06:34.034 7f419480a700  1 osd.43 2342992
> > start_waiting_for_healthy
> > 2020-01-15 12:06:34.198 7f41a1023700 -1 osd.43 2342992 heartbeat_check: no
> > reply from [*:*:*:5::13]:6854 osd.49 ever on either front or back, first
> > ping sent 2020-01-15 12:06:1
> > 1.411216 (oldest deadline 2020-01-15 12:06:31.411216)
> > 2020-01-15 12:06:34.198 7f41a1023700 -1 osd.43 2342992 heartbeat_check: no
> > reply from [*:*:*:5::13]:6834 osd.51 ever on either front or back, first
> > ping sent 2020-01-15 12:06:1
> > 1.411216 (oldest deadline 2020-01-15 12:06:31.411216)
> > 2020-01-15 12:06:34.198 7f41a1023700 -1 osd.43 2342992 heartbeat_check: no
> > reply from [*:*:*:5::13]:6862 osd.52 ever on either front or back, first
> > ping sent 2020-01-15 12:06:1
> > 1.411216 (oldest deadline 2020-01-15 12:06:31.411216)
> > 2020-01-15 12:06:34.198 7f41a1023700 -1 osd.43 2342992 heartbeat_check: no
> > reply from [*:*:*:5::13]:6875 osd.53 ever on either front or back, first
> > ping sent 2020-01-15 12:06:1
> > 1.411216 (oldest deadline 2020-01-15 12:06:31.411216)
> > 2020-01-15 12:06:34.198 7f41a1023700 -1 osd.43 2342992 heartbeat_check: no
> > reply from [*:*:*:5::13]:6894 osd.54 ever on either front or back, first
> > ping sent 2020-01-15 12:06:1
> > 1.411216 

Re: [ceph-users] ?==?utf-8?q? OSD's hang after network blip

2020-01-15 Thread Nick Fisk
On Wednesday, January 15, 2020 14:37 GMT, "Nick Fisk"  wrote: 
 
> Hi All,
> 
> Running 14.2.5, currently experiencing some network blips isolated to a 
> single rack which is under investigation. However, it appears following a 
> network blip, random OSD's in unaffected racks are sometimes not recovering 
> from the incident and are left running running in a zombie state. The OSD's 
> appear to be running from a process perspective, but the cluster thinks they 
> are down and will not rejoin the cluster until the OSD process is restarted, 
> which incidentally takes a lot longer than usual (systemctl command takes a 
> couple of minutes to complete).
> 
> If the OSD is left in this state, CPU and memory usage of the process appears 
> to climb, but never rejoins, at least for several hours that I have left 
> them. Not exactly sure what the OSD is trying to do during this period. 
> There's nothing in the logs during this hung state to indicate that anything 
> is happening, but I will try and inject more verbose logging next time it 
> occurs.
> 
> Not sure if anybody has come across this before or any ideas? In the past as 
> long as OSD's have been running they have always re-joint following any 
> network issues.
> 
> Nick
> 
> Sample from OSD and cluster logs below. Blip happened at 12:06, I restarted 
> OSD at 12:26
> 
> OSD Logs from OSD that hung (Note this OSD was not directly affected by 
> network outage)
> 2020-01-15 12:06:32.234 7f41a1023700 -1 osd.43 2342991 heartbeat_check: no 
> reply from [*:*:*:5::14]:6838 osd.71 ever on either front or back, first ping 
> sent 2020-01-15 12:06:1


 
 Its just happened again and managed to pull this out of debug_osd 20 :

2020-01-15 16:29:01.464 7ff1763df700 10 osd.87 2343121 handle_osd_ping osd.182 
v2:[2a03:25e0:253:5::76]:6839/8394683 says i am down in 2343138
2020-01-15 16:29:01.464 7ff1763df700 10 osd.87 2343121 handle_osd_ping osd.184 
v2:[2a03:25e0:253:5::76]:6814/7394522 says i am down in 2343138
2020-01-15 16:29:01.464 7ff1763df700 10 osd.87 2343121 handle_osd_ping osd.190 
v2:[2a03:25e0:253:5::76]:6860/5986687 says i am down in 2343138
2020-01-15 16:29:01.668 7ff1763df700 10 osd.87 2343121 handle_osd_ping osd.19 
v2:[2a03:25e0:253:5::12]:6815/5153900 says i am down in 2343138

And this from the daemon status output:
sudo ceph daemon osd.87 status
{
"cluster_fsid": "c1703b54-b4cd-41ab-a3ba-4fab241b62f3",
"osd_fsid": "0cd8fe7d-17be-4982-b76f-ef1cbed0c19b",
"whoami": 87,
"state": "waiting_for_healthy",
"oldest_map": 2342407,
"newest_map": 2343121,
"num_pgs": 218
}

So OSD doesn't seem to be getting latest map from the mon's. Map 2343138 
obviously has OSD.87 marked down hence the error messages from the osd_pings. 
But I'm guessing the latest map the OSD has 2343121 has it as marked up, so it 
never tries to "re-connect"?

Seems similar to this post from a few years back, which didn't seem to end with 
a form of resolution
https://www.spinics.net/lists/ceph-devel/msg31788.html

Also found this PR for Nautilus which suggested it might be a fix for the 
issue, but should already be part of the release I'm running:
https://github.com/ceph/ceph/pull/23958

Nick

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com