Re: [ceph-users] OSD's hang after network blip
We upgraded to 14.2.4 back in October and this week to v14.2.6. But I don't think the cluster had a network outage until yesterday, so I wouldn't have thought this is a .6 regression. If it happens again I'll look for the waiting for map message. -- dan On Thu, Jan 16, 2020 at 12:08 PM Nick Fisk wrote: > On Thursday, January 16, 2020 09:15 GMT, Dan van der Ster < > d...@vanderster.com> wrote: > > > Hi Nick, > > > > We saw the exact same problem yesterday after a network outage -- a few > of > > our down OSDs were stuck down until we restarted their processes. > > > > -- Dan > > > > > > On Wed, Jan 15, 2020 at 3:37 PM Nick Fisk wrote: > > > > > Hi All, > > > > > > Running 14.2.5, currently experiencing some network blips isolated to a > > > single rack which is under investigation. However, it appears > following a > > > network blip, random OSD's in unaffected racks are sometimes not > recovering > > > from the incident and are left running running in a zombie state. The > OSD's > > > appear to be running from a process perspective, but the cluster thinks > > > they are down and will not rejoin the cluster until the OSD process is > > > restarted, which incidentally takes a lot longer than usual (systemctl > > > command takes a couple of minutes to complete). > > > > > > If the OSD is left in this state, CPU and memory usage of the process > > > appears to climb, but never rejoins, at least for several hours that I > have > > > left them. Not exactly sure what the OSD is trying to do during this > > > period. There's nothing in the logs during this hung state to indicate > that > > > anything is happening, but I will try and inject more verbose logging > next > > > time it occurs. > > > > > > Not sure if anybody has come across this before or any ideas? In the > past > > > as long as OSD's have been running they have always re-joint following > any > > > network issues. > > > > > > Nick > > > > > > Sample from OSD and cluster logs below. Blip happened at 12:06, I > > > restarted OSD at 12:26 > > > > > > OSD Logs from OSD that hung (Note this OSD was not directly affected by > > > network outage) > > > 2020-01-15 12:06:32.234 7f41a1023700 -1 osd.43 2342991 > heartbeat_check: no > > > reply from [*:*:*:5::14]:6838 osd.71 ever on either front or back, > first > > > ping sent 2020-01-15 12:06:1 > > > 1.411216 (oldest deadline 2020-01-15 12:06:31.411216) > > > 2020-01-15 12:06:33.194 7f41a1023700 -1 osd.43 2342991 > heartbeat_check: no > > > reply from [*:*:*:5::13]:6854 osd.49 ever on either front or back, > first > > > ping sent 2020-01-15 12:06:1 > > > 1.411216 (oldest deadline 2020-01-15 12:06:31.411216) > > > 2020-01-15 12:06:33.194 7f41a1023700 -1 osd.43 2342991 > heartbeat_check: no > > > reply from [*:*:*:5::13]:6834 osd.51 ever on either front or back, > first > > > ping sent 2020-01-15 12:06:1 > > > 1.411216 (oldest deadline 2020-01-15 12:06:31.411216) > > > 2020-01-15 12:06:33.194 7f41a1023700 -1 osd.43 2342991 > heartbeat_check: no > > > reply from [*:*:*:5::13]:6862 osd.52 ever on either front or back, > first > > > ping sent 2020-01-15 12:06:1 > > > 1.411216 (oldest deadline 2020-01-15 12:06:31.411216) > > > 2020-01-15 12:06:33.194 7f41a1023700 -1 osd.43 2342991 > heartbeat_check: no > > > reply from [*:*:*:5::13]:6875 osd.53 ever on either front or back, > first > > > ping sent 2020-01-15 12:06:1 > > > 1.411216 (oldest deadline 2020-01-15 12:06:31.411216) > > > 2020-01-15 12:06:33.194 7f41a1023700 -1 osd.43 2342991 > heartbeat_check: no > > > reply from [*:*:*:5::13]:6894 osd.54 ever on either front or back, > first > > > ping sent 2020-01-15 12:06:1 > > > 1.411216 (oldest deadline 2020-01-15 12:06:31.411216) > > > 2020-01-15 12:06:33.194 7f41a1023700 -1 osd.43 2342991 > heartbeat_check: no > > > reply from [*:*:*:5::14]:6838 osd.71 ever on either front or back, > first > > > ping sent 2020-01-15 12:06:1 > > > 1.411216 (oldest deadline 2020-01-15 12:06:31.411216) > > > 2020-01-15 12:06:34.034 7f419480a700 0 log_channel(cluster) log [WRN] > : > > > Monitor daemon marked osd.43 down, but it is still running > > > 2020-01-15 12:06:34.034 7f419480a700 0 log_channel(cluster) log [DBG] > : > > > map e2342992 wrongly marked me down at e2342992 > > > 2020-01-15 12:06:34.034 7f419480a700 1 osd.43 2342992 > > > start_waiting_for_healthy > > > 2020-01-15 12:06:34.198 7f41a1023700 -1 osd.43 2342992 > heartbeat_check: no > > > reply from [*:*:*:5::13]:6854 osd.49 ever on either front or back, > first > > > ping sent 2020-01-15 12:06:1 > > > 1.411216 (oldest deadline 2020-01-15 12:06:31.411216) > > > 2020-01-15 12:06:34.198 7f41a1023700 -1 osd.43 2342992 > heartbeat_check: no > > > reply from [*:*:*:5::13]:6834 osd.51 ever on either front or back, > first > > > ping sent 2020-01-15 12:06:1 > > > 1.411216 (oldest deadline 2020-01-15 12:06:31.411216) > > > 2020-01-15 12:06:34.198 7f41a1023700 -1 osd.43 2342992 > heartbeat_check: no > > > reply from [*:*:*:5::13]:6862 osd.52 ever on either front or
Re: [ceph-users] OSD's hang after network blip
Hi Nick, We saw the exact same problem yesterday after a network outage -- a few of our down OSDs were stuck down until we restarted their processes. -- Dan On Wed, Jan 15, 2020 at 3:37 PM Nick Fisk wrote: > Hi All, > > Running 14.2.5, currently experiencing some network blips isolated to a > single rack which is under investigation. However, it appears following a > network blip, random OSD's in unaffected racks are sometimes not recovering > from the incident and are left running running in a zombie state. The OSD's > appear to be running from a process perspective, but the cluster thinks > they are down and will not rejoin the cluster until the OSD process is > restarted, which incidentally takes a lot longer than usual (systemctl > command takes a couple of minutes to complete). > > If the OSD is left in this state, CPU and memory usage of the process > appears to climb, but never rejoins, at least for several hours that I have > left them. Not exactly sure what the OSD is trying to do during this > period. There's nothing in the logs during this hung state to indicate that > anything is happening, but I will try and inject more verbose logging next > time it occurs. > > Not sure if anybody has come across this before or any ideas? In the past > as long as OSD's have been running they have always re-joint following any > network issues. > > Nick > > Sample from OSD and cluster logs below. Blip happened at 12:06, I > restarted OSD at 12:26 > > OSD Logs from OSD that hung (Note this OSD was not directly affected by > network outage) > 2020-01-15 12:06:32.234 7f41a1023700 -1 osd.43 2342991 heartbeat_check: no > reply from [*:*:*:5::14]:6838 osd.71 ever on either front or back, first > ping sent 2020-01-15 12:06:1 > 1.411216 (oldest deadline 2020-01-15 12:06:31.411216) > 2020-01-15 12:06:33.194 7f41a1023700 -1 osd.43 2342991 heartbeat_check: no > reply from [*:*:*:5::13]:6854 osd.49 ever on either front or back, first > ping sent 2020-01-15 12:06:1 > 1.411216 (oldest deadline 2020-01-15 12:06:31.411216) > 2020-01-15 12:06:33.194 7f41a1023700 -1 osd.43 2342991 heartbeat_check: no > reply from [*:*:*:5::13]:6834 osd.51 ever on either front or back, first > ping sent 2020-01-15 12:06:1 > 1.411216 (oldest deadline 2020-01-15 12:06:31.411216) > 2020-01-15 12:06:33.194 7f41a1023700 -1 osd.43 2342991 heartbeat_check: no > reply from [*:*:*:5::13]:6862 osd.52 ever on either front or back, first > ping sent 2020-01-15 12:06:1 > 1.411216 (oldest deadline 2020-01-15 12:06:31.411216) > 2020-01-15 12:06:33.194 7f41a1023700 -1 osd.43 2342991 heartbeat_check: no > reply from [*:*:*:5::13]:6875 osd.53 ever on either front or back, first > ping sent 2020-01-15 12:06:1 > 1.411216 (oldest deadline 2020-01-15 12:06:31.411216) > 2020-01-15 12:06:33.194 7f41a1023700 -1 osd.43 2342991 heartbeat_check: no > reply from [*:*:*:5::13]:6894 osd.54 ever on either front or back, first > ping sent 2020-01-15 12:06:1 > 1.411216 (oldest deadline 2020-01-15 12:06:31.411216) > 2020-01-15 12:06:33.194 7f41a1023700 -1 osd.43 2342991 heartbeat_check: no > reply from [*:*:*:5::14]:6838 osd.71 ever on either front or back, first > ping sent 2020-01-15 12:06:1 > 1.411216 (oldest deadline 2020-01-15 12:06:31.411216) > 2020-01-15 12:06:34.034 7f419480a700 0 log_channel(cluster) log [WRN] : > Monitor daemon marked osd.43 down, but it is still running > 2020-01-15 12:06:34.034 7f419480a700 0 log_channel(cluster) log [DBG] : > map e2342992 wrongly marked me down at e2342992 > 2020-01-15 12:06:34.034 7f419480a700 1 osd.43 2342992 > start_waiting_for_healthy > 2020-01-15 12:06:34.198 7f41a1023700 -1 osd.43 2342992 heartbeat_check: no > reply from [*:*:*:5::13]:6854 osd.49 ever on either front or back, first > ping sent 2020-01-15 12:06:1 > 1.411216 (oldest deadline 2020-01-15 12:06:31.411216) > 2020-01-15 12:06:34.198 7f41a1023700 -1 osd.43 2342992 heartbeat_check: no > reply from [*:*:*:5::13]:6834 osd.51 ever on either front or back, first > ping sent 2020-01-15 12:06:1 > 1.411216 (oldest deadline 2020-01-15 12:06:31.411216) > 2020-01-15 12:06:34.198 7f41a1023700 -1 osd.43 2342992 heartbeat_check: no > reply from [*:*:*:5::13]:6862 osd.52 ever on either front or back, first > ping sent 2020-01-15 12:06:1 > 1.411216 (oldest deadline 2020-01-15 12:06:31.411216) > 2020-01-15 12:06:34.198 7f41a1023700 -1 osd.43 2342992 heartbeat_check: no > reply from [*:*:*:5::13]:6875 osd.53 ever on either front or back, first > ping sent 2020-01-15 12:06:1 > 1.411216 (oldest deadline 2020-01-15 12:06:31.411216) > 2020-01-15 12:06:34.198 7f41a1023700 -1 osd.43 2342992 heartbeat_check: no > reply from [*:*:*:5::13]:6894 osd.54 ever on either front or back, first > ping sent 2020-01-15 12:06:1 > 1.411216 (oldest deadline 2020-01-15 12:06:31.411216) > 2020-01-15 12:06:34.198 7f41a1023700 -1 osd.43 2342992 heartbeat_check: no > reply from [*:*:*:5::14]:6838 osd.71 ever on either front or back, first > ping sent 2020-01-15 12:06:1 > 1.411216 (oldest deadline