Re: [ceph-users] OSD's hang after network blip

2020-01-16 Thread Dan van der Ster
We upgraded to 14.2.4 back in October and this week to v14.2.6.
But I don't think the cluster had a network outage until yesterday, so I
wouldn't have thought this is a .6 regression.

If it happens again I'll look for the waiting for map message.

-- dan


On Thu, Jan 16, 2020 at 12:08 PM Nick Fisk  wrote:

> On Thursday, January 16, 2020 09:15 GMT, Dan van der Ster <
> d...@vanderster.com> wrote:
>
> > Hi Nick,
> >
> > We saw the exact same problem yesterday after a network outage -- a few
> of
> > our down OSDs were stuck down until we restarted their processes.
> >
> > -- Dan
> >
> >
> > On Wed, Jan 15, 2020 at 3:37 PM Nick Fisk  wrote:
> >
> > > Hi All,
> > >
> > > Running 14.2.5, currently experiencing some network blips isolated to a
> > > single rack which is under investigation. However, it appears
> following a
> > > network blip, random OSD's in unaffected racks are sometimes not
> recovering
> > > from the incident and are left running running in a zombie state. The
> OSD's
> > > appear to be running from a process perspective, but the cluster thinks
> > > they are down and will not rejoin the cluster until the OSD process is
> > > restarted, which incidentally takes a lot longer than usual (systemctl
> > > command takes a couple of minutes to complete).
> > >
> > > If the OSD is left in this state, CPU and memory usage of the process
> > > appears to climb, but never rejoins, at least for several hours that I
> have
> > > left them. Not exactly sure what the OSD is trying to do during this
> > > period. There's nothing in the logs during this hung state to indicate
> that
> > > anything is happening, but I will try and inject more verbose logging
> next
> > > time it occurs.
> > >
> > > Not sure if anybody has come across this before or any ideas? In the
> past
> > > as long as OSD's have been running they have always re-joint following
> any
> > > network issues.
> > >
> > > Nick
> > >
> > > Sample from OSD and cluster logs below. Blip happened at 12:06, I
> > > restarted OSD at 12:26
> > >
> > > OSD Logs from OSD that hung (Note this OSD was not directly affected by
> > > network outage)
> > > 2020-01-15 12:06:32.234 7f41a1023700 -1 osd.43 2342991
> heartbeat_check: no
> > > reply from [*:*:*:5::14]:6838 osd.71 ever on either front or back,
> first
> > > ping sent 2020-01-15 12:06:1
> > > 1.411216 (oldest deadline 2020-01-15 12:06:31.411216)
> > > 2020-01-15 12:06:33.194 7f41a1023700 -1 osd.43 2342991
> heartbeat_check: no
> > > reply from [*:*:*:5::13]:6854 osd.49 ever on either front or back,
> first
> > > ping sent 2020-01-15 12:06:1
> > > 1.411216 (oldest deadline 2020-01-15 12:06:31.411216)
> > > 2020-01-15 12:06:33.194 7f41a1023700 -1 osd.43 2342991
> heartbeat_check: no
> > > reply from [*:*:*:5::13]:6834 osd.51 ever on either front or back,
> first
> > > ping sent 2020-01-15 12:06:1
> > > 1.411216 (oldest deadline 2020-01-15 12:06:31.411216)
> > > 2020-01-15 12:06:33.194 7f41a1023700 -1 osd.43 2342991
> heartbeat_check: no
> > > reply from [*:*:*:5::13]:6862 osd.52 ever on either front or back,
> first
> > > ping sent 2020-01-15 12:06:1
> > > 1.411216 (oldest deadline 2020-01-15 12:06:31.411216)
> > > 2020-01-15 12:06:33.194 7f41a1023700 -1 osd.43 2342991
> heartbeat_check: no
> > > reply from [*:*:*:5::13]:6875 osd.53 ever on either front or back,
> first
> > > ping sent 2020-01-15 12:06:1
> > > 1.411216 (oldest deadline 2020-01-15 12:06:31.411216)
> > > 2020-01-15 12:06:33.194 7f41a1023700 -1 osd.43 2342991
> heartbeat_check: no
> > > reply from [*:*:*:5::13]:6894 osd.54 ever on either front or back,
> first
> > > ping sent 2020-01-15 12:06:1
> > > 1.411216 (oldest deadline 2020-01-15 12:06:31.411216)
> > > 2020-01-15 12:06:33.194 7f41a1023700 -1 osd.43 2342991
> heartbeat_check: no
> > > reply from [*:*:*:5::14]:6838 osd.71 ever on either front or back,
> first
> > > ping sent 2020-01-15 12:06:1
> > > 1.411216 (oldest deadline 2020-01-15 12:06:31.411216)
> > > 2020-01-15 12:06:34.034 7f419480a700  0 log_channel(cluster) log [WRN]
> :
> > > Monitor daemon marked osd.43 down, but it is still running
> > > 2020-01-15 12:06:34.034 7f419480a700  0 log_channel(cluster) log [DBG]
> :
> > > map e2342992 wrongly marked me down at e2342992
> > > 2020-01-15 12:06:34.034 7f419480a700  1 osd.43 2342992
> > > start_waiting_for_healthy
> > > 2020-01-15 12:06:34.198 7f41a1023700 -1 osd.43 2342992
> heartbeat_check: no
> > > reply from [*:*:*:5::13]:6854 osd.49 ever on either front or back,
> first
> > > ping sent 2020-01-15 12:06:1
> > > 1.411216 (oldest deadline 2020-01-15 12:06:31.411216)
> > > 2020-01-15 12:06:34.198 7f41a1023700 -1 osd.43 2342992
> heartbeat_check: no
> > > reply from [*:*:*:5::13]:6834 osd.51 ever on either front or back,
> first
> > > ping sent 2020-01-15 12:06:1
> > > 1.411216 (oldest deadline 2020-01-15 12:06:31.411216)
> > > 2020-01-15 12:06:34.198 7f41a1023700 -1 osd.43 2342992
> heartbeat_check: no
> > > reply from [*:*:*:5::13]:6862 osd.52 ever on either front or 

Re: [ceph-users] OSD's hang after network blip

2020-01-16 Thread Dan van der Ster
Hi Nick,

We saw the exact same problem yesterday after a network outage -- a few of
our down OSDs were stuck down until we restarted their processes.

-- Dan


On Wed, Jan 15, 2020 at 3:37 PM Nick Fisk  wrote:

> Hi All,
>
> Running 14.2.5, currently experiencing some network blips isolated to a
> single rack which is under investigation. However, it appears following a
> network blip, random OSD's in unaffected racks are sometimes not recovering
> from the incident and are left running running in a zombie state. The OSD's
> appear to be running from a process perspective, but the cluster thinks
> they are down and will not rejoin the cluster until the OSD process is
> restarted, which incidentally takes a lot longer than usual (systemctl
> command takes a couple of minutes to complete).
>
> If the OSD is left in this state, CPU and memory usage of the process
> appears to climb, but never rejoins, at least for several hours that I have
> left them. Not exactly sure what the OSD is trying to do during this
> period. There's nothing in the logs during this hung state to indicate that
> anything is happening, but I will try and inject more verbose logging next
> time it occurs.
>
> Not sure if anybody has come across this before or any ideas? In the past
> as long as OSD's have been running they have always re-joint following any
> network issues.
>
> Nick
>
> Sample from OSD and cluster logs below. Blip happened at 12:06, I
> restarted OSD at 12:26
>
> OSD Logs from OSD that hung (Note this OSD was not directly affected by
> network outage)
> 2020-01-15 12:06:32.234 7f41a1023700 -1 osd.43 2342991 heartbeat_check: no
> reply from [*:*:*:5::14]:6838 osd.71 ever on either front or back, first
> ping sent 2020-01-15 12:06:1
> 1.411216 (oldest deadline 2020-01-15 12:06:31.411216)
> 2020-01-15 12:06:33.194 7f41a1023700 -1 osd.43 2342991 heartbeat_check: no
> reply from [*:*:*:5::13]:6854 osd.49 ever on either front or back, first
> ping sent 2020-01-15 12:06:1
> 1.411216 (oldest deadline 2020-01-15 12:06:31.411216)
> 2020-01-15 12:06:33.194 7f41a1023700 -1 osd.43 2342991 heartbeat_check: no
> reply from [*:*:*:5::13]:6834 osd.51 ever on either front or back, first
> ping sent 2020-01-15 12:06:1
> 1.411216 (oldest deadline 2020-01-15 12:06:31.411216)
> 2020-01-15 12:06:33.194 7f41a1023700 -1 osd.43 2342991 heartbeat_check: no
> reply from [*:*:*:5::13]:6862 osd.52 ever on either front or back, first
> ping sent 2020-01-15 12:06:1
> 1.411216 (oldest deadline 2020-01-15 12:06:31.411216)
> 2020-01-15 12:06:33.194 7f41a1023700 -1 osd.43 2342991 heartbeat_check: no
> reply from [*:*:*:5::13]:6875 osd.53 ever on either front or back, first
> ping sent 2020-01-15 12:06:1
> 1.411216 (oldest deadline 2020-01-15 12:06:31.411216)
> 2020-01-15 12:06:33.194 7f41a1023700 -1 osd.43 2342991 heartbeat_check: no
> reply from [*:*:*:5::13]:6894 osd.54 ever on either front or back, first
> ping sent 2020-01-15 12:06:1
> 1.411216 (oldest deadline 2020-01-15 12:06:31.411216)
> 2020-01-15 12:06:33.194 7f41a1023700 -1 osd.43 2342991 heartbeat_check: no
> reply from [*:*:*:5::14]:6838 osd.71 ever on either front or back, first
> ping sent 2020-01-15 12:06:1
> 1.411216 (oldest deadline 2020-01-15 12:06:31.411216)
> 2020-01-15 12:06:34.034 7f419480a700  0 log_channel(cluster) log [WRN] :
> Monitor daemon marked osd.43 down, but it is still running
> 2020-01-15 12:06:34.034 7f419480a700  0 log_channel(cluster) log [DBG] :
> map e2342992 wrongly marked me down at e2342992
> 2020-01-15 12:06:34.034 7f419480a700  1 osd.43 2342992
> start_waiting_for_healthy
> 2020-01-15 12:06:34.198 7f41a1023700 -1 osd.43 2342992 heartbeat_check: no
> reply from [*:*:*:5::13]:6854 osd.49 ever on either front or back, first
> ping sent 2020-01-15 12:06:1
> 1.411216 (oldest deadline 2020-01-15 12:06:31.411216)
> 2020-01-15 12:06:34.198 7f41a1023700 -1 osd.43 2342992 heartbeat_check: no
> reply from [*:*:*:5::13]:6834 osd.51 ever on either front or back, first
> ping sent 2020-01-15 12:06:1
> 1.411216 (oldest deadline 2020-01-15 12:06:31.411216)
> 2020-01-15 12:06:34.198 7f41a1023700 -1 osd.43 2342992 heartbeat_check: no
> reply from [*:*:*:5::13]:6862 osd.52 ever on either front or back, first
> ping sent 2020-01-15 12:06:1
> 1.411216 (oldest deadline 2020-01-15 12:06:31.411216)
> 2020-01-15 12:06:34.198 7f41a1023700 -1 osd.43 2342992 heartbeat_check: no
> reply from [*:*:*:5::13]:6875 osd.53 ever on either front or back, first
> ping sent 2020-01-15 12:06:1
> 1.411216 (oldest deadline 2020-01-15 12:06:31.411216)
> 2020-01-15 12:06:34.198 7f41a1023700 -1 osd.43 2342992 heartbeat_check: no
> reply from [*:*:*:5::13]:6894 osd.54 ever on either front or back, first
> ping sent 2020-01-15 12:06:1
> 1.411216 (oldest deadline 2020-01-15 12:06:31.411216)
> 2020-01-15 12:06:34.198 7f41a1023700 -1 osd.43 2342992 heartbeat_check: no
> reply from [*:*:*:5::14]:6838 osd.71 ever on either front or back, first
> ping sent 2020-01-15 12:06:1
> 1.411216 (oldest deadline