[ceph-users] Re: Full cluster outage when ECONNREFUSED is triggered

2023-11-24 Thread Denis Krienbühl
Thanks everyone for your feedback. I created a ticket and added most of our internal post-mortem research: https://tracker.ceph.com/issues/63636 Cheers, Denis > On 24 Nov 2023, at 09:01, Denis Krienbühl wrote: > > Hi > > We’ve recently had a serious outage at work, after a host had a

[ceph-users] Re: Full cluster outage when ECONNREFUSED is triggered

2023-11-24 Thread Denis Krienbühl
Hi Frank. > On 24 Nov 2023, at 14:27, Frank Schilder wrote: > > I have to ask a clarifying question. If I understand the intend of > osd_fast_fail_on_connection_refused correctly, an OSD that receives a > connection_refused should get marked down fast to avoid unnecessarily long > wait

[ceph-users] Re: Full cluster outage when ECONNREFUSED is triggered

2023-11-24 Thread Frank Schilder
Linke Cc: ceph-users@ceph.io Subject: [ceph-users] Re: Full cluster outage when ECONNREFUSED is triggered Hi Denis. > The mon then propagates that failure, without taking any other reports into > consideration: Exactly. I cannot imagine that this change of behavior is intended. The c

[ceph-users] Re: Full cluster outage when ECONNREFUSED is triggered

2023-11-24 Thread Frank Schilder
rkhard Linke Cc: ceph-users@ceph.io Subject: [ceph-users] Re: Full cluster outage when ECONNREFUSED is triggered > On 24 Nov 2023, at 11:49, Burkhard Linke > wrote: > > This should not be case in the reported situation unless setting > osd_fast_fail_on_connection_refused<

[ceph-users] Re: Full cluster outage when ECONNREFUSED is triggered

2023-11-24 Thread Denis Krienbühl
> On 24 Nov 2023, at 11:49, Burkhard Linke > wrote: > > This should not be case in the reported situation unless setting > osd_fast_fail_on_connection_refused=true > changes this

[ceph-users] Re: Full cluster outage when ECONNREFUSED is triggered

2023-11-24 Thread Burkhard Linke
Hi, I think this is why the mon-osd interaction requires a certain number of osd to report another osd as down/unavailable: https://docs.ceph.com/en/latest/rados/configuration/mon-osd-interaction/#osds-report-down-osds The default value for mon_osd_reporter_subtree_level is host, and the

[ceph-users] Re: Full cluster outage when ECONNREFUSED is triggered

2023-11-24 Thread Janne Johansson
Den fre 24 nov. 2023 kl 10:25 skrev Frank Schilder : > > Hi Denis, > > I would agree with you that a single misconfigured host should not take out > healthy hosts under any circumstances. I'm not sure if your incident is > actually covered by the devs comments, it is quite possible that you

[ceph-users] Re: Full cluster outage when ECONNREFUSED is triggered

2023-11-24 Thread Denis Krienbühl
Thanks Frank. I see it the same way. I’ll be sure to create a ticket with all the details and steps to reproduce the issue. Denis > On 24 Nov 2023, at 10:24, Frank Schilder wrote: > > Hi Denis, > > I would agree with you that a single misconfigured host should not take out > healthy hosts

[ceph-users] Re: Full cluster outage when ECONNREFUSED is triggered

2023-11-24 Thread Frank Schilder
Hi Denis, I would agree with you that a single misconfigured host should not take out healthy hosts under any circumstances. I'm not sure if your incident is actually covered by the devs comments, it is quite possible that you observed an unintended side effect that is a bug in handling the