Re: [ceph-users] Whole cluster flapping

Will Marley Wed, 08 Aug 2018 07:14:08 -0700

Hi again Frederic,

It may be worth looking at a recovery sleep.
osd recovery sleep
Description:


Time in seconds to sleep before next recovery or backfill op. Increasing this 
value will slow down recovery operation while client operations will be less 
impacted.

Type:

Float

Default:

0

osd recovery sleep hdd
Description:

Time in seconds to sleep before next recovery or backfill op for HDDs.

Type:

Float

Default:

0.1

osd recovery sleep ssd
Description:

Time in seconds to sleep before next recovery or backfill op for SSDs.

Type:

Float

Default:

0

osd recovery sleep hybrid
Description:

Time in seconds to sleep before next recovery or backfill op when osd data is 
on HDD and osd journal is on SSD.

Type:

Float

Default:

0.025


(Pulled from 
http://docs.ceph.com/docs/master/rados/configuration/osd-config-ref/)

When we faced similar issues, using the command ceph tell osd.* injectargs 
'--osd-recovery-sleep 2 allowed the OSDs to respond with a heartbeat whilst 
taking a break between recovery operations. I’d suggest tweaking the sleep wait 
time to find a sweet spot.

This may be worth a try, so let us know how you get on.

Regards,
Will

From: ceph-users <ceph-users-boun...@lists.ceph.com> On Behalf Of Webert de 
Souza Lima
Sent: 08 August 2018 15:06
To: frederic.c...@sib.fr
Cc: ceph-users <ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] Whole cluster flapping

So your OSDs are really too busy to respond heartbeats.
You'll be facing this for sometime until cluster loads get lower.

I would set `ceph osd set nodeep-scrub` until the heavy disk IO stops.
maybe you can schedule it for enable during the night and disabling in the 
morning.

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
Belo Horizonte - Brasil
IRC NICK - WebertRLZ


On Wed, Aug 8, 2018 at 9:18 AM CUZA Frédéric 
<frederic.c...@sib.fr<mailto:frederic.c...@sib.fr>> wrote:
Thx for the command line, I did take a look too it what I don’t really know 
what to search for, my bad….
All this flapping is due to deep-scrub when it starts on an OSD things start to 
go bad.

I set out all the OSDs that were flapping the most (1 by 1 after rebalancing) 
and it looks better even if some osds keep going down/up with the same message 
in logs :

1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fdabd897700' had timed out 
after 90

(I update it to 90 instead of 15s)

Regards,



De : ceph-users 
<ceph-users-boun...@lists.ceph.com<mailto:ceph-users-boun...@lists.ceph.com>> 
De la part de Webert de Souza Lima
Envoyé : 07 August 2018 16:28
À : ceph-users <ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>>
Objet : Re: [ceph-users] Whole cluster flapping

oops, my bad, you're right.

I don't know much you can see but maybe you can dig around performance counters 
and see what's happening on those OSDs, try these:

~# ceph daemonperf osd.XX
~# ceph daemon osd.XX perf dump

change XX to your OSD numbers.

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
Belo Horizonte - Brasil
IRC NICK - WebertRLZ


On Tue, Aug 7, 2018 at 10:47 AM CUZA Frédéric 
<frederic.c...@sib.fr<mailto:frederic.c...@sib.fr>> wrote:
Pool is already deleted and no longer present in stats.

Regards,

De : ceph-users 
<ceph-users-boun...@lists.ceph.com<mailto:ceph-users-boun...@lists.ceph.com>> 
De la part de Webert de Souza Lima
Envoyé : 07 August 2018 15:08
À : ceph-users <ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>>
Objet : Re: [ceph-users] Whole cluster flapping

Frédéric,

see if the number of objects is decreasing in the pool with `ceph df [detail]`

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
Belo Horizonte - Brasil
IRC NICK - WebertRLZ


On Tue, Aug 7, 2018 at 5:46 AM CUZA Frédéric 
<frederic.c...@sib.fr<mailto:frederic.c...@sib.fr>> wrote:
It’s been over a week now and the whole cluster keeps flapping, it is never the 
same OSDs that go down.
Is there a way to get the progress of this recovery ? (The pool hat I deleted 
is no longer present (for a while now))
In fact, there is a lot of i/o activity on the server where osds go down.

Regards,

De : ceph-users 
<ceph-users-boun...@lists.ceph.com<mailto:ceph-users-boun...@lists.ceph.com>> 
De la part de Webert de Souza Lima
Envoyé : 31 July 2018 16:25
À : ceph-users <ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>>
Objet : Re: [ceph-users] Whole cluster flapping

The pool deletion might have triggered a lot of IO operations on the disks and 
the process might be too busy to respond to hearbeats, so the mons mark them as 
down due to no response.
Check also the OSD logs to see if they are actually crashing and restarting, 
and disk IO usage (i.e. iostat).

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
Belo Horizonte - Brasil
IRC NICK - WebertRLZ


On Tue, Jul 31, 2018 at 7:23 AM CUZA Frédéric 
<frederic.c...@sib.fr<mailto:frederic.c...@sib.fr>> wrote:
Hi Everyone,

I just upgrade our cluster to Luminous 12.2.7 and I delete a quite large pool 
that we had (120 TB).
Our cluster is made of 14 Nodes with each composed of 12 OSDs (1 HDD -> 1 OSD), 
we have SDD for journal.

After I deleted the large pool my cluster started to flapping on all OSDs.
Osds are marked down and then marked up as follow :

2018-07-31 10:42:51.504319 mon.ceph_monitor01 [INF] osd.97 
172.29.228.72:6800/95783<http://172.29.228.72:6800/95783> boot
2018-07-31 10:42:55.330993 mon.ceph_monitor01 [WRN] Health check update: 
5798/5845200 objects misplaced (0.099%) (OBJECT_MISPLACED)
2018-07-31 10:42:55.331065 mon.ceph_monitor01 [WRN] Health check update: 
Degraded data redundancy: 221365/5845200 objects degraded (3.787%), 98 pgs 
degraded, 317 pgs undersized (PG_DEGRADED)
2018-07-31 10:42:55.331093 mon.ceph_monitor01 [WRN] Health check update: 81 
slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-07-31 10:42:55.548385 mon.ceph_monitor01 [WRN] Health check update: 
Reduced data availability: 13 pgs inactive, 4 pgs peering (PG_AVAILABILITY)
2018-07-31 10:42:55.610556 mon.ceph_monitor01 [INF] osd.96 
172.29.228.72:6803/95830<http://172.29.228.72:6803/95830> boot
2018-07-31 10:43:00.331787 mon.ceph_monitor01 [WRN] Health check update: 5 osds 
down (OSD_DOWN)
2018-07-31 10:43:00.331930 mon.ceph_monitor01 [WRN] Health check update: 
5782/5845401 objects misplaced (0.099%) (OBJECT_MISPLACED)
2018-07-31 10:43:00.331950 mon.ceph_monitor01 [WRN] Health check update: 
Degraded data redundancy: 167757/5845401 objects degraded (2.870%), 77 pgs 
degraded, 223 pgs undersized (PG_DEGRADED)
2018-07-31 10:43:00.331966 mon.ceph_monitor01 [WRN] Health check update: 76 
slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-07-31 10:43:01.729891 mon.ceph_monitor01 [WRN] Health check update: 
Reduced data availability: 7 pgs inactive, 6 pgs peering (PG_AVAILABILITY)
2018-07-31 10:43:01.753867 mon.ceph_monitor01 [INF] osd.4 
172.29.228.246:6812/3144542<http://172.29.228.246:6812/3144542> boot
2018-07-31 10:43:05.332624 mon.ceph_monitor01 [WRN] Health check update: 4 osds 
down (OSD_DOWN)
2018-07-31 10:43:05.332691 mon.ceph_monitor01 [WRN] Health check update: 
5767/5845569 objects misplaced (0.099%) (OBJECT_MISPLACED)
2018-07-31 10:43:05.332718 mon.ceph_monitor01 [WRN] Health check update: 
Degraded data redundancy: 130565/5845569 objects degraded (2.234%), 67 pgs 
degraded, 220 pgs undersized (PG_DEGRADED)
2018-07-31 10:43:05.332736 mon.ceph_monitor01 [WRN] Health check update: 83 
slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-07-31 10:43:07.004993 mon.ceph_monitor01 [WRN] Health check update: 
Reduced data availability: 5 pgs inactive, 5 pgs peering (PG_AVAILABILITY)
2018-07-31 10:43:10.333548 mon.ceph_monitor01 [WRN] Health check update: 
5752/5845758 objects misplaced (0.098%) (OBJECT_MISPLACED)
2018-07-31 10:43:10.333593 mon.ceph_monitor01 [WRN] Health check update: 
Degraded data redundancy: 107805/5845758 objects degraded (1.844%), 59 pgs 
degraded, 197 pgs undersized (PG_DEGRADED)
2018-07-31 10:43:10.333608 mon.ceph_monitor01 [WRN] Health check update: 95 
slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-07-31 10:43:15.334451 mon.ceph_monitor01 [WRN] Health check update: 
5738/5845923 objects misplaced (0.098%) (OBJECT_MISPLACED)
2018-07-31 10:43:15.334494 mon.ceph_monitor01 [WRN] Health check update: 
Degraded data redundancy: 107807/5845923 objects degraded (1.844%), 59 pgs 
degraded, 197 pgs undersized (PG_DEGRADED)
2018-07-31 10:43:15.334510 mon.ceph_monitor01 [WRN] Health check update: 98 
slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-07-31 10:43:15.334865 mon.ceph_monitor01 [INF] osd.18 failed 
(root=default,room=xxxx,host=xxxx) (8 reporters from different host after 
54.650576 >= grace 54.300663)
2018-07-31 10:43:15.336552 mon.ceph_monitor01 [WRN] Health check update: 5 osds 
down (OSD_DOWN)
2018-07-31 10:43:17.357747 mon.ceph_monitor01 [WRN] Health check update: 
Reduced data availability: 6 pgs inactive, 6 pgs peering (PG_AVAILABILITY)
2018-07-31 10:43:20.339495 mon.ceph_monitor01 [WRN] Health check update: 
5724/5846073 objects misplaced (0.098%) (OBJECT_MISPLACED)
2018-07-31 10:43:20.339543 mon.ceph_monitor01 [WRN] Health check update: 
Degraded data redundancy: 122901/5846073 objects degraded (2.102%), 65 pgs 
degraded, 201 pgs undersized (PG_DEGRADED)
2018-07-31 10:43:20.339559 mon.ceph_monitor01 [WRN] Health check update: 78 
slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-07-31 10:43:22.481251 mon.ceph_monitor01 [WRN] Health check update: 4 osds 
down (OSD_DOWN)
2018-07-31 10:43:22.498621 mon.ceph_monitor01 [INF] osd.18 
172.29.228.5:6812/14996<http://172.29.228.5:6812/14996> boot
2018-07-31 10:43:25.340099 mon.ceph_monitor01 [WRN] Health check update: 
5712/5846235 objects misplaced (0.098%) (OBJECT_MISPLACED)
2018-07-31 10:43:25.340147 mon.ceph_monitor01 [WRN] Health check update: 
Reduced data availability: 6 pgs inactive, 3 pgs peering (PG_AVAILABILITY)
2018-07-31 10:43:25.340163 mon.ceph_monitor01 [WRN] Health check update: 
Degraded data redundancy: 138553/5846235 objects degraded (2.370%), 74 pgs 
degraded, 201 pgs undersized (PG_DEGRADED)
2018-07-31 10:43:25.340181 mon.ceph_monitor01 [WRN] Health check update: 11 
slow requests are blocked > 32 sec (REQUEST_SLOW)

On the OSDs that failed logs are full of this kind of message :
2018-07-31 03:41:28.789681 7f698b66c700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7f6976685700' had timed out after 15
2018-07-31 03:41:28.945710 7f698ae6b700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7f6976685700' had timed out after 15
2018-07-31 03:41:28.946263 7f698be6d700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7f6976685700' had timed out after 15
2018-07-31 03:41:28.994397 7f698b66c700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7f6976685700' had timed out after 15
2018-07-31 03:41:28.994443 7f698ae6b700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7f6976685700' had timed out after 15
2018-07-31 03:41:29.023356 7f698be6d700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7f6976685700' had timed out after 15
2018-07-31 03:41:29.023415 7f698be6d700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7f6976685700' had timed out after 15
2018-07-31 03:41:29.102909 7f698ae6b700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7f6976685700' had timed out after 15
2018-07-31 03:41:29.102917 7f698b66c700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7f6976685700' had timed out after 15

At first it seems like a network issue but we haven’t change a thing on the 
network and this cluster has been okay for months.

I can’t figure out what is happening at the moment, some help will be greatly 
appreciated !

Regards,
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

________________________________

NOTICE AND DISCLAIMER
This e-mail (including any attachments) is intended for the above-named 
person(s). If you are not the intended recipient, notify the sender 
immediately, delete this email from your system and do not disclose or use for 
any purpose. We may monitor all incoming and outgoing emails in line with 
current legislation. We have taken steps to ensure that this email and 
attachments are free from any virus, but it remains your responsibility to 
ensure that viruses do not adversely affect you

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Whole cluster flapping

Reply via email to