Re: [ceph-users] ceph (jewel) unable to recover after node failure

2020-01-10 Thread Eugen Block

Hi,


A. will ceph be able to recover over time? I am afraid that the 14 PGs
that are down will not recover.


if all OSDs come back (stable) the recovery should eventually finish.


B. what caused the OSDs going down and up during recovery after the
failed OSD node came back online? (step 2 above) I suspect that the
high CPU load we saw on all the nodes caused timeouts in the OSD
daemons. Is this a reasonable assumption?


Yes, this is a reasonable assumption. Just some weeks ago we saw this  
in a customer cluster with EC pools. The OSDs were fully saturated,  
causing failing heartbeats from the peers, coming back up and so on  
(flapping OSDs). At the beginning the MON notices that the OSD  
processes are up although the peers report them as down but after 5 of  
these "down" reports by peers (config option osd_max_markdown_count)  
within 10 minutes (config osd_max_markdown_period) the OSD is marked  
as out, causing more rebalancing causing a higher load.


If there are no other hints for a different root cause you could set  
'ceph osd set nodown' to prevent that flapping. This should help the  
cluster to recover, it helped in the customer environment, although  
there also was another issue.


Regards,
Eugen


Zitat von Hanspeter Kunz :


Hi,

after a node failure ceph is unable to recover, i.e. unable to
reintegrate the failed node back into the cluster.

what happened?
1. a node with 11 osds crashed, the remaining 4 nodes (also with 11
osds each) re-balanced, although reporting the following error
condition:

too many PGs per OSD (314 > max 300)

2. after we put the failed node back online, automatic recovery
started, but very soon (after a few minutes) we saw OSDs randomly going
down and up on ALL the osd nodes (not only on the one that had failed).
we saw the the load (CPU) on the nodes was very high (average load 120)

3. the situation seemed to get worse over time (more and more OSDs
going down, less were coming back up) so we switched the node that had
failed off again.

4. after that, the cluster "calmed down", CPU load became normal
(average load ~4-5). we manually restarted the OSD daemons of the OSDs
that were still down and one after the other these OSDs came back up.
Recovery processes are still running now, but it seems to me that 14
PGs are not recoverable:

output of ceph -s:

 health HEALTH_ERR
16 pgs are stuck inactive for more than 300 seconds
255 pgs backfill_wait
16 pgs backfilling
205 pgs degraded
14 pgs down
2 pgs incomplete
14 pgs peering
48 pgs recovery_wait
205 pgs stuck degraded
16 pgs stuck inactive
335 pgs stuck unclean
156 pgs stuck undersized
156 pgs undersized
25 requests are blocked > 32 sec
recovery 1788571/71151951 objects degraded (2.514%)
recovery 2342374/71151951 objects misplaced (3.292%)
too many PGs per OSD (314 > max 300)

I have a few questions now:

A. will ceph be able to recover over time? I am afraid that the 14 PGs
that are down will not recover.

B. what caused the OSDs going down and up during recovery after the
failed OSD node came back online? (step 2 above) I suspect that the
high CPU load we saw on all the nodes caused timeouts in the OSD
daemons. Is this a reasonable assumption?

C. If indeed all this was caused by such an overload is there a way to
make the recovery process less CPU intensive?

D. What would you advise me to do/try to recover to a healthy state?

In what follows I try to give some more background information
(configuration, log messages).

ceph version: 10.2.11
OS version: debian jessie
[yes I know this is old]

cluster: 5 OSD nodes (12 cores, 64G RAM), 11 OSD per node, each OSD
daemon controls a 2 TB harddrive. The journals are written to an SSD.

ceph.conf:
-
[global]
fsid = [censored]
mon_initial_members = salomon, simon, ramon
mon_host = 10.65.16.44, 10.65.16.45, 10.65.16.46
public_network = 10.65.16.0/24
cluster_network = 10.65.18.0/24
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
mon osd down out interval = 7200
--

Log Messages (examples):

we see a lot of:

Jan  7 18:52:22 bruce ceph-osd[9184]: 2020-01-07 18:52:22.411377  
7f0ebd93b700 -1 osd
.29 15636 heartbeat_check: no reply from 10.65.16.43:6822 osd.48  
since back 2020-01-
07 18:51:20.119784 front 2020-01-07 18:52:21.575852 (cutoff  
2020-01-07 18:52:02.4113

30)

however, all the networks were up (the machines could ping each other).

I guess these are the log messages of OSDs going down (on one of the
nodes):
Jan  7 16:47:37 bruce ceph-osd[3684]: 2020-01-07 16:47:37.729691  
7fbe5ee73700 -1 osd.25 15017 *** Got signal Interrupt ***
Jan  7 16:47:37 bruce ceph-osd[3684]: 2020-01-07 16:47:37.729701  
7fbe5ee73700 -1 osd.25 15017 shutdown
Jan  7 16:47:43 bruce ceph-osd[5689]: 

[ceph-users] ceph (jewel) unable to recover after node failure

2020-01-07 Thread Hanspeter Kunz
here is the output of ceph health detail: 

HEALTH_ERR 16 pgs are stuck inactive for more than 300 seconds; 134 pgs 
backfill_wait; 11 pgs backfilling; 69 pgs degraded; 14 pgs down; 2 pgs 
incomplete; 14 pgs peering; 6 pgs recovery_wait; 69 pgs stuck degraded; 16 pgs 
stuck inactive; 167 pgs stuck unclean; 63 pgs stuck undersized; 63 pgs 
undersized; 29 requests are blocked > 32 sec; 6 osds have slow requests; 
recovery 667605/71152293 objects degraded (0.938%); recovery 1564114/71152293 
objects misplaced (2.198%); too many PGs per OSD (314 > max 300)
pg 8.3ec is stuck inactive for 17320.100016, current state down+peering, last 
acting [22,40,49]
pg 9.3ac is stuck inactive since forever, current state down+remapped+peering, 
last acting [36]
pg 9.243 is stuck inactive for 17602.030517, current state incomplete, last 
acting [34,47,26]
pg 9.23e is stuck inactive since forever, current state down+remapped+peering, 
last acting [18]
pg 11.7a is stuck inactive since forever, current state down+remapped+peering, 
last acting [13,25]
pg 9.66 is stuck inactive since forever, current state down+remapped+peering, 
last acting [20]
pg 8.6c is stuck inactive for 17196.609471, current state down+peering, last 
acting [34,17,48]
pg 8.143 is stuck inactive for 17201.229429, current state 
down+remapped+peering, last acting [39,19]
pg 10.103 is stuck inactive for 17544.862477, current state down+peering, last 
acting [30,19,53]
pg 8.ae is stuck inactive for 17518.839339, current state down+peering, last 
acting [39,21,52]
pg 8.37 is stuck inactive for 17520.793755, current state down+peering, last 
acting [15,40,52]
pg 7.399 is stuck inactive since forever, current state down+remapped+peering, 
last acting [21]
pg 7.210 is stuck inactive for 17535.412721, current state incomplete, last 
acting [22,49,15]
pg 7.136 is stuck inactive for 40796.009480, current state 
down+remapped+peering, last acting [46]
pg 9.38 is stuck inactive since forever, current state down+remapped+peering, 
last acting [46]
pg 7.36 is stuck inactive since forever, current state down+remapped+peering, 
last acting [20]
pg 9.3ff is stuck unclean for 59505.890789, current state 
active+remapped+wait_backfill, last acting [48,53,33]
pg 9.3e8 is stuck unclean for 21312.446345, current state 
active+remapped+wait_backfill, last acting [28,53,27]
pg 9.3df is stuck unclean for 17346.719500, current state 
active+undersized+degraded+remapped+wait_backfill, last acting [28,46]
pg 7.3c8 is stuck unclean for 86528.672542, current state 
active+remapped+wait_backfill, last acting [30,35,40]
pg 9.3b1 is stuck unclean for 17859.207821, current state 
active+remapped+wait_backfill, last acting [35,40,14]
pg 7.3b8 is stuck unclean for 88517.511151, current state 
active+undersized+degraded+remapped+wait_backfill, last acting [42,14]
pg 9.398 is stuck unclean for 41016.001863, current state 
active+undersized+degraded+remapped+wait_backfill, last acting [32,12]
pg 7.38b is stuck unclean for 41003.853238, current state 
active+remapped+wait_backfill, last acting [13,34,42]
pg 7.36d is stuck unclean for 18780.388726, current state 
active+undersized+degraded+remapped+wait_backfill, last acting [32,29]
pg 9.363 is stuck unclean for 59589.647646, current state 
active+remapped+wait_backfill, last acting [40,16,32]
pg 7.369 is stuck unclean for 17601.998787, current state 
active+undersized+degraded+remapped+wait_backfill, last acting [31,15]
pg 9.368 is stuck unclean for 41558.892612, current state 
active+remapped+wait_backfill, last acting [21,25,19]
pg 7.34d is stuck unclean for 41015.946070, current state 
active+remapped+wait_backfill, last acting [48,14,22]
pg 9.3db is stuck unclean for 50487.572088, current state 
active+remapped+wait_backfill, last acting [40,33,52]
pg 7.30c is stuck unclean for 98943.868376, current state 
active+remapped+wait_backfill, last acting [12,39,16]
pg 7.3a5 is stuck unclean for 26487.349029, current state 
active+remapped+wait_backfill, last acting [36,28,33]
pg 8.2d3 is stuck unclean for 98535.669203, current state 
active+recovery_wait+degraded, last acting [30,33,52]
pg 7.2d6 is stuck unclean for 17769.739311, current state 
active+remapped+wait_backfill, last acting [16,15,36]
pg 9.2b2 is stuck unclean for 67277.008904, current state 
active+undersized+degraded+remapped+wait_backfill, last acting [40,19]
pg 9.2b5 is stuck unclean for 17510.383905, current state 
active+remapped+wait_backfill, last acting [32,29,33]
pg 9.2b8 is stuck unclean for 17601.978526, current state 
active+remapped+backfilling, last acting [18,21,50]
pg 9.2a1 is stuck unclean for 41018.243699, current state 
active+undersized+degraded+remapped+wait_backfill, last acting [28,49]
pg 9.2a8 is stuck unclean for 59129.277638, current state 
active+remapped+wait_backfill, last acting [15,17,44]
pg 7.295 is stuck unclean for 17859.207323, current state 
active+undersized+degraded+remapped+wait_backfill, last acting [38,21]
pg 7.28b is stuck unclean for 

[ceph-users] ceph (jewel) unable to recover after node failure

2020-01-07 Thread Hanspeter Kunz
Hi,

after a node failure ceph is unable to recover, i.e. unable to
reintegrate the failed node back into the cluster.

what happened?
1. a node with 11 osds crashed, the remaining 4 nodes (also with 11
osds each) re-balanced, although reporting the following error
condition:

too many PGs per OSD (314 > max 300)

2. after we put the failed node back online, automatic recovery
started, but very soon (after a few minutes) we saw OSDs randomly going
down and up on ALL the osd nodes (not only on the one that had failed).
we saw the the load (CPU) on the nodes was very high (average load 120)

3. the situation seemed to get worse over time (more and more OSDs
going down, less were coming back up) so we switched the node that had
failed off again.

4. after that, the cluster "calmed down", CPU load became normal
(average load ~4-5). we manually restarted the OSD daemons of the OSDs
that were still down and one after the other these OSDs came back up.
Recovery processes are still running now, but it seems to me that 14
PGs are not recoverable:

output of ceph -s:

 health HEALTH_ERR
16 pgs are stuck inactive for more than 300 seconds
255 pgs backfill_wait
16 pgs backfilling
205 pgs degraded
14 pgs down
2 pgs incomplete
14 pgs peering
48 pgs recovery_wait
205 pgs stuck degraded
16 pgs stuck inactive
335 pgs stuck unclean
156 pgs stuck undersized
156 pgs undersized
25 requests are blocked > 32 sec
recovery 1788571/71151951 objects degraded (2.514%)
recovery 2342374/71151951 objects misplaced (3.292%)
too many PGs per OSD (314 > max 300)

I have a few questions now:

A. will ceph be able to recover over time? I am afraid that the 14 PGs
that are down will not recover.

B. what caused the OSDs going down and up during recovery after the
failed OSD node came back online? (step 2 above) I suspect that the
high CPU load we saw on all the nodes caused timeouts in the OSD
daemons. Is this a reasonable assumption?

C. If indeed all this was caused by such an overload is there a way to
make the recovery process less CPU intensive?

D. What would you advise me to do/try to recover to a healthy state?

In what follows I try to give some more background information
(configuration, log messages). 

ceph version: 10.2.11
OS version: debian jessie
[yes I know this is old]

cluster: 5 OSD nodes (12 cores, 64G RAM), 11 OSD per node, each OSD
daemon controls a 2 TB harddrive. The journals are written to an SSD. 

ceph.conf:
-
[global]
fsid = [censored]
mon_initial_members = salomon, simon, ramon
mon_host = 10.65.16.44, 10.65.16.45, 10.65.16.46
public_network = 10.65.16.0/24
cluster_network = 10.65.18.0/24
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
mon osd down out interval = 7200
--

Log Messages (examples):

we see a lot of:

Jan  7 18:52:22 bruce ceph-osd[9184]: 2020-01-07 18:52:22.411377 7f0ebd93b700 
-1 osd
.29 15636 heartbeat_check: no reply from 10.65.16.43:6822 osd.48 since back 
2020-01-
07 18:51:20.119784 front 2020-01-07 18:52:21.575852 (cutoff 2020-01-07 
18:52:02.4113
30)

however, all the networks were up (the machines could ping each other).

I guess these are the log messages of OSDs going down (on one of the
nodes):
Jan  7 16:47:37 bruce ceph-osd[3684]: 2020-01-07 16:47:37.729691 7fbe5ee73700 
-1 osd.25 15017 *** Got signal Interrupt ***
Jan  7 16:47:37 bruce ceph-osd[3684]: 2020-01-07 16:47:37.729701 7fbe5ee73700 
-1 osd.25 15017 shutdown
Jan  7 16:47:43 bruce ceph-osd[5689]: 2020-01-07 16:47:43.940577 7fb47fda5700 
-1 osd.27 15023 *** Got signal Interrupt ***
Jan  7 16:47:43 bruce ceph-osd[5689]: 2020-01-07 16:47:43.940598 7fb47fda5700 
-1 osd.27 15023 shutdown
Jan  7 16:47:44 bruce ceph-osd[8766]: 2020-01-07 16:47:44.037075 7f4aa0a00700 
-1 osd.24 15023 *** Got signal Interrupt ***
Jan  7 16:47:44 bruce ceph-osd[8766]: 2020-01-07 16:47:44.037087 7f4aa0a00700 
-1 osd.24 15023 shutdown
Jan  7 16:48:04 bruce ceph-osd[8098]: 2020-01-07 16:48:04.511811 7fd6c26a8700 
-1 osd.22 15042 *** Got signal Interrupt ***
Jan  7 16:48:04 bruce ceph-osd[8098]: 2020-01-07 16:48:04.511869 7fd6c26a8700 
-1 osd.22 15042 shutdown

Best regards,
Hp
-- 
Hanspeter Kunz  University of Zurich
Systems Administrator   Department of Informatics
Email: hk...@ifi.uzh.ch Binzm├╝hlestrasse 14
Tel: +41.(0)44.63-56714 Office 2.E.07
http://www.ifi.uzh.ch   CH-8050 Zurich, Switzerland

Spamtraps: hkunz.bo...@ailab.ch hkunz.bo...@ifi.uzh.ch
---
Rome wasn't burnt in a day.


smime.p7s
Description: S/MIME cryptographic signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com