Hi All. I have a ceph cluster that's partially upgraded to Luminous. Last night a host died and since then the cluster is failing to recover. It finished backfilling, but was left with thousands of requests degraded, inactive, or stale. In order to move past the issue, I put the cluster in noout,noscrub,nodeep-scrub and restarted all services one by one.
Here is the current state of the cluster, any idea how to get past the stale and stuck pgs? Any help would be very appreciated. Thanks. -Brett ## ceph -s output ############### $ sudo ceph -s cluster: id: <removed> health: HEALTH_ERR 165 pgs are stuck inactive for more than 60 seconds 243 pgs backfill_wait 144 pgs backfilling 332 pgs degraded 5 pgs peering 1 pgs recovery_wait 22 pgs stale 332 pgs stuck degraded 143 pgs stuck inactive 22 pgs stuck stale 531 pgs stuck unclean 330 pgs stuck undersized 330 pgs undersized 671 requests are blocked > 32 sec 603 requests are blocked > 4096 sec recovery 3524906/412016682 objects degraded (0.856%) recovery 2462252/412016682 objects misplaced (0.598%) noout,noscrub,nodeep-scrub flag(s) set mon.ceph0rdi-mon1-1-prd store is getting too big! 17612 MB >= 15360 MB mon.ceph0rdi-mon2-1-prd store is getting too big! 17669 MB >= 15360 MB mon.ceph0rdi-mon3-1-prd store is getting too big! 17586 MB >= 15360 MB services: mon: 3 daemons, quorum ceph0rdi-mon1-1-prd,ceph0rdi-mon2-1-prd,ceph0rdi-mon3-1-prd mgr: ceph0rdi-mon3-1-prd(active), standbys: ceph0rdi-mon2-1-prd, ceph0rdi-mon1-1-prd osd: 222 osds: 218 up, 218 in; 428 remapped pgs flags noout,noscrub,nodeep-scrub data: pools: 35 pools, 38144 pgs objects: 130M objects, 172 TB usage: 538 TB used, 337 TB / 875 TB avail pgs: 0.375% pgs not active 3524906/412016682 objects degraded (0.856%) 2462252/412016682 objects misplaced (0.598%) 37599 active+clean 173 active+undersized+degraded+remapped+backfill_wait 133 active+undersized+degraded+remapped+backfilling 93 activating 68 active+remapped+backfill_wait 22 activating+undersized+degraded+remapped 13 stale+active+clean 11 active+remapped+backfilling 9 activating+remapped 5 remapped 5 stale+activating+remapped 3 remapped+peering 2 stale+remapped 2 stale+remapped+peering 1 activating+degraded+remapped 1 active+clean+remapped 1 active+degraded+remapped+backfill_wait 1 active+undersized+remapped+backfill_wait 1 activating+degraded 1 active+recovery_wait+undersized+degraded+remapped io: client: 187 kB/s rd, 2595 kB/s wr, 99 op/s rd, 343 op/s wr recovery: 1509 MB/s, 1541 objects/s ## ceph pg dump_stuck stale (this number doesn't seem to decrease) ######################################################## $ sudo ceph pg dump_stuck stale ok PG_STAT STATE UP UP_PRIMARY ACTING ACTING_PRIMARY 17.6d7 stale+remapped [5,223,96] 5 [223,96,148] 223 2.5c5 stale+active+clean [224,48,179] 224 [224,48,179] 224 17.64e stale+active+clean [224,84,109] 224 [224,84,109] 224 19.5b4 stale+activating+remapped [124,130,20] 124 [124,20,11] 124 17.4c6 stale+active+clean [224,216,95] 224 [224,216,95] 224 73.413 stale+activating+remapped [117,130,189] 117 [117,189,137] 117 2.431 stale+remapped+peering [5,180,142] 5 [180,142,40] 180 69.1dc stale+active+clean [62,36,54] 62 [62,36,54] 62 14.790 stale+active+clean [81,114,19] 81 [81,114,19] 81 2.78e stale+active+clean [224,143,124] 224 [224,143,124] 224 73.37a stale+active+clean [224,84,38] 224 [224,84,38] 224 17.42d stale+activating+remapped [220,130,25] 220 [220,25,137] 220 72.263 stale+active+clean [224,148,117] 224 [224,148,117] 224 67.40 stale+active+clean [62,170,71] 62 [62,170,71] 62 67.16d stale+remapped+peering [3,147,22] 3 [147,22,29] 147 20.3de stale+active+clean [224,103,126] 224 [224,103,126] 224 19.721 stale+remapped [3,34,179] 3 [34,179,128] 34 19.2f1 stale+activating+remapped [126,130,178] 126 [126,178,72] 126 74.28b stale+active+clean [224,95,56] 224 [224,95,56] 224 20.6b6 stale+active+clean [224,56,126] 224 [224,56,126] 224 2.2ac stale+active+clean [224,223,143] 224 [224,223,143] 224 73.11c stale+activating+remapped [91,130,201] 91 [91,201,137] 91 ## Queries on the pg's don't seem to work ################################## $ sudo ceph pg 2.5c5 query Error ENOENT: i don't have pgid 2.5c5 $ sudo ceph pg 17.6d7 query Error ENOENT: i don't have pgid 17.6d7 ## Ceph versions (in case that helps) ############################## $ sudo ceph versions { "mon": { "ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable)": 3 }, "mgr": { "ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable)": 3 }, "osd": { "ceph version 10.2.9 (2ee413f77150c0f375ff6f10edd6c8f9c7d060d0)": 60, "ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable)": 158 }, "mds": {}, "overall": { "ceph version 10.2.9 (2ee413f77150c0f375ff6f10edd6c8f9c7d060d0)": 60, "ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable)": 164 } }
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com