[ceph-users] Re: One pg stuck in active+undersized+degraded after OSD down

2021-11-23 Thread David Tinker
Fiddling with the crush weights sorted this out and I was able to remove the OSD from the cluster. I set all the big weights down to 1 ceph osd crush reweight osd.7 1.0 etc. Tx for all the help On Tue, Nov 23, 2021 at 9:35 AM Stefan Kooman wrote: > On 11/23/21 08:21, David Tinker wrote: > >

[ceph-users] Re: One pg stuck in active+undersized+degraded after OSD down

2021-11-22 Thread David Tinker
Yes it recovered when I put the OSD back in. The issue is that it fails to sort itself out when I remove that OSD even though I have loads of space and 8 other OSDs in 4 different zones to choose from. The weights are very different (some 3.2 others 0.36) and that post I found suggested that this

[ceph-users] Re: One pg stuck in active+undersized+degraded after OSD down

2021-11-22 Thread David Tinker
I just had a look at the balance docs and it says "No adjustments will be made to the PG distribution if the cluster is degraded (e.g., because an OSD has failed and the system has not yet healed itself).". That implies that the balancer won't run until the disruption caused by the removed OSD has

[ceph-users] Re: One pg stuck in active+undersized+degraded after OSD down

2021-11-22 Thread David Tinker
Yes it is on: # ceph balancer status { "active": true, "last_optimize_duration": "0:00:00.001867", "last_optimize_started": "Mon Nov 22 13:10:24 2021", "mode": "upmap", "optimize_result": "Unable to find further optimization, or pool(s) pg_num is decreasing, or distribution is

[ceph-users] Re: One pg stuck in active+undersized+degraded after OSD down

2021-11-21 Thread David Tinker
I set osd.7 as "in", uncordened the node, scaled the OSD deployment back up and things are recovering with cluster status HEALTH_OK. I found this message from the archives: https://www.mail-archive.com/ceph-users@lists.ceph.com/msg47071.html "You have a large difference in the capacities of the

[ceph-users] Re: One pg stuck in active+undersized+degraded after OSD down

2021-11-18 Thread David Tinker
Would it be worth setting the OSD I removed back to "in" (or whatever the opposite of "out") is and seeing if things recovered? On Thu, Nov 18, 2021 at 3:44 PM David Tinker wrote: > Tx. # ceph version > ceph version 15.2.7 (88e41c6c49beb18add4fdb6b4326ca466d931db8) octopus > (stable) > > > > On

[ceph-users] Re: One pg stuck in active+undersized+degraded after OSD down

2021-11-18 Thread David Tinker
Tx. # ceph version ceph version 15.2.7 (88e41c6c49beb18add4fdb6b4326ca466d931db8) octopus (stable) On Thu, Nov 18, 2021 at 3:28 PM Stefan Kooman wrote: > On 11/18/21 13:20, David Tinker wrote: > > I just grepped all the OSD pod logs for error and warn and nothing comes > up: > > > > # k logs

[ceph-users] Re: One pg stuck in active+undersized+degraded after OSD down

2021-11-18 Thread David Tinker
If I ignore the dire warnings and about losing data and do: ceph osd purge 7 will I lose data? There are still 2 copies of everything right? I need to remove the node with the OSD from the k8s cluster, reinstall it and have it re-join the cluster. This will bring in some new OSDs and maybe Ceph

[ceph-users] Re: One pg stuck in active+undersized+degraded after OSD down

2021-11-18 Thread David Tinker
I just grepped all the OSD pod logs for error and warn and nothing comes up: # k logs -n rook-ceph rook-ceph-osd-10-659549cd48-nfqgk | grep -i warn etc I am assuming that would bring back something if any of them were unhappy. On Thu, Nov 18, 2021 at 1:26 PM Stefan Kooman wrote: > On

[ceph-users] Re: One pg stuck in active+undersized+degraded after OSD down

2021-11-18 Thread David Tinker
Sure. Tx. # ceph pg 3.1f query { "snap_trimq": "[]", "snap_trimq_len": 0, "state": "active+undersized+degraded", "epoch": 2477, "up": [ 0, 2 ], "acting": [ 0, 2 ], "acting_recovery_backfill": [ "0", "2" ],