Hello.

Today we've experienced a complete CEPH cluster outage - total loss of
power in the whole infrastructure.
6 osd nodes and 3 monitors went down at the same time. CEPH 14.2.10

This resulted in unfound objects, which were "reverted" in a hurry with
ceph pg <pg_id> mark_unfound_lost revert
In retrospect that was probably a mistake as the "have" part stated 0'0.

But then deep-scrubs started and they found inconsistent PGs. We tried
repairing them, but they just switched to failed_repair.

Here's a log example:
2021-06-25 00:08:07.693645 osd.0 [ERR] 3.c shard 6
3:3163e703:::rbd_data.be08c566ef438d.0000000000002445:head : missing
2021-06-25 00:08:07.693710 osd.0 [ERR] repair 3.c
3:3163e2ee:::rbd_data.efa86358d15f4a.000000000000004b:6ab1 : is an
unexpected clone
2021-06-25 00:11:55.128951 osd.0 [ERR] 3.c repair 1 missing, 0 inconsistent
objects
2021-06-25 00:11:55.128969 osd.0 [ERR] 3.c repair 2 errors, 1 fixed

I tried manually deleting conflicting objects from secondary osds
with ceph-objectstore-tool like this
ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-22 --pgid 3.c
rbd_data.efa86358d15f4a.000000000000004b:6ab1 remove
it removes it but without any positive impact. Pretty sure I don't
understand the concept.

So currently I have the following thoughts:
 - is there any doc on the object placement specifics and what all of those
numbers in their name mean? I've seen objects with similar prefix/mid but
different suffix and I have no idea what does it mean;
 - I'm actually not sure what the production impact is at that point
because everything seems to work so far. So I'm thinking if it's possible
to kill replicas on secondary OSDd with ceph-objectstore-tool and just let
CEPH create a replica from primary PG?

I have 8 scrub errors and 4 inconsistent+failed_repair PGs, and I'm afraid
that further deep scrubs will reveal more errors.
Any thoughts appreciated.
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to