Hi Jerry,
I think this is one of those "there must be something else going on
here" situations; marking any OSD out should affect only that one
"slot" in the acting set, at least until backfill completes (and in my
experience has always been the case). It might be worth inspecting the
cluster log
After doing more experiments, the outcome answer some of my questions:
The environment is kind of different compared to the one mentioned in
previous mail.
1) the `ceph osd tree`
-2 2.06516 root perf_osd
-5 0.67868 host jceph-n2-perf_osd
2ssd 0.17331 osd.2
Hello Josh,
I simulated the osd.14 failure by the following steps:
1. hot unplug the disk
2. systemctl stop ceph-osd@14
3. ceph osd out 14
The used CRUSH rule to create the EC8+3 pool is described as below:
# ceph osd crush rule dump erasure_hdd_mhosts
{
"rule_id": 8,
"rule_name"
Hi Jerry,
In general, your CRUSH rules should define the behaviour you're
looking for. Based on what you've stated about your configuration,
after failing a single node or an OSD on a single node, then you
should still be able to tolerate two more failures in the system
without losing data (or los