[ceph-users] Re: Is there any way to obtain the maximum number of node failure in ceph without data loss?

2021-07-26 Thread Josh Baergen
Hi Jerry, I think this is one of those "there must be something else going on here" situations; marking any OSD out should affect only that one "slot" in the acting set, at least until backfill completes (and in my experience has always been the case). It might be worth inspecting the cluster log

[ceph-users] Re: Is there any way to obtain the maximum number of node failure in ceph without data loss?

2021-07-26 Thread Jerry Lee
After doing more experiments, the outcome answer some of my questions: The environment is kind of different compared to the one mentioned in previous mail. 1) the `ceph osd tree` -2 2.06516 root perf_osd -5 0.67868 host jceph-n2-perf_osd 2ssd 0.17331 osd.2

[ceph-users] Re: Is there any way to obtain the maximum number of node failure in ceph without data loss?

2021-07-25 Thread Jerry Lee
Hello Josh, I simulated the osd.14 failure by the following steps: 1. hot unplug the disk 2. systemctl stop ceph-osd@14 3. ceph osd out 14 The used CRUSH rule to create the EC8+3 pool is described as below: # ceph osd crush rule dump erasure_hdd_mhosts { "rule_id": 8, "rule_name"

[ceph-users] Re: Is there any way to obtain the maximum number of node failure in ceph without data loss?

2021-07-23 Thread Josh Baergen
Hi Jerry, In general, your CRUSH rules should define the behaviour you're looking for. Based on what you've stated about your configuration, after failing a single node or an OSD on a single node, then you should still be able to tolerate two more failures in the system without losing data (or los