see https://tracker.ceph.com/issues/38724 . "this results in the Production VMs becoming unresponsive as their disks are unavailable when we have multiple OSDs down on multiple hosts. (we are doing 2 copy) I've seen it where 3 OSDs are down at the same time on different hosts due to this bug. That's when we are seemingly really un lucky with the BUG. (3 copy would not have saved us from that)"
" That OSD failure seems to have caused a cascade. Several more OSDs have crashed. 12% of objects were degraded, and I had to create new 'ssd' class OSDs to get enough failure domains. I cancelled the cp to prioritize recovery. Is there any workaround to repair the OSDs and get them to restart properly? They just crash again every time I restart them. Can this bug please be set to a higher priority? This has caused an outage for myself and Edward above, and threatens data loss. That warrants at least Major." And we had our most important virtual machines [ freepbx phone, postfix mail, dovecot imap , order entry data, accounting etc ] go off line . we have a great backup system and were able to restore all except for the last 40 minutes of data. and check this thread: https://www.mail-archive.com/ceph-users@ceph.io/msg00488.html -1 , as ceph versions greater the 12.2.11 are unstable.