Starting on Friday, as part of adding a new pod of 12 servers, we initiated a 
reweight on roughly 384 drives; from 0.1 to 0.25. Something about the resulting 
large backfill is causing librbd to hang, requiring server restarts. The 
volumes are showing buffer i/o errors when this happens.We are currently using 
hybrid OSDs with both SSD and traditional spinning disks. The current status of 
the cluster is:
ceph --version
ceph version 14.2.22 
Cluster Kernel 5.4.49-200
{
        "mon": {
        "ceph version 14.2.22 nautilus (stable)": 3
        },
        "mgr": {
        "ceph version 14.2.22 nautilus (stable)": 3
        },
        "osd": {
        "ceph version 14.2.21 nautilus (stable)": 368,
        "ceph version 14.2.22 (stable)": 2055
        },
        "mds": {},
        "rgw": {
        "ceph version 14.2.22 (stable)": 7
        },
        "overall": {
        "ceph version 14.2.21 (stable)": 368,
        "ceph version 14.2.22 (stable)": 2068
        }
}

HEALTH_WARN, noscrub,nodeep-scrub flag(s) set. 
pgs: 6815703/11016906121 objects degraded (0.062%) 2814059622/11016906121
 objects misplaced (25.543%). 

The client servers are on 3.10.0-1062.1.2.el7.x86_6

We have found a couple of issues that look relevant: 
https://tracker.ceph.com/issues/19385 
https://tracker.ceph.com/issues/18807 
Has anyone experienced anything like this before? Does anyone have any 
recommendations as to settings that can help alleviate this while the backfill 
completes? 
An example of the buffer ii/o errors:

Jul 17 06:36:08 host8098 kernel: buffer_io_error: 22 callbacks suppressed
Jul 17 06:36:08 host8098 kernel: Buffer I/O error on dev dm-4, logical block 0, 
async page read
Jul 17 06:36:08 host8098 kernel: Buffer I/O error on dev dm-4, logical block 0, 
async page read
Jul 17 06:36:08 host8098 kernel: Buffer I/O error on dev dm-4, logical block 0, 
async page read
Jul 17 06:36:08 host8098 kernel: Buffer I/O error on dev dm-4, logical block 0, 
async page read
Jul 17 06:36:08 host8098 kernel: Buffer I/O error on dev dm-4, logical block 0, 
async page read
Jul 17 06:36:08 host8098 kernel: Buffer I/O error on dev dm-4, logical block 0, 
async page read
Jul 17 06:36:08 host8098 kernel: Buffer I/O error on dev dm-4, logical block 3, 
async page read
Jul 17 06:36:08 host8098 kernel: Buffer I/O error on dev dm-5, logical block 
511984, async page read
Jul 17 06:36:08 host8098 kernel: Buffer I/O error on dev dm-6, logical block 
3487657728, async page read
Jul 17 06:36:08 host8098 kernel: Buffer I/O error on dev dm-6, logical block 
3487657729, async page read
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to