[ceph-users] OSD: FAILED ceph_assert(clone_size.count(clone))

Jake Grimmett Mon, 23 Mar 2020 08:55:06 -0700

Dear All,

We are having problems with a critical osd crashing on a Nautilus(14.2.8) cluster.

This is a critical failure, as the osd is part of a pg that is otherwise"down+remapped" due to other osd's crashing; we were hoping the pg wasgoing to repair itself, as there are plenty of free osds, but for somereason this pg never managed to get out of an undersized state.

The osd starts OK, runs for a few minutes, then crashes with an assert,immediately after trying to backfill the pg that is "down+remapped"

-7> 2020-03-23 15:28:15.368 7f15aeea8700 5 osd.287 pg_epoch: 35531pg[5.750s2( v 35398'3381328 (35288'3378238,35398'3381328]local-lis/les=35530/35531 n=190408 ec=1821/1818 lis/c 35530/22903les/c/f 35531/22917/0 35486/35530/35530)[234,354,304,388,125,25,427,226,77,154]/[2147483647,2147483647,287,388,125,25,427,226,77,154]p287(2)backfill=[234(0),304(2),354(1)] r=2 lpr=35530 pi=[22903,35530)/9 rops=1crt=35398'3381328 lcod 0'0 mlcod 0'0active+undersized+degraded+remapped+backfilling mbc={} trimq=112 ps=121]backfill_pos is 5:0ae00653:::1000e49a8c6.000000d3:head -6> 2020-03-23 15:28:15.381 7f15cc9ec700 10 monclient:get_auth_request con 0x555b2f229800 auth_method 0 -5> 2020-03-23 15:28:15.381 7f15b86bb700 2 osd.287 35531ms_handle_reset con 0x555b2fef7400 session 0x555b2f363600 -4> 2020-03-23 15:28:15.391 7f15c04c5700 5 prioritycachetune_memory target: 4294967296 mapped: 805339136 unmapped: 1032192 heap:806371328 old mem: 2845415832 new mem: 2845415832 -3> 2020-03-23 15:28:15.420 7f15cc9ec700 10 monclient:get_auth_request con 0x555b2fef7800 auth_method 0 -2> 2020-03-23 15:28:15.420 7f15b86bb700 2 osd.287 35531ms_handle_reset con 0x555b2fef7c00 session 0x555b2f363c00 -1> 2020-03-23 15:28:15.476 7f15aeea8700 -1/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.8/rpm/el7/BUILD/ceph-14.2.8/src/osd/osd_types.cc:In function 'uint64_t SnapSet::get_clone_bytes(snapid_t) const' thread7f15aeea8700 time 2020-03-23 15:28:15.470166/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.8/rpm/el7/BUILD/ceph-14.2.8/src/osd/osd_types.cc:5443: FAILED ceph_assert(clone_size.count(clone))

osd log (127KB) is here:<https://www.mrc-lmb.cam.ac.uk/scicomp/ceph-osd.287.log.gz>


/var/log/ceph/ceph-osd.287.log.gz

when the osd was running, the pg state is as follows

[root@ceph7 ~]# ceph pg dump | grep ^5.750

5.750 190408 0 804 190119 0569643615603 0 0 3090 3090active+undersized+degraded+remapped+backfill_wait 2020-03-2314:37:57.582509 35398'3381328 35491:3265627[234,354,304,388,125,25,427,226,77,154] 234[NONE,NONE,287,388,125,25,427,226,77,154] 287 24471'32008292020-01-28 15:48:35.574934 24471'3200829 2020-01-2815:48:35.574934 112


with the osd down:

[root@ceph7 ~]#  ceph pg dump | grep ^5.750
dumped all

5.750 190408 0 0 0 0569643615603 0 0 30903090 down+remapped 2020-03-2315:28:28.345176 35398'3381328 35532:3265613[234,354,304,388,125,25,427,226,77,154] 234[NONE,NONE,NONE,388,125,25,427,226,77,154] 388 24471'32008292020-01-28 15:48:35.574934 24471'3200829 2020-01-28 15:48:35.574934

This cluster is being used to backup a live cephfs cluster and has 1.8PBof data, including 30 days of snapshots. We are using 8+2 EC.


Any help appreciated,

Jake


Note: I am working from home until further notice.
For help, contact unixad...@mrc-lmb.cam.ac.uk
--
Dr Jake Grimmett
Head Of Scientific Computing
MRC Laboratory of Molecular Biology
Francis Crick Avenue,
Cambridge CB2 0QH, UK.
Phone 01223 267019
Mobile 0776 9886539
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] OSD: FAILED ceph_assert(clone_size.count(clone))

Reply via email to