Dear All,

We are having problems with a critical osd crashing on a Nautilus (14.2.8) cluster.

This is a critical failure, as the osd is part of a pg that is otherwise "down+remapped" due to other osd's crashing; we were hoping the pg was going to repair itself, as there are plenty of free osds, but for some reason this pg never managed to get out of an undersized state.

The osd starts OK, runs for a few minutes, then crashes with an assert, immediately after trying to backfill the pg that is "down+remapped"

    -7> 2020-03-23 15:28:15.368 7f15aeea8700  5 osd.287 pg_epoch: 35531 pg[5.750s2( v 35398'3381328 (35288'3378238,35398'3381328] local-lis/les=35530/35531 n=190408 ec=1821/1818 lis/c 35530/22903 les/c/f 35531/22917/0 35486/35530/35530) [234,354,304,388,125,25,427,226,77,154]/[2147483647,2147483647,287,388,125,25,427,226,77,154]p287(2) backfill=[234(0),304(2),354(1)] r=2 lpr=35530 pi=[22903,35530)/9 rops=1 crt=35398'3381328 lcod 0'0 mlcod 0'0 active+undersized+degraded+remapped+backfilling mbc={} trimq=112 ps=121] backfill_pos is 5:0ae00653:::1000e49a8c6.000000d3:head     -6> 2020-03-23 15:28:15.381 7f15cc9ec700 10 monclient: get_auth_request con 0x555b2f229800 auth_method 0     -5> 2020-03-23 15:28:15.381 7f15b86bb700  2 osd.287 35531 ms_handle_reset con 0x555b2fef7400 session 0x555b2f363600     -4> 2020-03-23 15:28:15.391 7f15c04c5700  5 prioritycache tune_memory target: 4294967296 mapped: 805339136 unmapped: 1032192 heap: 806371328 old mem: 2845415832 new mem: 2845415832     -3> 2020-03-23 15:28:15.420 7f15cc9ec700 10 monclient: get_auth_request con 0x555b2fef7800 auth_method 0     -2> 2020-03-23 15:28:15.420 7f15b86bb700  2 osd.287 35531 ms_handle_reset con 0x555b2fef7c00 session 0x555b2f363c00     -1> 2020-03-23 15:28:15.476 7f15aeea8700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.8/rpm/el7/BUILD/ceph-14.2.8/src/osd/osd_types.cc: In function 'uint64_t SnapSet::get_clone_bytes(snapid_t) const' thread 7f15aeea8700 time 2020-03-23 15:28:15.470166 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.8/rpm/el7/BUILD/ceph-14.2.8/src/osd/osd_types.cc: 5443: FAILED ceph_assert(clone_size.count(clone))

osd log (127KB) is here: <https://www.mrc-lmb.cam.ac.uk/scicomp/ceph-osd.287.log.gz>

/var/log/ceph/ceph-osd.287.log.gz

when the osd was running, the pg state is as follows

[root@ceph7 ~]# ceph pg dump | grep ^5.750
5.750    190408                  0      804    190119       0 569643615603           0          0 3090     3090 active+undersized+degraded+remapped+backfill_wait 2020-03-23 14:37:57.582509  35398'3381328  35491:3265627 [234,354,304,388,125,25,427,226,77,154]        234 [NONE,NONE,287,388,125,25,427,226,77,154]            287 24471'3200829 2020-01-28 15:48:35.574934   24471'3200829 2020-01-28 15:48:35.574934           112

with the osd down:

[root@ceph7 ~]#  ceph pg dump | grep ^5.750
dumped all
5.750    190408                  0        0         0       0 569643615603           0          0 3090 3090                                     down+remapped 2020-03-23 15:28:28.345176  35398'3381328  35532:3265613 [234,354,304,388,125,25,427,226,77,154]        234 [NONE,NONE,NONE,388,125,25,427,226,77,154]            388 24471'3200829 2020-01-28 15:48:35.574934   24471'3200829 2020-01-28 15:48:35.574934

This cluster is being used to backup a live cephfs cluster and has 1.8PB of data, including 30 days of snapshots. We are using 8+2 EC.

Any help appreciated,

Jake


Note: I am working from home until further notice.
For help, contact unixad...@mrc-lmb.cam.ac.uk
--
Dr Jake Grimmett
Head Of Scientific Computing
MRC Laboratory of Molecular Biology
Francis Crick Avenue,
Cambridge CB2 0QH, UK.
Phone 01223 267019
Mobile 0776 9886539
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to