[ceph-users] some OSDs stuck down after 10.2.7 -> 10.2.9 update

Lincoln Bryant Sat, 15 Jul 2017 17:26:01 -0700

Hi all,

After updating to 10.2.9, some of our SSD-based OSDs get put into "down"state and die as in [1].

After bringing these OSDs back up, they sit at 100% CPU utilization andnever become up/in. From the log I see (from [2]):heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1cfad0d700' hadtimed out after 1

before they ultimately crash.

Stracing them, I see them chewing on omaps for a while and then theyseem to do nothing, but CPU utilization is still quite high.

I downgraded (inadvisable, I know) these OSDs to 10.2.7 and they comeback happily. I tried setting debug_osd = 20, debug_filestore = 20,debug_ms = 20, debug_auth = 20, debug_leveldb = 20 but it didn't seemlike there was any additional information in the logs.

Does anyone have any clues how to debug this further? I'm a bit worriedabout running a mix of 10.2.7 and 10.2.9 OSDs in my pool.

For what it's worth, the SSD OSDs in this CRUSH root are serving CephFSmetadata. Other OSDs (spinners in EC and replicated pools) arecompletely OK as far as I can tell. All hosts are EL7.


Thanks,
Lincoln

[1]

-8> 2017-07-15 13:21:51.959502 7f9d23a2a700 1 --192.170.226.253:0/2474101 <== osd.456 192.170.226.250:6807/3547149 1293==== osd_ping(ping_reply e818277 stamp 2017-07-15 13:21:51.958432) v2==== 47+0+0 (584190599 0 0) 0x7f9dd6a93000

 con 0x7f9dcf4d2300

-7> 2017-07-15 13:21:51.959578 7f9d2b26b700 1 --192.170.226.253:0/2474101 <== osd.461 192.170.226.255:6814/4575940 1295==== osd_ping(ping_reply e818277 stamp 2017-07-15 13:21:51.958432) v2==== 47+0+0 (584190599 0 0) 0x7f9d9a1c9200

 con 0x7f9dc38fff80

-6> 2017-07-15 13:21:51.959597 7f9d2b46d700 1 --192.170.226.253:0/2474101 <== osd.460 192.170.226.254:6851/2545858 1290==== osd_ping(ping_reply e818277 stamp 2017-07-15 13:21:51.958432) v2==== 47+0+0 (584190599 0 0) 0x7f9d9a1c7600

 con 0x7f9dc3900a00

-5> 2017-07-15 13:21:51.959612 7f9d1e14f700 1 --192.170.226.253:0/2474101 <== osd.434 192.170.226.242:6803/3058582 1293==== osd_ping(ping_reply e818277 stamp 2017-07-15 13:21:51.958432) v2==== 47+0+0 (584190599 0 0) 0x7f9dc78c0800

 con 0x7f9d7aebae80

-4> 2017-07-15 13:21:51.959650 7f9d19792700 1 --192.170.226.253:0/2474101 <== osd.437 192.170.226.245:6818/2299326 1277==== osd_ping(ping_reply e818277 stamp 2017-07-15 13:21:51.958432) v2==== 47+0+0 (584190599 0 0) 0x7f9dc78c0200

 con 0x7f9dd0c0ba80

-3> 2017-07-15 13:21:51.959666 7f9d5d940700 1 --192.170.226.253:0/2474101 <== osd.460 192.170.226.254:6849/2545858 1290==== osd_ping(ping_reply e818277 stamp 2017-07-15 13:21:51.958432) v2==== 47+0+0 (584190599 0 0) 0x7f9d9a1c8200

 con 0x7f9dc38ff500

-2> 2017-07-15 13:21:52.085120 7f9d659a2700 1 heartbeat_mapis_healthy 'OSD::osd_op_tp thread 0x7f9ce0504700' had timed out after 15-1> 2017-07-15 13:21:52.085130 7f9d659a2700 1 heartbeat_mapis_healthy 'OSD::osd_op_tp thread 0x7f9ce0504700' had suicide timed outafter 1500> 2017-07-15 13:21:52.108248 7f9d659a2700 -1common/HeartbeatMap.cc: In function 'boolceph::HeartbeatMap::_check(const ceph::heartbeat_handle_d*, const char*,time_t)' thread 7f9d659a2700 time 2017-07-15 13:21:52.085137

common/HeartbeatMap.cc: 86: FAILED assert(0 == "hit suicide timeout")

 ceph version 10.2.9 (2ee413f77150c0f375ff6f10edd6c8f9c7d060d0)

1: (ceph::__ceph_assert_fail(char const*, char const*, int, charconst*)+0x85) [0x7f9d6bb0f4a5]2: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, charconst*, long)+0x2e1) [0x7f9d6ba4e541]

 3: (ceph::HeartbeatMap::is_healthy()+0xde) [0x7f9d6ba4ed9e]
 4: (ceph::HeartbeatMap::check_touch_file()+0x2c) [0x7f9d6ba4f57c]
 5: (CephContextServiceThread::entry()+0x15b) [0x7f9d6bb2724b]
 6: (()+0x7dc5) [0x7f9d69a26dc5]
 7: (clone()+0x6d) [0x7f9d680b173d]

NOTE: a copy of the executable, or `objdump -rdS <executable>` isneeded to interpret this.

[2]

2017-07-15 14:35:23.730434 7f1d98bde800 0 ceph version 10.2.9(2ee413f77150c0f375ff6f10edd6c8f9c7d060d0), process ceph-osd, pid 25592092017-07-15 14:35:23.731923 7f1d98bde800 0 pidfile_write: ignore empty--pid-file2017-07-15 14:35:23.772858 7f1d98bde800 0filestore(/var/lib/ceph/osd/ceph-459) backend xfs (magic 0x58465342)2017-07-15 14:35:23.773367 7f1d98bde800 0genericfilestorebackend(/var/lib/ceph/osd/ceph-459) detect_features:FIEMAP ioctl is disabled via 'filestore fiemap' config option2017-07-15 14:35:23.773374 7f1d98bde800 0genericfilestorebackend(/var/lib/ceph/osd/ceph-459) detect_features:SEEK_DATA/SEEK_HOLE is disabled via 'filestore seek data hole' config option2017-07-15 14:35:23.773393 7f1d98bde800 0genericfilestorebackend(/var/lib/ceph/osd/ceph-459) detect_features:splice is supported2017-07-15 14:35:24.148987 7f1d98bde800 0genericfilestorebackend(/var/lib/ceph/osd/ceph-459) detect_features:syncfs(2) syscall fully supported (by glibc and kernel)2017-07-15 14:35:24.149090 7f1d98bde800 0xfsfilestorebackend(/var/lib/ceph/osd/ceph-459) detect_feature: extsizeis disabled by conf

2017-07-15 14:35:24.164498 7f1d98bde800  1 leveldb: Recovering log #87058

2017-07-15 14:35:24.192613 7f1d98bde800 1 leveldb: Level-0 table#87061: started2017-07-15 14:35:24.215001 7f1d98bde800 1 leveldb: Level-0 table#87061: 1081662 bytes OK

2017-07-15 14:35:24.226041 7f1d98bde800  1 leveldb: Delete type=0 #87058

2017-07-15 14:35:24.227900 7f1d98bde800  1 leveldb: Delete type=3 #86877

2017-07-15 14:58:30.814169 7f1d98bde800 0filestore(/var/lib/ceph/osd/ceph-459) mount: enabling WRITEAHEAD journalmode: checkpoint is not enabled2017-07-15 14:58:30.820879 7f1d98bde800 -1 journal FileJournal::_open:disabling aio for non-block journal. Use journal_force_aio to force useof aio anyway2017-07-15 14:58:30.820885 7f1d98bde800 1 journal _open/var/lib/ceph/osd/ceph-459/journal fd 18: 5368709120 bytes, block size4096 bytes, directio = 1, aio = 02017-07-15 14:58:30.822211 7f1d98bde800 1 journal _open/var/lib/ceph/osd/ceph-459/journal fd 18: 5368709120 bytes, block size4096 bytes, directio = 1, aio = 02017-07-15 14:58:30.823657 7f1d98bde800 1filestore(/var/lib/ceph/osd/ceph-459) upgrade2017-07-15 14:58:30.827552 7f1d98bde800 0 <cls>cls/cephfs/cls_cephfs.cc:202: loading cephfs_size_scan2017-07-15 14:58:30.830365 7f1d98bde800 0 <cls>cls/hello/cls_hello.cc:305: loading cls_hello2017-07-15 14:58:30.832013 7f1d98bde800 0 osd.459 818854 crush map hasfeatures 283778186739712, adjusting msgr requires for clients2017-07-15 14:58:30.832023 7f1d98bde800 0 osd.459 818854 crush map hasfeatures 284053064646656 was 8705, adjusting msgr requires for mons2017-07-15 14:58:30.832032 7f1d98bde800 0 osd.459 818854 crush map hasfeatures 284053064646656, adjusting msgr requires for osds

2017-07-15 14:58:31.878565 7f1d98bde800  0 osd.459 818854 load_pgs

2017-07-15 14:58:53.123797 7f1d98bde800 0 osd.459 818854 load_pgsopened 520 pgs2017-07-15 14:58:53.123985 7f1d98bde800 0 osd.459 818854 using 0 opqueue with priority op cut off at 64.2017-07-15 14:58:53.124833 7f1d98bde800 -1 osd.459 818854log_to_monitors {default=true}2017-07-15 14:58:53.178867 7f1d98bde800 0 osd.459 818854 done withinit, starting boot process2017-07-15 14:59:12.117367 7f1cf3afd700 1 heartbeat_map is_healthy'OSD::osd_op_tp thread 0x7f1cfad0d700' had timed out after 152017-07-15 14:59:12.117382 7f1cf3cff700 1 heartbeat_map is_healthy'OSD::osd_op_tp thread 0x7f1cfad0d700' had timed out after 152017-07-15 14:59:12.325987 7f1d0f5a7700 1 heartbeat_map is_healthy'OSD::osd_op_tp thread 0x7f1cfad0d700' had timed out after 152017-07-15 14:59:12.326000 7f1d10946700 1 heartbeat_map is_healthy'OSD::osd_op_tp thread 0x7f1cfad0d700' had timed out after 152017-07-15 14:59:12.812750 7f1cf06c9700 1 heartbeat_map is_healthy'OSD::osd_op_tp thread 0x7f1cfad0d700' had timed out after 152017-07-15 14:59:12.812757 7f1cf07ca700 1 heartbeat_map is_healthy'OSD::osd_op_tp thread 0x7f1cfad0d700' had timed out after 152017-07-15 14:59:13.105328 7f1d4f34f700 1 heartbeat_map is_healthy'OSD::osd_op_tp thread 0x7f1cfad0d700' had timed out after 152017-07-15 14:59:13.105340 7f1d4b3c0700 1 heartbeat_map is_healthy'OSD::osd_op_tp thread 0x7f1cfad0d700' had timed out after 152017-07-15 14:59:13.242604 7f1d532c5700 1 heartbeat_map is_healthy'OSD::osd_op_tp thread 0x7f1cfad0d700' had timed out after 152017-07-15 14:59:13.242676 7f1d505b0700 1 heartbeat_map is_healthy'OSD::osd_op_tp thread 0x7f1cfad0d700' had timed out after 15


_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] some OSDs stuck down after 10.2.7 -> 10.2.9 update

Reply via email to