[ceph-users] ceph osd crash help needed
Hi all, I have a production cluster, I recently purged all snaps. Now on a set of OSD's when they backfill im getting an assert like the below : -4> 2019-08-13 00:25:14.577 7ff4637b1700 5 osd.99 pg_epoch: 206049 pg[0.12ed( v 206047'25372641 (199518'25369560,206047'25372641] local-lis/les =206046/206047 n=1746 ec=117322/1362 lis/c 206046/193496 les/c/f 206047/206028/0 206045/206046/206045) [99,76]/[99] backfill=[76] r=0 lpr=206046 pi= [205889,206046)/1 crt=206047'25372641 lcod 206047'25372640 mlcod 206047'25372640 active+undersized+remapped+backfill_wait mbc={} ps=80] exit Started /Primary/Active/WaitRemoteBackfillReserved 0.244929 1 0.64 -3> 2019-08-13 00:25:14.577 7ff4637b1700 5 osd.99 pg_epoch: 206049 pg[0.12ed( v 206047'25372641 (199518'25369560,206047'25372641] local-lis/les =206046/206047 n=1746 ec=117322/1362 lis/c 206046/193496 les/c/f 206047/206028/0 206045/206046/206045) [99,76]/[99] backfill=[76] r=0 lpr=206046 pi= [205889,206046)/1 crt=206047'25372641 lcod 206047'25372640 mlcod 206047'25372640 active+undersized+remapped+backfill_wait mbc={} ps=80] enter Started/Primary/Active/Backfilling -2> 2019-08-13 00:25:14.653 7ff4637b1700 5 osd.99 pg_epoch: 206049 pg[0.12ed( v 206047'25372641 (199518'25369560,206047'25372641] local-lis/les=206046/206047 n=1746 ec=117322/1362 lis/c 206046/193496 les/c/f 206047/206028/0 206045/206046/206045) [99,76]/[99] backfill=[76] r=0 lpr=206046 pi=[205889,206046)/1 rops=1 crt=206047'25372641 lcod 206047'25372640 mlcod 206047'25372640 active+undersized+remapped+backfilling mbc={} ps=80] backfill_pos is 0:b74d67be:::rbd_data.dae7bc6b8b4567.b4b8:head -1> 2019-08-13 00:25:14.757 7ff4637b1700 -1 /root/sources/pve/ceph/ceph-14.2.1/src/osd/osd_types.cc: In function 'uint64_t SnapSet::get_clone_bytes(snapid_t) const' thread 7ff4637b1700 time 2019-08-13 00:25:14.759270 /root/sources/pve/ceph/ceph-14.2.1/src/osd/osd_types.cc: 5263: FAILED ceph_assert(clone_overlap.count(clone)) ceph version 14.2.1 (9257126ffb439de1652793b3e29f4c0b97a47b47) nautilus (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x152) [0x55989e4a6450] 2: (()+0x517628) [0x55989e4a6628] 3: (SnapSet::get_clone_bytes(snapid_t) const+0xc2) [0x55989e880d62] 4: (PrimaryLogPG::add_object_context_to_pg_stat(std::shared_ptr, pg_stat_t*)+0x297) [0x55989e7b2197] 5: (PrimaryLogPG::recover_backfill(unsigned long, ThreadPool::TPHandle&, bool*)+0xfdc) [0x55989e7e059c] 6: (PrimaryLogPG::start_recovery_ops(unsigned long, ThreadPool::TPHandle&, unsigned long*)+0x110b) [0x55989e7e468b] 7: (OSD::do_recovery(PG*, unsigned int, unsigned long, ThreadPool::TPHandle&)+0x302) [0x55989e639192] 8: (PGRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr&, ThreadPool::TPHandle&)+0x19) [0x55989e8d15d9] 9: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x7d7) [0x55989e6544d7] 10: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b4) [0x55989ec2ba74] 11: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55989ec2e470] 12: (()+0x7fa3) [0x7ff47f718fa3] 13: (clone()+0x3f) [0x7ff47f2c84cf] 0> 2019-08-13 00:25:14.761 7ff4637b1700 -1 *** Caught signal (Aborted) ** in thread 7ff4637b1700 thread_name:tp_osd_tp ceph version 14.2.1 (9257126ffb439de1652793b3e29f4c0b97a47b47) nautilus (stable) 1: (()+0x12730) [0x7ff47f723730] 2: (gsignal()+0x10b) [0x7ff47f2067bb] 3: (abort()+0x121) [0x7ff47f1f1535] 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a3) [0x55989e4a64a1] 5: (()+0x517628) [0x55989e4a6628] 6: (SnapSet::get_clone_bytes(snapid_t) const+0xc2) [0x55989e880d62] 7: (PrimaryLogPG::add_object_context_to_pg_stat(std::shared_ptr, pg_stat_t*)+0x297) [0x55989e7b2197] 8: (PrimaryLogPG::recover_backfill(unsigned long, ThreadPool::TPHandle&, bool*)+0xfdc) [0x55989e7e059c] 9: (PrimaryLogPG::start_recovery_ops(unsigned long, ThreadPool::TPHandle&, unsigned long*)+0x110b) [0x55989e7e468b] 10: (OSD::do_recovery(PG*, unsigned int, unsigned long, ThreadPool::TPHandle&)+0x302) [0x55989e639192] 11: (PGRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr&, ThreadPool::TPHandle&)+0x19) [0x55989e8d15d9] 12: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x7d7) [0x55989e6544d7] 13: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b4) [0x55989ec2ba74] 14: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55989ec2e470] 15: (()+0x7fa3) [0x7ff47f718fa3] 16: (clone()+0x3f) [0x7ff47f2c84cf] NOTE: a copy of the executable, or `objdump -rdS ` is needed to interpret this. FAILED ceph_assert(clone_overlap.count(clone) if possible id like to 'nuke' this from the osd as there is no snap's active, however would love some advise on the best way to go about this. best regards Kevin Myers ___ ceph-users mailing list ceph-users@lists.ceph.com
Re: [ceph-users] Possibly a bug on rocksdb
Hi Samuel, You can use https://tracker.ceph.com/issues/41211 to provide the information that Brad requested, along with debug_osd=20, using debug_rocksdb=20 and debug_bluestore=20 might be useful. Thanks, Neha On Sun, Aug 11, 2019 at 4:18 PM Brad Hubbard wrote: > > Could you create a tracker for this? > > Also, if you can reproduce this could you gather a log with > debug_osd=20 ? That should show us the superblock it was trying to > decode as well as additional details. > > On Mon, Aug 12, 2019 at 6:29 AM huxia...@horebdata.cn > wrote: > > > > Dear folks, > > > > I had an OSD down, not because of a bad disk, but most likely a bug hit on > > Rockdb. Any one had similar issue? > > > > I am using Luminous 12.2.12 version. Log attached below > > > > thanks, > > Samuel > > > > ** > > [root@horeb72 ceph]# head -400 ceph-osd.4.log > > 2019-08-11 07:30:02.186519 7f69bd020700 0 -- 192.168.10.72:6805/5915 >> > > 192.168.10.73:6801/4096 conn(0x56549cfc0800 :6805 > > s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg > > accept connect_seq 15 vs existing csq=15 existing_state=STATE_STANDBY > > 2019-08-11 07:30:02.186871 7f69bd020700 0 -- 192.168.10.72:6805/5915 >> > > 192.168.10.73:6801/4096 conn(0x56549cfc0800 :6805 > > s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg > > accept connect_seq 16 vs existing csq=15 existing_state=STATE_STANDBY > > 2019-08-11 07:30:02.242291 7f69bc81f700 0 -- 192.168.10.72:6805/5915 >> > > 192.168.10.71:6805/5046 conn(0x5654b93ed000 :6805 > > s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg > > accept connect_seq 15 vs existing csq=15 existing_state=STATE_STANDBY > > 2019-08-11 07:30:02.242554 7f69bc81f700 0 -- 192.168.10.72:6805/5915 >> > > 192.168.10.71:6805/5046 conn(0x5654b93ed000 :6805 > > s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg > > accept connect_seq 16 vs existing csq=15 existing_state=STATE_STANDBY > > 2019-08-11 07:30:02.260295 7f69bc81f700 0 -- 192.168.10.72:6805/5915 >> > > 192.168.10.73:6806/4864 conn(0x56544de16800 :6805 > > s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg > > accept connect_seq 15 vs existing csq=15 > > existing_state=STATE_CONNECTING_WAIT_CONNECT_REPLY > > 2019-08-11 17:11:01.968247 7ff4822f1d80 -1 WARNING: the following dangerous > > and experimental features are enabled: bluestore,rocksdb > > 2019-08-11 17:11:01.968333 7ff4822f1d80 0 ceph version 12.2.12 > > (1436006594665279fe734b4c15d7e08c13ebd777) luminous (stable), process > > ceph-osd, pid 1048682 > > 2019-08-11 17:11:01.970611 7ff4822f1d80 0 pidfile_write: ignore empty > > --pid-file > > 2019-08-11 17:11:01.991542 7ff4822f1d80 -1 WARNING: the following dangerous > > and experimental features are enabled: bluestore,rocksdb > > 2019-08-11 17:11:01.997597 7ff4822f1d80 0 load: jerasure load: lrc load: > > isa > > 2019-08-11 17:11:01.997710 7ff4822f1d80 1 bdev create path > > /var/lib/ceph/osd/ceph-4/block type kernel > > 2019-08-11 17:11:01.997723 7ff4822f1d80 1 bdev(0x564774656c00 > > /var/lib/ceph/osd/ceph-4/block) open path /var/lib/ceph/osd/ceph-4/block > > 2019-08-11 17:11:01.998127 7ff4822f1d80 1 bdev(0x564774656c00 > > /var/lib/ceph/osd/ceph-4/block) open size 858887553024 (0xc7f9b0, > > 800GiB) block_size 4096 (4KiB) non-rotational > > 2019-08-11 17:11:01.998231 7ff4822f1d80 1 bdev(0x564774656c00 > > /var/lib/ceph/osd/ceph-4/block) close > > 2019-08-11 17:11:02.265144 7ff4822f1d80 1 bdev create path > > /var/lib/ceph/osd/ceph-4/block type kernel > > 2019-08-11 17:11:02.265177 7ff4822f1d80 1 bdev(0x564774658a00 > > /var/lib/ceph/osd/ceph-4/block) open path /var/lib/ceph/osd/ceph-4/block > > 2019-08-11 17:11:02.265695 7ff4822f1d80 1 bdev(0x564774658a00 > > /var/lib/ceph/osd/ceph-4/block) open size 858887553024 (0xc7f9b0, > > 800GiB) block_size 4096 (4KiB) non-rotational > > 2019-08-11 17:11:02.266233 7ff4822f1d80 1 bdev create path > > /var/lib/ceph/osd/ceph-4/block.db type kernel > > 2019-08-11 17:11:02.266256 7ff4822f1d80 1 bdev(0x564774589a00 > > /var/lib/ceph/osd/ceph-4/block.db) open path > > /var/lib/ceph/osd/ceph-4/block.db > > 2019-08-11 17:11:02.266812 7ff4822f1d80 1 bdev(0x564774589a00 > > /var/lib/ceph/osd/ceph-4/block.db) open size 2759360 (0x6fc20, > > 27.9GiB) block_size 4096 (4KiB) non-rotational > > 2019-08-11 17:11:02.266998 7ff4822f1d80 1 bdev create path > > /var/lib/ceph/osd/ceph-4/block type kernel > > 2019-08-11 17:11:02.267015 7ff4822f1d80 1 bdev(0x564774659a00 > > /var/lib/ceph/osd/ceph-4/block) open path /var/lib/ceph/osd/ceph-4/block > > 2019-08-11 17:11:02.267412 7ff4822f1d80 1 bdev(0x564774659a00 > > /var/lib/ceph/osd/ceph-4/block) open size 858887553024 (0xc7f9b0, > > 800GiB) block_size 4096 (4KiB) non-rotational > > 2019-08-11 17:11:02.298355 7ff4822f1d80 0 set
Re: [ceph-users] optane + 4x SSDs for VM disk images?
>> Could performance of Optane + 4x SSDs per node ever exceed that of >> pure Optane disks? > > No. With Ceph, the results for Optane and just for good server SSDs are > almost the same. One thing is that you can run more OSDs per an Optane > than per a usual SSD. However, the latency you get from both is almost > the same as most of it comes from Ceph itself, not from the underlying > storage. This also results in Optanes being useless for > block.db/block.wal if your SSDs aren't shitty desktop ones. > > And as usual I'm posting the link to my article > https://yourcmc.ru/wiki/Ceph_performance :) You write that they are not reporting QD=1 single-threaded numbers, but in Table 10 and 11 the average latencies are reported which is "close to the same", so they can get Read latency: 0.32ms (thereby 3125 IOPS) Write latency: 1.1ms (therby 909 IOPS) Really nice writeup and very true - should be a must-read for anyone starting out with Ceph. -- Jesper ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] New CRUSH device class questions
On Wed, Aug 7, 2019 at 7:05 AM Paul Emmerich wrote: > ~ is the internal implementation of device classes. Internally it's > still using separate roots, that's how it stays compatible with older > clients that don't know about device classes. > That makes sense. > And since it wasn't mentioned here yet: consider upgrading to Nautilus > to benefit from the new and improved accounting for metadata space. > You'll be able to see how much space is used for metadata and quotas > should work properly for metadata usage. > I think I'm not explaining this well and it is confusing people. I don't want to limit the size of the metadata pool, I also don't want to limit the size of the data pool as the cluster flexibility could cause the quota to be out of date at anytime and probably useless (since we want to use as much space as possible for data). I would like to reserve space for the metadata pool so that no other pool can touch it, much like when you thick provision a VM disk file. It is guaranteed for that entity an no one else can use it, even if it is mostly empty. So far people have only told me how to limit the space of a pool, which is not what I'm looking for. Thank you, Robert LeBlanc Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] optane + 4x SSDs for VM disk images?
Could performance of Optane + 4x SSDs per node ever exceed that of pure Optane disks? No. With Ceph, the results for Optane and just for good server SSDs are almost the same. One thing is that you can run more OSDs per an Optane than per a usual SSD. However, the latency you get from both is almost the same as most of it comes from Ceph itself, not from the underlying storage. This also results in Optanes being useless for block.db/block.wal if your SSDs aren't shitty desktop ones. And as usual I'm posting the link to my article https://yourcmc.ru/wiki/Ceph_performance :) ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] optane + 4x SSDs for VM disk images?
The problem with caching is that if the performance delta between the two storage types isn't large enough, the cost of the caching algorithms and the complexity of managing everything outweigh the performance gains. With Optanes vs. SSDs, the main thing to consider is how busy the devices are in the worst case. Optanes have incredibly low latency and therefore are great at being fast with smaller workloads (as measured by the effective queue depth in iostat) -- but the Optanes I've use typically max out at a queue depth of 15 or so. SSDs aren't as fast at single workoads but the ones I typically use in my ivory tower work better when they are multitasking a lot -- and depending on the type can easily outperform Optanes when there are many things going on at once (in my case the preferred drive is the SN200/SN260 which is excellent at the high queue depth workloads). So the short suggestion is: don't waste time with caching, and analyze your actual workload a bit to be 100% sure where the bottleneck is. Unless you require stupidly low latency and the fastest possible performance for a single user, you will get more bang for your buck by adding more SSDs. If you're not using HDDs, Ceph is usually CPU bound due to limitations in the threading model -- so keep that in mind too (sounds like you already know this if you're doing 4 partitions per Optane :) ). Mark ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] optane + 4x SSDs for VM disk images?
On 11/08/2019 19:46, Victor Hooi wrote: Hi I am building a 3-node Ceph cluster to storE VM disk images. We are running Ceph Nautilus with KVM. Each node has: Xeon 4116 512GB ram per node Optane 905p NVMe disk with 980 GB Previously, I was creating four OSDs per Optane disk, and using only Optane disks for all storage. However, if I could get say 4x980GB SSDs for each node, would that improve anything? Is there a good way of using the Optane disks as a cache? (WAL?) Or what would be a good way of making use of this hardware for VM disk images? Could performance of Optane + 4x SSDs per node ever exceed that of pure Optane disks? Thanks, Victor ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com Generally i would go with adding the SSDs, gives you good capacity + overall performance per dollar + its common deployment in Ceph. Ceph does have overhead so trying to push extreme performance may be costly. To answer your questions: Latency and single stream iops will always be better with pure Optane. So if you have a few client streams / low queue depth, then adding SSDs will make it slower. If you have a lot of client streams, you can get higher total iops if you add SSDs and use your Optane as WAL/DB. 4 SSDs could be in the ballpark but you can stress test you cluster and measure the %busy of all your disks ( Optane + SSDs) to make sure they are equally busy at that ratio, if your Optane is less busy, you can further add SSDs, increasing overall iops. So performance will depend if you want highest iops + can sacrifice latency then a hybrid solution is better. If you need absolute latency then stay with all Optane. As stated Ceph does have overhead, so the gain in latency as a ratio is costly. For caching: i would not recommed bcache/dm-cache unless for hdds. Possibly dm-writecache can show slight write latency improvements and maybe a middle ground if you really want to squeeze latency. /Maged ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Scrub start-time and end-time
Hi I have a few questions regarding the options for limiting the scrubbing to a certain time frame : "osd scrub begin hour" and "osd scrub end hour". Is it allowed to have the scrub period cross midnight ? eg have start time at 22:00 and end time 07:00 next morning. I assume that if you only configure the one of them - it still behaves as if it is unconfigured ?? /Torben ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [Ceph-users] Re: MDS failing under load with large cache sizes
I've been copying happily for days now (not very fast, but the MDS were stable), but eventually the MDSs started flapping again due to large cache sizes (they are being killed after 11M inodes). I could solve the problem by temporarily increasing the cache size in order to allow them to rejoin, but it tells me that my settings do not fully solve the problem yet (unless perhaps I increase the trim threshold even further. On 06.08.19 19:52, Janek Bevendorff wrote: Your parallel rsync job is only getting 150 creates per second? What was the previous throughput? I am actually not quite sure what the exact throughput was or is or what I can expect. It varies so much. I am copying from a 23GB file list that is split into 3000 chunks which are then processed by 16-24 parallel rsync processes. I have copied 27 of 64TB so far (according to df -h) and to my taste it's taking a lot longer than it should be doing. The main problem here is not that I'm trying to copy 64TB (drop in the bucket), the problem is that it's 64TB in tiny, small, and medium-sized files. This whole MDS mess and several pauses and restarts in between have completely distorted my sense of how far in the process I actually am or how fast I would expect it to go. Right now it's starting again from the beginning, so I expect it'll be another day or so until it starts moving some real data again. The cache size looks correct here. Yeah. Cache appears to be constant-size now. I am still getting occasional "client failing to respond to cache pressure", but that goes away as fast as it came. Try pinning if possible in each parallel rsync job. I was considering that, but couldn't come up with a feasible pinning strategy. We have all those files of very different sizes spread very unevenly across a handful of top-level directories. I get the impression that I couldn't do much (or any) better than the automatic balancer. Here are tracker tickets to resolve the issues you encountered: https://tracker.ceph.com/issues/41140 https://tracker.ceph.com/issues/41141 Thanks a lot! ___ Ceph-users mailing list -- ceph-us...@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com