[ceph-users] ceph osd crash help needed

2019-08-12 Thread response
Hi all,

I have a production cluster, I recently purged all snaps.

Now on a set of OSD's when they backfill im getting an assert like the below : 

-4> 2019-08-13 00:25:14.577 7ff4637b1700  5 osd.99 pg_epoch: 206049 
pg[0.12ed( v 206047'25372641 (199518'25369560,206047'25372641] local-lis/les
=206046/206047 n=1746 ec=117322/1362 lis/c 206046/193496 les/c/f 
206047/206028/0 206045/206046/206045) [99,76]/[99] backfill=[76] r=0 lpr=206046 
pi=
[205889,206046)/1 crt=206047'25372641 lcod 206047'25372640 mlcod 
206047'25372640 active+undersized+remapped+backfill_wait mbc={} ps=80] exit 
Started
/Primary/Active/WaitRemoteBackfillReserved 0.244929 1 0.64
-3> 2019-08-13 00:25:14.577 7ff4637b1700  5 osd.99 pg_epoch: 206049 
pg[0.12ed( v 206047'25372641 (199518'25369560,206047'25372641] local-lis/les
=206046/206047 n=1746 ec=117322/1362 lis/c 206046/193496 les/c/f 
206047/206028/0 206045/206046/206045) [99,76]/[99] backfill=[76] r=0 lpr=206046 
pi=
[205889,206046)/1 crt=206047'25372641 lcod 206047'25372640 mlcod 
206047'25372640 active+undersized+remapped+backfill_wait mbc={} ps=80] enter 
Started/Primary/Active/Backfilling
-2> 2019-08-13 00:25:14.653 7ff4637b1700  5 osd.99 pg_epoch: 206049 
pg[0.12ed( v 206047'25372641 (199518'25369560,206047'25372641] 
local-lis/les=206046/206047 n=1746 ec=117322/1362 lis/c 206046/193496 les/c/f 
206047/206028/0 206045/206046/206045) [99,76]/[99] backfill=[76] r=0 lpr=206046 
pi=[205889,206046)/1 rops=1 crt=206047'25372641 lcod 206047'25372640 mlcod 
206047'25372640 active+undersized+remapped+backfilling mbc={} ps=80] 
backfill_pos is 0:b74d67be:::rbd_data.dae7bc6b8b4567.b4b8:head
-1> 2019-08-13 00:25:14.757 7ff4637b1700 -1 
/root/sources/pve/ceph/ceph-14.2.1/src/osd/osd_types.cc: In function 'uint64_t 
SnapSet::get_clone_bytes(snapid_t) const' thread 7ff4637b1700 time 2019-08-13 
00:25:14.759270
/root/sources/pve/ceph/ceph-14.2.1/src/osd/osd_types.cc: 5263: FAILED 
ceph_assert(clone_overlap.count(clone))

 ceph version 14.2.1 (9257126ffb439de1652793b3e29f4c0b97a47b47) nautilus 
(stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x152) [0x55989e4a6450]
 2: (()+0x517628) [0x55989e4a6628]
 3: (SnapSet::get_clone_bytes(snapid_t) const+0xc2) [0x55989e880d62]
 4: 
(PrimaryLogPG::add_object_context_to_pg_stat(std::shared_ptr, 
pg_stat_t*)+0x297) [0x55989e7b2197]
 5: (PrimaryLogPG::recover_backfill(unsigned long, ThreadPool::TPHandle&, 
bool*)+0xfdc) [0x55989e7e059c]
 6: (PrimaryLogPG::start_recovery_ops(unsigned long, ThreadPool::TPHandle&, 
unsigned long*)+0x110b) [0x55989e7e468b]
 7: (OSD::do_recovery(PG*, unsigned int, unsigned long, 
ThreadPool::TPHandle&)+0x302) [0x55989e639192]
 8: (PGRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr&, 
ThreadPool::TPHandle&)+0x19) [0x55989e8d15d9]
 9: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x7d7) 
[0x55989e6544d7]
 10: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b4) 
[0x55989ec2ba74]
 11: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55989ec2e470]
 12: (()+0x7fa3) [0x7ff47f718fa3]
 13: (clone()+0x3f) [0x7ff47f2c84cf]

 0> 2019-08-13 00:25:14.761 7ff4637b1700 -1 *** Caught signal (Aborted) **
 in thread 7ff4637b1700 thread_name:tp_osd_tp

 ceph version 14.2.1 (9257126ffb439de1652793b3e29f4c0b97a47b47) nautilus 
(stable)
 1: (()+0x12730) [0x7ff47f723730]
 2: (gsignal()+0x10b) [0x7ff47f2067bb]
 3: (abort()+0x121) [0x7ff47f1f1535]
 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x1a3) [0x55989e4a64a1]
 5: (()+0x517628) [0x55989e4a6628]
 6: (SnapSet::get_clone_bytes(snapid_t) const+0xc2) [0x55989e880d62]
 7: 
(PrimaryLogPG::add_object_context_to_pg_stat(std::shared_ptr, 
pg_stat_t*)+0x297) [0x55989e7b2197]
 8: (PrimaryLogPG::recover_backfill(unsigned long, ThreadPool::TPHandle&, 
bool*)+0xfdc) [0x55989e7e059c]
 9: (PrimaryLogPG::start_recovery_ops(unsigned long, ThreadPool::TPHandle&, 
unsigned long*)+0x110b) [0x55989e7e468b]
 10: (OSD::do_recovery(PG*, unsigned int, unsigned long, 
ThreadPool::TPHandle&)+0x302) [0x55989e639192]
 11: (PGRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr&, 
ThreadPool::TPHandle&)+0x19) [0x55989e8d15d9]
 12: (OSD::ShardedOpWQ::_process(unsigned int, 
ceph::heartbeat_handle_d*)+0x7d7) [0x55989e6544d7]
 13: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b4) 
[0x55989ec2ba74]
 14: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55989ec2e470]
 15: (()+0x7fa3) [0x7ff47f718fa3]
 16: (clone()+0x3f) [0x7ff47f2c84cf]
 NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
interpret this.



 FAILED ceph_assert(clone_overlap.count(clone)

if possible id like to 'nuke' this from the osd as there is no snap's active, 
however  would love some advise on the best way to go about this. 

best regards
Kevin Myers

___
ceph-users mailing list
ceph-users@lists.ceph.com

Re: [ceph-users] Possibly a bug on rocksdb

2019-08-12 Thread Neha Ojha
Hi Samuel,

You can use https://tracker.ceph.com/issues/41211 to provide the
information that Brad requested, along with debug_osd=20, using
debug_rocksdb=20 and debug_bluestore=20 might be useful.

Thanks,
Neha



On Sun, Aug 11, 2019 at 4:18 PM Brad Hubbard  wrote:
>
> Could you create a tracker for this?
>
> Also, if you can reproduce this could you gather a log with
> debug_osd=20 ? That should show us the superblock it was trying to
> decode as well as additional details.
>
> On Mon, Aug 12, 2019 at 6:29 AM huxia...@horebdata.cn
>  wrote:
> >
> > Dear folks,
> >
> > I had an OSD down, not because of a bad disk, but most likely a bug hit on 
> > Rockdb. Any one had similar issue?
> >
> > I am using Luminous 12.2.12 version. Log attached below
> >
> > thanks,
> > Samuel
> >
> > **
> > [root@horeb72 ceph]# head -400 ceph-osd.4.log
> > 2019-08-11 07:30:02.186519 7f69bd020700  0 -- 192.168.10.72:6805/5915 >> 
> > 192.168.10.73:6801/4096 conn(0x56549cfc0800 :6805 
> > s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg 
> > accept connect_seq 15 vs existing csq=15 existing_state=STATE_STANDBY
> > 2019-08-11 07:30:02.186871 7f69bd020700  0 -- 192.168.10.72:6805/5915 >> 
> > 192.168.10.73:6801/4096 conn(0x56549cfc0800 :6805 
> > s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg 
> > accept connect_seq 16 vs existing csq=15 existing_state=STATE_STANDBY
> > 2019-08-11 07:30:02.242291 7f69bc81f700  0 -- 192.168.10.72:6805/5915 >> 
> > 192.168.10.71:6805/5046 conn(0x5654b93ed000 :6805 
> > s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg 
> > accept connect_seq 15 vs existing csq=15 existing_state=STATE_STANDBY
> > 2019-08-11 07:30:02.242554 7f69bc81f700  0 -- 192.168.10.72:6805/5915 >> 
> > 192.168.10.71:6805/5046 conn(0x5654b93ed000 :6805 
> > s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg 
> > accept connect_seq 16 vs existing csq=15 existing_state=STATE_STANDBY
> > 2019-08-11 07:30:02.260295 7f69bc81f700  0 -- 192.168.10.72:6805/5915 >> 
> > 192.168.10.73:6806/4864 conn(0x56544de16800 :6805 
> > s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg 
> > accept connect_seq 15 vs existing csq=15 
> > existing_state=STATE_CONNECTING_WAIT_CONNECT_REPLY
> > 2019-08-11 17:11:01.968247 7ff4822f1d80 -1 WARNING: the following dangerous 
> > and experimental features are enabled: bluestore,rocksdb
> > 2019-08-11 17:11:01.968333 7ff4822f1d80  0 ceph version 12.2.12 
> > (1436006594665279fe734b4c15d7e08c13ebd777) luminous (stable), process 
> > ceph-osd, pid 1048682
> > 2019-08-11 17:11:01.970611 7ff4822f1d80  0 pidfile_write: ignore empty 
> > --pid-file
> > 2019-08-11 17:11:01.991542 7ff4822f1d80 -1 WARNING: the following dangerous 
> > and experimental features are enabled: bluestore,rocksdb
> > 2019-08-11 17:11:01.997597 7ff4822f1d80  0 load: jerasure load: lrc load: 
> > isa
> > 2019-08-11 17:11:01.997710 7ff4822f1d80  1 bdev create path 
> > /var/lib/ceph/osd/ceph-4/block type kernel
> > 2019-08-11 17:11:01.997723 7ff4822f1d80  1 bdev(0x564774656c00 
> > /var/lib/ceph/osd/ceph-4/block) open path /var/lib/ceph/osd/ceph-4/block
> > 2019-08-11 17:11:01.998127 7ff4822f1d80  1 bdev(0x564774656c00 
> > /var/lib/ceph/osd/ceph-4/block) open size 858887553024 (0xc7f9b0, 
> > 800GiB) block_size 4096 (4KiB) non-rotational
> > 2019-08-11 17:11:01.998231 7ff4822f1d80  1 bdev(0x564774656c00 
> > /var/lib/ceph/osd/ceph-4/block) close
> > 2019-08-11 17:11:02.265144 7ff4822f1d80  1 bdev create path 
> > /var/lib/ceph/osd/ceph-4/block type kernel
> > 2019-08-11 17:11:02.265177 7ff4822f1d80  1 bdev(0x564774658a00 
> > /var/lib/ceph/osd/ceph-4/block) open path /var/lib/ceph/osd/ceph-4/block
> > 2019-08-11 17:11:02.265695 7ff4822f1d80  1 bdev(0x564774658a00 
> > /var/lib/ceph/osd/ceph-4/block) open size 858887553024 (0xc7f9b0, 
> > 800GiB) block_size 4096 (4KiB) non-rotational
> > 2019-08-11 17:11:02.266233 7ff4822f1d80  1 bdev create path 
> > /var/lib/ceph/osd/ceph-4/block.db type kernel
> > 2019-08-11 17:11:02.266256 7ff4822f1d80  1 bdev(0x564774589a00 
> > /var/lib/ceph/osd/ceph-4/block.db) open path 
> > /var/lib/ceph/osd/ceph-4/block.db
> > 2019-08-11 17:11:02.266812 7ff4822f1d80  1 bdev(0x564774589a00 
> > /var/lib/ceph/osd/ceph-4/block.db) open size 2759360 (0x6fc20, 
> > 27.9GiB) block_size 4096 (4KiB) non-rotational
> > 2019-08-11 17:11:02.266998 7ff4822f1d80  1 bdev create path 
> > /var/lib/ceph/osd/ceph-4/block type kernel
> > 2019-08-11 17:11:02.267015 7ff4822f1d80  1 bdev(0x564774659a00 
> > /var/lib/ceph/osd/ceph-4/block) open path /var/lib/ceph/osd/ceph-4/block
> > 2019-08-11 17:11:02.267412 7ff4822f1d80  1 bdev(0x564774659a00 
> > /var/lib/ceph/osd/ceph-4/block) open size 858887553024 (0xc7f9b0, 
> > 800GiB) block_size 4096 (4KiB) non-rotational
> > 2019-08-11 17:11:02.298355 7ff4822f1d80  0  set 

Re: [ceph-users] optane + 4x SSDs for VM disk images?

2019-08-12 Thread jesper
>> Could performance of Optane + 4x SSDs per node ever exceed that of
>> pure Optane disks?
>
> No. With Ceph, the results for Optane and just for good server SSDs are
> almost the same. One thing is that you can run more OSDs per an Optane
> than per a usual SSD. However, the latency you get from both is almost
> the same as most of it comes from Ceph itself, not from the underlying
> storage. This also results in Optanes being useless for
> block.db/block.wal if your SSDs aren't shitty desktop ones.
>
> And as usual I'm posting the link to my article
> https://yourcmc.ru/wiki/Ceph_performance :)

You write that they are not reporting QD=1 single-threaded numbers,
but in Table 10 and 11 the average latencies are reported which
is "close to the same", so they can get

Read latency: 0.32ms (thereby 3125 IOPS)
Write latency: 1.1ms  (therby 909 IOPS)

Really nice writeup and very true - should be a must-read for anyone
starting out with Ceph.

-- 
Jesper

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] New CRUSH device class questions

2019-08-12 Thread Robert LeBlanc
On Wed, Aug 7, 2019 at 7:05 AM Paul Emmerich  wrote:

> ~ is the internal implementation of device classes. Internally it's
> still using separate roots, that's how it stays compatible with older
> clients that don't know about device classes.
>

That makes sense.


> And since it wasn't mentioned here yet: consider upgrading to Nautilus
> to benefit from the new and improved accounting for metadata space.
> You'll be able to see how much space is used for metadata and quotas
> should work properly for metadata usage.
>

I think I'm not explaining this well and it is confusing people. I don't
want to limit the size of the metadata pool, I also don't want to limit the
size of the data pool as the cluster flexibility could cause the quota to
be out of date at anytime and probably useless (since we want to use as
much space as possible for data). I would like to reserve space for the
metadata pool so that no other pool can touch it, much like when you thick
provision a VM disk file. It is guaranteed for that entity an no one else
can use it, even if it is mostly empty. So far people have only told me how
to limit the space of a pool, which is not what I'm looking for.

Thank you,
Robert LeBlanc


Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] optane + 4x SSDs for VM disk images?

2019-08-12 Thread vitalif

Could performance of Optane + 4x SSDs per node ever exceed that of
pure Optane disks?


No. With Ceph, the results for Optane and just for good server SSDs are 
almost the same. One thing is that you can run more OSDs per an Optane 
than per a usual SSD. However, the latency you get from both is almost 
the same as most of it comes from Ceph itself, not from the underlying 
storage. This also results in Optanes being useless for 
block.db/block.wal if your SSDs aren't shitty desktop ones.


And as usual I'm posting the link to my article 
https://yourcmc.ru/wiki/Ceph_performance :)

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] optane + 4x SSDs for VM disk images?

2019-08-12 Thread Mark Lehrer
The problem with caching is that if the performance delta between the
two storage types isn't large enough, the cost of the caching
algorithms and the complexity of managing everything outweigh the
performance gains.

With Optanes vs. SSDs, the main thing to consider is how busy the
devices are in the worst case.  Optanes have incredibly low latency
and therefore are great at being fast with smaller workloads (as
measured by the effective queue depth in iostat) -- but the Optanes
I've use typically max out at a queue depth of 15 or so.  SSDs aren't
as fast at single workoads but the ones I typically use in my ivory
tower work better when they are multitasking a lot -- and depending on
the type can easily outperform Optanes when there are many things
going on at once (in my case the preferred drive is the SN200/SN260
which is excellent at the high queue depth workloads).

So the short suggestion is: don't waste time with caching, and analyze
your actual workload a bit to be 100% sure where the bottleneck is.
Unless you require stupidly low latency and the fastest possible
performance for a single user, you will get more bang for your buck by
adding more SSDs.  If you're not using HDDs, Ceph is usually CPU bound
due to limitations in the threading model -- so keep that in mind too
(sounds like you already know this if you're doing 4 partitions per
Optane :) ).

Mark
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] optane + 4x SSDs for VM disk images?

2019-08-12 Thread Maged Mokhtar


On 11/08/2019 19:46, Victor Hooi wrote:

Hi

I am building a 3-node Ceph cluster to storE VM disk images.

We are running Ceph Nautilus with KVM.

Each node has:

Xeon 4116
512GB ram per node
Optane 905p NVMe disk with 980 GB

Previously, I was creating four OSDs per Optane disk, and using only 
Optane disks for all storage.


However, if I could get say 4x980GB SSDs for each node, would that 
improve anything?


Is there a good way of using the Optane disks as a cache? (WAL?) Or 
what would be a good way of making use of this hardware for VM disk 
images?


Could performance of Optane + 4x SSDs per node ever exceed that of 
pure Optane disks?


Thanks,
Victor



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Generally i would go with adding the SSDs, gives you good capacity + 
overall performance per dollar + its common deployment in Ceph. Ceph 
does have overhead so trying to push extreme performance may be costly. 
To answer your questions:


Latency and single stream iops will always be better with pure Optane. 
So if you have a few client streams / low queue depth, then adding SSDs 
will make it slower.


If you have a lot of client streams, you can get higher total iops if 
you add SSDs and use your Optane as WAL/DB. 4 SSDs could be in the 
ballpark but you can stress test you cluster and measure the %busy of 
all your disks ( Optane + SSDs) to make sure they are equally busy at 
that ratio, if your Optane is less busy, you can further add SSDs, 
increasing overall iops.


So performance will depend if you want highest iops + can sacrifice 
latency then a hybrid solution is better. If you need absolute latency 
then stay with all Optane. As stated Ceph does have overhead, so the 
gain in latency as a ratio is costly.


For caching: i would not recommed bcache/dm-cache unless for hdds. 
Possibly dm-writecache can show slight write latency improvements and 
maybe a middle ground if you really want to squeeze latency.


/Maged










___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Scrub start-time and end-time

2019-08-12 Thread Torben Hørup

Hi

I have a few questions regarding the options for limiting the scrubbing 
to a certain time frame : "osd scrub begin hour" and "osd scrub end 
hour".


Is it allowed to have the scrub period cross midnight ? eg have start 
time at 22:00 and end time 07:00 next morning.


I assume that if you only configure the one of them - it still behaves 
as if it is unconfigured ??



/Torben
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Ceph-users] Re: MDS failing under load with large cache sizes

2019-08-12 Thread Janek Bevendorff
I've been copying happily for days now (not very fast, but the MDS were 
stable), but eventually the MDSs started flapping again due to large 
cache sizes (they are being killed after 11M inodes). I could solve the 
problem by temporarily increasing the cache size in order to allow them 
to rejoin, but it tells me that my settings do not fully solve the 
problem yet (unless perhaps I increase the trim threshold even further.



On 06.08.19 19:52, Janek Bevendorff wrote:

Your parallel rsync job is only getting 150 creates per second? What
was the previous throughput?

I am actually not quite sure what the exact throughput was or is or what
I can expect. It varies so much. I am copying from a 23GB file list that
is split into 3000 chunks which are then processed by 16-24 parallel
rsync processes. I have copied 27 of 64TB so far (according to df -h)
and to my taste it's taking a lot longer than it should be doing. The
main problem here is not that I'm trying to copy 64TB (drop in the
bucket), the problem is that it's 64TB in tiny, small, and medium-sized
files.

This whole MDS mess and several pauses and restarts in between have
completely distorted my sense of how far in the process I actually am or
how fast I would expect it to go. Right now it's starting again from the
beginning, so I expect it'll be another day or so until it starts moving
some real data again.


The cache size looks correct here.

Yeah. Cache appears to be constant-size now. I am still getting
occasional "client failing to respond to cache pressure", but that goes
away as fast as it came.



Try pinning if possible in each parallel rsync job.

I was considering that, but couldn't come up with a feasible pinning
strategy. We have all those files of very different sizes spread very
unevenly across a handful of top-level directories. I get the impression
that I couldn't do much (or any) better than the automatic balancer.



Here are tracker tickets to resolve the issues you encountered:

https://tracker.ceph.com/issues/41140
https://tracker.ceph.com/issues/41141

Thanks a lot!
___
Ceph-users mailing list -- ceph-us...@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com