Re: [ceph-users] Consumer-grade SSD in Ceph

2019-12-19 Thread Mark Nelson
The way I try to look at this is: 1) How much more do the enterprise grade drives cost? 2) What are the benefits? (Faster performance, longer life, etc) 3) How much does it cost to deal with downtime, diagnose issues, and replace malfunctioning hardware? My personal take is that

Re: [ceph-users] very high ram usage by OSDs on Nautilus

2019-10-29 Thread Mark Nelson
Ok, assuming my math is right you've got ~14G of data in the mempools. ~6.5GB bluestore data ~1.8GB bluestore onode ~5GB bluestore other Rest is other misc stuff.  That seems to be pretty inline with the numbers you posted in your screenshot. IE this doesn't appear to be a leak, but

Re: [ceph-users] very high ram usage by OSDs on Nautilus

2019-10-28 Thread Mark Nelson
Hi Philippe, Have you looked at the mempool stats yet? ceph daemon osd.NNN dump_mempools You may also want to look at the heap stats, and potentially enable debug 5 for bluestore to see what the priority cache manager is doing.  Typically in these cases we end up seeing a ton of memory

Re: [ceph-users] cephfs full, 2/3 Raw capacity used

2019-08-26 Thread Mark Nelson
On 8/26/19 7:39 AM, Wido den Hollander wrote: On 8/26/19 1:35 PM, Simon Oosthoek wrote: On 26-08-19 13:25, Simon Oosthoek wrote: On 26-08-19 13:11, Wido den Hollander wrote: The reweight might actually cause even more confusion for the balancer. The balancer uses upmap mode and that

[ceph-users] hsbench 0.2 released

2019-08-22 Thread Mark Nelson
Hi Folks, I've updated hsbench (new S3 benchmark) to 0.2 Notable changes since 0.1: - Can now output CSV results - Can now output JSON results - Fix for poor read performance with low thread counts - New bucket listing benchmark with a new "mk" flag that lets you control the number of

Re: [ceph-users] radosgw pegging down 5 CPU cores when no data is being transferred

2019-08-21 Thread Mark Nelson
Hi Vladimir, On 8/21/19 8:54 AM, Vladimir Brik wrote: Hello I am running a Ceph 14.2.1 cluster with 3 rados gateways. Periodically, radosgw process on those machines starts consuming 100% of 5 CPU cores for days at a time, even though the machine is not being used for data transfers

Re: [ceph-users] [RFC] New S3 Benchmark

2019-08-15 Thread Mark Nelson
On 8/15/19 6:41 PM, David Byte wrote: Mark, did the S3 engine for fio not work? Sent from my iPhone. Typos are Apple's fault. On Aug 15, 2019, at 6:37 PM, Mark Nelson wrote: Hi Guys, Earlier this week I was working on investigating the impact of OMAP performance on RGW and wanted to see

Re: [ceph-users] WAL/DB size

2019-08-15 Thread Mark Nelson
Hi Folks, The basic idea behind the WAL is that for every DB write transaction you first write it into an in-memory buffer and to a region on disk.  RocksDB typically is setup to have multiple WAL buffers, and when one or more fills up, it will start flushing the data to L0 while new writes

Re: [ceph-users] WAL/DB size

2019-08-14 Thread Mark Nelson
On 8/14/19 1:06 PM, solarflow99 wrote: Actually standalone WAL is required when you have either very small fast device (and don't want db to use it) or three devices (different in performance) behind OSD (e.g. hdd, ssd, nvme). So WAL is to be located at the fastest one.

Re: [ceph-users] WAL/DB size

2019-08-13 Thread Mark Nelson
On 8/13/19 3:51 PM, Paul Emmerich wrote: On Tue, Aug 13, 2019 at 10:04 PM Wido den Hollander wrote: I just checked an RGW-only setup. 6TB drive, 58% full, 11.2GB of DB in use. No slow db in use. random rgw-only setup here: 12TB drive, 77% full, 48GB metadata and 10GB omap for index and

Re: [ceph-users] out of memory bluestore osds

2019-08-07 Thread Mark Nelson
Hi Jaime, we only use the cache size parameters now if you've disabled autotuning.  With autotuning we adjust the cache size on the fly to try and keep the mapped process memory under the osd_memory_target.  You can set a lower memory target than default, though you will have far less cache

Re: [ceph-users] How to maximize the OSD effective queue depth in Ceph?

2019-08-06 Thread Mark Nelson
You may be interested in using my wallclock profiler to look at lock contention: https://github.com/markhpc/gdbpmp It will greatly slow down the OSD but will show you where time is being spent and so far the results appear to at least be relatively informative.  I used it recently when

Re: [ceph-users] Bluestore caching oddities, again

2019-08-05 Thread Mark Nelson
On 8/4/19 7:36 PM, Christian Balzer wrote: Hello, On Sun, 4 Aug 2019 06:34:46 -0500 Mark Nelson wrote: On 8/4/19 6:09 AM, Paul Emmerich wrote: On Sun, Aug 4, 2019 at 3:47 AM Christian Balzer wrote: 2. Bluestore caching still broken When writing data with the fios below, it isn't

Re: [ceph-users] Bluestore caching oddities, again

2019-08-04 Thread Mark Nelson
On 8/4/19 6:09 AM, Paul Emmerich wrote: On Sun, Aug 4, 2019 at 3:47 AM Christian Balzer wrote: 2. Bluestore caching still broken When writing data with the fios below, it isn't cached on the OSDs. Worse, existing cached data that gets overwritten is removed from the cache, which while of

Re: [ceph-users] High memory usage OSD with BlueStore

2019-08-01 Thread Mark Nelson
Hi Danny, Are your arm binaries built using tcmalloc?  At least on x86 we saw significantly higher memory fragmentation and memory usage with glibc malloc. First, you can look at the mempool stats which may provide a hint: ceph daemon osd.NNN dump_mempools Assuming you are using

Re: [ceph-users] New best practices for osds???

2019-07-26 Thread Mark Nelson
On 7/25/19 9:27 PM, Anthony D'Atri wrote: We run few hundred HDD OSDs for our backup cluster, we set one RAID 0 per HDD in order to be able to use -battery protected- write cache from the RAID controller. It really improves performance, for both bluestore and filestore OSDs. Having run

Re: [ceph-users] Observation of bluestore db/wal performance

2019-07-21 Thread Mark Nelson
FWIW, the DB and WAL don't really do the same thing that the cache tier does.  The WAL is similar to filestore's journal, and the DB is primarily for storing metadata (onodes, blobs, extents, and OMAP data).  Offloading these things to an SSD will definitely help, but you won't see the same

Re: [ceph-users] which tool to use for benchmarking rgw s3, yscb or cosbench

2019-07-21 Thread Mark Nelson
Hi Wei Zhao, I've used ycsb for mongodb on rbd testing before.  It worked fine and was pretty straightforward to run.  The only real concern I had was that many of the default workloads used a zipfian distribution for reads.  This basically meant reads were entirely coming from cache and

Re: [ceph-users] Bluestore Runaway Memory

2019-07-18 Thread Mark Nelson
Hi Brett, Can you enable debug_bluestore = 5 and debug_prioritycache = 5 on one of the OSDs that's showing the behavior?  You'll want to look in the logs for lines that look like this: 2019-07-18T19:34:42.587-0400 7f4048b8d700  5 prioritycache tune_memory target: 4294967296 mapped:

Re: [ceph-users] New best practices for osds???

2019-07-17 Thread Mark Nelson
Some of the first performance studies we did back at Inktank were looking at RAID-0 vs JBOD setups! :)  You are absolutely right that the controller cache (especially write-back with a battery or supercap) can help with HDD-only configurations.  Where we typically saw problems was when you

Re: [ceph-users] bluestore_allocated vs bluestore_stored

2019-06-17 Thread Mark Nelson
Earlier in bluestore's life, we couldn't handle a 4K min_alloc size on NVMe without incurring pretty significant slowdowns (and also generally higher amounts of metadata in the DB).  Lately I've been seeing some indications that we've improved the stack to the point where 4K min_alloc no

Re: [ceph-users] Verifying current configuration values

2019-06-12 Thread Mark Nelson
On 6/12/19 5:51 PM, Jorge Garcia wrote: I'm following the bluestore config reference guide and trying to change the value for osd_memory_target. I added the following entry in the /etc/ceph/ceph.conf file:   [osd]   osd_memory_target = 2147483648 and restarted the osd daemons doing

Re: [ceph-users] OSD RAM recommendations

2019-06-07 Thread Mark Nelson
The truth of the matter is that folks try to boil this down to some kind of hard and fast rule but it's often not that simple. With our current default settings for pglog, rocksdb WAL buffers, etc, the OSD basically needs about 1GB of RAM for bare-bones operation (not under recovery or extreme

Re: [ceph-users] Unexplainable high memory usage OSD with BlueStore

2019-05-03 Thread Mark Nelson
On 5/3/19 1:38 AM, Denny Fuchs wrote: hi, I never recognized the Debian /etc/default/ceph :-) = # Increase tcmalloc cache size TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=134217728 that is, what is active now. Yep, if you profile the OSD under a small write workload you can see

Re: [ceph-users] Unexplainable high memory usage OSD with BlueStore

2019-05-02 Thread Mark Nelson
On 5/2/19 1:51 PM, Igor Podlesny wrote: On Fri, 3 May 2019 at 01:29, Mark Nelson wrote: On 5/2/19 11:46 AM, Igor Podlesny wrote: On Thu, 2 May 2019 at 05:02, Mark Nelson wrote: [...] FWIW, if you still have an OSD up with tcmalloc, it's probably worth looking at the heap stats to see how

Re: [ceph-users] Unexplainable high memory usage OSD with BlueStore

2019-05-02 Thread Mark Nelson
On 5/2/19 11:46 AM, Igor Podlesny wrote: On Thu, 2 May 2019 at 05:02, Mark Nelson wrote: [...] FWIW, if you still have an OSD up with tcmalloc, it's probably worth looking at the heap stats to see how much memory tcmalloc thinks it's allocated vs how much RSS memory is being used

Re: [ceph-users] Unexplainable high memory usage OSD with BlueStore

2019-05-01 Thread Mark Nelson
On 5/1/19 12:59 AM, Igor Podlesny wrote: On Tue, 30 Apr 2019 at 20:56, Igor Podlesny wrote: On Tue, 30 Apr 2019 at 19:10, Denny Fuchs wrote: [..] Any suggestions ? -- Try different allocator. Ah, BTW, except memory allocator there's another option: recently backported bitmap allocator.

Re: [ceph-users] How to reduce HDD OSD flapping due to rocksdb compacting event?

2019-04-12 Thread Mark Nelson
:01 PM Mark Nelson <mailto:mnel...@redhat.com>> wrote: Hi Charles, Basically the goal is to reduce write-amplification as much as possible.  The deeper that the rocksdb hierarchy gets, the worse the write-amplifcation for compaction is going to be.  If you look at th

Re: [ceph-users] How to reduce HDD OSD flapping due to rocksdb compacting event?

2019-04-12 Thread Mark Nelson
Hi Charles, Basically the goal is to reduce write-amplification as much as possible.  The deeper that the rocksdb hierarchy gets, the worse the write-amplifcation for compaction is going to be.  If you look at the OSD logs you'll see the write-amp factors for compaction in the rocksdb

Re: [ceph-users] osd_memory_target exceeding on Luminous OSD BlueStore

2019-04-10 Thread Mark Nelson
false positive. Thanks, I continue to read your resources. Le mardi 09 avril 2019 à 09:30 -0500, Mark Nelson a écrit : My understanding is that basically the kernel is either unable or uninterested (maybe due to lack of memory pressure?) in reclaiming the memory .  It's possible you might have

Re: [ceph-users] osd_memory_target exceeding on Luminous OSD BlueStore

2019-04-09 Thread Mark Nelson
ease that ? Thanks, Olivier Le lundi 08 avril 2019 à 16:09 -0500, Mark Nelson a écrit : One of the difficulties with the osd_memory_target work is that we can't tune based on the RSS memory usage of the process. Ultimately it's up to the kernel to decide to reclaim memory and especially with tr

Re: [ceph-users] osd_memory_target exceeding on Luminous OSD BlueStore

2019-04-08 Thread Mark Nelson
One of the difficulties with the osd_memory_target work is that we can't tune based on the RSS memory usage of the process. Ultimately it's up to the kernel to decide to reclaim memory and especially with transparent huge pages it's tough to judge what the kernel is going to do even if memory

Re: [ceph-users] fio test rbd - single thread - qd1

2019-03-20 Thread Mark Nelson
On 3/20/19 3:12 AM, Vitaliy Filippov wrote: `cpupower idle-set -D 0` will help you a lot, yes. However it seems that not only the bluestore makes it slow. >= 50% of the latency is introduced by the OSD itself. I'm just trying to understand WHAT parts of it are doing so much work. For example

Re: [ceph-users] optimize bluestore for random write i/o

2019-03-12 Thread Mark Nelson
On 3/12/19 8:40 AM, vita...@yourcmc.ru wrote: One way or another we can only have a single thread sending writes to rocksdb.  A lot of the prior optimization work on the write side was to get as much processing out of the kv_sync_thread as possible. That's still a worthwhile goal as it's

Re: [ceph-users] Ceph block storage - block.db useless?

2019-03-12 Thread Mark Nelson
Our default of 4 256MB WAL buffers is arguably already too big. On one hand we are making these buffers large to hopefully avoid short lived data going into the DB (pglog writes).  IE if a pglog write comes in and later a tombstone invalidating it comes in, we really want those to land in the

Re: [ceph-users] optimize bluestore for random write i/o

2019-03-12 Thread Mark Nelson
On 3/12/19 7:31 AM, vita...@yourcmc.ru wrote: Decreasing the min_alloc size isn't always a win, but ican be in some cases.  Originally bluestore_min_alloc_size_ssd was set to 4096 but we increased it to 16384 because at the time our metadata path was slow and increasing it resulted in a pretty

Re: [ceph-users] Ceph block storage - block.db useless?

2019-03-12 Thread Mark Nelson
On 3/12/19 7:24 AM, Benjamin Zapiec wrote: Hello, i was wondering about ceph block.db to be nearly empty and I started to investigate. The recommendations from ceph are that block.db should be at least 4% the size of block. So my OSD configuration looks like this: wal.db - not explicit

Re: [ceph-users] 13.2.4 odd memory leak?

2019-03-08 Thread Mark Nelson
On 3/8/19 8:12 AM, Steffen Winther Sørensen wrote: On 8 Mar 2019, at 14.30, Mark Nelson <mailto:mnel...@redhat.com>> wrote: On 3/8/19 5:56 AM, Steffen Winther Sørensen wrote: On 5 Mar 2019, at 10.02, Paul Emmerich <mailto:paul.emmer...@croit.io>> wrote: Yeah, there

Re: [ceph-users] 13.2.4 odd memory leak?

2019-03-08 Thread Mark Nelson
On 3/8/19 5:56 AM, Steffen Winther Sørensen wrote: On 5 Mar 2019, at 10.02, Paul Emmerich wrote: Yeah, there's a bug in 13.2.4. You need to set it to at least ~1.2GB. Yeap thanks, setting it at 1G+256M worked :) Hope this won’t bloat memory during coming weekend VM backups through CephFS

Re: [ceph-users] optimize bluestore for random write i/o

2019-03-06 Thread Mark Nelson
On 3/6/19 5:12 AM, Stefan Priebe - Profihost AG wrote: Hi Mark, Am 05.03.19 um 23:12 schrieb Mark Nelson: Hi Stefan, Could you try running your random write workload against bluestore and then take a wallclock profile of an OSD using gdbpmp? It's available here: https://github.com/markhpc

Re: [ceph-users] optimize bluestore for random write i/o

2019-03-06 Thread Mark Nelson
On 3/5/19 4:23 PM, Vitaliy Filippov wrote: Testing -rw=write without -sync=1 or -fsync=1 (or -fsync=32 for batch IO, or just fio -ioengine=rbd from outside a VM) is rather pointless - you're benchmarking the RBD cache, not Ceph itself. RBD cache is coalescing your writes into big sequential

Re: [ceph-users] optimize bluestore for random write i/o

2019-03-05 Thread Mark Nelson
Hi Stefan, Could you try running your random write workload against bluestore and then take a wallclock profile of an OSD using gdbpmp? It's available here: https://github.com/markhpc/gdbpmp Thanks, Mark On 3/5/19 2:29 AM, Stefan Priebe - Profihost AG wrote: Hello list, while the

Re: [ceph-users] Ceph cluster on AMD based system.

2019-03-05 Thread Mark Nelson
observed higher write-amplification on our test nodes.  I suspect that might be a worthwhile trade-off for nvdimms or optane, but I'm not sure it's a good idea for typical NVMe drives. Mark On Tue, Mar 5, 2019 at 5:35 PM Mark Nelson wrote: Hi, I've got a ryzen7 1700 box that I

Re: [ceph-users] Ceph cluster on AMD based system.

2019-03-05 Thread Mark Nelson
Hi, I've got a ryzen7 1700 box that I regularly run tests on along with the upstream community performance test nodes that have Intel Xeon E5-2650v3 processors in them.  The Ryzen is 3.0GHz/3.7GHz turbo while the Xeons are 2.3GHz/3.0GHz.  The Xeons are quite a bit faster clock/clock in the

Re: [ceph-users] RBD poor performance

2019-02-27 Thread Mark Nelson
FWIW, I've got recent tests of a fairly recent master build (14.0.1-3118-gd239c2a) showing a single OSD hitting ~33-38K 4k randwrite IOPS with 3 client nodes running fio (io_depth = 32) both with RBD and with CephFS.  The OSD node had older gen CPUs (Xeon E5-2650 v3) and NVMe drives (Intel

Re: [ceph-users] ceph osd commit latency increase over time, until restart

2019-01-30 Thread Mark Nelson
On 1/30/19 7:45 AM, Alexandre DERUMIER wrote: I don't see any smoking gun here... :/ I need to test to compare when latency are going very high, but I need to wait more days/weeks. The main difference between a warm OSD and a cold one is that on startup the bluestore cache is empty. You

Re: [ceph-users] slow requests and high i/o / read rate on bluestore osds after upgrade 12.2.8 -> 12.2.10

2019-01-18 Thread Mark Nelson
On 1/18/19 9:22 AM, Nils Fahldieck - Profihost AG wrote: Hello Mark, I'm answering on behalf of Stefan. Am 18.01.19 um 00:22 schrieb Mark Nelson: On 1/17/19 4:06 PM, Stefan Priebe - Profihost AG wrote: Hello Mark, after reading http://docs.ceph.com/docs/master/rados/configuration/bluestore

Re: [ceph-users] slow requests and high i/o / read rate on bluestore osds after upgrade 12.2.8 -> 12.2.10

2019-01-17 Thread Mark Nelson
On 1/17/19 4:06 PM, Stefan Priebe - Profihost AG wrote: Hello Mark, after reading http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/ again i'm really confused how the behaviour is exactly under 12.2.8 regarding memory and 12.2.10. Also i stumpled upon "When tcmalloc

Re: [ceph-users] slow requests and high i/o / read rate on bluestore osds after upgrade 12.2.8 -> 12.2.10

2019-01-17 Thread Mark Nelson
Hi Stefan, I'm taking a stab at reproducing this in-house.  Any details you can give me that might help would be much appreciated.  I'll let you know what I find. Thanks, Mark On 1/16/19 1:56 PM, Stefan Priebe - Profihost AG wrote: i reverted the whole cluster back to 12.2.8 - recovery

Re: [ceph-users] slow requests and high i/o / read rate on bluestore osds after upgrade 12.2.8 -> 12.2.10

2019-01-16 Thread Mark Nelson
Hi Stefan, 12.2.9 included the pg hard limit patches and the osd_memory_autotuning patches.  While at first I was wondering if this was autotuning, it sounds like it may be more related to the pg hard limit.  I'm not terribly familiar with those patches though so some of the other members

Re: [ceph-users] slow requests and high i/o / read rate on bluestore osds after upgrade 12.2.8 -> 12.2.10

2019-01-15 Thread Mark Nelson
On 1/15/19 9:02 AM, Stefan Priebe - Profihost AG wrote: Am 15.01.19 um 12:45 schrieb Marc Roos: I upgraded this weekend from 12.2.8 to 12.2.10 without such issues (osd's are idle) it turns out this was a kernel bug. Updating to a newer kernel - has solved this issue. Greets, Stefan Hi

Re: [ceph-users] slow requests and high i/o / read rate on bluestore osds after upgrade 12.2.8 -> 12.2.10

2019-01-14 Thread Mark Nelson
Hi Stefan, Any idea if the reads are constant or bursty?  One cause of heavy reads is when rocksdb is compacting and has to read SST files from disk.  It's also possible you could see heavy read traffic during writes if data has to be read from SST files rather than cache. It's possible this

Re: [ceph-users] Slow rbd reads (fast writes) with luminous + bluestore

2018-12-13 Thread Mark Nelson
Hi Florian, On 12/13/18 7:52 AM, Florian Haas wrote: On 02/12/2018 19:48, Florian Haas wrote: Hi Mark, just taking the liberty to follow up on this one, as I'd really like to get to the bottom of this. On 28/11/2018 16:53, Florian Haas wrote: On 28/11/2018 15:52, Mark Nelson wrote: Option

Re: [ceph-users] SLOW SSD's after moving to Bluestore

2018-12-10 Thread Mark Nelson
Hi Tyler, I think we had a user a while back that reported they had background deletion work going on after upgrading their OSDs from filestore to bluestore due to PGs having been moved around.  Is it possible that your cluster is doing a bunch of work (deletion or otherwise) beyond the

Re: [ceph-users] Slow rbd reads (fast writes) with luminous + bluestore

2018-11-28 Thread Mark Nelson
On 11/28/18 8:36 AM, Florian Haas wrote: On 14/08/2018 15:57, Emmanuel Lacour wrote: Le 13/08/2018 à 16:58, Jason Dillaman a écrit : See [1] for ways to tweak the bluestore cache sizes. I believe that by default, bluestore will not cache any data but instead will only attempt to cache its

Re: [ceph-users] RGW performance with lots of objects

2018-11-27 Thread Mark Nelson
Hi Robert, Solved is probably a strong word.  I'd say that things have improved.  Bluestore in general tends to handle large numbers of objects better than filestore does for several reasons including that it doesn't suffer from pg directory splitting (though RocksDB compaction can become a

Re: [ceph-users] bucket indices: ssd-only or is a large fast block.db sufficient?

2018-11-20 Thread Mark Nelson
One consideration is that you may not be able to fit higher DB levels on the db partition and end up with a lot of waste (Nick Fisk recently saw this on his test cluster).  We've talked about potentially trying to pre-compute the hierarchy sizing so that we can align a level boundary to fit

Re: [ceph-users] How many PGs per OSD is too many?

2018-11-14 Thread Mark Nelson
On 11/14/18 1:45 PM, Vladimir Brik wrote: Hello I have a ceph 13.2.2 cluster comprised of 5 hosts, each with 16 HDDs and 4 SSDs. HDD OSDs have about 50 PGs each, while SSD OSDs have about 400 PGs each (a lot more pools use SSDs than HDDs). Servers are fairly powerful: 48 HT cores, 192GB of

Re: [ceph-users] Some questions concerning filestore --> bluestore migration

2018-10-05 Thread Mark Nelson
FWIW, here are values I measured directly from the RocksDB SST files under different small write workloads (ie the ones where you'd expect a larger DB footprint): https://drive.google.com/file/d/1Ews2WR-y5k3TMToAm0ZDsm7Gf_fwvyFw/view?usp=sharing These tests were only with 256GB of data

Re: [ceph-users] Bluestore DB size and onode count

2018-09-10 Thread Mark Nelson
On 09/10/2018 12:22 PM, Igor Fedotov wrote: Hi Nick. On 9/10/2018 1:30 PM, Nick Fisk wrote: If anybody has 5 minutes could they just clarify a couple of things for me 1. onode count, should this be equal to the number of objects stored on the OSD? Through reading several posts, there

Re: [ceph-users] Increase tcmalloc thread cache bytes - still recommended?

2018-07-19 Thread Mark Nelson
I believe that the standard mechanisms for launching OSDs already sets the thread cache higher than default.  It's possible we might be able to relax that now as async messenger doesn't thrash the cache as badly as simple messenger did.  I suspect there's probably still some value to

Re: [ceph-users] jemalloc / Bluestore

2018-07-05 Thread Mark Nelson
Hi Uwe, As luck would have it we were just looking at memory allocators again and ran some quick RBD and RGW tests that stress memory allocation: https://drive.google.com/uc?export=download=1VlWvEDSzaG7fE4tnYfxYtzeJ8mwx4DFg The gist of it is that tcmalloc looks like it's doing pretty well

Re: [ceph-users] Bluestore caching, flawed by design?

2018-04-02 Thread Mark Nelson
On 04/01/2018 07:59 PM, Christian Balzer wrote: Hello, firstly, Jack pretty much correctly correlated my issues to Mark's points, more below. On Sat, 31 Mar 2018 08:24:45 -0500 Mark Nelson wrote: On 03/29/2018 08:59 PM, Christian Balzer wrote: Hello, my crappy test cluster was rendered

Re: [ceph-users] Bluestore caching, flawed by design?

2018-03-31 Thread Mark Nelson
On 03/29/2018 08:59 PM, Christian Balzer wrote: Hello, my crappy test cluster was rendered inoperational by an IP renumbering that wasn't planned and forced on me during a DC move, so I decided to start from scratch and explore the fascinating world of Luminous/bluestore and all the assorted

Re: [ceph-users] What do you use to benchmark your rgw?

2018-03-28 Thread Mark Nelson
Personally I usually use a modified version of Mark Seger's getput tool here: https://github.com/markhpc/getput/tree/wip-fix-timing The difference between this version and upstream is primarily to make getput more accurate/useful when using something like CBT for orchestration instead of the

Re: [ceph-users] [Cbt] Poor libRBD write performance

2017-11-20 Thread Mark Nelson
On 11/20/2017 10:06 AM, Moreno, Orlando wrote: Hi all, I’ve been experiencing weird performance behavior when using FIO RBD engine directly to an RBD volume with numjobs > 1. For a 4KB random write test at 32 QD and 1 numjob, I can get about 40K IOPS, but when I increase the numjobs to 4, it

Re: [ceph-users] Bluestore performance 50% of filestore

2017-11-16 Thread Mark Nelson
, Radoslav Nikiforov wrote: No, What test parameters (iodepth/file size/numjobs) would make sense for 3 node/27OSD@4TB ? - Rado -Original Message- From: Mark Nelson [mailto:mnel...@redhat.com] Sent: Thursday, November 16, 2017 10:56 AM To: Milanov, Radoslav Nikiforov <rad...@bu.edu>;

Re: [ceph-users] Bluestore performance 50% of filestore

2017-11-16 Thread Mark Nelson
PM *To:* Milanov, Radoslav Nikiforov <rad...@bu.edu> *Cc:* Mark Nelson <mnel...@redhat.com>; ceph-users@lists.ceph.com *Subject:* Re: [ceph-users] Bluestore performance 50% of filestore I'd probably say 50GB to leave some extra space over-provisioned. 50GB should definitely pr

Re: [ceph-users] Bluestore performance 50% of filestore

2017-11-14 Thread Mark Nelson
Of Mark Nelson Sent: Tuesday, November 14, 2017 4:04 PM To: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Bluestore performance 50% of filestore Hi Radoslav, Is RBD cache enabled and in writeback mode? Do you have client side readahead? Both are doing better for writes than you'd expect

Re: [ceph-users] Bluestore performance 50% of filestore

2017-11-14 Thread Mark Nelson
Hi Radoslav, Is RBD cache enabled and in writeback mode? Do you have client side readahead? Both are doing better for writes than you'd expect from the native performance of the disks assuming they are typical 7200RPM drives and you are using 3X replication (~150IOPS * 27 / 3 = ~1350

Re: [ceph-users] Performance, and how much wiggle room there is with tunables

2017-11-10 Thread Mark Nelson
per physical drive or multiple..any recommendations ? In those tests 1 OSD per NVMe. You can do better if you put multiple OSDs on the same drive, both for filestore and bluestore. Mark Cheers /Maged On 2017-11-10 18:51, Mark Nelson wrote: FWIW, on very fast drives you can achieve at least

Re: [ceph-users] Performance, and how much wiggle room there is with tunables

2017-11-10 Thread Mark Nelson
FWIW, on very fast drives you can achieve at least 1.4GB/s and 30K+ write IOPS per OSD (before replication). It's quite possible to do better but those are recent numbers on a mostly default bluestore configuration that I'm fairly confident to share. It takes a lot of CPU, but it's possible.

Re: [ceph-users] bluestore - wal,db on faster devices?

2017-11-09 Thread Mark Nelson
e same SSD? Mensaje original De: Nick Fisk <n...@fisk.me.uk> Fecha: 8/11/17 10:16 p. m. (GMT+01:00) Para: 'Mark Nelson' <mnel...@redhat.com>, 'Wolfgang Lendl' <wolfgang.le...@meduniwien.ac.at> Cc: ceph-users@lists.ceph.com Asunto: Re: [ceph-users] bluestore - wal,db on fas

Re: [ceph-users] bluestore - wal,db on faster devices?

2017-11-08 Thread Mark Nelson
On 11/08/2017 03:16 PM, Nick Fisk wrote: -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Mark Nelson Sent: 08 November 2017 19:46 To: Wolfgang Lendl <wolfgang.le...@meduniwien.ac.at> Cc: ceph-users@lists.ceph.com Subject: Re: [ceph

Re: [ceph-users] bluestore - wal,db on faster devices?

2017-11-08 Thread Mark Nelson
nse in rbd environments - correct? br wolfgang On 11/08/2017 02:21 PM, Mark Nelson wrote: Hi Wolfgang, In bluestore the WAL serves sort of a similar purpose to filestore's journal, but bluestore isn't dependent on it for guaranteeing durability of large writes. With bluestore you can often get hig

Re: [ceph-users] bluestore - wal,db on faster devices?

2017-11-08 Thread Mark Nelson
Hi Wolfgang, In bluestore the WAL serves sort of a similar purpose to filestore's journal, but bluestore isn't dependent on it for guaranteeing durability of large writes. With bluestore you can often get higher large-write throughput than with filestore when using HDD-only or flash-only

Re: [ceph-users] Bluestore OSD_DATA, WAL & DB

2017-11-03 Thread Mark Nelson
On 11/03/2017 08:25 AM, Wido den Hollander wrote: Op 3 november 2017 om 13:33 schreef Mark Nelson <mnel...@redhat.com>: On 11/03/2017 02:44 AM, Wido den Hollander wrote: Op 3 november 2017 om 0:09 schreef Nigel Williams <nigel.willi...@tpac.org.au>: On 3 November 2

Re: [ceph-users] Bluestore OSD_DATA, WAL & DB

2017-11-03 Thread Mark Nelson
On 11/03/2017 04:08 AM, Jorge Pinilla López wrote: well I haven't found any recomendation either but I think that sometimes the SSD space is being wasted. If someone wanted to write it, you could have bluefs share some of the space on the drive for hot object data and release space as

Re: [ceph-users] Bluestore OSD_DATA, WAL & DB

2017-11-03 Thread Mark Nelson
On 11/03/2017 02:44 AM, Wido den Hollander wrote: Op 3 november 2017 om 0:09 schreef Nigel Williams : On 3 November 2017 at 07:45, Martin Overgaard Hansen wrote: I want to bring this subject back in the light and hope someone can provide

Re: [ceph-users] Bluestore with SSD-backed DBs; what if the SSD fails?

2017-10-25 Thread Mark Nelson
On 10/25/2017 03:51 AM, Caspar Smit wrote: Hi, I've asked the exact same question a few days ago, same answer: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-October/021708.html I guess we'll have to bite the bullet on this one and take this into account when designing. This is

Re: [ceph-users] Bluestore OSD_DATA, WAL & DB

2017-10-17 Thread Mark Nelson
On 10/17/2017 01:54 AM, Wido den Hollander wrote: Op 16 oktober 2017 om 18:14 schreef Richard Hesketh <richard.hesk...@rd.bbc.co.uk>: On 16/10/17 13:45, Wido den Hollander wrote: Op 26 september 2017 om 16:39 schreef Mark Nelson <mnel...@redhat.com>: On 09/26/2017 01:10

Re: [ceph-users] BlueStore Cache Ratios

2017-10-11 Thread Mark Nelson
Hi Jorge, I was sort of responsible for all of this. :) So basically there are different caches in different places: - rocksdb bloom filter and index cache - rocksdb block cache (which can be configured to include filters and indexes) - rocksdb compressed block cache - bluestore onode cache

Re: [ceph-users] BlueStore questions about workflow and performance

2017-10-03 Thread Mark Nelson
On 10/03/2017 07:59 AM, Alex Gorbachev wrote: Hi Sam, On Mon, Oct 2, 2017 at 6:01 PM Sam Huracan > wrote: Anyone can help me? On Oct 2, 2017 17:56, "Sam Huracan"

Re: [ceph-users] Bluestore OSD_DATA, WAL & DB

2017-09-26 Thread Mark Nelson
, 2017 at 10:53 AM Dietmar Rieder <dietmar.rie...@i-med.ac.at <mailto:dietmar.rie...@i-med.ac.at>> wrote: On 09/25/2017 02:59 PM, Mark Nelson wrote: > On 09/25/2017 03:31 AM, TYLin wrote: >> Hi, >> >> To my understand, the bluestore write workflo

Re: [ceph-users] Bluestore OSD_DATA, WAL & DB

2017-09-25 Thread Mark Nelson
On 09/25/2017 05:02 PM, Nigel Williams wrote: On 26 September 2017 at 01:10, David Turner wrote: If they are on separate devices, then you need to make it as big as you need to to ensure that it won't spill over (or if it does that you're ok with the degraded

Re: [ceph-users] Bluestore OSD_DATA, WAL & DB

2017-09-25 Thread Mark Nelson
Specifying one large DB partition per OSD will cover both uses. thanks, Ben On Thu, Sep 21, 2017 at 12:15 PM, Dietmar Rieder <dietmar.rie...@i-med.ac.at> wrote: On 09/21/2017 05:03 PM, Mark Nelson wrote: On 09/21/2017 03:17 AM, Dietmar Rieder wrote: On 09/21/2017 09:45 AM, Maged Mokh

Re: [ceph-users] Bluestore OSD_DATA, WAL & DB

2017-09-21 Thread Mark Nelson
On 09/21/2017 03:17 AM, Dietmar Rieder wrote: On 09/21/2017 09:45 AM, Maged Mokhtar wrote: On 2017-09-21 07:56, Lazuardi Nasution wrote: Hi, I'm still looking for the answer of these questions. Maybe someone can share their thought on these. Any comment will be helpful too. Best regards,

Re: [ceph-users] luminous vs jewel rbd performance

2017-09-21 Thread Mark Nelson
Hi Rafael, In the original email you mentioned 4M block size, seq read, but here it looks like you are doing 4k writes? Can you clarify? If you are doing 4k direct sequential writes with iodepth=1 and are also using librbd cache, please make sure that librbd is set to writeback mode in both

Re: [ceph-users] Bluestore disk colocation using NVRAM, SSD and SATA

2017-09-21 Thread Mark Nelson
On 09/21/2017 03:19 AM, Maged Mokhtar wrote: On 2017-09-21 10:01, Dietmar Rieder wrote: Hi, I'm in the same situation (NVMEs, SSDs, SAS HDDs). I asked the same questions to myself. For now I decided to use the NVMEs as wal and db devices for the SAS HDDs and on the SSDs I colocate wal and

Re: [ceph-users] Bluestore "separate" WAL and DB (and WAL/DB size?) [and recovery sleep]

2017-09-14 Thread Mark Nelson
on the cluster is very low and I don't have to honour any guarantees about client performance - getting back into HEALTH_OK asap is preferable). Rich On 13/09/17 21:14, Mark Nelson wrote: Hi Richard, Regarding recovery speed, have you looked through any of Neha's results on recovery sleep testing

Re: [ceph-users] Bluestore "separate" WAL and DB (and WAL/DB size?)

2017-09-13 Thread Mark Nelson
Hi Richard, Regarding recovery speed, have you looked through any of Neha's results on recovery sleep testing earlier this summer? https://www.spinics.net/lists/ceph-devel/msg37665.html She tested bluestore and filestore under a couple of different scenarios. The gist of it is that time to

Re: [ceph-users] Help with down OSD with Ceph 12.1.4 on Bluestore back

2017-08-29 Thread Mark Nelson
Hi Bryan, Check out your SCSI device failures, but if that doesn't pan out, Sage and I have been tracking this: http://tracker.ceph.com/issues/21171 There's a fix in place being tested now! Mark On 08/29/2017 05:41 PM, Bryan Banister wrote: Found some bad stuff in the messages file about

Re: [ceph-users] NVMe + SSD + HDD RBD Replicas with Bluestore...

2017-08-23 Thread Mark Nelson
On 08/23/2017 07:17 PM, Mark Nelson wrote: On 08/23/2017 06:18 PM, Xavier Trilla wrote: Oh man, what do you know!... I'm quite amazed. I've been reviewing more documentation about min_replica_size and seems like it doesn't work as I thought (Although I remember specifically reading

Re: [ceph-users] NVMe + SSD + HDD RBD Replicas with Bluestore...

2017-08-23 Thread Mark Nelson
On 08/23/2017 06:18 PM, Xavier Trilla wrote: Oh man, what do you know!... I'm quite amazed. I've been reviewing more documentation about min_replica_size and seems like it doesn't work as I thought (Although I remember specifically reading it somewhere some years ago :/ ). And, as all

Re: [ceph-users] Optimise Setup with Bluestore

2017-08-16 Thread Mark Nelson
Hi Mehmet! On 08/16/2017 11:12 AM, Mehmet wrote: :( no suggestions or recommendations on this? Am 14. August 2017 16:50:15 MESZ schrieb Mehmet : Hi friends, my actual hardware setup per OSD-node is as follow: # 3 OSD-Nodes with - 2x Intel(R) Xeon(R) CPU

Re: [ceph-users] luminous/bluetsore osd memory requirements

2017-08-14 Thread Mark Nelson
On 08/14/2017 02:42 PM, Nick Fisk wrote: -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Ronny Aasen Sent: 14 August 2017 18:55 To: ceph-users@lists.ceph.com Subject: Re: [ceph-users] luminous/bluetsore osd memory requirements On 10.08.2017

Re: [ceph-users] BlueStore SSD

2017-08-14 Thread Mark Nelson
On 08/14/2017 12:52 PM, Ashley Merrick wrote: Hello, Hi Ashley! Currently run 10x4TB , 2xSSD for Journal, planning to move fully to BS, looking at adding extra servers. With the removal of the double write on BS and from the testing so far of BS (having WAL & DB on SSD Seeing very minimal

Re: [ceph-users] Squeezing Performance of CEPH

2017-06-22 Thread Mark Nelson
Hello Massimiliano, Based on the configuration below, it appears you have 8 SSDs total (2 nodes with 4 SSDs each)? I'm going to assume you have 3x replication and are you using filestore, so in reality you are writing 3 copies and doing full data journaling for each copy, for 6x writes per

Re: [ceph-users] osd_op_tp timeouts

2017-06-13 Thread Mark Nelson
Hi Tyler, I wanted to make sure you got a reply to this, but unfortunately I don't have much to give you. It sounds like you already took a look at the disk metrics and ceph is probably not waiting on disk IO based on your description. If you can easily invoke the problem, you could attach

Re: [ceph-users] Lumionous: bluestore 'tp_osd_tp thread tp_osd_tp' had timed out after 60

2017-06-08 Thread Mark Nelson
> > Jake > > On 06/06/17 15:46, Mark Nelson wrote: >> Hi Jake, >> >> I just happened to notice this was on 12.0.3. Would it be possible to >> test this out with current master and see if it still is a problem? >

  1   2   3   4   5   6   >