Re: [ceph-users] optimize bluestore for random write i/o

2019-03-12 Thread vitalif

I bet you'd see better memstore results with my vector based object
implementation instead of bufferlists.


Where can I find it?


Nick Fisk noticed the same
thing you did.  One interesting observation he made was that disabling
CPU C/P states helped bluestore immensely in the iodepth=1 case.


This is exactly what I've done by `cpupower idle-set -D 0`. It basically 
increases iops 2-3 times.


Pipelined writes were added in rocksdb 5.5.1 back in the summer of 
2017.  That wasn't available when bluestore was being written.


In fact ... it slightly confuses me because even now bluestore IS 
writing to rocksdb from multiple threads sometimes. It's when 
bluestore_sync_submit_transaction is on and the write doesn't require 
aio (= when it's a deferred write) and when it holds several other 
conditions. It calls db->submit_transaction from the tp_osd_tp thread 
then.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] optimize bluestore for random write i/o

2019-03-12 Thread Mark Nelson


On 3/12/19 8:40 AM, vita...@yourcmc.ru wrote:

One way or another we can only have a single thread sending writes to
rocksdb.  A lot of the prior optimization work on the write side was
to get as much processing out of the kv_sync_thread as possible.
That's still a worthwhile goal as it's typically what bottlenecks with
high amounts of concurrency.  What I think would be very interesting
though is if we moved more toward a model where we had lots shards
(OSDs or shards of an OSD) with independent rocksdb instances and less
threading overhead per shard.  That's the way the seastar work is
going, and also sort of the model I've been thinking about for a very
simple single-threaded OSD.


Doesn't rocksdb have pipelined writes? Isn't it better to just use 
that builtin concurrency instead of factoring in your own?



Pipelined writes were added in rocksdb 5.5.1 back in the summer of 
2017.  That wasn't available when bluestore was being written. We may be 
able to make use of it now but I don't think anyone has taken the time 
to figure out how much work it would take or what kind of benefit we 
would get.



Mark

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] optimize bluestore for random write i/o

2019-03-12 Thread vitalif

One way or another we can only have a single thread sending writes to
rocksdb.  A lot of the prior optimization work on the write side was
to get as much processing out of the kv_sync_thread as possible. 
That's still a worthwhile goal as it's typically what bottlenecks with
high amounts of concurrency.  What I think would be very interesting
though is if we moved more toward a model where we had lots shards
(OSDs or shards of an OSD) with independent rocksdb instances and less
threading overhead per shard.  That's the way the seastar work is
going, and also sort of the model I've been thinking about for a very
simple single-threaded OSD.


Doesn't rocksdb have pipelined writes? Isn't it better to just use that 
builtin concurrency instead of factoring in your own?

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] optimize bluestore for random write i/o

2019-03-12 Thread Mark Nelson


On 3/12/19 7:31 AM, vita...@yourcmc.ru wrote:

Decreasing the min_alloc size isn't always a win, but ican be in some
cases.  Originally bluestore_min_alloc_size_ssd was set to 4096 but we
increased it to 16384 because at the time our metadata path was slow
and increasing it resulted in a pretty significant performance win
(along with increasing the WAL buffers in rocksdb to reduce write
amplification).  Since then we've improved the metadata path to the
point where at least on our test nodes performance is pretty close
between with min_alloc size = 16k and min_alloc size = 4k the last
time I looked.  It might be a good idea to drop it down to 4k now but
I think we need to be careful because there are tradeoffs.


I think it's all about your disks' latency. Deferred write is 1 
IO+sync and redirect-write is 2 IOs+syncs. So if your IO or sync is 
slow (like it is on HDDs and bad SSDs) then the deferred write is 
better in terms of latency. If your IO is fast then you're only 
bottlenecked by the OSD code itself eating a lot of CPU and then 
direct write may be better. By the way, I think OSD itself is way TOO 
slow currently (see below).



Don't disagree, bluestore's write path has gotten *really* complicated.




The idea I was talking about turned out to be only viable for HDD/slow 
SSDs and only for low iodepths. But the gain is huge - something 
between +50% iops to +100% iops (2x less latency). There is a stupid 
problem in current bluestore implementation which makes it do 2 
journal writes and FSYNCs instead of one for every incoming 
transaction. The details are here: https://tracker.ceph.com/issues/38559


The unnecessary commit is the BlueFS's WAL. All it's doing is 
recording the increased size of a RocksDB WAL file. Which obviously 
shouldn't be required with RocksDB as its default setting is 
"kTolerateCorruptedTailRecords". However, without this setting the WAL 
is not synced to the disk with every write because by some clever 
logic sync_file_range is called only with SYNC_FILE_RANGE_WRITE in the 
corresponding piece of code. Thus the OSD's database gets corrupted 
when you kill it with -9 and thus it's impossible to set 
`bluefs_preextend_wal_files` to true. And thus you get two writes and 
commits instead of one.


I don't know the exact idea behind doing only SYNC_FILE_RANGE_WRITE - 
as I understand there is currently no benefit in doing this. It could 
be a benefit if RocksDB was writing journal in small parts and then 
doing a single sync - but it's always flushing the newly written part 
of a journal to disk as a whole.


The simplest way to fix it is just to add SYNC_FILE_RANGE_WAIT_BEFORE 
and SYNC_FILE_RANGE_WAIT_AFTER to sync_file_range in KernelDevice.cc. 
My pull request is here: https://github.com/ceph/ceph/pull/26909 - 
I've tested this change with 13.2.4 Mimic and 14.1.0 Nautilus and yes, 
it does increase single-thread iops on HDDs two times (!). After this 
change BlueStore becomes actually better than FileStore at least on HDDs.


Another way of fixing it would be to add an explicit bdev->flush at 
the end of the kv_sync_thread, after db->submit_transaction_sync(), 
and possibly remove the redundant sync_file_range at all. But then you 
must do the same in another place in _txc_state_proc, because it's 
also sometimes doing submit_transaction_sync(). In the end I 
personally think that to add flags to sync_file_range is better 
because a function named "submit_transaction_sync" should be in fact 
SYNC! It shouldn't require additional steps from the caller to make 
the data durable.



I'm glad you are peaking under the covers here. :)  There's a lot going 
on here, and it's not immediate obvious what the intent is and the 
failure conditions are.  I suspect the intent here was to error on the 
side of caution but we really need to document this better.  To be fair 
it's not just us, there's confusion and terribleness all the way up to 
the kernel and beyond.





Also I have a small funny test result to share.

I've created one OSD on my laptop on a loop device in a tmpfs (i.e. 
RAM), created 1 RBD image inside it and tested it with `fio 
-ioengine=rbd -direct=1 -bs=4k -rw=randwrite`. Before doing the test 
I've turned off CPU power saving with `cpupower idle-set -D 0`.


The results are:
- filestore: 2200 iops with -iodepth=1 (0.454ms average latency). 8500 
iops with -iodepth=128.
- bluestore: 1800 iops with -iodepth=1 (0.555ms average latency). 9000 
iops with -iodepth=128.
- memstore: 3000 iops with -iodepth=1 (0.333ms average latency). 11000 
iops with -iodepth=128.


If we can think of memstore being a "minimal possible /dev/null" then:
- OSD overhead is 1/3000 = 0.333ms (maybe slighly less, but that 
doesn't matter).

- filestore overhead is 1/2200-1/3000 = 0.121ms
- bluestore overhead is 1/1800-1/3000 = 0.222ms

The conclusion is that bluestore is actually almost TWO TIMES slower 
than filestore in terms of pure latency, and the throughput is only 
slightly 

Re: [ceph-users] optimize bluestore for random write i/o

2019-03-12 Thread vitalif

Decreasing the min_alloc size isn't always a win, but ican be in some
cases.  Originally bluestore_min_alloc_size_ssd was set to 4096 but we
increased it to 16384 because at the time our metadata path was slow
and increasing it resulted in a pretty significant performance win
(along with increasing the WAL buffers in rocksdb to reduce write
amplification).  Since then we've improved the metadata path to the
point where at least on our test nodes performance is pretty close
between with min_alloc size = 16k and min_alloc size = 4k the last
time I looked.  It might be a good idea to drop it down to 4k now but
I think we need to be careful because there are tradeoffs.


I think it's all about your disks' latency. Deferred write is 1 IO+sync 
and redirect-write is 2 IOs+syncs. So if your IO or sync is slow (like 
it is on HDDs and bad SSDs) then the deferred write is better in terms 
of latency. If your IO is fast then you're only bottlenecked by the OSD 
code itself eating a lot of CPU and then direct write may be better. By 
the way, I think OSD itself is way TOO slow currently (see below).


The idea I was talking about turned out to be only viable for HDD/slow 
SSDs and only for low iodepths. But the gain is huge - something between 
+50% iops to +100% iops (2x less latency). There is a stupid problem in 
current bluestore implementation which makes it do 2 journal writes and 
FSYNCs instead of one for every incoming transaction. The details are 
here: https://tracker.ceph.com/issues/38559


The unnecessary commit is the BlueFS's WAL. All it's doing is recording 
the increased size of a RocksDB WAL file. Which obviously shouldn't be 
required with RocksDB as its default setting is 
"kTolerateCorruptedTailRecords". However, without this setting the WAL 
is not synced to the disk with every write because by some clever logic 
sync_file_range is called only with SYNC_FILE_RANGE_WRITE in the 
corresponding piece of code. Thus the OSD's database gets corrupted when 
you kill it with -9 and thus it's impossible to set 
`bluefs_preextend_wal_files` to true. And thus you get two writes and 
commits instead of one.


I don't know the exact idea behind doing only SYNC_FILE_RANGE_WRITE - as 
I understand there is currently no benefit in doing this. It could be a 
benefit if RocksDB was writing journal in small parts and then doing a 
single sync - but it's always flushing the newly written part of a 
journal to disk as a whole.


The simplest way to fix it is just to add SYNC_FILE_RANGE_WAIT_BEFORE 
and SYNC_FILE_RANGE_WAIT_AFTER to sync_file_range in KernelDevice.cc. My 
pull request is here: https://github.com/ceph/ceph/pull/26909 - I've 
tested this change with 13.2.4 Mimic and 14.1.0 Nautilus and yes, it 
does increase single-thread iops on HDDs two times (!). After this 
change BlueStore becomes actually better than FileStore at least on 
HDDs.


Another way of fixing it would be to add an explicit bdev->flush at the 
end of the kv_sync_thread, after db->submit_transaction_sync(), and 
possibly remove the redundant sync_file_range at all. But then you must 
do the same in another place in _txc_state_proc, because it's also 
sometimes doing submit_transaction_sync(). In the end I personally think 
that to add flags to sync_file_range is better because a function named 
"submit_transaction_sync" should be in fact SYNC! It shouldn't require 
additional steps from the caller to make the data durable.


Also I have a small funny test result to share.

I've created one OSD on my laptop on a loop device in a tmpfs (i.e. 
RAM), created 1 RBD image inside it and tested it with `fio 
-ioengine=rbd -direct=1 -bs=4k -rw=randwrite`. Before doing the test 
I've turned off CPU power saving with `cpupower idle-set -D 0`.


The results are:
- filestore: 2200 iops with -iodepth=1 (0.454ms average latency). 8500 
iops with -iodepth=128.
- bluestore: 1800 iops with -iodepth=1 (0.555ms average latency). 9000 
iops with -iodepth=128.
- memstore: 3000 iops with -iodepth=1 (0.333ms average latency). 11000 
iops with -iodepth=128.


If we can think of memstore being a "minimal possible /dev/null" then:
- OSD overhead is 1/3000 = 0.333ms (maybe slighly less, but that doesn't 
matter).

- filestore overhead is 1/2200-1/3000 = 0.121ms
- bluestore overhead is 1/1800-1/3000 = 0.222ms

The conclusion is that bluestore is actually almost TWO TIMES slower 
than filestore in terms of pure latency, and the throughput is only 
slightly better. How could it happen? How could a newly written store 
become two times slower than the old one? ) that's pretty annoying...


Could it be because bluestore is doing a lot of threading? I mean could 
it be because each write operation goes through 5 threads during its 
execution? (tp_osd_tp -> aio -> kv_sync_thread -> kv_finalize_thread -> 
finisher)? Maybe just remove aio and kv threads and process all 
operations directly in tp_osd_tp then?

___
ceph-users 

Re: [ceph-users] optimize bluestore for random write i/o

2019-03-06 Thread Stefan Priebe - Profihost AG

Am 06.03.19 um 14:08 schrieb Mark Nelson:
> 
> On 3/6/19 5:12 AM, Stefan Priebe - Profihost AG wrote:
>> Hi Mark,
>> Am 05.03.19 um 23:12 schrieb Mark Nelson:
>>> Hi Stefan,
>>>
>>>
>>> Could you try running your random write workload against bluestore and
>>> then take a wallclock profile of an OSD using gdbpmp? It's available
>>> here:
>>>
>>>
>>> https://github.com/markhpc/gdbpmp
>> sure but it does not work:
>>
>>
>> # ./gdbpmp.py -p 3760442 -n 100 -o gdbpmp.data
>> Attaching to process 3760442...0x7f917b6a615f in
>> pthread_cond_wait@@GLIBC_2.3.2 () from
>> target:/lib/x86_64-linux-gnu/libpthread.so.0
>>
>> Thread 1 "ceph-osd" received signal SIGCONT, Continued.
>> 0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
>> target:/lib/x86_64-linux-gnu/libpthread.so.0
>> Done.
>> Gathering Samples
>> Thread 1 "ceph-osd" received signal SIGINT, Interrupt.
>> 0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
>> target:/lib/x86_64-linux-gnu/libpthread.so.0
>> .
>> Thread 1 "ceph-osd" received signal SIGINT, Interrupt.
>> 0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
>> target:/lib/x86_64-linux-gnu/libpthread.so.0
>> .
>> Thread 1 "ceph-osd" received signal SIGINT, Interrupt.
>> 0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
>> target:/lib/x86_64-linux-gnu/libpthread.so.0
>> .
>> Thread 1 "ceph-osd" received signal SIGINT, Interrupt.
>> 0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
>> target:/lib/x86_64-linux-gnu/libpthread.so.0
>> .
>> Thread 2 "log" received signal SIGINT, Interrupt.
>> 0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
>> target:/lib/x86_64-linux-gnu/libpthread.so.0
>> .
>> Thread 1 "ceph-osd" received signal SIGINT, Interrupt.
>> 0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
>> target:/lib/x86_64-linux-gnu/libpthread.so.0
>> .
>> Thread 1 "ceph-osd" received signal SIGINT, Interrupt.
>> 0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
>> target:/lib/x86_64-linux-gnu/libpthread.so.0
>> .
>> Thread 1 "ceph-osd" received signal SIGINT, Interrupt.
>> 0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
>> target:/lib/x86_64-linux-gnu/libpthread.so.0
>> .
>> Thread 1 "ceph-osd" received signal SIGINT, Interrupt.
>> 0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
>> target:/lib/x86_64-linux-gnu/libpthread.so.0
>> .
>> Thread 2 "log" received signal SIGINT, Interrupt.
>> 0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
>> target:/lib/x86_64-linux-gnu/libpthread.so.0
>> .
>> Thread 1 "ceph-osd" received signal SIGINT, Interrupt.
>> 0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
>> target:/lib/x86_64-linux-gnu/libpthread.so.0
>> .
>> Thread 1 "ceph-osd" received signal SIGINT, Interrupt.
>> 0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
>> target:/lib/x86_64-linux-gnu/libpthread.so.0
>> .
>> Thread 1 "ceph-osd" received signal SIGINT, Interrupt.
>> 0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
>> target:/lib/x86_64-linux-gnu/libpthread.so.0
>> .
>> Thread 1 "ceph-osd" received signal SIGINT, Interrupt.
>> 0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
>> target:/lib/x86_64-linux-gnu/libpthread.so.0
>> .
>> Thread 1 "ceph-osd" received signal SIGINT, Interrupt.
>> 0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
>> target:/lib/x86_64-linux-gnu/libpthread.so.0
>> .
>> Thread 1 "ceph-osd" received signal SIGINT, Interrupt.
>> 0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
>> target:/lib/x86_64-linux-gnu/libpthread.so.0
>> .
>> Thread 2 "log" received signal SIGINT, Interrupt.
>> 0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
>> target:/lib/x86_64-linux-gnu/libpthread.so.0
>> .
>> Thread 1 "ceph-osd" received signal SIGINT, Interrupt.
>> 0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
>> target:/lib/x86_64-linux-gnu/libpthread.so.0
>> .
>> Thread 1 "ceph-osd" received signal SIGINT, Interrupt.
>> 0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
>> target:/lib/x86_64-linux-gnu/libpthread.so.0
>> .
>> Thread 1 "ceph-osd" received signal SIGINT, Interrupt.
>> 0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
>> target:/lib/x86_64-linux-gnu/libpthread.so.0
>> .
>> Thread 2 "log" received signal SIGINT, Interrupt.
>> 0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
>> target:/lib/x86_64-linux-gnu/libpthread.so.0
>> .
>> Thread 1 "ceph-osd" received signal SIGINT, Interrupt.
>> 0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
>> target:/lib/x86_64-linux-gnu/libpthread.so.0
>> .
>> Thread 1 "ceph-osd" received signal SIGINT, Interrupt.
>> 0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
>> target:/lib/x86_64-linux-gnu/libpthread.so.0
>> .
>> Thread 1 "ceph-osd" received signal SIGINT, Interrupt.
>> 0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
>> target:/lib/x86_64-linux-gnu/libpthread.so.0

Re: [ceph-users] optimize bluestore for random write i/o

2019-03-06 Thread Mark Nelson


On 3/6/19 5:12 AM, Stefan Priebe - Profihost AG wrote:

Hi Mark,
Am 05.03.19 um 23:12 schrieb Mark Nelson:

Hi Stefan,


Could you try running your random write workload against bluestore and
then take a wallclock profile of an OSD using gdbpmp? It's available here:


https://github.com/markhpc/gdbpmp

sure but it does not work:


# ./gdbpmp.py -p 3760442 -n 100 -o gdbpmp.data
Attaching to process 3760442...0x7f917b6a615f in
pthread_cond_wait@@GLIBC_2.3.2 () from
target:/lib/x86_64-linux-gnu/libpthread.so.0

Thread 1 "ceph-osd" received signal SIGCONT, Continued.
0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
target:/lib/x86_64-linux-gnu/libpthread.so.0
Done.
Gathering Samples
Thread 1 "ceph-osd" received signal SIGINT, Interrupt.
0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
target:/lib/x86_64-linux-gnu/libpthread.so.0
.
Thread 1 "ceph-osd" received signal SIGINT, Interrupt.
0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
target:/lib/x86_64-linux-gnu/libpthread.so.0
.
Thread 1 "ceph-osd" received signal SIGINT, Interrupt.
0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
target:/lib/x86_64-linux-gnu/libpthread.so.0
.
Thread 1 "ceph-osd" received signal SIGINT, Interrupt.
0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
target:/lib/x86_64-linux-gnu/libpthread.so.0
.
Thread 2 "log" received signal SIGINT, Interrupt.
0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
target:/lib/x86_64-linux-gnu/libpthread.so.0
.
Thread 1 "ceph-osd" received signal SIGINT, Interrupt.
0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
target:/lib/x86_64-linux-gnu/libpthread.so.0
.
Thread 1 "ceph-osd" received signal SIGINT, Interrupt.
0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
target:/lib/x86_64-linux-gnu/libpthread.so.0
.
Thread 1 "ceph-osd" received signal SIGINT, Interrupt.
0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
target:/lib/x86_64-linux-gnu/libpthread.so.0
.
Thread 1 "ceph-osd" received signal SIGINT, Interrupt.
0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
target:/lib/x86_64-linux-gnu/libpthread.so.0
.
Thread 2 "log" received signal SIGINT, Interrupt.
0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
target:/lib/x86_64-linux-gnu/libpthread.so.0
.
Thread 1 "ceph-osd" received signal SIGINT, Interrupt.
0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
target:/lib/x86_64-linux-gnu/libpthread.so.0
.
Thread 1 "ceph-osd" received signal SIGINT, Interrupt.
0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
target:/lib/x86_64-linux-gnu/libpthread.so.0
.
Thread 1 "ceph-osd" received signal SIGINT, Interrupt.
0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
target:/lib/x86_64-linux-gnu/libpthread.so.0
.
Thread 1 "ceph-osd" received signal SIGINT, Interrupt.
0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
target:/lib/x86_64-linux-gnu/libpthread.so.0
.
Thread 1 "ceph-osd" received signal SIGINT, Interrupt.
0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
target:/lib/x86_64-linux-gnu/libpthread.so.0
.
Thread 1 "ceph-osd" received signal SIGINT, Interrupt.
0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
target:/lib/x86_64-linux-gnu/libpthread.so.0
.
Thread 2 "log" received signal SIGINT, Interrupt.
0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
target:/lib/x86_64-linux-gnu/libpthread.so.0
.
Thread 1 "ceph-osd" received signal SIGINT, Interrupt.
0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
target:/lib/x86_64-linux-gnu/libpthread.so.0
.
Thread 1 "ceph-osd" received signal SIGINT, Interrupt.
0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
target:/lib/x86_64-linux-gnu/libpthread.so.0
.
Thread 1 "ceph-osd" received signal SIGINT, Interrupt.
0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
target:/lib/x86_64-linux-gnu/libpthread.so.0
.
Thread 2 "log" received signal SIGINT, Interrupt.
0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
target:/lib/x86_64-linux-gnu/libpthread.so.0
.
Thread 1 "ceph-osd" received signal SIGINT, Interrupt.
0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
target:/lib/x86_64-linux-gnu/libpthread.so.0
.
Thread 1 "ceph-osd" received signal SIGINT, Interrupt.
0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
target:/lib/x86_64-linux-gnu/libpthread.so.0
.
Thread 1 "ceph-osd" received signal SIGINT, Interrupt.
0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
target:/lib/x86_64-linux-gnu/libpthread.so.0
.
Thread 1 "ceph-osd" received signal SIGINT, Interrupt.
0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
target:/lib/x86_64-linux-gnu/libpthread.so.0
.
Thread 1 "ceph-osd" received signal SIGINT, Interrupt.
0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
target:/lib/x86_64-linux-gnu/libpthread.so.0
.
Thread 1 "ceph-osd" received signal SIGINT, Interrupt.
0x7f917b6a615f 

Re: [ceph-users] optimize bluestore for random write i/o

2019-03-06 Thread Mark Nelson


On 3/5/19 4:23 PM, Vitaliy Filippov wrote:
Testing -rw=write without -sync=1 or -fsync=1 (or -fsync=32 for batch 
IO, or just fio -ioengine=rbd from outside a VM) is rather pointless - 
you're benchmarking the RBD cache, not Ceph itself. RBD cache is 
coalescing your writes into big sequential writes. Of course bluestore 
is faster in this case - it has no double write for big writes.


I'll probably try to test these settings - I'm also interested in 
random write iops in an all-flash bluestore cluster :) but I don't 
think any rocksdb options will help. I found bluestore pretty 
untunable in terms of performance :)



For random writes, you often end up bottlenecked in the kv sync thread 
so long as you aren't generally CPU bound.  Anything you can do to 
reduce the work being done in the kv sync thread usually helps.  A big 
one is making sure you are hitting onodes in the bluestore cache rather 
than rocksdb cache or disk. IE having enough onode cache available for 
the dataset being benchmarked.





The best thing to do for me was to disable CPU powersaving (set 
governor to performance + cpupower idle-set -D 1). Your CPUs become 
frying pans but write IOPS, especially single-thread write IOPS which 
are the worst-case scenario AND at the same time the thing 
applications usually need increase 2-3 times. Test it with fio 
-ioengine=rbd -bs=4k -iodepth=1.



Yep, this is a big one.  I've asked for clarification from vendors if we 
can actually recommend doing this but haven't gotten a clear answer yet. :/





Another thing that I've done on my cluster was to set 
`bluestore_min_alloc_size_ssd` to 4096. The reason to do that is that 
it's 16kb by default which means all writes below 16kb use the same 
deferred write path as with HDDs. Deferred writes only increase WA 
factor for SSDs and lower the performance. You have to recreate OSDs 
after changing this variable - it's only applied at the time of OSD 
creation.



Decreasing the min_alloc size isn't always a win, but ican be in some 
cases.  Originally bluestore_min_alloc_size_ssd was set to 4096 but we 
increased it to 16384 because at the time our metadata path was slow and 
increasing it resulted in a pretty significant performance win (along 
with increasing the WAL buffers in rocksdb to reduce write 
amplification).  Since then we've improved the metadata path to the 
point where at least on our test nodes performance is pretty close 
between with min_alloc size = 16k and min_alloc size = 4k the last time 
I looked.  It might be a good idea to drop it down to 4k now but I think 
we need to be careful because there are tradeoffs.



You can see some of the original work we did in 2016 looking at this on 
our performance test cluster here:



https://docs.google.com/spreadsheets/d/1YPiiDu0IxQdB4DcVVz8WON9CpWX9QOy5r-XmYJL0Sys/edit?usp=sharing


And follow-up work in 2017 here:


https://drive.google.com/file/d/0B2gTBZrkrnpZVXpzR2JNRmR0WFE/view?usp=sharing


It might be time to revisit again.




I'm also currently trying another performance fix, kind of... but it 
involves patching ceph's code, so I'll share it later if I succeed.



Would you consider sharing what your idea is?  There are absolutely 
areas where performance can be improved, but often times they involve 
tradeoffs in some respect.






Hello list,

while the performance of sequential writes 4k on bluestore is very high
and even higher than filestore i was wondering what i can do to optimize
random pattern as well.

While using:
fio --rw=write --iodepth=32 --ioengine=libaio --bs=4k --numjobs=4
--filename=/tmp/test --size=10G --runtime=60 --group_reporting
--name=test --direct=1

I get 36000 iop/s on bluestore while having 11500 on filestore.

Using randwrite gives me 17000 on filestore and only 9500 on bluestore.

This is on all flash / ssd running luminous 12.2.10.



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] optimize bluestore for random write i/o

2019-03-06 Thread Stefan Priebe - Profihost AG
Hi Mark,
Am 05.03.19 um 23:12 schrieb Mark Nelson:
> Hi Stefan,
> 
> 
> Could you try running your random write workload against bluestore and
> then take a wallclock profile of an OSD using gdbpmp? It's available here:
> 
> 
> https://github.com/markhpc/gdbpmp

sure but it does not work:


# ./gdbpmp.py -p 3760442 -n 100 -o gdbpmp.data
Attaching to process 3760442...0x7f917b6a615f in
pthread_cond_wait@@GLIBC_2.3.2 () from
target:/lib/x86_64-linux-gnu/libpthread.so.0

Thread 1 "ceph-osd" received signal SIGCONT, Continued.
0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
target:/lib/x86_64-linux-gnu/libpthread.so.0
Done.
Gathering Samples
Thread 1 "ceph-osd" received signal SIGINT, Interrupt.
0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
target:/lib/x86_64-linux-gnu/libpthread.so.0
.
Thread 1 "ceph-osd" received signal SIGINT, Interrupt.
0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
target:/lib/x86_64-linux-gnu/libpthread.so.0
.
Thread 1 "ceph-osd" received signal SIGINT, Interrupt.
0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
target:/lib/x86_64-linux-gnu/libpthread.so.0
.
Thread 1 "ceph-osd" received signal SIGINT, Interrupt.
0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
target:/lib/x86_64-linux-gnu/libpthread.so.0
.
Thread 2 "log" received signal SIGINT, Interrupt.
0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
target:/lib/x86_64-linux-gnu/libpthread.so.0
.
Thread 1 "ceph-osd" received signal SIGINT, Interrupt.
0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
target:/lib/x86_64-linux-gnu/libpthread.so.0
.
Thread 1 "ceph-osd" received signal SIGINT, Interrupt.
0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
target:/lib/x86_64-linux-gnu/libpthread.so.0
.
Thread 1 "ceph-osd" received signal SIGINT, Interrupt.
0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
target:/lib/x86_64-linux-gnu/libpthread.so.0
.
Thread 1 "ceph-osd" received signal SIGINT, Interrupt.
0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
target:/lib/x86_64-linux-gnu/libpthread.so.0
.
Thread 2 "log" received signal SIGINT, Interrupt.
0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
target:/lib/x86_64-linux-gnu/libpthread.so.0
.
Thread 1 "ceph-osd" received signal SIGINT, Interrupt.
0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
target:/lib/x86_64-linux-gnu/libpthread.so.0
.
Thread 1 "ceph-osd" received signal SIGINT, Interrupt.
0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
target:/lib/x86_64-linux-gnu/libpthread.so.0
.
Thread 1 "ceph-osd" received signal SIGINT, Interrupt.
0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
target:/lib/x86_64-linux-gnu/libpthread.so.0
.
Thread 1 "ceph-osd" received signal SIGINT, Interrupt.
0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
target:/lib/x86_64-linux-gnu/libpthread.so.0
.
Thread 1 "ceph-osd" received signal SIGINT, Interrupt.
0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
target:/lib/x86_64-linux-gnu/libpthread.so.0
.
Thread 1 "ceph-osd" received signal SIGINT, Interrupt.
0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
target:/lib/x86_64-linux-gnu/libpthread.so.0
.
Thread 2 "log" received signal SIGINT, Interrupt.
0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
target:/lib/x86_64-linux-gnu/libpthread.so.0
.
Thread 1 "ceph-osd" received signal SIGINT, Interrupt.
0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
target:/lib/x86_64-linux-gnu/libpthread.so.0
.
Thread 1 "ceph-osd" received signal SIGINT, Interrupt.
0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
target:/lib/x86_64-linux-gnu/libpthread.so.0
.
Thread 1 "ceph-osd" received signal SIGINT, Interrupt.
0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
target:/lib/x86_64-linux-gnu/libpthread.so.0
.
Thread 2 "log" received signal SIGINT, Interrupt.
0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
target:/lib/x86_64-linux-gnu/libpthread.so.0
.
Thread 1 "ceph-osd" received signal SIGINT, Interrupt.
0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
target:/lib/x86_64-linux-gnu/libpthread.so.0
.
Thread 1 "ceph-osd" received signal SIGINT, Interrupt.
0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
target:/lib/x86_64-linux-gnu/libpthread.so.0
.
Thread 1 "ceph-osd" received signal SIGINT, Interrupt.
0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
target:/lib/x86_64-linux-gnu/libpthread.so.0
.
Thread 1 "ceph-osd" received signal SIGINT, Interrupt.
0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
target:/lib/x86_64-linux-gnu/libpthread.so.0
.
Thread 1 "ceph-osd" received signal SIGINT, Interrupt.
0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from
target:/lib/x86_64-linux-gnu/libpthread.so.0
.
Thread 1 "ceph-osd" received signal SIGINT, Interrupt.
0x7f917b6a615f in pthread_cond_wait@@GLIBC_2.3.2 () from

Re: [ceph-users] optimize bluestore for random write i/o

2019-03-05 Thread Vitaliy Filippov
Testing -rw=write without -sync=1 or -fsync=1 (or -fsync=32 for batch IO,  
or just fio -ioengine=rbd from outside a VM) is rather pointless - you're  
benchmarking the RBD cache, not Ceph itself. RBD cache is coalescing your  
writes into big sequential writes. Of course bluestore is faster in this  
case - it has no double write for big writes.


I'll probably try to test these settings - I'm also interested in random  
write iops in an all-flash bluestore cluster :) but I don't think any  
rocksdb options will help. I found bluestore pretty untunable in terms of  
performance :)


The best thing to do for me was to disable CPU powersaving (set governor  
to performance + cpupower idle-set -D 1). Your CPUs become frying pans but  
write IOPS, especially single-thread write IOPS which are the worst-case  
scenario AND at the same time the thing applications usually need increase  
2-3 times. Test it with fio -ioengine=rbd -bs=4k -iodepth=1.


Another thing that I've done on my cluster was to set  
`bluestore_min_alloc_size_ssd` to 4096. The reason to do that is that it's  
16kb by default which means all writes below 16kb use the same deferred  
write path as with HDDs. Deferred writes only increase WA factor for SSDs  
and lower the performance. You have to recreate OSDs after changing this  
variable - it's only applied at the time of OSD creation.


I'm also currently trying another performance fix, kind of... but it  
involves patching ceph's code, so I'll share it later if I succeed.



Hello list,

while the performance of sequential writes 4k on bluestore is very high
and even higher than filestore i was wondering what i can do to optimize
random pattern as well.

While using:
fio --rw=write --iodepth=32 --ioengine=libaio --bs=4k --numjobs=4
--filename=/tmp/test --size=10G --runtime=60 --group_reporting
--name=test --direct=1

I get 36000 iop/s on bluestore while having 11500 on filestore.

Using randwrite gives me 17000 on filestore and only 9500 on bluestore.

This is on all flash / ssd running luminous 12.2.10.


--
With best regards,
  Vitaliy Filippov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] optimize bluestore for random write i/o

2019-03-05 Thread Mark Nelson

Hi Stefan,


Could you try running your random write workload against bluestore and 
then take a wallclock profile of an OSD using gdbpmp? It's available here:



https://github.com/markhpc/gdbpmp


Thanks,

Mark


On 3/5/19 2:29 AM, Stefan Priebe - Profihost AG wrote:

Hello list,

while the performance of sequential writes 4k on bluestore is very high
and even higher than filestore i was wondering what i can do to optimize
random pattern as well.

While using:
fio --rw=write --iodepth=32 --ioengine=libaio --bs=4k --numjobs=4
--filename=/tmp/test --size=10G --runtime=60 --group_reporting
--name=test --direct=1

I get 36000 iop/s on bluestore while having 11500 on filestore.

Using randwrite gives me 17000 on filestore and only 9500 on bluestore.

This is on all flash / ssd running luminous 12.2.10.

Greets,
Stefan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] optimize bluestore for random write i/o

2019-03-05 Thread Stefan Priebe - Profihost AG
Am 05.03.19 um 10:05 schrieb Paul Emmerich:
> This workload is probably bottlenecked by rocksdb (since the small
> writes are buffered there), so that's probably what needs tuning here.

while reading:
https://www.flashmemorysummit.com/English/Collaterals/Proceedings/2018/20180807_INVT-101A-1_Meredith.pdf

there is also a mention of:
osd_min_pg_log_entries = 10
osd_max_pg_log_entries = 10
osd_pg_log_dups_tracked = 10
osd_pg_log_trim_min = 10

which sounds like the backported feature of ceph 12.2.11 ot limit ph log
entries. Can anybody confirm?

They also changed:
bluestore_rocksdb_options



> Paul
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] optimize bluestore for random write i/o

2019-03-05 Thread Paul Emmerich
This workload is probably bottlenecked by rocksdb (since the small
writes are buffered there), so that's probably what needs tuning here.


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Tue, Mar 5, 2019 at 9:29 AM Stefan Priebe - Profihost AG
 wrote:
>
> Hello list,
>
> while the performance of sequential writes 4k on bluestore is very high
> and even higher than filestore i was wondering what i can do to optimize
> random pattern as well.
>
> While using:
> fio --rw=write --iodepth=32 --ioengine=libaio --bs=4k --numjobs=4
> --filename=/tmp/test --size=10G --runtime=60 --group_reporting
> --name=test --direct=1
>
> I get 36000 iop/s on bluestore while having 11500 on filestore.
>
> Using randwrite gives me 17000 on filestore and only 9500 on bluestore.
>
> This is on all flash / ssd running luminous 12.2.10.
>
> Greets,
> Stefan
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com