Re: [ceph-users] Consumer-grade SSD in Ceph

2020-01-03 Thread Vitaliy Filippov

We didn’t find a measurable difference doing this on 5100s, ymmv.


It depends on the controller...

With chipset SATA and LSI 9200 HBA the difference is huge. I have some  
evidence here: https://yourcmc.ru/wiki/Ceph_performance#Server_SSDs


With some controllers it may be not the case.

--
With best regards,
  Vitaliy Filippov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Benchmark diffrence between rados bench and rbd bench

2019-12-29 Thread Vitaliy Filippov

rados bench -p scbench 60 seq --io-size 8192 --io-threads 256
Read size:4194304


rados bench doesn't have --io-size option

testing sequential read with 8K I/O size is a strange idea anyway though

--
With best regards,
  Vitaliy Filippov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ssd requirements for wal/db

2019-10-04 Thread Vitaliy Filippov
WAL/DB isn't "read intensive". It's more "write intensive" :) use server  
SSDs with capacitors to get adequate write performance.



Hi all,

We are thinking about putting our wal/db of hdds/ on ssds. If we would
put the wal&db of 4 HDDS on 1 SSD as recommended, what type of SSD would
suffice?
We were thinking of using SATA Read Intensive 6Gbps 1DWPD SSDs.

Does someone has some experience with this configuration? Would we need
SAS ssds instead of SATA? And Mixed Use 3WPD instead of Read intensive?


--
With best regards,
  Vitaliy Filippov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] latency on OSD

2019-08-19 Thread Vitaliy Filippov

We recommend you SSD


Hi all,
I have installed ceph luminous, witch 5 nodes(45 OSD)

* 5 ceph-osd
  network: bond lacp 10GB
  RAM: 96GB
  HD: 9 disk SATA-3TB (bluestore)

I wanted to ask for help to fix the latency of the osd "ceph osd perf"

You who recommend me?


My config is:

/etc/ceph/ceph.conf

[global]
fsid = 414507dd-8a16-4548-86b7-906b0c9905e1
mon_initial_members = controller01,controller02,controller03
mon_host = 192.168.13.11,192.168.13.12,192.168.13.13
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx

public network = 192.168.13.0/24
cluster network = 192.168.10.0/24

osd_pool_default_pg_num = 1024
osd_pool_default_pgp_num = 1024
osd_pool_default_flag_hashpspool = true

[osd]
osd_scrub_begin_hour = 22
osd_scrub_end_hour = 6


---
ceph osd perf

osd commit_latency(ms) apply_latency(ms)
  0 4949
  1120   120
  2 3636
  3 6565
  4 1919
  5 5757
  6112   112
  7 5353
  8159   159
  9226   226
 10 2121
 11 7979
 12 5050
 13133   133
 14105   105
 15 6565
 16 3232
 17 6464
 18 6262
 19 7878
 20 7171
 21 9797
 22168   168
 23108   108
 24119   119
 25219   219
 26144   144
 27 2626
 28 7676
 29176   176
 30 2323
 31 9191
 32 3030
 33 6464
 34 2121
 35 7373
 36124   124
 37 8585
 38 3939
 39 3636
 40 2727
 41 3333
 42 4949
 43 2222
 44 4444





--
With best regards,
  Vitaliy Filippov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Future of Filestore?

2019-07-24 Thread Vitaliy Filippov

/dev/vdb:
 Timing cached reads:   2556 MB in  1.99 seconds = 1281.50 MB/sec
 Timing buffered disk reads:  62 MB in  3.03 seconds =  20.48 MB/sec


That is without any special tuning, just migrating back to FileStore…
journal is on the HDD (it wouldn't let me put it on the SSD like it did
last time).

As I say, not going to set the world on fire, but 20MB/sec is quite
usable for my needs.  The 4× speed increase is very welcome!


I get 60 mb/s inside a VM in my home nano-ceph consisting of 5 HDDs 4 of  
which are inside one PC and 5th is plugged into a ROCK64 :)) I use  
Bluestore...


--
With best regards,
  Vitaliy Filippov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] New best practices for osds???

2019-07-24 Thread Vitaliy Filippov

One RAID0 array per drive :)


I can't understand how using RAID0 is better than JBOD, considering jbod
would be many individual disks, each used as OSDs, instead of a single  
big one used as a single OSD.


--
With best regards,
  Vitaliy Filippov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] New best practices for osds???

2019-07-22 Thread Vitaliy Filippov
OK, I meant "it may help performance" :) the main point is that we had at  
least one case of data loss due to some Adaptec controller in RAID0 mode  
discussed recently in our ceph chat...


--
With best regards,
  Vitaliy Filippov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] New best practices for osds???

2019-07-22 Thread Vitaliy Filippov
It helps performance, but it can also lead to data loss if the raid  
controller is crap (not flushing data correctly)


--
With best regards,
  Vitaliy Filippov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Future of Filestore?

2019-07-22 Thread Vitaliy Filippov

Linear reads, `hdparm -t /dev/vda`.


Check if you have `cache=writeback` enabled in your VM options.

If it's enabled but you still get 5mb/s then try to benchmark your cluster  
with fio -ioengine=rbd from outside a VM.


Like

fio -ioengine=rbd -name=test -bs=4M -iodepth=16 -rw=read -pool=rpool  
-runtime=60 -rbdname=testimg


--
With best regards,
  Vitaliy Filippov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Future of Filestore?

2019-07-22 Thread Vitaliy Filippov

5MB/s in what mode?

For linear writes, that definitely means some kind of misconfiguration.  
For random writes... there's a handbrake in Bluestore which makes random  
writes run at half speed in HDD-only setups :)  
https://github.com/ceph/ceph/pull/26909


And if you push that handbrake down you actually get better random writes  
on HDDs with bluestore, too.


--
With best regards,
  Vitaliy Filippov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] BlueFS spillover detected - 14.2.1

2019-06-19 Thread Vitaliy Filippov
All values except 4, 30 and 286 GB are currently useless in ceph with  
default rocksdb settings :)


That's what you are seeing - all devices just use ~28 GB and everything  
else goes to HDDs.


--
With best regards,
  Vitaliy Filippov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Recommended fs to use with rbd

2019-03-31 Thread Vitaliy Filippov
...which only works when mapped with `virtio-scsi` (not with the regular  
virtio driver) :)



The only important thing is to enable discard/trim on the file system.


--
With best regards,
  Vitaliy Filippov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] fio test rbd - single thread - qd1

2019-03-20 Thread Vitaliy Filippov

`cpupower idle-set -D 0` will help you a lot, yes.

However it seems that not only the bluestore makes it slow. >= 50% of the  
latency is introduced by the OSD itself. I'm just trying to understand  
WHAT parts of it are doing so much work. For example in my current case  
(with cpupower idle-set -D 0 of course) when I was testing a single OSD on  
a very good drive (Intel NVMe, capable of 4+ single-thread sync write  
iops) it was delivering me only 950-1000 iops. It's roughly 1 ms latency,  
and only 50% of it comes from bluestore (you can see it `ceph daemon osd.x  
perf dump`)! I've even tuned bluestore a little, so that now I'm getting  
~1200 iops from it. It means that the bluestore's latency dropped by 33%  
(it was around 1/1000 = 500 us, now it is 1/1200 = ~330 us). But still the  
overall improvement is only 20% - everything else is eaten by the OSD  
itself.


--
With best regards,
  Vitaliy Filippov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 3-node cluster with 3 x Intel Optane 900P - very low benchmarked performance (200 IOPS)?

2019-03-11 Thread Vitaliy Filippov
These options aren't needed, numjobs is 1 by default and RBD has no "sync"  
concept at all. Operations are always "sync" by default.


In fact even --direct=1 may be redundant because there's no page cache  
involved. However I keep it just in case - there is the RBD cache, what if  
one day fio gets it enabled? :)



how about adding:  --sync=1 --numjobs=1  to the command as well?


--
With best regards,
  Vitaliy Filippov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 3-node cluster with 3 x Intel Optane 900P - very low benchmarked performance (200 IOPS)?

2019-03-09 Thread Vitaliy Filippov

There are 2:

fio -ioengine=rbd -direct=1 -name=test -bs=4k -iodepth=1 -rw=randwrite  
-pool=bench -rbdname=testimg


fio -ioengine=rbd -direct=1 -name=test -bs=4k -iodepth=128 -rw=randwrite  
-pool=bench -rbdname=testimg


The first measures your min possible latency - it does not scale with the  
number of OSDs at all, but it's usually what real applications like DBMSes  
need.


The second measures your max possible random write throughput which you  
probably won't be able to utilize if you don't have enough VMs all writing  
in parallel.


--
With best regards,
  Vitaliy Filippov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 3-node cluster with 3 x Intel Optane 900P - very low benchmarked performance (200 IOPS)?

2019-03-09 Thread Vitaliy Filippov

Welcome to our "slow ceph" party :)))

However I have to note that:

1) 50 iops is for 4 KB blocks. You're testing it with 4 MB ones.  
That's kind of unfair comparison.


2) fio -ioengine=rbd is better than rados bench for testing.

3) You can't "compensate" for Ceph's overhead even by having infinitely  
fast disks.


At its simplest, imagine that disk I/O takes X microseconds and Ceph's  
overhead is Y for a single operation.


Suppose there is no parallelism. Then raw disk IOPS = 100/X and Ceph  
IOPS = 100/(X+Y). Y is currently quite long, something around 400-800  
microseconds or so. So the best IOPS number you can squeeze out of a  
single client thread (a DBMS, for example) is 100/400 = only ~2500  
iops.


Parallel iops are of course better, but still you won't get anything close  
to 50 iops from a single OSD. The expected number is around 15000.  
Create multiple OSDs on a single NVMe and sacrifice your CPU usage if you  
want better results.


--
With best regards,
  Vitaliy Filippov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mounting image from erasure-coded pool without tiering in KVM

2019-03-06 Thread Vitaliy Filippov

Check if you have a recent enough librbd installed on your VM hosts.


Hello, all!
I have a problem with adding image volumes to my KVM VM.
I prepared erasure coded pool (named data01) on full-bluestore OSDs and
allowed ec_overwrites on it. Also i created replicated pool for image
volumes metadata named ssd-repl.

Pools were prepared by:
ceph osd pool create data01 1024 1024 erasure 2-1-isa-v
ceph osd pool set data01 allow_ec_overwrites true
rbd pool init data01

Image was created using:
rbd create --size 25G --data-pool data01 ssd-repl/vm-5

Image info:
[ceph@alfa-csn-01 ~]$ rbd info ssd-repl/vm-5
rbd image 'vm-5':
   size 25 GiB in 6400 objects
   order 22 (4 MiB objects)
   id: a20c46b8b4567
   data_pool: data01
   block_name_prefix: rbd_data.21.a20c46b8b4567
   format: 2
   features: layering, exclusive-lock, object-map, fast-diff,
deep-flatten, data-pool
   op_features:
   flags:
   create_timestamp: Tue Mar  5 16:51:59 2019

So it seem all should work.
But when i try to run VM with this disk attached i'm getting following
error:
root@alfa-cpu-02:~# virsh start vm-5
error: Failed to start domain vm-5
error: internal error: process exited while connecting to monitor:
2019-03-05T13:53:30.020525Z qemu-system-x86_64: -drive
file=rbd:ssd-repl/vm-5:id=libvirt:key=AQBD5GJc40bjN
hAA7qV6hZYumI7FUDkhElxMYw==:auth_supported=cephx\;none:mon_host=10.212.3.161\:6789,format=raw,if=none,id=drive-virtio-disk1:
error reading header from vm-5

XML config for this volume from my VM:

  
  

  
   
   
  
 
 
   

If i create the whole image in replicated pool then all works as  
expected:

i can connect and work with this disk inside VM.
What could be the reason for such behavior?
What i missed in configuration?

Thanks in advance!


--
With best regards,
  Vitaliy Filippov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] optimize bluestore for random write i/o

2019-03-05 Thread Vitaliy Filippov
Testing -rw=write without -sync=1 or -fsync=1 (or -fsync=32 for batch IO,  
or just fio -ioengine=rbd from outside a VM) is rather pointless - you're  
benchmarking the RBD cache, not Ceph itself. RBD cache is coalescing your  
writes into big sequential writes. Of course bluestore is faster in this  
case - it has no double write for big writes.


I'll probably try to test these settings - I'm also interested in random  
write iops in an all-flash bluestore cluster :) but I don't think any  
rocksdb options will help. I found bluestore pretty untunable in terms of  
performance :)


The best thing to do for me was to disable CPU powersaving (set governor  
to performance + cpupower idle-set -D 1). Your CPUs become frying pans but  
write IOPS, especially single-thread write IOPS which are the worst-case  
scenario AND at the same time the thing applications usually need increase  
2-3 times. Test it with fio -ioengine=rbd -bs=4k -iodepth=1.


Another thing that I've done on my cluster was to set  
`bluestore_min_alloc_size_ssd` to 4096. The reason to do that is that it's  
16kb by default which means all writes below 16kb use the same deferred  
write path as with HDDs. Deferred writes only increase WA factor for SSDs  
and lower the performance. You have to recreate OSDs after changing this  
variable - it's only applied at the time of OSD creation.


I'm also currently trying another performance fix, kind of... but it  
involves patching ceph's code, so I'll share it later if I succeed.



Hello list,

while the performance of sequential writes 4k on bluestore is very high
and even higher than filestore i was wondering what i can do to optimize
random pattern as well.

While using:
fio --rw=write --iodepth=32 --ioengine=libaio --bs=4k --numjobs=4
--filename=/tmp/test --size=10G --runtime=60 --group_reporting
--name=test --direct=1

I get 36000 iop/s on bluestore while having 11500 on filestore.

Using randwrite gives me 17000 on filestore and only 9500 on bluestore.

This is on all flash / ssd running luminous 12.2.10.


--
With best regards,
  Vitaliy Filippov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD poor performance

2019-03-05 Thread Vitaliy Filippov
Yes (I mean yes, it's real). Ceph's tiering works by moving whole (4MB)  
objects to the cache pool, updating them there (with 4K random writes?)  
and evicting them back when cache is full. I.e. the bad part here is that  
it can't do "write-through".


Also there are some configuration options regarding the eviction process,  
you can try to tune them. But don't expect the basis to change: when the  
cache pool is full, Ceph will still need to evict something from there.


Why do you want cache tiering at all? Just use `allow_ec_overwrites=true`  
if you're using EC and mount your RBD/CephFS directly without a cache pool.



During one of my test i found that fio inside my VM generates 1 MiB/s
(about 150 IOPS), but `ceph -s' shows me 500 MiB/s of flushing and 280
MiB/s of evicting data. How it could be? Is it real? Do you have any
optimization policies inside CEPH to eliminate such behaviour?


--
With best regards,
  Vitaliy Filippov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Right way to delete OSD from cluster?

2019-03-01 Thread Vitaliy Filippov
+1, I also think's it's strange that deleting OSD by "osd out -> osd  
purge" causes two rebalances instead of one.


--
With best regards,
  Vitaliy Filippov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD poor performance

2019-02-27 Thread Vitaliy Filippov
By "maximum write iops of an osd" I mean total iops divided by the number  
of OSDs. For example, an expensive setup from Micron  
(https://www.micron.com/about/blog/2018/april/micron-9200-max-red-hat-ceph-storage-30-reference-architecture-block-performance)  
has got only 8750 peak write iops per an NVMe. These exact NVMes they used  
are rated for 26+ iops when connected directly :). CPU is a real  
bottleneck. The need for a Seastar-based rewrite is not a joke! :)


Total iops is the number coming from a test like:

fio -ioengine=rbd -direct=1 -name=test -bs=4k -iodepth=128 -rw=randwrite  
-pool= -runtime=60 -rbdname=testimg


...or from several such jobs run in parallel each over a separate RBD  
image.


This is a "random write bandwidth" test, and, in fact, it's not the most  
useful one - the single-thread latency usually does matter more than just  
total bandwidth. To test for it, run:


fio -ioengine=rbd -direct=1 -name=test -bs=4k -iodepth=1 -rw=randwrite  
-pool= -runtime=60 -rbdname=testimg


You'll get a pretty low number (< 100 for HDD clusters, 500-1000 for SSD  
clusters). It's as expected that it's low. Everything above 1000 iops (<  
1ms latency, single-thread iops = 1 / avg latency) is hard to achieve with  
Ceph no matter what disks you're using. Also single-thread latency does  
not depend on the number of OSDs in the cluster, because the workload is  
not parallel.


However you can also test iops of single OSDs by creating a pool with  
size=1 and using a custom benchmark tool we've made with our colleagues  
from a russian Ceph chat... we can publish it here a short time later if  
you want :).



At some point I would expect the cpu to be the bottleneck. They have
always been saying this here for better latency get fast cpu's.
Would be nice to know what GHz you are testing, and how that scales. Rep
1-3, erasure propably also takes a hit.
How do you test maximum iops of the osd? (Just curious, so I can test
mine)

I have posted here a while ago a cephfs test on ssd rep 1. that was
performing nowhere near native, asking if this was normal. But never got
a response to it. I can remember that they send everyone a questionaire
and asked if they should focus on performance more, now I wished I
checked that box ;)


--
With best regards,
  Vitaliy Filippov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD poor performance

2019-02-27 Thread Vitaliy Filippov
To me it seems Ceph's iops limit is 1 (maybe 15000 with BIS hardware)  
per an OSD. After that number it starts to get stuck on CPU.


I've tried to create a pool from 3 OSDs in loop devices over tmpfs and  
I've only got ~15000 iops :) good disks aren't the bottleneck, CPU is.


--
With best regards,
  Vitaliy Filippov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Blocked ops after change from filestore on HDD to bluestore on SDD

2019-02-27 Thread Vitaliy Filippov
I think this should not lead to blocked ops in any case, even if the  
performance is low...


--
With best regards,
  Vitaliy Filippov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Configuration about using nvme SSD

2019-02-24 Thread Vitaliy Filippov
I've tried 4x OSD on fast SAS SSDs in a test setup with only 2 such drives  
in cluster - it increased CPU consumption a lot, but total 4Kb random  
write iops (RBD) only went from ~11000 to ~22000. So it was 2x increase,  
but at a huge cost.



One thing that's worked for me to get more out of nvmes with Ceph is to
create multiple partitions on the nvme with an osd on each partition.  
That

way you get more osd processes and CPU per nvme device. I've heard of
people using up to 4 partitions like this.


--
With best regards,
  Vitaliy Filippov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Configuration about using nvme SSD

2019-02-24 Thread Vitaliy Filippov

We can get 513558 IOPS in 4K read per nvme by fio but only 45146 IOPS
per OSD.by rados.


Don't expect Ceph to fully utilize NVMe's, it's software and it's slow :)  
some colleagues tell that SPDK works out of the box, but almost doesn't  
increase performance, because the userland-kernel interaction isn't the  
bottleneck currently, it's Ceph code itself. I also tried once, but I  
couldn't make it work. When I have some spare NVMe's I'll make another  
attempt.


So... try it and share your results here :) we're all interested.

--
With best regards,
  Vitaliy Filippov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore HDD Cluster Advice

2019-02-23 Thread Vitaliy Filippov

Hello,

What IO size are you testing, Bluestore will only defer writes under  
32kb is size by default. Unless you are writing sequentially,
only limited amount of buffering via SSD is going to help, you will  
eventually hit the limits of the disk. Could you share some more

details as I'm interested in in this topic as well.


I'm testing 4kb random writes, mostly with iodepth=1 (single-thread  
latency test). This is the main case which is expected to be sped up by  
the SSD journal and also the worst case for SDS's :).


Interesting, will have to investigate this further!!! I wish there were  
more details around this technology from HGST


It's simple to test yourself - similar thing is currently common in SMR  
drives. Pick a random cheap 2.5" 1TB Seagate SMR HDD and test it with fio  
with one of `sync` or `fsync` options and iodepth=32 - you'll see it  
handles more than 1000 random 4Kb write iops. It only handles so much  
until its buffer is full of course. When I tested one of these I found  
that the buffer was 8 GB. After writing 8 GB the performance drops to  
~30-50 iops, and when the drive is idle it starts to flush the buffer.  
This process takes a lot of time if the buffer is full (several hours).


The difference between 2.5 SMR seagates and HGSTs is that HGSTs only  
enable "media cache" when the volatile cache is disabled (which was a real  
surprise to me), and SMRs keep it enabled all the time.


But the thing that really confused me was that Bluestore random write  
performance - even single-threaded write performance (latency test) -  
changed when I altered the parameter of the DATA device (not journal)! WHY  
was it affected? Based on common sense and bluestore's documentation  
random deferred write commit time when the system is not under load (and  
with iodepth=1 it isn't) should only depend on the WAL device performance!  
But it's also affected by the data device which tells us there is some  
problem in the bluestore's implementation.



At the same time, deferred writes slightly help performance when you
don't have SSD. But the difference we talking is like tens of iops (30
vs 40), so it's not noticeable in the SSD era :).


What size IO's are these you are testing with? I see a difference going  
from around 50IOPs up to over a thousand for a single

threaded 4kb sequential test.


4Kb random writes. The numbers of 30-40 iops are from small HDD-only  
clusters (one 12x on 3 hosts, one 4x on ONE host - "scrap-ceph", home  
version :)). I've tried to play with prefer_deferred_size_hdd there and  
discovered that it had very little impact on random 4kb iodepth=128 iops.  
Which I think is slightly counter-intuitive because the expectation is  
that the deferred writes should increase random iops.


Careful here, Bluestore will only migrate the next level of its DB if it  
can fit the entire DB on the flash device. These cutoff's
are around 3GB,30GB,300GB by default, so anything in-between will not be  
used. In your example a 20GB flash partition will mean that
a large amount of RocksDB will end up on the spinning disk  
(slowusedBytes)


Thanks, I didn't know that... I rechecked - all my 8TB osds with 20GB  
partitions migrated their DBs to slow devices again. Previously I moved  
them to SSDs with rebased Igor Fedotov's ceph-bluestool ... oops :)  
ceph-bluestore-tool. Although I still don't understand where the number 3  
comes from? Ceph's default bluestore_rocksdb_options states there are  
4*256MB memtables, it's 1GB, not 3...


--
With best regards,
  Vitaliy Filippov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Bluestore] Some of my osd's uses BlueFS slow storage for db - why?

2019-02-23 Thread Vitaliy Filippov
Numbers are easy to calculate from RocksDB parameters, however I also  
don't understand why it's 3 -> 30 -> 300...


Default memtables are 256 MB, there are 4 of them, so L0 should be 1 GB,  
L1 should be 10 GB, and L2 should be 100 GB?


These sizes are roughly 3GB,30GB,300GB. Anything in-between those  
sizes are pointless. Only ~3GB of SSD will ever be used out of a

28GB partition. Likewise a 240GB partition is also pointless as only
~30GB will be used.

Where did you get those numbers? I would like to read more if you can
point to a link.


--
With best regards,
  Vitaliy Filippov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Migrating a baremetal Ceph cluster into K8s + Rook

2019-02-19 Thread Vitaliy Filippov
In our russian-speaking Ceph chat we swear "ceph inside kuber" people all  
the time because they often do not understand in what state their cluster  
is at all


// Sorry to intervene :))

--
With best regards,
  Vitaliy Filippov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Replicating CephFS between clusters

2019-02-19 Thread Vitaliy Filippov

Ah, yes, good question. I don't know if there is a true upper limit, but
leaving old snapshot around could hurt you when replaying journals and  
such.


Is is still so in mimic?

Should I live in fear if I keep old snapshots all the time (because I'm  
using them as "checkpoints")? :)


--
With best regards,
  Vitaliy Filippov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore HDD Cluster Advice

2019-02-13 Thread Vitaliy Filippov

Hello,

We'll soon be building out four new luminous clusters with Bluestore.
Our current clusters are running filestore so we're not very familiar
with Bluestore yet and I'd like to have an idea of what to expect.

Here are the OSD hardware specs (5x per cluster):
2x 3.0GHz 18c/36t
22x 1.8TB 10K SAS (RAID1 OS + 20 OSD's)
5x 480GB Intel S4610 SSD's (WAL and DB)
192 GB RAM
4X Mellanox 25GB NIC
PERC H730p

With filestore we've found that we can achieve sub-millisecond write
latency by running very fast journals (currently Intel S4610's). My
main concern is that Bluestore doesn't use journals and instead writes
directly to the higher latency HDD; in theory resulting in slower acks
and higher write latency. How does Bluestore handle this? Can we
expect similar or better performance then our current filestore
clusters?

I've heard it repeated that Bluestore performs better than Filestore
but I've also heard some people claiming this is not always the case
with HDD's. Is there any truth to that and if so is there a
configuration we can use to achieve this same type of performance with
Bluestore?


Bluestore does use journals for small writes and doesn't for big ones. You  
can try to disable "small writes" by increasing  
bluestore_prefer_deferred_size, but it's generally pointless because in  
Bluestore the "journal" is RocksDB's journal (WAL) which creates way too  
much extra write amplification when big data chunks are put into it. This  
creates extra load for SSDs and write performance does not increase when  
compared to the default.


Bluestore is always better in terms of linear write throughput because it  
has no double-write for big data chunks. But it's roughly on par, and  
sometimes may even be slightly worse than filestore, in terms of 4K random  
writes.


--
With best regards,
  Vitaliy Filippov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RDMA/RoCE enablement failed with (113) No route to host

2019-02-09 Thread Vitaliy Filippov

Hi Roman,

We recently discussed your tests and a simple idea came to my mind - can  
you repeat your tests targeting latency instead of max throughput? I mean  
just use iodepth=1. What the latency is and on what hardware?



Well, I am playing with ceph rdma implementation quite a while
and it has unsolved problems, thus I would say the status is
"not completely broken", but "you can run it on your own risk
and smile":

1. On disconnect of previously active (high write load) connection
there is a race that can lead to osd (or any receiver) crash:

https://github.com/ceph/ceph/pull/25447

2. Recent qlogic hardware (qedr drivers) does not support
IBV_EVENT_QP_LAST_WQE_REACHED, which is used in ceph rdma
implementation, pull request from 1. also targets this
incompatibility.

3. On high write load and many connections there is a chance,
that osd can run out of receive WRs and rdma connection (QP)
on sender side will get IBV_WC_RETRY_EXC_ERR, thus disconnected.
This is fundamental design problem, which has to be fixed on
protocol level (e.g. propagate backpressure to senders).

4. Unfortunately neither rdma or any other 0-latency network can
bring significant value, because the bottle neck is not a
network, please consider this for further reading regarding
transport performance in ceph:

https://www.spinics.net/lists/ceph-devel/msg43555.html

Problems described above have quite a big impact on overall
transport performance.

--
Roman


--
With best regards,
  Vitaliy Filippov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rados block on SSD - performance - how to tune and get insight?

2019-02-07 Thread Vitaliy Filippov
Ceph is a massive overhead, so it seems it maxes out at ~1 (at most  
15000) write iops per one ssd with queue depth of 128 and ~1000 iops with  
queue depth of 1 (1ms latency). Or maybe 2000-2500 write iops (0.4-0.5ms)  
with best possible hardware. Micron has only squeezed ~8750 iops from each  
of their NVMes in their reference setup... the same NVMes reached 29  
iops in their setup when connected directly.



Hi Maged

Thanks for your reply.


6k is low as a max write iops value..even for single client. for cluster
of 3 nodes, we see from 10k to 60k write iops depending on hardware.

can you increase your threads to 64 or 128 via -t parameter


I can absolutely get it higher by increasing the parallism. But I
may have missed to explain my purpuse - I'm intested in how close to
putting local SSD/NVMe in servers I can get with RDB. Thus putting
parallel scenarios that I would never see in production in the
tests does not really help my understanding. I think a concurrency level
of 16 is in the top of what I would expect our PostgreSQL databases to do
in real life.


--
With best regards,
  Vitaliy Filippov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore device’s device selector for Samsung NVMe

2019-01-15 Thread Vitaliy Filippov

Try lspci -vs and look for

`Capabilities: [148] Device Serial Number 00-02-c9-03-00-4f-68-7e`

in the output

--
With best regards,
  Vitaliy Filippov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Poor ceph cluster performance

2018-11-27 Thread Vitaliy Filippov

CPU: 2 x E5-2603 @1.8GHz
RAM: 16GB
Network: 1G port shared for Ceph public and cluster traffics
Journaling device: 1 x 120GB SSD (SATA3, consumer grade)
OSD device: 2 x 2TB 7200rpm spindle (SATA3, consumer grade)


0.84 MB/s sequential write is impossibly bad, it's not normal with any  
kind of devices and even with 1G network, you probably have some kind of  
problem in your setup - maybe the network RTT is very high or maybe osd or  
mon nodes are shared with other running tasks and overloaded or maybe your  
disks are already dead... :))



As I moved on to test block devices, I got a following error message:

# rbd map image01 --pool testbench --name client.admin


You don't need to map it to run benchmarks, use `fio --ioengine=rbd`  
(however you'll still need /etc/ceph/ceph.client.admin.keyring)


--
With best regards,
  Vitaliy Filippov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Low traffic Ceph cluster with consumer SSD.

2018-11-25 Thread Vitaliy Filippov
At least when I run a simple O_SYNC random 4k write test with a random  
Intel 545s SSD plugged in through USB3-SATA adapter (UASP), pull USB cable  
out and then recheck written data everything is good and nothing is lost  
(however iops are of course low, 1100-1200)


--
With best regards,
  Vitaliy Filippov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Low traffic Ceph cluster with consumer SSD.

2018-11-25 Thread Vitaliy Filippov

Ceph issues fsync's all the time


...and, of course, it has journaling :) (only fsync is of course not  
sufficient)


with enterprise SSDs which have capacitors fsync just becomes a no-op and  
thus transactional write performance becomes the same as non-transactional  
(i.e. 10+ times faster for 4k random writes)


--
With best regards,
  Vitaliy Filippov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Low traffic Ceph cluster with consumer SSD.

2018-11-25 Thread Vitaliy Filippov

the real risk is the lack of power loss protection. Data can be
corrupted on unflean shutdowns


it's not! lack of "advanced power loss protection" only means lower iops
with fsync, but not the possibility of data corruption

"advanced power loss protection" is basically the synonym for
"non-volatile cache"


A few years ago - it was pretty common knowledge that if it didnt have
capacitors - and thus Power-Loss-Protection, then an unexpected power-off
could lead to data-loss situations. Perhapos I'm not updated with recent
development. Is it a solved problem today in consumergrade SSD?
.. any links to insight/testing/etc would be welcome.

https://arstechnica.com/civis/viewtopic.php?f=11&t=1383499
- does at least not support the viewpoint.


All disks (HDDs and SSDs) have cache and may lose non-transactional writes  
that are in-flight. However, any adequate disk handles fsync's (i.e SATA  
FLUSH CACHE commands). So transactional writes should never be lost, and  
in Ceph ALL writes are transactional - Ceph issues fsync's all the time.  
Another example is DBMS-es - they also issue an fsync when you COMMIT.


--
With best regards,
  Vitaliy Filippov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Low traffic Ceph cluster with consumer SSD.

2018-11-25 Thread Vitaliy Filippov

On 24 Nov 2018, at 18.09, Anton Aleksandrov  wrote
We plan to have data on dedicate disk in each node and my question is  
about WAL/DB for Bluestore. How bad would it be to place it on  
system-consumer-SSD? How big risk is it, that everything will get  
"slower than using spinning HDD for the same purpose"? And how big risk  
is it, that our nodes will die, because of SSD lifespan?


just try and tell us :) I can't imagine it may be slower than colocated  
db+wal+data.


also it depends on exact SSD models, but a lot of SSDs (even consumer  
ones) in fact survive 10-20 times more writes than claimed by the  
manufacturer. only some really cheap chinese ones don't...


there's an article on 3dnews about it: https://3dnews.ru/938764/

the real risk is the lack of power loss protection. Data can be  
corrupted on unflean shutdowns


it's not! lack of "advanced power loss protection" only means lower iops  
with fsync, but not the possibility of data corruption


"advanced power loss protection" is basically the synonym for  
"non-volatile cache"



Disabling cache may help


it won't help on consumer ssds, because (write+fsync) performance is  
roughly the same as (write with cache disabled) for them


Ceph is always issuing at least as many fsync's as writes, so it's  
basically always operating in "disk cache disabled" mode


at the same time, disabling disk write cache on enterprise SSDs (hdparm -W  
0) often increases random write iops by an order of magnitude. not sure  
why. maybe because kernel flushes disk queue on every sync if it thinks  
disk cache is enabled...


--
With best regards,
  Vitaliy Filippov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Disabling write cache on SATA HDDs reduces write latency 7 times

2018-11-11 Thread Vitaliy Filippov

Either more weird then, what drives is in the other cluster?


Desktop Toshiba and Seagate Constellation 7200rpm

As I understand by now the main impact is for SSD+HDD clusters. Enabled  
HDD write cache causes kernel to send flush requests for it (when write  
cache is disabled it doesn't bother about that) and probably it affects  
something else and causes some extra waits for SSD journal (although it's  
strange and looks like a bug to me). I tried to check latencies in `ceph  
daemon osd.xx perf dump` and both kv_commit_lat and commit_lat decreased  
~10 times when I disabled HDD write cache (although both are SSD-related  
as I understand).


Maybe your HDD are connected via some RAID controller and when you disable  
cache it doesn't really get disabled, but the kernels just stops to issue  
flush requests and makes some writes unsafe?


--
With best regards,
  Vitaliy Filippov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Disabling write cache on SATA HDDs reduces write latency 7 times

2018-11-11 Thread Vitaliy Filippov
It seems no, I've just tested it on another small cluster with HDDs only -  
no change



Does it make sense to test disabling this on hdd cluster only?


--
With best regards,
  Vitaliy Filippov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Disabling write cache on SATA HDDs reduces write latency 7 times

2018-11-10 Thread Vitaliy Filippov

Hi

A weird thing happens in my test cluster made from desktop hardware.

The command `for i in /dev/sd?; do hdparm -W 0 $i; done` increases  
single-thread write iops (reduces latency) 7 times!


It is a 3-node cluster with Ryzen 2700 CPUs, 3x SATA 7200rpm HDDs + 1x  
SATA desktop SSD for system and ceph-mon + 1x SATA server SSD for  
block.db/wal in each host. Hosts are linked by 10gbit ethernet (not the  
fastest one though, average RTT according to flood-ping is 0.098ms). Ceph  
and OpenNebula are installed on the same hosts, OSDs are prepared with  
ceph-volume and bluestore with default options. SSDs have capacitors  
('power-loss protection'), write cache is turned off for them since the  
very beginning (hdparm -W 0 /dev/sdb). They're quite old, but each of them  
is capable of delivering ~22000 iops in journal mode (fio -sync=1  
-direct=1 -iodepth=1 -bs=4k -rw=write).


However, RBD single-threaded random-write benchmark originally gave awful  
results - when testing with `fio -ioengine=libaio -size=10G -sync=1  
-direct=1 -name=test -bs=4k -iodepth=1 -rw=randwrite -runtime=60  
-filename=./testfile` from inside a VM, the result was only 58 iops  
average (17ms latency). This was not what I expected from the HDD+SSD  
setup.


But today I tried to play with cache settings for data disks. And I was  
really surprised to discover that just disabling HDD write cache (hdparm  
-W 0 /dev/sdX for all HDD devices) increases single-threaded performance  
~7 times! The result from the same VM (without even rebooting it) is  
iops=405, avg lat=2.47ms. That's a magnitude faster and in fact 2.5ms  
seems sort of an expected number.


As I understand 4k writes are always deferred at the default setting of  
prefer_deferred_size_hdd=32768, this means they should only get written to  
the journal device before OSD acks the write operation.


So my question is WHY? Why does HDD write cache affect commit latency with  
WAL on an SSD?


I would also appreciate if anybody with similar setup (HDD+SSD with  
desktop SATA controllers or HBA) could test the same thing...


--
With best regards,
  Vitaliy Filippov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com