Re: [ceph-users] Consumer-grade SSD in Ceph
We didn’t find a measurable difference doing this on 5100s, ymmv. It depends on the controller... With chipset SATA and LSI 9200 HBA the difference is huge. I have some evidence here: https://yourcmc.ru/wiki/Ceph_performance#Server_SSDs With some controllers it may be not the case. -- With best regards, Vitaliy Filippov ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Benchmark diffrence between rados bench and rbd bench
rados bench -p scbench 60 seq --io-size 8192 --io-threads 256 Read size:4194304 rados bench doesn't have --io-size option testing sequential read with 8K I/O size is a strange idea anyway though -- With best regards, Vitaliy Filippov ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ssd requirements for wal/db
WAL/DB isn't "read intensive". It's more "write intensive" :) use server SSDs with capacitors to get adequate write performance. Hi all, We are thinking about putting our wal/db of hdds/ on ssds. If we would put the wal&db of 4 HDDS on 1 SSD as recommended, what type of SSD would suffice? We were thinking of using SATA Read Intensive 6Gbps 1DWPD SSDs. Does someone has some experience with this configuration? Would we need SAS ssds instead of SATA? And Mixed Use 3WPD instead of Read intensive? -- With best regards, Vitaliy Filippov ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] latency on OSD
We recommend you SSD Hi all, I have installed ceph luminous, witch 5 nodes(45 OSD) * 5 ceph-osd network: bond lacp 10GB RAM: 96GB HD: 9 disk SATA-3TB (bluestore) I wanted to ask for help to fix the latency of the osd "ceph osd perf" You who recommend me? My config is: /etc/ceph/ceph.conf [global] fsid = 414507dd-8a16-4548-86b7-906b0c9905e1 mon_initial_members = controller01,controller02,controller03 mon_host = 192.168.13.11,192.168.13.12,192.168.13.13 auth_cluster_required = cephx auth_service_required = cephx auth_client_required = cephx public network = 192.168.13.0/24 cluster network = 192.168.10.0/24 osd_pool_default_pg_num = 1024 osd_pool_default_pgp_num = 1024 osd_pool_default_flag_hashpspool = true [osd] osd_scrub_begin_hour = 22 osd_scrub_end_hour = 6 --- ceph osd perf osd commit_latency(ms) apply_latency(ms) 0 4949 1120 120 2 3636 3 6565 4 1919 5 5757 6112 112 7 5353 8159 159 9226 226 10 2121 11 7979 12 5050 13133 133 14105 105 15 6565 16 3232 17 6464 18 6262 19 7878 20 7171 21 9797 22168 168 23108 108 24119 119 25219 219 26144 144 27 2626 28 7676 29176 176 30 2323 31 9191 32 3030 33 6464 34 2121 35 7373 36124 124 37 8585 38 3939 39 3636 40 2727 41 3333 42 4949 43 2222 44 4444 -- With best regards, Vitaliy Filippov ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Future of Filestore?
/dev/vdb: Timing cached reads: 2556 MB in 1.99 seconds = 1281.50 MB/sec Timing buffered disk reads: 62 MB in 3.03 seconds = 20.48 MB/sec That is without any special tuning, just migrating back to FileStore… journal is on the HDD (it wouldn't let me put it on the SSD like it did last time). As I say, not going to set the world on fire, but 20MB/sec is quite usable for my needs. The 4× speed increase is very welcome! I get 60 mb/s inside a VM in my home nano-ceph consisting of 5 HDDs 4 of which are inside one PC and 5th is plugged into a ROCK64 :)) I use Bluestore... -- With best regards, Vitaliy Filippov ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] New best practices for osds???
One RAID0 array per drive :) I can't understand how using RAID0 is better than JBOD, considering jbod would be many individual disks, each used as OSDs, instead of a single big one used as a single OSD. -- With best regards, Vitaliy Filippov ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] New best practices for osds???
OK, I meant "it may help performance" :) the main point is that we had at least one case of data loss due to some Adaptec controller in RAID0 mode discussed recently in our ceph chat... -- With best regards, Vitaliy Filippov ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] New best practices for osds???
It helps performance, but it can also lead to data loss if the raid controller is crap (not flushing data correctly) -- With best regards, Vitaliy Filippov ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Future of Filestore?
Linear reads, `hdparm -t /dev/vda`. Check if you have `cache=writeback` enabled in your VM options. If it's enabled but you still get 5mb/s then try to benchmark your cluster with fio -ioengine=rbd from outside a VM. Like fio -ioengine=rbd -name=test -bs=4M -iodepth=16 -rw=read -pool=rpool -runtime=60 -rbdname=testimg -- With best regards, Vitaliy Filippov ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Future of Filestore?
5MB/s in what mode? For linear writes, that definitely means some kind of misconfiguration. For random writes... there's a handbrake in Bluestore which makes random writes run at half speed in HDD-only setups :) https://github.com/ceph/ceph/pull/26909 And if you push that handbrake down you actually get better random writes on HDDs with bluestore, too. -- With best regards, Vitaliy Filippov ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] BlueFS spillover detected - 14.2.1
All values except 4, 30 and 286 GB are currently useless in ceph with default rocksdb settings :) That's what you are seeing - all devices just use ~28 GB and everything else goes to HDDs. -- With best regards, Vitaliy Filippov ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Recommended fs to use with rbd
...which only works when mapped with `virtio-scsi` (not with the regular virtio driver) :) The only important thing is to enable discard/trim on the file system. -- With best regards, Vitaliy Filippov ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] fio test rbd - single thread - qd1
`cpupower idle-set -D 0` will help you a lot, yes. However it seems that not only the bluestore makes it slow. >= 50% of the latency is introduced by the OSD itself. I'm just trying to understand WHAT parts of it are doing so much work. For example in my current case (with cpupower idle-set -D 0 of course) when I was testing a single OSD on a very good drive (Intel NVMe, capable of 4+ single-thread sync write iops) it was delivering me only 950-1000 iops. It's roughly 1 ms latency, and only 50% of it comes from bluestore (you can see it `ceph daemon osd.x perf dump`)! I've even tuned bluestore a little, so that now I'm getting ~1200 iops from it. It means that the bluestore's latency dropped by 33% (it was around 1/1000 = 500 us, now it is 1/1200 = ~330 us). But still the overall improvement is only 20% - everything else is eaten by the OSD itself. -- With best regards, Vitaliy Filippov ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] 3-node cluster with 3 x Intel Optane 900P - very low benchmarked performance (200 IOPS)?
These options aren't needed, numjobs is 1 by default and RBD has no "sync" concept at all. Operations are always "sync" by default. In fact even --direct=1 may be redundant because there's no page cache involved. However I keep it just in case - there is the RBD cache, what if one day fio gets it enabled? :) how about adding: --sync=1 --numjobs=1 to the command as well? -- With best regards, Vitaliy Filippov ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] 3-node cluster with 3 x Intel Optane 900P - very low benchmarked performance (200 IOPS)?
There are 2: fio -ioengine=rbd -direct=1 -name=test -bs=4k -iodepth=1 -rw=randwrite -pool=bench -rbdname=testimg fio -ioengine=rbd -direct=1 -name=test -bs=4k -iodepth=128 -rw=randwrite -pool=bench -rbdname=testimg The first measures your min possible latency - it does not scale with the number of OSDs at all, but it's usually what real applications like DBMSes need. The second measures your max possible random write throughput which you probably won't be able to utilize if you don't have enough VMs all writing in parallel. -- With best regards, Vitaliy Filippov ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] 3-node cluster with 3 x Intel Optane 900P - very low benchmarked performance (200 IOPS)?
Welcome to our "slow ceph" party :))) However I have to note that: 1) 50 iops is for 4 KB blocks. You're testing it with 4 MB ones. That's kind of unfair comparison. 2) fio -ioengine=rbd is better than rados bench for testing. 3) You can't "compensate" for Ceph's overhead even by having infinitely fast disks. At its simplest, imagine that disk I/O takes X microseconds and Ceph's overhead is Y for a single operation. Suppose there is no parallelism. Then raw disk IOPS = 100/X and Ceph IOPS = 100/(X+Y). Y is currently quite long, something around 400-800 microseconds or so. So the best IOPS number you can squeeze out of a single client thread (a DBMS, for example) is 100/400 = only ~2500 iops. Parallel iops are of course better, but still you won't get anything close to 50 iops from a single OSD. The expected number is around 15000. Create multiple OSDs on a single NVMe and sacrifice your CPU usage if you want better results. -- With best regards, Vitaliy Filippov ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Mounting image from erasure-coded pool without tiering in KVM
Check if you have a recent enough librbd installed on your VM hosts. Hello, all! I have a problem with adding image volumes to my KVM VM. I prepared erasure coded pool (named data01) on full-bluestore OSDs and allowed ec_overwrites on it. Also i created replicated pool for image volumes metadata named ssd-repl. Pools were prepared by: ceph osd pool create data01 1024 1024 erasure 2-1-isa-v ceph osd pool set data01 allow_ec_overwrites true rbd pool init data01 Image was created using: rbd create --size 25G --data-pool data01 ssd-repl/vm-5 Image info: [ceph@alfa-csn-01 ~]$ rbd info ssd-repl/vm-5 rbd image 'vm-5': size 25 GiB in 6400 objects order 22 (4 MiB objects) id: a20c46b8b4567 data_pool: data01 block_name_prefix: rbd_data.21.a20c46b8b4567 format: 2 features: layering, exclusive-lock, object-map, fast-diff, deep-flatten, data-pool op_features: flags: create_timestamp: Tue Mar 5 16:51:59 2019 So it seem all should work. But when i try to run VM with this disk attached i'm getting following error: root@alfa-cpu-02:~# virsh start vm-5 error: Failed to start domain vm-5 error: internal error: process exited while connecting to monitor: 2019-03-05T13:53:30.020525Z qemu-system-x86_64: -drive file=rbd:ssd-repl/vm-5:id=libvirt:key=AQBD5GJc40bjN hAA7qV6hZYumI7FUDkhElxMYw==:auth_supported=cephx\;none:mon_host=10.212.3.161\:6789,format=raw,if=none,id=drive-virtio-disk1: error reading header from vm-5 XML config for this volume from my VM: If i create the whole image in replicated pool then all works as expected: i can connect and work with this disk inside VM. What could be the reason for such behavior? What i missed in configuration? Thanks in advance! -- With best regards, Vitaliy Filippov ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] optimize bluestore for random write i/o
Testing -rw=write without -sync=1 or -fsync=1 (or -fsync=32 for batch IO, or just fio -ioengine=rbd from outside a VM) is rather pointless - you're benchmarking the RBD cache, not Ceph itself. RBD cache is coalescing your writes into big sequential writes. Of course bluestore is faster in this case - it has no double write for big writes. I'll probably try to test these settings - I'm also interested in random write iops in an all-flash bluestore cluster :) but I don't think any rocksdb options will help. I found bluestore pretty untunable in terms of performance :) The best thing to do for me was to disable CPU powersaving (set governor to performance + cpupower idle-set -D 1). Your CPUs become frying pans but write IOPS, especially single-thread write IOPS which are the worst-case scenario AND at the same time the thing applications usually need increase 2-3 times. Test it with fio -ioengine=rbd -bs=4k -iodepth=1. Another thing that I've done on my cluster was to set `bluestore_min_alloc_size_ssd` to 4096. The reason to do that is that it's 16kb by default which means all writes below 16kb use the same deferred write path as with HDDs. Deferred writes only increase WA factor for SSDs and lower the performance. You have to recreate OSDs after changing this variable - it's only applied at the time of OSD creation. I'm also currently trying another performance fix, kind of... but it involves patching ceph's code, so I'll share it later if I succeed. Hello list, while the performance of sequential writes 4k on bluestore is very high and even higher than filestore i was wondering what i can do to optimize random pattern as well. While using: fio --rw=write --iodepth=32 --ioengine=libaio --bs=4k --numjobs=4 --filename=/tmp/test --size=10G --runtime=60 --group_reporting --name=test --direct=1 I get 36000 iop/s on bluestore while having 11500 on filestore. Using randwrite gives me 17000 on filestore and only 9500 on bluestore. This is on all flash / ssd running luminous 12.2.10. -- With best regards, Vitaliy Filippov ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RBD poor performance
Yes (I mean yes, it's real). Ceph's tiering works by moving whole (4MB) objects to the cache pool, updating them there (with 4K random writes?) and evicting them back when cache is full. I.e. the bad part here is that it can't do "write-through". Also there are some configuration options regarding the eviction process, you can try to tune them. But don't expect the basis to change: when the cache pool is full, Ceph will still need to evict something from there. Why do you want cache tiering at all? Just use `allow_ec_overwrites=true` if you're using EC and mount your RBD/CephFS directly without a cache pool. During one of my test i found that fio inside my VM generates 1 MiB/s (about 150 IOPS), but `ceph -s' shows me 500 MiB/s of flushing and 280 MiB/s of evicting data. How it could be? Is it real? Do you have any optimization policies inside CEPH to eliminate such behaviour? -- With best regards, Vitaliy Filippov ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Right way to delete OSD from cluster?
+1, I also think's it's strange that deleting OSD by "osd out -> osd purge" causes two rebalances instead of one. -- With best regards, Vitaliy Filippov ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RBD poor performance
By "maximum write iops of an osd" I mean total iops divided by the number of OSDs. For example, an expensive setup from Micron (https://www.micron.com/about/blog/2018/april/micron-9200-max-red-hat-ceph-storage-30-reference-architecture-block-performance) has got only 8750 peak write iops per an NVMe. These exact NVMes they used are rated for 26+ iops when connected directly :). CPU is a real bottleneck. The need for a Seastar-based rewrite is not a joke! :) Total iops is the number coming from a test like: fio -ioengine=rbd -direct=1 -name=test -bs=4k -iodepth=128 -rw=randwrite -pool= -runtime=60 -rbdname=testimg ...or from several such jobs run in parallel each over a separate RBD image. This is a "random write bandwidth" test, and, in fact, it's not the most useful one - the single-thread latency usually does matter more than just total bandwidth. To test for it, run: fio -ioengine=rbd -direct=1 -name=test -bs=4k -iodepth=1 -rw=randwrite -pool= -runtime=60 -rbdname=testimg You'll get a pretty low number (< 100 for HDD clusters, 500-1000 for SSD clusters). It's as expected that it's low. Everything above 1000 iops (< 1ms latency, single-thread iops = 1 / avg latency) is hard to achieve with Ceph no matter what disks you're using. Also single-thread latency does not depend on the number of OSDs in the cluster, because the workload is not parallel. However you can also test iops of single OSDs by creating a pool with size=1 and using a custom benchmark tool we've made with our colleagues from a russian Ceph chat... we can publish it here a short time later if you want :). At some point I would expect the cpu to be the bottleneck. They have always been saying this here for better latency get fast cpu's. Would be nice to know what GHz you are testing, and how that scales. Rep 1-3, erasure propably also takes a hit. How do you test maximum iops of the osd? (Just curious, so I can test mine) I have posted here a while ago a cephfs test on ssd rep 1. that was performing nowhere near native, asking if this was normal. But never got a response to it. I can remember that they send everyone a questionaire and asked if they should focus on performance more, now I wished I checked that box ;) -- With best regards, Vitaliy Filippov ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RBD poor performance
To me it seems Ceph's iops limit is 1 (maybe 15000 with BIS hardware) per an OSD. After that number it starts to get stuck on CPU. I've tried to create a pool from 3 OSDs in loop devices over tmpfs and I've only got ~15000 iops :) good disks aren't the bottleneck, CPU is. -- With best regards, Vitaliy Filippov ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Blocked ops after change from filestore on HDD to bluestore on SDD
I think this should not lead to blocked ops in any case, even if the performance is low... -- With best regards, Vitaliy Filippov ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Configuration about using nvme SSD
I've tried 4x OSD on fast SAS SSDs in a test setup with only 2 such drives in cluster - it increased CPU consumption a lot, but total 4Kb random write iops (RBD) only went from ~11000 to ~22000. So it was 2x increase, but at a huge cost. One thing that's worked for me to get more out of nvmes with Ceph is to create multiple partitions on the nvme with an osd on each partition. That way you get more osd processes and CPU per nvme device. I've heard of people using up to 4 partitions like this. -- With best regards, Vitaliy Filippov ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Configuration about using nvme SSD
We can get 513558 IOPS in 4K read per nvme by fio but only 45146 IOPS per OSD.by rados. Don't expect Ceph to fully utilize NVMe's, it's software and it's slow :) some colleagues tell that SPDK works out of the box, but almost doesn't increase performance, because the userland-kernel interaction isn't the bottleneck currently, it's Ceph code itself. I also tried once, but I couldn't make it work. When I have some spare NVMe's I'll make another attempt. So... try it and share your results here :) we're all interested. -- With best regards, Vitaliy Filippov ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Bluestore HDD Cluster Advice
Hello, What IO size are you testing, Bluestore will only defer writes under 32kb is size by default. Unless you are writing sequentially, only limited amount of buffering via SSD is going to help, you will eventually hit the limits of the disk. Could you share some more details as I'm interested in in this topic as well. I'm testing 4kb random writes, mostly with iodepth=1 (single-thread latency test). This is the main case which is expected to be sped up by the SSD journal and also the worst case for SDS's :). Interesting, will have to investigate this further!!! I wish there were more details around this technology from HGST It's simple to test yourself - similar thing is currently common in SMR drives. Pick a random cheap 2.5" 1TB Seagate SMR HDD and test it with fio with one of `sync` or `fsync` options and iodepth=32 - you'll see it handles more than 1000 random 4Kb write iops. It only handles so much until its buffer is full of course. When I tested one of these I found that the buffer was 8 GB. After writing 8 GB the performance drops to ~30-50 iops, and when the drive is idle it starts to flush the buffer. This process takes a lot of time if the buffer is full (several hours). The difference between 2.5 SMR seagates and HGSTs is that HGSTs only enable "media cache" when the volatile cache is disabled (which was a real surprise to me), and SMRs keep it enabled all the time. But the thing that really confused me was that Bluestore random write performance - even single-threaded write performance (latency test) - changed when I altered the parameter of the DATA device (not journal)! WHY was it affected? Based on common sense and bluestore's documentation random deferred write commit time when the system is not under load (and with iodepth=1 it isn't) should only depend on the WAL device performance! But it's also affected by the data device which tells us there is some problem in the bluestore's implementation. At the same time, deferred writes slightly help performance when you don't have SSD. But the difference we talking is like tens of iops (30 vs 40), so it's not noticeable in the SSD era :). What size IO's are these you are testing with? I see a difference going from around 50IOPs up to over a thousand for a single threaded 4kb sequential test. 4Kb random writes. The numbers of 30-40 iops are from small HDD-only clusters (one 12x on 3 hosts, one 4x on ONE host - "scrap-ceph", home version :)). I've tried to play with prefer_deferred_size_hdd there and discovered that it had very little impact on random 4kb iodepth=128 iops. Which I think is slightly counter-intuitive because the expectation is that the deferred writes should increase random iops. Careful here, Bluestore will only migrate the next level of its DB if it can fit the entire DB on the flash device. These cutoff's are around 3GB,30GB,300GB by default, so anything in-between will not be used. In your example a 20GB flash partition will mean that a large amount of RocksDB will end up on the spinning disk (slowusedBytes) Thanks, I didn't know that... I rechecked - all my 8TB osds with 20GB partitions migrated their DBs to slow devices again. Previously I moved them to SSDs with rebased Igor Fedotov's ceph-bluestool ... oops :) ceph-bluestore-tool. Although I still don't understand where the number 3 comes from? Ceph's default bluestore_rocksdb_options states there are 4*256MB memtables, it's 1GB, not 3... -- With best regards, Vitaliy Filippov ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [Bluestore] Some of my osd's uses BlueFS slow storage for db - why?
Numbers are easy to calculate from RocksDB parameters, however I also don't understand why it's 3 -> 30 -> 300... Default memtables are 256 MB, there are 4 of them, so L0 should be 1 GB, L1 should be 10 GB, and L2 should be 100 GB? These sizes are roughly 3GB,30GB,300GB. Anything in-between those sizes are pointless. Only ~3GB of SSD will ever be used out of a 28GB partition. Likewise a 240GB partition is also pointless as only ~30GB will be used. Where did you get those numbers? I would like to read more if you can point to a link. -- With best regards, Vitaliy Filippov ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Migrating a baremetal Ceph cluster into K8s + Rook
In our russian-speaking Ceph chat we swear "ceph inside kuber" people all the time because they often do not understand in what state their cluster is at all // Sorry to intervene :)) -- With best regards, Vitaliy Filippov ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Replicating CephFS between clusters
Ah, yes, good question. I don't know if there is a true upper limit, but leaving old snapshot around could hurt you when replaying journals and such. Is is still so in mimic? Should I live in fear if I keep old snapshots all the time (because I'm using them as "checkpoints")? :) -- With best regards, Vitaliy Filippov ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Bluestore HDD Cluster Advice
Hello, We'll soon be building out four new luminous clusters with Bluestore. Our current clusters are running filestore so we're not very familiar with Bluestore yet and I'd like to have an idea of what to expect. Here are the OSD hardware specs (5x per cluster): 2x 3.0GHz 18c/36t 22x 1.8TB 10K SAS (RAID1 OS + 20 OSD's) 5x 480GB Intel S4610 SSD's (WAL and DB) 192 GB RAM 4X Mellanox 25GB NIC PERC H730p With filestore we've found that we can achieve sub-millisecond write latency by running very fast journals (currently Intel S4610's). My main concern is that Bluestore doesn't use journals and instead writes directly to the higher latency HDD; in theory resulting in slower acks and higher write latency. How does Bluestore handle this? Can we expect similar or better performance then our current filestore clusters? I've heard it repeated that Bluestore performs better than Filestore but I've also heard some people claiming this is not always the case with HDD's. Is there any truth to that and if so is there a configuration we can use to achieve this same type of performance with Bluestore? Bluestore does use journals for small writes and doesn't for big ones. You can try to disable "small writes" by increasing bluestore_prefer_deferred_size, but it's generally pointless because in Bluestore the "journal" is RocksDB's journal (WAL) which creates way too much extra write amplification when big data chunks are put into it. This creates extra load for SSDs and write performance does not increase when compared to the default. Bluestore is always better in terms of linear write throughput because it has no double-write for big data chunks. But it's roughly on par, and sometimes may even be slightly worse than filestore, in terms of 4K random writes. -- With best regards, Vitaliy Filippov ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RDMA/RoCE enablement failed with (113) No route to host
Hi Roman, We recently discussed your tests and a simple idea came to my mind - can you repeat your tests targeting latency instead of max throughput? I mean just use iodepth=1. What the latency is and on what hardware? Well, I am playing with ceph rdma implementation quite a while and it has unsolved problems, thus I would say the status is "not completely broken", but "you can run it on your own risk and smile": 1. On disconnect of previously active (high write load) connection there is a race that can lead to osd (or any receiver) crash: https://github.com/ceph/ceph/pull/25447 2. Recent qlogic hardware (qedr drivers) does not support IBV_EVENT_QP_LAST_WQE_REACHED, which is used in ceph rdma implementation, pull request from 1. also targets this incompatibility. 3. On high write load and many connections there is a chance, that osd can run out of receive WRs and rdma connection (QP) on sender side will get IBV_WC_RETRY_EXC_ERR, thus disconnected. This is fundamental design problem, which has to be fixed on protocol level (e.g. propagate backpressure to senders). 4. Unfortunately neither rdma or any other 0-latency network can bring significant value, because the bottle neck is not a network, please consider this for further reading regarding transport performance in ceph: https://www.spinics.net/lists/ceph-devel/msg43555.html Problems described above have quite a big impact on overall transport performance. -- Roman -- With best regards, Vitaliy Filippov ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rados block on SSD - performance - how to tune and get insight?
Ceph is a massive overhead, so it seems it maxes out at ~1 (at most 15000) write iops per one ssd with queue depth of 128 and ~1000 iops with queue depth of 1 (1ms latency). Or maybe 2000-2500 write iops (0.4-0.5ms) with best possible hardware. Micron has only squeezed ~8750 iops from each of their NVMes in their reference setup... the same NVMes reached 29 iops in their setup when connected directly. Hi Maged Thanks for your reply. 6k is low as a max write iops value..even for single client. for cluster of 3 nodes, we see from 10k to 60k write iops depending on hardware. can you increase your threads to 64 or 128 via -t parameter I can absolutely get it higher by increasing the parallism. But I may have missed to explain my purpuse - I'm intested in how close to putting local SSD/NVMe in servers I can get with RDB. Thus putting parallel scenarios that I would never see in production in the tests does not really help my understanding. I think a concurrency level of 16 is in the top of what I would expect our PostgreSQL databases to do in real life. -- With best regards, Vitaliy Filippov ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Bluestore device’s device selector for Samsung NVMe
Try lspci -vs and look for `Capabilities: [148] Device Serial Number 00-02-c9-03-00-4f-68-7e` in the output -- With best regards, Vitaliy Filippov ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Poor ceph cluster performance
CPU: 2 x E5-2603 @1.8GHz RAM: 16GB Network: 1G port shared for Ceph public and cluster traffics Journaling device: 1 x 120GB SSD (SATA3, consumer grade) OSD device: 2 x 2TB 7200rpm spindle (SATA3, consumer grade) 0.84 MB/s sequential write is impossibly bad, it's not normal with any kind of devices and even with 1G network, you probably have some kind of problem in your setup - maybe the network RTT is very high or maybe osd or mon nodes are shared with other running tasks and overloaded or maybe your disks are already dead... :)) As I moved on to test block devices, I got a following error message: # rbd map image01 --pool testbench --name client.admin You don't need to map it to run benchmarks, use `fio --ioengine=rbd` (however you'll still need /etc/ceph/ceph.client.admin.keyring) -- With best regards, Vitaliy Filippov ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Low traffic Ceph cluster with consumer SSD.
At least when I run a simple O_SYNC random 4k write test with a random Intel 545s SSD plugged in through USB3-SATA adapter (UASP), pull USB cable out and then recheck written data everything is good and nothing is lost (however iops are of course low, 1100-1200) -- With best regards, Vitaliy Filippov ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Low traffic Ceph cluster with consumer SSD.
Ceph issues fsync's all the time ...and, of course, it has journaling :) (only fsync is of course not sufficient) with enterprise SSDs which have capacitors fsync just becomes a no-op and thus transactional write performance becomes the same as non-transactional (i.e. 10+ times faster for 4k random writes) -- With best regards, Vitaliy Filippov ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Low traffic Ceph cluster with consumer SSD.
the real risk is the lack of power loss protection. Data can be corrupted on unflean shutdowns it's not! lack of "advanced power loss protection" only means lower iops with fsync, but not the possibility of data corruption "advanced power loss protection" is basically the synonym for "non-volatile cache" A few years ago - it was pretty common knowledge that if it didnt have capacitors - and thus Power-Loss-Protection, then an unexpected power-off could lead to data-loss situations. Perhapos I'm not updated with recent development. Is it a solved problem today in consumergrade SSD? .. any links to insight/testing/etc would be welcome. https://arstechnica.com/civis/viewtopic.php?f=11&t=1383499 - does at least not support the viewpoint. All disks (HDDs and SSDs) have cache and may lose non-transactional writes that are in-flight. However, any adequate disk handles fsync's (i.e SATA FLUSH CACHE commands). So transactional writes should never be lost, and in Ceph ALL writes are transactional - Ceph issues fsync's all the time. Another example is DBMS-es - they also issue an fsync when you COMMIT. -- With best regards, Vitaliy Filippov ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Low traffic Ceph cluster with consumer SSD.
On 24 Nov 2018, at 18.09, Anton Aleksandrov wrote We plan to have data on dedicate disk in each node and my question is about WAL/DB for Bluestore. How bad would it be to place it on system-consumer-SSD? How big risk is it, that everything will get "slower than using spinning HDD for the same purpose"? And how big risk is it, that our nodes will die, because of SSD lifespan? just try and tell us :) I can't imagine it may be slower than colocated db+wal+data. also it depends on exact SSD models, but a lot of SSDs (even consumer ones) in fact survive 10-20 times more writes than claimed by the manufacturer. only some really cheap chinese ones don't... there's an article on 3dnews about it: https://3dnews.ru/938764/ the real risk is the lack of power loss protection. Data can be corrupted on unflean shutdowns it's not! lack of "advanced power loss protection" only means lower iops with fsync, but not the possibility of data corruption "advanced power loss protection" is basically the synonym for "non-volatile cache" Disabling cache may help it won't help on consumer ssds, because (write+fsync) performance is roughly the same as (write with cache disabled) for them Ceph is always issuing at least as many fsync's as writes, so it's basically always operating in "disk cache disabled" mode at the same time, disabling disk write cache on enterprise SSDs (hdparm -W 0) often increases random write iops by an order of magnitude. not sure why. maybe because kernel flushes disk queue on every sync if it thinks disk cache is enabled... -- With best regards, Vitaliy Filippov ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Disabling write cache on SATA HDDs reduces write latency 7 times
Either more weird then, what drives is in the other cluster? Desktop Toshiba and Seagate Constellation 7200rpm As I understand by now the main impact is for SSD+HDD clusters. Enabled HDD write cache causes kernel to send flush requests for it (when write cache is disabled it doesn't bother about that) and probably it affects something else and causes some extra waits for SSD journal (although it's strange and looks like a bug to me). I tried to check latencies in `ceph daemon osd.xx perf dump` and both kv_commit_lat and commit_lat decreased ~10 times when I disabled HDD write cache (although both are SSD-related as I understand). Maybe your HDD are connected via some RAID controller and when you disable cache it doesn't really get disabled, but the kernels just stops to issue flush requests and makes some writes unsafe? -- With best regards, Vitaliy Filippov ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Disabling write cache on SATA HDDs reduces write latency 7 times
It seems no, I've just tested it on another small cluster with HDDs only - no change Does it make sense to test disabling this on hdd cluster only? -- With best regards, Vitaliy Filippov ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Disabling write cache on SATA HDDs reduces write latency 7 times
Hi A weird thing happens in my test cluster made from desktop hardware. The command `for i in /dev/sd?; do hdparm -W 0 $i; done` increases single-thread write iops (reduces latency) 7 times! It is a 3-node cluster with Ryzen 2700 CPUs, 3x SATA 7200rpm HDDs + 1x SATA desktop SSD for system and ceph-mon + 1x SATA server SSD for block.db/wal in each host. Hosts are linked by 10gbit ethernet (not the fastest one though, average RTT according to flood-ping is 0.098ms). Ceph and OpenNebula are installed on the same hosts, OSDs are prepared with ceph-volume and bluestore with default options. SSDs have capacitors ('power-loss protection'), write cache is turned off for them since the very beginning (hdparm -W 0 /dev/sdb). They're quite old, but each of them is capable of delivering ~22000 iops in journal mode (fio -sync=1 -direct=1 -iodepth=1 -bs=4k -rw=write). However, RBD single-threaded random-write benchmark originally gave awful results - when testing with `fio -ioengine=libaio -size=10G -sync=1 -direct=1 -name=test -bs=4k -iodepth=1 -rw=randwrite -runtime=60 -filename=./testfile` from inside a VM, the result was only 58 iops average (17ms latency). This was not what I expected from the HDD+SSD setup. But today I tried to play with cache settings for data disks. And I was really surprised to discover that just disabling HDD write cache (hdparm -W 0 /dev/sdX for all HDD devices) increases single-threaded performance ~7 times! The result from the same VM (without even rebooting it) is iops=405, avg lat=2.47ms. That's a magnitude faster and in fact 2.5ms seems sort of an expected number. As I understand 4k writes are always deferred at the default setting of prefer_deferred_size_hdd=32768, this means they should only get written to the journal device before OSD acks the write operation. So my question is WHY? Why does HDD write cache affect commit latency with WAL on an SSD? I would also appreciate if anybody with similar setup (HDD+SSD with desktop SATA controllers or HBA) could test the same thing... -- With best regards, Vitaliy Filippov ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com