Re: [ceph-users] Optane 900P device class automatically set to SSD not NVME

2018-08-13 Thread vitalif
Hi, Can you benchmark your Optane 900P with `fio -fsync=1 -direct=1 -bs=4k -rw=randwrite -runtime=60`? It's really interesting to see how much iops it will provide for ceph :) -- With best regards, Vitaliy Filippov ___ ceph-users mailing list

Re: [ceph-users] Ceph cluster uses substantially more disk space after rebalancing

2018-11-02 Thread vitalif
If you simply multiply number of objects and rbd object size you will get 7611672*4M ~= 29T and that is what you should see in USED field, and 29/2*3=43.5T of raw space. Unfortunately no idea why they consume less; probably because not all objects are fully written. It seems some objects

Re: [ceph-users] Ceph cluster uses substantially more disk space after rebalancing

2018-11-02 Thread vitalif
Hi again. It seems I've found the problem, although I don't understand the root cause. I looked into OSD datastore using ceph-objectstore-tool and I see that for almost every object there are two copies, like: 2#13:080008d8:::rbd_data.15.3d3e1d6b8b4567.00361a96:28#

Re: [ceph-users] Luminous or Mimic client on Debian Testing (Buster)

2018-11-13 Thread vitalif
Use Ubuntu bionic repository, Mimic installs without problem from there. You can also build it yourself, all you need is to install gcc-7 and other build dependencies, git clone, checkout 13.2.2 and say `dpkg-buildpackage -j4`. It takes some time, but overall it builds without issues, except

Re: [ceph-users] Benchmark performance when using SSD as the journal

2018-11-14 Thread vitalif
Hi Dave, The main line in SSD specs you should look at is Enhanced Power Loss Data Protection: Yes This makes SSD cache nonvolatile and makes SSD ignore fsync()s so transactional performance becomes equal to non-transactional. So your SSDs should be OK for journal. rados bench is a bad

Re: [ceph-users] Configuration about using nvme SSD

2019-02-25 Thread vitalif
I create 2-4 RBD images sized 10GB or more with --thick-provision, then run fio -ioengine=rbd -direct=1 -invalidate=1 -name=test -bs=4k -iodepth=128 -rw=randwrite -pool=rpool -runtime=60 -rbdname=testimg For each of them at the same time. How do you test what total 4Kb random write iops

Re: [ceph-users] Ceph block storage - block.db useless?

2019-03-12 Thread vitalif
block.db is very unlikely to ever grow to 250GB with a 6TB data device. However, there seems to be a funny "issue" with all block.db sizes except 4, 30, and 286 GB being useless, because RocksDB puts the data on the fast storage only if it thinks the whole LSM level will fit there. Ceph's

Re: [ceph-users] Ceph block storage - block.db useless?

2019-03-12 Thread vitalif
Amount of the metadata depends on the amount of data. But RocksDB is only putting metadata to the fast storage when it thinks all metadata on the same level of the DB is going to fit there. So all sizes except 4, 30, 286 GB are useless. ___

Re: [ceph-users] optimize bluestore for random write i/o

2019-03-12 Thread vitalif
Decreasing the min_alloc size isn't always a win, but ican be in some cases.  Originally bluestore_min_alloc_size_ssd was set to 4096 but we increased it to 16384 because at the time our metadata path was slow and increasing it resulted in a pretty significant performance win (along with

Re: [ceph-users] optimize bluestore for random write i/o

2019-03-12 Thread vitalif
One way or another we can only have a single thread sending writes to rocksdb.  A lot of the prior optimization work on the write side was to get as much processing out of the kv_sync_thread as possible.  That's still a worthwhile goal as it's typically what bottlenecks with high amounts of

Re: [ceph-users] optimize bluestore for random write i/o

2019-03-12 Thread vitalif
I bet you'd see better memstore results with my vector based object implementation instead of bufferlists. Where can I find it? Nick Fisk noticed the same thing you did.  One interesting observation he made was that disabling CPU C/P states helped bluestore immensely in the iodepth=1 case.

Re: [ceph-users] Bluestore HDD Cluster Advice

2019-02-14 Thread vitalif
Yes and no... bluestore seems to not work really optimal. For example, it has no filestore-like journal waterlining and flushes the deferred write queue just every 32 writes (deferred_batch_ops). And when it does that it's basically waiting for the HDD to commit and slowing down all further

Re: [ceph-users] performance in a small cluster

2019-06-04 Thread vitalif
Basically they max out at around 1000 IOPS and report 100% utilization and feel slow. Haven't seen the 5200 yet. Micron 5100s performs wonderfully! You have to just turn its write cache off: hdparm -W 0 /dev/sdX 1000 IOPS means you haven't done it. Although even with write cache enabled I

Re: [ceph-users] Single threaded IOPS on SSD pool.

2019-06-05 Thread vitalif
Ok, average network latency from VM to OSD's ~0.4ms. It's rather bad, you can improve the latency by 0.3ms just by upgrading the network. Single threaded performance ~500-600 IOPS - or average latency of 1.6ms Is that comparable to what other are seeing? Good "reference" numbers are 0.5ms

Re: [ceph-users] optane + 4x SSDs for VM disk images?

2019-08-12 Thread vitalif
Could performance of Optane + 4x SSDs per node ever exceed that of pure Optane disks? No. With Ceph, the results for Optane and just for good server SSDs are almost the same. One thing is that you can run more OSDs per an Optane than per a usual SSD. However, the latency you get from both is

Re: [ceph-users] bluestore write iops calculation

2019-08-02 Thread vitalif
where small means 32kb or smaller going to BlueStore, so <= 128kb writes from the client. Also: please don't do 4+1 erasure coding, see older discussions for details. Can you point me to the discussion abort the problems of 4+1? It's not easy to google :) -- Vitaliy Filippov

Re: [ceph-users] optane + 4x SSDs for VM disk images?

2019-08-13 Thread vitalif
Could performance of Optane + 4x SSDs per node ever exceed that of pure Optane disks? No. With Ceph, the results for Optane and just for good server SSDs are almost the same. One thing is that you can run more OSDs per an Optane than per a usual SSD. However, the latency you get from both is

Re: [ceph-users] Ceph performance paper

2019-08-20 Thread vitalif
Hi Marc, Hi Vitaliy, just saw you recommend someone to use ssd, and wanted to use the oppurtunaty to thank you for composing this text[0], enoyed reading it. - What do you mean with: bad-SSD-only? A cluster consisting only of bad SSDs, like desktop ones :) their latency with fsync is

Re: [ceph-users] bluestore write iops calculation

2019-08-07 Thread vitalif
I can add RAM ans is there a way to increase rocksdb caching , can I increase bluestore_cache_size_hdd to higher value to cache rocksdb? In recent releases it's governed by the osd_memory_target parameter. In previous releases it's bluestore_cache_size_hdd. Check release notes to know for

Re: [ceph-users] bluestore write iops calculation

2019-08-02 Thread vitalif
1. For 750 object write request , data written directly into data partition and since we use EC 4+1 there will be 5 iops across the cluster for each obejct write . This makes 750 * 5 = 3750 iops don't forget about the metadata and the deferring of small writes. deferred write queue + metadata,

Re: [ceph-users] Returning to the performance in a small cluster topic

2019-07-29 Thread vitalif
Your results are okay..ish. General rule is that it's hard to achieve read latencies below 0.5ms and write latencies below 1ms with Ceph, **no matter what drives or network you use**. 1 iops with one thread is 0.1 ms. It's just impossible with Ceph currently. I've heard that some people

Re: [ceph-users] bluestore write iops calculation

2019-08-05 Thread vitalif
Hi Team, @vita...@yourcmc.ru , thank you for information and could you please clarify on the below quires as well, 1. Average object size we use will be 256KB to 512KB , will there be deferred write queue ? With the default settings, no (bluestore_prefer_deferred_size_hdd = 32KB) Are you

Re: [ceph-users] Erasure Coding performance for IO < stripe_width

2019-07-24 Thread vitalif
We're seeing ~5800 IOPS, ~23 MiB/s on 4 KiB IO (stripe_width 8192) on a pool that could do 3 GiB/s with 4M blocksize. So, yeah, well, that is rather harsh, even for EC. 4kb IO is slow in Ceph even without EC. Your 3 GB/s linear writes don't matter anything. Ceph adds a significant overhead to

Re: [ceph-users] Future of Filestore?

2019-07-24 Thread vitalif
60 millibits per second? 60 bits every 1000 seconds? Are you serious? Or did we get the capitalisation wrong? Assuming 60MB/sec (as 60 Mb/sec would still be slower than the 5MB/sec I was getting), maybe there's some characteristic that Bluestore is particularly dependent on regarding the

Re: [ceph-users] RocksDB device selection (performance requirements)

2019-11-06 Thread vitalif
Hi, sorry to everyone that I post my link again, but https://yourcmc.ru/wiki/Ceph_performance Hello Cephers, The only recommendation I can find about db device selection is about the capacity (4% of the data disk) on the documents. Is there any suggestions about technical specs like

Re: [ceph-users] NVMe disk - size

2019-11-15 Thread vitalif
Use 30 GB for all OSDs. Other values are pointless, because https://yourcmc.ru/wiki/Ceph_performance#About_block.db_sizing You can use the rest of free NVMe space for bcache - it's much better than just allocating it for block.db. ___ ceph-users

Re: [ceph-users] How does IOPS/latency scale for additional OSDs? (Intel S3610 SATA SSD, for block storage use case)

2019-10-23 Thread vitalif
Hi! Latency doesn't scale with the number of OSDs at all, IOPS scale almost linearly. IOPS are bounded by CPU usage though. Also a single RBD client usually doesn't deliver more than 20-30k read iops and 10-15k write iops. You can run with more than 1 OSD per drive if you think you have

Re: [ceph-users] Looking for the best way to utilize 1TB NVMe added to the host with 8x3TB HDD OSDs

2019-09-20 Thread vitalif
1 NVMe is really only limited to a readonly / writethrough cache (which should be of course possible with bcache). Nobody wants to lose all data after 1 disk failure... Another option is the use of bcache / flashcache. I have experimented with bcache, it is quite easy to et up, but once you

Re: [ceph-users] Consumer-grade SSD in Ceph

2019-12-19 Thread vitalif
Usually it doesn't, it only harms performance and probably SSD lifetime too I would not be running ceph on ssds without powerloss protection. I delivers a potential data loss scenario ___ ceph-users mailing list ceph-users@lists.ceph.com

Re: [ceph-users] low io with enterprise SSDs ceph luminous - can we expect more? [klartext]

2020-01-13 Thread vitalif
Hi, we're playing around with ceph but are not quite happy with the IOs. on average 5000 iops / write on average 13000 iops / read We're expecting more. :( any ideas or is that all we can expect? With server SSD you can expect up to ~1 write / ~25000 read iops per a single client.

Re: [ceph-users] low io with enterprise SSDs ceph luminous - can we expect more? [klartext]

2020-01-14 Thread vitalif
Yes, that's it, see the end of the article. You'll have to disable signature checks, too. cephx_require_signatures = false cephx_cluster_require_signatures = false cephx_sign_messages = false Hi Vitaliy, thank you for your time. Do you mean cephx sign messages = false with "diable

Re: [ceph-users] where does 100% RBD utilization come from?

2020-01-14 Thread vitalif
Hi Philip, I'm not sure if we're talking about the same thing but I was also confused when I didn't see 100% OSD drive utilization during my first RBD write benchmark. Since then I collect all my confusion here https://yourcmc.ru/wiki/Ceph_performance :) 100% RBD utilization means that

Re: [ceph-users] low io with enterprise SSDs ceph luminous - can we expect more? [klartext]

2020-01-20 Thread vitalif
Hi Eric, You say you don't have access to raw drives. What does it mean? Do you run Ceph OSDs inside VMs? In that case you should probably disable Micron caches on the hosts, not just in VMs. Yes, disabling the write cache only takes place upon a power cycle... or upon the next hotplug of

Re: [ceph-users] Consumer-grade SSD in Ceph

2019-12-27 Thread vitalif
SATA: Micron 5100-5200-5300, Seagate Nytro 1351/1551 (don't forget to disable their cache with hdparm -W 0) NVMe: Intel P4500, Micron 9300 Thanks for all the replies. In summary; consumer grade SSD is a no go. What is an alternative to SM863a? Since it is quite hard to get these due non