Hi Adam, Thanks for your great round up there - Your points are excellent.
What I have ended up doing a few days ago (apologies have been too busy to respond..) was set rbd cache = true under each client in the ceph.conf - This got me from 15MB/s up to about 70MB/s. I then set the disk holding the zfs dataset to writeback cache in proxmox (as you note below) and that has bumped it up to about 130MB/s -- Which I am happy with for this setup. Regards, Mark On 27 July 2018 at 14:46, Adam Thompson <athom...@athompso.net> wrote: > On 2018-07-27 07:05, ronny+pve-u...@aasen.cx wrote: > >> rbd striping is a per image setting. you may need to make the rbd >> image and migrate data. >> >> On 07/26/18 12:25, Mark Adams wrote: >> >>> Thanks for your suggestions. Do you know if it is possible to change an >>> existing rbd pool to striping? or does this have to be done on first >>> setup? >>> >> > Please be aware that striping will not result in any increased > performance, if you are using "safe" I/O modes, i.e. your VM waits for a > successful flush-to-disk after every sector. In that scenario, CEPH will > never give you write performance equal to a local disk because you're > limited to the bandwidth of a single remote disk [subsystem] *plus* the > network round-trip latency, which even if measured in microseconds, still > adds up. > > Based on my experience with this and other distributed storage systems, I > believe you will likely find that you get large write-performance gains by: > > 1. use the largest possible block size during writes. 512B sectors are > the worst-case scenario for any remote storage. Try to write in chunks of > *at least* 1 MByte, and it's not unreasonable nowadays to write in chunks > of 64MB or larger. The rationale here is that you're spending more time > sending data, and less time waiting for ACKs. The more you can tilt that > in favor of data, the better off you are. (There are downsides to huge > sector/block/chunk sizes, though - this isn't a "free lunch" scenario. See > #5.) > > 2. relax your write-consistency requirements. If you can tolerate the > small risk with "Write Back" you should see better performance, especially > during burst writes. During large sequential writes, there are not many > ways to violate the laws of physics, and CEPH automatically amplifies your > writes by (in your case) a factor of 2x due to replication. > > 3. switch to storage devices with the best possible local write speed, for > OSDs. OSDs are limited by the performance of the underlying device or > virtual device. (e.g. it's totally possible to run OSDs on a hardware > RAID6 controller) > > 4. Avoid CoW-on-CoW. Write amplification means you'll lose around 50% of > your IOPS and/or I/O bandwidth for each level of CoW nesting, depending on > workload. So don't put CEPH OSDs on, ssy, BTRFS or ZFS filesystems. A > worst-case scenario would be something like running a VM using ZFS on top > of CEPH, where the OSDs are located on BTRFS filsystems, which are in turn > virtual devices hosted on ZFS filesystems. Welcome to 1980's storage > performance, in that case! (I did it without realizing once... seriously, > 5 MBps sequential writes was a good day!) FWIW, CoW filesystems are > generally awesome - just not when stacked. A sufficiently fast external > NAS running ZFS with VMs stored over NFS can provide decent performance, > *if* tuned correctly. iX Systems, for example, spends a lot of time & > effort making this work well, including some lovely HA NAS appliances. > > 5. Remember the triangle. You can optimize a distributed storage system > for any TWO of: a) cost, b) resiliency/reliability/HA, or c) speed. (This > is a specific case of the traditional good/fast/cheap:pick-any-2 adage.) > > > I'm not sure I'm saying anything new here, I may have just summarized the > discussion, but the points remain valid. > > Good luck with your performance problems. > -Adam > > _______________________________________________ > pve-user mailing list > pve-user@pve.proxmox.com > https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user > _______________________________________________ pve-user mailing list pve-user@pve.proxmox.com https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user