Re: qemu-img cache modes with Linux cgroup v1
Hey, just FYI about tmpfs, during some development on Fedora 39 I noticed O_DIRECT is now supported on tmpfs (as opposed to our CI which runs Centos 9 Stream). `qemu-img convert -t none -O raw tests/images/cirros-qcow2.img /tmp/cirros.raw` where /tmp is indeed a tmpfs. I might be missing something so feel free to call that out On Tue, Aug 1, 2023 at 6:38 PM Stefan Hajnoczi wrote: > Hi Daniel, > I agree with your points. > > Stefan >
Re: qemu-img cache modes with Linux cgroup v1
On Mon, May 06, 2024 at 08:10:25PM +0300, Alex Kalenyuk wrote: > Hey, just FYI about tmpfs, during some development on Fedora 39 I noticed > O_DIRECT is now supported on tmpfs (as opposed to our CI which runs Centos > 9 Stream). > `qemu-img convert -t none -O raw tests/images/cirros-qcow2.img > /tmp/cirros.raw` > where /tmp is indeed a tmpfs. > > I might be missing something so feel free to call that out Yes, it was added by: commit e88e0d366f9cfbb810b0c8509dc5d130d5a53e02 Author: Hugh Dickins Date: Thu Aug 10 23:27:07 2023 -0700 tmpfs: trivial support for direct IO It's fairly new but great to have. Stefan signature.asc Description: PGP signature
Re: qemu-img cache modes with Linux cgroup v1
Hi Daniel, I agree with your points. Stefan signature.asc Description: PGP signature
Re: qemu-img cache modes with Linux cgroup v1
On Mon, Jul 31, 2023 at 11:40:36AM -0400, Stefan Hajnoczi wrote: > Hi, > qemu-img -t writeback -T writeback is not designed to run with the Linux > cgroup v1 memory controller because dirtying too much page cache leads > to process termination instead of usual non-cgroup and cgroup v2 > throttling behavior: > https://bugzilla.redhat.com/show_bug.cgi?id=2196072 Ewww, a horrible behavioural change v1 is imposing on apps :-( QEMU happens to hit it because we do lots of I/O, but plenty of other apps do major I/o and can fall into the same trap :-( I can imagine that simply running a big "tar zxvf" would have much the same effect in terms of masses of I/O in a short time. > I wanted to share my thoughts on this issue. > > cache=none bypasses the host page cache and will not hit the cgroup > memory limit. It's an easy solution to avoid exceeding the cgroup v1 > memory limit. I go further and say that is a good recommendation even without this bug in cgroups v1. writeback caching helps if you have lots of free memory, but on virtualization hosts memory is usually the biggest VM density constraint, so apps shouldn't generally expect there to be lots of free host memory to burn as I/O cache. If you're using qemu-img in preparation for running qemu-system-XXX and the latter will use cache=none anyway, then it is even less desirable for qemu-img to fill the host cache with pages that won't be accessed again when the VM starts in qemu-system-. > However, not all Linux file systems support O_DIRECT and qemu-img's I/O > pattern may perform worse under cache=none than cache=writeback. > > 1. Which file systems support O_DIRECT in Linux 6.5? > > I searched the Linux source code for file systems that implement > .direct_IO or set FMODE_CAN_ODIRECT. This is not exhaustive and may not > be 100% accurate. > > The big name file systems (ext4, XFS, btrfs, nfs, smb, ceph) support > O_DIRECT. The most obvious omission is tmpfs. Rather than trying to fogure out a list of FS types, in openstack, a bit of code was added to simply attempt to open a test file with O_DIRECT on the target filesystem. If that works then run qemu-img / qemu-system-XXX with cache=none, otherwise use cache=writeback. IOW, a "best effort" to avoid host cache where supported. Could there be justification for QEMU to support a "best effort" host cache bypass mode natively, to avoid every app needing to re-implement this logic to check for support of O_DIRECT ? eg a QEMU 'cache=trynone' option instead of 'cache=none' ? > 2. Is qemu-img performance with O_DIRECT acceptable? > > The I/O pattern matters more with O_DIRECT because every I/O request is > sent to the storage device. This means buffer sizes matter more (more > small I/Os have higher overhead than fewer large I/Os). Concurrency can > also help saturate the storage device. "qemu-img convert" supports the '--parallel' flag to use many coroutines for I/O > If you switch to O_DIRECT and encounter performance problems then > qemu-img can be optimized to send I/O patterns with less overhead. This > requires performance analysis. Since we're in pretty direct control of the I/O pattern qemu-img imposes, it feels very sensible to optimize it to such that cache=none achieves ideal performance. > 3. Using buffered I/O because O_DIRECT is not universally supported? > > If you can't use O_DIRECT, then qemu-img could be extended to manage its > dirty page cache set carefully. This consists of picking a budget and > writing back to disk when the budget is exhausted. IOW, re-implementing what the kernel should already be doing for us :-( This feels like the least desirable thing for QEMU to take on, especially since cgroups v1 is an evolutionary dead-end, with v2 increasingly taking over the world. With regards, Daniel -- |: https://berrange.com -o-https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o-https://fstop138.berrange.com :| |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|
Re: qemu-img cache modes with Linux cgroup v1
On Mon, Jul 31, 2023 at 11:40:36AM -0400, Stefan Hajnoczi wrote: > 3. Using buffered I/O because O_DIRECT is not universally supported? > > If you can't use O_DIRECT, then qemu-img could be extended to manage its > dirty page cache set carefully. This consists of picking a budget and > writing back to disk when the budget is exhausted. Richard Jones has > shared links covering posix_fadvise(2) and sync_file_range(2): > https://lkml.iu.edu/hypermail/linux/kernel/1005.2/01845.html > https://lkml.iu.edu/hypermail/linux/kernel/1005.2/01953.html > > We can discuss qemu-img code changes and performance analysis more if > you decide to take that direction. There's a bit more detail in these two commits: https://gitlab.com/nbdkit/libnbd/-/commit/64d50d994dd7062d5cce21f26f0e8eba0e88c87e https://gitlab.com/nbdkit/nbdkit/-/commit/a956e2e75d6c88eeefecd967505667c9f176e3af In my experience this method is much better than using O_DIRECT, it has much fewer sharp edges. By the way, this is a super-useful tool for measuring how much of the page cache is being used to cache a file: https://github.com/Feh/nocache Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-builder quickly builds VMs from scratch http://libguestfs.org/virt-builder.1.html
qemu-img cache modes with Linux cgroup v1
Hi, qemu-img -t writeback -T writeback is not designed to run with the Linux cgroup v1 memory controller because dirtying too much page cache leads to process termination instead of usual non-cgroup and cgroup v2 throttling behavior: https://bugzilla.redhat.com/show_bug.cgi?id=2196072 I wanted to share my thoughts on this issue. cache=none bypasses the host page cache and will not hit the cgroup memory limit. It's an easy solution to avoid exceeding the cgroup v1 memory limit. However, not all Linux file systems support O_DIRECT and qemu-img's I/O pattern may perform worse under cache=none than cache=writeback. 1. Which file systems support O_DIRECT in Linux 6.5? I searched the Linux source code for file systems that implement .direct_IO or set FMODE_CAN_ODIRECT. This is not exhaustive and may not be 100% accurate. The big name file systems (ext4, XFS, btrfs, nfs, smb, ceph) support O_DIRECT. The most obvious omission is tmpfs. If your users are running file systems that support O_DIRECT, then qemu-img -t none -T none is an easy solution to the cgroup v1 memory limit issue. Supported: 9p affs btrfs ceph erofs exfat ext2 ext4 f2fs fat fuse gfs2 hfs hfsplus jfs minix nfs nilfs2 ntfs3 ocfs2 orangefs overlayfs reiserfs smb udf xfs zonefs Unsupported: adfs befs bfs cramfs ecryptfs efs freevxfs hpfs hugetlbfs isofs jffs2 ntfs omfs qnx4 qnx6 ramfs romfs squashfs sysv tmpfs ubifs ufs vboxsf 2. Is qemu-img performance with O_DIRECT acceptable? The I/O pattern matters more with O_DIRECT because every I/O request is sent to the storage device. This means buffer sizes matter more (more small I/Os have higher overhead than fewer large I/Os). Concurrency can also help saturate the storage device. If you switch to O_DIRECT and encounter performance problems then qemu-img can be optimized to send I/O patterns with less overhead. This requires performance analysis. 3. Using buffered I/O because O_DIRECT is not universally supported? If you can't use O_DIRECT, then qemu-img could be extended to manage its dirty page cache set carefully. This consists of picking a budget and writing back to disk when the budget is exhausted. Richard Jones has shared links covering posix_fadvise(2) and sync_file_range(2): https://lkml.iu.edu/hypermail/linux/kernel/1005.2/01845.html https://lkml.iu.edu/hypermail/linux/kernel/1005.2/01953.html We can discuss qemu-img code changes and performance analysis more if you decide to take that direction. Hope this helps! Stefan signature.asc Description: PGP signature