Re: [Virtio-fs] virtiofs vs 9p performance(Re: tools/virtiofs: Multi threading seems to hurt performance)
On Tue, Sep 29, 2020 at 03:49:04PM +0200, Miklos Szeredi wrote: > On Tue, Sep 29, 2020 at 3:18 PM Vivek Goyal wrote: > > > - virtiofs cache=none mode is faster than cache=auto mode for this > > workload. > > Not sure why. One cause could be that readahead is not perfect at > detecting the random pattern. Could we compare total I/O on the > server vs. total I/O by fio? Ran tests with auto_inval_data disabled and compared with other results. vtfs-auto-ex-randrw randrw-psync27.8mb/9547kb 7136/2386 vtfs-auto-sh-randrw randrw-psync43.3mb/14.4mb 10.8k/3709 vtfs-auto-sh-noinvalrandrw-psync50.5mb/16.9mb 12.6k/4330 vtfs-none-sh-randrw randrw-psync54.1mb/18.1mb 13.5k/4649 With auto_inval_data disabled, this time I saw around 20% performance jump in READ and is now much closer to cache=none performance. Thanks Vivek
Re: [Virtio-fs] virtiofs vs 9p performance(Re: tools/virtiofs: Multi threading seems to hurt performance)
On Tue, Sep 29, 2020 at 4:01 PM Vivek Goyal wrote: > > On Tue, Sep 29, 2020 at 03:49:04PM +0200, Miklos Szeredi wrote: > > On Tue, Sep 29, 2020 at 3:18 PM Vivek Goyal wrote: > > > > > - virtiofs cache=none mode is faster than cache=auto mode for this > > > workload. > > > > Not sure why. One cause could be that readahead is not perfect at > > detecting the random pattern. Could we compare total I/O on the > > server vs. total I/O by fio? > > Hi Miklos, > > I will instrument virtiosd code to figure out total I/O. > > One more potential issue I am staring at is refreshing the attrs on > READ if fc->auto_inval_data is set. > > fuse_cache_read_iter() { > /* > * In auto invalidate mode, always update attributes on read. > * Otherwise, only update if we attempt to read past EOF (to ensure > * i_size is up to date). > */ > if (fc->auto_inval_data || > (iocb->ki_pos + iov_iter_count(to) > i_size_read(inode))) { > int err; > err = fuse_update_attributes(inode, iocb->ki_filp); > if (err) > return err; > } > } > > Given this is a mixed READ/WRITE workload, every WRITE will invalidate > attrs. And next READ will first do GETATTR() from server (and potentially > invalidate page cache) before doing READ. > > This sounds suboptimal especially from the point of view of WRITEs > done by this client itself. I mean if another client has modified > the file, then doing GETATTR after a second makes sense. But there > should be some optimization to make sure our own WRITEs don't end > up doing GETATTR and invalidate page cache (because cache contents > are still valid). Yeah, that sucks. > I disabled ->auto_invalid_data and that seemed to result in 8-10% > gain in performance for this workload. Need to wrap my head around these caching issues. Thanks, Miklos
Re: [Virtio-fs] virtiofs vs 9p performance(Re: tools/virtiofs: Multi threading seems to hurt performance)
On Tue, Sep 29, 2020 at 03:49:04PM +0200, Miklos Szeredi wrote: > On Tue, Sep 29, 2020 at 3:18 PM Vivek Goyal wrote: > > > - virtiofs cache=none mode is faster than cache=auto mode for this > > workload. > > Not sure why. One cause could be that readahead is not perfect at > detecting the random pattern. Could we compare total I/O on the > server vs. total I/O by fio? Hi Miklos, I will instrument virtiosd code to figure out total I/O. One more potential issue I am staring at is refreshing the attrs on READ if fc->auto_inval_data is set. fuse_cache_read_iter() { /* * In auto invalidate mode, always update attributes on read. * Otherwise, only update if we attempt to read past EOF (to ensure * i_size is up to date). */ if (fc->auto_inval_data || (iocb->ki_pos + iov_iter_count(to) > i_size_read(inode))) { int err; err = fuse_update_attributes(inode, iocb->ki_filp); if (err) return err; } } Given this is a mixed READ/WRITE workload, every WRITE will invalidate attrs. And next READ will first do GETATTR() from server (and potentially invalidate page cache) before doing READ. This sounds suboptimal especially from the point of view of WRITEs done by this client itself. I mean if another client has modified the file, then doing GETATTR after a second makes sense. But there should be some optimization to make sure our own WRITEs don't end up doing GETATTR and invalidate page cache (because cache contents are still valid). I disabled ->auto_invalid_data and that seemed to result in 8-10% gain in performance for this workload. Thanks Vivek
Re: virtiofs vs 9p performance(Re: tools/virtiofs: Multi threading seems to hurt performance)
On Dienstag, 29. September 2020 15:49:42 CEST Vivek Goyal wrote: > > Depends on what's randomized. If read chunk size is randomized, then yes, > > you would probably see less performance increase compared to a simple > > 'cat foo.dat'. > > We are using "fio" for testing and read chunk size is not being > randomized. chunk size (block size) is fixed at 4K size for these tests. Good to know, thanks! > > If only the read position is randomized, but the read chunk size honors > > iounit, a.k.a. stat's st_blksize (i.e. reading with the most efficient > > block size advertised by 9P), then I would assume still seeing a > > performance increase. > > Yes, we are randomizing read position. But there is no notion of looking > at st_blksize. Its fixed at 4K. (notice option --bs=4k in fio > commandline). Ah ok, then the results make sense. With these block sizes you will indeed suffer a performance issue with 9p, due to several thread hops in Tread handling, which is due to be fixed. Best regards, Christian Schoenebeck
Re: [Virtio-fs] virtiofs vs 9p performance(Re: tools/virtiofs: Multi threading seems to hurt performance)
On Tue, Sep 29, 2020 at 3:18 PM Vivek Goyal wrote: > - virtiofs cache=none mode is faster than cache=auto mode for this > workload. Not sure why. One cause could be that readahead is not perfect at detecting the random pattern. Could we compare total I/O on the server vs. total I/O by fio? Thanks, Millos
Re: virtiofs vs 9p performance(Re: tools/virtiofs: Multi threading seems to hurt performance)
On Tue, Sep 29, 2020 at 03:28:06PM +0200, Christian Schoenebeck wrote: > On Dienstag, 29. September 2020 15:03:25 CEST Vivek Goyal wrote: > > On Sun, Sep 27, 2020 at 02:14:43PM +0200, Christian Schoenebeck wrote: > > > On Freitag, 25. September 2020 20:51:47 CEST Dr. David Alan Gilbert wrote: > > > > * Christian Schoenebeck (qemu_...@crudebyte.com) wrote: > > > > > On Freitag, 25. September 2020 15:05:38 CEST Dr. David Alan Gilbert > wrote: > > > > > > > > 9p ( mount -t 9p -o trans=virtio kernel /mnt > > > > > > > > -oversion=9p2000.L,cache=mmap,msize=1048576 ) test: (g=0): > > > > > > > > rw=randrw, > > > > > > > > > > > > > > Bottleneck --^ > > > > > > > > > > > > > > By increasing 'msize' you would encounter better 9P I/O results. > > > > > > > > > > > > OK, I thought that was bigger than the default; what number should > > > > > > I > > > > > > use? > > > > > > > > > > It depends on the underlying storage hardware. In other words: you > > > > > have to > > > > > try increasing the 'msize' value to a point where you no longer notice > > > > > a > > > > > negative performance impact (or almost). Which is fortunately quite > > > > > easy > > > > > to test on> > > > > > > > > > > guest like: > > > > > dd if=/dev/zero of=test.dat bs=1G count=12 > > > > > time cat test.dat > /dev/null > > > > > > > > > > I would start with an absolute minimum msize of 10MB. I would > > > > > recommend > > > > > something around 100MB maybe for a mechanical hard drive. With a PCIe > > > > > flash > > > > > you probably would rather pick several hundred MB or even more. > > > > > > > > > > That unpleasant 'msize' issue is a limitation of the 9p protocol: > > > > > client > > > > > (guest) must suggest the value of msize on connection to server > > > > > (host). > > > > > Server can only lower, but not raise it. And the client in turn > > > > > obviously > > > > > cannot see host's storage device(s), so client is unable to pick a > > > > > good > > > > > value by itself. So it's a suboptimal handshake issue right now. > > > > > > > > It doesn't seem to be making a vast difference here: > > > > > > > > > > > > > > > > 9p mount -t 9p -o trans=virtio kernel /mnt > > > > -oversion=9p2000.L,cache=mmap,msize=104857600 > > > > > > > > Run status group 0 (all jobs): > > > >READ: bw=62.5MiB/s (65.6MB/s), 62.5MiB/s-62.5MiB/s > > > >(65.6MB/s-65.6MB/s), > > > > > > > > io=3070MiB (3219MB), run=49099-49099msec WRITE: bw=20.9MiB/s (21.9MB/s), > > > > 20.9MiB/s-20.9MiB/s (21.9MB/s-21.9MB/s), io=1026MiB (1076MB), > > > > run=49099-49099msec > > > > > > > > 9p mount -t 9p -o trans=virtio kernel /mnt > > > > -oversion=9p2000.L,cache=mmap,msize=1048576000 > > > > > > > > Run status group 0 (all jobs): > > > >READ: bw=65.2MiB/s (68.3MB/s), 65.2MiB/s-65.2MiB/s > > > >(68.3MB/s-68.3MB/s), > > > > > > > > io=3070MiB (3219MB), run=47104-47104msec WRITE: bw=21.8MiB/s (22.8MB/s), > > > > 21.8MiB/s-21.8MiB/s (22.8MB/s-22.8MB/s), io=1026MiB (1076MB), > > > > run=47104-47104msec > > > > > > > > > > > > Dave > > > > > > Is that benchmark tool honoring 'iounit' to automatically run with max. > > > I/O > > > chunk sizes? What's that benchmark tool actually? And do you also see no > > > improvement with a simple > > > > > > time cat largefile.dat > /dev/null > > > > I am assuming that msize only helps with sequential I/O and not random > > I/O. > > > > Dave is running random read and random write mix and probably that's why > > he is not seeing any improvement with msize increase. > > > > If we run sequential workload (as "cat largefile.dat"), that should > > see an improvement with msize increase. > > > > Thanks > > Vivek > > Depends on what's randomized. If read chunk size is randomized, then yes, you > would probably see less performance increase compared to a simple > 'cat foo.dat'. We are using "fio" for testing and read chunk size is not being randomized. chunk size (block size) is fixed at 4K size for these tests. > > If only the read position is randomized, but the read chunk size honors > iounit, a.k.a. stat's st_blksize (i.e. reading with the most efficient block > size advertised by 9P), then I would assume still seeing a performance > increase. Yes, we are randomizing read position. But there is no notion of looking at st_blksize. Its fixed at 4K. (notice option --bs=4k in fio commandline). > Because seeking is a no/low cost factor in this case. The guest OS > seeking does not transmit a 9p message. The offset is rather passed with any > Tread message instead: > https://github.com/chaos/diod/blob/master/protocol.md > > I mean, yes, random seeks reduce I/O performance in general of course, but in > direct performance comparison, the difference in overhead of the 9p vs. > virtiofs network controller layer is most probably the most relevant aspect > if > large I/O chunk sizes are used. > Agreed that large I/O chunk size will help with the
Re: virtiofs vs 9p performance(Re: tools/virtiofs: Multi threading seems to hurt performance)
On Dienstag, 29. September 2020 15:03:25 CEST Vivek Goyal wrote: > On Sun, Sep 27, 2020 at 02:14:43PM +0200, Christian Schoenebeck wrote: > > On Freitag, 25. September 2020 20:51:47 CEST Dr. David Alan Gilbert wrote: > > > * Christian Schoenebeck (qemu_...@crudebyte.com) wrote: > > > > On Freitag, 25. September 2020 15:05:38 CEST Dr. David Alan Gilbert wrote: > > > > > > > 9p ( mount -t 9p -o trans=virtio kernel /mnt > > > > > > > -oversion=9p2000.L,cache=mmap,msize=1048576 ) test: (g=0): > > > > > > > rw=randrw, > > > > > > > > > > > > Bottleneck --^ > > > > > > > > > > > > By increasing 'msize' you would encounter better 9P I/O results. > > > > > > > > > > OK, I thought that was bigger than the default; what number should > > > > > I > > > > > use? > > > > > > > > It depends on the underlying storage hardware. In other words: you > > > > have to > > > > try increasing the 'msize' value to a point where you no longer notice > > > > a > > > > negative performance impact (or almost). Which is fortunately quite > > > > easy > > > > to test on> > > > > > > > > guest like: > > > > dd if=/dev/zero of=test.dat bs=1G count=12 > > > > time cat test.dat > /dev/null > > > > > > > > I would start with an absolute minimum msize of 10MB. I would > > > > recommend > > > > something around 100MB maybe for a mechanical hard drive. With a PCIe > > > > flash > > > > you probably would rather pick several hundred MB or even more. > > > > > > > > That unpleasant 'msize' issue is a limitation of the 9p protocol: > > > > client > > > > (guest) must suggest the value of msize on connection to server > > > > (host). > > > > Server can only lower, but not raise it. And the client in turn > > > > obviously > > > > cannot see host's storage device(s), so client is unable to pick a > > > > good > > > > value by itself. So it's a suboptimal handshake issue right now. > > > > > > It doesn't seem to be making a vast difference here: > > > > > > > > > > > > 9p mount -t 9p -o trans=virtio kernel /mnt > > > -oversion=9p2000.L,cache=mmap,msize=104857600 > > > > > > Run status group 0 (all jobs): > > >READ: bw=62.5MiB/s (65.6MB/s), 62.5MiB/s-62.5MiB/s > > >(65.6MB/s-65.6MB/s), > > > > > > io=3070MiB (3219MB), run=49099-49099msec WRITE: bw=20.9MiB/s (21.9MB/s), > > > 20.9MiB/s-20.9MiB/s (21.9MB/s-21.9MB/s), io=1026MiB (1076MB), > > > run=49099-49099msec > > > > > > 9p mount -t 9p -o trans=virtio kernel /mnt > > > -oversion=9p2000.L,cache=mmap,msize=1048576000 > > > > > > Run status group 0 (all jobs): > > >READ: bw=65.2MiB/s (68.3MB/s), 65.2MiB/s-65.2MiB/s > > >(68.3MB/s-68.3MB/s), > > > > > > io=3070MiB (3219MB), run=47104-47104msec WRITE: bw=21.8MiB/s (22.8MB/s), > > > 21.8MiB/s-21.8MiB/s (22.8MB/s-22.8MB/s), io=1026MiB (1076MB), > > > run=47104-47104msec > > > > > > > > > Dave > > > > Is that benchmark tool honoring 'iounit' to automatically run with max. > > I/O > > chunk sizes? What's that benchmark tool actually? And do you also see no > > improvement with a simple > > > > time cat largefile.dat > /dev/null > > I am assuming that msize only helps with sequential I/O and not random > I/O. > > Dave is running random read and random write mix and probably that's why > he is not seeing any improvement with msize increase. > > If we run sequential workload (as "cat largefile.dat"), that should > see an improvement with msize increase. > > Thanks > Vivek Depends on what's randomized. If read chunk size is randomized, then yes, you would probably see less performance increase compared to a simple 'cat foo.dat'. If only the read position is randomized, but the read chunk size honors iounit, a.k.a. stat's st_blksize (i.e. reading with the most efficient block size advertised by 9P), then I would assume still seeing a performance increase. Because seeking is a no/low cost factor in this case. The guest OS seeking does not transmit a 9p message. The offset is rather passed with any Tread message instead: https://github.com/chaos/diod/blob/master/protocol.md I mean, yes, random seeks reduce I/O performance in general of course, but in direct performance comparison, the difference in overhead of the 9p vs. virtiofs network controller layer is most probably the most relevant aspect if large I/O chunk sizes are used. But OTOH: I haven't optimized anything in Tread handling in 9p (yet). Best regards, Christian Schoenebeck
Re: virtiofs vs 9p performance(Re: tools/virtiofs: Multi threading seems to hurt performance)
On Fri, Sep 25, 2020 at 01:41:39PM +0100, Dr. David Alan Gilbert wrote: [..] > So I'm sitll beating 9p; the thread-pool-size=1 seems to be great for > read performance here. > Hi Dave, I spent some time making changes to virtiofs-tests so that I can test a mix of random read and random write workload. That testsuite runs a workload 3 times and reports the average. So I like to use it to reduce run to run variation effect. So I ran following to mimic carlos's workload. $ ./run-fio-test.sh test -direct=1 -c fio-jobs/randrw-psync.job > testresults.txt $ ./parse-fio-results.sh testresults.txt I am using a SSD at the host to back these files. Option "-c" always creates new files for testing. Following are my results in various configurations. Used cache=mmap mode for 9p and cache=auto (and cache=none) modes for virtiofs. Also tested 9p default as well as msize=16m. Tested virtiofs both with exclusive as well as shared thread pool. NAMEWORKLOADBandwidth IOPS 9p-mmap-randrw randrw-psync42.8mb/14.3mb 10.7k/3666 9p-mmap-msize16mrandrw-psync42.8mb/14.3mb 10.7k/3674 vtfs-auto-ex-randrw randrw-psync27.8mb/9547kb 7136/2386 vtfs-auto-sh-randrw randrw-psync43.3mb/14.4mb 10.8k/3709 vtfs-none-sh-randrw randrw-psync54.1mb/18.1mb 13.5k/4649 - Increasing msize to 16m did not help with performance for this workload. - virtiofs exclusive thread pool ("ex"), is slower than 9p. - virtiofs shared thread pool ("sh"), matches the performance of 9p. - virtiofs cache=none mode is faster than cache=auto mode for this workload. Carlos, I am looking at more ways to optimize it further for virtiofs. In the mean time I think switching to "shared" thread pool should bring you very close to 9p in your setup I think. Thanks Vivek
Re: virtiofs vs 9p performance(Re: tools/virtiofs: Multi threading seems to hurt performance)
On Sun, Sep 27, 2020 at 02:14:43PM +0200, Christian Schoenebeck wrote: > On Freitag, 25. September 2020 20:51:47 CEST Dr. David Alan Gilbert wrote: > > * Christian Schoenebeck (qemu_...@crudebyte.com) wrote: > > > On Freitag, 25. September 2020 15:05:38 CEST Dr. David Alan Gilbert wrote: > > > > > > 9p ( mount -t 9p -o trans=virtio kernel /mnt > > > > > > -oversion=9p2000.L,cache=mmap,msize=1048576 ) test: (g=0): > > > > > > rw=randrw, > > > > > > > > > > Bottleneck --^ > > > > > > > > > > By increasing 'msize' you would encounter better 9P I/O results. > > > > > > > > OK, I thought that was bigger than the default; what number should I > > > > use? > > > > > > It depends on the underlying storage hardware. In other words: you have to > > > try increasing the 'msize' value to a point where you no longer notice a > > > negative performance impact (or almost). Which is fortunately quite easy > > > to test on> > > > guest like: > > > dd if=/dev/zero of=test.dat bs=1G count=12 > > > time cat test.dat > /dev/null > > > > > > I would start with an absolute minimum msize of 10MB. I would recommend > > > something around 100MB maybe for a mechanical hard drive. With a PCIe > > > flash > > > you probably would rather pick several hundred MB or even more. > > > > > > That unpleasant 'msize' issue is a limitation of the 9p protocol: client > > > (guest) must suggest the value of msize on connection to server (host). > > > Server can only lower, but not raise it. And the client in turn obviously > > > cannot see host's storage device(s), so client is unable to pick a good > > > value by itself. So it's a suboptimal handshake issue right now. > > > > It doesn't seem to be making a vast difference here: > > > > > > > > 9p mount -t 9p -o trans=virtio kernel /mnt > > -oversion=9p2000.L,cache=mmap,msize=104857600 > > > > Run status group 0 (all jobs): > >READ: bw=62.5MiB/s (65.6MB/s), 62.5MiB/s-62.5MiB/s (65.6MB/s-65.6MB/s), > > io=3070MiB (3219MB), run=49099-49099msec WRITE: bw=20.9MiB/s (21.9MB/s), > > 20.9MiB/s-20.9MiB/s (21.9MB/s-21.9MB/s), io=1026MiB (1076MB), > > run=49099-49099msec > > > > 9p mount -t 9p -o trans=virtio kernel /mnt > > -oversion=9p2000.L,cache=mmap,msize=1048576000 > > > > Run status group 0 (all jobs): > >READ: bw=65.2MiB/s (68.3MB/s), 65.2MiB/s-65.2MiB/s (68.3MB/s-68.3MB/s), > > io=3070MiB (3219MB), run=47104-47104msec WRITE: bw=21.8MiB/s (22.8MB/s), > > 21.8MiB/s-21.8MiB/s (22.8MB/s-22.8MB/s), io=1026MiB (1076MB), > > run=47104-47104msec > > > > > > Dave > > Is that benchmark tool honoring 'iounit' to automatically run with max. I/O > chunk sizes? What's that benchmark tool actually? And do you also see no > improvement with a simple > > time cat largefile.dat > /dev/null I am assuming that msize only helps with sequential I/O and not random I/O. Dave is running random read and random write mix and probably that's why he is not seeing any improvement with msize increase. If we run sequential workload (as "cat largefile.dat"), that should see an improvement with msize increase. Thanks Vivek
Re: virtiofs vs 9p performance(Re: tools/virtiofs: Multi threading seems to hurt performance)
On Freitag, 25. September 2020 20:51:47 CEST Dr. David Alan Gilbert wrote: > * Christian Schoenebeck (qemu_...@crudebyte.com) wrote: > > On Freitag, 25. September 2020 15:05:38 CEST Dr. David Alan Gilbert wrote: > > > > > 9p ( mount -t 9p -o trans=virtio kernel /mnt > > > > > -oversion=9p2000.L,cache=mmap,msize=1048576 ) test: (g=0): > > > > > rw=randrw, > > > > > > > > Bottleneck --^ > > > > > > > > By increasing 'msize' you would encounter better 9P I/O results. > > > > > > OK, I thought that was bigger than the default; what number should I > > > use? > > > > It depends on the underlying storage hardware. In other words: you have to > > try increasing the 'msize' value to a point where you no longer notice a > > negative performance impact (or almost). Which is fortunately quite easy > > to test on> > > guest like: > > dd if=/dev/zero of=test.dat bs=1G count=12 > > time cat test.dat > /dev/null > > > > I would start with an absolute minimum msize of 10MB. I would recommend > > something around 100MB maybe for a mechanical hard drive. With a PCIe > > flash > > you probably would rather pick several hundred MB or even more. > > > > That unpleasant 'msize' issue is a limitation of the 9p protocol: client > > (guest) must suggest the value of msize on connection to server (host). > > Server can only lower, but not raise it. And the client in turn obviously > > cannot see host's storage device(s), so client is unable to pick a good > > value by itself. So it's a suboptimal handshake issue right now. > > It doesn't seem to be making a vast difference here: > > > > 9p mount -t 9p -o trans=virtio kernel /mnt > -oversion=9p2000.L,cache=mmap,msize=104857600 > > Run status group 0 (all jobs): >READ: bw=62.5MiB/s (65.6MB/s), 62.5MiB/s-62.5MiB/s (65.6MB/s-65.6MB/s), > io=3070MiB (3219MB), run=49099-49099msec WRITE: bw=20.9MiB/s (21.9MB/s), > 20.9MiB/s-20.9MiB/s (21.9MB/s-21.9MB/s), io=1026MiB (1076MB), > run=49099-49099msec > > 9p mount -t 9p -o trans=virtio kernel /mnt > -oversion=9p2000.L,cache=mmap,msize=1048576000 > > Run status group 0 (all jobs): >READ: bw=65.2MiB/s (68.3MB/s), 65.2MiB/s-65.2MiB/s (68.3MB/s-68.3MB/s), > io=3070MiB (3219MB), run=47104-47104msec WRITE: bw=21.8MiB/s (22.8MB/s), > 21.8MiB/s-21.8MiB/s (22.8MB/s-22.8MB/s), io=1026MiB (1076MB), > run=47104-47104msec > > > Dave Is that benchmark tool honoring 'iounit' to automatically run with max. I/O chunk sizes? What's that benchmark tool actually? And do you also see no improvement with a simple time cat largefile.dat > /dev/null ? Best regards, Christian Schoenebeck
Re: virtiofs vs 9p performance(Re: tools/virtiofs: Multi threading seems to hurt performance)
* Christian Schoenebeck (qemu_...@crudebyte.com) wrote: > On Freitag, 25. September 2020 15:05:38 CEST Dr. David Alan Gilbert wrote: > > > > 9p ( mount -t 9p -o trans=virtio kernel /mnt > > > > -oversion=9p2000.L,cache=mmap,msize=1048576 ) test: (g=0): rw=randrw, > > > > > > Bottleneck --^ > > > > > > By increasing 'msize' you would encounter better 9P I/O results. > > > > OK, I thought that was bigger than the default; what number should I > > use? > > It depends on the underlying storage hardware. In other words: you have to > try > increasing the 'msize' value to a point where you no longer notice a negative > performance impact (or almost). Which is fortunately quite easy to test on > guest like: > > dd if=/dev/zero of=test.dat bs=1G count=12 > time cat test.dat > /dev/null > > I would start with an absolute minimum msize of 10MB. I would recommend > something around 100MB maybe for a mechanical hard drive. With a PCIe flash > you probably would rather pick several hundred MB or even more. > > That unpleasant 'msize' issue is a limitation of the 9p protocol: client > (guest) must suggest the value of msize on connection to server (host). > Server > can only lower, but not raise it. And the client in turn obviously cannot see > host's storage device(s), so client is unable to pick a good value by itself. > So it's a suboptimal handshake issue right now. It doesn't seem to be making a vast difference here: 9p mount -t 9p -o trans=virtio kernel /mnt -oversion=9p2000.L,cache=mmap,msize=104857600 Run status group 0 (all jobs): READ: bw=62.5MiB/s (65.6MB/s), 62.5MiB/s-62.5MiB/s (65.6MB/s-65.6MB/s), io=3070MiB (3219MB), run=49099-49099msec WRITE: bw=20.9MiB/s (21.9MB/s), 20.9MiB/s-20.9MiB/s (21.9MB/s-21.9MB/s), io=1026MiB (1076MB), run=49099-49099msec 9p mount -t 9p -o trans=virtio kernel /mnt -oversion=9p2000.L,cache=mmap,msize=1048576000 Run status group 0 (all jobs): READ: bw=65.2MiB/s (68.3MB/s), 65.2MiB/s-65.2MiB/s (68.3MB/s-68.3MB/s), io=3070MiB (3219MB), run=47104-47104msec WRITE: bw=21.8MiB/s (22.8MB/s), 21.8MiB/s-21.8MiB/s (22.8MB/s-22.8MB/s), io=1026MiB (1076MB), run=47104-47104msec Dave > Many users don't even know this 'msize' parameter exists and hence run with > the Linux kernel's default value of just 8kB. For QEMU 5.2 I addressed this > by > logging a performance warning on host side for making users at least aware > about this issue. The long-term plan is to pass a good msize value from host > to guest via virtio (like it's already done for the available export tags) > and > the Linux kernel would default to that instead. > > Best regards, > Christian Schoenebeck > > -- Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
Re: virtiofs vs 9p performance(Re: tools/virtiofs: Multi threading seems to hurt performance)
On Freitag, 25. September 2020 18:05:17 CEST Christian Schoenebeck wrote: > On Freitag, 25. September 2020 15:05:38 CEST Dr. David Alan Gilbert wrote: > > > > 9p ( mount -t 9p -o trans=virtio kernel /mnt > > > > -oversion=9p2000.L,cache=mmap,msize=1048576 ) test: (g=0): rw=randrw, > > > > > > Bottleneck --^ > > > > > > By increasing 'msize' you would encounter better 9P I/O results. > > > > OK, I thought that was bigger than the default; what number should I > > use? > > It depends on the underlying storage hardware. In other words: you have to > try increasing the 'msize' value to a point where you no longer notice a > negative performance impact (or almost). Which is fortunately quite easy to > test on guest like: > > dd if=/dev/zero of=test.dat bs=1G count=12 > time cat test.dat > /dev/null I forgot: you should execute that 'dd' command and host side, and the 'cat' command on guest side, to avoid any caching making the benchmark result look better than it actually is. Because for finding a good 'msize' value you only care about actual 9p data really being transmitted between host and guest. Best regards, Christian Schoenebeck
Re: virtiofs vs 9p performance(Re: tools/virtiofs: Multi threading seems to hurt performance)
On Freitag, 25. September 2020 15:05:38 CEST Dr. David Alan Gilbert wrote: > > > 9p ( mount -t 9p -o trans=virtio kernel /mnt > > > -oversion=9p2000.L,cache=mmap,msize=1048576 ) test: (g=0): rw=randrw, > > > > Bottleneck --^ > > > > By increasing 'msize' you would encounter better 9P I/O results. > > OK, I thought that was bigger than the default; what number should I > use? It depends on the underlying storage hardware. In other words: you have to try increasing the 'msize' value to a point where you no longer notice a negative performance impact (or almost). Which is fortunately quite easy to test on guest like: dd if=/dev/zero of=test.dat bs=1G count=12 time cat test.dat > /dev/null I would start with an absolute minimum msize of 10MB. I would recommend something around 100MB maybe for a mechanical hard drive. With a PCIe flash you probably would rather pick several hundred MB or even more. That unpleasant 'msize' issue is a limitation of the 9p protocol: client (guest) must suggest the value of msize on connection to server (host). Server can only lower, but not raise it. And the client in turn obviously cannot see host's storage device(s), so client is unable to pick a good value by itself. So it's a suboptimal handshake issue right now. Many users don't even know this 'msize' parameter exists and hence run with the Linux kernel's default value of just 8kB. For QEMU 5.2 I addressed this by logging a performance warning on host side for making users at least aware about this issue. The long-term plan is to pass a good msize value from host to guest via virtio (like it's already done for the available export tags) and the Linux kernel would default to that instead. Best regards, Christian Schoenebeck
Re: tools/virtiofs: Multi threading seems to hurt performance
On Fri, Sep 25, 2020 at 01:11:27PM +0100, Dr. David Alan Gilbert wrote: > * Vivek Goyal (vgo...@redhat.com) wrote: > > On Tue, Sep 22, 2020 at 11:25:31AM +0100, Dr. David Alan Gilbert wrote: > > > * Dr. David Alan Gilbert (dgilb...@redhat.com) wrote: > > > > Hi, > > > > I've been doing some of my own perf tests and I think I agree > > > > about the thread pool size; my test is a kernel build > > > > and I've tried a bunch of different options. > > > > > > > > My config: > > > > Host: 16 core AMD EPYC (32 thread), 128G RAM, > > > > 5.9.0-rc4 kernel, rhel 8.2ish userspace. > > > > 5.1.0 qemu/virtiofsd built from git. > > > > Guest: Fedora 32 from cloud image with just enough extra installed for > > > > a kernel build. > > > > > > > > git cloned and checkout v5.8 of Linux into /dev/shm/linux on the host > > > > fresh before each test. Then log into the guest, make defconfig, > > > > time make -j 16 bzImage, make clean; time make -j 16 bzImage > > > > The numbers below are the 'real' time in the guest from the initial make > > > > (the subsequent makes dont vary much) > > > > > > > > Below are the detauls of what each of these means, but here are the > > > > numbers first > > > > > > > > virtiofsdefault4m0.978s > > > > 9pdefault 9m41.660s > > > > virtiofscache=none10m29.700s > > > > 9pmmappass 9m30.047s > > > > 9pmbigmsize 12m4.208s > > > > 9pmsecnone 9m21.363s > > > > virtiofscache=noneT1 7m17.494s > > > > virtiofsdefaultT1 3m43.326s > > > > > > > > So the winner there by far is the 'virtiofsdefaultT1' - that's > > > > the default virtiofs settings, but with --thread-pool-size=1 - so > > > > yes it gives a small benefit. > > > > But interestingly the cache=none virtiofs performance is pretty bad, > > > > but thread-pool-size=1 on that makes a BIG improvement. > > > > > > Here are fio runs that Vivek asked me to run in my same environment > > > (there are some 0's in some of the mmap cases, and I've not investigated > > > why yet). > > > > cache=none does not allow mmap in case of virtiofs. That's when you > > are seeing 0. > > > > >virtiofs is looking good here in I think all of the cases; > > > there's some division over which cinfig; cache=none > > > seems faster in some cases which surprises me. > > > > I know cache=none is faster in case of write workloads. It forces > > direct write where we don't call file_remove_privs(). While cache=auto > > goes through file_remove_privs() and that adds a GETXATTR request to > > every WRITE request. > > Can you point me to how cache=auto causes the file_remove_privs? fs/fuse/file.c fuse_cache_write_iter() { err = file_remove_privs(file); } Above path is taken when cache=auto/cache=always is used. If virtiofsd is running with noxattr, then it does not impose any cost. But if xattr are enabled, then every WRITE first results in a getxattr(security.capability) and that slows down WRITES tremendously. When cache=none is used, we go through following path instead. fuse_direct_write_iter() and it does not have file_remove_privs(). We set a flag in WRITE request to tell server to kill suid/sgid/security.capability, instead. fuse_direct_io() { ia->write.in.write_flags |= FUSE_WRITE_KILL_PRIV } Vivek
Re: virtiofs vs 9p performance(Re: tools/virtiofs: Multi threading seems to hurt performance)
* Christian Schoenebeck (qemu_...@crudebyte.com) wrote: > On Freitag, 25. September 2020 14:41:39 CEST Dr. David Alan Gilbert wrote: > > > Hi Carlos, > > > > > > So you are running following test. > > > > > > fio --direct=1 --gtod_reduce=1 --name=test > > > --filename=random_read_write.fio --bs=4k --iodepth=64 --size=4G > > > --readwrite=randrw --rwmixread=75 --output=/output/fio.txt > > > > > > And following are your results. > > > > > > 9p > > > -- > > > READ: bw=211MiB/s (222MB/s), 211MiB/s-211MiB/s (222MB/s-222MB/s), > > > io=3070MiB (3219MB), run=14532-14532msec > > > > > > WRITE: bw=70.6MiB/s (74.0MB/s), 70.6MiB/s-70.6MiB/s (74.0MB/s-74.0MB/s), > > > io=1026MiB (1076MB), run=14532-14532msec > > > > > > virtiofs > > > > > > > > > Run status group 0 (all jobs): > > >READ: bw=159MiB/s (167MB/s), 159MiB/s-159MiB/s (167MB/s-167MB/s), > > >io=3070MiB (3219MB), run=19321-19321msec> > > > WRITE: bw=53.1MiB/s (55.7MB/s), 53.1MiB/s-53.1MiB/s (55.7MB/s-55.7MB/s), > > > io=1026MiB (1076MB), run=19321-19321msec> > > > So looks like you are getting better performance with 9p in this case. > > > > That's interesting, because I've just tried similar again with my > > ramdisk setup: > > > > fio --direct=1 --gtod_reduce=1 --name=test --filename=random_read_write.fio > > --bs=4k --iodepth=64 --size=4G --readwrite=randrw --rwmixread=75 > > --output=aname.txt > > > > > > virtiofs default options > > test: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) > > 4096B-4096B, ioengine=psync, iodepth=64 fio-3.21 > > Starting 1 process > > test: Laying out IO file (1 file / 4096MiB) > > > > test: (groupid=0, jobs=1): err= 0: pid=773: Fri Sep 25 12:28:32 2020 > > read: IOPS=18.3k, BW=71.3MiB/s (74.8MB/s)(3070MiB/43042msec) > >bw ( KiB/s): min=70752, max=77280, per=100.00%, avg=73075.71, > > stdev=1603.47, samples=85 iops: min=17688, max=19320, avg=18268.92, > > stdev=400.86, samples=85 write: IOPS=6102, BW=23.8MiB/s > > (24.0MB/s)(1026MiB/43042msec); 0 zone resets bw ( KiB/s): min=23128, > > max=25696, per=100.00%, avg=24420.40, stdev=583.08, samples=85 iops > > : min= 5782, max= 6424, avg=6105.09, stdev=145.76, samples=85 cpu > > : usr=0.10%, sys=30.09%, ctx=1245312, majf=0, minf=6 IO depths: > > 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit: > > 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : > > 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: > > total=785920,262656,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, > > window=0, percentile=100.00%, depth=64 > > > > Run status group 0 (all jobs): > >READ: bw=71.3MiB/s (74.8MB/s), 71.3MiB/s-71.3MiB/s (74.8MB/s-74.8MB/s), > > io=3070MiB (3219MB), run=43042-43042msec WRITE: bw=23.8MiB/s (24.0MB/s), > > 23.8MiB/s-23.8MiB/s (24.0MB/s-24.0MB/s), io=1026MiB (1076MB), > > run=43042-43042msec > > > > virtiofs cache=none > > test: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) > > 4096B-4096B, ioengine=psync, iodepth=64 fio-3.21 > > Starting 1 process > > > > test: (groupid=0, jobs=1): err= 0: pid=740: Fri Sep 25 12:30:57 2020 > > read: IOPS=22.9k, BW=89.6MiB/s (93.0MB/s)(3070MiB/34256msec) > >bw ( KiB/s): min=89048, max=94240, per=100.00%, avg=91871.06, > > stdev=967.87, samples=68 iops: min=22262, max=23560, avg=22967.76, > > stdev=241.97, samples=68 write: IOPS=7667, BW=29.0MiB/s > > (31.4MB/s)(1026MiB/34256msec); 0 zone resets bw ( KiB/s): min=29264, > > max=32248, per=100.00%, avg=30700.82, stdev=541.97, samples=68 iops > > : min= 7316, max= 8062, avg=7675.21, stdev=135.49, samples=68 cpu > > : usr=1.03%, sys=27.64%, ctx=1048635, majf=0, minf=5 IO depths: > > 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit: > > 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : > > 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: > > total=785920,262656,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, > > window=0, percentile=100.00%, depth=64 > > > > Run status group 0 (all jobs): > >READ: bw=89.6MiB/s (93.0MB/s), 89.6MiB/s-89.6MiB/s (93.0MB/s-93.0MB/s), > > io=3070MiB (3219MB), run=34256-34256msec WRITE: bw=29.0MiB/s (31.4MB/s), > > 29.0MiB/s-29.0MiB/s (31.4MB/s-31.4MB/s), io=1026MiB (1076MB), > > run=34256-34256msec > > > > virtiofs cache=none thread-pool-size=1 > > test: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) > > 4096B-4096B, ioengine=psync, iodepth=64 fio-3.21 > > Starting 1 process > > > > test: (groupid=0, jobs=1): err= 0: pid=738: Fri Sep 25 12:33:17 2020 > > read: IOPS=23.7k, BW=92.4MiB/s (96.9MB/s)(3070MiB/33215msec) > >bw ( KiB/s): min=89808, max=111952, per=100.00%, avg=94762.30, > > stdev=4507.43, samples=66 iops: min=22452, max=27988, avg=23690.58, > > stdev=1126.86, samples=66 write: IOPS=7907, BW=30.9MiB/s > >
Re: virtiofs vs 9p performance(Re: tools/virtiofs: Multi threading seems to hurt performance)
On Freitag, 25. September 2020 14:41:39 CEST Dr. David Alan Gilbert wrote: > > Hi Carlos, > > > > So you are running following test. > > > > fio --direct=1 --gtod_reduce=1 --name=test > > --filename=random_read_write.fio --bs=4k --iodepth=64 --size=4G > > --readwrite=randrw --rwmixread=75 --output=/output/fio.txt > > > > And following are your results. > > > > 9p > > -- > > READ: bw=211MiB/s (222MB/s), 211MiB/s-211MiB/s (222MB/s-222MB/s), > > io=3070MiB (3219MB), run=14532-14532msec > > > > WRITE: bw=70.6MiB/s (74.0MB/s), 70.6MiB/s-70.6MiB/s (74.0MB/s-74.0MB/s), > > io=1026MiB (1076MB), run=14532-14532msec > > > > virtiofs > > > > > > Run status group 0 (all jobs): > >READ: bw=159MiB/s (167MB/s), 159MiB/s-159MiB/s (167MB/s-167MB/s), > >io=3070MiB (3219MB), run=19321-19321msec> > > WRITE: bw=53.1MiB/s (55.7MB/s), 53.1MiB/s-53.1MiB/s (55.7MB/s-55.7MB/s), > > io=1026MiB (1076MB), run=19321-19321msec> > > So looks like you are getting better performance with 9p in this case. > > That's interesting, because I've just tried similar again with my > ramdisk setup: > > fio --direct=1 --gtod_reduce=1 --name=test --filename=random_read_write.fio > --bs=4k --iodepth=64 --size=4G --readwrite=randrw --rwmixread=75 > --output=aname.txt > > > virtiofs default options > test: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) > 4096B-4096B, ioengine=psync, iodepth=64 fio-3.21 > Starting 1 process > test: Laying out IO file (1 file / 4096MiB) > > test: (groupid=0, jobs=1): err= 0: pid=773: Fri Sep 25 12:28:32 2020 > read: IOPS=18.3k, BW=71.3MiB/s (74.8MB/s)(3070MiB/43042msec) >bw ( KiB/s): min=70752, max=77280, per=100.00%, avg=73075.71, > stdev=1603.47, samples=85 iops: min=17688, max=19320, avg=18268.92, > stdev=400.86, samples=85 write: IOPS=6102, BW=23.8MiB/s > (24.0MB/s)(1026MiB/43042msec); 0 zone resets bw ( KiB/s): min=23128, > max=25696, per=100.00%, avg=24420.40, stdev=583.08, samples=85 iops > : min= 5782, max= 6424, avg=6105.09, stdev=145.76, samples=85 cpu > : usr=0.10%, sys=30.09%, ctx=1245312, majf=0, minf=6 IO depths: > 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit: > 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : > 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: > total=785920,262656,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, > window=0, percentile=100.00%, depth=64 > > Run status group 0 (all jobs): >READ: bw=71.3MiB/s (74.8MB/s), 71.3MiB/s-71.3MiB/s (74.8MB/s-74.8MB/s), > io=3070MiB (3219MB), run=43042-43042msec WRITE: bw=23.8MiB/s (24.0MB/s), > 23.8MiB/s-23.8MiB/s (24.0MB/s-24.0MB/s), io=1026MiB (1076MB), > run=43042-43042msec > > virtiofs cache=none > test: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) > 4096B-4096B, ioengine=psync, iodepth=64 fio-3.21 > Starting 1 process > > test: (groupid=0, jobs=1): err= 0: pid=740: Fri Sep 25 12:30:57 2020 > read: IOPS=22.9k, BW=89.6MiB/s (93.0MB/s)(3070MiB/34256msec) >bw ( KiB/s): min=89048, max=94240, per=100.00%, avg=91871.06, > stdev=967.87, samples=68 iops: min=22262, max=23560, avg=22967.76, > stdev=241.97, samples=68 write: IOPS=7667, BW=29.0MiB/s > (31.4MB/s)(1026MiB/34256msec); 0 zone resets bw ( KiB/s): min=29264, > max=32248, per=100.00%, avg=30700.82, stdev=541.97, samples=68 iops > : min= 7316, max= 8062, avg=7675.21, stdev=135.49, samples=68 cpu > : usr=1.03%, sys=27.64%, ctx=1048635, majf=0, minf=5 IO depths: > 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit: > 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : > 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: > total=785920,262656,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, > window=0, percentile=100.00%, depth=64 > > Run status group 0 (all jobs): >READ: bw=89.6MiB/s (93.0MB/s), 89.6MiB/s-89.6MiB/s (93.0MB/s-93.0MB/s), > io=3070MiB (3219MB), run=34256-34256msec WRITE: bw=29.0MiB/s (31.4MB/s), > 29.0MiB/s-29.0MiB/s (31.4MB/s-31.4MB/s), io=1026MiB (1076MB), > run=34256-34256msec > > virtiofs cache=none thread-pool-size=1 > test: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) > 4096B-4096B, ioengine=psync, iodepth=64 fio-3.21 > Starting 1 process > > test: (groupid=0, jobs=1): err= 0: pid=738: Fri Sep 25 12:33:17 2020 > read: IOPS=23.7k, BW=92.4MiB/s (96.9MB/s)(3070MiB/33215msec) >bw ( KiB/s): min=89808, max=111952, per=100.00%, avg=94762.30, > stdev=4507.43, samples=66 iops: min=22452, max=27988, avg=23690.58, > stdev=1126.86, samples=66 write: IOPS=7907, BW=30.9MiB/s > (32.4MB/s)(1026MiB/33215msec); 0 zone resets bw ( KiB/s): min=29424, > max=37112, per=100.00%, avg=31668.73, stdev=1558.69, samples=66 iops > : min= 7356, max= 9278, avg=7917.18, stdev=389.67, samples=66 cpu > : usr=0.43%, sys=29.07%, ctx=1048627, majf=0, minf=7 IO
Re: virtiofs vs 9p performance(Re: tools/virtiofs: Multi threading seems to hurt performance)
* Vivek Goyal (vgo...@redhat.com) wrote: > On Thu, Sep 24, 2020 at 09:33:01PM +, Venegas Munoz, Jose Carlos wrote: > > Hi Folks, > > > > Sorry for the delay about how to reproduce `fio` data. > > > > I have some code to automate testing for multiple kata configs and collect > > info like: > > - Kata-env, kata configuration.toml, qemu command, virtiofsd command. > > > > See: > > https://github.com/jcvenegas/mrunner/ > > > > > > Last time we agreed to narrow the cases and configs to compare virtiofs and > > 9pfs > > > > The configs where the following: > > > > - qemu + virtiofs(cache=auto, dax=0) a.ka. `kata-qemu-virtiofs` WITOUT xattr > > - qemu + 9pfs a.k.a `kata-qemu` > > > > Please take a look to the html and raw results I attach in this mail. > > Hi Carlos, > > So you are running following test. > > fio --direct=1 --gtod_reduce=1 --name=test --filename=random_read_write.fio > --bs=4k --iodepth=64 --size=4G --readwrite=randrw --rwmixread=75 > --output=/output/fio.txt > > And following are your results. > > 9p > -- > READ: bw=211MiB/s (222MB/s), 211MiB/s-211MiB/s (222MB/s-222MB/s), io=3070MiB > (3219MB), run=14532-14532msec > > WRITE: bw=70.6MiB/s (74.0MB/s), 70.6MiB/s-70.6MiB/s (74.0MB/s-74.0MB/s), > io=1026MiB (1076MB), run=14532-14532msec > > virtiofs > > Run status group 0 (all jobs): >READ: bw=159MiB/s (167MB/s), 159MiB/s-159MiB/s (167MB/s-167MB/s), > io=3070MiB (3219MB), run=19321-19321msec > WRITE: bw=53.1MiB/s (55.7MB/s), 53.1MiB/s-53.1MiB/s (55.7MB/s-55.7MB/s), > io=1026MiB (1076MB), run=19321-19321msec > > So looks like you are getting better performance with 9p in this case. That's interesting, because I've just tried similar again with my ramdisk setup: fio --direct=1 --gtod_reduce=1 --name=test --filename=random_read_write.fio --bs=4k --iodepth=64 --size=4G --readwrite=randrw --rwmixread=75 --output=aname.txt virtiofs default options test: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=64 fio-3.21 Starting 1 process test: Laying out IO file (1 file / 4096MiB) test: (groupid=0, jobs=1): err= 0: pid=773: Fri Sep 25 12:28:32 2020 read: IOPS=18.3k, BW=71.3MiB/s (74.8MB/s)(3070MiB/43042msec) bw ( KiB/s): min=70752, max=77280, per=100.00%, avg=73075.71, stdev=1603.47, samples=85 iops: min=17688, max=19320, avg=18268.92, stdev=400.86, samples=85 write: IOPS=6102, BW=23.8MiB/s (24.0MB/s)(1026MiB/43042msec); 0 zone resets bw ( KiB/s): min=23128, max=25696, per=100.00%, avg=24420.40, stdev=583.08, samples=85 iops: min= 5782, max= 6424, avg=6105.09, stdev=145.76, samples=85 cpu : usr=0.10%, sys=30.09%, ctx=1245312, majf=0, minf=6 IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=785920,262656,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=64 Run status group 0 (all jobs): READ: bw=71.3MiB/s (74.8MB/s), 71.3MiB/s-71.3MiB/s (74.8MB/s-74.8MB/s), io=3070MiB (3219MB), run=43042-43042msec WRITE: bw=23.8MiB/s (24.0MB/s), 23.8MiB/s-23.8MiB/s (24.0MB/s-24.0MB/s), io=1026MiB (1076MB), run=43042-43042msec virtiofs cache=none test: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=64 fio-3.21 Starting 1 process test: (groupid=0, jobs=1): err= 0: pid=740: Fri Sep 25 12:30:57 2020 read: IOPS=22.9k, BW=89.6MiB/s (93.0MB/s)(3070MiB/34256msec) bw ( KiB/s): min=89048, max=94240, per=100.00%, avg=91871.06, stdev=967.87, samples=68 iops: min=22262, max=23560, avg=22967.76, stdev=241.97, samples=68 write: IOPS=7667, BW=29.0MiB/s (31.4MB/s)(1026MiB/34256msec); 0 zone resets bw ( KiB/s): min=29264, max=32248, per=100.00%, avg=30700.82, stdev=541.97, samples=68 iops: min= 7316, max= 8062, avg=7675.21, stdev=135.49, samples=68 cpu : usr=1.03%, sys=27.64%, ctx=1048635, majf=0, minf=5 IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=785920,262656,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=64 Run status group 0 (all jobs): READ: bw=89.6MiB/s (93.0MB/s), 89.6MiB/s-89.6MiB/s (93.0MB/s-93.0MB/s), io=3070MiB (3219MB), run=34256-34256msec WRITE: bw=29.0MiB/s (31.4MB/s), 29.0MiB/s-29.0MiB/s (31.4MB/s-31.4MB/s), io=1026MiB (1076MB), run=34256-34256msec virtiofs cache=none thread-pool-size=1 test: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=64 fio-3.21 Starting 1 process test: (groupid=0, jobs=1): err= 0:
Re: tools/virtiofs: Multi threading seems to hurt performance
* Vivek Goyal (vgo...@redhat.com) wrote: > On Tue, Sep 22, 2020 at 11:25:31AM +0100, Dr. David Alan Gilbert wrote: > > * Dr. David Alan Gilbert (dgilb...@redhat.com) wrote: > > > Hi, > > > I've been doing some of my own perf tests and I think I agree > > > about the thread pool size; my test is a kernel build > > > and I've tried a bunch of different options. > > > > > > My config: > > > Host: 16 core AMD EPYC (32 thread), 128G RAM, > > > 5.9.0-rc4 kernel, rhel 8.2ish userspace. > > > 5.1.0 qemu/virtiofsd built from git. > > > Guest: Fedora 32 from cloud image with just enough extra installed for > > > a kernel build. > > > > > > git cloned and checkout v5.8 of Linux into /dev/shm/linux on the host > > > fresh before each test. Then log into the guest, make defconfig, > > > time make -j 16 bzImage, make clean; time make -j 16 bzImage > > > The numbers below are the 'real' time in the guest from the initial make > > > (the subsequent makes dont vary much) > > > > > > Below are the detauls of what each of these means, but here are the > > > numbers first > > > > > > virtiofsdefault4m0.978s > > > 9pdefault 9m41.660s > > > virtiofscache=none10m29.700s > > > 9pmmappass 9m30.047s > > > 9pmbigmsize 12m4.208s > > > 9pmsecnone 9m21.363s > > > virtiofscache=noneT1 7m17.494s > > > virtiofsdefaultT1 3m43.326s > > > > > > So the winner there by far is the 'virtiofsdefaultT1' - that's > > > the default virtiofs settings, but with --thread-pool-size=1 - so > > > yes it gives a small benefit. > > > But interestingly the cache=none virtiofs performance is pretty bad, > > > but thread-pool-size=1 on that makes a BIG improvement. > > > > Here are fio runs that Vivek asked me to run in my same environment > > (there are some 0's in some of the mmap cases, and I've not investigated > > why yet). > > cache=none does not allow mmap in case of virtiofs. That's when you > are seeing 0. > > >virtiofs is looking good here in I think all of the cases; > > there's some division over which cinfig; cache=none > > seems faster in some cases which surprises me. > > I know cache=none is faster in case of write workloads. It forces > direct write where we don't call file_remove_privs(). While cache=auto > goes through file_remove_privs() and that adds a GETXATTR request to > every WRITE request. Can you point me to how cache=auto causes the file_remove_privs? Dave > Vivek -- Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
Re: tools/virtiofs: Multi threading seems to hurt performance
Hi Folks, Sorry for the delay about how to reproduce `fio` data. I have some code to automate testing for multiple kata configs and collect info like: - Kata-env, kata configuration.toml, qemu command, virtiofsd command. See: https://github.com/jcvenegas/mrunner/ Last time we agreed to narrow the cases and configs to compare virtiofs and 9pfs The configs where the following: - qemu + virtiofs(cache=auto, dax=0) a.ka. `kata-qemu-virtiofs` WITOUT xattr - qemu + 9pfs a.k.a `kata-qemu` Please take a look to the html and raw results I attach in this mail. ## Can I say that the current status is: - As David tests and Vivek points, for the fio workload you are using, seems that the best candidate should be cache=none - In the comparison I took cache=auto as Vivek suggested, this make sense as it seems that will be the default for kata. - Even if for this case cache=none works better, Can I assume that cache=auto dax=0 will be better than any 9pfs config? (once we find the root cause) - Vivek is taking a look to mmap mode from 9pfs, to see how different is with current virtiofs implementations. In 9pfs for kata, this is what we use by default. ## I'd like to identify what should be next on the debug/testing? - Should I try to narrow by only trying to with qemu? - Should I try first with a new patch you already have? - Probably try with qemu without static build? - Do the same test with thread-pool-size=1? Please let me know how can I help. Cheers. On 22/09/20 12:47, "Vivek Goyal" wrote: On Tue, Sep 22, 2020 at 11:25:31AM +0100, Dr. David Alan Gilbert wrote: > * Dr. David Alan Gilbert (dgilb...@redhat.com) wrote: > > Hi, > > I've been doing some of my own perf tests and I think I agree > > about the thread pool size; my test is a kernel build > > and I've tried a bunch of different options. > > > > My config: > > Host: 16 core AMD EPYC (32 thread), 128G RAM, > > 5.9.0-rc4 kernel, rhel 8.2ish userspace. > > 5.1.0 qemu/virtiofsd built from git. > > Guest: Fedora 32 from cloud image with just enough extra installed for > > a kernel build. > > > > git cloned and checkout v5.8 of Linux into /dev/shm/linux on the host > > fresh before each test. Then log into the guest, make defconfig, > > time make -j 16 bzImage, make clean; time make -j 16 bzImage > > The numbers below are the 'real' time in the guest from the initial make > > (the subsequent makes dont vary much) > > > > Below are the detauls of what each of these means, but here are the > > numbers first > > > > virtiofsdefault4m0.978s > > 9pdefault 9m41.660s > > virtiofscache=none10m29.700s > > 9pmmappass 9m30.047s > > 9pmbigmsize 12m4.208s > > 9pmsecnone 9m21.363s > > virtiofscache=noneT1 7m17.494s > > virtiofsdefaultT1 3m43.326s > > > > So the winner there by far is the 'virtiofsdefaultT1' - that's > > the default virtiofs settings, but with --thread-pool-size=1 - so > > yes it gives a small benefit. > > But interestingly the cache=none virtiofs performance is pretty bad, > > but thread-pool-size=1 on that makes a BIG improvement. > > Here are fio runs that Vivek asked me to run in my same environment > (there are some 0's in some of the mmap cases, and I've not investigated > why yet). cache=none does not allow mmap in case of virtiofs. That's when you are seeing 0. >virtiofs is looking good here in I think all of the cases; > there's some division over which cinfig; cache=none > seems faster in some cases which surprises me. I know cache=none is faster in case of write workloads. It forces direct write where we don't call file_remove_privs(). While cache=auto goes through file_remove_privs() and that adds a GETXATTR request to every WRITE request. Vivek results.tar.gz Description: results.tar.gz Title: vitiofs 9pfs: fio comparsion vitiofs 9pfs: fio comparsionqemu + virtiofs(cache=auto, dax=0) a.ka. kata-qemu-virtiofsqemu + 9pfs a.k.a kata-qemuPlatformPacket : c1.small.x86-01 PROC1 x Intel E3-1240 v3 RAM32GB DISK2 x 120GB SSD NIC2 x 1Gbps Bonded Port Nproc: 8EnvNamekata-qemu-virtiofskata-qemuKata version1.12.0-alpha11.12.0-alpha1Qemu versionversion 5.0.0 (kata-static)5.0.0 (kata-static)Qemu code repohttps://gitlab.com/virtio-fs/qemu.githttps://github.com/qemu/qemuQemu tagqemu5.0-virtiofs-with51bits-daxv5.0.0Kernel codehttps://gitlab.com/virtio-fs/linux.githttps://cdn.kernel.org/pub/linux/kernel/v4.x/kernel tagkata-v5.6-april-09-2020v5.4.60OS:18.04.2 LTS (Bionic Beaver)Host kernel:4.15.0-50-generic #54-Ubuntufio workload:fio --direct=1 --gtod_reduce=1 --name=test --filename=random_read_write.fio --bs=4k --iodepth=64 --size=4G --readwrite=randrw --rwmixread=75 --output=/output/fio.tx Results: kata-qemu(9pfs):READ: bw=211MiB/s
virtiofs vs 9p performance(Re: tools/virtiofs: Multi threading seems to hurt performance)
On Thu, Sep 24, 2020 at 09:33:01PM +, Venegas Munoz, Jose Carlos wrote: > Hi Folks, > > Sorry for the delay about how to reproduce `fio` data. > > I have some code to automate testing for multiple kata configs and collect > info like: > - Kata-env, kata configuration.toml, qemu command, virtiofsd command. > > See: > https://github.com/jcvenegas/mrunner/ > > > Last time we agreed to narrow the cases and configs to compare virtiofs and > 9pfs > > The configs where the following: > > - qemu + virtiofs(cache=auto, dax=0) a.ka. `kata-qemu-virtiofs` WITOUT xattr > - qemu + 9pfs a.k.a `kata-qemu` > > Please take a look to the html and raw results I attach in this mail. Hi Carlos, So you are running following test. fio --direct=1 --gtod_reduce=1 --name=test --filename=random_read_write.fio --bs=4k --iodepth=64 --size=4G --readwrite=randrw --rwmixread=75 --output=/output/fio.txt And following are your results. 9p -- READ: bw=211MiB/s (222MB/s), 211MiB/s-211MiB/s (222MB/s-222MB/s), io=3070MiB (3219MB), run=14532-14532msec WRITE: bw=70.6MiB/s (74.0MB/s), 70.6MiB/s-70.6MiB/s (74.0MB/s-74.0MB/s), io=1026MiB (1076MB), run=14532-14532msec virtiofs Run status group 0 (all jobs): READ: bw=159MiB/s (167MB/s), 159MiB/s-159MiB/s (167MB/s-167MB/s), io=3070MiB (3219MB), run=19321-19321msec WRITE: bw=53.1MiB/s (55.7MB/s), 53.1MiB/s-53.1MiB/s (55.7MB/s-55.7MB/s), io=1026MiB (1076MB), run=19321-19321msec So looks like you are getting better performance with 9p in this case. Can you apply "shared pool" patch to qemu for virtiofsd and re-run this test and see if you see any better results. In my testing, with cache=none, virtiofs performed better than 9p in all the fio jobs I was running. For the case of cache=auto for virtiofs (with xattr enabled), 9p performed better in certain write workloads. I have identified root cause of that problem and working on HANDLE_KILLPRIV_V2 patches to improve WRITE performance of virtiofs with cache=auto and xattr enabled. I will post my 9p and virtiofs comparison numbers next week. In the mean time will be great if you could apply following qemu patch, rebuild qemu and re-run above test. https://www.redhat.com/archives/virtio-fs/2020-September/msg00081.html Also what's the status of file cache on host in both the cases. Are you booting host fresh for these tests so that cache is cold on host or cache is warm? Thanks Vivek
Re: tools/virtiofs: Multi threading seems to hurt performance
On Tue, Sep 22, 2020 at 12:09:46PM +0100, Dr. David Alan Gilbert wrote: > > Do you have the numbers for: >epool >epool thread-pool-size=1 >spool Hi David, Ok, I re-ran my numbers again after upgrading to latest qemu and also upgraded host kernel to latest upstream. Apart from comparing I epool, spool and 1Thread, I also ran their numa variants. That is I launched qemu and virtiofsd on node 0 of machine (numactl --cpunodebind=0). Results are kind of mixed. Here are my takeaways. - Running on same numa node improves performance overall for exclusive, shared and exclusive-1T mode. - In general both shared pool and exclusive-1T mode seem to perform better than exclusive mode, except for the case of randwrite-libaio. In some cases (seqread-libaio, seqwrite-libaio, seqwrite-libaio-multi) exclusive pool performs better than exclusive-1T. - Looks like in some cases exclusive-1T performs better than shared pool. (randwrite-libaio, randwrite-psync-multi, seqwrite-psync-multi, seqwrite-psync, seqread-libaio-multi, seqread-psync-multi) Overall, I feel that both exlusive-1T and shared perform better than exclusive pool. Results between exclusive-1T and shared pool are mixed. It seems like in many cases exclusve-1T performs better. I would say that moving to "shared" pool seems like a reasonable option. Thanks Vivek NAMEWORKLOADBandwidth IOPS vtfs-none-epool seqread-psync 38(MiB/s) 9967 vtfs-none-epool-1T seqread-psync 66(MiB/s) 16k vtfs-none-spool seqread-psync 67(MiB/s) 16k vtfs-none-epool-numaseqread-psync 48(MiB/s) 12k vtfs-none-epool-1T-numa seqread-psync 74(MiB/s) 18k vtfs-none-spool-numaseqread-psync 74(MiB/s) 18k vtfs-none-epool seqread-psync-multi 204(MiB/s) 51k vtfs-none-epool-1T seqread-psync-multi 325(MiB/s) 81k vtfs-none-spool seqread-psync-multi 271(MiB/s) 67k vtfs-none-epool-numaseqread-psync-multi 253(MiB/s) 63k vtfs-none-epool-1T-numa seqread-psync-multi 349(MiB/s) 87k vtfs-none-spool-numaseqread-psync-multi 301(MiB/s) 75k vtfs-none-epool seqread-libaio 301(MiB/s) 75k vtfs-none-epool-1T seqread-libaio 273(MiB/s) 68k vtfs-none-spool seqread-libaio 334(MiB/s) 83k vtfs-none-epool-numaseqread-libaio 315(MiB/s) 78k vtfs-none-epool-1T-numa seqread-libaio 326(MiB/s) 81k vtfs-none-spool-numaseqread-libaio 335(MiB/s) 83k vtfs-none-epool seqread-libaio-multi202(MiB/s) 50k vtfs-none-epool-1T seqread-libaio-multi308(MiB/s) 77k vtfs-none-spool seqread-libaio-multi247(MiB/s) 61k vtfs-none-epool-numaseqread-libaio-multi238(MiB/s) 59k vtfs-none-epool-1T-numa seqread-libaio-multi307(MiB/s) 76k vtfs-none-spool-numaseqread-libaio-multi269(MiB/s) 67k vtfs-none-epool randread-psync 41(MiB/s) 10k vtfs-none-epool-1T randread-psync 67(MiB/s) 16k vtfs-none-spool randread-psync 64(MiB/s) 16k vtfs-none-epool-numarandread-psync 48(MiB/s) 12k vtfs-none-epool-1T-numa randread-psync 73(MiB/s) 18k vtfs-none-spool-numarandread-psync 72(MiB/s) 18k vtfs-none-epool randread-psync-multi207(MiB/s) 51k vtfs-none-epool-1T randread-psync-multi313(MiB/s) 78k vtfs-none-spool randread-psync-multi265(MiB/s) 66k vtfs-none-epool-numarandread-psync-multi253(MiB/s) 63k vtfs-none-epool-1T-numa randread-psync-multi340(MiB/s) 85k vtfs-none-spool-numarandread-psync-multi305(MiB/s) 76k vtfs-none-epool randread-libaio 305(MiB/s) 76k vtfs-none-epool-1T randread-libaio 308(MiB/s) 77k vtfs-none-spool randread-libaio 329(MiB/s) 82k vtfs-none-epool-numarandread-libaio 310(MiB/s) 77k vtfs-none-epool-1T-numa randread-libaio 328(MiB/s) 82k vtfs-none-spool-numarandread-libaio 339(MiB/s) 84k vtfs-none-epool randread-libaio-multi 265(MiB/s) 66k vtfs-none-epool-1T randread-libaio-multi 267(MiB/s) 66k
Re: tools/virtiofs: Multi threading seems to hurt performance
On Tue, Sep 22, 2020 at 11:25:31AM +0100, Dr. David Alan Gilbert wrote: > * Dr. David Alan Gilbert (dgilb...@redhat.com) wrote: > > Hi, > > I've been doing some of my own perf tests and I think I agree > > about the thread pool size; my test is a kernel build > > and I've tried a bunch of different options. > > > > My config: > > Host: 16 core AMD EPYC (32 thread), 128G RAM, > > 5.9.0-rc4 kernel, rhel 8.2ish userspace. > > 5.1.0 qemu/virtiofsd built from git. > > Guest: Fedora 32 from cloud image with just enough extra installed for > > a kernel build. > > > > git cloned and checkout v5.8 of Linux into /dev/shm/linux on the host > > fresh before each test. Then log into the guest, make defconfig, > > time make -j 16 bzImage, make clean; time make -j 16 bzImage > > The numbers below are the 'real' time in the guest from the initial make > > (the subsequent makes dont vary much) > > > > Below are the detauls of what each of these means, but here are the > > numbers first > > > > virtiofsdefault4m0.978s > > 9pdefault 9m41.660s > > virtiofscache=none10m29.700s > > 9pmmappass 9m30.047s > > 9pmbigmsize 12m4.208s > > 9pmsecnone 9m21.363s > > virtiofscache=noneT1 7m17.494s > > virtiofsdefaultT1 3m43.326s > > > > So the winner there by far is the 'virtiofsdefaultT1' - that's > > the default virtiofs settings, but with --thread-pool-size=1 - so > > yes it gives a small benefit. > > But interestingly the cache=none virtiofs performance is pretty bad, > > but thread-pool-size=1 on that makes a BIG improvement. > > Here are fio runs that Vivek asked me to run in my same environment > (there are some 0's in some of the mmap cases, and I've not investigated > why yet). cache=none does not allow mmap in case of virtiofs. That's when you are seeing 0. >virtiofs is looking good here in I think all of the cases; > there's some division over which cinfig; cache=none > seems faster in some cases which surprises me. I know cache=none is faster in case of write workloads. It forces direct write where we don't call file_remove_privs(). While cache=auto goes through file_remove_privs() and that adds a GETXATTR request to every WRITE request. Vivek
Re: tools/virtiofs: Multi threading seems to hurt performance
* Vivek Goyal (vgo...@redhat.com) wrote: > On Fri, Sep 18, 2020 at 05:34:36PM -0400, Vivek Goyal wrote: > > Hi All, > > > > virtiofsd default thread pool size is 64. To me it feels that in most of > > the cases thread pool size 1 performs better than thread pool size 64. > > > > I ran virtiofs-tests. > > > > https://github.com/rhvgoyal/virtiofs-tests > > I spent more time debugging this. First thing I noticed is that we > are using "exclusive" glib thread pool. > > https://developer.gnome.org/glib/stable/glib-Thread-Pools.html#g-thread-pool-new > > This seems to run pre-determined number of threads dedicated to that > thread pool. Little instrumentation of code revealed that every new > request gets assiged to new thread (despite the fact that previous > thread finished its job). So internally there might be some kind of > round robin policy to choose next thread for running the job. > > I decided to switch to "shared" pool instead where it seemed to spin > up new threads only if there is enough work. Also threads can be shared > between pools. > > And looks like testing results are way better with "shared" pools. So > may be we should switch to shared pool by default. (Till somebody shows > in what cases exclusive pools are better). > > Second thought which came to mind was what's the impact of NUMA. What > if qemu and virtiofsd process/threads are running on separate NUMA > node. That should increase memory access latency and increased overhead. > So I used "numactl --cpubind=0" to bind both qemu and virtiofsd to node > 0. My machine seems to have two numa nodes. (Each node is having 32 > logical processors). Keeping both qemu and virtiofsd on same node > improves throughput further. > > So here are the results. > > vtfs-none-epool --> cache=none, exclusive thread pool. > vtfs-none-spool --> cache=none, shared thread pool. > vtfs-none-spool-numa --> cache=none, shared thread pool, same numa node Do you have the numbers for: epool epool thread-pool-size=1 spool ? Dave > > NAMEWORKLOADBandwidth IOPS > > vtfs-none-epool seqread-psync 36(MiB/s) 9392 > > vtfs-none-spool seqread-psync 68(MiB/s) 17k > > vtfs-none-spool-numaseqread-psync 73(MiB/s) 18k > > > vtfs-none-epool seqread-psync-multi 210(MiB/s) 52k > > vtfs-none-spool seqread-psync-multi 260(MiB/s) 65k > > vtfs-none-spool-numaseqread-psync-multi 309(MiB/s) 77k > > > vtfs-none-epool seqread-libaio 286(MiB/s) 71k > > vtfs-none-spool seqread-libaio 328(MiB/s) 82k > > vtfs-none-spool-numaseqread-libaio 332(MiB/s) 83k > > > vtfs-none-epool seqread-libaio-multi201(MiB/s) 50k > > vtfs-none-spool seqread-libaio-multi254(MiB/s) 63k > > vtfs-none-spool-numaseqread-libaio-multi276(MiB/s) 69k > > > vtfs-none-epool randread-psync 40(MiB/s) 10k > > vtfs-none-spool randread-psync 64(MiB/s) 16k > > vtfs-none-spool-numarandread-psync 72(MiB/s) 18k > > > vtfs-none-epool randread-psync-multi211(MiB/s) 52k > > vtfs-none-spool randread-psync-multi252(MiB/s) 63k > > vtfs-none-spool-numarandread-psync-multi297(MiB/s) 74k > > > vtfs-none-epool randread-libaio 313(MiB/s) 78k > > vtfs-none-spool randread-libaio 320(MiB/s) 80k > > vtfs-none-spool-numarandread-libaio 330(MiB/s) 82k > > > vtfs-none-epool randread-libaio-multi 257(MiB/s) 64k > > vtfs-none-spool randread-libaio-multi 274(MiB/s) 68k > > vtfs-none-spool-numarandread-libaio-multi 319(MiB/s) 79k > > > vtfs-none-epool seqwrite-psync 34(MiB/s) 8926 > > vtfs-none-spool seqwrite-psync 55(MiB/s) 13k > > vtfs-none-spool-numaseqwrite-psync 66(MiB/s) 16k > > > vtfs-none-epool seqwrite-psync-multi196(MiB/s) 49k > > vtfs-none-spool seqwrite-psync-multi225(MiB/s) 56k > > vtfs-none-spool-numaseqwrite-psync-multi270(MiB/s) 67k > > > vtfs-none-epool seqwrite-libaio 257(MiB/s) 64k > > vtfs-none-spool seqwrite-libaio 304(MiB/s) 76k > > vtfs-none-spool-numaseqwrite-libaio 267(MiB/s) 66k > > >
Re: tools/virtiofs: Multi threading seems to hurt performance
* Dr. David Alan Gilbert (dgilb...@redhat.com) wrote: > Hi, > I've been doing some of my own perf tests and I think I agree > about the thread pool size; my test is a kernel build > and I've tried a bunch of different options. > > My config: > Host: 16 core AMD EPYC (32 thread), 128G RAM, > 5.9.0-rc4 kernel, rhel 8.2ish userspace. > 5.1.0 qemu/virtiofsd built from git. > Guest: Fedora 32 from cloud image with just enough extra installed for > a kernel build. > > git cloned and checkout v5.8 of Linux into /dev/shm/linux on the host > fresh before each test. Then log into the guest, make defconfig, > time make -j 16 bzImage, make clean; time make -j 16 bzImage > The numbers below are the 'real' time in the guest from the initial make > (the subsequent makes dont vary much) > > Below are the detauls of what each of these means, but here are the > numbers first > > virtiofsdefault4m0.978s > 9pdefault 9m41.660s > virtiofscache=none10m29.700s > 9pmmappass 9m30.047s > 9pmbigmsize 12m4.208s > 9pmsecnone 9m21.363s > virtiofscache=noneT1 7m17.494s > virtiofsdefaultT1 3m43.326s > > So the winner there by far is the 'virtiofsdefaultT1' - that's > the default virtiofs settings, but with --thread-pool-size=1 - so > yes it gives a small benefit. > But interestingly the cache=none virtiofs performance is pretty bad, > but thread-pool-size=1 on that makes a BIG improvement. Here are fio runs that Vivek asked me to run in my same environment (there are some 0's in some of the mmap cases, and I've not investigated why yet). virtiofs is looking good here in I think all of the cases; there's some division over which cinfig; cache=none seems faster in some cases which surprises me. Dave NAMEWORKLOADBandwidth IOPS 9pbigmsize seqread-psync 108(MiB/s) 27k 9pdefault seqread-psync 105(MiB/s) 26k 9pmmappass seqread-psync 107(MiB/s) 26k 9pmsecnone seqread-psync 107(MiB/s) 26k virtiofscachenoneT1 seqread-psync 135(MiB/s) 33k virtiofscachenone seqread-psync 115(MiB/s) 28k virtiofsdefaultT1 seqread-psync 2465(MiB/s) 616k virtiofsdefault seqread-psync 2468(MiB/s) 617k 9pbigmsize seqread-psync-multi 357(MiB/s) 89k 9pdefault seqread-psync-multi 358(MiB/s) 89k 9pmmappass seqread-psync-multi 347(MiB/s) 86k 9pmsecnone seqread-psync-multi 364(MiB/s) 91k virtiofscachenoneT1 seqread-psync-multi 479(MiB/s) 119k virtiofscachenone seqread-psync-multi 385(MiB/s) 96k virtiofsdefaultT1 seqread-psync-multi 5916(MiB/s) 1479k virtiofsdefault seqread-psync-multi 8771(MiB/s) 2192k 9pbigmsize seqread-mmap111(MiB/s) 27k 9pdefault seqread-mmap101(MiB/s) 25k 9pmmappass seqread-mmap114(MiB/s) 28k 9pmsecnone seqread-mmap107(MiB/s) 26k virtiofscachenoneT1 seqread-mmap0(KiB/s)0 virtiofscachenone seqread-mmap0(KiB/s)0 virtiofsdefaultT1 seqread-mmap2896(MiB/s) 724k virtiofsdefault seqread-mmap2856(MiB/s) 714k 9pbigmsize seqread-mmap-multi 364(MiB/s) 91k 9pdefault seqread-mmap-multi 348(MiB/s) 87k 9pmmappass seqread-mmap-multi 354(MiB/s) 88k 9pmsecnone seqread-mmap-multi 340(MiB/s) 85k virtiofscachenoneT1 seqread-mmap-multi 0(KiB/s)0 virtiofscachenone seqread-mmap-multi 0(KiB/s)0 virtiofsdefaultT1 seqread-mmap-multi 6057(MiB/s) 1514k virtiofsdefault seqread-mmap-multi 9585(MiB/s) 2396k 9pbigmsize seqread-libaio 109(MiB/s) 27k 9pdefault seqread-libaio 103(MiB/s) 25k 9pmmappass seqread-libaio 107(MiB/s) 26k 9pmsecnone seqread-libaio 107(MiB/s) 26k virtiofscachenoneT1 seqread-libaio 671(MiB/s) 167k virtiofscachenone seqread-libaio 538(MiB/s) 134k virtiofsdefaultT1 seqread-libaio
Re: tools/virtiofs: Multi threading seems to hurt performance
On Fri, Sep 18, 2020 at 05:34:36PM -0400, Vivek Goyal wrote: > Hi All, > > virtiofsd default thread pool size is 64. To me it feels that in most of > the cases thread pool size 1 performs better than thread pool size 64. > > I ran virtiofs-tests. > > https://github.com/rhvgoyal/virtiofs-tests I spent more time debugging this. First thing I noticed is that we are using "exclusive" glib thread pool. https://developer.gnome.org/glib/stable/glib-Thread-Pools.html#g-thread-pool-new This seems to run pre-determined number of threads dedicated to that thread pool. Little instrumentation of code revealed that every new request gets assiged to new thread (despite the fact that previous thread finished its job). So internally there might be some kind of round robin policy to choose next thread for running the job. I decided to switch to "shared" pool instead where it seemed to spin up new threads only if there is enough work. Also threads can be shared between pools. And looks like testing results are way better with "shared" pools. So may be we should switch to shared pool by default. (Till somebody shows in what cases exclusive pools are better). Second thought which came to mind was what's the impact of NUMA. What if qemu and virtiofsd process/threads are running on separate NUMA node. That should increase memory access latency and increased overhead. So I used "numactl --cpubind=0" to bind both qemu and virtiofsd to node 0. My machine seems to have two numa nodes. (Each node is having 32 logical processors). Keeping both qemu and virtiofsd on same node improves throughput further. So here are the results. vtfs-none-epool --> cache=none, exclusive thread pool. vtfs-none-spool --> cache=none, shared thread pool. vtfs-none-spool-numa --> cache=none, shared thread pool, same numa node NAMEWORKLOADBandwidth IOPS vtfs-none-epool seqread-psync 36(MiB/s) 9392 vtfs-none-spool seqread-psync 68(MiB/s) 17k vtfs-none-spool-numaseqread-psync 73(MiB/s) 18k vtfs-none-epool seqread-psync-multi 210(MiB/s) 52k vtfs-none-spool seqread-psync-multi 260(MiB/s) 65k vtfs-none-spool-numaseqread-psync-multi 309(MiB/s) 77k vtfs-none-epool seqread-libaio 286(MiB/s) 71k vtfs-none-spool seqread-libaio 328(MiB/s) 82k vtfs-none-spool-numaseqread-libaio 332(MiB/s) 83k vtfs-none-epool seqread-libaio-multi201(MiB/s) 50k vtfs-none-spool seqread-libaio-multi254(MiB/s) 63k vtfs-none-spool-numaseqread-libaio-multi276(MiB/s) 69k vtfs-none-epool randread-psync 40(MiB/s) 10k vtfs-none-spool randread-psync 64(MiB/s) 16k vtfs-none-spool-numarandread-psync 72(MiB/s) 18k vtfs-none-epool randread-psync-multi211(MiB/s) 52k vtfs-none-spool randread-psync-multi252(MiB/s) 63k vtfs-none-spool-numarandread-psync-multi297(MiB/s) 74k vtfs-none-epool randread-libaio 313(MiB/s) 78k vtfs-none-spool randread-libaio 320(MiB/s) 80k vtfs-none-spool-numarandread-libaio 330(MiB/s) 82k vtfs-none-epool randread-libaio-multi 257(MiB/s) 64k vtfs-none-spool randread-libaio-multi 274(MiB/s) 68k vtfs-none-spool-numarandread-libaio-multi 319(MiB/s) 79k vtfs-none-epool seqwrite-psync 34(MiB/s) 8926 vtfs-none-spool seqwrite-psync 55(MiB/s) 13k vtfs-none-spool-numaseqwrite-psync 66(MiB/s) 16k vtfs-none-epool seqwrite-psync-multi196(MiB/s) 49k vtfs-none-spool seqwrite-psync-multi225(MiB/s) 56k vtfs-none-spool-numaseqwrite-psync-multi270(MiB/s) 67k vtfs-none-epool seqwrite-libaio 257(MiB/s) 64k vtfs-none-spool seqwrite-libaio 304(MiB/s) 76k vtfs-none-spool-numaseqwrite-libaio 267(MiB/s) 66k vtfs-none-epool seqwrite-libaio-multi 312(MiB/s) 78k vtfs-none-spool seqwrite-libaio-multi 366(MiB/s) 91k vtfs-none-spool-numaseqwrite-libaio-multi 381(MiB/s) 95k vtfs-none-epool randwrite-psync 38(MiB/s) 9745 vtfs-none-spool randwrite-psync 55(MiB/s) 13k
Re: tools/virtiofs: Multi threading seems to hurt performance
On Mon, Sep 21, 2020 at 09:39:44AM -0400, Vivek Goyal wrote: > On Mon, Sep 21, 2020 at 09:39:23AM +0100, Stefan Hajnoczi wrote: > > On Fri, Sep 18, 2020 at 05:34:36PM -0400, Vivek Goyal wrote: > > > And here are the comparision results. To me it seems that by default > > > we should switch to 1 thread (Till we can figure out how to make > > > multi thread performance better even when single process is doing > > > I/O in client). > > > > Let's understand the reason before making changes. > > > > Questions: > > * Is "1-thread" --thread-pool-size=1? > > Yes. Okay, I wanted to make sure 1-thread is still going through the glib thread pool. So it's the same code path regardless of the --thread-pool-size= value. This suggests the performance issue is related to timing side-effects like lock contention, thread scheduling, etc. > > * How do the kvm_stat vmexit counters compare? > > This should be same, isn't it. Changing number of threads serving should > not change number of vmexits? There is batching at the virtio and eventfd levels. I'm not sure if it's coming into play here but you would see it by comparing vmexits and eventfd reads. Having more threads can increase the number of notifications and completion interrupt, which can make overall performance worse in some cases. > > * How does host mpstat -P ALL compare? > > Never used mpstat. Will try running it and see if I can get something > meaningful. Tools like top, vmstat, etc can give similar information. I'm wondering what the host CPU utilization (guest/sys/user) looks like. > But I suepct it has to do with thread pool implementation and possibly > extra cacheline bouncing. I think perf can record cacheline bounces if you want to check. Stefan signature.asc Description: PGP signature
Re: tools/virtiofs: Multi threading seems to hurt performance
Hi, I've been doing some of my own perf tests and I think I agree about the thread pool size; my test is a kernel build and I've tried a bunch of different options. My config: Host: 16 core AMD EPYC (32 thread), 128G RAM, 5.9.0-rc4 kernel, rhel 8.2ish userspace. 5.1.0 qemu/virtiofsd built from git. Guest: Fedora 32 from cloud image with just enough extra installed for a kernel build. git cloned and checkout v5.8 of Linux into /dev/shm/linux on the host fresh before each test. Then log into the guest, make defconfig, time make -j 16 bzImage, make clean; time make -j 16 bzImage The numbers below are the 'real' time in the guest from the initial make (the subsequent makes dont vary much) Below are the detauls of what each of these means, but here are the numbers first virtiofsdefault4m0.978s 9pdefault 9m41.660s virtiofscache=none10m29.700s 9pmmappass 9m30.047s 9pmbigmsize 12m4.208s 9pmsecnone 9m21.363s virtiofscache=noneT1 7m17.494s virtiofsdefaultT1 3m43.326s So the winner there by far is the 'virtiofsdefaultT1' - that's the default virtiofs settings, but with --thread-pool-size=1 - so yes it gives a small benefit. But interestingly the cache=none virtiofs performance is pretty bad, but thread-pool-size=1 on that makes a BIG improvement. virtiofsdefault: ./virtiofsd --socket-path=/tmp/vhostqemu -o source=/dev/shm/linux ./x86_64-softmmu/qemu-system-x86_64 -M pc,memory-backend=mem,accel=kvm -smp 8 -cpu host -m 32G,maxmem=64G,slots=1 -object memory-backend-memfd,id=mem,size=32G,share=on -drive if=virtio,file=/home/images/f-32-kernel.qcow2 -nographic -chardev socket,id=char0,path=/tmp/vhostqemu -device vhost-user-fs-pci,queue-size=1024,chardev=char0,tag=kernel mount -t virtiofs kernel /mnt 9pdefault ./x86_64-softmmu/qemu-system-x86_64 -M pc,accel=kvm -smp 8 -cpu host -m 32G -drive if=virtio,file=/home/images/f-32-kernel.qcow2 -nographic -virtfs local,path=/dev/shm/linux,mount_tag=kernel,security_model=passthrough mount -t 9p -o trans=virtio kernel /mnt -oversion=9p2000.L virtiofscache=none ./virtiofsd --socket-path=/tmp/vhostqemu -o source=/dev/shm/linux -o cache=none ./x86_64-softmmu/qemu-system-x86_64 -M pc,memory-backend=mem,accel=kvm -smp 8 -cpu host -m 32G,maxmem=64G,slots=1 -object memory-backend-memfd,id=mem,size=32G,share=on -drive if=virtio,file=/home/images/f-32-kernel.qcow2 -nographic -chardev socket,id=char0,path=/tmp/vhostqemu -device vhost-user-fs-pci,queue-size=1024,chardev=char0,tag=kernel mount -t virtiofs kernel /mnt 9pmmappass ./x86_64-softmmu/qemu-system-x86_64 -M pc,accel=kvm -smp 8 -cpu host -m 32G -drive if=virtio,file=/home/images/f-32-kernel.qcow2 -nographic -virtfs local,path=/dev/shm/linux,mount_tag=kernel,security_model=passthrough mount -t 9p -o trans=virtio kernel /mnt -oversion=9p2000.L,cache=mmap 9pmbigmsize ./x86_64-softmmu/qemu-system-x86_64 -M pc,accel=kvm -smp 8 -cpu host -m 32G -drive if=virtio,file=/home/images/f-32-kernel.qcow2 -nographic -virtfs local,path=/dev/shm/linux,mount_tag=kernel,security_model=passthrough mount -t 9p -o trans=virtio kernel /mnt -oversion=9p2000.L,cache=mmap,msize=1048576 9pmsecnone ./x86_64-softmmu/qemu-system-x86_64 -M pc,accel=kvm -smp 8 -cpu host -m 32G -drive if=virtio,file=/home/images/f-32-kernel.qcow2 -nographic -virtfs local,path=/dev/shm/linux,mount_tag=kernel,security_model=none mount -t 9p -o trans=virtio kernel /mnt -oversion=9p2000.L virtiofscache=noneT1 ./virtiofsd --socket-path=/tmp/vhostqemu -o source=/dev/shm/linux -o cache=none --thread-pool-size=1 mount -t virtiofs kernel /mnt virtiofsdefaultT1 ./virtiofsd --socket-path=/tmp/vhostqemu -o source=/dev/shm/linux --thread-pool-size=1 ./x86_64-softmmu/qemu-system-x86_64 -M pc,memory-backend=mem,accel=kvm -smp 8 -cpu host -m 32G,maxmem=64G,slots=1 -object memory-backend-memfd,id=mem,size=32G,share=on -drive if=virtio,file=/home/images/f-32-kernel.qcow2 -nographic -chardev socket,id=char0,path=/tmp/vhostqemu -device vhost-user-fs-pci,queue-size=1024,chardev=char0,tag=kernel -- Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
Re: tools/virtiofs: Multi threading seems to hurt performance
On Mon, Sep 21, 2020 at 09:35:16AM -0400, Vivek Goyal wrote: > On Mon, Sep 21, 2020 at 09:50:19AM +0100, Dr. David Alan Gilbert wrote: > > * Vivek Goyal (vgo...@redhat.com) wrote: > > > Hi All, > > > > > > virtiofsd default thread pool size is 64. To me it feels that in most of > > > the cases thread pool size 1 performs better than thread pool size 64. > > > > > > I ran virtiofs-tests. > > > > > > https://github.com/rhvgoyal/virtiofs-tests > > > > > > And here are the comparision results. To me it seems that by default > > > we should switch to 1 thread (Till we can figure out how to make > > > multi thread performance better even when single process is doing > > > I/O in client). > > > > > > I am especially more interested in getting performance better for > > > single process in client. If that suffers, then it is pretty bad. > > > > > > Especially look at randread, randwrite, seqwrite performance. seqread > > > seems pretty good anyway. > > > > > > If I don't run who test suite and just ran randread-psync job, > > > my throughput jumps from around 40MB/s to 60MB/s. That's a huge > > > jump I would say. > > > > > > Thoughts? > > > > What's your host setup; how many cores has the host got and how many did > > you give the guest? > > Got 2 processors on host with 16 cores in each processor. With > hyperthreading enabled, it makes 32 logical cores on each processor and > that makes 64 logical cores on host. > > I have given 32 to guest. FWIW, I'd be inclined to disable hyperthreading in the BIOS for one test to validate whether it is impacting performance results seen. Hyperthreads are weak compared to a real CPU, and could result in misleading data even if you are limiting your guest to 1/2 the host logical CPUs. Regards, Daniel -- |: https://berrange.com -o-https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o-https://fstop138.berrange.com :| |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|
Re: tools/virtiofs: Multi threading seems to hurt performance
On Mon, Sep 21, 2020 at 09:39:23AM +0100, Stefan Hajnoczi wrote: > On Fri, Sep 18, 2020 at 05:34:36PM -0400, Vivek Goyal wrote: > > And here are the comparision results. To me it seems that by default > > we should switch to 1 thread (Till we can figure out how to make > > multi thread performance better even when single process is doing > > I/O in client). > > Let's understand the reason before making changes. > > Questions: > * Is "1-thread" --thread-pool-size=1? Yes. > * Was DAX enabled? No. > * How does cache=none perform? I just ran random read workload with cache=none. cache-none randread-psync 45(MiB/s) 11k cache-none-1-thread randread-psync 63(MiB/s) 15k With 1 thread it offers more IOPS. > * Does commenting out vu_queue_get_avail_bytes() + fuse_log("%s: >Queue %d gave evalue: %zx available: in: %u out: %u\n") in >fv_queue_thread help? Will try that. > * How do the kvm_stat vmexit counters compare? This should be same, isn't it. Changing number of threads serving should not change number of vmexits? > * How does host mpstat -P ALL compare? Never used mpstat. Will try running it and see if I can get something meaningful. > * How does host perf record -a compare? Will try it. I feel this might be too big and too verbose to get something meaningful. > * Does the Rust virtiofsd show the same pattern (it doesn't use glib >thread pools)? No idea. Never tried rust implementation of virtiofsd. But I suepct it has to do with thread pool implementation and possibly extra cacheline bouncing. Thanks Vivek
Re: tools/virtiofs: Multi threading seems to hurt performance
On Mon, Sep 21, 2020 at 09:50:19AM +0100, Dr. David Alan Gilbert wrote: > * Vivek Goyal (vgo...@redhat.com) wrote: > > Hi All, > > > > virtiofsd default thread pool size is 64. To me it feels that in most of > > the cases thread pool size 1 performs better than thread pool size 64. > > > > I ran virtiofs-tests. > > > > https://github.com/rhvgoyal/virtiofs-tests > > > > And here are the comparision results. To me it seems that by default > > we should switch to 1 thread (Till we can figure out how to make > > multi thread performance better even when single process is doing > > I/O in client). > > > > I am especially more interested in getting performance better for > > single process in client. If that suffers, then it is pretty bad. > > > > Especially look at randread, randwrite, seqwrite performance. seqread > > seems pretty good anyway. > > > > If I don't run who test suite and just ran randread-psync job, > > my throughput jumps from around 40MB/s to 60MB/s. That's a huge > > jump I would say. > > > > Thoughts? > > What's your host setup; how many cores has the host got and how many did > you give the guest? Got 2 processors on host with 16 cores in each processor. With hyperthreading enabled, it makes 32 logical cores on each processor and that makes 64 logical cores on host. I have given 32 to guest. Vivek
Re: tools/virtiofs: Multi threading seems to hurt performance
* Vivek Goyal (vgo...@redhat.com) wrote: > Hi All, > > virtiofsd default thread pool size is 64. To me it feels that in most of > the cases thread pool size 1 performs better than thread pool size 64. > > I ran virtiofs-tests. > > https://github.com/rhvgoyal/virtiofs-tests > > And here are the comparision results. To me it seems that by default > we should switch to 1 thread (Till we can figure out how to make > multi thread performance better even when single process is doing > I/O in client). > > I am especially more interested in getting performance better for > single process in client. If that suffers, then it is pretty bad. > > Especially look at randread, randwrite, seqwrite performance. seqread > seems pretty good anyway. > > If I don't run who test suite and just ran randread-psync job, > my throughput jumps from around 40MB/s to 60MB/s. That's a huge > jump I would say. > > Thoughts? What's your host setup; how many cores has the host got and how many did you give the guest? Dave > Thanks > Vivek > > > NAMEWORKLOADBandwidth IOPS > > cache-auto seqread-psync 690(MiB/s) 172k > > cache-auto-1-thread seqread-psync 729(MiB/s) 182k > > > cache-auto seqread-psync-multi 2578(MiB/s) 644k > > cache-auto-1-thread seqread-psync-multi 2597(MiB/s) 649k > > > cache-auto seqread-mmap660(MiB/s) 165k > > cache-auto-1-thread seqread-mmap672(MiB/s) 168k > > > cache-auto seqread-mmap-multi 2499(MiB/s) 624k > > cache-auto-1-thread seqread-mmap-multi 2618(MiB/s) 654k > > > cache-auto seqread-libaio 286(MiB/s) 71k > > cache-auto-1-thread seqread-libaio 260(MiB/s) 65k > > > cache-auto seqread-libaio-multi1508(MiB/s) 377k > > cache-auto-1-thread seqread-libaio-multi986(MiB/s) 246k > > > cache-auto randread-psync 35(MiB/s) 9191 > > cache-auto-1-thread randread-psync 55(MiB/s) 13k > > > cache-auto randread-psync-multi179(MiB/s) 44k > > cache-auto-1-thread randread-psync-multi209(MiB/s) 52k > > > cache-auto randread-mmap 32(MiB/s) 8273 > > cache-auto-1-thread randread-mmap 50(MiB/s) 12k > > > cache-auto randread-mmap-multi 161(MiB/s) 40k > > cache-auto-1-thread randread-mmap-multi 185(MiB/s) 46k > > > cache-auto randread-libaio 268(MiB/s) 67k > > cache-auto-1-thread randread-libaio 254(MiB/s) 63k > > > cache-auto randread-libaio-multi 256(MiB/s) 64k > > cache-auto-1-thread randread-libaio-multi 155(MiB/s) 38k > > > cache-auto seqwrite-psync 23(MiB/s) 6026 > > cache-auto-1-thread seqwrite-psync 30(MiB/s) 7925 > > > cache-auto seqwrite-psync-multi100(MiB/s) 25k > > cache-auto-1-thread seqwrite-psync-multi154(MiB/s) 38k > > > cache-auto seqwrite-mmap 343(MiB/s) 85k > > cache-auto-1-thread seqwrite-mmap 355(MiB/s) 88k > > > cache-auto seqwrite-mmap-multi 408(MiB/s) 102k > > cache-auto-1-thread seqwrite-mmap-multi 438(MiB/s) 109k > > > cache-auto seqwrite-libaio 41(MiB/s) 10k > > cache-auto-1-thread seqwrite-libaio 65(MiB/s) 16k > > > cache-auto seqwrite-libaio-multi 137(MiB/s) 34k > > cache-auto-1-thread seqwrite-libaio-multi 214(MiB/s) 53k > > > cache-auto randwrite-psync 22(MiB/s) 5801 > > cache-auto-1-thread randwrite-psync 30(MiB/s) 7927 > > > cache-auto randwrite-psync-multi 100(MiB/s) 25k > > cache-auto-1-thread randwrite-psync-multi 151(MiB/s) 37k > > > cache-auto randwrite-mmap 31(MiB/s) 7984 > > cache-auto-1-thread randwrite-mmap 55(MiB/s) 13k > > > cache-auto randwrite-mmap-multi124(MiB/s) 31k > > cache-auto-1-thread randwrite-mmap-multi213(MiB/s) 53k > > > cache-auto
Re: tools/virtiofs: Multi threading seems to hurt performance
On Fri, Sep 18, 2020 at 05:34:36PM -0400, Vivek Goyal wrote: > And here are the comparision results. To me it seems that by default > we should switch to 1 thread (Till we can figure out how to make > multi thread performance better even when single process is doing > I/O in client). Let's understand the reason before making changes. Questions: * Is "1-thread" --thread-pool-size=1? * Was DAX enabled? * How does cache=none perform? * Does commenting out vu_queue_get_avail_bytes() + fuse_log("%s: Queue %d gave evalue: %zx available: in: %u out: %u\n") in fv_queue_thread help? * How do the kvm_stat vmexit counters compare? * How does host mpstat -P ALL compare? * How does host perf record -a compare? * Does the Rust virtiofsd show the same pattern (it doesn't use glib thread pools)? Stefan > NAMEWORKLOADBandwidth IOPS > > cache-auto seqread-psync 690(MiB/s) 172k > > cache-auto-1-thread seqread-psync 729(MiB/s) 182k > > > cache-auto seqread-psync-multi 2578(MiB/s) 644k > > cache-auto-1-thread seqread-psync-multi 2597(MiB/s) 649k > > > cache-auto seqread-mmap660(MiB/s) 165k > > cache-auto-1-thread seqread-mmap672(MiB/s) 168k > > > cache-auto seqread-mmap-multi 2499(MiB/s) 624k > > cache-auto-1-thread seqread-mmap-multi 2618(MiB/s) 654k > > > cache-auto seqread-libaio 286(MiB/s) 71k > > cache-auto-1-thread seqread-libaio 260(MiB/s) 65k > > > cache-auto seqread-libaio-multi1508(MiB/s) 377k > > cache-auto-1-thread seqread-libaio-multi986(MiB/s) 246k > > > cache-auto randread-psync 35(MiB/s) 9191 > > cache-auto-1-thread randread-psync 55(MiB/s) 13k > > > cache-auto randread-psync-multi179(MiB/s) 44k > > cache-auto-1-thread randread-psync-multi209(MiB/s) 52k > > > cache-auto randread-mmap 32(MiB/s) 8273 > > cache-auto-1-thread randread-mmap 50(MiB/s) 12k > > > cache-auto randread-mmap-multi 161(MiB/s) 40k > > cache-auto-1-thread randread-mmap-multi 185(MiB/s) 46k > > > cache-auto randread-libaio 268(MiB/s) 67k > > cache-auto-1-thread randread-libaio 254(MiB/s) 63k > > > cache-auto randread-libaio-multi 256(MiB/s) 64k > > cache-auto-1-thread randread-libaio-multi 155(MiB/s) 38k > > > cache-auto seqwrite-psync 23(MiB/s) 6026 > > cache-auto-1-thread seqwrite-psync 30(MiB/s) 7925 > > > cache-auto seqwrite-psync-multi100(MiB/s) 25k > > cache-auto-1-thread seqwrite-psync-multi154(MiB/s) 38k > > > cache-auto seqwrite-mmap 343(MiB/s) 85k > > cache-auto-1-thread seqwrite-mmap 355(MiB/s) 88k > > > cache-auto seqwrite-mmap-multi 408(MiB/s) 102k > > cache-auto-1-thread seqwrite-mmap-multi 438(MiB/s) 109k > > > cache-auto seqwrite-libaio 41(MiB/s) 10k > > cache-auto-1-thread seqwrite-libaio 65(MiB/s) 16k > > > cache-auto seqwrite-libaio-multi 137(MiB/s) 34k > > cache-auto-1-thread seqwrite-libaio-multi 214(MiB/s) 53k > > > cache-auto randwrite-psync 22(MiB/s) 5801 > > cache-auto-1-thread randwrite-psync 30(MiB/s) 7927 > > > cache-auto randwrite-psync-multi 100(MiB/s) 25k > > cache-auto-1-thread randwrite-psync-multi 151(MiB/s) 37k > > > cache-auto randwrite-mmap 31(MiB/s) 7984 > > cache-auto-1-thread randwrite-mmap 55(MiB/s) 13k > > > cache-auto randwrite-mmap-multi124(MiB/s) 31k > > cache-auto-1-thread randwrite-mmap-multi213(MiB/s) 53k > > > cache-auto randwrite-libaio40(MiB/s) 10k > > cache-auto-1-thread randwrite-libaio64(MiB/s) 16k > > > cache-auto randwrite-libaio-multi 139(MiB/s)