Re: [Virtio-fs] virtiofs vs 9p performance(Re: tools/virtiofs: Multi threading seems to hurt performance)

2020-09-29 Thread Vivek Goyal
On Tue, Sep 29, 2020 at 03:49:04PM +0200, Miklos Szeredi wrote:
> On Tue, Sep 29, 2020 at 3:18 PM Vivek Goyal  wrote:
> 
> > - virtiofs cache=none mode is faster than cache=auto mode for this
> >   workload.
> 
> Not sure why.  One cause could be that readahead is not perfect at
> detecting the random pattern.  Could we compare total I/O on the
> server vs. total I/O by fio?

Ran tests with auto_inval_data disabled and compared with other results.

vtfs-auto-ex-randrw randrw-psync27.8mb/9547kb   7136/2386
vtfs-auto-sh-randrw randrw-psync43.3mb/14.4mb   10.8k/3709
vtfs-auto-sh-noinvalrandrw-psync50.5mb/16.9mb   12.6k/4330
vtfs-none-sh-randrw randrw-psync54.1mb/18.1mb   13.5k/4649

With auto_inval_data disabled, this time I saw around 20% performance jump
in READ and is now much closer to cache=none performance.

Thanks
Vivek




Re: [Virtio-fs] virtiofs vs 9p performance(Re: tools/virtiofs: Multi threading seems to hurt performance)

2020-09-29 Thread Miklos Szeredi
On Tue, Sep 29, 2020 at 4:01 PM Vivek Goyal  wrote:
>
> On Tue, Sep 29, 2020 at 03:49:04PM +0200, Miklos Szeredi wrote:
> > On Tue, Sep 29, 2020 at 3:18 PM Vivek Goyal  wrote:
> >
> > > - virtiofs cache=none mode is faster than cache=auto mode for this
> > >   workload.
> >
> > Not sure why.  One cause could be that readahead is not perfect at
> > detecting the random pattern.  Could we compare total I/O on the
> > server vs. total I/O by fio?
>
> Hi Miklos,
>
> I will instrument virtiosd code to figure out total I/O.
>
> One more potential issue I am staring at is refreshing the attrs on
> READ if fc->auto_inval_data is set.
>
> fuse_cache_read_iter() {
> /*
>  * In auto invalidate mode, always update attributes on read.
>  * Otherwise, only update if we attempt to read past EOF (to ensure
>  * i_size is up to date).
>  */
> if (fc->auto_inval_data ||
> (iocb->ki_pos + iov_iter_count(to) > i_size_read(inode))) {
> int err;
> err = fuse_update_attributes(inode, iocb->ki_filp);
> if (err)
> return err;
> }
> }
>
> Given this is a mixed READ/WRITE workload, every WRITE will invalidate
> attrs. And next READ will first do GETATTR() from server (and potentially
> invalidate page cache) before doing READ.
>
> This sounds suboptimal especially from the point of view of WRITEs
> done by this client itself. I mean if another client has modified
> the file, then doing GETATTR after a second makes sense. But there
> should be some optimization to make sure our own WRITEs don't end
> up doing GETATTR and invalidate page cache (because cache contents
> are still valid).

Yeah, that sucks.

> I disabled ->auto_invalid_data and that seemed to result in 8-10%
> gain in performance for this workload.

Need to wrap my head around these caching issues.

Thanks,
Miklos




Re: [Virtio-fs] virtiofs vs 9p performance(Re: tools/virtiofs: Multi threading seems to hurt performance)

2020-09-29 Thread Vivek Goyal
On Tue, Sep 29, 2020 at 03:49:04PM +0200, Miklos Szeredi wrote:
> On Tue, Sep 29, 2020 at 3:18 PM Vivek Goyal  wrote:
> 
> > - virtiofs cache=none mode is faster than cache=auto mode for this
> >   workload.
> 
> Not sure why.  One cause could be that readahead is not perfect at
> detecting the random pattern.  Could we compare total I/O on the
> server vs. total I/O by fio?

Hi Miklos,

I will instrument virtiosd code to figure out total I/O.

One more potential issue I am staring at is refreshing the attrs on 
READ if fc->auto_inval_data is set.

fuse_cache_read_iter() {
/*
 * In auto invalidate mode, always update attributes on read.
 * Otherwise, only update if we attempt to read past EOF (to ensure
 * i_size is up to date).
 */
if (fc->auto_inval_data ||
(iocb->ki_pos + iov_iter_count(to) > i_size_read(inode))) {
int err;
err = fuse_update_attributes(inode, iocb->ki_filp);
if (err)
return err;
}
}

Given this is a mixed READ/WRITE workload, every WRITE will invalidate
attrs. And next READ will first do GETATTR() from server (and potentially
invalidate page cache) before doing READ.

This sounds suboptimal especially from the point of view of WRITEs
done by this client itself. I mean if another client has modified
the file, then doing GETATTR after a second makes sense. But there
should be some optimization to make sure our own WRITEs don't end
up doing GETATTR and invalidate page cache (because cache contents
are still valid).

I disabled ->auto_invalid_data and that seemed to result in 8-10%
gain in performance for this workload.

Thanks
Vivek




Re: virtiofs vs 9p performance(Re: tools/virtiofs: Multi threading seems to hurt performance)

2020-09-29 Thread Christian Schoenebeck
On Dienstag, 29. September 2020 15:49:42 CEST Vivek Goyal wrote:
> > Depends on what's randomized. If read chunk size is randomized, then yes,
> > you would probably see less performance increase compared to a simple
> > 'cat foo.dat'.
> 
> We are using "fio" for testing and read chunk size is not being
> randomized. chunk size (block size) is fixed at 4K size for these tests.

Good to know, thanks!

> > If only the read position is randomized, but the read chunk size honors
> > iounit, a.k.a. stat's st_blksize (i.e. reading with the most efficient
> > block size advertised by 9P), then I would assume still seeing a
> > performance increase.
> 
> Yes, we are randomizing read position. But there is no notion of looking
> at st_blksize. Its fixed at 4K. (notice option --bs=4k in fio
> commandline).

Ah ok, then the results make sense.

With these block sizes you will indeed suffer a performance issue with 9p, due 
to several thread hops in Tread handling, which is due to be fixed.

Best regards,
Christian Schoenebeck





Re: [Virtio-fs] virtiofs vs 9p performance(Re: tools/virtiofs: Multi threading seems to hurt performance)

2020-09-29 Thread Miklos Szeredi
On Tue, Sep 29, 2020 at 3:18 PM Vivek Goyal  wrote:

> - virtiofs cache=none mode is faster than cache=auto mode for this
>   workload.

Not sure why.  One cause could be that readahead is not perfect at
detecting the random pattern.  Could we compare total I/O on the
server vs. total I/O by fio?

Thanks,
Millos




Re: virtiofs vs 9p performance(Re: tools/virtiofs: Multi threading seems to hurt performance)

2020-09-29 Thread Vivek Goyal
On Tue, Sep 29, 2020 at 03:28:06PM +0200, Christian Schoenebeck wrote:
> On Dienstag, 29. September 2020 15:03:25 CEST Vivek Goyal wrote:
> > On Sun, Sep 27, 2020 at 02:14:43PM +0200, Christian Schoenebeck wrote:
> > > On Freitag, 25. September 2020 20:51:47 CEST Dr. David Alan Gilbert wrote:
> > > > * Christian Schoenebeck (qemu_...@crudebyte.com) wrote:
> > > > > On Freitag, 25. September 2020 15:05:38 CEST Dr. David Alan Gilbert 
> wrote:
> > > > > > > > 9p ( mount -t 9p -o trans=virtio kernel /mnt
> > > > > > > > -oversion=9p2000.L,cache=mmap,msize=1048576 ) test: (g=0):
> > > > > > > > rw=randrw,
> > > > > > > 
> > > > > > > Bottleneck --^
> > > > > > > 
> > > > > > > By increasing 'msize' you would encounter better 9P I/O results.
> > > > > > 
> > > > > > OK, I thought that was bigger than the default;  what number should
> > > > > > I
> > > > > > use?
> > > > > 
> > > > > It depends on the underlying storage hardware. In other words: you
> > > > > have to
> > > > > try increasing the 'msize' value to a point where you no longer notice
> > > > > a
> > > > > negative performance impact (or almost). Which is fortunately quite
> > > > > easy
> > > > > to test on>
> > > > > 
> > > > > guest like:
> > > > >   dd if=/dev/zero of=test.dat bs=1G count=12
> > > > >   time cat test.dat > /dev/null
> > > > > 
> > > > > I would start with an absolute minimum msize of 10MB. I would
> > > > > recommend
> > > > > something around 100MB maybe for a mechanical hard drive. With a PCIe
> > > > > flash
> > > > > you probably would rather pick several hundred MB or even more.
> > > > > 
> > > > > That unpleasant 'msize' issue is a limitation of the 9p protocol:
> > > > > client
> > > > > (guest) must suggest the value of msize on connection to server
> > > > > (host).
> > > > > Server can only lower, but not raise it. And the client in turn
> > > > > obviously
> > > > > cannot see host's storage device(s), so client is unable to pick a
> > > > > good
> > > > > value by itself. So it's a suboptimal handshake issue right now.
> > > > 
> > > > It doesn't seem to be making a vast difference here:
> > > > 
> > > > 
> > > > 
> > > > 9p mount -t 9p -o trans=virtio kernel /mnt
> > > > -oversion=9p2000.L,cache=mmap,msize=104857600
> > > > 
> > > > Run status group 0 (all jobs):
> > > >READ: bw=62.5MiB/s (65.6MB/s), 62.5MiB/s-62.5MiB/s
> > > >(65.6MB/s-65.6MB/s),
> > > > 
> > > > io=3070MiB (3219MB), run=49099-49099msec WRITE: bw=20.9MiB/s (21.9MB/s),
> > > > 20.9MiB/s-20.9MiB/s (21.9MB/s-21.9MB/s), io=1026MiB (1076MB),
> > > > run=49099-49099msec
> > > > 
> > > > 9p mount -t 9p -o trans=virtio kernel /mnt
> > > > -oversion=9p2000.L,cache=mmap,msize=1048576000
> > > > 
> > > > Run status group 0 (all jobs):
> > > >READ: bw=65.2MiB/s (68.3MB/s), 65.2MiB/s-65.2MiB/s
> > > >(68.3MB/s-68.3MB/s),
> > > > 
> > > > io=3070MiB (3219MB), run=47104-47104msec WRITE: bw=21.8MiB/s (22.8MB/s),
> > > > 21.8MiB/s-21.8MiB/s (22.8MB/s-22.8MB/s), io=1026MiB (1076MB),
> > > > run=47104-47104msec
> > > > 
> > > > 
> > > > Dave
> > > 
> > > Is that benchmark tool honoring 'iounit' to automatically run with max.
> > > I/O
> > > chunk sizes? What's that benchmark tool actually? And do you also see no
> > > improvement with a simple
> > > 
> > >   time cat largefile.dat > /dev/null
> > 
> > I am assuming that msize only helps with sequential I/O and not random
> > I/O.
> > 
> > Dave is running random read and random write mix and probably that's why
> > he is not seeing any improvement with msize increase.
> > 
> > If we run sequential workload (as "cat largefile.dat"), that should
> > see an improvement with msize increase.
> > 
> > Thanks
> > Vivek
> 
> Depends on what's randomized. If read chunk size is randomized, then yes, you 
> would probably see less performance increase compared to a simple
> 'cat foo.dat'.

We are using "fio" for testing and read chunk size is not being
randomized. chunk size (block size) is fixed at 4K size for these tests.

> 
> If only the read position is randomized, but the read chunk size honors 
> iounit, a.k.a. stat's st_blksize (i.e. reading with the most efficient block 
> size advertised by 9P), then I would assume still seeing a performance 
> increase.

Yes, we are randomizing read position. But there is no notion of looking
at st_blksize. Its fixed at 4K. (notice option --bs=4k in fio
commandline).

> Because seeking is a no/low cost factor in this case. The guest OS 
> seeking does not transmit a 9p message. The offset is rather passed with any 
> Tread message instead:
> https://github.com/chaos/diod/blob/master/protocol.md
> 
> I mean, yes, random seeks reduce I/O performance in general of course, but in 
> direct performance comparison, the difference in overhead of the 9p vs. 
> virtiofs network controller layer is most probably the most relevant aspect 
> if 
> large I/O chunk sizes are used.
> 

Agreed that large I/O chunk size will help with the 

Re: virtiofs vs 9p performance(Re: tools/virtiofs: Multi threading seems to hurt performance)

2020-09-29 Thread Christian Schoenebeck
On Dienstag, 29. September 2020 15:03:25 CEST Vivek Goyal wrote:
> On Sun, Sep 27, 2020 at 02:14:43PM +0200, Christian Schoenebeck wrote:
> > On Freitag, 25. September 2020 20:51:47 CEST Dr. David Alan Gilbert wrote:
> > > * Christian Schoenebeck (qemu_...@crudebyte.com) wrote:
> > > > On Freitag, 25. September 2020 15:05:38 CEST Dr. David Alan Gilbert 
wrote:
> > > > > > > 9p ( mount -t 9p -o trans=virtio kernel /mnt
> > > > > > > -oversion=9p2000.L,cache=mmap,msize=1048576 ) test: (g=0):
> > > > > > > rw=randrw,
> > > > > > 
> > > > > > Bottleneck --^
> > > > > > 
> > > > > > By increasing 'msize' you would encounter better 9P I/O results.
> > > > > 
> > > > > OK, I thought that was bigger than the default;  what number should
> > > > > I
> > > > > use?
> > > > 
> > > > It depends on the underlying storage hardware. In other words: you
> > > > have to
> > > > try increasing the 'msize' value to a point where you no longer notice
> > > > a
> > > > negative performance impact (or almost). Which is fortunately quite
> > > > easy
> > > > to test on>
> > > > 
> > > > guest like:
> > > > dd if=/dev/zero of=test.dat bs=1G count=12
> > > > time cat test.dat > /dev/null
> > > > 
> > > > I would start with an absolute minimum msize of 10MB. I would
> > > > recommend
> > > > something around 100MB maybe for a mechanical hard drive. With a PCIe
> > > > flash
> > > > you probably would rather pick several hundred MB or even more.
> > > > 
> > > > That unpleasant 'msize' issue is a limitation of the 9p protocol:
> > > > client
> > > > (guest) must suggest the value of msize on connection to server
> > > > (host).
> > > > Server can only lower, but not raise it. And the client in turn
> > > > obviously
> > > > cannot see host's storage device(s), so client is unable to pick a
> > > > good
> > > > value by itself. So it's a suboptimal handshake issue right now.
> > > 
> > > It doesn't seem to be making a vast difference here:
> > > 
> > > 
> > > 
> > > 9p mount -t 9p -o trans=virtio kernel /mnt
> > > -oversion=9p2000.L,cache=mmap,msize=104857600
> > > 
> > > Run status group 0 (all jobs):
> > >READ: bw=62.5MiB/s (65.6MB/s), 62.5MiB/s-62.5MiB/s
> > >(65.6MB/s-65.6MB/s),
> > > 
> > > io=3070MiB (3219MB), run=49099-49099msec WRITE: bw=20.9MiB/s (21.9MB/s),
> > > 20.9MiB/s-20.9MiB/s (21.9MB/s-21.9MB/s), io=1026MiB (1076MB),
> > > run=49099-49099msec
> > > 
> > > 9p mount -t 9p -o trans=virtio kernel /mnt
> > > -oversion=9p2000.L,cache=mmap,msize=1048576000
> > > 
> > > Run status group 0 (all jobs):
> > >READ: bw=65.2MiB/s (68.3MB/s), 65.2MiB/s-65.2MiB/s
> > >(68.3MB/s-68.3MB/s),
> > > 
> > > io=3070MiB (3219MB), run=47104-47104msec WRITE: bw=21.8MiB/s (22.8MB/s),
> > > 21.8MiB/s-21.8MiB/s (22.8MB/s-22.8MB/s), io=1026MiB (1076MB),
> > > run=47104-47104msec
> > > 
> > > 
> > > Dave
> > 
> > Is that benchmark tool honoring 'iounit' to automatically run with max.
> > I/O
> > chunk sizes? What's that benchmark tool actually? And do you also see no
> > improvement with a simple
> > 
> > time cat largefile.dat > /dev/null
> 
> I am assuming that msize only helps with sequential I/O and not random
> I/O.
> 
> Dave is running random read and random write mix and probably that's why
> he is not seeing any improvement with msize increase.
> 
> If we run sequential workload (as "cat largefile.dat"), that should
> see an improvement with msize increase.
> 
> Thanks
> Vivek

Depends on what's randomized. If read chunk size is randomized, then yes, you 
would probably see less performance increase compared to a simple
'cat foo.dat'.

If only the read position is randomized, but the read chunk size honors 
iounit, a.k.a. stat's st_blksize (i.e. reading with the most efficient block 
size advertised by 9P), then I would assume still seeing a performance 
increase. Because seeking is a no/low cost factor in this case. The guest OS 
seeking does not transmit a 9p message. The offset is rather passed with any 
Tread message instead:
https://github.com/chaos/diod/blob/master/protocol.md

I mean, yes, random seeks reduce I/O performance in general of course, but in 
direct performance comparison, the difference in overhead of the 9p vs. 
virtiofs network controller layer is most probably the most relevant aspect if 
large I/O chunk sizes are used.

But OTOH: I haven't optimized anything in Tread handling in 9p (yet).

Best regards,
Christian Schoenebeck





Re: virtiofs vs 9p performance(Re: tools/virtiofs: Multi threading seems to hurt performance)

2020-09-29 Thread Vivek Goyal
On Fri, Sep 25, 2020 at 01:41:39PM +0100, Dr. David Alan Gilbert wrote:

[..]
> So I'm sitll beating 9p; the thread-pool-size=1 seems to be great for
> read performance here.
> 

Hi Dave,

I spent some time making changes to virtiofs-tests so that I can test
a mix of random read and random write workload. That testsuite runs
a workload 3 times and reports the average. So I like to use it to
reduce run to run variation effect.

So I ran following to mimic carlos's workload.

$ ./run-fio-test.sh test -direct=1 -c  fio-jobs/randrw-psync.job >
testresults.txt

$ ./parse-fio-results.sh testresults.txt

I am using a SSD at the host to back these files. Option "-c" always
creates new files for testing.

Following are my results in various configurations. Used cache=mmap mode
for 9p and cache=auto (and cache=none) modes for virtiofs. Also tested
9p default as well as msize=16m. Tested virtiofs both with exclusive
as well as shared thread pool.

NAMEWORKLOADBandwidth   IOPS
9p-mmap-randrw  randrw-psync42.8mb/14.3mb   10.7k/3666  
9p-mmap-msize16mrandrw-psync42.8mb/14.3mb   10.7k/3674  
vtfs-auto-ex-randrw randrw-psync27.8mb/9547kb   7136/2386   
vtfs-auto-sh-randrw randrw-psync43.3mb/14.4mb   10.8k/3709  
vtfs-none-sh-randrw randrw-psync54.1mb/18.1mb   13.5k/4649  


- Increasing msize to 16m did not help with performance for this workload.
- virtiofs exclusive thread pool ("ex"), is slower than 9p.
- virtiofs shared thread pool ("sh"), matches the performance of 9p.
- virtiofs cache=none mode is faster than cache=auto mode for this
  workload.

Carlos, I am looking at more ways to optimize it further for virtiofs.
In the mean time I think switching to "shared" thread pool should
bring you very close to 9p in your setup I think.

Thanks
Vivek




Re: virtiofs vs 9p performance(Re: tools/virtiofs: Multi threading seems to hurt performance)

2020-09-29 Thread Vivek Goyal
On Sun, Sep 27, 2020 at 02:14:43PM +0200, Christian Schoenebeck wrote:
> On Freitag, 25. September 2020 20:51:47 CEST Dr. David Alan Gilbert wrote:
> > * Christian Schoenebeck (qemu_...@crudebyte.com) wrote:
> > > On Freitag, 25. September 2020 15:05:38 CEST Dr. David Alan Gilbert wrote:
> > > > > > 9p ( mount -t 9p -o trans=virtio kernel /mnt
> > > > > > -oversion=9p2000.L,cache=mmap,msize=1048576 ) test: (g=0):
> > > > > > rw=randrw,
> > > > > 
> > > > > Bottleneck --^
> > > > > 
> > > > > By increasing 'msize' you would encounter better 9P I/O results.
> > > > 
> > > > OK, I thought that was bigger than the default;  what number should I
> > > > use?
> > > 
> > > It depends on the underlying storage hardware. In other words: you have to
> > > try increasing the 'msize' value to a point where you no longer notice a
> > > negative performance impact (or almost). Which is fortunately quite easy
> > > to test on> 
> > > guest like:
> > >   dd if=/dev/zero of=test.dat bs=1G count=12
> > >   time cat test.dat > /dev/null
> > > 
> > > I would start with an absolute minimum msize of 10MB. I would recommend
> > > something around 100MB maybe for a mechanical hard drive. With a PCIe
> > > flash
> > > you probably would rather pick several hundred MB or even more.
> > > 
> > > That unpleasant 'msize' issue is a limitation of the 9p protocol: client
> > > (guest) must suggest the value of msize on connection to server (host).
> > > Server can only lower, but not raise it. And the client in turn obviously
> > > cannot see host's storage device(s), so client is unable to pick a good
> > > value by itself. So it's a suboptimal handshake issue right now.
> > 
> > It doesn't seem to be making a vast difference here:
> > 
> > 
> > 
> > 9p mount -t 9p -o trans=virtio kernel /mnt
> > -oversion=9p2000.L,cache=mmap,msize=104857600
> > 
> > Run status group 0 (all jobs):
> >READ: bw=62.5MiB/s (65.6MB/s), 62.5MiB/s-62.5MiB/s (65.6MB/s-65.6MB/s),
> > io=3070MiB (3219MB), run=49099-49099msec WRITE: bw=20.9MiB/s (21.9MB/s),
> > 20.9MiB/s-20.9MiB/s (21.9MB/s-21.9MB/s), io=1026MiB (1076MB),
> > run=49099-49099msec
> > 
> > 9p mount -t 9p -o trans=virtio kernel /mnt
> > -oversion=9p2000.L,cache=mmap,msize=1048576000
> > 
> > Run status group 0 (all jobs):
> >READ: bw=65.2MiB/s (68.3MB/s), 65.2MiB/s-65.2MiB/s (68.3MB/s-68.3MB/s),
> > io=3070MiB (3219MB), run=47104-47104msec WRITE: bw=21.8MiB/s (22.8MB/s),
> > 21.8MiB/s-21.8MiB/s (22.8MB/s-22.8MB/s), io=1026MiB (1076MB),
> > run=47104-47104msec
> > 
> > 
> > Dave
> 
> Is that benchmark tool honoring 'iounit' to automatically run with max. I/O 
> chunk sizes? What's that benchmark tool actually? And do you also see no 
> improvement with a simple
> 
>   time cat largefile.dat > /dev/null

I am assuming that msize only helps with sequential I/O and not random
I/O.

Dave is running random read and random write mix and probably that's why
he is not seeing any improvement with msize increase.

If we run sequential workload (as "cat largefile.dat"), that should
see an improvement with msize increase.

Thanks
Vivek




Re: virtiofs vs 9p performance(Re: tools/virtiofs: Multi threading seems to hurt performance)

2020-09-27 Thread Christian Schoenebeck
On Freitag, 25. September 2020 20:51:47 CEST Dr. David Alan Gilbert wrote:
> * Christian Schoenebeck (qemu_...@crudebyte.com) wrote:
> > On Freitag, 25. September 2020 15:05:38 CEST Dr. David Alan Gilbert wrote:
> > > > > 9p ( mount -t 9p -o trans=virtio kernel /mnt
> > > > > -oversion=9p2000.L,cache=mmap,msize=1048576 ) test: (g=0):
> > > > > rw=randrw,
> > > > 
> > > > Bottleneck --^
> > > > 
> > > > By increasing 'msize' you would encounter better 9P I/O results.
> > > 
> > > OK, I thought that was bigger than the default;  what number should I
> > > use?
> > 
> > It depends on the underlying storage hardware. In other words: you have to
> > try increasing the 'msize' value to a point where you no longer notice a
> > negative performance impact (or almost). Which is fortunately quite easy
> > to test on> 
> > guest like:
> > dd if=/dev/zero of=test.dat bs=1G count=12
> > time cat test.dat > /dev/null
> > 
> > I would start with an absolute minimum msize of 10MB. I would recommend
> > something around 100MB maybe for a mechanical hard drive. With a PCIe
> > flash
> > you probably would rather pick several hundred MB or even more.
> > 
> > That unpleasant 'msize' issue is a limitation of the 9p protocol: client
> > (guest) must suggest the value of msize on connection to server (host).
> > Server can only lower, but not raise it. And the client in turn obviously
> > cannot see host's storage device(s), so client is unable to pick a good
> > value by itself. So it's a suboptimal handshake issue right now.
> 
> It doesn't seem to be making a vast difference here:
> 
> 
> 
> 9p mount -t 9p -o trans=virtio kernel /mnt
> -oversion=9p2000.L,cache=mmap,msize=104857600
> 
> Run status group 0 (all jobs):
>READ: bw=62.5MiB/s (65.6MB/s), 62.5MiB/s-62.5MiB/s (65.6MB/s-65.6MB/s),
> io=3070MiB (3219MB), run=49099-49099msec WRITE: bw=20.9MiB/s (21.9MB/s),
> 20.9MiB/s-20.9MiB/s (21.9MB/s-21.9MB/s), io=1026MiB (1076MB),
> run=49099-49099msec
> 
> 9p mount -t 9p -o trans=virtio kernel /mnt
> -oversion=9p2000.L,cache=mmap,msize=1048576000
> 
> Run status group 0 (all jobs):
>READ: bw=65.2MiB/s (68.3MB/s), 65.2MiB/s-65.2MiB/s (68.3MB/s-68.3MB/s),
> io=3070MiB (3219MB), run=47104-47104msec WRITE: bw=21.8MiB/s (22.8MB/s),
> 21.8MiB/s-21.8MiB/s (22.8MB/s-22.8MB/s), io=1026MiB (1076MB),
> run=47104-47104msec
> 
> 
> Dave

Is that benchmark tool honoring 'iounit' to automatically run with max. I/O 
chunk sizes? What's that benchmark tool actually? And do you also see no 
improvement with a simple

time cat largefile.dat > /dev/null

?

Best regards,
Christian Schoenebeck





Re: virtiofs vs 9p performance(Re: tools/virtiofs: Multi threading seems to hurt performance)

2020-09-25 Thread Dr. David Alan Gilbert
* Christian Schoenebeck (qemu_...@crudebyte.com) wrote:
> On Freitag, 25. September 2020 15:05:38 CEST Dr. David Alan Gilbert wrote:
> > > > 9p ( mount -t 9p -o trans=virtio kernel /mnt
> > > > -oversion=9p2000.L,cache=mmap,msize=1048576 ) test: (g=0): rw=randrw,
> > > 
> > > Bottleneck --^
> > > 
> > > By increasing 'msize' you would encounter better 9P I/O results.
> > 
> > OK, I thought that was bigger than the default;  what number should I
> > use?
> 
> It depends on the underlying storage hardware. In other words: you have to 
> try 
> increasing the 'msize' value to a point where you no longer notice a negative 
> performance impact (or almost). Which is fortunately quite easy to test on 
> guest like:
> 
>   dd if=/dev/zero of=test.dat bs=1G count=12
>   time cat test.dat > /dev/null
> 
> I would start with an absolute minimum msize of 10MB. I would recommend 
> something around 100MB maybe for a mechanical hard drive. With a PCIe flash 
> you probably would rather pick several hundred MB or even more.
> 
> That unpleasant 'msize' issue is a limitation of the 9p protocol: client 
> (guest) must suggest the value of msize on connection to server (host). 
> Server 
> can only lower, but not raise it. And the client in turn obviously cannot see 
> host's storage device(s), so client is unable to pick a good value by itself. 
> So it's a suboptimal handshake issue right now.

It doesn't seem to be making a vast difference here:



9p mount -t 9p -o trans=virtio kernel /mnt 
-oversion=9p2000.L,cache=mmap,msize=104857600

Run status group 0 (all jobs):
   READ: bw=62.5MiB/s (65.6MB/s), 62.5MiB/s-62.5MiB/s (65.6MB/s-65.6MB/s), 
io=3070MiB (3219MB), run=49099-49099msec
  WRITE: bw=20.9MiB/s (21.9MB/s), 20.9MiB/s-20.9MiB/s (21.9MB/s-21.9MB/s), 
io=1026MiB (1076MB), run=49099-49099msec

9p mount -t 9p -o trans=virtio kernel /mnt 
-oversion=9p2000.L,cache=mmap,msize=1048576000

Run status group 0 (all jobs):
   READ: bw=65.2MiB/s (68.3MB/s), 65.2MiB/s-65.2MiB/s (68.3MB/s-68.3MB/s), 
io=3070MiB (3219MB), run=47104-47104msec
  WRITE: bw=21.8MiB/s (22.8MB/s), 21.8MiB/s-21.8MiB/s (22.8MB/s-22.8MB/s), 
io=1026MiB (1076MB), run=47104-47104msec


Dave

> Many users don't even know this 'msize' parameter exists and hence run with 
> the Linux kernel's default value of just 8kB. For QEMU 5.2 I addressed this 
> by 
> logging a performance warning on host side for making users at least aware 
> about this issue. The long-term plan is to pass a good msize value from host 
> to guest via virtio (like it's already done for the available export tags) 
> and 
> the Linux kernel would default to that instead.
> 
> Best regards,
> Christian Schoenebeck
> 
> 
-- 
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK




Re: virtiofs vs 9p performance(Re: tools/virtiofs: Multi threading seems to hurt performance)

2020-09-25 Thread Christian Schoenebeck
On Freitag, 25. September 2020 18:05:17 CEST Christian Schoenebeck wrote:
> On Freitag, 25. September 2020 15:05:38 CEST Dr. David Alan Gilbert wrote:
> > > > 9p ( mount -t 9p -o trans=virtio kernel /mnt
> > > > -oversion=9p2000.L,cache=mmap,msize=1048576 ) test: (g=0): rw=randrw,
> > > 
> > > Bottleneck --^
> > > 
> > > By increasing 'msize' you would encounter better 9P I/O results.
> > 
> > OK, I thought that was bigger than the default;  what number should I
> > use?
> 
> It depends on the underlying storage hardware. In other words: you have to
> try increasing the 'msize' value to a point where you no longer notice a
> negative performance impact (or almost). Which is fortunately quite easy to
> test on guest like:
> 
>   dd if=/dev/zero of=test.dat bs=1G count=12
>   time cat test.dat > /dev/null

I forgot: you should execute that 'dd' command and host side, and the 'cat' 
command on guest side, to avoid any caching making the benchmark result look 
better than it actually is. Because for finding a good 'msize' value you only 
care about actual 9p data really being transmitted between host and guest.

Best regards,
Christian Schoenebeck





Re: virtiofs vs 9p performance(Re: tools/virtiofs: Multi threading seems to hurt performance)

2020-09-25 Thread Christian Schoenebeck
On Freitag, 25. September 2020 15:05:38 CEST Dr. David Alan Gilbert wrote:
> > > 9p ( mount -t 9p -o trans=virtio kernel /mnt
> > > -oversion=9p2000.L,cache=mmap,msize=1048576 ) test: (g=0): rw=randrw,
> > 
> > Bottleneck --^
> > 
> > By increasing 'msize' you would encounter better 9P I/O results.
> 
> OK, I thought that was bigger than the default;  what number should I
> use?

It depends on the underlying storage hardware. In other words: you have to try 
increasing the 'msize' value to a point where you no longer notice a negative 
performance impact (or almost). Which is fortunately quite easy to test on 
guest like:

dd if=/dev/zero of=test.dat bs=1G count=12
time cat test.dat > /dev/null

I would start with an absolute minimum msize of 10MB. I would recommend 
something around 100MB maybe for a mechanical hard drive. With a PCIe flash 
you probably would rather pick several hundred MB or even more.

That unpleasant 'msize' issue is a limitation of the 9p protocol: client 
(guest) must suggest the value of msize on connection to server (host). Server 
can only lower, but not raise it. And the client in turn obviously cannot see 
host's storage device(s), so client is unable to pick a good value by itself. 
So it's a suboptimal handshake issue right now.

Many users don't even know this 'msize' parameter exists and hence run with 
the Linux kernel's default value of just 8kB. For QEMU 5.2 I addressed this by 
logging a performance warning on host side for making users at least aware 
about this issue. The long-term plan is to pass a good msize value from host 
to guest via virtio (like it's already done for the available export tags) and 
the Linux kernel would default to that instead.

Best regards,
Christian Schoenebeck





Re: tools/virtiofs: Multi threading seems to hurt performance

2020-09-25 Thread Vivek Goyal
On Fri, Sep 25, 2020 at 01:11:27PM +0100, Dr. David Alan Gilbert wrote:
> * Vivek Goyal (vgo...@redhat.com) wrote:
> > On Tue, Sep 22, 2020 at 11:25:31AM +0100, Dr. David Alan Gilbert wrote:
> > > * Dr. David Alan Gilbert (dgilb...@redhat.com) wrote:
> > > > Hi,
> > > >   I've been doing some of my own perf tests and I think I agree
> > > > about the thread pool size;  my test is a kernel build
> > > > and I've tried a bunch of different options.
> > > > 
> > > > My config:
> > > >   Host: 16 core AMD EPYC (32 thread), 128G RAM,
> > > >  5.9.0-rc4 kernel, rhel 8.2ish userspace.
> > > >   5.1.0 qemu/virtiofsd built from git.
> > > >   Guest: Fedora 32 from cloud image with just enough extra installed for
> > > > a kernel build.
> > > > 
> > > >   git cloned and checkout v5.8 of Linux into /dev/shm/linux on the host
> > > > fresh before each test.  Then log into the guest, make defconfig,
> > > > time make -j 16 bzImage,  make clean; time make -j 16 bzImage 
> > > > The numbers below are the 'real' time in the guest from the initial make
> > > > (the subsequent makes dont vary much)
> > > > 
> > > > Below are the detauls of what each of these means, but here are the
> > > > numbers first
> > > > 
> > > > virtiofsdefault4m0.978s
> > > > 9pdefault  9m41.660s
> > > > virtiofscache=none10m29.700s
> > > > 9pmmappass 9m30.047s
> > > > 9pmbigmsize   12m4.208s
> > > > 9pmsecnone 9m21.363s
> > > > virtiofscache=noneT1   7m17.494s
> > > > virtiofsdefaultT1  3m43.326s
> > > > 
> > > > So the winner there by far is the 'virtiofsdefaultT1' - that's
> > > > the default virtiofs settings, but with --thread-pool-size=1 - so
> > > > yes it gives a small benefit.
> > > > But interestingly the cache=none virtiofs performance is pretty bad,
> > > > but thread-pool-size=1 on that makes a BIG improvement.
> > > 
> > > Here are fio runs that Vivek asked me to run in my same environment
> > > (there are some 0's in some of the mmap cases, and I've not investigated
> > > why yet).
> > 
> > cache=none does not allow mmap in case of virtiofs. That's when you
> > are seeing 0.
> > 
> > >virtiofs is looking good here in I think all of the cases;
> > > there's some division over which cinfig; cache=none
> > > seems faster in some cases which surprises me.
> > 
> > I know cache=none is faster in case of write workloads. It forces
> > direct write where we don't call file_remove_privs(). While cache=auto
> > goes through file_remove_privs() and that adds a GETXATTR request to
> > every WRITE request.
> 
> Can you point me to how cache=auto causes the file_remove_privs?

fs/fuse/file.c

fuse_cache_write_iter() {
err = file_remove_privs(file);
}

Above path is taken when cache=auto/cache=always is used. If virtiofsd
is running with noxattr, then it does not impose any cost. But if xattr
are enabled, then every WRITE first results in a
getxattr(security.capability) and that slows down WRITES tremendously.

When cache=none is used, we go through following path instead.

fuse_direct_write_iter() and it does not have file_remove_privs(). We
set a flag in WRITE request to tell server to kill
suid/sgid/security.capability, instead.

fuse_direct_io() {
ia->write.in.write_flags |= FUSE_WRITE_KILL_PRIV
}

Vivek




Re: virtiofs vs 9p performance(Re: tools/virtiofs: Multi threading seems to hurt performance)

2020-09-25 Thread Dr. David Alan Gilbert
* Christian Schoenebeck (qemu_...@crudebyte.com) wrote:
> On Freitag, 25. September 2020 14:41:39 CEST Dr. David Alan Gilbert wrote:
> > > Hi Carlos,
> > > 
> > > So you are running following test.
> > > 
> > > fio --direct=1 --gtod_reduce=1 --name=test
> > > --filename=random_read_write.fio --bs=4k --iodepth=64 --size=4G
> > > --readwrite=randrw --rwmixread=75 --output=/output/fio.txt
> > > 
> > > And following are your results.
> > > 
> > > 9p
> > > --
> > > READ: bw=211MiB/s (222MB/s), 211MiB/s-211MiB/s (222MB/s-222MB/s),
> > > io=3070MiB (3219MB), run=14532-14532msec
> > > 
> > > WRITE: bw=70.6MiB/s (74.0MB/s), 70.6MiB/s-70.6MiB/s (74.0MB/s-74.0MB/s),
> > > io=1026MiB (1076MB), run=14532-14532msec
> > > 
> > > virtiofs
> > > 
> > > 
> > > Run status group 0 (all jobs):
> > >READ: bw=159MiB/s (167MB/s), 159MiB/s-159MiB/s (167MB/s-167MB/s),
> > >io=3070MiB (3219MB), run=19321-19321msec>   
> > >   WRITE: bw=53.1MiB/s (55.7MB/s), 53.1MiB/s-53.1MiB/s (55.7MB/s-55.7MB/s),
> > >   io=1026MiB (1076MB), run=19321-19321msec> 
> > > So looks like you are getting better performance with 9p in this case.
> > 
> > That's interesting, because I've just tried similar again with my
> > ramdisk setup:
> > 
> > fio --direct=1 --gtod_reduce=1 --name=test --filename=random_read_write.fio
> > --bs=4k --iodepth=64 --size=4G --readwrite=randrw --rwmixread=75
> > --output=aname.txt
> > 
> > 
> > virtiofs default options
> > test: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T)
> > 4096B-4096B, ioengine=psync, iodepth=64 fio-3.21
> > Starting 1 process
> > test: Laying out IO file (1 file / 4096MiB)
> > 
> > test: (groupid=0, jobs=1): err= 0: pid=773: Fri Sep 25 12:28:32 2020
> >   read: IOPS=18.3k, BW=71.3MiB/s (74.8MB/s)(3070MiB/43042msec)
> >bw (  KiB/s): min=70752, max=77280, per=100.00%, avg=73075.71,
> > stdev=1603.47, samples=85 iops: min=17688, max=19320, avg=18268.92,
> > stdev=400.86, samples=85 write: IOPS=6102, BW=23.8MiB/s
> > (24.0MB/s)(1026MiB/43042msec); 0 zone resets bw (  KiB/s): min=23128,
> > max=25696, per=100.00%, avg=24420.40, stdev=583.08, samples=85 iops   
> > : min= 5782, max= 6424, avg=6105.09, stdev=145.76, samples=85 cpu 
> > : usr=0.10%, sys=30.09%, ctx=1245312, majf=0, minf=6 IO depths:
> > 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit:
> > 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete  :
> > 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts:
> > total=785920,262656,0,0 short=0,0,0,0 dropped=0,0,0,0 latency   : target=0,
> > window=0, percentile=100.00%, depth=64
> > 
> > Run status group 0 (all jobs):
> >READ: bw=71.3MiB/s (74.8MB/s), 71.3MiB/s-71.3MiB/s (74.8MB/s-74.8MB/s),
> > io=3070MiB (3219MB), run=43042-43042msec WRITE: bw=23.8MiB/s (24.0MB/s),
> > 23.8MiB/s-23.8MiB/s (24.0MB/s-24.0MB/s), io=1026MiB (1076MB),
> > run=43042-43042msec
> > 
> > virtiofs cache=none
> > test: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T)
> > 4096B-4096B, ioengine=psync, iodepth=64 fio-3.21
> > Starting 1 process
> > 
> > test: (groupid=0, jobs=1): err= 0: pid=740: Fri Sep 25 12:30:57 2020
> >   read: IOPS=22.9k, BW=89.6MiB/s (93.0MB/s)(3070MiB/34256msec)
> >bw (  KiB/s): min=89048, max=94240, per=100.00%, avg=91871.06,
> > stdev=967.87, samples=68 iops: min=22262, max=23560, avg=22967.76,
> > stdev=241.97, samples=68 write: IOPS=7667, BW=29.0MiB/s
> > (31.4MB/s)(1026MiB/34256msec); 0 zone resets bw (  KiB/s): min=29264,
> > max=32248, per=100.00%, avg=30700.82, stdev=541.97, samples=68 iops   
> > : min= 7316, max= 8062, avg=7675.21, stdev=135.49, samples=68 cpu 
> > : usr=1.03%, sys=27.64%, ctx=1048635, majf=0, minf=5 IO depths:
> > 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit:
> > 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete  :
> > 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts:
> > total=785920,262656,0,0 short=0,0,0,0 dropped=0,0,0,0 latency   : target=0,
> > window=0, percentile=100.00%, depth=64
> > 
> > Run status group 0 (all jobs):
> >READ: bw=89.6MiB/s (93.0MB/s), 89.6MiB/s-89.6MiB/s (93.0MB/s-93.0MB/s),
> > io=3070MiB (3219MB), run=34256-34256msec WRITE: bw=29.0MiB/s (31.4MB/s),
> > 29.0MiB/s-29.0MiB/s (31.4MB/s-31.4MB/s), io=1026MiB (1076MB),
> > run=34256-34256msec
> > 
> > virtiofs cache=none thread-pool-size=1
> > test: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T)
> > 4096B-4096B, ioengine=psync, iodepth=64 fio-3.21
> > Starting 1 process
> > 
> > test: (groupid=0, jobs=1): err= 0: pid=738: Fri Sep 25 12:33:17 2020
> >   read: IOPS=23.7k, BW=92.4MiB/s (96.9MB/s)(3070MiB/33215msec)
> >bw (  KiB/s): min=89808, max=111952, per=100.00%, avg=94762.30,
> > stdev=4507.43, samples=66 iops: min=22452, max=27988, avg=23690.58,
> > stdev=1126.86, samples=66 write: IOPS=7907, BW=30.9MiB/s
> > 

Re: virtiofs vs 9p performance(Re: tools/virtiofs: Multi threading seems to hurt performance)

2020-09-25 Thread Christian Schoenebeck
On Freitag, 25. September 2020 14:41:39 CEST Dr. David Alan Gilbert wrote:
> > Hi Carlos,
> > 
> > So you are running following test.
> > 
> > fio --direct=1 --gtod_reduce=1 --name=test
> > --filename=random_read_write.fio --bs=4k --iodepth=64 --size=4G
> > --readwrite=randrw --rwmixread=75 --output=/output/fio.txt
> > 
> > And following are your results.
> > 
> > 9p
> > --
> > READ: bw=211MiB/s (222MB/s), 211MiB/s-211MiB/s (222MB/s-222MB/s),
> > io=3070MiB (3219MB), run=14532-14532msec
> > 
> > WRITE: bw=70.6MiB/s (74.0MB/s), 70.6MiB/s-70.6MiB/s (74.0MB/s-74.0MB/s),
> > io=1026MiB (1076MB), run=14532-14532msec
> > 
> > virtiofs
> > 
> > 
> > Run status group 0 (all jobs):
> >READ: bw=159MiB/s (167MB/s), 159MiB/s-159MiB/s (167MB/s-167MB/s),
> >io=3070MiB (3219MB), run=19321-19321msec>   
> >   WRITE: bw=53.1MiB/s (55.7MB/s), 53.1MiB/s-53.1MiB/s (55.7MB/s-55.7MB/s),
> >   io=1026MiB (1076MB), run=19321-19321msec> 
> > So looks like you are getting better performance with 9p in this case.
> 
> That's interesting, because I've just tried similar again with my
> ramdisk setup:
> 
> fio --direct=1 --gtod_reduce=1 --name=test --filename=random_read_write.fio
> --bs=4k --iodepth=64 --size=4G --readwrite=randrw --rwmixread=75
> --output=aname.txt
> 
> 
> virtiofs default options
> test: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T)
> 4096B-4096B, ioengine=psync, iodepth=64 fio-3.21
> Starting 1 process
> test: Laying out IO file (1 file / 4096MiB)
> 
> test: (groupid=0, jobs=1): err= 0: pid=773: Fri Sep 25 12:28:32 2020
>   read: IOPS=18.3k, BW=71.3MiB/s (74.8MB/s)(3070MiB/43042msec)
>bw (  KiB/s): min=70752, max=77280, per=100.00%, avg=73075.71,
> stdev=1603.47, samples=85 iops: min=17688, max=19320, avg=18268.92,
> stdev=400.86, samples=85 write: IOPS=6102, BW=23.8MiB/s
> (24.0MB/s)(1026MiB/43042msec); 0 zone resets bw (  KiB/s): min=23128,
> max=25696, per=100.00%, avg=24420.40, stdev=583.08, samples=85 iops   
> : min= 5782, max= 6424, avg=6105.09, stdev=145.76, samples=85 cpu 
> : usr=0.10%, sys=30.09%, ctx=1245312, majf=0, minf=6 IO depths:
> 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit:
> 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete  :
> 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts:
> total=785920,262656,0,0 short=0,0,0,0 dropped=0,0,0,0 latency   : target=0,
> window=0, percentile=100.00%, depth=64
> 
> Run status group 0 (all jobs):
>READ: bw=71.3MiB/s (74.8MB/s), 71.3MiB/s-71.3MiB/s (74.8MB/s-74.8MB/s),
> io=3070MiB (3219MB), run=43042-43042msec WRITE: bw=23.8MiB/s (24.0MB/s),
> 23.8MiB/s-23.8MiB/s (24.0MB/s-24.0MB/s), io=1026MiB (1076MB),
> run=43042-43042msec
> 
> virtiofs cache=none
> test: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T)
> 4096B-4096B, ioengine=psync, iodepth=64 fio-3.21
> Starting 1 process
> 
> test: (groupid=0, jobs=1): err= 0: pid=740: Fri Sep 25 12:30:57 2020
>   read: IOPS=22.9k, BW=89.6MiB/s (93.0MB/s)(3070MiB/34256msec)
>bw (  KiB/s): min=89048, max=94240, per=100.00%, avg=91871.06,
> stdev=967.87, samples=68 iops: min=22262, max=23560, avg=22967.76,
> stdev=241.97, samples=68 write: IOPS=7667, BW=29.0MiB/s
> (31.4MB/s)(1026MiB/34256msec); 0 zone resets bw (  KiB/s): min=29264,
> max=32248, per=100.00%, avg=30700.82, stdev=541.97, samples=68 iops   
> : min= 7316, max= 8062, avg=7675.21, stdev=135.49, samples=68 cpu 
> : usr=1.03%, sys=27.64%, ctx=1048635, majf=0, minf=5 IO depths:
> 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit:
> 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete  :
> 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts:
> total=785920,262656,0,0 short=0,0,0,0 dropped=0,0,0,0 latency   : target=0,
> window=0, percentile=100.00%, depth=64
> 
> Run status group 0 (all jobs):
>READ: bw=89.6MiB/s (93.0MB/s), 89.6MiB/s-89.6MiB/s (93.0MB/s-93.0MB/s),
> io=3070MiB (3219MB), run=34256-34256msec WRITE: bw=29.0MiB/s (31.4MB/s),
> 29.0MiB/s-29.0MiB/s (31.4MB/s-31.4MB/s), io=1026MiB (1076MB),
> run=34256-34256msec
> 
> virtiofs cache=none thread-pool-size=1
> test: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T)
> 4096B-4096B, ioengine=psync, iodepth=64 fio-3.21
> Starting 1 process
> 
> test: (groupid=0, jobs=1): err= 0: pid=738: Fri Sep 25 12:33:17 2020
>   read: IOPS=23.7k, BW=92.4MiB/s (96.9MB/s)(3070MiB/33215msec)
>bw (  KiB/s): min=89808, max=111952, per=100.00%, avg=94762.30,
> stdev=4507.43, samples=66 iops: min=22452, max=27988, avg=23690.58,
> stdev=1126.86, samples=66 write: IOPS=7907, BW=30.9MiB/s
> (32.4MB/s)(1026MiB/33215msec); 0 zone resets bw (  KiB/s): min=29424,
> max=37112, per=100.00%, avg=31668.73, stdev=1558.69, samples=66 iops   
> : min= 7356, max= 9278, avg=7917.18, stdev=389.67, samples=66 cpu 
> : usr=0.43%, sys=29.07%, ctx=1048627, majf=0, minf=7 IO 

Re: virtiofs vs 9p performance(Re: tools/virtiofs: Multi threading seems to hurt performance)

2020-09-25 Thread Dr. David Alan Gilbert
* Vivek Goyal (vgo...@redhat.com) wrote:
> On Thu, Sep 24, 2020 at 09:33:01PM +, Venegas Munoz, Jose Carlos wrote:
> > Hi Folks,
> > 
> > Sorry for the delay about how to reproduce `fio` data.
> > 
> > I have some code to automate testing for multiple kata configs and collect 
> > info like:
> > - Kata-env, kata configuration.toml, qemu command, virtiofsd command.
> > 
> > See: 
> > https://github.com/jcvenegas/mrunner/
> > 
> > 
> > Last time we agreed to narrow the cases and configs to compare virtiofs and 
> > 9pfs
> > 
> > The configs where the following:
> > 
> > - qemu + virtiofs(cache=auto, dax=0) a.ka. `kata-qemu-virtiofs` WITOUT xattr
> > - qemu + 9pfs a.k.a `kata-qemu`
> > 
> > Please take a look to the html and raw results I attach in this mail.
> 
> Hi Carlos,
> 
> So you are running following test.
> 
> fio --direct=1 --gtod_reduce=1 --name=test --filename=random_read_write.fio 
> --bs=4k --iodepth=64 --size=4G --readwrite=randrw --rwmixread=75 
> --output=/output/fio.txt
> 
> And following are your results.
> 
> 9p
> --
> READ: bw=211MiB/s (222MB/s), 211MiB/s-211MiB/s (222MB/s-222MB/s), io=3070MiB 
> (3219MB), run=14532-14532msec
> 
> WRITE: bw=70.6MiB/s (74.0MB/s), 70.6MiB/s-70.6MiB/s (74.0MB/s-74.0MB/s), 
> io=1026MiB (1076MB), run=14532-14532msec
> 
> virtiofs
> 
> Run status group 0 (all jobs):
>READ: bw=159MiB/s (167MB/s), 159MiB/s-159MiB/s (167MB/s-167MB/s), 
> io=3070MiB (3219MB), run=19321-19321msec
>   WRITE: bw=53.1MiB/s (55.7MB/s), 53.1MiB/s-53.1MiB/s (55.7MB/s-55.7MB/s), 
> io=1026MiB (1076MB), run=19321-19321msec
> 
> So looks like you are getting better performance with 9p in this case.

That's interesting, because I've just tried similar again with my
ramdisk setup:

fio --direct=1 --gtod_reduce=1 --name=test --filename=random_read_write.fio 
--bs=4k --iodepth=64 --size=4G --readwrite=randrw --rwmixread=75 
--output=aname.txt


virtiofs default options
test: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, 
ioengine=psync, iodepth=64
fio-3.21
Starting 1 process
test: Laying out IO file (1 file / 4096MiB)

test: (groupid=0, jobs=1): err= 0: pid=773: Fri Sep 25 12:28:32 2020
  read: IOPS=18.3k, BW=71.3MiB/s (74.8MB/s)(3070MiB/43042msec)
   bw (  KiB/s): min=70752, max=77280, per=100.00%, avg=73075.71, 
stdev=1603.47, samples=85
   iops: min=17688, max=19320, avg=18268.92, stdev=400.86, samples=85
  write: IOPS=6102, BW=23.8MiB/s (24.0MB/s)(1026MiB/43042msec); 0 zone resets
   bw (  KiB/s): min=23128, max=25696, per=100.00%, avg=24420.40, stdev=583.08, 
samples=85
   iops: min= 5782, max= 6424, avg=6105.09, stdev=145.76, samples=85
  cpu  : usr=0.10%, sys=30.09%, ctx=1245312, majf=0, minf=6
  IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
 submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 issued rwts: total=785920,262656,0,0 short=0,0,0,0 dropped=0,0,0,0
 latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: bw=71.3MiB/s (74.8MB/s), 71.3MiB/s-71.3MiB/s (74.8MB/s-74.8MB/s), 
io=3070MiB (3219MB), run=43042-43042msec
  WRITE: bw=23.8MiB/s (24.0MB/s), 23.8MiB/s-23.8MiB/s (24.0MB/s-24.0MB/s), 
io=1026MiB (1076MB), run=43042-43042msec

virtiofs cache=none
test: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, 
ioengine=psync, iodepth=64
fio-3.21
Starting 1 process

test: (groupid=0, jobs=1): err= 0: pid=740: Fri Sep 25 12:30:57 2020
  read: IOPS=22.9k, BW=89.6MiB/s (93.0MB/s)(3070MiB/34256msec)
   bw (  KiB/s): min=89048, max=94240, per=100.00%, avg=91871.06, stdev=967.87, 
samples=68
   iops: min=22262, max=23560, avg=22967.76, stdev=241.97, samples=68
  write: IOPS=7667, BW=29.0MiB/s (31.4MB/s)(1026MiB/34256msec); 0 zone resets
   bw (  KiB/s): min=29264, max=32248, per=100.00%, avg=30700.82, stdev=541.97, 
samples=68
   iops: min= 7316, max= 8062, avg=7675.21, stdev=135.49, samples=68
  cpu  : usr=1.03%, sys=27.64%, ctx=1048635, majf=0, minf=5
  IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
 submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 issued rwts: total=785920,262656,0,0 short=0,0,0,0 dropped=0,0,0,0
 latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: bw=89.6MiB/s (93.0MB/s), 89.6MiB/s-89.6MiB/s (93.0MB/s-93.0MB/s), 
io=3070MiB (3219MB), run=34256-34256msec
  WRITE: bw=29.0MiB/s (31.4MB/s), 29.0MiB/s-29.0MiB/s (31.4MB/s-31.4MB/s), 
io=1026MiB (1076MB), run=34256-34256msec

virtiofs cache=none thread-pool-size=1
test: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, 
ioengine=psync, iodepth=64
fio-3.21
Starting 1 process

test: (groupid=0, jobs=1): err= 0: 

Re: tools/virtiofs: Multi threading seems to hurt performance

2020-09-25 Thread Dr. David Alan Gilbert
* Vivek Goyal (vgo...@redhat.com) wrote:
> On Tue, Sep 22, 2020 at 11:25:31AM +0100, Dr. David Alan Gilbert wrote:
> > * Dr. David Alan Gilbert (dgilb...@redhat.com) wrote:
> > > Hi,
> > >   I've been doing some of my own perf tests and I think I agree
> > > about the thread pool size;  my test is a kernel build
> > > and I've tried a bunch of different options.
> > > 
> > > My config:
> > >   Host: 16 core AMD EPYC (32 thread), 128G RAM,
> > >  5.9.0-rc4 kernel, rhel 8.2ish userspace.
> > >   5.1.0 qemu/virtiofsd built from git.
> > >   Guest: Fedora 32 from cloud image with just enough extra installed for
> > > a kernel build.
> > > 
> > >   git cloned and checkout v5.8 of Linux into /dev/shm/linux on the host
> > > fresh before each test.  Then log into the guest, make defconfig,
> > > time make -j 16 bzImage,  make clean; time make -j 16 bzImage 
> > > The numbers below are the 'real' time in the guest from the initial make
> > > (the subsequent makes dont vary much)
> > > 
> > > Below are the detauls of what each of these means, but here are the
> > > numbers first
> > > 
> > > virtiofsdefault4m0.978s
> > > 9pdefault  9m41.660s
> > > virtiofscache=none10m29.700s
> > > 9pmmappass 9m30.047s
> > > 9pmbigmsize   12m4.208s
> > > 9pmsecnone 9m21.363s
> > > virtiofscache=noneT1   7m17.494s
> > > virtiofsdefaultT1  3m43.326s
> > > 
> > > So the winner there by far is the 'virtiofsdefaultT1' - that's
> > > the default virtiofs settings, but with --thread-pool-size=1 - so
> > > yes it gives a small benefit.
> > > But interestingly the cache=none virtiofs performance is pretty bad,
> > > but thread-pool-size=1 on that makes a BIG improvement.
> > 
> > Here are fio runs that Vivek asked me to run in my same environment
> > (there are some 0's in some of the mmap cases, and I've not investigated
> > why yet).
> 
> cache=none does not allow mmap in case of virtiofs. That's when you
> are seeing 0.
> 
> >virtiofs is looking good here in I think all of the cases;
> > there's some division over which cinfig; cache=none
> > seems faster in some cases which surprises me.
> 
> I know cache=none is faster in case of write workloads. It forces
> direct write where we don't call file_remove_privs(). While cache=auto
> goes through file_remove_privs() and that adds a GETXATTR request to
> every WRITE request.

Can you point me to how cache=auto causes the file_remove_privs?

Dave

> Vivek
-- 
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK




Re: tools/virtiofs: Multi threading seems to hurt performance

2020-09-24 Thread Venegas Munoz, Jose Carlos
Hi Folks,

Sorry for the delay about how to reproduce `fio` data.

I have some code to automate testing for multiple kata configs and collect info 
like:
- Kata-env, kata configuration.toml, qemu command, virtiofsd command.

See: 
https://github.com/jcvenegas/mrunner/


Last time we agreed to narrow the cases and configs to compare virtiofs and 9pfs

The configs where the following:

- qemu + virtiofs(cache=auto, dax=0) a.ka. `kata-qemu-virtiofs` WITOUT xattr
- qemu + 9pfs a.k.a `kata-qemu`

Please take a look to the html and raw results I attach in this mail.

## Can I say that the  current status is:
- As David tests and Vivek points, for the fio workload you are using, seems 
that the best candidate should be cache=none
   -  In the comparison I took  cache=auto as Vivek suggested, this make sense 
as it seems that will be the default for kata.
   - Even if for this case cache=none works better, Can I assume that 
cache=auto dax=0 will be better than any 9pfs config? (once we find the root 
cause)

- Vivek is taking a look to mmap mode from 9pfs, to see how different is  with 
current virtiofs implementations. In 9pfs for kata, this is what we use by 
default.

## I'd like to identify what should be next on the debug/testing?

- Should I try to narrow by only trying to with qemu? 
- Should I try first with a new patch you already have? 
- Probably try with qemu without static build?
- Do the same test with thread-pool-size=1?

Please let me know how can I help.

Cheers.

On 22/09/20 12:47, "Vivek Goyal"  wrote:

On Tue, Sep 22, 2020 at 11:25:31AM +0100, Dr. David Alan Gilbert wrote:
> * Dr. David Alan Gilbert (dgilb...@redhat.com) wrote:
> > Hi,
> >   I've been doing some of my own perf tests and I think I agree
> > about the thread pool size;  my test is a kernel build
> > and I've tried a bunch of different options.
> > 
> > My config:
> >   Host: 16 core AMD EPYC (32 thread), 128G RAM,
> >  5.9.0-rc4 kernel, rhel 8.2ish userspace.
> >   5.1.0 qemu/virtiofsd built from git.
> >   Guest: Fedora 32 from cloud image with just enough extra installed for
> > a kernel build.
> > 
> >   git cloned and checkout v5.8 of Linux into /dev/shm/linux on the host
> > fresh before each test.  Then log into the guest, make defconfig,
> > time make -j 16 bzImage,  make clean; time make -j 16 bzImage 
> > The numbers below are the 'real' time in the guest from the initial make
> > (the subsequent makes dont vary much)
> > 
> > Below are the detauls of what each of these means, but here are the
> > numbers first
> > 
> > virtiofsdefault4m0.978s
> > 9pdefault  9m41.660s
> > virtiofscache=none10m29.700s
> > 9pmmappass 9m30.047s
> > 9pmbigmsize   12m4.208s
> > 9pmsecnone 9m21.363s
> > virtiofscache=noneT1   7m17.494s
> > virtiofsdefaultT1  3m43.326s
> > 
> > So the winner there by far is the 'virtiofsdefaultT1' - that's
> > the default virtiofs settings, but with --thread-pool-size=1 - so
> > yes it gives a small benefit.
> > But interestingly the cache=none virtiofs performance is pretty bad,
> > but thread-pool-size=1 on that makes a BIG improvement.
> 
> Here are fio runs that Vivek asked me to run in my same environment
> (there are some 0's in some of the mmap cases, and I've not investigated
> why yet).

cache=none does not allow mmap in case of virtiofs. That's when you
are seeing 0.

>virtiofs is looking good here in I think all of the cases;
> there's some division over which cinfig; cache=none
> seems faster in some cases which surprises me.

I know cache=none is faster in case of write workloads. It forces
direct write where we don't call file_remove_privs(). While cache=auto
goes through file_remove_privs() and that adds a GETXATTR request to
every WRITE request.

Vivek




results.tar.gz
Description: results.tar.gz
Title: vitiofs 9pfs: fio comparsion
vitiofs 9pfs: fio comparsionqemu + virtiofs(cache=auto, dax=0) a.ka. kata-qemu-virtiofsqemu + 9pfs a.k.a kata-qemuPlatformPacket : c1.small.x86-01
PROC1 x Intel E3-1240 v3 RAM32GB
DISK2 x 120GB SSD
NIC2 x 1Gbps Bonded Port
Nproc: 8EnvNamekata-qemu-virtiofskata-qemuKata version1.12.0-alpha11.12.0-alpha1Qemu versionversion 5.0.0 (kata-static)5.0.0 (kata-static)Qemu code repohttps://gitlab.com/virtio-fs/qemu.githttps://github.com/qemu/qemuQemu tagqemu5.0-virtiofs-with51bits-daxv5.0.0Kernel codehttps://gitlab.com/virtio-fs/linux.githttps://cdn.kernel.org/pub/linux/kernel/v4.x/kernel tagkata-v5.6-april-09-2020v5.4.60OS:18.04.2 LTS (Bionic Beaver)Host kernel:4.15.0-50-generic #54-Ubuntufio workload:fio --direct=1 --gtod_reduce=1 --name=test --filename=random_read_write.fio --bs=4k --iodepth=64 --size=4G --readwrite=randrw --rwmixread=75 --output=/output/fio.tx
Results:
kata-qemu(9pfs):READ: bw=211MiB/s 

virtiofs vs 9p performance(Re: tools/virtiofs: Multi threading seems to hurt performance)

2020-09-24 Thread Vivek Goyal
On Thu, Sep 24, 2020 at 09:33:01PM +, Venegas Munoz, Jose Carlos wrote:
> Hi Folks,
> 
> Sorry for the delay about how to reproduce `fio` data.
> 
> I have some code to automate testing for multiple kata configs and collect 
> info like:
> - Kata-env, kata configuration.toml, qemu command, virtiofsd command.
> 
> See: 
> https://github.com/jcvenegas/mrunner/
> 
> 
> Last time we agreed to narrow the cases and configs to compare virtiofs and 
> 9pfs
> 
> The configs where the following:
> 
> - qemu + virtiofs(cache=auto, dax=0) a.ka. `kata-qemu-virtiofs` WITOUT xattr
> - qemu + 9pfs a.k.a `kata-qemu`
> 
> Please take a look to the html and raw results I attach in this mail.

Hi Carlos,

So you are running following test.

fio --direct=1 --gtod_reduce=1 --name=test --filename=random_read_write.fio 
--bs=4k --iodepth=64 --size=4G --readwrite=randrw --rwmixread=75 
--output=/output/fio.txt

And following are your results.

9p
--
READ: bw=211MiB/s (222MB/s), 211MiB/s-211MiB/s (222MB/s-222MB/s), io=3070MiB 
(3219MB), run=14532-14532msec

WRITE: bw=70.6MiB/s (74.0MB/s), 70.6MiB/s-70.6MiB/s (74.0MB/s-74.0MB/s), 
io=1026MiB (1076MB), run=14532-14532msec

virtiofs

Run status group 0 (all jobs):
   READ: bw=159MiB/s (167MB/s), 159MiB/s-159MiB/s (167MB/s-167MB/s), io=3070MiB 
(3219MB), run=19321-19321msec
  WRITE: bw=53.1MiB/s (55.7MB/s), 53.1MiB/s-53.1MiB/s (55.7MB/s-55.7MB/s), 
io=1026MiB (1076MB), run=19321-19321msec

So looks like you are getting better performance with 9p in this case.

Can you apply "shared pool" patch to qemu for virtiofsd and re-run this
test and see if you see any better results.

In my testing, with cache=none, virtiofs performed better than 9p in 
all the fio jobs I was running. For the case of cache=auto  for virtiofs
(with xattr enabled), 9p performed better in certain write workloads. I
have identified root cause of that problem and working on
HANDLE_KILLPRIV_V2 patches to improve WRITE performance of virtiofs
with cache=auto and xattr enabled.

I will post my 9p and virtiofs comparison numbers next week. In the
mean time will be great if you could apply following qemu patch, rebuild
qemu and re-run above test.

https://www.redhat.com/archives/virtio-fs/2020-September/msg00081.html

Also what's the status of file cache on host in both the cases. Are
you booting host fresh for these tests so that cache is cold on host
or cache is warm?

Thanks
Vivek




Re: tools/virtiofs: Multi threading seems to hurt performance

2020-09-22 Thread Vivek Goyal
On Tue, Sep 22, 2020 at 12:09:46PM +0100, Dr. David Alan Gilbert wrote:
> 
> Do you have the numbers for:
>epool
>epool thread-pool-size=1
>spool

Hi David,

Ok, I re-ran my numbers again after upgrading to latest qemu and also
upgraded host kernel to latest upstream. Apart from comparing I epool,
spool and 1Thread, I also ran their numa variants. That is I launched
qemu and virtiofsd on node 0 of machine (numactl --cpunodebind=0).

Results are kind of mixed. Here are my takeaways.

- Running on same numa node improves performance overall for exclusive,
  shared and exclusive-1T mode.

- In general both shared pool and exclusive-1T mode seem to perform
  better than exclusive mode, except for the case of randwrite-libaio.
  In some cases (seqread-libaio, seqwrite-libaio, seqwrite-libaio-multi)
  exclusive pool performs better than exclusive-1T.

- Looks like in some cases exclusive-1T performs better than shared
  pool. (randwrite-libaio, randwrite-psync-multi, seqwrite-psync-multi,
  seqwrite-psync, seqread-libaio-multi, seqread-psync-multi)


Overall, I feel that both exlusive-1T and shared perform better than
exclusive pool. Results between exclusive-1T and shared pool are mixed.
It seems like in many cases exclusve-1T performs better. I would say
that moving to "shared" pool seems like a reasonable option.

Thanks
Vivek

NAMEWORKLOADBandwidth   IOPS
vtfs-none-epool seqread-psync   38(MiB/s)   9967
vtfs-none-epool-1T  seqread-psync   66(MiB/s)   16k 
vtfs-none-spool seqread-psync   67(MiB/s)   16k 
vtfs-none-epool-numaseqread-psync   48(MiB/s)   12k 
vtfs-none-epool-1T-numa seqread-psync   74(MiB/s)   18k 
vtfs-none-spool-numaseqread-psync   74(MiB/s)   18k 

vtfs-none-epool seqread-psync-multi 204(MiB/s)  51k 
vtfs-none-epool-1T  seqread-psync-multi 325(MiB/s)  81k 
vtfs-none-spool seqread-psync-multi 271(MiB/s)  67k 
vtfs-none-epool-numaseqread-psync-multi 253(MiB/s)  63k 
vtfs-none-epool-1T-numa seqread-psync-multi 349(MiB/s)  87k 
vtfs-none-spool-numaseqread-psync-multi 301(MiB/s)  75k 

vtfs-none-epool seqread-libaio  301(MiB/s)  75k 
vtfs-none-epool-1T  seqread-libaio  273(MiB/s)  68k 
vtfs-none-spool seqread-libaio  334(MiB/s)  83k 
vtfs-none-epool-numaseqread-libaio  315(MiB/s)  78k 
vtfs-none-epool-1T-numa seqread-libaio  326(MiB/s)  81k 
vtfs-none-spool-numaseqread-libaio  335(MiB/s)  83k 

vtfs-none-epool seqread-libaio-multi202(MiB/s)  50k 
vtfs-none-epool-1T  seqread-libaio-multi308(MiB/s)  77k 
vtfs-none-spool seqread-libaio-multi247(MiB/s)  61k 
vtfs-none-epool-numaseqread-libaio-multi238(MiB/s)  59k 
vtfs-none-epool-1T-numa seqread-libaio-multi307(MiB/s)  76k 
vtfs-none-spool-numaseqread-libaio-multi269(MiB/s)  67k 

vtfs-none-epool randread-psync  41(MiB/s)   10k 
vtfs-none-epool-1T  randread-psync  67(MiB/s)   16k 
vtfs-none-spool randread-psync  64(MiB/s)   16k 
vtfs-none-epool-numarandread-psync  48(MiB/s)   12k 
vtfs-none-epool-1T-numa randread-psync  73(MiB/s)   18k 
vtfs-none-spool-numarandread-psync  72(MiB/s)   18k 

vtfs-none-epool randread-psync-multi207(MiB/s)  51k 
vtfs-none-epool-1T  randread-psync-multi313(MiB/s)  78k 
vtfs-none-spool randread-psync-multi265(MiB/s)  66k 
vtfs-none-epool-numarandread-psync-multi253(MiB/s)  63k 
vtfs-none-epool-1T-numa randread-psync-multi340(MiB/s)  85k 
vtfs-none-spool-numarandread-psync-multi305(MiB/s)  76k 

vtfs-none-epool randread-libaio 305(MiB/s)  76k 
vtfs-none-epool-1T  randread-libaio 308(MiB/s)  77k 
vtfs-none-spool randread-libaio 329(MiB/s)  82k 
vtfs-none-epool-numarandread-libaio 310(MiB/s)  77k 
vtfs-none-epool-1T-numa randread-libaio 328(MiB/s)  82k 
vtfs-none-spool-numarandread-libaio 339(MiB/s)  84k 

vtfs-none-epool randread-libaio-multi   265(MiB/s)  66k 
vtfs-none-epool-1T  randread-libaio-multi   267(MiB/s)  66k  

Re: tools/virtiofs: Multi threading seems to hurt performance

2020-09-22 Thread Vivek Goyal
On Tue, Sep 22, 2020 at 11:25:31AM +0100, Dr. David Alan Gilbert wrote:
> * Dr. David Alan Gilbert (dgilb...@redhat.com) wrote:
> > Hi,
> >   I've been doing some of my own perf tests and I think I agree
> > about the thread pool size;  my test is a kernel build
> > and I've tried a bunch of different options.
> > 
> > My config:
> >   Host: 16 core AMD EPYC (32 thread), 128G RAM,
> >  5.9.0-rc4 kernel, rhel 8.2ish userspace.
> >   5.1.0 qemu/virtiofsd built from git.
> >   Guest: Fedora 32 from cloud image with just enough extra installed for
> > a kernel build.
> > 
> >   git cloned and checkout v5.8 of Linux into /dev/shm/linux on the host
> > fresh before each test.  Then log into the guest, make defconfig,
> > time make -j 16 bzImage,  make clean; time make -j 16 bzImage 
> > The numbers below are the 'real' time in the guest from the initial make
> > (the subsequent makes dont vary much)
> > 
> > Below are the detauls of what each of these means, but here are the
> > numbers first
> > 
> > virtiofsdefault4m0.978s
> > 9pdefault  9m41.660s
> > virtiofscache=none10m29.700s
> > 9pmmappass 9m30.047s
> > 9pmbigmsize   12m4.208s
> > 9pmsecnone 9m21.363s
> > virtiofscache=noneT1   7m17.494s
> > virtiofsdefaultT1  3m43.326s
> > 
> > So the winner there by far is the 'virtiofsdefaultT1' - that's
> > the default virtiofs settings, but with --thread-pool-size=1 - so
> > yes it gives a small benefit.
> > But interestingly the cache=none virtiofs performance is pretty bad,
> > but thread-pool-size=1 on that makes a BIG improvement.
> 
> Here are fio runs that Vivek asked me to run in my same environment
> (there are some 0's in some of the mmap cases, and I've not investigated
> why yet).

cache=none does not allow mmap in case of virtiofs. That's when you
are seeing 0.

>virtiofs is looking good here in I think all of the cases;
> there's some division over which cinfig; cache=none
> seems faster in some cases which surprises me.

I know cache=none is faster in case of write workloads. It forces
direct write where we don't call file_remove_privs(). While cache=auto
goes through file_remove_privs() and that adds a GETXATTR request to
every WRITE request.

Vivek




Re: tools/virtiofs: Multi threading seems to hurt performance

2020-09-22 Thread Dr. David Alan Gilbert
* Vivek Goyal (vgo...@redhat.com) wrote:
> On Fri, Sep 18, 2020 at 05:34:36PM -0400, Vivek Goyal wrote:
> > Hi All,
> > 
> > virtiofsd default thread pool size is 64. To me it feels that in most of
> > the cases thread pool size 1 performs better than thread pool size 64.
> > 
> > I ran virtiofs-tests.
> > 
> > https://github.com/rhvgoyal/virtiofs-tests
> 
> I spent more time debugging this. First thing I noticed is that we
> are using "exclusive" glib thread pool.
> 
> https://developer.gnome.org/glib/stable/glib-Thread-Pools.html#g-thread-pool-new
> 
> This seems to run pre-determined number of threads dedicated to that
> thread pool. Little instrumentation of code revealed that every new
> request gets assiged to new thread (despite the fact that previous
> thread finished its job). So internally there might be some kind of
> round robin policy to choose next thread for running the job.
> 
> I decided to switch to "shared" pool instead where it seemed to spin
> up new threads only if there is enough work. Also threads can be shared
> between pools.
> 
> And looks like testing results are way better with "shared" pools. So
> may be we should switch to shared pool by default. (Till somebody shows
> in what cases exclusive pools are better).
> 
> Second thought which came to mind was what's the impact of NUMA. What
> if qemu and virtiofsd process/threads are running on separate NUMA
> node. That should increase memory access latency and increased overhead.
> So I used "numactl --cpubind=0" to bind both qemu and virtiofsd to node
> 0. My machine seems to have two numa nodes. (Each node is having 32
> logical processors). Keeping both qemu and virtiofsd on same node
> improves throughput further.
> 
> So here are the results.
> 
> vtfs-none-epool --> cache=none, exclusive thread pool.
> vtfs-none-spool --> cache=none, shared thread pool.
> vtfs-none-spool-numa --> cache=none, shared thread pool, same numa node

Do you have the numbers for:
   epool
   epool thread-pool-size=1
   spool

?

Dave

> 
> NAMEWORKLOADBandwidth   IOPS  
>   
> vtfs-none-epool seqread-psync   36(MiB/s)   9392  
>   
> vtfs-none-spool seqread-psync   68(MiB/s)   17k   
>   
> vtfs-none-spool-numaseqread-psync   73(MiB/s)   18k   
>   
> 
> vtfs-none-epool seqread-psync-multi 210(MiB/s)  52k   
>   
> vtfs-none-spool seqread-psync-multi 260(MiB/s)  65k   
>   
> vtfs-none-spool-numaseqread-psync-multi 309(MiB/s)  77k   
>   
> 
> vtfs-none-epool seqread-libaio  286(MiB/s)  71k   
>   
> vtfs-none-spool seqread-libaio  328(MiB/s)  82k   
>   
> vtfs-none-spool-numaseqread-libaio  332(MiB/s)  83k   
>   
> 
> vtfs-none-epool seqread-libaio-multi201(MiB/s)  50k   
>   
> vtfs-none-spool seqread-libaio-multi254(MiB/s)  63k   
>   
> vtfs-none-spool-numaseqread-libaio-multi276(MiB/s)  69k   
>   
> 
> vtfs-none-epool randread-psync  40(MiB/s)   10k   
>   
> vtfs-none-spool randread-psync  64(MiB/s)   16k   
>   
> vtfs-none-spool-numarandread-psync  72(MiB/s)   18k   
>   
> 
> vtfs-none-epool randread-psync-multi211(MiB/s)  52k   
>   
> vtfs-none-spool randread-psync-multi252(MiB/s)  63k   
>   
> vtfs-none-spool-numarandread-psync-multi297(MiB/s)  74k   
>   
> 
> vtfs-none-epool randread-libaio 313(MiB/s)  78k   
>   
> vtfs-none-spool randread-libaio 320(MiB/s)  80k   
>   
> vtfs-none-spool-numarandread-libaio 330(MiB/s)  82k   
>   
> 
> vtfs-none-epool randread-libaio-multi   257(MiB/s)  64k   
>   
> vtfs-none-spool randread-libaio-multi   274(MiB/s)  68k   
>   
> vtfs-none-spool-numarandread-libaio-multi   319(MiB/s)  79k   
>   
> 
> vtfs-none-epool seqwrite-psync  34(MiB/s)   8926  
>   
> vtfs-none-spool seqwrite-psync  55(MiB/s)   13k   
>   
> vtfs-none-spool-numaseqwrite-psync  66(MiB/s)   16k   
>   
> 
> vtfs-none-epool seqwrite-psync-multi196(MiB/s)  49k   
>   
> vtfs-none-spool seqwrite-psync-multi225(MiB/s)  56k   
>   
> vtfs-none-spool-numaseqwrite-psync-multi270(MiB/s)  67k   
>   
> 
> vtfs-none-epool seqwrite-libaio 257(MiB/s)  64k   
>   
> vtfs-none-spool seqwrite-libaio 304(MiB/s)  76k   
>   
> vtfs-none-spool-numaseqwrite-libaio 267(MiB/s)  66k   
>   
> 
> 

Re: tools/virtiofs: Multi threading seems to hurt performance

2020-09-22 Thread Dr. David Alan Gilbert
* Dr. David Alan Gilbert (dgilb...@redhat.com) wrote:
> Hi,
>   I've been doing some of my own perf tests and I think I agree
> about the thread pool size;  my test is a kernel build
> and I've tried a bunch of different options.
> 
> My config:
>   Host: 16 core AMD EPYC (32 thread), 128G RAM,
>  5.9.0-rc4 kernel, rhel 8.2ish userspace.
>   5.1.0 qemu/virtiofsd built from git.
>   Guest: Fedora 32 from cloud image with just enough extra installed for
> a kernel build.
> 
>   git cloned and checkout v5.8 of Linux into /dev/shm/linux on the host
> fresh before each test.  Then log into the guest, make defconfig,
> time make -j 16 bzImage,  make clean; time make -j 16 bzImage 
> The numbers below are the 'real' time in the guest from the initial make
> (the subsequent makes dont vary much)
> 
> Below are the detauls of what each of these means, but here are the
> numbers first
> 
> virtiofsdefault4m0.978s
> 9pdefault  9m41.660s
> virtiofscache=none10m29.700s
> 9pmmappass 9m30.047s
> 9pmbigmsize   12m4.208s
> 9pmsecnone 9m21.363s
> virtiofscache=noneT1   7m17.494s
> virtiofsdefaultT1  3m43.326s
> 
> So the winner there by far is the 'virtiofsdefaultT1' - that's
> the default virtiofs settings, but with --thread-pool-size=1 - so
> yes it gives a small benefit.
> But interestingly the cache=none virtiofs performance is pretty bad,
> but thread-pool-size=1 on that makes a BIG improvement.

Here are fio runs that Vivek asked me to run in my same environment
(there are some 0's in some of the mmap cases, and I've not investigated
why yet). virtiofs is looking good here in I think all of the cases;
there's some division over which cinfig; cache=none
seems faster in some cases which surprises me.

Dave


NAMEWORKLOADBandwidth   IOPS
9pbigmsize  seqread-psync   108(MiB/s)  27k 
9pdefault   seqread-psync   105(MiB/s)  26k 
9pmmappass  seqread-psync   107(MiB/s)  26k 
9pmsecnone  seqread-psync   107(MiB/s)  26k 
virtiofscachenoneT1 seqread-psync   135(MiB/s)  33k 
virtiofscachenone   seqread-psync   115(MiB/s)  28k 
virtiofsdefaultT1   seqread-psync   2465(MiB/s) 616k
virtiofsdefault seqread-psync   2468(MiB/s) 617k

9pbigmsize  seqread-psync-multi 357(MiB/s)  89k 
9pdefault   seqread-psync-multi 358(MiB/s)  89k 
9pmmappass  seqread-psync-multi 347(MiB/s)  86k 
9pmsecnone  seqread-psync-multi 364(MiB/s)  91k 
virtiofscachenoneT1 seqread-psync-multi 479(MiB/s)  119k
virtiofscachenone   seqread-psync-multi 385(MiB/s)  96k 
virtiofsdefaultT1   seqread-psync-multi 5916(MiB/s) 1479k   
virtiofsdefault seqread-psync-multi 8771(MiB/s) 2192k   

9pbigmsize  seqread-mmap111(MiB/s)  27k 
9pdefault   seqread-mmap101(MiB/s)  25k 
9pmmappass  seqread-mmap114(MiB/s)  28k 
9pmsecnone  seqread-mmap107(MiB/s)  26k 
virtiofscachenoneT1 seqread-mmap0(KiB/s)0   
virtiofscachenone   seqread-mmap0(KiB/s)0   
virtiofsdefaultT1   seqread-mmap2896(MiB/s) 724k
virtiofsdefault seqread-mmap2856(MiB/s) 714k

9pbigmsize  seqread-mmap-multi  364(MiB/s)  91k 
9pdefault   seqread-mmap-multi  348(MiB/s)  87k 
9pmmappass  seqread-mmap-multi  354(MiB/s)  88k 
9pmsecnone  seqread-mmap-multi  340(MiB/s)  85k 
virtiofscachenoneT1 seqread-mmap-multi  0(KiB/s)0   
virtiofscachenone   seqread-mmap-multi  0(KiB/s)0   
virtiofsdefaultT1   seqread-mmap-multi  6057(MiB/s) 1514k   
virtiofsdefault seqread-mmap-multi  9585(MiB/s) 2396k   

9pbigmsize  seqread-libaio  109(MiB/s)  27k 
9pdefault   seqread-libaio  103(MiB/s)  25k 
9pmmappass  seqread-libaio  107(MiB/s)  26k 
9pmsecnone  seqread-libaio  107(MiB/s)  26k 
virtiofscachenoneT1 seqread-libaio  671(MiB/s)  167k
virtiofscachenone   seqread-libaio  538(MiB/s)  134k
virtiofsdefaultT1   seqread-libaio  

Re: tools/virtiofs: Multi threading seems to hurt performance

2020-09-21 Thread Vivek Goyal
On Fri, Sep 18, 2020 at 05:34:36PM -0400, Vivek Goyal wrote:
> Hi All,
> 
> virtiofsd default thread pool size is 64. To me it feels that in most of
> the cases thread pool size 1 performs better than thread pool size 64.
> 
> I ran virtiofs-tests.
> 
> https://github.com/rhvgoyal/virtiofs-tests

I spent more time debugging this. First thing I noticed is that we
are using "exclusive" glib thread pool.

https://developer.gnome.org/glib/stable/glib-Thread-Pools.html#g-thread-pool-new

This seems to run pre-determined number of threads dedicated to that
thread pool. Little instrumentation of code revealed that every new
request gets assiged to new thread (despite the fact that previous
thread finished its job). So internally there might be some kind of
round robin policy to choose next thread for running the job.

I decided to switch to "shared" pool instead where it seemed to spin
up new threads only if there is enough work. Also threads can be shared
between pools.

And looks like testing results are way better with "shared" pools. So
may be we should switch to shared pool by default. (Till somebody shows
in what cases exclusive pools are better).

Second thought which came to mind was what's the impact of NUMA. What
if qemu and virtiofsd process/threads are running on separate NUMA
node. That should increase memory access latency and increased overhead.
So I used "numactl --cpubind=0" to bind both qemu and virtiofsd to node
0. My machine seems to have two numa nodes. (Each node is having 32
logical processors). Keeping both qemu and virtiofsd on same node
improves throughput further.

So here are the results.

vtfs-none-epool --> cache=none, exclusive thread pool.
vtfs-none-spool --> cache=none, shared thread pool.
vtfs-none-spool-numa --> cache=none, shared thread pool, same numa node


NAMEWORKLOADBandwidth   IOPS
vtfs-none-epool seqread-psync   36(MiB/s)   9392
vtfs-none-spool seqread-psync   68(MiB/s)   17k 
vtfs-none-spool-numaseqread-psync   73(MiB/s)   18k 

vtfs-none-epool seqread-psync-multi 210(MiB/s)  52k 
vtfs-none-spool seqread-psync-multi 260(MiB/s)  65k 
vtfs-none-spool-numaseqread-psync-multi 309(MiB/s)  77k 

vtfs-none-epool seqread-libaio  286(MiB/s)  71k 
vtfs-none-spool seqread-libaio  328(MiB/s)  82k 
vtfs-none-spool-numaseqread-libaio  332(MiB/s)  83k 

vtfs-none-epool seqread-libaio-multi201(MiB/s)  50k 
vtfs-none-spool seqread-libaio-multi254(MiB/s)  63k 
vtfs-none-spool-numaseqread-libaio-multi276(MiB/s)  69k 

vtfs-none-epool randread-psync  40(MiB/s)   10k 
vtfs-none-spool randread-psync  64(MiB/s)   16k 
vtfs-none-spool-numarandread-psync  72(MiB/s)   18k 

vtfs-none-epool randread-psync-multi211(MiB/s)  52k 
vtfs-none-spool randread-psync-multi252(MiB/s)  63k 
vtfs-none-spool-numarandread-psync-multi297(MiB/s)  74k 

vtfs-none-epool randread-libaio 313(MiB/s)  78k 
vtfs-none-spool randread-libaio 320(MiB/s)  80k 
vtfs-none-spool-numarandread-libaio 330(MiB/s)  82k 

vtfs-none-epool randread-libaio-multi   257(MiB/s)  64k 
vtfs-none-spool randread-libaio-multi   274(MiB/s)  68k 
vtfs-none-spool-numarandread-libaio-multi   319(MiB/s)  79k 

vtfs-none-epool seqwrite-psync  34(MiB/s)   8926
vtfs-none-spool seqwrite-psync  55(MiB/s)   13k 
vtfs-none-spool-numaseqwrite-psync  66(MiB/s)   16k 

vtfs-none-epool seqwrite-psync-multi196(MiB/s)  49k 
vtfs-none-spool seqwrite-psync-multi225(MiB/s)  56k 
vtfs-none-spool-numaseqwrite-psync-multi270(MiB/s)  67k 

vtfs-none-epool seqwrite-libaio 257(MiB/s)  64k 
vtfs-none-spool seqwrite-libaio 304(MiB/s)  76k 
vtfs-none-spool-numaseqwrite-libaio 267(MiB/s)  66k 

vtfs-none-epool seqwrite-libaio-multi   312(MiB/s)  78k 
vtfs-none-spool seqwrite-libaio-multi   366(MiB/s)  91k 
vtfs-none-spool-numaseqwrite-libaio-multi   381(MiB/s)  95k 

vtfs-none-epool randwrite-psync 38(MiB/s)   9745
vtfs-none-spool randwrite-psync 55(MiB/s)   13k 

Re: tools/virtiofs: Multi threading seems to hurt performance

2020-09-21 Thread Stefan Hajnoczi
On Mon, Sep 21, 2020 at 09:39:44AM -0400, Vivek Goyal wrote:
> On Mon, Sep 21, 2020 at 09:39:23AM +0100, Stefan Hajnoczi wrote:
> > On Fri, Sep 18, 2020 at 05:34:36PM -0400, Vivek Goyal wrote:
> > > And here are the comparision results. To me it seems that by default
> > > we should switch to 1 thread (Till we can figure out how to make
> > > multi thread performance better even when single process is doing
> > > I/O in client).
> > 
> > Let's understand the reason before making changes.
> > 
> > Questions:
> >  * Is "1-thread" --thread-pool-size=1?
> 
> Yes.

Okay, I wanted to make sure 1-thread is still going through the glib
thread pool. So it's the same code path regardless of the
--thread-pool-size= value. This suggests the performance issue is
related to timing side-effects like lock contention, thread scheduling,
etc.

> >  * How do the kvm_stat vmexit counters compare?
> 
> This should be same, isn't it. Changing number of threads serving should
> not change number of vmexits?

There is batching at the virtio and eventfd levels. I'm not sure if it's
coming into play here but you would see it by comparing vmexits and
eventfd reads. Having more threads can increase the number of
notifications and completion interrupt, which can make overall
performance worse in some cases.

> >  * How does host mpstat -P ALL compare?
> 
> Never used mpstat. Will try running it and see if I can get something
> meaningful.

Tools like top, vmstat, etc can give similar information. I'm wondering
what the host CPU utilization (guest/sys/user) looks like.

> But I suepct it has to do with thread pool implementation and possibly
> extra cacheline bouncing.

I think perf can record cacheline bounces if you want to check.

Stefan


signature.asc
Description: PGP signature


Re: tools/virtiofs: Multi threading seems to hurt performance

2020-09-21 Thread Dr. David Alan Gilbert
Hi,
  I've been doing some of my own perf tests and I think I agree
about the thread pool size;  my test is a kernel build
and I've tried a bunch of different options.

My config:
  Host: 16 core AMD EPYC (32 thread), 128G RAM,
 5.9.0-rc4 kernel, rhel 8.2ish userspace.
  5.1.0 qemu/virtiofsd built from git.
  Guest: Fedora 32 from cloud image with just enough extra installed for
a kernel build.

  git cloned and checkout v5.8 of Linux into /dev/shm/linux on the host
fresh before each test.  Then log into the guest, make defconfig,
time make -j 16 bzImage,  make clean; time make -j 16 bzImage 
The numbers below are the 'real' time in the guest from the initial make
(the subsequent makes dont vary much)

Below are the detauls of what each of these means, but here are the
numbers first

virtiofsdefault4m0.978s
9pdefault  9m41.660s
virtiofscache=none10m29.700s
9pmmappass 9m30.047s
9pmbigmsize   12m4.208s
9pmsecnone 9m21.363s
virtiofscache=noneT1   7m17.494s
virtiofsdefaultT1  3m43.326s

So the winner there by far is the 'virtiofsdefaultT1' - that's
the default virtiofs settings, but with --thread-pool-size=1 - so
yes it gives a small benefit.
But interestingly the cache=none virtiofs performance is pretty bad,
but thread-pool-size=1 on that makes a BIG improvement.


virtiofsdefault:
  ./virtiofsd --socket-path=/tmp/vhostqemu -o source=/dev/shm/linux
  ./x86_64-softmmu/qemu-system-x86_64 -M pc,memory-backend=mem,accel=kvm -smp 8 
-cpu host -m 32G,maxmem=64G,slots=1 -object 
memory-backend-memfd,id=mem,size=32G,share=on -drive 
if=virtio,file=/home/images/f-32-kernel.qcow2 -nographic -chardev 
socket,id=char0,path=/tmp/vhostqemu -device 
vhost-user-fs-pci,queue-size=1024,chardev=char0,tag=kernel
  mount -t virtiofs kernel /mnt

9pdefault
  ./x86_64-softmmu/qemu-system-x86_64 -M pc,accel=kvm -smp 8 -cpu host -m 32G 
-drive if=virtio,file=/home/images/f-32-kernel.qcow2 -nographic -virtfs 
local,path=/dev/shm/linux,mount_tag=kernel,security_model=passthrough
  mount -t 9p -o trans=virtio kernel /mnt -oversion=9p2000.L

virtiofscache=none
  ./virtiofsd --socket-path=/tmp/vhostqemu -o source=/dev/shm/linux -o 
cache=none
  ./x86_64-softmmu/qemu-system-x86_64 -M pc,memory-backend=mem,accel=kvm -smp 8 
-cpu host -m 32G,maxmem=64G,slots=1 -object 
memory-backend-memfd,id=mem,size=32G,share=on -drive 
if=virtio,file=/home/images/f-32-kernel.qcow2 -nographic -chardev 
socket,id=char0,path=/tmp/vhostqemu -device 
vhost-user-fs-pci,queue-size=1024,chardev=char0,tag=kernel
  mount -t virtiofs kernel /mnt

9pmmappass
  ./x86_64-softmmu/qemu-system-x86_64 -M pc,accel=kvm -smp 8 -cpu host -m 32G 
-drive if=virtio,file=/home/images/f-32-kernel.qcow2 -nographic -virtfs 
local,path=/dev/shm/linux,mount_tag=kernel,security_model=passthrough
  mount -t 9p -o trans=virtio kernel /mnt -oversion=9p2000.L,cache=mmap

9pmbigmsize
   ./x86_64-softmmu/qemu-system-x86_64 -M pc,accel=kvm -smp 8 -cpu host -m 32G 
-drive if=virtio,file=/home/images/f-32-kernel.qcow2 -nographic -virtfs 
local,path=/dev/shm/linux,mount_tag=kernel,security_model=passthrough
   mount -t 9p -o trans=virtio kernel /mnt 
-oversion=9p2000.L,cache=mmap,msize=1048576

9pmsecnone
   ./x86_64-softmmu/qemu-system-x86_64 -M pc,accel=kvm -smp 8 -cpu host -m 32G 
-drive if=virtio,file=/home/images/f-32-kernel.qcow2 -nographic -virtfs 
local,path=/dev/shm/linux,mount_tag=kernel,security_model=none
   mount -t 9p -o trans=virtio kernel /mnt -oversion=9p2000.L

virtiofscache=noneT1
   ./virtiofsd --socket-path=/tmp/vhostqemu -o source=/dev/shm/linux -o 
cache=none --thread-pool-size=1
   mount -t virtiofs kernel /mnt

virtiofsdefaultT1
   ./virtiofsd --socket-path=/tmp/vhostqemu -o source=/dev/shm/linux 
--thread-pool-size=1
./x86_64-softmmu/qemu-system-x86_64 -M pc,memory-backend=mem,accel=kvm -smp 
8 -cpu host -m 32G,maxmem=64G,slots=1 -object 
memory-backend-memfd,id=mem,size=32G,share=on -drive 
if=virtio,file=/home/images/f-32-kernel.qcow2 -nographic -chardev 
socket,id=char0,path=/tmp/vhostqemu -device 
vhost-user-fs-pci,queue-size=1024,chardev=char0,tag=kernel
-- 
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK




Re: tools/virtiofs: Multi threading seems to hurt performance

2020-09-21 Thread Daniel P . Berrangé
On Mon, Sep 21, 2020 at 09:35:16AM -0400, Vivek Goyal wrote:
> On Mon, Sep 21, 2020 at 09:50:19AM +0100, Dr. David Alan Gilbert wrote:
> > * Vivek Goyal (vgo...@redhat.com) wrote:
> > > Hi All,
> > > 
> > > virtiofsd default thread pool size is 64. To me it feels that in most of
> > > the cases thread pool size 1 performs better than thread pool size 64.
> > > 
> > > I ran virtiofs-tests.
> > > 
> > > https://github.com/rhvgoyal/virtiofs-tests
> > > 
> > > And here are the comparision results. To me it seems that by default
> > > we should switch to 1 thread (Till we can figure out how to make
> > > multi thread performance better even when single process is doing
> > > I/O in client).
> > > 
> > > I am especially more interested in getting performance better for
> > > single process in client. If that suffers, then it is pretty bad.
> > > 
> > > Especially look at randread, randwrite, seqwrite performance. seqread
> > > seems pretty good anyway.
> > > 
> > > If I don't run who test suite and just ran randread-psync job,
> > > my throughput jumps from around 40MB/s to 60MB/s. That's a huge
> > > jump I would say.
> > > 
> > > Thoughts?
> > 
> > What's your host setup; how many cores has the host got and how many did
> > you give the guest?
> 
> Got 2 processors on host with 16 cores in each processor. With
> hyperthreading enabled, it makes 32 logical cores on each processor and
> that makes 64 logical cores on host.
> 
> I have given 32 to guest.

FWIW, I'd be inclined to disable hyperthreading in the BIOS for one
test to validate whether it is impacting performance results seen.
Hyperthreads are weak compared to a real CPU, and could result in
misleading data even if you are limiting your guest to 1/2 the host
logical CPUs.

Regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|




Re: tools/virtiofs: Multi threading seems to hurt performance

2020-09-21 Thread Vivek Goyal
On Mon, Sep 21, 2020 at 09:39:23AM +0100, Stefan Hajnoczi wrote:
> On Fri, Sep 18, 2020 at 05:34:36PM -0400, Vivek Goyal wrote:
> > And here are the comparision results. To me it seems that by default
> > we should switch to 1 thread (Till we can figure out how to make
> > multi thread performance better even when single process is doing
> > I/O in client).
> 
> Let's understand the reason before making changes.
> 
> Questions:
>  * Is "1-thread" --thread-pool-size=1?

Yes.

>  * Was DAX enabled?

No.

>  * How does cache=none perform?

I just ran random read workload with cache=none.

cache-none  randread-psync  45(MiB/s)   11k 
cache-none-1-thread randread-psync  63(MiB/s)   15k

With 1 thread it offers more IOPS.

>  * Does commenting out vu_queue_get_avail_bytes() + fuse_log("%s:
>Queue %d gave evalue: %zx available: in: %u out: %u\n") in
>fv_queue_thread help?

Will try that.

>  * How do the kvm_stat vmexit counters compare?

This should be same, isn't it. Changing number of threads serving should
not change number of vmexits?

>  * How does host mpstat -P ALL compare?

Never used mpstat. Will try running it and see if I can get something
meaningful.

>  * How does host perf record -a compare?

Will try it. I feel this might be too big and too verbose to get
something meaningful.

>  * Does the Rust virtiofsd show the same pattern (it doesn't use glib
>thread pools)?

No idea. Never tried rust implementation of virtiofsd.

But I suepct it has to do with thread pool implementation and possibly
extra cacheline bouncing.

Thanks
Vivek




Re: tools/virtiofs: Multi threading seems to hurt performance

2020-09-21 Thread Vivek Goyal
On Mon, Sep 21, 2020 at 09:50:19AM +0100, Dr. David Alan Gilbert wrote:
> * Vivek Goyal (vgo...@redhat.com) wrote:
> > Hi All,
> > 
> > virtiofsd default thread pool size is 64. To me it feels that in most of
> > the cases thread pool size 1 performs better than thread pool size 64.
> > 
> > I ran virtiofs-tests.
> > 
> > https://github.com/rhvgoyal/virtiofs-tests
> > 
> > And here are the comparision results. To me it seems that by default
> > we should switch to 1 thread (Till we can figure out how to make
> > multi thread performance better even when single process is doing
> > I/O in client).
> > 
> > I am especially more interested in getting performance better for
> > single process in client. If that suffers, then it is pretty bad.
> > 
> > Especially look at randread, randwrite, seqwrite performance. seqread
> > seems pretty good anyway.
> > 
> > If I don't run who test suite and just ran randread-psync job,
> > my throughput jumps from around 40MB/s to 60MB/s. That's a huge
> > jump I would say.
> > 
> > Thoughts?
> 
> What's your host setup; how many cores has the host got and how many did
> you give the guest?

Got 2 processors on host with 16 cores in each processor. With
hyperthreading enabled, it makes 32 logical cores on each processor and
that makes 64 logical cores on host.

I have given 32 to guest.

Vivek




Re: tools/virtiofs: Multi threading seems to hurt performance

2020-09-21 Thread Dr. David Alan Gilbert
* Vivek Goyal (vgo...@redhat.com) wrote:
> Hi All,
> 
> virtiofsd default thread pool size is 64. To me it feels that in most of
> the cases thread pool size 1 performs better than thread pool size 64.
> 
> I ran virtiofs-tests.
> 
> https://github.com/rhvgoyal/virtiofs-tests
> 
> And here are the comparision results. To me it seems that by default
> we should switch to 1 thread (Till we can figure out how to make
> multi thread performance better even when single process is doing
> I/O in client).
> 
> I am especially more interested in getting performance better for
> single process in client. If that suffers, then it is pretty bad.
> 
> Especially look at randread, randwrite, seqwrite performance. seqread
> seems pretty good anyway.
> 
> If I don't run who test suite and just ran randread-psync job,
> my throughput jumps from around 40MB/s to 60MB/s. That's a huge
> jump I would say.
> 
> Thoughts?

What's your host setup; how many cores has the host got and how many did
you give the guest?

Dave

> Thanks
> Vivek
> 
> 
> NAMEWORKLOADBandwidth   IOPS  
>   
> cache-auto  seqread-psync   690(MiB/s)  172k  
>   
> cache-auto-1-thread seqread-psync   729(MiB/s)  182k  
>   
> 
> cache-auto  seqread-psync-multi 2578(MiB/s) 644k  
>   
> cache-auto-1-thread seqread-psync-multi 2597(MiB/s) 649k  
>   
> 
> cache-auto  seqread-mmap660(MiB/s)  165k  
>   
> cache-auto-1-thread seqread-mmap672(MiB/s)  168k  
>   
> 
> cache-auto  seqread-mmap-multi  2499(MiB/s) 624k  
>   
> cache-auto-1-thread seqread-mmap-multi  2618(MiB/s) 654k  
>   
> 
> cache-auto  seqread-libaio  286(MiB/s)  71k   
>   
> cache-auto-1-thread seqread-libaio  260(MiB/s)  65k   
>   
> 
> cache-auto  seqread-libaio-multi1508(MiB/s) 377k  
>   
> cache-auto-1-thread seqread-libaio-multi986(MiB/s)  246k  
>   
> 
> cache-auto  randread-psync  35(MiB/s)   9191  
>   
> cache-auto-1-thread randread-psync  55(MiB/s)   13k   
>   
> 
> cache-auto  randread-psync-multi179(MiB/s)  44k   
>   
> cache-auto-1-thread randread-psync-multi209(MiB/s)  52k   
>   
> 
> cache-auto  randread-mmap   32(MiB/s)   8273  
>   
> cache-auto-1-thread randread-mmap   50(MiB/s)   12k   
>   
> 
> cache-auto  randread-mmap-multi 161(MiB/s)  40k   
>   
> cache-auto-1-thread randread-mmap-multi 185(MiB/s)  46k   
>   
> 
> cache-auto  randread-libaio 268(MiB/s)  67k   
>   
> cache-auto-1-thread randread-libaio 254(MiB/s)  63k   
>   
> 
> cache-auto  randread-libaio-multi   256(MiB/s)  64k   
>   
> cache-auto-1-thread randread-libaio-multi   155(MiB/s)  38k   
>   
> 
> cache-auto  seqwrite-psync  23(MiB/s)   6026  
>   
> cache-auto-1-thread seqwrite-psync  30(MiB/s)   7925  
>   
> 
> cache-auto  seqwrite-psync-multi100(MiB/s)  25k   
>   
> cache-auto-1-thread seqwrite-psync-multi154(MiB/s)  38k   
>   
> 
> cache-auto  seqwrite-mmap   343(MiB/s)  85k   
>   
> cache-auto-1-thread seqwrite-mmap   355(MiB/s)  88k   
>   
> 
> cache-auto  seqwrite-mmap-multi 408(MiB/s)  102k  
>   
> cache-auto-1-thread seqwrite-mmap-multi 438(MiB/s)  109k  
>   
> 
> cache-auto  seqwrite-libaio 41(MiB/s)   10k   
>   
> cache-auto-1-thread seqwrite-libaio 65(MiB/s)   16k   
>   
> 
> cache-auto  seqwrite-libaio-multi   137(MiB/s)  34k   
>   
> cache-auto-1-thread seqwrite-libaio-multi   214(MiB/s)  53k   
>   
> 
> cache-auto  randwrite-psync 22(MiB/s)   5801  
>   
> cache-auto-1-thread randwrite-psync 30(MiB/s)   7927  
>   
> 
> cache-auto  randwrite-psync-multi   100(MiB/s)  25k   
>   
> cache-auto-1-thread randwrite-psync-multi   151(MiB/s)  37k   
>   
> 
> cache-auto  randwrite-mmap  31(MiB/s)   7984  
>   
> cache-auto-1-thread randwrite-mmap  55(MiB/s)   13k   
>   
> 
> cache-auto  randwrite-mmap-multi124(MiB/s)  31k   
>   
> cache-auto-1-thread randwrite-mmap-multi213(MiB/s)  53k   
>   
> 
> cache-auto

Re: tools/virtiofs: Multi threading seems to hurt performance

2020-09-21 Thread Stefan Hajnoczi
On Fri, Sep 18, 2020 at 05:34:36PM -0400, Vivek Goyal wrote:
> And here are the comparision results. To me it seems that by default
> we should switch to 1 thread (Till we can figure out how to make
> multi thread performance better even when single process is doing
> I/O in client).

Let's understand the reason before making changes.

Questions:
 * Is "1-thread" --thread-pool-size=1?
 * Was DAX enabled?
 * How does cache=none perform?
 * Does commenting out vu_queue_get_avail_bytes() + fuse_log("%s:
   Queue %d gave evalue: %zx available: in: %u out: %u\n") in
   fv_queue_thread help?
 * How do the kvm_stat vmexit counters compare?
 * How does host mpstat -P ALL compare?
 * How does host perf record -a compare?
 * Does the Rust virtiofsd show the same pattern (it doesn't use glib
   thread pools)?

Stefan

> NAMEWORKLOADBandwidth   IOPS  
>   
> cache-auto  seqread-psync   690(MiB/s)  172k  
>   
> cache-auto-1-thread seqread-psync   729(MiB/s)  182k  
>   
> 
> cache-auto  seqread-psync-multi 2578(MiB/s) 644k  
>   
> cache-auto-1-thread seqread-psync-multi 2597(MiB/s) 649k  
>   
> 
> cache-auto  seqread-mmap660(MiB/s)  165k  
>   
> cache-auto-1-thread seqread-mmap672(MiB/s)  168k  
>   
> 
> cache-auto  seqread-mmap-multi  2499(MiB/s) 624k  
>   
> cache-auto-1-thread seqread-mmap-multi  2618(MiB/s) 654k  
>   
> 
> cache-auto  seqread-libaio  286(MiB/s)  71k   
>   
> cache-auto-1-thread seqread-libaio  260(MiB/s)  65k   
>   
> 
> cache-auto  seqread-libaio-multi1508(MiB/s) 377k  
>   
> cache-auto-1-thread seqread-libaio-multi986(MiB/s)  246k  
>   
> 
> cache-auto  randread-psync  35(MiB/s)   9191  
>   
> cache-auto-1-thread randread-psync  55(MiB/s)   13k   
>   
> 
> cache-auto  randread-psync-multi179(MiB/s)  44k   
>   
> cache-auto-1-thread randread-psync-multi209(MiB/s)  52k   
>   
> 
> cache-auto  randread-mmap   32(MiB/s)   8273  
>   
> cache-auto-1-thread randread-mmap   50(MiB/s)   12k   
>   
> 
> cache-auto  randread-mmap-multi 161(MiB/s)  40k   
>   
> cache-auto-1-thread randread-mmap-multi 185(MiB/s)  46k   
>   
> 
> cache-auto  randread-libaio 268(MiB/s)  67k   
>   
> cache-auto-1-thread randread-libaio 254(MiB/s)  63k   
>   
> 
> cache-auto  randread-libaio-multi   256(MiB/s)  64k   
>   
> cache-auto-1-thread randread-libaio-multi   155(MiB/s)  38k   
>   
> 
> cache-auto  seqwrite-psync  23(MiB/s)   6026  
>   
> cache-auto-1-thread seqwrite-psync  30(MiB/s)   7925  
>   
> 
> cache-auto  seqwrite-psync-multi100(MiB/s)  25k   
>   
> cache-auto-1-thread seqwrite-psync-multi154(MiB/s)  38k   
>   
> 
> cache-auto  seqwrite-mmap   343(MiB/s)  85k   
>   
> cache-auto-1-thread seqwrite-mmap   355(MiB/s)  88k   
>   
> 
> cache-auto  seqwrite-mmap-multi 408(MiB/s)  102k  
>   
> cache-auto-1-thread seqwrite-mmap-multi 438(MiB/s)  109k  
>   
> 
> cache-auto  seqwrite-libaio 41(MiB/s)   10k   
>   
> cache-auto-1-thread seqwrite-libaio 65(MiB/s)   16k   
>   
> 
> cache-auto  seqwrite-libaio-multi   137(MiB/s)  34k   
>   
> cache-auto-1-thread seqwrite-libaio-multi   214(MiB/s)  53k   
>   
> 
> cache-auto  randwrite-psync 22(MiB/s)   5801  
>   
> cache-auto-1-thread randwrite-psync 30(MiB/s)   7927  
>   
> 
> cache-auto  randwrite-psync-multi   100(MiB/s)  25k   
>   
> cache-auto-1-thread randwrite-psync-multi   151(MiB/s)  37k   
>   
> 
> cache-auto  randwrite-mmap  31(MiB/s)   7984  
>   
> cache-auto-1-thread randwrite-mmap  55(MiB/s)   13k   
>   
> 
> cache-auto  randwrite-mmap-multi124(MiB/s)  31k   
>   
> cache-auto-1-thread randwrite-mmap-multi213(MiB/s)  53k   
>   
> 
> cache-auto  randwrite-libaio40(MiB/s)   10k   
>   
> cache-auto-1-thread randwrite-libaio64(MiB/s)   16k   
>   
> 
> cache-auto  randwrite-libaio-multi  139(MiB/s)