Re: [PATCH RFC v8 02/11] vhost: use batched get_vq_desc version

2020-07-20 Thread Jason Wang


On 2020/7/20 下午7:16, Eugenio Pérez wrote:

On Mon, Jul 20, 2020 at 11:27 AM Michael S. Tsirkin  wrote:

On Thu, Jul 16, 2020 at 07:16:27PM +0200, Eugenio Perez Martin wrote:

On Fri, Jul 10, 2020 at 7:58 AM Michael S. Tsirkin  wrote:

On Fri, Jul 10, 2020 at 07:39:26AM +0200, Eugenio Perez Martin wrote:

How about playing with the batch size? Make it a mod parameter instead
of the hard coded 64, and measure for all values 1 to 64 ...

Right, according to the test result, 64 seems to be too aggressive in
the case of TX.


Got it, thanks both!

In particular I wonder whether with batch size 1
we get same performance as without batching
(would indicate 64 is too aggressive)
or not (would indicate one of the code changes
affects performance in an unexpected way).

--
MST


Hi!

Varying batch_size as drivers/vhost/net.c:VHOST_NET_BATCH,

sorry this is not what I meant.

I mean something like this:


diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 0b509be8d7b1..b94680e5721d 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -1279,6 +1279,10 @@ static void handle_rx_net(struct vhost_work *work)
 handle_rx(net);
  }

+MODULE_PARM_DESC(batch_num, "Number of batched descriptors. (offset from 64)");
+module_param(batch_num, int, 0644);
+static int batch_num = 0;
+
  static int vhost_net_open(struct inode *inode, struct file *f)
  {
 struct vhost_net *n;
@@ -1333,7 +1337,7 @@ static int vhost_net_open(struct inode *inode, struct 
file *f)
 vhost_net_buf_init(&n->vqs[i].rxq);
 }
 vhost_dev_init(dev, vqs, VHOST_NET_VQ_MAX,
-  UIO_MAXIOV + VHOST_NET_BATCH,
+  UIO_MAXIOV + VHOST_NET_BATCH + batch_num,
VHOST_NET_PKT_WEIGHT, VHOST_NET_WEIGHT, true,
NULL);


then you can try tweaking batching and playing with mod parameter without
recompiling.


VHOST_NET_BATCH affects lots of other things.


Ok, got it. Since they were aligned from the start, I thought it was a good 
idea to maintain them in-sync.


and testing
the pps as previous mail says. This means that we have either only
vhost_net batching (in base testing, like previously to apply this
patch) or both batching sizes the same.

I've checked that vhost process (and pktgen) goes 100% cpu also.

For tx: Batching decrements always the performance, in all cases. Not
sure why bufapi made things better the last time.

Batching makes improvements until 64 bufs, I see increments of pps but like 1%.

For rx: Batching always improves performance. It seems that if we
batch little, bufapi decreases performance, but beyond 64, bufapi is
much better. The bufapi version keeps improving until I set a batching
of 1024. So I guess it is super good to have a bunch of buffers to
receive.

Since with this test I cannot disable event_idx or things like that,
what would be the next step for testing?

Thanks!

--
Results:
# Buf size: 1,16,32,64,128,256,512

# Tx
# ===
# Base
2293304.308,3396057.769,3540860.615,3636056.077,3332950.846,3694276.154,3689820
# Batch
2286723.857,3307191.643,3400346.571,3452527.786,3460766.857,3431042.5,3440722.286
# Batch + Bufapi
2257970.769,3151268.385,3260150.538,3379383.846,3424028.846,3433384.308,3385635.231,3406554.538

# Rx
# ==
# pktgen results (pps)
1223275,1668868,1728794,1769261,1808574,1837252,1846436
1456924,1797901,1831234,1868746,1877508,1931598,1936402
1368923,1719716,1794373,1865170,1884803,1916021,1975160

# Testpmd pps results
1222698.143,1670604,1731040.6,1769218,1811206,1839308.75,1848478.75
1450140.5,1799985.75,1834089.75,1871290,1880005.5,1934147.25,1939034
1370621,1721858,1796287.75,1866618.5,1885466.5,1918670.75,1976173.5,1988760.75,1978316

pktgen was run again for rx with 1024 and 2048 buf size, giving
1988760.75 and 1978316 pps. Testpmd goes the same way.

Don't really understand what does this data mean.
Which number of descs is batched for each run?


Sorry, I should have explained better. I will expand here, but feel free to 
skip it since we are going to discard the
data anyway. Or to propose a better way to tell them.

Is a CSV with the values I've obtained, in pps, from pktgen and testpmd. This 
way is easy to plot them.

Maybe is easier as tables, if mail readers/gmail does not misalign them.


# Tx
# ===

Base: With the previous code, not integrating any patch. testpmd is txonly 
mode, tap interface is XDP_DROP everything.
We vary VHOST_NET_BATCH (1, 16, 32, ...). As Jason put in a previous mail:

TX: testpmd(txonly) -> virtio-user -> vhost_net -> XDP_DROP on TAP


  1 | 16 | 32 | 64 | 128|256 |  
 512  |
2293304.308| 3396057.769| 3540860.615| 3636056.077| 3332950.846| 3694276.154| 
3689820|

If we add the batching part of the series, but not the bufapi:

   1 | 16 | 32 | 64 | 128|256|  
   512|
2286723.857 | 3307191.643| 3400346.571| 3452527.786| 3460766.857| 3431042.5 | 

Re: [PATCH RFC v8 02/11] vhost: use batched get_vq_desc version

2020-07-20 Thread Michael S. Tsirkin
On Mon, Jul 20, 2020 at 01:16:47PM +0200, Eugenio Pérez wrote:
> 
> On Mon, Jul 20, 2020 at 11:27 AM Michael S. Tsirkin  wrote:
> > On Thu, Jul 16, 2020 at 07:16:27PM +0200, Eugenio Perez Martin wrote:
> > > On Fri, Jul 10, 2020 at 7:58 AM Michael S. Tsirkin  
> > > wrote:
> > > > On Fri, Jul 10, 2020 at 07:39:26AM +0200, Eugenio Perez Martin wrote:
> > > > > > > How about playing with the batch size? Make it a mod parameter 
> > > > > > > instead
> > > > > > > of the hard coded 64, and measure for all values 1 to 64 ...
> > > > > > 
> > > > > > Right, according to the test result, 64 seems to be too aggressive 
> > > > > > in
> > > > > > the case of TX.
> > > > > > 
> > > > > 
> > > > > Got it, thanks both!
> > > > 
> > > > In particular I wonder whether with batch size 1
> > > > we get same performance as without batching
> > > > (would indicate 64 is too aggressive)
> > > > or not (would indicate one of the code changes
> > > > affects performance in an unexpected way).
> > > > 
> > > > --
> > > > MST
> > > > 
> > > 
> > > Hi!
> > > 
> > > Varying batch_size as drivers/vhost/net.c:VHOST_NET_BATCH,
> > 
> > sorry this is not what I meant.
> > 
> > I mean something like this:
> > 
> > 
> > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> > index 0b509be8d7b1..b94680e5721d 100644
> > --- a/drivers/vhost/net.c
> > +++ b/drivers/vhost/net.c
> > @@ -1279,6 +1279,10 @@ static void handle_rx_net(struct vhost_work *work)
> > handle_rx(net);
> >  }
> > 
> > +MODULE_PARM_DESC(batch_num, "Number of batched descriptors. (offset from 
> > 64)");
> > +module_param(batch_num, int, 0644);
> > +static int batch_num = 0;
> > +
> >  static int vhost_net_open(struct inode *inode, struct file *f)
> >  {
> > struct vhost_net *n;
> > @@ -1333,7 +1337,7 @@ static int vhost_net_open(struct inode *inode, struct 
> > file *f)
> > vhost_net_buf_init(&n->vqs[i].rxq);
> > }
> > vhost_dev_init(dev, vqs, VHOST_NET_VQ_MAX,
> > -  UIO_MAXIOV + VHOST_NET_BATCH,
> > +  UIO_MAXIOV + VHOST_NET_BATCH + batch_num,
> >VHOST_NET_PKT_WEIGHT, VHOST_NET_WEIGHT, true,
> >NULL);
> > 
> > 
> > then you can try tweaking batching and playing with mod parameter without
> > recompiling.
> > 
> > 
> > VHOST_NET_BATCH affects lots of other things.
> > 
> 
> Ok, got it. Since they were aligned from the start, I thought it was a good 
> idea to maintain them in-sync.
> 
> > > and testing
> > > the pps as previous mail says. This means that we have either only
> > > vhost_net batching (in base testing, like previously to apply this
> > > patch) or both batching sizes the same.
> > > 
> > > I've checked that vhost process (and pktgen) goes 100% cpu also.
> > > 
> > > For tx: Batching decrements always the performance, in all cases. Not
> > > sure why bufapi made things better the last time.
> > > 
> > > Batching makes improvements until 64 bufs, I see increments of pps but 
> > > like 1%.
> > > 
> > > For rx: Batching always improves performance. It seems that if we
> > > batch little, bufapi decreases performance, but beyond 64, bufapi is
> > > much better. The bufapi version keeps improving until I set a batching
> > > of 1024. So I guess it is super good to have a bunch of buffers to
> > > receive.
> > > 
> > > Since with this test I cannot disable event_idx or things like that,
> > > what would be the next step for testing?
> > > 
> > > Thanks!
> > > 
> > > --
> > > Results:
> > > # Buf size: 1,16,32,64,128,256,512
> > > 
> > > # Tx
> > > # ===
> > > # Base
> > > 2293304.308,3396057.769,3540860.615,3636056.077,3332950.846,3694276.154,3689820
> > > # Batch
> > > 2286723.857,3307191.643,3400346.571,3452527.786,3460766.857,3431042.5,3440722.286
> > > # Batch + Bufapi
> > > 2257970.769,3151268.385,3260150.538,3379383.846,3424028.846,3433384.308,3385635.231,3406554.538
> > > 
> > > # Rx
> > > # ==
> > > # pktgen results (pps)
> > > 1223275,1668868,1728794,1769261,1808574,1837252,1846436
> > > 1456924,1797901,1831234,1868746,1877508,1931598,1936402
> > > 1368923,1719716,1794373,1865170,1884803,1916021,1975160
> > > 
> > > # Testpmd pps results
> > > 1222698.143,1670604,1731040.6,1769218,1811206,1839308.75,1848478.75
> > > 1450140.5,1799985.75,1834089.75,1871290,1880005.5,1934147.25,1939034
> > > 1370621,1721858,1796287.75,1866618.5,1885466.5,1918670.75,1976173.5,1988760.75,1978316
> > > 
> > > pktgen was run again for rx with 1024 and 2048 buf size, giving
> > > 1988760.75 and 1978316 pps. Testpmd goes the same way.
> > 
> > Don't really understand what does this data mean.
> > Which number of descs is batched for each run?
> > 
> 
> Sorry, I should have explained better. I will expand here, but feel free to 
> skip it since we are going to discard the
> data anyway. Or to propose a better way to tell them.
> 
> Is a CSV with the values I've obtained, in pps, from pktgen and testpmd. This 
> way is easy to plot them.
> 

Re: [PATCH RFC v8 02/11] vhost: use batched get_vq_desc version

2020-07-20 Thread Michael S. Tsirkin
On Thu, Jul 16, 2020 at 07:16:27PM +0200, Eugenio Perez Martin wrote:
> On Fri, Jul 10, 2020 at 7:58 AM Michael S. Tsirkin  wrote:
> >
> > On Fri, Jul 10, 2020 at 07:39:26AM +0200, Eugenio Perez Martin wrote:
> > > > > How about playing with the batch size? Make it a mod parameter instead
> > > > > of the hard coded 64, and measure for all values 1 to 64 ...
> > > >
> > > >
> > > > Right, according to the test result, 64 seems to be too aggressive in
> > > > the case of TX.
> > > >
> > >
> > > Got it, thanks both!
> >
> > In particular I wonder whether with batch size 1
> > we get same performance as without batching
> > (would indicate 64 is too aggressive)
> > or not (would indicate one of the code changes
> > affects performance in an unexpected way).
> >
> > --
> > MST
> >
> 
> Hi!
> 
> Varying batch_size as drivers/vhost/net.c:VHOST_NET_BATCH,

sorry this is not what I meant.

I mean something like this:


diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 0b509be8d7b1..b94680e5721d 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -1279,6 +1279,10 @@ static void handle_rx_net(struct vhost_work *work)
handle_rx(net);
 }
 
+MODULE_PARM_DESC(batch_num, "Number of batched descriptors. (offset from 64)");
+module_param(batch_num, int, 0644);
+static int batch_num = 0;
+
 static int vhost_net_open(struct inode *inode, struct file *f)
 {
struct vhost_net *n;
@@ -1333,7 +1337,7 @@ static int vhost_net_open(struct inode *inode, struct 
file *f)
vhost_net_buf_init(&n->vqs[i].rxq);
}
vhost_dev_init(dev, vqs, VHOST_NET_VQ_MAX,
-  UIO_MAXIOV + VHOST_NET_BATCH,
+  UIO_MAXIOV + VHOST_NET_BATCH + batch_num,
   VHOST_NET_PKT_WEIGHT, VHOST_NET_WEIGHT, true,
   NULL);
 

then you can try tweaking batching and playing with mod parameter without
recompiling.


VHOST_NET_BATCH affects lots of other things.


> and testing
> the pps as previous mail says. This means that we have either only
> vhost_net batching (in base testing, like previously to apply this
> patch) or both batching sizes the same.
> 
> I've checked that vhost process (and pktgen) goes 100% cpu also.
> 
> For tx: Batching decrements always the performance, in all cases. Not
> sure why bufapi made things better the last time.
> 
> Batching makes improvements until 64 bufs, I see increments of pps but like 
> 1%.
> 
> For rx: Batching always improves performance. It seems that if we
> batch little, bufapi decreases performance, but beyond 64, bufapi is
> much better. The bufapi version keeps improving until I set a batching
> of 1024. So I guess it is super good to have a bunch of buffers to
> receive.
> 
> Since with this test I cannot disable event_idx or things like that,
> what would be the next step for testing?
> 
> Thanks!
> 
> --
> Results:
> # Buf size: 1,16,32,64,128,256,512
> 
> # Tx
> # ===
> # Base
> 2293304.308,3396057.769,3540860.615,3636056.077,3332950.846,3694276.154,3689820
> # Batch
> 2286723.857,3307191.643,3400346.571,3452527.786,3460766.857,3431042.5,3440722.286
> # Batch + Bufapi
> 2257970.769,3151268.385,3260150.538,3379383.846,3424028.846,3433384.308,3385635.231,3406554.538
> 
> # Rx
> # ==
> # pktgen results (pps)
> 1223275,1668868,1728794,1769261,1808574,1837252,1846436
> 1456924,1797901,1831234,1868746,1877508,1931598,1936402
> 1368923,1719716,1794373,1865170,1884803,1916021,1975160
> 
> # Testpmd pps results
> 1222698.143,1670604,1731040.6,1769218,1811206,1839308.75,1848478.75
> 1450140.5,1799985.75,1834089.75,1871290,1880005.5,1934147.25,1939034
> 1370621,1721858,1796287.75,1866618.5,1885466.5,1918670.75,1976173.5,1988760.75,1978316
> 
> pktgen was run again for rx with 1024 and 2048 buf size, giving
> 1988760.75 and 1978316 pps. Testpmd goes the same way.

Don't really understand what does this data mean.
Which number of descs is batched for each run?

-- 
MST

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC v8 02/11] vhost: use batched get_vq_desc version

2020-07-20 Thread Jason Wang


On 2020/7/17 上午1:16, Eugenio Perez Martin wrote:

On Fri, Jul 10, 2020 at 7:58 AM Michael S. Tsirkin  wrote:

On Fri, Jul 10, 2020 at 07:39:26AM +0200, Eugenio Perez Martin wrote:

How about playing with the batch size? Make it a mod parameter instead
of the hard coded 64, and measure for all values 1 to 64 ...


Right, according to the test result, 64 seems to be too aggressive in
the case of TX.


Got it, thanks both!

In particular I wonder whether with batch size 1
we get same performance as without batching
(would indicate 64 is too aggressive)
or not (would indicate one of the code changes
affects performance in an unexpected way).

--
MST


Hi!

Varying batch_size as drivers/vhost/net.c:VHOST_NET_BATCH,



Did you mean varying the value of VHOST_NET_BATCH itself or the number 
of batched descriptors?




and testing
the pps as previous mail says. This means that we have either only
vhost_net batching (in base testing, like previously to apply this
patch) or both batching sizes the same.

I've checked that vhost process (and pktgen) goes 100% cpu also.

For tx: Batching decrements always the performance, in all cases. Not
sure why bufapi made things better the last time.

Batching makes improvements until 64 bufs, I see increments of pps but like 1%.

For rx: Batching always improves performance. It seems that if we
batch little, bufapi decreases performance, but beyond 64, bufapi is
much better. The bufapi version keeps improving until I set a batching
of 1024. So I guess it is super good to have a bunch of buffers to
receive.

Since with this test I cannot disable event_idx or things like that,
what would be the next step for testing?

Thanks!

--
Results:
# Buf size: 1,16,32,64,128,256,512

# Tx
# ===
# Base
2293304.308,3396057.769,3540860.615,3636056.077,3332950.846,3694276.154,3689820



What's the meaning of buf size in the context of "base"?

And I wonder maybe perf diff can help.

Thanks



# Batch
2286723.857,3307191.643,3400346.571,3452527.786,3460766.857,3431042.5,3440722.286
# Batch + Bufapi
2257970.769,3151268.385,3260150.538,3379383.846,3424028.846,3433384.308,3385635.231,3406554.538

# Rx
# ==
# pktgen results (pps)
1223275,1668868,1728794,1769261,1808574,1837252,1846436
1456924,1797901,1831234,1868746,1877508,1931598,1936402
1368923,1719716,1794373,1865170,1884803,1916021,1975160

# Testpmd pps results
1222698.143,1670604,1731040.6,1769218,1811206,1839308.75,1848478.75
1450140.5,1799985.75,1834089.75,1871290,1880005.5,1934147.25,1939034
1370621,1721858,1796287.75,1866618.5,1885466.5,1918670.75,1976173.5,1988760.75,1978316

pktgen was run again for rx with 1024 and 2048 buf size, giving
1988760.75 and 1978316 pps. Testpmd goes the same way.



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH RFC v8 02/11] vhost: use batched get_vq_desc version

2020-07-09 Thread Jason Wang


On 2020/7/10 下午1:39, Eugenio Perez Martin wrote:

It is allocated 1 thread in lcore 1 (F_THREAD=1) which belongs to the
same NUMA as testpmd. Actually, it is the testpmd master core, so it
should be a good idea to move it to another lcore of the same NUMA
node.

Is this enough for pktgen to allocate the memory in that numa node?
Since the script only write parameters to /proc, I assume that it has
no effect to run it under numactl/taskset, and pktgen will allocate
memory based on the lcore is running. Am I right?

Thanks!



I think you're right.

Thanks

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH RFC v8 02/11] vhost: use batched get_vq_desc version

2020-07-09 Thread Michael S. Tsirkin
On Fri, Jul 10, 2020 at 07:39:26AM +0200, Eugenio Perez Martin wrote:
> > > How about playing with the batch size? Make it a mod parameter instead
> > > of the hard coded 64, and measure for all values 1 to 64 ...
> >
> >
> > Right, according to the test result, 64 seems to be too aggressive in
> > the case of TX.
> >
> 
> Got it, thanks both!

In particular I wonder whether with batch size 1
we get same performance as without batching
(would indicate 64 is too aggressive)
or not (would indicate one of the code changes
affects performance in an unexpected way).

-- 
MST

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC v8 02/11] vhost: use batched get_vq_desc version

2020-07-09 Thread Jason Wang


On 2020/7/10 上午1:37, Michael S. Tsirkin wrote:

On Thu, Jul 09, 2020 at 06:46:13PM +0200, Eugenio Perez Martin wrote:

On Wed, Jul 1, 2020 at 4:10 PM Jason Wang  wrote:


On 2020/7/1 下午9:04, Eugenio Perez Martin wrote:

On Wed, Jul 1, 2020 at 2:40 PM Jason Wang  wrote:

On 2020/7/1 下午6:43, Eugenio Perez Martin wrote:

On Tue, Jun 23, 2020 at 6:15 PM Eugenio Perez Martin
 wrote:

On Mon, Jun 22, 2020 at 6:29 PM Michael S. Tsirkin  wrote:

On Mon, Jun 22, 2020 at 06:11:21PM +0200, Eugenio Perez Martin wrote:

On Mon, Jun 22, 2020 at 5:55 PM Michael S. Tsirkin  wrote:

On Fri, Jun 19, 2020 at 08:07:57PM +0200, Eugenio Perez Martin wrote:

On Mon, Jun 15, 2020 at 2:28 PM Eugenio Perez Martin
 wrote:

On Thu, Jun 11, 2020 at 5:22 PM Konrad Rzeszutek Wilk
 wrote:

On Thu, Jun 11, 2020 at 07:34:19AM -0400, Michael S. Tsirkin wrote:

As testing shows no performance change, switch to that now.

What kind of testing? 100GiB? Low latency?


Hi Konrad.

I tested this version of the patch:
https://lkml.org/lkml/2019/10/13/42

It was tested for throughput with DPDK's testpmd (as described in
http://doc.dpdk.org/guides/howto/virtio_user_as_exceptional_path.html)
and kernel pktgen. No latency tests were performed by me. Maybe it is
interesting to perform a latency test or just a different set of tests
over a recent version.

Thanks!

I have repeated the tests with v9, and results are a little bit different:
* If I test opening it with testpmd, I see no change between versions

OK that is testpmd on guest, right? And vhost-net on the host?


Hi Michael.

No, sorry, as described in
http://doc.dpdk.org/guides/howto/virtio_user_as_exceptional_path.html.
But I could add to test it in the guest too.

These kinds of raw packets "bursts" do not show performance
differences, but I could test deeper if you think it would be worth
it.

Oh ok, so this is without guest, with virtio-user.
It might be worth checking dpdk within guest too just
as another data point.


Ok, I will do it!


* If I forward packets between two vhost-net interfaces in the guest
using a linux bridge in the host:

And here I guess you mean virtio-net in the guest kernel?

Yes, sorry: Two virtio-net interfaces connected with a linux bridge in
the host. More precisely:
* Adding one of the interfaces to another namespace, assigning it an
IP, and starting netserver there.
* Assign another IP in the range manually to the other virtual net
interface, and start the desired test there.

If you think it would be better to perform then differently please let me know.

Not sure why you bother with namespaces since you said you are
using L2 bridging. I guess it's unimportant.


Sorry, I think I should have provided more context about that.

The only reason to use namespaces is to force the traffic of these
netperf tests to go through the external bridge. To test netperf
different possibilities than the testpmd (or pktgen or others "blast
of frames unconditionally" tests).

This way, I make sure that is the same version of everything in the
guest, and is a little bit easier to manage cpu affinity, start and
stop testing...

I could use a different VM for sending and receiving, but I find this
way a faster one and it should not introduce a lot of noise. I can
test with two VM if you think that this use of network namespace
introduces too much noise.

Thanks!


 - netperf UDP_STREAM shows a performance increase of 1.8, almost
doubling performance. This gets lower as frame size increase.

Regarding UDP_STREAM:
* with event_idx=on: The performance difference is reduced a lot if
applied affinity properly (manually assigning CPU on host/guest and
setting IRQs on guest), making them perform equally with and without
the patch again. Maybe the batching makes the scheduler perform
better.

Note that for UDP_STREAM, the result is pretty trick to be analyzed. E.g
setting a sndbuf for TAP may help for the performance (reduce the drop).


Ok, will add that to the test. Thanks!


Actually, it's better to skip the UDP_STREAM test since:

- My understanding is very few application is using raw UDP stream
- It's hard to analyze (usually you need to count the drop ratio etc)



 - rests of the test goes noticeably worse: UDP_RR goes from ~6347
transactions/sec to 5830

* Regarding UDP_RR, TCP_STREAM, and TCP_RR, proper CPU pinning makes
them perform similarly again, only a very small performance drop
observed. It could be just noise.
** All of them perform better than vanilla if event_idx=off, not sure
why. I can try to repeat them if you suspect that can be a test
failure.

* With testpmd and event_idx=off, if I send from the VM to host, I see
a performance increment especially in small packets. The buf api also
increases performance compared with only batching: Sending the minimum
packet size in testpmd makes pps go from 356kpps to 473 kpps.

What's your setup for this. The number looks rather low. I'd expected
1-2 Mpps at least.


Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz, 2 NUMA 

Re: [PATCH RFC v8 02/11] vhost: use batched get_vq_desc version

2020-07-09 Thread Michael S. Tsirkin
On Thu, Jul 09, 2020 at 06:46:13PM +0200, Eugenio Perez Martin wrote:
> On Wed, Jul 1, 2020 at 4:10 PM Jason Wang  wrote:
> >
> >
> > On 2020/7/1 下午9:04, Eugenio Perez Martin wrote:
> > > On Wed, Jul 1, 2020 at 2:40 PM Jason Wang  wrote:
> > >>
> > >> On 2020/7/1 下午6:43, Eugenio Perez Martin wrote:
> > >>> On Tue, Jun 23, 2020 at 6:15 PM Eugenio Perez Martin
> > >>>  wrote:
> >  On Mon, Jun 22, 2020 at 6:29 PM Michael S. Tsirkin  
> >  wrote:
> > > On Mon, Jun 22, 2020 at 06:11:21PM +0200, Eugenio Perez Martin wrote:
> > >> On Mon, Jun 22, 2020 at 5:55 PM Michael S. Tsirkin  
> > >> wrote:
> > >>> On Fri, Jun 19, 2020 at 08:07:57PM +0200, Eugenio Perez Martin 
> > >>> wrote:
> >  On Mon, Jun 15, 2020 at 2:28 PM Eugenio Perez Martin
> >   wrote:
> > > On Thu, Jun 11, 2020 at 5:22 PM Konrad Rzeszutek Wilk
> > >  wrote:
> > >> On Thu, Jun 11, 2020 at 07:34:19AM -0400, Michael S. Tsirkin 
> > >> wrote:
> > >>> As testing shows no performance change, switch to that now.
> > >> What kind of testing? 100GiB? Low latency?
> > >>
> > > Hi Konrad.
> > >
> > > I tested this version of the patch:
> > > https://lkml.org/lkml/2019/10/13/42
> > >
> > > It was tested for throughput with DPDK's testpmd (as described in
> > > http://doc.dpdk.org/guides/howto/virtio_user_as_exceptional_path.html)
> > > and kernel pktgen. No latency tests were performed by me. Maybe 
> > > it is
> > > interesting to perform a latency test or just a different set of 
> > > tests
> > > over a recent version.
> > >
> > > Thanks!
> >  I have repeated the tests with v9, and results are a little bit 
> >  different:
> >  * If I test opening it with testpmd, I see no change between 
> >  versions
> > >>> OK that is testpmd on guest, right? And vhost-net on the host?
> > >>>
> > >> Hi Michael.
> > >>
> > >> No, sorry, as described in
> > >> http://doc.dpdk.org/guides/howto/virtio_user_as_exceptional_path.html.
> > >> But I could add to test it in the guest too.
> > >>
> > >> These kinds of raw packets "bursts" do not show performance
> > >> differences, but I could test deeper if you think it would be worth
> > >> it.
> > > Oh ok, so this is without guest, with virtio-user.
> > > It might be worth checking dpdk within guest too just
> > > as another data point.
> > >
> >  Ok, I will do it!
> > 
> >  * If I forward packets between two vhost-net interfaces in the 
> >  guest
> >  using a linux bridge in the host:
> > >>> And here I guess you mean virtio-net in the guest kernel?
> > >> Yes, sorry: Two virtio-net interfaces connected with a linux bridge 
> > >> in
> > >> the host. More precisely:
> > >> * Adding one of the interfaces to another namespace, assigning it an
> > >> IP, and starting netserver there.
> > >> * Assign another IP in the range manually to the other virtual net
> > >> interface, and start the desired test there.
> > >>
> > >> If you think it would be better to perform then differently please 
> > >> let me know.
> > > Not sure why you bother with namespaces since you said you are
> > > using L2 bridging. I guess it's unimportant.
> > >
> >  Sorry, I think I should have provided more context about that.
> > 
> >  The only reason to use namespaces is to force the traffic of these
> >  netperf tests to go through the external bridge. To test netperf
> >  different possibilities than the testpmd (or pktgen or others "blast
> >  of frames unconditionally" tests).
> > 
> >  This way, I make sure that is the same version of everything in the
> >  guest, and is a little bit easier to manage cpu affinity, start and
> >  stop testing...
> > 
> >  I could use a different VM for sending and receiving, but I find this
> >  way a faster one and it should not introduce a lot of noise. I can
> >  test with two VM if you think that this use of network namespace
> >  introduces too much noise.
> > 
> >  Thanks!
> > 
> >  - netperf UDP_STREAM shows a performance increase of 1.8, 
> >  almost
> >  doubling performance. This gets lower as frame size increase.
> > >>> Regarding UDP_STREAM:
> > >>> * with event_idx=on: The performance difference is reduced a lot if
> > >>> applied affinity properly (manually assigning CPU on host/guest and
> > >>> setting IRQs on guest), making them perform equally with and without
> > >>> the patch again. Maybe the batching makes the scheduler perform
> > >>> better.
> > >>
> > >> Note that for UDP_STREAM, the result is pretty trick to be analyzed. E.g
> > >> setting a sndbuf for TAP may help for the performance (reduce the drop).
> > >>

Re: [PATCH RFC v8 02/11] vhost: use batched get_vq_desc version

2020-07-01 Thread Jason Wang


On 2020/7/1 下午9:04, Eugenio Perez Martin wrote:

On Wed, Jul 1, 2020 at 2:40 PM Jason Wang  wrote:


On 2020/7/1 下午6:43, Eugenio Perez Martin wrote:

On Tue, Jun 23, 2020 at 6:15 PM Eugenio Perez Martin
 wrote:

On Mon, Jun 22, 2020 at 6:29 PM Michael S. Tsirkin  wrote:

On Mon, Jun 22, 2020 at 06:11:21PM +0200, Eugenio Perez Martin wrote:

On Mon, Jun 22, 2020 at 5:55 PM Michael S. Tsirkin  wrote:

On Fri, Jun 19, 2020 at 08:07:57PM +0200, Eugenio Perez Martin wrote:

On Mon, Jun 15, 2020 at 2:28 PM Eugenio Perez Martin
 wrote:

On Thu, Jun 11, 2020 at 5:22 PM Konrad Rzeszutek Wilk
 wrote:

On Thu, Jun 11, 2020 at 07:34:19AM -0400, Michael S. Tsirkin wrote:

As testing shows no performance change, switch to that now.

What kind of testing? 100GiB? Low latency?


Hi Konrad.

I tested this version of the patch:
https://lkml.org/lkml/2019/10/13/42

It was tested for throughput with DPDK's testpmd (as described in
http://doc.dpdk.org/guides/howto/virtio_user_as_exceptional_path.html)
and kernel pktgen. No latency tests were performed by me. Maybe it is
interesting to perform a latency test or just a different set of tests
over a recent version.

Thanks!

I have repeated the tests with v9, and results are a little bit different:
* If I test opening it with testpmd, I see no change between versions

OK that is testpmd on guest, right? And vhost-net on the host?


Hi Michael.

No, sorry, as described in
http://doc.dpdk.org/guides/howto/virtio_user_as_exceptional_path.html.
But I could add to test it in the guest too.

These kinds of raw packets "bursts" do not show performance
differences, but I could test deeper if you think it would be worth
it.

Oh ok, so this is without guest, with virtio-user.
It might be worth checking dpdk within guest too just
as another data point.


Ok, I will do it!


* If I forward packets between two vhost-net interfaces in the guest
using a linux bridge in the host:

And here I guess you mean virtio-net in the guest kernel?

Yes, sorry: Two virtio-net interfaces connected with a linux bridge in
the host. More precisely:
* Adding one of the interfaces to another namespace, assigning it an
IP, and starting netserver there.
* Assign another IP in the range manually to the other virtual net
interface, and start the desired test there.

If you think it would be better to perform then differently please let me know.

Not sure why you bother with namespaces since you said you are
using L2 bridging. I guess it's unimportant.


Sorry, I think I should have provided more context about that.

The only reason to use namespaces is to force the traffic of these
netperf tests to go through the external bridge. To test netperf
different possibilities than the testpmd (or pktgen or others "blast
of frames unconditionally" tests).

This way, I make sure that is the same version of everything in the
guest, and is a little bit easier to manage cpu affinity, start and
stop testing...

I could use a different VM for sending and receiving, but I find this
way a faster one and it should not introduce a lot of noise. I can
test with two VM if you think that this use of network namespace
introduces too much noise.

Thanks!


- netperf UDP_STREAM shows a performance increase of 1.8, almost
doubling performance. This gets lower as frame size increase.

Regarding UDP_STREAM:
* with event_idx=on: The performance difference is reduced a lot if
applied affinity properly (manually assigning CPU on host/guest and
setting IRQs on guest), making them perform equally with and without
the patch again. Maybe the batching makes the scheduler perform
better.


Note that for UDP_STREAM, the result is pretty trick to be analyzed. E.g
setting a sndbuf for TAP may help for the performance (reduce the drop).


Ok, will add that to the test. Thanks!



Actually, it's better to skip the UDP_STREAM test since:

- My understanding is very few application is using raw UDP stream
- It's hard to analyze (usually you need to count the drop ratio etc)





- rests of the test goes noticeably worse: UDP_RR goes from ~6347
transactions/sec to 5830

* Regarding UDP_RR, TCP_STREAM, and TCP_RR, proper CPU pinning makes
them perform similarly again, only a very small performance drop
observed. It could be just noise.
** All of them perform better than vanilla if event_idx=off, not sure
why. I can try to repeat them if you suspect that can be a test
failure.

* With testpmd and event_idx=off, if I send from the VM to host, I see
a performance increment especially in small packets. The buf api also
increases performance compared with only batching: Sending the minimum
packet size in testpmd makes pps go from 356kpps to 473 kpps.


What's your setup for this. The number looks rather low. I'd expected
1-2 Mpps at least.


Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz, 2 NUMA nodes of 16G memory
each, and no device assigned to the NUMA node I'm testing in. Too low
for testpmd AF_PACKET driver too?



I don't test AF_PACKET, I guess it shoul

Re: [PATCH RFC v8 02/11] vhost: use batched get_vq_desc version

2020-07-01 Thread Jason Wang


On 2020/7/1 下午6:43, Eugenio Perez Martin wrote:

On Tue, Jun 23, 2020 at 6:15 PM Eugenio Perez Martin
 wrote:

On Mon, Jun 22, 2020 at 6:29 PM Michael S. Tsirkin  wrote:

On Mon, Jun 22, 2020 at 06:11:21PM +0200, Eugenio Perez Martin wrote:

On Mon, Jun 22, 2020 at 5:55 PM Michael S. Tsirkin  wrote:

On Fri, Jun 19, 2020 at 08:07:57PM +0200, Eugenio Perez Martin wrote:

On Mon, Jun 15, 2020 at 2:28 PM Eugenio Perez Martin
 wrote:

On Thu, Jun 11, 2020 at 5:22 PM Konrad Rzeszutek Wilk
 wrote:

On Thu, Jun 11, 2020 at 07:34:19AM -0400, Michael S. Tsirkin wrote:

As testing shows no performance change, switch to that now.

What kind of testing? 100GiB? Low latency?


Hi Konrad.

I tested this version of the patch:
https://lkml.org/lkml/2019/10/13/42

It was tested for throughput with DPDK's testpmd (as described in
http://doc.dpdk.org/guides/howto/virtio_user_as_exceptional_path.html)
and kernel pktgen. No latency tests were performed by me. Maybe it is
interesting to perform a latency test or just a different set of tests
over a recent version.

Thanks!

I have repeated the tests with v9, and results are a little bit different:
* If I test opening it with testpmd, I see no change between versions


OK that is testpmd on guest, right? And vhost-net on the host?


Hi Michael.

No, sorry, as described in
http://doc.dpdk.org/guides/howto/virtio_user_as_exceptional_path.html.
But I could add to test it in the guest too.

These kinds of raw packets "bursts" do not show performance
differences, but I could test deeper if you think it would be worth
it.

Oh ok, so this is without guest, with virtio-user.
It might be worth checking dpdk within guest too just
as another data point.


Ok, I will do it!


* If I forward packets between two vhost-net interfaces in the guest
using a linux bridge in the host:

And here I guess you mean virtio-net in the guest kernel?

Yes, sorry: Two virtio-net interfaces connected with a linux bridge in
the host. More precisely:
* Adding one of the interfaces to another namespace, assigning it an
IP, and starting netserver there.
* Assign another IP in the range manually to the other virtual net
interface, and start the desired test there.

If you think it would be better to perform then differently please let me know.


Not sure why you bother with namespaces since you said you are
using L2 bridging. I guess it's unimportant.


Sorry, I think I should have provided more context about that.

The only reason to use namespaces is to force the traffic of these
netperf tests to go through the external bridge. To test netperf
different possibilities than the testpmd (or pktgen or others "blast
of frames unconditionally" tests).

This way, I make sure that is the same version of everything in the
guest, and is a little bit easier to manage cpu affinity, start and
stop testing...

I could use a different VM for sending and receiving, but I find this
way a faster one and it should not introduce a lot of noise. I can
test with two VM if you think that this use of network namespace
introduces too much noise.

Thanks!


   - netperf UDP_STREAM shows a performance increase of 1.8, almost
doubling performance. This gets lower as frame size increase.

Regarding UDP_STREAM:
* with event_idx=on: The performance difference is reduced a lot if
applied affinity properly (manually assigning CPU on host/guest and
setting IRQs on guest), making them perform equally with and without
the patch again. Maybe the batching makes the scheduler perform
better.



Note that for UDP_STREAM, the result is pretty trick to be analyzed. E.g 
setting a sndbuf for TAP may help for the performance (reduce the drop).






   - rests of the test goes noticeably worse: UDP_RR goes from ~6347
transactions/sec to 5830

* Regarding UDP_RR, TCP_STREAM, and TCP_RR, proper CPU pinning makes
them perform similarly again, only a very small performance drop
observed. It could be just noise.
** All of them perform better than vanilla if event_idx=off, not sure
why. I can try to repeat them if you suspect that can be a test
failure.

* With testpmd and event_idx=off, if I send from the VM to host, I see
a performance increment especially in small packets. The buf api also
increases performance compared with only batching: Sending the minimum
packet size in testpmd makes pps go from 356kpps to 473 kpps.



What's your setup for this. The number looks rather low. I'd expected 
1-2 Mpps at least.




Sending
1024 length UDP-PDU makes it go from 570kpps to 64 kpps.

Something strange I observe in these tests: I get more pps the bigger
the transmitted buffer size is. Not sure why.

** Sending from the host to the VM does not make a big change with the
patches in small packets scenario (minimum, 64 bytes, about 645
without the patch, ~625 with batch and batch+buf api). If the packets
are bigger, I can see a performance increase: with 256 bits,



I think you meant bytes?



  it goes
from 590kpps to about 600kpps, and in case of 1500 bytes pay

Re: [PATCH RFC v8 02/11] vhost: use batched get_vq_desc version

2020-07-01 Thread Michael S. Tsirkin
On Wed, Jul 01, 2020 at 12:43:09PM +0200, Eugenio Perez Martin wrote:
> On Tue, Jun 23, 2020 at 6:15 PM Eugenio Perez Martin
>  wrote:
> >
> > On Mon, Jun 22, 2020 at 6:29 PM Michael S. Tsirkin  wrote:
> > >
> > > On Mon, Jun 22, 2020 at 06:11:21PM +0200, Eugenio Perez Martin wrote:
> > > > On Mon, Jun 22, 2020 at 5:55 PM Michael S. Tsirkin  
> > > > wrote:
> > > > >
> > > > > On Fri, Jun 19, 2020 at 08:07:57PM +0200, Eugenio Perez Martin wrote:
> > > > > > On Mon, Jun 15, 2020 at 2:28 PM Eugenio Perez Martin
> > > > > >  wrote:
> > > > > > >
> > > > > > > On Thu, Jun 11, 2020 at 5:22 PM Konrad Rzeszutek Wilk
> > > > > > >  wrote:
> > > > > > > >
> > > > > > > > On Thu, Jun 11, 2020 at 07:34:19AM -0400, Michael S. Tsirkin 
> > > > > > > > wrote:
> > > > > > > > > As testing shows no performance change, switch to that now.
> > > > > > > >
> > > > > > > > What kind of testing? 100GiB? Low latency?
> > > > > > > >
> > > > > > >
> > > > > > > Hi Konrad.
> > > > > > >
> > > > > > > I tested this version of the patch:
> > > > > > > https://lkml.org/lkml/2019/10/13/42
> > > > > > >
> > > > > > > It was tested for throughput with DPDK's testpmd (as described in
> > > > > > > http://doc.dpdk.org/guides/howto/virtio_user_as_exceptional_path.html)
> > > > > > > and kernel pktgen. No latency tests were performed by me. Maybe 
> > > > > > > it is
> > > > > > > interesting to perform a latency test or just a different set of 
> > > > > > > tests
> > > > > > > over a recent version.
> > > > > > >
> > > > > > > Thanks!
> > > > > >
> > > > > > I have repeated the tests with v9, and results are a little bit 
> > > > > > different:
> > > > > > * If I test opening it with testpmd, I see no change between 
> > > > > > versions
> > > > >
> > > > >
> > > > > OK that is testpmd on guest, right? And vhost-net on the host?
> > > > >
> > > >
> > > > Hi Michael.
> > > >
> > > > No, sorry, as described in
> > > > http://doc.dpdk.org/guides/howto/virtio_user_as_exceptional_path.html.
> > > > But I could add to test it in the guest too.
> > > >
> > > > These kinds of raw packets "bursts" do not show performance
> > > > differences, but I could test deeper if you think it would be worth
> > > > it.
> > >
> > > Oh ok, so this is without guest, with virtio-user.
> > > It might be worth checking dpdk within guest too just
> > > as another data point.
> > >
> >
> > Ok, I will do it!
> >
> > > > > > * If I forward packets between two vhost-net interfaces in the guest
> > > > > > using a linux bridge in the host:
> > > > >
> > > > > And here I guess you mean virtio-net in the guest kernel?
> > > >
> > > > Yes, sorry: Two virtio-net interfaces connected with a linux bridge in
> > > > the host. More precisely:
> > > > * Adding one of the interfaces to another namespace, assigning it an
> > > > IP, and starting netserver there.
> > > > * Assign another IP in the range manually to the other virtual net
> > > > interface, and start the desired test there.
> > > >
> > > > If you think it would be better to perform then differently please let 
> > > > me know.
> > >
> > >
> > > Not sure why you bother with namespaces since you said you are
> > > using L2 bridging. I guess it's unimportant.
> > >
> >
> > Sorry, I think I should have provided more context about that.
> >
> > The only reason to use namespaces is to force the traffic of these
> > netperf tests to go through the external bridge. To test netperf
> > different possibilities than the testpmd (or pktgen or others "blast
> > of frames unconditionally" tests).
> >
> > This way, I make sure that is the same version of everything in the
> > guest, and is a little bit easier to manage cpu affinity, start and
> > stop testing...
> >
> > I could use a different VM for sending and receiving, but I find this
> > way a faster one and it should not introduce a lot of noise. I can
> > test with two VM if you think that this use of network namespace
> > introduces too much noise.
> >
> > Thanks!
> >
> > > > >
> > > > > >   - netperf UDP_STREAM shows a performance increase of 1.8, almost
> > > > > > doubling performance. This gets lower as frame size increase.
> 
> Regarding UDP_STREAM:
> * with event_idx=on: The performance difference is reduced a lot if
> applied affinity properly (manually assigning CPU on host/guest and
> setting IRQs on guest), making them perform equally with and without
> the patch again. Maybe the batching makes the scheduler perform
> better.
> 
> > > > > >   - rests of the test goes noticeably worse: UDP_RR goes from ~6347
> > > > > > transactions/sec to 5830
> 
> * Regarding UDP_RR, TCP_STREAM, and TCP_RR, proper CPU pinning makes
> them perform similarly again, only a very small performance drop
> observed. It could be just noise.
> ** All of them perform better than vanilla if event_idx=off, not sure
> why. I can try to repeat them if you suspect that can be a test
> failure.
> 
> * With testpmd and event_idx=off, if I send from the VM to host, I see
> a performance increment 

Re: [PATCH RFC v8 02/11] vhost: use batched get_vq_desc version

2020-06-23 Thread Michael S. Tsirkin
On Tue, Jun 23, 2020 at 09:00:57AM +0200, Eugenio Perez Martin wrote:
> On Tue, Jun 23, 2020 at 4:51 AM Jason Wang  wrote:
> >
> >
> > On 2020/6/23 上午12:00, Michael S. Tsirkin wrote:
> > > On Wed, Jun 17, 2020 at 11:19:26AM +0800, Jason Wang wrote:
> > >> On 2020/6/11 下午7:34, Michael S. Tsirkin wrote:
> > >>>static void vhost_vq_free_iovecs(struct vhost_virtqueue *vq)
> > >>>{
> > >>> kfree(vq->descs);
> > >>> @@ -394,6 +400,9 @@ static long vhost_dev_alloc_iovecs(struct vhost_dev 
> > >>> *dev)
> > >>> for (i = 0; i < dev->nvqs; ++i) {
> > >>> vq = dev->vqs[i];
> > >>> vq->max_descs = dev->iov_limit;
> > >>> +   if (vhost_vq_num_batch_descs(vq) < 0) {
> > >>> +   return -EINVAL;
> > >>> +   }
> > >> This check breaks vdpa which set iov_limit to zero. Consider iov_limit is
> > >> meaningless to vDPA, I wonder we can skip the test when device doesn't 
> > >> use
> > >> worker.
> > >>
> > >> Thanks
> > > It doesn't need iovecs at all, right?
> > >
> > > -- MST
> >
> >
> > Yes, so we may choose to bypass the iovecs as well.
> >
> > Thanks
> >
> 
> I think that the kmalloc_array returns ZERO_SIZE_PTR for all of them
> in that case, so I didn't bother to skip the kmalloc_array parts.
> Would you prefer to skip them all and let them NULL? Or have I
> misunderstood what you mean?
> 
> Thanks!

Sorry about being unclear. I just meant that it seems cleaner
to check for iov_limit being 0 not for worker thread.

-- 
MST

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH RFC v8 02/11] vhost: use batched get_vq_desc version

2020-06-23 Thread Jason Wang


On 2020/6/23 下午3:00, Eugenio Perez Martin wrote:

On Tue, Jun 23, 2020 at 4:51 AM Jason Wang  wrote:


On 2020/6/23 上午12:00, Michael S. Tsirkin wrote:

On Wed, Jun 17, 2020 at 11:19:26AM +0800, Jason Wang wrote:

On 2020/6/11 下午7:34, Michael S. Tsirkin wrote:

static void vhost_vq_free_iovecs(struct vhost_virtqueue *vq)
{
 kfree(vq->descs);
@@ -394,6 +400,9 @@ static long vhost_dev_alloc_iovecs(struct vhost_dev *dev)
 for (i = 0; i < dev->nvqs; ++i) {
 vq = dev->vqs[i];
 vq->max_descs = dev->iov_limit;
+   if (vhost_vq_num_batch_descs(vq) < 0) {
+   return -EINVAL;
+   }

This check breaks vdpa which set iov_limit to zero. Consider iov_limit is
meaningless to vDPA, I wonder we can skip the test when device doesn't use
worker.

Thanks

It doesn't need iovecs at all, right?

-- MST


Yes, so we may choose to bypass the iovecs as well.

Thanks


I think that the kmalloc_array returns ZERO_SIZE_PTR for all of them
in that case, so I didn't bother to skip the kmalloc_array parts.
Would you prefer to skip them all and let them NULL? Or have I
misunderstood what you mean?



I'm ok with either approach, but my understanding is that Michael wants 
to skip them all.


Thanks




Thanks!



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH RFC v8 02/11] vhost: use batched get_vq_desc version

2020-06-22 Thread Jason Wang


On 2020/6/23 上午12:00, Michael S. Tsirkin wrote:

On Wed, Jun 17, 2020 at 11:19:26AM +0800, Jason Wang wrote:

On 2020/6/11 下午7:34, Michael S. Tsirkin wrote:

   static void vhost_vq_free_iovecs(struct vhost_virtqueue *vq)
   {
kfree(vq->descs);
@@ -394,6 +400,9 @@ static long vhost_dev_alloc_iovecs(struct vhost_dev *dev)
for (i = 0; i < dev->nvqs; ++i) {
vq = dev->vqs[i];
vq->max_descs = dev->iov_limit;
+   if (vhost_vq_num_batch_descs(vq) < 0) {
+   return -EINVAL;
+   }

This check breaks vdpa which set iov_limit to zero. Consider iov_limit is
meaningless to vDPA, I wonder we can skip the test when device doesn't use
worker.

Thanks

It doesn't need iovecs at all, right?

-- MST



Yes, so we may choose to bypass the iovecs as well.

Thanks

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH RFC v8 02/11] vhost: use batched get_vq_desc version

2020-06-22 Thread Michael S. Tsirkin
On Mon, Jun 22, 2020 at 06:11:21PM +0200, Eugenio Perez Martin wrote:
> On Mon, Jun 22, 2020 at 5:55 PM Michael S. Tsirkin  wrote:
> >
> > On Fri, Jun 19, 2020 at 08:07:57PM +0200, Eugenio Perez Martin wrote:
> > > On Mon, Jun 15, 2020 at 2:28 PM Eugenio Perez Martin
> > >  wrote:
> > > >
> > > > On Thu, Jun 11, 2020 at 5:22 PM Konrad Rzeszutek Wilk
> > > >  wrote:
> > > > >
> > > > > On Thu, Jun 11, 2020 at 07:34:19AM -0400, Michael S. Tsirkin wrote:
> > > > > > As testing shows no performance change, switch to that now.
> > > > >
> > > > > What kind of testing? 100GiB? Low latency?
> > > > >
> > > >
> > > > Hi Konrad.
> > > >
> > > > I tested this version of the patch:
> > > > https://lkml.org/lkml/2019/10/13/42
> > > >
> > > > It was tested for throughput with DPDK's testpmd (as described in
> > > > http://doc.dpdk.org/guides/howto/virtio_user_as_exceptional_path.html)
> > > > and kernel pktgen. No latency tests were performed by me. Maybe it is
> > > > interesting to perform a latency test or just a different set of tests
> > > > over a recent version.
> > > >
> > > > Thanks!
> > >
> > > I have repeated the tests with v9, and results are a little bit different:
> > > * If I test opening it with testpmd, I see no change between versions
> >
> >
> > OK that is testpmd on guest, right? And vhost-net on the host?
> >
> 
> Hi Michael.
> 
> No, sorry, as described in
> http://doc.dpdk.org/guides/howto/virtio_user_as_exceptional_path.html.
> But I could add to test it in the guest too.
> 
> These kinds of raw packets "bursts" do not show performance
> differences, but I could test deeper if you think it would be worth
> it.

Oh ok, so this is without guest, with virtio-user.
It might be worth checking dpdk within guest too just
as another data point.

> > > * If I forward packets between two vhost-net interfaces in the guest
> > > using a linux bridge in the host:
> >
> > And here I guess you mean virtio-net in the guest kernel?
> 
> Yes, sorry: Two virtio-net interfaces connected with a linux bridge in
> the host. More precisely:
> * Adding one of the interfaces to another namespace, assigning it an
> IP, and starting netserver there.
> * Assign another IP in the range manually to the other virtual net
> interface, and start the desired test there.
> 
> If you think it would be better to perform then differently please let me 
> know.


Not sure why you bother with namespaces since you said you are
using L2 bridging. I guess it's unimportant.

> >
> > >   - netperf UDP_STREAM shows a performance increase of 1.8, almost
> > > doubling performance. This gets lower as frame size increase.
> > >   - rests of the test goes noticeably worse: UDP_RR goes from ~6347
> > > transactions/sec to 5830
> >
> > OK so it seems plausible that we still have a bug where an interrupt
> > is delayed. That is the main difference between pmd and virtio.
> > Let's try disabling event index, and see what happens - that's
> > the trickiest part of interrupts.
> >
> 
> Got it, will get back with the results.
> 
> Thank you very much!
> 
> >
> >
> > >   - TCP_STREAM goes from ~10.7 gbps to ~7Gbps
> > >   - TCP_RR from 6223.64 transactions/sec to 5739.44
> >

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC v8 02/11] vhost: use batched get_vq_desc version

2020-06-22 Thread Michael S. Tsirkin
On Wed, Jun 17, 2020 at 11:19:26AM +0800, Jason Wang wrote:
> 
> On 2020/6/11 下午7:34, Michael S. Tsirkin wrote:
> >   static void vhost_vq_free_iovecs(struct vhost_virtqueue *vq)
> >   {
> > kfree(vq->descs);
> > @@ -394,6 +400,9 @@ static long vhost_dev_alloc_iovecs(struct vhost_dev 
> > *dev)
> > for (i = 0; i < dev->nvqs; ++i) {
> > vq = dev->vqs[i];
> > vq->max_descs = dev->iov_limit;
> > +   if (vhost_vq_num_batch_descs(vq) < 0) {
> > +   return -EINVAL;
> > +   }
> 
> 
> This check breaks vdpa which set iov_limit to zero. Consider iov_limit is
> meaningless to vDPA, I wonder we can skip the test when device doesn't use
> worker.
> 
> Thanks

It doesn't need iovecs at all, right?

-- 
MST

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH RFC v8 02/11] vhost: use batched get_vq_desc version

2020-06-22 Thread Michael S. Tsirkin
On Fri, Jun 19, 2020 at 08:07:57PM +0200, Eugenio Perez Martin wrote:
> On Mon, Jun 15, 2020 at 2:28 PM Eugenio Perez Martin
>  wrote:
> >
> > On Thu, Jun 11, 2020 at 5:22 PM Konrad Rzeszutek Wilk
> >  wrote:
> > >
> > > On Thu, Jun 11, 2020 at 07:34:19AM -0400, Michael S. Tsirkin wrote:
> > > > As testing shows no performance change, switch to that now.
> > >
> > > What kind of testing? 100GiB? Low latency?
> > >
> >
> > Hi Konrad.
> >
> > I tested this version of the patch:
> > https://lkml.org/lkml/2019/10/13/42
> >
> > It was tested for throughput with DPDK's testpmd (as described in
> > http://doc.dpdk.org/guides/howto/virtio_user_as_exceptional_path.html)
> > and kernel pktgen. No latency tests were performed by me. Maybe it is
> > interesting to perform a latency test or just a different set of tests
> > over a recent version.
> >
> > Thanks!
> 
> I have repeated the tests with v9, and results are a little bit different:
> * If I test opening it with testpmd, I see no change between versions


OK that is testpmd on guest, right? And vhost-net on the host?

> * If I forward packets between two vhost-net interfaces in the guest
> using a linux bridge in the host:

And here I guess you mean virtio-net in the guest kernel?

>   - netperf UDP_STREAM shows a performance increase of 1.8, almost
> doubling performance. This gets lower as frame size increase.
>   - rests of the test goes noticeably worse: UDP_RR goes from ~6347
> transactions/sec to 5830

OK so it seems plausible that we still have a bug where an interrupt
is delayed. That is the main difference between pmd and virtio.
Let's try disabling event index, and see what happens - that's
the trickiest part of interrupts.



>   - TCP_STREAM goes from ~10.7 gbps to ~7Gbps
>   - TCP_RR from 6223.64 transactions/sec to 5739.44

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC v8 02/11] vhost: use batched get_vq_desc version

2020-06-22 Thread Jason Wang


On 2020/6/20 上午2:07, Eugenio Perez Martin wrote:

On Mon, Jun 15, 2020 at 2:28 PM Eugenio Perez Martin
 wrote:

On Thu, Jun 11, 2020 at 5:22 PM Konrad Rzeszutek Wilk
 wrote:

On Thu, Jun 11, 2020 at 07:34:19AM -0400, Michael S. Tsirkin wrote:

As testing shows no performance change, switch to that now.

What kind of testing? 100GiB? Low latency?


Hi Konrad.

I tested this version of the patch:
https://lkml.org/lkml/2019/10/13/42

It was tested for throughput with DPDK's testpmd (as described in
http://doc.dpdk.org/guides/howto/virtio_user_as_exceptional_path.html)
and kernel pktgen. No latency tests were performed by me. Maybe it is
interesting to perform a latency test or just a different set of tests
over a recent version.

Thanks!

I have repeated the tests with v9, and results are a little bit different:
* If I test opening it with testpmd, I see no change between versions
* If I forward packets between two vhost-net interfaces in the guest
using a linux bridge in the host:
   - netperf UDP_STREAM shows a performance increase of 1.8, almost
doubling performance. This gets lower as frame size increase.
   - rests of the test goes noticeably worse: UDP_RR goes from ~6347
transactions/sec to 5830
   - TCP_STREAM goes from ~10.7 gbps to ~7Gbps



Which direction did you mean here? Guest TX or RX?



   - TCP_RR from 6223.64 transactions/sec to 5739.44



Perf diff might help. I think we can start from the RR result which 
should be easier. Maybe you can test it for each patch then you may see 
which patch is the source of the regression.


Thanks

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH RFC v8 02/11] vhost: use batched get_vq_desc version

2020-06-16 Thread Jason Wang


On 2020/6/11 下午7:34, Michael S. Tsirkin wrote:

  static void vhost_vq_free_iovecs(struct vhost_virtqueue *vq)
  {
kfree(vq->descs);
@@ -394,6 +400,9 @@ static long vhost_dev_alloc_iovecs(struct vhost_dev *dev)
for (i = 0; i < dev->nvqs; ++i) {
vq = dev->vqs[i];
vq->max_descs = dev->iov_limit;
+   if (vhost_vq_num_batch_descs(vq) < 0) {
+   return -EINVAL;
+   }



This check breaks vdpa which set iov_limit to zero. Consider iov_limit 
is meaningless to vDPA, I wonder we can skip the test when device 
doesn't use worker.


Thanks

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH RFC v8 02/11] vhost: use batched get_vq_desc version

2020-06-11 Thread Konrad Rzeszutek Wilk
On Thu, Jun 11, 2020 at 07:34:19AM -0400, Michael S. Tsirkin wrote:
> As testing shows no performance change, switch to that now.

What kind of testing? 100GiB? Low latency?

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH RFC v8 02/11] vhost: use batched get_vq_desc version

2020-06-11 Thread Michael S. Tsirkin
As testing shows no performance change, switch to that now.

Signed-off-by: Michael S. Tsirkin 
Signed-off-by: Eugenio Pérez 
Link: https://lore.kernel.org/r/20200401183118.8334-3-epere...@redhat.com
Signed-off-by: Michael S. Tsirkin 
---
 drivers/vhost/test.c  |   2 +-
 drivers/vhost/vhost.c | 314 --
 drivers/vhost/vhost.h |   7 +-
 3 files changed, 61 insertions(+), 262 deletions(-)

diff --git a/drivers/vhost/test.c b/drivers/vhost/test.c
index 0466921f4772..7d69778aaa26 100644
--- a/drivers/vhost/test.c
+++ b/drivers/vhost/test.c
@@ -119,7 +119,7 @@ static int vhost_test_open(struct inode *inode, struct file 
*f)
dev = &n->dev;
vqs[VHOST_TEST_VQ] = &n->vqs[VHOST_TEST_VQ];
n->vqs[VHOST_TEST_VQ].handle_kick = handle_vq_kick;
-   vhost_dev_init(dev, vqs, VHOST_TEST_VQ_MAX, UIO_MAXIOV,
+   vhost_dev_init(dev, vqs, VHOST_TEST_VQ_MAX, UIO_MAXIOV + 64,
   VHOST_TEST_PKT_WEIGHT, VHOST_TEST_WEIGHT, true, NULL);
 
f->private_data = n;
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 11433d709651..dfcdb36d4227 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -304,6 +304,7 @@ static void vhost_vq_reset(struct vhost_dev *dev,
 {
vq->num = 1;
vq->ndescs = 0;
+   vq->first_desc = 0;
vq->desc = NULL;
vq->avail = NULL;
vq->used = NULL;
@@ -372,6 +373,11 @@ static int vhost_worker(void *data)
return 0;
 }
 
+static int vhost_vq_num_batch_descs(struct vhost_virtqueue *vq)
+{
+   return vq->max_descs - UIO_MAXIOV;
+}
+
 static void vhost_vq_free_iovecs(struct vhost_virtqueue *vq)
 {
kfree(vq->descs);
@@ -394,6 +400,9 @@ static long vhost_dev_alloc_iovecs(struct vhost_dev *dev)
for (i = 0; i < dev->nvqs; ++i) {
vq = dev->vqs[i];
vq->max_descs = dev->iov_limit;
+   if (vhost_vq_num_batch_descs(vq) < 0) {
+   return -EINVAL;
+   }
vq->descs = kmalloc_array(vq->max_descs,
  sizeof(*vq->descs),
  GFP_KERNEL);
@@ -1610,6 +1619,7 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int 
ioctl, void __user *arg
vq->last_avail_idx = s.num;
/* Forget the cached index value. */
vq->avail_idx = vq->last_avail_idx;
+   vq->ndescs = vq->first_desc = 0;
break;
case VHOST_GET_VRING_BASE:
s.index = idx;
@@ -2078,253 +2088,6 @@ static unsigned next_desc(struct vhost_virtqueue *vq, 
struct vring_desc *desc)
return next;
 }
 
-static int get_indirect(struct vhost_virtqueue *vq,
-   struct iovec iov[], unsigned int iov_size,
-   unsigned int *out_num, unsigned int *in_num,
-   struct vhost_log *log, unsigned int *log_num,
-   struct vring_desc *indirect)
-{
-   struct vring_desc desc;
-   unsigned int i = 0, count, found = 0;
-   u32 len = vhost32_to_cpu(vq, indirect->len);
-   struct iov_iter from;
-   int ret, access;
-
-   /* Sanity check */
-   if (unlikely(len % sizeof desc)) {
-   vq_err(vq, "Invalid length in indirect descriptor: "
-  "len 0x%llx not multiple of 0x%zx\n",
-  (unsigned long long)len,
-  sizeof desc);
-   return -EINVAL;
-   }
-
-   ret = translate_desc(vq, vhost64_to_cpu(vq, indirect->addr), len, 
vq->indirect,
-UIO_MAXIOV, VHOST_ACCESS_RO);
-   if (unlikely(ret < 0)) {
-   if (ret != -EAGAIN)
-   vq_err(vq, "Translation failure %d in indirect.\n", 
ret);
-   return ret;
-   }
-   iov_iter_init(&from, READ, vq->indirect, ret, len);
-
-   /* We will use the result as an address to read from, so most
-* architectures only need a compiler barrier here. */
-   read_barrier_depends();
-
-   count = len / sizeof desc;
-   /* Buffers are chained via a 16 bit next field, so
-* we can have at most 2^16 of these. */
-   if (unlikely(count > USHRT_MAX + 1)) {
-   vq_err(vq, "Indirect buffer length too big: %d\n",
-  indirect->len);
-   return -E2BIG;
-   }
-
-   do {
-   unsigned iov_count = *in_num + *out_num;
-   if (unlikely(++found > count)) {
-   vq_err(vq, "Loop detected: last one at %u "
-  "indirect size %u\n",
-  i, count);
-   return -EINVAL;
-   }
-   if (unlikely(!copy_from_iter_full(&desc, sizeof(desc), &from))) 
{
-   vq_err(vq, "Failed indirect descriptor: idx %d, %zx\n",
-  i, (size_