Re: [PATCH net-next v2] vhost/net: Defer TX queue re-enable until after sendmsg

2025-05-18 Thread Jon Kohler


> On May 18, 2025, at 5:38 PM, Michael S. Tsirkin  wrote:
> 
> !---|
>  CAUTION: External Email
> 
> |---!
> 
> On Sat, Apr 19, 2025 at 06:05:18PM -0700, Jon Kohler wrote:
>> In handle_tx_copy, TX batching processes packets below ~PAGE_SIZE and
>> batches up to 64 messages before calling sock->sendmsg.
>> 
>> Currently, when there are no more messages on the ring to dequeue,
>> handle_tx_copy re-enables kicks on the ring *before* firing off the
>> batch sendmsg. However, sock->sendmsg incurs a non-zero delay,
>> especially if it needs to wake up a thread (e.g., another vhost worker).
>> 
>> If the guest submits additional messages immediately after the last ring
>> check and disablement, it triggers an EPT_MISCONFIG vmexit to attempt to
>> kick the vhost worker. This may happen while the worker is still
>> processing the sendmsg, leading to wasteful exit(s).
>> 
>> This is particularly problematic for single-threaded guest submission
>> threads, as they must exit, wait for the exit to be processed
>> (potentially involving a TTWU), and then resume.
>> 
>> In scenarios like a constant stream of UDP messages, this results in a
>> sawtooth pattern where the submitter frequently vmexits, and the
>> vhost-net worker alternates between sleeping and waking.
>> 
>> A common solution is to configure vhost-net busy polling via userspace
>> (e.g., qemu poll-us). However, treating the sendmsg as the "busy"
>> period by keeping kicks disabled during the final sendmsg and
>> performing one additional ring check afterward provides a significant
>> performance improvement without any excess busy poll cycles.
>> 
>> If messages are found in the ring after the final sendmsg, requeue the
>> TX handler. This ensures fairness for the RX handler and allows
>> vhost_run_work_list to cond_resched() as needed.
>> 
>> Test Case
>>TX VM: taskset -c 2 iperf3  -c rx-ip-here -t 60 -p 5200 -b 0 -u -i 5
>>RX VM: taskset -c 2 iperf3 -s -p 5200 -D
>>6.12.0, each worker backed by tun interface with IFF_NAPI setup.
>>Note: TCP side is largely unchanged as that was copy bound
>> 
>> 6.12.0 unpatched
>>EPT_MISCONFIG/second: 5411
>>Datagrams/second: ~382k
>>Interval Transfer Bitrate Lost/Total Datagrams
>>0.00-30.00  sec  15.5 GBytes  4.43 Gbits/sec  0/11481630 (0%)  sender
>> 
>> 6.12.0 patched
>>EPT_MISCONFIG/second: 58 (~93x reduction)
>>Datagrams/second: ~650k  (~1.7x increase)
>>Interval Transfer Bitrate Lost/Total Datagrams
>>0.00-30.00  sec  26.4 GBytes  7.55 Gbits/sec  0/19554720 (0%)  sender
>> 
>> Acked-by: Jason Wang 
>> Signed-off-by: Jon Kohler 
> 
> 
> Jon what are we doing with this patch? you still looking into
> some feedback?

Hey Michael,
This was already merged in v3 patch: 
https://lore.kernel.org/all/20250501020428.1889162-1-...@nutanix.com/
https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/drivers/vhost/net.c?id=8c2e6b26ffe243be1e78f5a4bfb1a857d6e6f6d6


> 
>> ---
>> drivers/vhost/net.c | 19 +++
>> 1 file changed, 15 insertions(+), 4 deletions(-)
>> 
>> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
>> index b9b9e9d40951..9b04025eea66 100644
>> --- a/drivers/vhost/net.c
>> +++ b/drivers/vhost/net.c
>> @@ -769,13 +769,17 @@ static void handle_tx_copy(struct vhost_net *net, 
>> struct socket *sock)
>> break;
>> /* Nothing new?  Wait for eventfd to tell us they refilled. */
>> if (head == vq->num) {
>> + /* If interrupted while doing busy polling, requeue
>> + * the handler to be fair handle_rx as well as other
>> + * tasks waiting on cpu
>> + */
>> if (unlikely(busyloop_intr)) {
>> vhost_poll_queue(&vq->poll);
>> - } else if (unlikely(vhost_enable_notify(&net->dev,
>> - vq))) {
>> - vhost_disable_notify(&net->dev, vq);
>> - continue;
>> }
>> + /* Kicks are disabled at this point, break loop and
>> + * process any remaining batched packets. Queue will
>> + * be re-enabled afterwards.
>> + */
>> break;
>> }
>> 
>> @@ -825,7 +829,14 @@ static void handle_tx_copy(struct vhost_net *net, 
>> struct socket *sock)
>> ++nvq->done_idx;
>> } while (likely(!vhost_exceeds_weight(vq, ++sent_pkts, total_len)));
>> 
>> + /* Kicks are still disabled, dispatch any remaining batched msgs. */
>> vhost_tx_batch(net, nvq, sock, &msg);
>> +
>> + /* All of our work has been completed; however, before leaving the
>> + * TX handler, do one last check for work, and requeue handler if
>> + * necessary. If there is no work, queue will be reenabled.
>> + */
>> + vhost_net_busy_poll_try_queue(net, vq);
>> }
>> 
>> static void handle_tx_zerocopy(struct vhost_net *net, struct socket *sock)
>> -- 
>> 2.43.0
> 



Re: [PATCH net-next v2] vhost/net: Defer TX queue re-enable until after sendmsg

2025-05-18 Thread Michael S. Tsirkin
On Sat, Apr 19, 2025 at 06:05:18PM -0700, Jon Kohler wrote:
> In handle_tx_copy, TX batching processes packets below ~PAGE_SIZE and
> batches up to 64 messages before calling sock->sendmsg.
> 
> Currently, when there are no more messages on the ring to dequeue,
> handle_tx_copy re-enables kicks on the ring *before* firing off the
> batch sendmsg. However, sock->sendmsg incurs a non-zero delay,
> especially if it needs to wake up a thread (e.g., another vhost worker).
> 
> If the guest submits additional messages immediately after the last ring
> check and disablement, it triggers an EPT_MISCONFIG vmexit to attempt to
> kick the vhost worker. This may happen while the worker is still
> processing the sendmsg, leading to wasteful exit(s).
> 
> This is particularly problematic for single-threaded guest submission
> threads, as they must exit, wait for the exit to be processed
> (potentially involving a TTWU), and then resume.
> 
> In scenarios like a constant stream of UDP messages, this results in a
> sawtooth pattern where the submitter frequently vmexits, and the
> vhost-net worker alternates between sleeping and waking.
> 
> A common solution is to configure vhost-net busy polling via userspace
> (e.g., qemu poll-us). However, treating the sendmsg as the "busy"
> period by keeping kicks disabled during the final sendmsg and
> performing one additional ring check afterward provides a significant
> performance improvement without any excess busy poll cycles.
> 
> If messages are found in the ring after the final sendmsg, requeue the
> TX handler. This ensures fairness for the RX handler and allows
> vhost_run_work_list to cond_resched() as needed.
> 
> Test Case
> TX VM: taskset -c 2 iperf3  -c rx-ip-here -t 60 -p 5200 -b 0 -u -i 5
> RX VM: taskset -c 2 iperf3 -s -p 5200 -D
> 6.12.0, each worker backed by tun interface with IFF_NAPI setup.
> Note: TCP side is largely unchanged as that was copy bound
> 
> 6.12.0 unpatched
> EPT_MISCONFIG/second: 5411
> Datagrams/second: ~382k
> Interval Transfer Bitrate Lost/Total Datagrams
> 0.00-30.00  sec  15.5 GBytes  4.43 Gbits/sec  0/11481630 (0%)  sender
> 
> 6.12.0 patched
> EPT_MISCONFIG/second: 58 (~93x reduction)
> Datagrams/second: ~650k  (~1.7x increase)
> Interval Transfer Bitrate Lost/Total Datagrams
> 0.00-30.00  sec  26.4 GBytes  7.55 Gbits/sec  0/19554720 (0%)  sender
> 
> Acked-by: Jason Wang 
> Signed-off-by: Jon Kohler 


Jon what are we doing with this patch? you still looking into
some feedback?

> ---
>  drivers/vhost/net.c | 19 +++
>  1 file changed, 15 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index b9b9e9d40951..9b04025eea66 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -769,13 +769,17 @@ static void handle_tx_copy(struct vhost_net *net, 
> struct socket *sock)
>   break;
>   /* Nothing new?  Wait for eventfd to tell us they refilled. */
>   if (head == vq->num) {
> + /* If interrupted while doing busy polling, requeue
> +  * the handler to be fair handle_rx as well as other
> +  * tasks waiting on cpu
> +  */
>   if (unlikely(busyloop_intr)) {
>   vhost_poll_queue(&vq->poll);
> - } else if (unlikely(vhost_enable_notify(&net->dev,
> - vq))) {
> - vhost_disable_notify(&net->dev, vq);
> - continue;
>   }
> + /* Kicks are disabled at this point, break loop and
> +  * process any remaining batched packets. Queue will
> +  * be re-enabled afterwards.
> +  */
>   break;
>   }
>  
> @@ -825,7 +829,14 @@ static void handle_tx_copy(struct vhost_net *net, struct 
> socket *sock)
>   ++nvq->done_idx;
>   } while (likely(!vhost_exceeds_weight(vq, ++sent_pkts, total_len)));
>  
> + /* Kicks are still disabled, dispatch any remaining batched msgs. */
>   vhost_tx_batch(net, nvq, sock, &msg);
> +
> + /* All of our work has been completed; however, before leaving the
> +  * TX handler, do one last check for work, and requeue handler if
> +  * necessary. If there is no work, queue will be reenabled.
> +  */
> + vhost_net_busy_poll_try_queue(net, vq);
>  }
>  
>  static void handle_tx_zerocopy(struct vhost_net *net, struct socket *sock)
> -- 
> 2.43.0




Re: [PATCH net-next v2] vhost/net: Defer TX queue re-enable until after sendmsg

2025-04-26 Thread Jon Kohler


> On Apr 26, 2025, at 3:06 PM, Jon Kohler  wrote:
> 
> 
> 
>> On Apr 24, 2025, at 8:11 AM, Michael S. Tsirkin  wrote:
>> 
>> !---|
>> CAUTION: External Email
>> 
>> |---!
>> 
>> On Thu, Apr 24, 2025 at 01:48:53PM +0200, Paolo Abeni wrote:
>>> On 4/20/25 3:05 AM, Jon Kohler wrote:
 diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
 index b9b9e9d40951..9b04025eea66 100644
 --- a/drivers/vhost/net.c
 +++ b/drivers/vhost/net.c
 @@ -769,13 +769,17 @@ static void handle_tx_copy(struct vhost_net *net, 
 struct socket *sock)
 break;
 /* Nothing new?  Wait for eventfd to tell us they refilled. */
 if (head == vq->num) {
 + /* If interrupted while doing busy polling, requeue
 + * the handler to be fair handle_rx as well as other
 + * tasks waiting on cpu
 + */
 if (unlikely(busyloop_intr)) {
 vhost_poll_queue(&vq->poll);
 - } else if (unlikely(vhost_enable_notify(&net->dev,
 - vq))) {
 - vhost_disable_notify(&net->dev, vq);
 - continue;
 }
 + /* Kicks are disabled at this point, break loop and
 + * process any remaining batched packets. Queue will
 + * be re-enabled afterwards.
 + */
 break;
 }
>>> 
>>> It's not clear to me why the zerocopy path does not need a similar change.
>> 
>> It can have one, it's just that Jon has a separate patch to drop
>> it completely. A commit log comment mentioning this would be a good
>> idea, yes.
> 
> Yea, the utility of the ZC side is a head scratcher for me, I can’t get it to 
> work
> well to save my life. I’ve got a separate thread I need to respond to Eugenio
> on, will try to circle back on that next week.
> 
> The reason this one works so well is that the last batch in the copy path can
> take a non-trivial amount of time, so it opens up the guest to a real saw 
> tooth
> pattern. Getting rid of that, and all that comes with it (exits, stalls, 
> etc), just
> pays off.
> 
>> 
 @@ -825,7 +829,14 @@ static void handle_tx_copy(struct vhost_net *net, 
 struct socket *sock)
 ++nvq->done_idx;
 } while (likely(!vhost_exceeds_weight(vq, ++sent_pkts, total_len)));
 
 + /* Kicks are still disabled, dispatch any remaining batched msgs. */
 vhost_tx_batch(net, nvq, sock, &msg);
 +
 + /* All of our work has been completed; however, before leaving the
 + * TX handler, do one last check for work, and requeue handler if
 + * necessary. If there is no work, queue will be reenabled.
 + */
 + vhost_net_busy_poll_try_queue(net, vq);
>>> 
>>> This will call vhost_poll_queue() regardless of the 'busyloop_intr' flag
>>> value, while AFAICS prior to this patch vhost_poll_queue() is only
>>> performed with busyloop_intr == true. Why don't we need to take care of
>>> such flag here?
>> 
>> Hmm I agree this is worth trying, a free if possibly small performance
>> gain, why not. Jon want to try?
> 
> I mentioned in the commit msg that the reason we’re doing this is to be
> fair to handle_rx. If my read of vhost_net_busy_poll_try_queue is correct,
> we would only call vhost_poll_queue iff:
> 1. The TX ring is not empty, in which case we want to run handle_tx again
> 2. When we go to reenable kicks, it returns non-zero, which means we
> should run handle_tx again anyhow
> 
> In the ring is truly empty, and we can re-enable kicks with no drama, we
> would not run vhost_poll_queue.
> 
> That said, I think what you’re saying here is, we should check the busy
> flag and *not* try vhost_net_busy_poll_try_queue, right? If so, great, I did
> that in an internal version of this patch; however, it adds another 
> conditional
> which for the vast majority of users is not going to add any value (I think)
> 
> Happy to dig deeper, either on this change series, or a follow up?

Sorry, I do not know why this email sent itself again when I opened my
laptop, my mistake somehow.

> 
>> 
>> 
>>> @Michael: I assume you prefer that this patch will go through the
>>> net-next tree, right?
>>> 
>>> Thanks,
>>> 
>>> Paolo
>> 
>> I don't mind and this seems to be what Jon wants.
>> I could queue it too, but extra review  it gets in the net tree is good.
> 
> My apologies, I thought all non-bug fixes had to go thru net-next,
> which is why I sent the v2 to net-next; however if you want to queue
> right away, I’m good with either. Its a fairly well contained patch with
> a huge upside :) 
> 
>> 
>> -- 
>> MST
>> 
> 



Re: [PATCH net-next v2] vhost/net: Defer TX queue re-enable until after sendmsg

2025-04-26 Thread Jon Kohler


> On Apr 24, 2025, at 8:11 AM, Michael S. Tsirkin  wrote:
> 
> !---|
> CAUTION: External Email
> 
> |---!
> 
> On Thu, Apr 24, 2025 at 01:48:53PM +0200, Paolo Abeni wrote:
>> On 4/20/25 3:05 AM, Jon Kohler wrote:
>>> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
>>> index b9b9e9d40951..9b04025eea66 100644
>>> --- a/drivers/vhost/net.c
>>> +++ b/drivers/vhost/net.c
>>> @@ -769,13 +769,17 @@ static void handle_tx_copy(struct vhost_net *net, 
>>> struct socket *sock)
>>> break;
>>> /* Nothing new?  Wait for eventfd to tell us they refilled. */
>>> if (head == vq->num) {
>>> + /* If interrupted while doing busy polling, requeue
>>> + * the handler to be fair handle_rx as well as other
>>> + * tasks waiting on cpu
>>> + */
>>> if (unlikely(busyloop_intr)) {
>>> vhost_poll_queue(&vq->poll);
>>> - } else if (unlikely(vhost_enable_notify(&net->dev,
>>> - vq))) {
>>> - vhost_disable_notify(&net->dev, vq);
>>> - continue;
>>> }
>>> + /* Kicks are disabled at this point, break loop and
>>> + * process any remaining batched packets. Queue will
>>> + * be re-enabled afterwards.
>>> + */
>>> break;
>>> }
>> 
>> It's not clear to me why the zerocopy path does not need a similar change.
> 
> It can have one, it's just that Jon has a separate patch to drop
> it completely. A commit log comment mentioning this would be a good
> idea, yes.

Yea, the utility of the ZC side is a head scratcher for me, I can’t get it to 
work
well to save my life. I’ve got a separate thread I need to respond to Eugenio
on, will try to circle back on that next week.

The reason this one works so well is that the last batch in the copy path can
take a non-trivial amount of time, so it opens up the guest to a real saw tooth
pattern. Getting rid of that, and all that comes with it (exits, stalls, etc), 
just
pays off.

> 
>>> @@ -825,7 +829,14 @@ static void handle_tx_copy(struct vhost_net *net, 
>>> struct socket *sock)
>>> ++nvq->done_idx;
>>> } while (likely(!vhost_exceeds_weight(vq, ++sent_pkts, total_len)));
>>> 
>>> + /* Kicks are still disabled, dispatch any remaining batched msgs. */
>>> vhost_tx_batch(net, nvq, sock, &msg);
>>> +
>>> + /* All of our work has been completed; however, before leaving the
>>> + * TX handler, do one last check for work, and requeue handler if
>>> + * necessary. If there is no work, queue will be reenabled.
>>> + */
>>> + vhost_net_busy_poll_try_queue(net, vq);
>> 
>> This will call vhost_poll_queue() regardless of the 'busyloop_intr' flag
>> value, while AFAICS prior to this patch vhost_poll_queue() is only
>> performed with busyloop_intr == true. Why don't we need to take care of
>> such flag here?
> 
> Hmm I agree this is worth trying, a free if possibly small performance
> gain, why not. Jon want to try?

I mentioned in the commit msg that the reason we’re doing this is to be
fair to handle_rx. If my read of vhost_net_busy_poll_try_queue is correct,
we would only call vhost_poll_queue iff:
1. The TX ring is not empty, in which case we want to run handle_tx again
2. When we go to reenable kicks, it returns non-zero, which means we
should run handle_tx again anyhow

In the ring is truly empty, and we can re-enable kicks with no drama, we
would not run vhost_poll_queue.

That said, I think what you’re saying here is, we should check the busy
flag and *not* try vhost_net_busy_poll_try_queue, right? If so, great, I did
that in an internal version of this patch; however, it adds another conditional
which for the vast majority of users is not going to add any value (I think)

Happy to dig deeper, either on this change series, or a follow up?

> 
> 
>> @Michael: I assume you prefer that this patch will go through the
>> net-next tree, right?
>> 
>> Thanks,
>> 
>> Paolo
> 
> I don't mind and this seems to be what Jon wants.
> I could queue it too, but extra review  it gets in the net tree is good.

My apologies, I thought all non-bug fixes had to go thru net-next,
which is why I sent the v2 to net-next; however if you want to queue
right away, I’m good with either. Its a fairly well contained patch with
a huge upside :) 

> 
> -- 
> MST
> 



Re: [PATCH net-next v2] vhost/net: Defer TX queue re-enable until after sendmsg

2025-04-24 Thread Michael S. Tsirkin
On Thu, Apr 24, 2025 at 01:53:34PM +, Jon Kohler wrote:
> 
> 
> > On Apr 24, 2025, at 8:11 AM, Michael S. Tsirkin  wrote:
> > 
> > !---|
> >  CAUTION: External Email
> > 
> > |---!
> > 
> > On Thu, Apr 24, 2025 at 01:48:53PM +0200, Paolo Abeni wrote:
> >> On 4/20/25 3:05 AM, Jon Kohler wrote:
> >>> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> >>> index b9b9e9d40951..9b04025eea66 100644
> >>> --- a/drivers/vhost/net.c
> >>> +++ b/drivers/vhost/net.c
> >>> @@ -769,13 +769,17 @@ static void handle_tx_copy(struct vhost_net *net, 
> >>> struct socket *sock)
> >>> break;
> >>> /* Nothing new?  Wait for eventfd to tell us they refilled. */
> >>> if (head == vq->num) {
> >>> + /* If interrupted while doing busy polling, requeue
> >>> + * the handler to be fair handle_rx as well as other
> >>> + * tasks waiting on cpu
> >>> + */
> >>> if (unlikely(busyloop_intr)) {
> >>> vhost_poll_queue(&vq->poll);
> >>> - } else if (unlikely(vhost_enable_notify(&net->dev,
> >>> - vq))) {
> >>> - vhost_disable_notify(&net->dev, vq);
> >>> - continue;
> >>> }
> >>> + /* Kicks are disabled at this point, break loop and
> >>> + * process any remaining batched packets. Queue will
> >>> + * be re-enabled afterwards.
> >>> + */
> >>> break;
> >>> }
> >> 
> >> It's not clear to me why the zerocopy path does not need a similar change.
> > 
> > It can have one, it's just that Jon has a separate patch to drop
> > it completely. A commit log comment mentioning this would be a good
> > idea, yes.
> 
> Yea, the utility of the ZC side is a head scratcher for me, I can’t get it to 
> work
> well to save my life. I’ve got a separate thread I need to respond to Eugenio
> on, will try to circle back on that next week.
> 
> The reason this one works so well is that the last batch in the copy path can
> take a non-trivial amount of time, so it opens up the guest to a real saw 
> tooth
> pattern. Getting rid of that, and all that comes with it (exits, stalls, 
> etc), just
> pays off.
> 
> > 
> >>> @@ -825,7 +829,14 @@ static void handle_tx_copy(struct vhost_net *net, 
> >>> struct socket *sock)
> >>> ++nvq->done_idx;
> >>> } while (likely(!vhost_exceeds_weight(vq, ++sent_pkts, total_len)));
> >>> 
> >>> + /* Kicks are still disabled, dispatch any remaining batched msgs. */
> >>> vhost_tx_batch(net, nvq, sock, &msg);
> >>> +
> >>> + /* All of our work has been completed; however, before leaving the
> >>> + * TX handler, do one last check for work, and requeue handler if
> >>> + * necessary. If there is no work, queue will be reenabled.
> >>> + */
> >>> + vhost_net_busy_poll_try_queue(net, vq);
> >> 
> >> This will call vhost_poll_queue() regardless of the 'busyloop_intr' flag
> >> value, while AFAICS prior to this patch vhost_poll_queue() is only
> >> performed with busyloop_intr == true. Why don't we need to take care of
> >> such flag here?
> > 
> > Hmm I agree this is worth trying, a free if possibly small performance
> > gain, why not. Jon want to try?
> 
> I mentioned in the commit msg that the reason we’re doing this is to be
> fair to handle_rx. If my read of vhost_net_busy_poll_try_queue is correct,
> we would only call vhost_poll_queue iff:
> 1. The TX ring is not empty, in which case we want to run handle_tx again
> 2. When we go to reenable kicks, it returns non-zero, which means we
> should run handle_tx again anyhow
> 
> In the ring is truly empty, and we can re-enable kicks with no drama, we
> would not run vhost_poll_queue.
> 
> That said, I think what you’re saying here is, we should check the busy
> flag and *not* try vhost_net_busy_poll_try_queue, right?

yes

> If so, great, I did
> that in an internal version of this patch; however, it adds another 
> conditional
> which for the vast majority of users is not going to add any value (I think)
> 
> Happy to dig deeper, either on this change series, or a follow up?

it just seems like a more conservate thing to do, given we already did
this in the past.

> > 
> > 
> >> @Michael: I assume you prefer that this patch will go through the
> >> net-next tree, right?
> >> 
> >> Thanks,
> >> 
> >> Paolo
> > 
> > I don't mind and this seems to be what Jon wants.
> > I could queue it too, but extra review  it gets in the net tree is good.
> 
> My apologies, I thought all non-bug fixes had to go thru net-next,
> which is why I sent the v2 to net-next; however if you want to queue
> right away, I’m good with either. Its a fairly well contained patch with
> a huge upside :) 
> 
> > 
> > -- 
> > MST
> > 
> 




Re: [PATCH net-next v2] vhost/net: Defer TX queue re-enable until after sendmsg

2025-04-24 Thread Jon Kohler


> On Apr 24, 2025, at 8:11 AM, Michael S. Tsirkin  wrote:
> 
> !---|
>  CAUTION: External Email
> 
> |---!
> 
> On Thu, Apr 24, 2025 at 01:48:53PM +0200, Paolo Abeni wrote:
>> On 4/20/25 3:05 AM, Jon Kohler wrote:
>>> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
>>> index b9b9e9d40951..9b04025eea66 100644
>>> --- a/drivers/vhost/net.c
>>> +++ b/drivers/vhost/net.c
>>> @@ -769,13 +769,17 @@ static void handle_tx_copy(struct vhost_net *net, 
>>> struct socket *sock)
>>> break;
>>> /* Nothing new?  Wait for eventfd to tell us they refilled. */
>>> if (head == vq->num) {
>>> + /* If interrupted while doing busy polling, requeue
>>> + * the handler to be fair handle_rx as well as other
>>> + * tasks waiting on cpu
>>> + */
>>> if (unlikely(busyloop_intr)) {
>>> vhost_poll_queue(&vq->poll);
>>> - } else if (unlikely(vhost_enable_notify(&net->dev,
>>> - vq))) {
>>> - vhost_disable_notify(&net->dev, vq);
>>> - continue;
>>> }
>>> + /* Kicks are disabled at this point, break loop and
>>> + * process any remaining batched packets. Queue will
>>> + * be re-enabled afterwards.
>>> + */
>>> break;
>>> }
>> 
>> It's not clear to me why the zerocopy path does not need a similar change.
> 
> It can have one, it's just that Jon has a separate patch to drop
> it completely. A commit log comment mentioning this would be a good
> idea, yes.

Yea, the utility of the ZC side is a head scratcher for me, I can’t get it to 
work
well to save my life. I’ve got a separate thread I need to respond to Eugenio
on, will try to circle back on that next week.

The reason this one works so well is that the last batch in the copy path can
take a non-trivial amount of time, so it opens up the guest to a real saw tooth
pattern. Getting rid of that, and all that comes with it (exits, stalls, etc), 
just
pays off.

> 
>>> @@ -825,7 +829,14 @@ static void handle_tx_copy(struct vhost_net *net, 
>>> struct socket *sock)
>>> ++nvq->done_idx;
>>> } while (likely(!vhost_exceeds_weight(vq, ++sent_pkts, total_len)));
>>> 
>>> + /* Kicks are still disabled, dispatch any remaining batched msgs. */
>>> vhost_tx_batch(net, nvq, sock, &msg);
>>> +
>>> + /* All of our work has been completed; however, before leaving the
>>> + * TX handler, do one last check for work, and requeue handler if
>>> + * necessary. If there is no work, queue will be reenabled.
>>> + */
>>> + vhost_net_busy_poll_try_queue(net, vq);
>> 
>> This will call vhost_poll_queue() regardless of the 'busyloop_intr' flag
>> value, while AFAICS prior to this patch vhost_poll_queue() is only
>> performed with busyloop_intr == true. Why don't we need to take care of
>> such flag here?
> 
> Hmm I agree this is worth trying, a free if possibly small performance
> gain, why not. Jon want to try?

I mentioned in the commit msg that the reason we’re doing this is to be
fair to handle_rx. If my read of vhost_net_busy_poll_try_queue is correct,
we would only call vhost_poll_queue iff:
1. The TX ring is not empty, in which case we want to run handle_tx again
2. When we go to reenable kicks, it returns non-zero, which means we
should run handle_tx again anyhow

In the ring is truly empty, and we can re-enable kicks with no drama, we
would not run vhost_poll_queue.

That said, I think what you’re saying here is, we should check the busy
flag and *not* try vhost_net_busy_poll_try_queue, right? If so, great, I did
that in an internal version of this patch; however, it adds another conditional
which for the vast majority of users is not going to add any value (I think)

Happy to dig deeper, either on this change series, or a follow up?

> 
> 
>> @Michael: I assume you prefer that this patch will go through the
>> net-next tree, right?
>> 
>> Thanks,
>> 
>> Paolo
> 
> I don't mind and this seems to be what Jon wants.
> I could queue it too, but extra review  it gets in the net tree is good.

My apologies, I thought all non-bug fixes had to go thru net-next,
which is why I sent the v2 to net-next; however if you want to queue
right away, I’m good with either. Its a fairly well contained patch with
a huge upside :) 

> 
> -- 
> MST
> 



Re: [PATCH net-next v2] vhost/net: Defer TX queue re-enable until after sendmsg

2025-04-24 Thread Michael S. Tsirkin
On Thu, Apr 24, 2025 at 01:48:53PM +0200, Paolo Abeni wrote:
> On 4/20/25 3:05 AM, Jon Kohler wrote:
> > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> > index b9b9e9d40951..9b04025eea66 100644
> > --- a/drivers/vhost/net.c
> > +++ b/drivers/vhost/net.c
> > @@ -769,13 +769,17 @@ static void handle_tx_copy(struct vhost_net *net, 
> > struct socket *sock)
> > break;
> > /* Nothing new?  Wait for eventfd to tell us they refilled. */
> > if (head == vq->num) {
> > +   /* If interrupted while doing busy polling, requeue
> > +* the handler to be fair handle_rx as well as other
> > +* tasks waiting on cpu
> > +*/
> > if (unlikely(busyloop_intr)) {
> > vhost_poll_queue(&vq->poll);
> > -   } else if (unlikely(vhost_enable_notify(&net->dev,
> > -   vq))) {
> > -   vhost_disable_notify(&net->dev, vq);
> > -   continue;
> > }
> > +   /* Kicks are disabled at this point, break loop and
> > +* process any remaining batched packets. Queue will
> > +* be re-enabled afterwards.
> > +*/
> > break;
> > }
> 
> It's not clear to me why the zerocopy path does not need a similar change.

It can have one, it's just that Jon has a separate patch to drop
it completely. A commit log comment mentioning this would be a good
idea, yes.

> > @@ -825,7 +829,14 @@ static void handle_tx_copy(struct vhost_net *net, 
> > struct socket *sock)
> > ++nvq->done_idx;
> > } while (likely(!vhost_exceeds_weight(vq, ++sent_pkts, total_len)));
> >  
> > +   /* Kicks are still disabled, dispatch any remaining batched msgs. */
> > vhost_tx_batch(net, nvq, sock, &msg);
> > +
> > +   /* All of our work has been completed; however, before leaving the
> > +* TX handler, do one last check for work, and requeue handler if
> > +* necessary. If there is no work, queue will be reenabled.
> > +*/
> > +   vhost_net_busy_poll_try_queue(net, vq);
> 
> This will call vhost_poll_queue() regardless of the 'busyloop_intr' flag
> value, while AFAICS prior to this patch vhost_poll_queue() is only
> performed with busyloop_intr == true. Why don't we need to take care of
> such flag here?

Hmm I agree this is worth trying, a free if possibly small performance
gain, why not. Jon want to try?


> @Michael: I assume you prefer that this patch will go through the
> net-next tree, right?
> 
> Thanks,
> 
> Paolo

I don't mind and this seems to be what Jon wants.
I could queue it too, but extra review  it gets in the net tree is good.

-- 
MST




Re: [PATCH net-next v2] vhost/net: Defer TX queue re-enable until after sendmsg

2025-04-24 Thread Paolo Abeni
On 4/20/25 3:05 AM, Jon Kohler wrote:
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index b9b9e9d40951..9b04025eea66 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -769,13 +769,17 @@ static void handle_tx_copy(struct vhost_net *net, 
> struct socket *sock)
>   break;
>   /* Nothing new?  Wait for eventfd to tell us they refilled. */
>   if (head == vq->num) {
> + /* If interrupted while doing busy polling, requeue
> +  * the handler to be fair handle_rx as well as other
> +  * tasks waiting on cpu
> +  */
>   if (unlikely(busyloop_intr)) {
>   vhost_poll_queue(&vq->poll);
> - } else if (unlikely(vhost_enable_notify(&net->dev,
> - vq))) {
> - vhost_disable_notify(&net->dev, vq);
> - continue;
>   }
> + /* Kicks are disabled at this point, break loop and
> +  * process any remaining batched packets. Queue will
> +  * be re-enabled afterwards.
> +  */
>   break;
>   }

It's not clear to me why the zerocopy path does not need a similar change.

> @@ -825,7 +829,14 @@ static void handle_tx_copy(struct vhost_net *net, struct 
> socket *sock)
>   ++nvq->done_idx;
>   } while (likely(!vhost_exceeds_weight(vq, ++sent_pkts, total_len)));
>  
> + /* Kicks are still disabled, dispatch any remaining batched msgs. */
>   vhost_tx_batch(net, nvq, sock, &msg);
> +
> + /* All of our work has been completed; however, before leaving the
> +  * TX handler, do one last check for work, and requeue handler if
> +  * necessary. If there is no work, queue will be reenabled.
> +  */
> + vhost_net_busy_poll_try_queue(net, vq);

This will call vhost_poll_queue() regardless of the 'busyloop_intr' flag
value, while AFAICS prior to this patch vhost_poll_queue() is only
performed with busyloop_intr == true. Why don't we need to take care of
such flag here?

@Michael: I assume you prefer that this patch will go through the
net-next tree, right?

Thanks,

Paolo




Re: [PATCH net-next v2] vhost/net: Defer TX queue re-enable until after sendmsg

2025-04-20 Thread Jon Kohler


> On Apr 20, 2025, at 3:32 AM, Michael S. Tsirkin  wrote:
> 
> !---|
>  CAUTION: External Email
> 
> |---!
> 
> On Sat, Apr 19, 2025 at 06:05:18PM -0700, Jon Kohler wrote:
>> In handle_tx_copy, TX batching processes packets below ~PAGE_SIZE and
>> batches up to 64 messages before calling sock->sendmsg.
>> 
>> Currently, when there are no more messages on the ring to dequeue,
>> handle_tx_copy re-enables kicks on the ring *before* firing off the
>> batch sendmsg. However, sock->sendmsg incurs a non-zero delay,
>> especially if it needs to wake up a thread (e.g., another vhost worker).
>> 
>> If the guest submits additional messages immediately after the last ring
>> check and disablement, it triggers an EPT_MISCONFIG vmexit to attempt to
>> kick the vhost worker. This may happen while the worker is still
>> processing the sendmsg, leading to wasteful exit(s).
>> 
>> This is particularly problematic for single-threaded guest submission
>> threads, as they must exit, wait for the exit to be processed
>> (potentially involving a TTWU), and then resume.
>> 
>> In scenarios like a constant stream of UDP messages, this results in a
>> sawtooth pattern where the submitter frequently vmexits, and the
>> vhost-net worker alternates between sleeping and waking.
>> 
>> A common solution is to configure vhost-net busy polling via userspace
>> (e.g., qemu poll-us). However, treating the sendmsg as the "busy"
>> period by keeping kicks disabled during the final sendmsg and
>> performing one additional ring check afterward provides a significant
>> performance improvement without any excess busy poll cycles.
>> 
>> If messages are found in the ring after the final sendmsg, requeue the
>> TX handler. This ensures fairness for the RX handler and allows
>> vhost_run_work_list to cond_resched() as needed.
>> 
>> Test Case
>>TX VM: taskset -c 2 iperf3  -c rx-ip-here -t 60 -p 5200 -b 0 -u -i 5
>>RX VM: taskset -c 2 iperf3 -s -p 5200 -D
>>6.12.0, each worker backed by tun interface with IFF_NAPI setup.
>>Note: TCP side is largely unchanged as that was copy bound
>> 
>> 6.12.0 unpatched
>>EPT_MISCONFIG/second: 5411
>>Datagrams/second: ~382k
>>Interval Transfer Bitrate Lost/Total Datagrams
>>0.00-30.00  sec  15.5 GBytes  4.43 Gbits/sec  0/11481630 (0%)  sender
>> 
>> 6.12.0 patched
>>EPT_MISCONFIG/second: 58 (~93x reduction)
>>Datagrams/second: ~650k  (~1.7x increase)
>>Interval Transfer Bitrate Lost/Total Datagrams
>>0.00-30.00  sec  26.4 GBytes  7.55 Gbits/sec  0/19554720 (0%)  sender
>> 
>> Acked-by: Jason Wang 
>> Signed-off-by: Jon Kohler 
> 
> sounds like the right approach.
> 
> Acked-by: Michael S. Tsirkin 
> 
> 
> 
>> ---
> 
> 
> in the future, pls put the changelog here as you progress v1->v2->v3.
> Thanks!

Thanks for checking out the patch, sorry about the lack of
changelog. For posterity, v2 was only sending the patch to
net-next as I forgot to add the prefix on v1.

v1 patch with Jason’s ack is here:
https://lore.kernel.org/kvm/20250401043230.790419-1-...@nutanix.com/T/

> 
>> drivers/vhost/net.c | 19 +++
>> 1 file changed, 15 insertions(+), 4 deletions(-)
>> 
>> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
>> index b9b9e9d40951..9b04025eea66 100644
>> --- a/drivers/vhost/net.c
>> +++ b/drivers/vhost/net.c
>> @@ -769,13 +769,17 @@ static void handle_tx_copy(struct vhost_net *net, 
>> struct socket *sock)
>> break;
>> /* Nothing new?  Wait for eventfd to tell us they refilled. */
>> if (head == vq->num) {
>> + /* If interrupted while doing busy polling, requeue
>> + * the handler to be fair handle_rx as well as other
>> + * tasks waiting on cpu
>> + */
>> if (unlikely(busyloop_intr)) {
>> vhost_poll_queue(&vq->poll);
>> - } else if (unlikely(vhost_enable_notify(&net->dev,
>> - vq))) {
>> - vhost_disable_notify(&net->dev, vq);
>> - continue;
>> }
>> + /* Kicks are disabled at this point, break loop and
>> + * process any remaining batched packets. Queue will
>> + * be re-enabled afterwards.
>> + */
>> break;
>> }
>> 
>> @@ -825,7 +829,14 @@ static void handle_tx_copy(struct vhost_net *net, 
>> struct socket *sock)
>> ++nvq->done_idx;
>> } while (likely(!vhost_exceeds_weight(vq, ++sent_pkts, total_len)));
>> 
>> + /* Kicks are still disabled, dispatch any remaining batched msgs. */
>> vhost_tx_batch(net, nvq, sock, &msg);
>> +
>> + /* All of our work has been completed; however, before leaving the
>> + * TX handler, do one last check for work, and requeue handler if
>> + * necessary. If there is no work, queue will be reenabled.
>> + */
>> + vhost_net_busy_poll_try_queue(net, vq);
>> }
>> 
>> static void handle_tx_zerocopy(struct vhost_net *net, struct socket *sock)
>> -- 
>> 2.43.0
> 



Re: [PATCH net-next v2] vhost/net: Defer TX queue re-enable until after sendmsg

2025-04-20 Thread Michael S. Tsirkin
On Sat, Apr 19, 2025 at 06:05:18PM -0700, Jon Kohler wrote:
> In handle_tx_copy, TX batching processes packets below ~PAGE_SIZE and
> batches up to 64 messages before calling sock->sendmsg.
> 
> Currently, when there are no more messages on the ring to dequeue,
> handle_tx_copy re-enables kicks on the ring *before* firing off the
> batch sendmsg. However, sock->sendmsg incurs a non-zero delay,
> especially if it needs to wake up a thread (e.g., another vhost worker).
> 
> If the guest submits additional messages immediately after the last ring
> check and disablement, it triggers an EPT_MISCONFIG vmexit to attempt to
> kick the vhost worker. This may happen while the worker is still
> processing the sendmsg, leading to wasteful exit(s).
> 
> This is particularly problematic for single-threaded guest submission
> threads, as they must exit, wait for the exit to be processed
> (potentially involving a TTWU), and then resume.
> 
> In scenarios like a constant stream of UDP messages, this results in a
> sawtooth pattern where the submitter frequently vmexits, and the
> vhost-net worker alternates between sleeping and waking.
> 
> A common solution is to configure vhost-net busy polling via userspace
> (e.g., qemu poll-us). However, treating the sendmsg as the "busy"
> period by keeping kicks disabled during the final sendmsg and
> performing one additional ring check afterward provides a significant
> performance improvement without any excess busy poll cycles.
> 
> If messages are found in the ring after the final sendmsg, requeue the
> TX handler. This ensures fairness for the RX handler and allows
> vhost_run_work_list to cond_resched() as needed.
> 
> Test Case
> TX VM: taskset -c 2 iperf3  -c rx-ip-here -t 60 -p 5200 -b 0 -u -i 5
> RX VM: taskset -c 2 iperf3 -s -p 5200 -D
> 6.12.0, each worker backed by tun interface with IFF_NAPI setup.
> Note: TCP side is largely unchanged as that was copy bound
> 
> 6.12.0 unpatched
> EPT_MISCONFIG/second: 5411
> Datagrams/second: ~382k
> Interval Transfer Bitrate Lost/Total Datagrams
> 0.00-30.00  sec  15.5 GBytes  4.43 Gbits/sec  0/11481630 (0%)  sender
> 
> 6.12.0 patched
> EPT_MISCONFIG/second: 58 (~93x reduction)
> Datagrams/second: ~650k  (~1.7x increase)
> Interval Transfer Bitrate Lost/Total Datagrams
> 0.00-30.00  sec  26.4 GBytes  7.55 Gbits/sec  0/19554720 (0%)  sender
> 
> Acked-by: Jason Wang 
> Signed-off-by: Jon Kohler 

sounds like the right approach.

Acked-by: Michael S. Tsirkin 



> ---


in the future, pls put the changelog here as you progress v1->v2->v3.
Thanks!

>  drivers/vhost/net.c | 19 +++
>  1 file changed, 15 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index b9b9e9d40951..9b04025eea66 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -769,13 +769,17 @@ static void handle_tx_copy(struct vhost_net *net, 
> struct socket *sock)
>   break;
>   /* Nothing new?  Wait for eventfd to tell us they refilled. */
>   if (head == vq->num) {
> + /* If interrupted while doing busy polling, requeue
> +  * the handler to be fair handle_rx as well as other
> +  * tasks waiting on cpu
> +  */
>   if (unlikely(busyloop_intr)) {
>   vhost_poll_queue(&vq->poll);
> - } else if (unlikely(vhost_enable_notify(&net->dev,
> - vq))) {
> - vhost_disable_notify(&net->dev, vq);
> - continue;
>   }
> + /* Kicks are disabled at this point, break loop and
> +  * process any remaining batched packets. Queue will
> +  * be re-enabled afterwards.
> +  */
>   break;
>   }
>  
> @@ -825,7 +829,14 @@ static void handle_tx_copy(struct vhost_net *net, struct 
> socket *sock)
>   ++nvq->done_idx;
>   } while (likely(!vhost_exceeds_weight(vq, ++sent_pkts, total_len)));
>  
> + /* Kicks are still disabled, dispatch any remaining batched msgs. */
>   vhost_tx_batch(net, nvq, sock, &msg);
> +
> + /* All of our work has been completed; however, before leaving the
> +  * TX handler, do one last check for work, and requeue handler if
> +  * necessary. If there is no work, queue will be reenabled.
> +  */
> + vhost_net_busy_poll_try_queue(net, vq);
>  }
>  
>  static void handle_tx_zerocopy(struct vhost_net *net, struct socket *sock)
> -- 
> 2.43.0




[PATCH net-next v2] vhost/net: Defer TX queue re-enable until after sendmsg

2025-04-19 Thread Jon Kohler
In handle_tx_copy, TX batching processes packets below ~PAGE_SIZE and
batches up to 64 messages before calling sock->sendmsg.

Currently, when there are no more messages on the ring to dequeue,
handle_tx_copy re-enables kicks on the ring *before* firing off the
batch sendmsg. However, sock->sendmsg incurs a non-zero delay,
especially if it needs to wake up a thread (e.g., another vhost worker).

If the guest submits additional messages immediately after the last ring
check and disablement, it triggers an EPT_MISCONFIG vmexit to attempt to
kick the vhost worker. This may happen while the worker is still
processing the sendmsg, leading to wasteful exit(s).

This is particularly problematic for single-threaded guest submission
threads, as they must exit, wait for the exit to be processed
(potentially involving a TTWU), and then resume.

In scenarios like a constant stream of UDP messages, this results in a
sawtooth pattern where the submitter frequently vmexits, and the
vhost-net worker alternates between sleeping and waking.

A common solution is to configure vhost-net busy polling via userspace
(e.g., qemu poll-us). However, treating the sendmsg as the "busy"
period by keeping kicks disabled during the final sendmsg and
performing one additional ring check afterward provides a significant
performance improvement without any excess busy poll cycles.

If messages are found in the ring after the final sendmsg, requeue the
TX handler. This ensures fairness for the RX handler and allows
vhost_run_work_list to cond_resched() as needed.

Test Case
TX VM: taskset -c 2 iperf3  -c rx-ip-here -t 60 -p 5200 -b 0 -u -i 5
RX VM: taskset -c 2 iperf3 -s -p 5200 -D
6.12.0, each worker backed by tun interface with IFF_NAPI setup.
Note: TCP side is largely unchanged as that was copy bound

6.12.0 unpatched
EPT_MISCONFIG/second: 5411
Datagrams/second: ~382k
Interval Transfer Bitrate Lost/Total Datagrams
0.00-30.00  sec  15.5 GBytes  4.43 Gbits/sec  0/11481630 (0%)  sender

6.12.0 patched
EPT_MISCONFIG/second: 58 (~93x reduction)
Datagrams/second: ~650k  (~1.7x increase)
Interval Transfer Bitrate Lost/Total Datagrams
0.00-30.00  sec  26.4 GBytes  7.55 Gbits/sec  0/19554720 (0%)  sender

Acked-by: Jason Wang 
Signed-off-by: Jon Kohler 
---
 drivers/vhost/net.c | 19 +++
 1 file changed, 15 insertions(+), 4 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index b9b9e9d40951..9b04025eea66 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -769,13 +769,17 @@ static void handle_tx_copy(struct vhost_net *net, struct 
socket *sock)
break;
/* Nothing new?  Wait for eventfd to tell us they refilled. */
if (head == vq->num) {
+   /* If interrupted while doing busy polling, requeue
+* the handler to be fair handle_rx as well as other
+* tasks waiting on cpu
+*/
if (unlikely(busyloop_intr)) {
vhost_poll_queue(&vq->poll);
-   } else if (unlikely(vhost_enable_notify(&net->dev,
-   vq))) {
-   vhost_disable_notify(&net->dev, vq);
-   continue;
}
+   /* Kicks are disabled at this point, break loop and
+* process any remaining batched packets. Queue will
+* be re-enabled afterwards.
+*/
break;
}
 
@@ -825,7 +829,14 @@ static void handle_tx_copy(struct vhost_net *net, struct 
socket *sock)
++nvq->done_idx;
} while (likely(!vhost_exceeds_weight(vq, ++sent_pkts, total_len)));
 
+   /* Kicks are still disabled, dispatch any remaining batched msgs. */
vhost_tx_batch(net, nvq, sock, &msg);
+
+   /* All of our work has been completed; however, before leaving the
+* TX handler, do one last check for work, and requeue handler if
+* necessary. If there is no work, queue will be reenabled.
+*/
+   vhost_net_busy_poll_try_queue(net, vq);
 }
 
 static void handle_tx_zerocopy(struct vhost_net *net, struct socket *sock)
-- 
2.43.0