[virtio-dev] Re: [virtio-comment] RE: [virtio-dev] RE: [virtio-comment] RE: [virtio-dev] RE: [virtio-comment] [PATCH v2] virtio-net: support setting coalescing params for multiple vqs

2024-01-24 Thread Jason Wang
On Thu, Jan 25, 2024 at 11:05 AM Heng Qi  wrote:
>
>
>
> 在 2024/1/24 下午9:18, Parav Pandit 写道:
> >> From: Heng Qi 
> >> Sent: Wednesday, January 24, 2024 6:31 PM
> >>
> >>
> >> 在 2024/1/22 下午1:03, Parav Pandit 写道:
>  From: Heng Qi 
>  Sent: Monday, January 22, 2024 8:27 AM
> 
>  在 2024/1/20 下午5:59, Parav Pandit 写道:
> >> From: Heng Qi 
> >> Sent: Wednesday, January 17, 2024 10:22 AM
> >>
> >> 在 2024/1/15 下午9:21, Parav Pandit 写道:
>  From: virtio-comm...@lists.oasis-open.org
>   On Behalf Of Heng Qi
>  Sent: Monday, January 15, 2024 6:36 PM
> 
>  Currently, when each time the driver attempts to update the
>  coalescing parameters for a vq, it needs to kick the device and
>  wait for the ctrlq response to return.
> >>> It does not need to wait. This is some driver limitation that does
> >>> not use
> >> the queue as "queue".
> >>> Such driver limitation should be removed in the driver. It does
> >>> not qualify
> >> as limitation.
> >>
> >> Yes, we don't have to wait.
> >>
> >> But in general, for user commands, it is necessary to obtain the
> >> final results synchronously.
> > Yes. Use initiated command can enqueue the request to cvq. Go to
> > sleep
>  for several micro to milliseconds.
> >> The user command cannot return before the final result is obtained.
> >> And wait is not the problem this patch solves.
> >>
> > By not holding the rtnl lock, rest of the context that needs to
> > enqueue the
>  request can progress such as that of netdim.
> 
>  Would like to see the using of rtnl lock changed.
> 
> >>> Inside the virtnet_rx_dim_work() there should be rtnl lock call.
> >>> A virtio_device level lock to be used for cvq. :)
> >>>
>  In addition, I have made batching and asynchronousization of the
>  netdim command, you can refer to this patch:
>  https://lore.kernel.org/all/1705410693-118895-4-git-send-email-
>  hen...@linux.alibaba.com/
> 
> >>> In the listed above driver patch the motivation "to optimize the CPU
> >>> overhead of the DIM worker caused by the guest being busy waiting for
> >>> the command response result."
> >>>
> >>> Is not right.
> >>> Because guest is still busy waiting.
> >> There is no busy wait for guests, see get_cvq_work().
> >>
> > Ok. not always busy waiting, sometimes it does.
>
> Busy waiting will only occur when the user command or dim command cannot
> find the available buffer for cvq.
>
> The user command is still in polling mode for now, I have not tried to
> optimize this. Now it's about improving dim performance.
>
> > virtnet_cvq_response() should not have flag..
>
> The flag is mainly used to identify whether it is a user command. If so,
> the previous polling mode will still be maintained.
>
> >
> > Who ever gets the OS global rtnl lock is calling virtnet_cvq_response() and 
> > checking and releasing.
> > Shouldn’t be done this way with try lock etc.
> > Rtnl lock is not supposed to protect low level driver some ctrlvq response 
> > flag.
>
> To summarize, we now want to make some improvements to cvq:
> 1. Reasonable timeout in busy waiting mode or interrupt-based etc.

Let's use interrupt to avoid tricky code.

> 2. Batch processing (the core problem is how to get results from user
> commands synchronously)
> 3. Remove rtnl_lock’s protection for ctrlq.

Virtio-comment is not the right place to discuss these. Let's move it to netdev.

Thanks


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



RE: [virtio-dev] RE: [virtio-comment] RE: [virtio-dev] RE: [virtio-comment] [PATCH v2] virtio-net: support setting coalescing params for multiple vqs

2024-01-24 Thread Parav Pandit

> From: Heng Qi 
> Sent: Wednesday, January 24, 2024 6:31 PM
> 
> 
> 在 2024/1/22 下午1:03, Parav Pandit 写道:
> >
> >> From: Heng Qi 
> >> Sent: Monday, January 22, 2024 8:27 AM
> >>
> >> 在 2024/1/20 下午5:59, Parav Pandit 写道:
>  From: Heng Qi 
>  Sent: Wednesday, January 17, 2024 10:22 AM
> 
>  在 2024/1/15 下午9:21, Parav Pandit 写道:
> >> From: virtio-comm...@lists.oasis-open.org
> >>  On Behalf Of Heng Qi
> >> Sent: Monday, January 15, 2024 6:36 PM
> >>
> >> Currently, when each time the driver attempts to update the
> >> coalescing parameters for a vq, it needs to kick the device and
> >> wait for the ctrlq response to return.
> > It does not need to wait. This is some driver limitation that does
> > not use
>  the queue as "queue".
> > Such driver limitation should be removed in the driver. It does
> > not qualify
>  as limitation.
> 
>  Yes, we don't have to wait.
> 
>  But in general, for user commands, it is necessary to obtain the
>  final results synchronously.
> >>> Yes. Use initiated command can enqueue the request to cvq. Go to
> >>> sleep
> >> for several micro to milliseconds.
>  The user command cannot return before the final result is obtained.
>  And wait is not the problem this patch solves.
> 
> >>> By not holding the rtnl lock, rest of the context that needs to
> >>> enqueue the
> >> request can progress such as that of netdim.
> >>
> >> Would like to see the using of rtnl lock changed.
> >>
> > Inside the virtnet_rx_dim_work() there should be rtnl lock call.
> > A virtio_device level lock to be used for cvq. :)
> >
> >> In addition, I have made batching and asynchronousization of the
> >> netdim command, you can refer to this patch:
> >> https://lore.kernel.org/all/1705410693-118895-4-git-send-email-
> >> hen...@linux.alibaba.com/
> >>
> > In the listed above driver patch the motivation "to optimize the CPU
> > overhead of the DIM worker caused by the guest being busy waiting for
> > the command response result."
> >
> > Is not right.
> > Because guest is still busy waiting.
> 
> There is no busy wait for guests, see get_cvq_work().
> 
Ok. not always busy waiting, sometimes it does. virtnet_cvq_response() should 
not have flag..

Who ever gets the OS global rtnl lock is calling virtnet_cvq_response() and 
checking and releasing.
Shouldn’t be done this way with try lock etc.
Rtnl lock is not supposed to protect low level driver some ctrlvq response flag.


> > Without batching, due to rtnl lock every VQ command is serialized as one
> outstanding command at a time in virtnet_rx_dim_work().
> > Due to this device is unable to take benefit of DMA batching at large scale.
> 
> Adding dim commands is now asynchronous, and the device will receive
> batches of commands.
> 
I am comparing the code where there is no rtnl lock with the code you posted.
The reference code of patch [1] needs to begin without OS global rtnl lock as 
base line.

[1] 
https://lore.kernel.org/all/1705410693-118895-4-git-send-email-hen...@linux.alibaba.com/

> >
> > This will enable driver to enqueue multiple cvq commands without
> > waiting
>  for previous one.
> > This will also enable device to find natural coalescing done on
> > multiple
>  commands.
> 
>  When batch user commands occur, ensuring synchronization is a
> concern.
> 
> >> The following path is observed: 1. Driver kicks the device; 2.
> >> After the device receives the kick, CPU scheduling occurs and DMA
> >> multiple buffers multiple times; 3. The device completes
> >> processing and replies
>  with a response.
> >> When large-queue devices issue multiple requests and kick the
> >> device frequently, this often interrupt the work of the device-side 
> >> CPU.
> > When there is large devices and multiple driver notifications by a
> > cpu that is N times faster than the device side cpu, the device
> > may find natural
>  coalescing on the commands of a given cvq.
> 
>  First we have to solve the ctrlq batch adding user (ethtool) command.
>  Even if processed in a batch way on device side, the number of
>  kicks and the number of backend DMAs has not been reduced.
> >>> Driver notifications are PCI writes so it should not hamper device
> >>> side,
> >> which can ignore them when they do not bother about it.
> >>
> >> Driver notifications need to be processed by the DPU, which
> >> interferes with the CPU on the DPU.
> >>
> > I was asking, if there is anyway to disable for your DPU to ignore these
> notifications while previous one is pending?
> >  From your above description, it seems there isn’t.
> 
> While the device is processing the request, additional kicks are ignored.
> 
Ok. that is good. In that case the kicks are not interfering, right?


Re: [virtio-dev] RE: [virtio-comment] RE: [virtio-dev] RE: [virtio-comment] [PATCH v2] virtio-net: support setting coalescing params for multiple vqs

2024-01-24 Thread Heng Qi




在 2024/1/22 下午1:03, Parav Pandit 写道:



From: Heng Qi 
Sent: Monday, January 22, 2024 8:27 AM

在 2024/1/20 下午5:59, Parav Pandit 写道:

From: Heng Qi 
Sent: Wednesday, January 17, 2024 10:22 AM

在 2024/1/15 下午9:21, Parav Pandit 写道:

From: virtio-comm...@lists.oasis-open.org
 On Behalf Of Heng Qi
Sent: Monday, January 15, 2024 6:36 PM

Currently, when each time the driver attempts to update the
coalescing parameters for a vq, it needs to kick the device and
wait for the ctrlq response to return.

It does not need to wait. This is some driver limitation that does
not use

the queue as "queue".

Such driver limitation should be removed in the driver. It does not
qualify

as limitation.

Yes, we don't have to wait.

But in general, for user commands, it is necessary to obtain the
final results synchronously.

Yes. Use initiated command can enqueue the request to cvq. Go to sleep

for several micro to milliseconds.

The user command cannot return before the final result is obtained.
And wait is not the problem this patch solves.


By not holding the rtnl lock, rest of the context that needs to enqueue the

request can progress such as that of netdim.

Would like to see the using of rtnl lock changed.


Inside the virtnet_rx_dim_work() there should be rtnl lock call.
A virtio_device level lock to be used for cvq. :)


In addition, I have made batching and asynchronousization of the netdim
command, you can refer to this patch:
https://lore.kernel.org/all/1705410693-118895-4-git-send-email-
hen...@linux.alibaba.com/


In the listed above driver patch the motivation "to optimize the
CPU overhead of the DIM worker caused by the guest being busy
waiting for the command response result."

Is not right.
Because guest is still busy waiting.


There is no busy wait for guests, see get_cvq_work().


Without batching, due to rtnl lock every VQ command is serialized as one 
outstanding command at a time in virtnet_rx_dim_work().
Due to this device is unable to take benefit of DMA batching at large scale.


Adding dim commands is now asynchronous, and the device will receive 
batches of commands.


  

This will enable driver to enqueue multiple cvq commands without
waiting

for previous one.

This will also enable device to find natural coalescing done on
multiple

commands.

When batch user commands occur, ensuring synchronization is a concern.


The following path is observed: 1. Driver kicks the device; 2.
After the device receives the kick, CPU scheduling occurs and DMA
multiple buffers multiple times; 3. The device completes processing
and replies

with a response.

When large-queue devices issue multiple requests and kick the
device frequently, this often interrupt the work of the device-side CPU.

When there is large devices and multiple driver notifications by a
cpu that is N times faster than the device side cpu, the device may
find natural

coalescing on the commands of a given cvq.

First we have to solve the ctrlq batch adding user (ethtool) command.
Even if processed in a batch way on device side, the number of kicks
and the number of backend DMAs has not been reduced.

Driver notifications are PCI writes so it should not hamper device side,

which can ignore them when they do not bother about it.

Driver notifications need to be processed by the DPU, which interferes with
the CPU on the DPU.


I was asking, if there is anyway to disable for your DPU to ignore these 
notifications while previous one is pending?
 From your above description, it seems there isn’t.


While the device is processing the request, additional kicks are ignored.

Thanks.




Backend DMAs should be reduced by avoiding the LIFO pattern followed by

the splitq driver.

Placing the descriptors contiguously like packedq reduces amount of DMA

naturally.

splitq is widely used, migrating to packedq is not that easy, especially when
there are many components and hardware involved.

I am not suggesting to packed_VQ.
I am suggesting fixing the driver to not do LIFO on descriptors for splitq.
In other words, using contiguous set of descriptors on splitq will improve the 
splitq for DMA.
This will allow using splitq more efficiently for dma as short-term solution 
for DMA until more efficient queues are defined.


The second predicable DMA to avoid is having 8Bytes of data inline in the

descriptor, instead of 16B indirection and extra dma.

Looking forward to working inline!
But I think this does not conflict with batch work, and combining the two will
be more beneficial.


It does not conflict. However, batching for large number of queues may not use 
the inline as the data bytes may not fit in the inline.
  

For multiple DMAs, we need to way to send 8 bytes of data without 16

bytes of indirection via a descriptor.

This is what we discussed a while back to do in txq and Stefan
suggested to

generalize for more queues, which is also a good idea.

Yes, this sounds good.


This the next item to focus as soon as flow filters are stable/merged.


Re: [virtio-dev] Re: [virtio-comment] RE: [virtio-dev] RE: [virtio-comment] [PATCH v2] virtio-net: support setting coalescing params for multiple vqs

2024-01-23 Thread Heng Qi




在 2024/1/23 下午3:15, Michael S. Tsirkin 写道:

On Tue, Jan 23, 2024 at 05:55:02AM +, Parav Pandit wrote:

From: Michael S. Tsirkin 
Sent: Monday, January 22, 2024 1:06 PM

On Mon, Jan 22, 2024 at 05:03:38AM +, Parav Pandit wrote:

The right test on Linux to do without rtnl lock which is anyway
ugly and

wrong semantic to use blocking the whole netdev stack.

(in case if you used that).

Do you have any good directions and attempts to remove rtnl_lock?


I think per device lock instead of rtnl is first step that we can start with.

Wil check internally who if someone already started working on it.

I feel the issue is at the conceptual level.

Not for requests which are initiated by the kernel stack (non user initiated).

So how is this different? Is it basically just because
tweaking coalescing in unexpected ways is considered mostly
harmless?


DIM sends configurations frequently, which is try best.




Yes some drivers will take a command
and just queue it for execution later, but this means that errors can not be
propagated back at all. Imagine device with mac
0x123 in promisc mode. Now commands:

1- program MAC 0xabcdef
2- disable promisc mode


User initiated commands error can be propagated when the command completes.
Enqueuing command is at the different bottom level in the driver.


If command 1 fails but 2 proceeds then packets with MAC 0xabc will be
dropped.

Any attempts to batch arbitrary commands will have this issue - be it at driver
or device level.


There is no suggestion to batch arbitrary commands from the driver side.
The suggestion is to batch VQs notification coalescing from the driver side.


So, here's my question: what exactly is the guest behaviour that is driving this
work? Is it with a linux guest?

At least looks to me yes based on the partial patches which are taking rtnl 
lock on netdim's worker callbacks.


which commands does userspace issue that we
need to send multiple vq coalescing commands?

None.


  If all you want is to send
same config to all VQs then why not just use
VIRTIO_NET_CTRL_NOTF_COAL_RX_SET as opposed to
VIRTIO_NET_CTRL_NOTF_COAL_VQ_SET ?

Only kernel stack initiated VQ notification coalescing changes.
Since every VQ has different values, VIRTIO_NET_CTRL_NOTF_COAL_RX_SET is not 
sufficient.



-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org