from:"Parav Pandit"

RE: [virtio-dev] RE: [virtio-comment] RE: [virtio-dev] RE: [virtio-comment] [PATCH v2] virtio-net: support setting coalescing params for multiple vqs

2024-01-24 Thread Parav Pandit


> From: Heng Qi 
> Sent: Wednesday, January 24, 2024 6:31 PM
> 
> 
> 在 2024/1/22 下午1:03, Parav Pandit 写道:
> >
> >> From: Heng Qi 
> >> Sent: Monday, January 22, 2024 8:27 AM
> >>
> >> 在 2024/1/20 下午5:59, Parav Pandit 写道:
> >>>> From: Heng Qi 
> >>>> Sent: Wednesday, January 17, 2024 10:22 AM
> >>>>
> >>>> 在 2024/1/15 下午9:21, Parav Pandit 写道:
> >>>>>> From: virtio-comm...@lists.oasis-open.org
> >>>>>>  On Behalf Of Heng Qi
> >>>>>> Sent: Monday, January 15, 2024 6:36 PM
> >>>>>>
> >>>>>> Currently, when each time the driver attempts to update the
> >>>>>> coalescing parameters for a vq, it needs to kick the device and
> >>>>>> wait for the ctrlq response to return.
> >>>>> It does not need to wait. This is some driver limitation that does
> >>>>> not use
> >>>> the queue as "queue".
> >>>>> Such driver limitation should be removed in the driver. It does
> >>>>> not qualify
> >>>> as limitation.
> >>>>
> >>>> Yes, we don't have to wait.
> >>>>
> >>>> But in general, for user commands, it is necessary to obtain the
> >>>> final results synchronously.
> >>> Yes. Use initiated command can enqueue the request to cvq. Go to
> >>> sleep
> >> for several micro to milliseconds.
> >>>> The user command cannot return before the final result is obtained.
> >>>> And wait is not the problem this patch solves.
> >>>>
> >>> By not holding the rtnl lock, rest of the context that needs to
> >>> enqueue the
> >> request can progress such as that of netdim.
> >>
> >> Would like to see the using of rtnl lock changed.
> >>
> > Inside the virtnet_rx_dim_work() there should be rtnl lock call.
> > A virtio_device level lock to be used for cvq. :)
> >
> >> In addition, I have made batching and asynchronousization of the
> >> netdim command, you can refer to this patch:
> >> https://lore.kernel.org/all/1705410693-118895-4-git-send-email-
> >> hen...@linux.alibaba.com/
> >>
> > In the listed above driver patch the motivation "to optimize the CPU
> > overhead of the DIM worker caused by the guest being busy waiting for
> > the command response result."
> >
> > Is not right.
> > Because guest is still busy waiting.
> 
> There is no busy wait for guests, see get_cvq_work().
> 
Ok. not always busy waiting, sometimes it does. virtnet_cvq_response() should 
not have flag..

Who ever gets the OS global rtnl lock is calling virtnet_cvq_response() and 
checking and releasing.
Shouldn’t be done this way with try lock etc.
Rtnl lock is not supposed to protect low level driver some ctrlvq response flag.


> > Without batching, due to rtnl lock every VQ command is serialized as one
> outstanding command at a time in virtnet_rx_dim_work().
> > Due to this device is unable to take benefit of DMA batching at large scale.
> 
> Adding dim commands is now asynchronous, and the device will receive
> batches of commands.
> 
I am comparing the code where there is no rtnl lock with the code you posted.
The reference code of patch [1] needs to begin without OS global rtnl lock as 
base line.

[1] 
https://lore.kernel.org/all/1705410693-118895-4-git-send-email-hen...@linux.alibaba.com/

> >
> >>>>> This will enable driver to enqueue multiple cvq commands without
> >>>>> waiting
> >>>> for previous one.
> >>>>> This will also enable device to find natural coalescing done on
> >>>>> multiple
> >>>> commands.
> >>>>
> >>>> When batch user commands occur, ensuring synchronization is a
> concern.
> >>>>
> >>>>>> The following path is observed: 1. Driver kicks the device; 2.
> >>>>>> After the device receives the kick, CPU scheduling occurs and DMA
> >>>>>> multiple buffers multiple times; 3. The device completes
> >>>>>> processing and replies
> >>>> with a response.
> >>>>>> When large-queue devices issue multiple requests and kick the
> >>>>>> device frequently, this often interrupt the work of the device-side 
> >>>>>> CPU.
> >>>>> When there is large devices and multiple driver notifications by a
> >>>>> cpu that is N times faster than the device side cpu, the device
> >>>>> may find natural
> >>>> coalescing on the commands of a given cvq.
> >>>>
> >>>> First we have to solve the ctrlq batch adding user (ethtool) command.
> >>>> Even if processed in a batch way on device side, the number of
> >>>> kicks and the number of backend DMAs has not been reduced.
> >>> Driver notifications are PCI writes so it should not hamper device
> >>> side,
> >> which can ignore them when they do not bother about it.
> >>
> >> Driver notifications need to be processed by the DPU, which
> >> interferes with the CPU on the DPU.
> >>
> > I was asking, if there is anyway to disable for your DPU to ignore these
> notifications while previous one is pending?
> >  From your above description, it seems there isn’t.
> 
> While the device is processing the request, additional kicks are ignored.
> 
Ok. that is good. In that case the kicks are not interfering, right?

[virtio-dev] RE: [virtio] [PATCH] virtio-net: clarify error handling for coalescing

2024-01-24 Thread Parav Pandit




> From: vir...@lists.oasis-open.org  On Behalf Of
> Michael S. Tsirkin
> Sent: Wednesday, January 24, 2024 4:46 PM
> To: virtio-comm...@lists.oasis-open.org; virtio-dev@lists.oasis-open.org
> Cc: vir...@lists.oasis-open.org

Can we please avoid posting to three different lists?

> Subject: [virtio] [PATCH] virtio-net: clarify error handling for coalescing
> 
> This is not a huge deal since it's a SHOULD anyway, so make the new
> requirement SHOULD too.
> 
> Signed-off-by: Michael S. Tsirkin 
> ---
>  device-types/net/description.tex | 6 +-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/device-types/net/description.tex b/device-
> types/net/description.tex
> index aff5e08..d1d25fe 100644
> --- a/device-types/net/description.tex
> +++ b/device-types/net/description.tex
> @@ -1792,7 +1792,11 @@ \subsubsection{Control
> Virtqueue}\label{sec:Device Types / Network Device / Devi
> 
>  The device MUST ignore \field{reserved}.
> 
> -The device SHOULD respond to VIRTIO_NET_CTRL_NOTF_COAL_TX_SET and
> VIRTIO_NET_CTRL_NOTF_COAL_RX_SET commands with VIRTIO_NET_ERR if
> it was not able to change the parameters.
> +The device SHOULD respond to VIRTIO_NET_CTRL_NOTF_COAL_TX_SET and
> +VIRTIO_NET_CTRL_NOTF_COAL_RX_SET commands with VIRTIO_NET_ERR
> if it was
> +not able to change coalescing parameters. In this case, the parameters
> +SHOULD remain unchanged, for all VQs.

Is the new added normative line without device is ok?

Or should it be written as
... In this case, the device SHOULD not update the parameters for all VQs.

?

> +
> 
>  The device MUST respond to the VIRTIO_NET_CTRL_NOTF_COAL_VQ_SET
> command with VIRTIO_NET_ERR if it was not able to change the parameters.
> 
> --
> MST
For the functionality itself, ignoring above normative suggestion,

Reviewed-by: Parav Pandit  


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org

[virtio-dev] RE: [virtio-comment] RE: [virtio-dev] RE: [virtio-comment] [PATCH v2] virtio-net: support setting coalescing params for multiple vqs

2024-01-22 Thread Parav Pandit



> From: Michael S. Tsirkin 
> Sent: Tuesday, January 23, 2024 12:45 PM
> 
> On Tue, Jan 23, 2024 at 05:55:02AM +, Parav Pandit wrote:
> > > From: Michael S. Tsirkin 
> > > Sent: Monday, January 22, 2024 1:06 PM
> > >
> > > On Mon, Jan 22, 2024 at 05:03:38AM +, Parav Pandit wrote:
> > > > > >>> The right test on Linux to do without rtnl lock which is
> > > > > >>> anyway ugly and
> > > > > >> wrong semantic to use blocking the whole netdev stack.
> > > > > >>> (in case if you used that).
> > > > > >> Do you have any good directions and attempts to remove rtnl_lock?
> > > > > >>
> > > > > > I think per device lock instead of rtnl is first step that we can 
> > > > > > start
> with.
> > > > >
> > > > Wil check internally who if someone already started working on it.
> > >
> > > I feel the issue is at the conceptual level.
> > Not for requests which are initiated by the kernel stack (non user 
> > initiated).
> 
> So how is this different? Is it basically just because tweaking coalescing in
> unexpected ways is considered mostly harmless?
> 
It is different in two ways.

1. requests are unrelated to each other (unlike the promisc/mac example).
For example, changing vq1.params have no effect on vq2.params.

2. Therefore they don't need to be blocked.

3. There is no user expecting a success/failure on this command.
If it fails, possibly the queues can be marked in error for 
recovery/re-enablement.


> > > Yes some drivers will take a command and just queue it for execution
> > > later, but this means that errors can not be propagated back at all.
> > > Imagine device with mac
> > > 0x123 in promisc mode. Now commands:
> > >
> > > 1- program MAC 0xabcdef
> > > 2- disable promisc mode
> > >
> > User initiated commands error can be propagated when the command
> completes.
> > Enqueuing command is at the different bottom level in the driver.
> >
> > > If command 1 fails but 2 proceeds then packets with MAC 0xabc will
> > > be dropped.
> > >
> > > Any attempts to batch arbitrary commands will have this issue - be
> > > it at driver or device level.
> > >
> > There is no suggestion to batch arbitrary commands from the driver side.
> > The suggestion is to batch VQs notification coalescing from the driver side.
> >
> > > So, here's my question: what exactly is the guest behaviour that is
> > > driving this work? Is it with a linux guest?
> > At least looks to me yes based on the partial patches which are taking rtnl
> lock on netdim's worker callbacks.
> >
> > > which commands does userspace issue that we need to send multiple vq
> > > coalescing commands?
> > None.
> >
> > >  If all you want is to send
> > > same config to all VQs then why not just use
> > > VIRTIO_NET_CTRL_NOTF_COAL_RX_SET as opposed to
> > > VIRTIO_NET_CTRL_NOTF_COAL_VQ_SET ?
> > Only kernel stack initiated VQ notification coalescing changes.
> > Since every VQ has different values, VIRTIO_NET_CTRL_NOTF_COAL_RX_SET
> is not sufficient.


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org

[virtio-dev] RE: [virtio-comment] RE: [virtio-dev] RE: [virtio-comment] [PATCH v2] virtio-net: support setting coalescing params for multiple vqs

2024-01-22 Thread Parav Pandit

> From: Michael S. Tsirkin 
> Sent: Monday, January 22, 2024 1:06 PM
> 
> On Mon, Jan 22, 2024 at 05:03:38AM +, Parav Pandit wrote:
> > > >>> The right test on Linux to do without rtnl lock which is anyway
> > > >>> ugly and
> > > >> wrong semantic to use blocking the whole netdev stack.
> > > >>> (in case if you used that).
> > > >> Do you have any good directions and attempts to remove rtnl_lock?
> > > >>
> > > > I think per device lock instead of rtnl is first step that we can start 
> > > > with.
> > >
> > Wil check internally who if someone already started working on it.
> 
> I feel the issue is at the conceptual level. 
Not for requests which are initiated by the kernel stack (non user initiated).

> Yes some drivers will take a command
> and just queue it for execution later, but this means that errors can not be
> propagated back at all. Imagine device with mac
> 0x123 in promisc mode. Now commands:
> 
> 1- program MAC 0xabcdef
> 2- disable promisc mode
> 
User initiated commands error can be propagated when the command completes.
Enqueuing command is at the different bottom level in the driver.

> If command 1 fails but 2 proceeds then packets with MAC 0xabc will be
> dropped.
> 
> Any attempts to batch arbitrary commands will have this issue - be it at 
> driver
> or device level.
> 
There is no suggestion to batch arbitrary commands from the driver side.
The suggestion is to batch VQs notification coalescing from the driver side.

> So, here's my question: what exactly is the guest behaviour that is driving 
> this
> work? Is it with a linux guest? 
At least looks to me yes based on the partial patches which are taking rtnl 
lock on netdim's worker callbacks.

> which commands does userspace issue that we
> need to send multiple vq coalescing commands?
None.

>  If all you want is to send
> same config to all VQs then why not just use
> VIRTIO_NET_CTRL_NOTF_COAL_RX_SET as opposed to
> VIRTIO_NET_CTRL_NOTF_COAL_VQ_SET ?
Only kernel stack initiated VQ notification coalescing changes.
Since every VQ has different values, VIRTIO_NET_CTRL_NOTF_COAL_RX_SET is not 
sufficient.

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org

[virtio-dev] RE: [virtio-comment] RE: [virtio-dev] RE: [virtio-comment] [PATCH v2] virtio-net: support setting coalescing params for multiple vqs

2024-01-21 Thread Parav Pandit



> From: Heng Qi 
> Sent: Monday, January 22, 2024 8:27 AM
> 
> 在 2024/1/20 下午5:59, Parav Pandit 写道:
> >> From: Heng Qi 
> >> Sent: Wednesday, January 17, 2024 10:22 AM
> >>
> >> 在 2024/1/15 下午9:21, Parav Pandit 写道:
> >>>> From: virtio-comm...@lists.oasis-open.org
> >>>>  On Behalf Of Heng Qi
> >>>> Sent: Monday, January 15, 2024 6:36 PM
> >>>>
> >>>> Currently, when each time the driver attempts to update the
> >>>> coalescing parameters for a vq, it needs to kick the device and
> >>>> wait for the ctrlq response to return.
> >>> It does not need to wait. This is some driver limitation that does
> >>> not use
> >> the queue as "queue".
> >>> Such driver limitation should be removed in the driver. It does not
> >>> qualify
> >> as limitation.
> >>
> >> Yes, we don't have to wait.
> >>
> >> But in general, for user commands, it is necessary to obtain the
> >> final results synchronously.
> > Yes. Use initiated command can enqueue the request to cvq. Go to sleep
> for several micro to milliseconds.
> >
> >> The user command cannot return before the final result is obtained.
> >> And wait is not the problem this patch solves.
> >>
> > By not holding the rtnl lock, rest of the context that needs to enqueue the
> request can progress such as that of netdim.
> 
> Would like to see the using of rtnl lock changed.
> 
Inside the virtnet_rx_dim_work() there should be rtnl lock call.
A virtio_device level lock to be used for cvq. :)

> In addition, I have made batching and asynchronousization of the netdim
> command, you can refer to this patch:
> https://lore.kernel.org/all/1705410693-118895-4-git-send-email-
> hen...@linux.alibaba.com/
>
In the listed above driver patch the motivation "to optimize the
CPU overhead of the DIM worker caused by the guest being busy
waiting for the command response result."

Is not right.
Because guest is still busy waiting.
Without batching, due to rtnl lock every VQ command is serialized as one 
outstanding command at a time in virtnet_rx_dim_work().
Due to this device is unable to take benefit of DMA batching at large scale.
 
> >
> >>> This will enable driver to enqueue multiple cvq commands without
> >>> waiting
> >> for previous one.
> >>> This will also enable device to find natural coalescing done on
> >>> multiple
> >> commands.
> >>
> >> When batch user commands occur, ensuring synchronization is a concern.
> >>
> >>>> The following path is observed: 1. Driver kicks the device; 2.
> >>>> After the device receives the kick, CPU scheduling occurs and DMA
> >>>> multiple buffers multiple times; 3. The device completes processing
> >>>> and replies
> >> with a response.
> >>>> When large-queue devices issue multiple requests and kick the
> >>>> device frequently, this often interrupt the work of the device-side CPU.
> >>> When there is large devices and multiple driver notifications by a
> >>> cpu that is N times faster than the device side cpu, the device may
> >>> find natural
> >> coalescing on the commands of a given cvq.
> >>
> >> First we have to solve the ctrlq batch adding user (ethtool) command.
> >> Even if processed in a batch way on device side, the number of kicks
> >> and the number of backend DMAs has not been reduced.
> > Driver notifications are PCI writes so it should not hamper device side,
> which can ignore them when they do not bother about it.
> 
> Driver notifications need to be processed by the DPU, which interferes with
> the CPU on the DPU.
> 
I was asking, if there is anyway to disable for your DPU to ignore these 
notifications while previous one is pending?
From your above description, it seems there isn’t.

> > Backend DMAs should be reduced by avoiding the LIFO pattern followed by
> the splitq driver.
> > Placing the descriptors contiguously like packedq reduces amount of DMA
> naturally.
> 
> splitq is widely used, migrating to packedq is not that easy, especially when
> there are many components and hardware involved.
I am not suggesting to packed_VQ.
I am suggesting fixing the driver to not do LIFO on descriptors for splitq.
In other words, using contiguous set of descriptors on splitq will improve the 
splitq for DMA.
This will allow using splitq more efficiently for dma as short-term solution 
for DMA until more efficient queues are defined.

> 
> >
> &

RE: [virtio-dev] RE: [virtio-comment] [PATCH v2] virtio-net: support setting coalescing params for multiple vqs

2024-01-20 Thread Parav Pandit


> From: Heng Qi 
> Sent: Wednesday, January 17, 2024 10:22 AM
> 
> 在 2024/1/15 下午9:21, Parav Pandit 写道:
> >
> >> From: virtio-comm...@lists.oasis-open.org
> >>  On Behalf Of Heng Qi
> >> Sent: Monday, January 15, 2024 6:36 PM
> >>
> >> Currently, when each time the driver attempts to update the
> >> coalescing parameters for a vq, it needs to kick the device and wait
> >> for the ctrlq response to return.
> > It does not need to wait. This is some driver limitation that does not use
> the queue as "queue".
> > Such driver limitation should be removed in the driver. It does not qualify
> as limitation.
> 
> Yes, we don't have to wait.
> 
> But in general, for user commands, it is necessary to obtain the final results
> synchronously.
Yes. Use initiated command can enqueue the request to cvq. Go to sleep for 
several micro to milliseconds.

> The user command cannot return before the final result is obtained.
> And wait is not the problem this patch solves.
> 
By not holding the rtnl lock, rest of the context that needs to enqueue the 
request can progress such as that of netdim.

> >
> > This will enable driver to enqueue multiple cvq commands without waiting
> for previous one.
> > This will also enable device to find natural coalescing done on multiple
> commands.
> 
> When batch user commands occur, ensuring synchronization is a concern.
> 
> >
> >> The following path is observed: 1. Driver kicks the device; 2. After
> >> the device receives the kick, CPU scheduling occurs and DMA multiple
> >> buffers multiple times; 3. The device completes processing and replies
> with a response.
> >>
> >> When large-queue devices issue multiple requests and kick the device
> >> frequently, this often interrupt the work of the device-side CPU.
> > When there is large devices and multiple driver notifications by a cpu
> > that is N times faster than the device side cpu, the device may find natural
> coalescing on the commands of a given cvq.
> 
> First we have to solve the ctrlq batch adding user (ethtool) command.
> Even if processed in a batch way on device side, the number of kicks and the
> number of backend DMAs has not been reduced.
Driver notifications are PCI writes so it should not hamper device side, which 
can ignore them when they do not bother about it.
Backend DMAs should be reduced by avoiding the LIFO pattern followed by the 
splitq driver.
Placing the descriptors contiguously like packedq reduces amount of DMA 
naturally.

The second predicable DMA to avoid is having 8Bytes of data inline in the 
descriptor, instead of 16B indirection and extra dma.

> 
> >
> > For multiple DMAs, we need to way to send 8 bytes of data without 16
> bytes of indirection via a descriptor.
> > This is what we discussed a while back to do in txq and Stefan suggested to
> generalize for more queues, which is also a good idea.
> 
> Yes, this sounds good.
>
This the next item to focus as soon as flow filters are stable/merged.
 
> >
> >> In addition,
> >> each vq request is processed separately, causing more delays for the
> >> CPU to wait for the DMA request to complete.
> >>
> >> These interruptions and overhead will strain the CPU responsible for
> >> controlling the path of the DPU, especially in multi-device and
> >> large-queue scenarios.
> >>
> >> To solve the above problems, we internally tried batch request, which
> >> merges requests from multiple queues and sends them at once. We
> >> conservatively tested
> > The batching done may end up modifying the given VQ's parameters
> multiple times.
> 
> In practice, we do not try to accumulate multiple parameter modifications for
> a specific vqn.
I meant to accumulate parameters of multiple VQs to batch in single command.

> 
> > The right test on Linux to do without rtnl lock which is anyway ugly and
> wrong semantic to use blocking the whole netdev stack.
> > (in case if you used that).
> 
> Do you have any good directions and attempts to remove rtnl_lock?
>
I think per device lock instead of rtnl is first step that we can start with.
 
> >
> >> 8 queue commands and sent them together. The DPU processing
> >> efficiency can be improved by 8 times, which greatly eases the DPU's
> >> support for multi- device and multi-queue DIM.
> >>
> > This is good.
> 
> YES. Makes sense for our DPUs.
> 
> >
> >> Maintainers may be concerned about whether the batch command
> method
> >> can optimize the above problems: accumulate multiple request
> commands
>

[virtio-dev] RE: [virtio-comment] RE: [PATCH v6 1/1] content: Add new feature VIRTIO_F_PRESERVE_RESOURCES

2024-01-16 Thread Parav Pandit


> From: Chen, Jiqian 
> Sent: Tuesday, January 16, 2024 1:51 PM
> 
> On 2024/1/16 15:19, Parav Pandit wrote:
> >
> >> From: Chen, Jiqian 
> >> Sent: Tuesday, January 16, 2024 12:08 PM
> >>
> >> On 2024/1/15 19:10, Parav Pandit wrote:
> >>>
> >>>> From: Chen, Jiqian 
> >>>> Sent: Monday, January 15, 2024 4:37 PM
> >>>>
> >>>> On 2024/1/15 18:52, Parav Pandit wrote:
> >>>>>
> >>>>>> From: Chen, Jiqian 
> >>>>>> Sent: Monday, January 15, 2024 4:17 PM
> >>>>>>
> >>>>>> On 2024/1/15 17:46, Parav Pandit wrote:
> >>>>>>>
> >>>>>>>> From: Chen, Jiqian 
> >>>>>>>> Sent: Monday, January 15, 2024 3:10 PM
> >>>>>>>
> >>>>>>>>> Display resources should be only restored when following
> >>>>>>>>> conditions are
> >>>>>>>> met.
> >>>>>>>>> 1. PCI PM cap is reported.
> >>>>>>>>> 2. PCI PM cap has non_soft_reset=1 3. virtio guest driver do
> >>>>>>>>> not perform reset when transport offers a restore
> >>>>>>>> capability using #1 and #2.
> >>>>>>>>>
> >>>>>>>>> Do you agree? Yes?
> >>>>>>>> Yes, I think this problem (display resources are destroyed
> >>>>>>>> during
> >>>>>>>> S3) can be sorted to two situations:
> >>>>>>>> First is what you said above, in this situation, the
> >>>>>>>> devices_reset of qemu is unreasonable, if a device has PM cap
> >>>>>>>> and non_soft_reset=1, qemu should not do resetting.
> >>>>>>>
> >>>>>>>> Second is without #1 or #2, the reset behavior is fine. My
> >>>>>>>> patch is to fix this situation, that sending a new virtio-gpu
> >>>>>>>> command to notify qemu to prevent destroying display resources
> during S3.
> >>>>>>>
> >>>>>>> I still have hard time following "My patch is to fix this 
> >>>>>>> situation...".
> >>>>>>>
> >>>>>>> When #1 and #2 is not done, there is nothing to restore.
> >>>>>>> Driver should not send some new virtio specific command when #1
> >>>>>>> and
> >>>>>>> #2 is
> >>>>>> not there.
> >>>>>>> Instead, if the device wants to restore the context, all #1, #2
> >>>>>>> and
> >>>>>>> #3 should
> >>>>>> be done to implement the restore functionality.
> >>>>>> When #1 and #2 is done. The "device reset behavior" should not
> >>>>>> happen. And then the display resources are not destroyed.
> >>>>>> I didn’t say send command to restore context.
> >>>>>
> >>>>>> I mean when #1 and #2 is not done. Driver and qemu both reset
> >>>>>> devices when resuming, at that time,
> >>>>> Above behavior is as per the spec guidelines. Hence it is fine.
> >>>>>
> >>>>>> we need a method to prevent the display resources from destroying.
> >>>>> I disagree to above as it is not as per the spec guidelines.
> >>>>> Just follow #1, #2 and #3 to not destroy the display resource 
> >>>>> destroying.
> >>>> I agree that follow #1, #2 and #3 will not destroy display resource.
> >>>> So the question goes back: If there is no #2 and you say that the
> >>>> reset behavior complies with the spec guidelines, how can I make
> >>>> the virtio gpu work properly with display resources destroyed?
> >>> Implement #2. :)
> >> We can't simply implement #2 if a device doesn't have #2,
> > How do you define "We" above?
> > Isn't we == device?
> >
> >> see Section 7.5.2.2
> >> in PCI Express Spec, about No_Soft_Reset:
> >> " If a VF implements the Power Management Capability, the VF's value
> >> of this field must be identical to the associated PF's value. "
> >> What's more, a device doesn't have #2 is allowed, we should consider
> >> how to

[virtio-dev] RE: [virtio-comment] RE: [PATCH v6 1/1] content: Add new feature VIRTIO_F_PRESERVE_RESOURCES

2024-01-15 Thread Parav Pandit


> From: Chen, Jiqian 
> Sent: Tuesday, January 16, 2024 12:08 PM
> 
> On 2024/1/15 19:10, Parav Pandit wrote:
> >
> >> From: Chen, Jiqian 
> >> Sent: Monday, January 15, 2024 4:37 PM
> >>
> >> On 2024/1/15 18:52, Parav Pandit wrote:
> >>>
> >>>> From: Chen, Jiqian 
> >>>> Sent: Monday, January 15, 2024 4:17 PM
> >>>>
> >>>> On 2024/1/15 17:46, Parav Pandit wrote:
> >>>>>
> >>>>>> From: Chen, Jiqian 
> >>>>>> Sent: Monday, January 15, 2024 3:10 PM
> >>>>>
> >>>>>>> Display resources should be only restored when following
> >>>>>>> conditions are
> >>>>>> met.
> >>>>>>> 1. PCI PM cap is reported.
> >>>>>>> 2. PCI PM cap has non_soft_reset=1 3. virtio guest driver do not
> >>>>>>> perform reset when transport offers a restore
> >>>>>> capability using #1 and #2.
> >>>>>>>
> >>>>>>> Do you agree? Yes?
> >>>>>> Yes, I think this problem (display resources are destroyed during
> >>>>>> S3) can be sorted to two situations:
> >>>>>> First is what you said above, in this situation, the
> >>>>>> devices_reset of qemu is unreasonable, if a device has PM cap and
> >>>>>> non_soft_reset=1, qemu should not do resetting.
> >>>>>
> >>>>>> Second is without #1 or #2, the reset behavior is fine. My patch
> >>>>>> is to fix this situation, that sending a new virtio-gpu command
> >>>>>> to notify qemu to prevent destroying display resources during S3.
> >>>>>
> >>>>> I still have hard time following "My patch is to fix this situation...".
> >>>>>
> >>>>> When #1 and #2 is not done, there is nothing to restore.
> >>>>> Driver should not send some new virtio specific command when #1
> >>>>> and
> >>>>> #2 is
> >>>> not there.
> >>>>> Instead, if the device wants to restore the context, all #1, #2
> >>>>> and
> >>>>> #3 should
> >>>> be done to implement the restore functionality.
> >>>> When #1 and #2 is done. The "device reset behavior" should not
> >>>> happen. And then the display resources are not destroyed.
> >>>> I didn’t say send command to restore context.
> >>>
> >>>> I mean when #1 and #2 is not done. Driver and qemu both reset
> >>>> devices when resuming, at that time,
> >>> Above behavior is as per the spec guidelines. Hence it is fine.
> >>>
> >>>> we need a method to prevent the display resources from destroying.
> >>> I disagree to above as it is not as per the spec guidelines.
> >>> Just follow #1, #2 and #3 to not destroy the display resource destroying.
> >> I agree that follow #1, #2 and #3 will not destroy display resource.
> >> So the question goes back: If there is no #2 and you say that the
> >> reset behavior complies with the spec guidelines, how can I make the
> >> virtio gpu work properly with display resources destroyed?
> > Implement #2. :)
> We can't simply implement #2 if a device doesn't have #2, 
How do you define "We" above?
Isn't we == device?

> see Section 7.5.2.2
> in PCI Express Spec, about No_Soft_Reset:
> " If a VF implements the Power Management Capability, the VF's value of this
> field must be identical to the associated PF's value. "
> What's more, a device doesn't have #2 is allowed, we should consider how to
> solve the two situations(No_Soft_Reset=1 and No_Soft_Reset=0), rather than
> simply considering implementing #2 for the device when it does not have #2.
A device implementation (sw or hw) that wants to offer maintain the function 
context will implement No_Soft_reset=1.

> Also in PCI Express Spec:
> " Functional context is required to be maintained by Functions in the D3Hot
> state if the No_Soft_Reset field in the PMCSR is Set. In this case, System
> Software is not required to re-initialize the Function after a transition from
> D3Hot to D0 (the Function will be in the D0active state). If the No_Soft_Reset
> bit is Clear, functional context is not required to be maintained by the 
> Function
> in the D3Hot state."
Right. It is NOT required to maintain. Hence, lets not maintain.

> Thi

[virtio-dev] RE: [virtio-comment] [PATCH v2] virtio-net: support setting coalescing params for multiple vqs

2024-01-15 Thread Parav Pandit




> From: virtio-comm...@lists.oasis-open.org  open.org> On Behalf Of Heng Qi
> Sent: Monday, January 15, 2024 6:36 PM
> 
> Currently, when each time the driver attempts to update the coalescing
> parameters for a vq, it needs to kick the device and wait for the ctrlq
> response to return.
It does not need to wait. This is some driver limitation that does not use the 
queue as "queue".
Such driver limitation should be removed in the driver. It does not qualify as 
limitation.

This will enable driver to enqueue multiple cvq commands without waiting for 
previous one.
This will also enable device to find natural coalescing done on multiple 
commands.

> The following path is observed: 1. Driver kicks the device; 2. After the 
> device
> receives the kick, CPU scheduling occurs and DMA multiple buffers multiple
> times; 3. The device completes processing and replies with a response.
> 
> When large-queue devices issue multiple requests and kick the device
> frequently, this often interrupt the work of the device-side CPU. 
When there is large devices and multiple driver notifications by a cpu that is 
N times faster than the device side cpu,
the device may find natural coalescing on the commands of a given cvq.

For multiple DMAs, we need to way to send 8 bytes of data without 16 bytes of 
indirection via a descriptor.
This is what we discussed a while back to do in txq and Stefan suggested to 
generalize for more queues, which is also a good idea.

> In addition,
> each vq request is processed separately, causing more delays for the CPU to
> wait for the DMA request to complete.
> 
> These interruptions and overhead will strain the CPU responsible for
> controlling the path of the DPU, especially in multi-device and large-queue
> scenarios.
> 
> To solve the above problems, we internally tried batch request, which merges
> requests from multiple queues and sends them at once. We conservatively
> tested
The batching done may end up modifying the given VQ's parameters multiple times.
The right test on Linux to do without rtnl lock which is anyway ugly and wrong 
semantic to use blocking the whole netdev stack.
(in case if you used that).

> 8 queue commands and sent them together. The DPU processing efficiency
> can be improved by 8 times, which greatly eases the DPU's support for multi-
> device and multi-queue DIM.
>
This is good.
 
> Maintainers may be concerned about whether the batch command method
> can optimize the above problems: accumulate multiple request commands to
> kick the device once, and obtain the processing results of the corresponding
> commands asynchronously.
This is unlikely to improve, rather it will have negative impact as it only 
means that moderation parameters are just delayed by the driver.

> The batch command method is used by us to optimize the CPU overhead of
> the DIM worker caused by the guest being busy waiting for the command
> response result.
In that case fixing the guest driver which is not yet written is the right fix.

> This is a different focus than batch request to solve the problem.
> 
> Suggested-by: Xiaoming Zhao 
> Signed-off-by: Heng Qi 
> ---
> v1->v2: Updated commit log. Due to sensitivity, sorry that can not give
> v1->the
> absolute value directly. @Michael
> 
>  device-types/net/description.tex | 26 --
>  1 file changed, 20 insertions(+), 6 deletions(-)
> 
> diff --git a/device-types/net/description.tex b/device-
> types/net/description.tex
> index aff5e08..b3766c4 100644
> --- a/device-types/net/description.tex
> +++ b/device-types/net/description.tex
> @@ -1667,8 +1667,8 @@ \subsubsection{Control
> Virtqueue}\label{sec:Device Types / Network Device / Devi  for notification
> coalescing.
> 
>  If the VIRTIO_NET_F_VQ_NOTF_COAL feature is negotiated, the driver can -
> send commands VIRTIO_NET_CTRL_NOTF_COAL_VQ_SET and
> VIRTIO_NET_CTRL_NOTF_COAL_VQ_GET -for virtqueue notification
> coalescing.
> +send commands VIRTIO_NET_CTRL_NOTF_COAL_VQ_SET,
> +VIRTIO_NET_CTRL_NOTF_COAL_VQS_SET and
> VIRTIO_NET_CTRL_NOTF_COAL_VQ_GET for virtqueue notification
> coalescing.
> 
A new feature bit is needed for this extra functionality.

>  \begin{lstlisting}
>  struct virtio_net_ctrl_coal {
> @@ -1682,11 +1682,17 @@ \subsubsection{Control
> Virtqueue}\label{sec:Device Types / Network Device / Devi
>  struct virtio_net_ctrl_coal coal;
>  };
> 
> +struct virtio_net_ctrl_mrg_coal_vq {
> +le16 num_entries; /* indicates number of valid entries */
> +struct virtio_net_ctrl_coal_vq entries[]; };
> +
>  #define VIRTIO_NET_CTRL_NOTF_COAL 6
>   #define VIRTIO_NET_CTRL_NOTF_COAL_TX_SET  0
>   #define VIRTIO_NET_CTRL_NOTF_COAL_RX_SET 1
>   #define VIRTIO_NET_CTRL_NOTF_COAL_VQ_SET 2
>   #define VIRTIO_NET_CTRL_NOTF_COAL_VQ_GET 3
> + #define VIRTIO_NET_CTRL_NOTF_COAL_VQS_SET 4
>  \end{lstlisting}
> 
>  Coalescing parameters:
> @@ -1706,6 +1712,7 @@ \subsubsection{Control
> Virtqueue}\label{sec:Device Types / Network Device / Devi  \item

[virtio-dev] RE: [virtio-comment] RE: [PATCH v6 1/1] content: Add new feature VIRTIO_F_PRESERVE_RESOURCES

2024-01-15 Thread Parav Pandit


> From: Chen, Jiqian 
> Sent: Monday, January 15, 2024 4:37 PM
> 
> On 2024/1/15 18:52, Parav Pandit wrote:
> >
> >> From: Chen, Jiqian 
> >> Sent: Monday, January 15, 2024 4:17 PM
> >>
> >> On 2024/1/15 17:46, Parav Pandit wrote:
> >>>
> >>>> From: Chen, Jiqian 
> >>>> Sent: Monday, January 15, 2024 3:10 PM
> >>>
> >>>>> Display resources should be only restored when following
> >>>>> conditions are
> >>>> met.
> >>>>> 1. PCI PM cap is reported.
> >>>>> 2. PCI PM cap has non_soft_reset=1 3. virtio guest driver do not
> >>>>> perform reset when transport offers a restore
> >>>> capability using #1 and #2.
> >>>>>
> >>>>> Do you agree? Yes?
> >>>> Yes, I think this problem (display resources are destroyed during
> >>>> S3) can be sorted to two situations:
> >>>> First is what you said above, in this situation, the devices_reset
> >>>> of qemu is unreasonable, if a device has PM cap and
> >>>> non_soft_reset=1, qemu should not do resetting.
> >>>
> >>>> Second is without #1 or #2, the reset behavior is fine. My patch is
> >>>> to fix this situation, that sending a new virtio-gpu command to
> >>>> notify qemu to prevent destroying display resources during S3.
> >>>
> >>> I still have hard time following "My patch is to fix this situation...".
> >>>
> >>> When #1 and #2 is not done, there is nothing to restore.
> >>> Driver should not send some new virtio specific command when #1 and
> >>> #2 is
> >> not there.
> >>> Instead, if the device wants to restore the context, all #1, #2 and
> >>> #3 should
> >> be done to implement the restore functionality.
> >> When #1 and #2 is done. The "device reset behavior" should not
> >> happen. And then the display resources are not destroyed.
> >> I didn’t say send command to restore context.
> >
> >> I mean when #1 and #2 is not done. Driver and qemu both reset devices
> >> when resuming, at that time,
> > Above behavior is as per the spec guidelines. Hence it is fine.
> >
> >> we need a method to prevent the display resources from destroying.
> > I disagree to above as it is not as per the spec guidelines.
> > Just follow #1, #2 and #3 to not destroy the display resource destroying.
> I agree that follow #1, #2 and #3 will not destroy display resource.
> So the question goes back: If there is no #2 and you say that the reset
> behavior complies with the spec guidelines, how can I make the virtio gpu
> work properly with display resources destroyed?
Implement #2. :)

> What my patch do is not to prevent resetting, is to prevent destroying
> resources during resetting. Gerd Hoffmann has agreed to this point in my
> qemu patch reviewing.
> Even if I use PM cap, I still need to prevent destroying resources during
> resetting.
Do you mean when virtio device is reset, you want to prevent?
If so, that is yet additional virtio spec hack that must be avoided as reset 
flow should be one.
Please do #3 to avoid resetting the device.

[virtio-dev] RE: [virtio-comment] RE: [PATCH v6 1/1] content: Add new feature VIRTIO_F_PRESERVE_RESOURCES

2024-01-15 Thread Parav Pandit


> From: Chen, Jiqian 
> Sent: Monday, January 15, 2024 4:17 PM
> 
> On 2024/1/15 17:46, Parav Pandit wrote:
> >
> >> From: Chen, Jiqian 
> >> Sent: Monday, January 15, 2024 3:10 PM
> >
> >>> Display resources should be only restored when following conditions
> >>> are
> >> met.
> >>> 1. PCI PM cap is reported.
> >>> 2. PCI PM cap has non_soft_reset=1
> >>> 3. virtio guest driver do not perform reset when transport offers a
> >>> restore
> >> capability using #1 and #2.
> >>>
> >>> Do you agree? Yes?
> >> Yes, I think this problem (display resources are destroyed during S3)
> >> can be sorted to two situations:
> >> First is what you said above, in this situation, the devices_reset of
> >> qemu is unreasonable, if a device has PM cap and non_soft_reset=1,
> >> qemu should not do resetting.
> >
> >> Second is without #1 or #2, the reset behavior is fine. My patch is
> >> to fix this situation, that sending a new virtio-gpu command to
> >> notify qemu to prevent destroying display resources during S3.
> >
> > I still have hard time following "My patch is to fix this situation...".
> >
> > When #1 and #2 is not done, there is nothing to restore.
> > Driver should not send some new virtio specific command when #1 and #2 is
> not there.
> > Instead, if the device wants to restore the context, all #1, #2 and #3 
> > should
> be done to implement the restore functionality.
> When #1 and #2 is done. The "device reset behavior" should not happen. And
> then the display resources are not destroyed.
> I didn’t say send command to restore context.

> I mean when #1 and #2 is not done. Driver and qemu both reset devices when
> resuming, at that time, 
Above behavior is as per the spec guidelines. Hence it is fine.

> we need a method to prevent the display resources from destroying.
I disagree to above as it is not as per the spec guidelines.
Just follow #1, #2 and #3 to not destroy the display resource destroying.
No need for any check command etc.

> I don't mean anything else, I just feel like you may not have understood the
> issues I encountered during the S3 process and the implementation logic of
> my patch from the beginning.
> In current kernel and qemu codes, when we do S3(suspend and resume) for
> guest, during the resuming, qemu and guest driver will reset the virtio-
> gpu(because #2 is not done), but once the resetting happens, the display
> resources will be destroyed and we can't restore, so we must add a method to
> prevent destroying resources at that situation. What my patch do is to add a
> new virtio-gpu command to notify qemu to check if the display resources
> should be destroyed during resetting.
> 
> >
> > In other words, if one wants to use device context restore on PM state
> transition it should do #1, #2 and #3.
> > (and avoid inventing new infrastructure because PCI PM has necessary
> things).
> 
> 
> --
> Best regards,
> Jiqian Chen.

[virtio-dev] RE: [virtio-comment] RE: [PATCH v6 1/1] content: Add new feature VIRTIO_F_PRESERVE_RESOURCES

2024-01-15 Thread Parav Pandit


> From: Chen, Jiqian 
> Sent: Monday, January 15, 2024 3:10 PM

> > Display resources should be only restored when following conditions are
> met.
> > 1. PCI PM cap is reported.
> > 2. PCI PM cap has non_soft_reset=1
> > 3. virtio guest driver do not perform reset when transport offers a restore
> capability using #1 and #2.
> >
> > Do you agree? Yes?
> Yes, I think this problem (display resources are destroyed during S3) can be
> sorted to two situations:
> First is what you said above, in this situation, the devices_reset of qemu is
> unreasonable, if a device has PM cap and non_soft_reset=1, qemu should not
> do resetting.

> Second is without #1 or #2, the reset behavior is fine. My patch is to fix 
> this
> situation, that sending a new virtio-gpu command to notify qemu to prevent
> destroying display resources during S3.

I still have hard time following "My patch is to fix this situation...".

When #1 and #2 is not done, there is nothing to restore. 
Driver should not send some new virtio specific command when #1 and #2 is not 
there.
Instead, if the device wants to restore the context, all #1, #2 and #3 should 
be done to implement the restore functionality.

In other words, if one wants to use device context restore on PM state 
transition it should do #1, #2 and #3.
(and avoid inventing new infrastructure because PCI PM has necessary things).

[virtio-dev] RE: [virtio-comment] RE: [PATCH v6 1/1] content: Add new feature VIRTIO_F_PRESERVE_RESOURCES

2024-01-15 Thread Parav Pandit


> From: Chen, Jiqian 
> Sent: Monday, January 15, 2024 2:40 PM
> 
> On 2024/1/15 16:52, Parav Pandit wrote:
> >
> >> From: Chen, Jiqian 
> >> Sent: Monday, January 15, 2024 1:51 PM
> >>
> >> On 2024/1/15 15:55, Parav Pandit wrote:
> >>>
> >>>> From: Chen, Jiqian 
> >>>> Sent: Monday, January 15, 2024 1:18 PM
> >>>>
> >>>> On 2024/1/15 15:37, Parav Pandit wrote:
> >>>>>
> >>>>>> From: Chen, Jiqian 
> >>>>>> Sent: Monday, January 15, 2024 1:03 PM
> >>>>>>
> >>>>>> On 2024/1/12 17:44, Parav Pandit wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>>> From: Chen, Jiqian 
> >>>>>>>> Sent: Friday, January 12, 2024 2:55 PM
> >>>>>>>>
> >>>>>>>> On 2024/1/12 16:47, Parav Pandit wrote:
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> From: Chen, Jiqian 
> >>>>>>>>>> Sent: Friday, January 12, 2024 1:55 PM
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On 2024/1/12 16:02, Parav Pandit wrote:
> >>>>>>>>>>> Hi Jiqian,
> >>>>>>>>>>>
> >>>>>>>>>>>> From: Chen, Jiqian 
> >>>>>>>>>>>> Sent: Friday, January 12, 2024 1:11 PM
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Hi all,
> >>>>>>>>>>>> Sorry to reply late.
> >>>>>>>>>>>> I don't know if you still remember this problem, let me
> >>>>>>>>>>>> briefly descript
> >>>>>>>> it.
> >>>>>>>>>>>> I am working to implement virtgpu S3 function on Xen.
> >>>>>>>>>>>> Currently on Xen, if we start a guest through qemu with
> >>>>>>>>>>>> enabling virtgpu, and then suspend and resume guest. We can
> >>>>>>>>>>>> find that the guest kernel comes back, but the display doesn't.
> >>>>>>>>>>>> It just shown a black
> >>>>>>>>>> screen.
> >>>>>>>>>>>> That is because during suspending, guest called into qemu
> >>>>>>>>>>>> and qemu destroyed all display resources and reset
> >>>>>>>>>>>> renderer. This made the display gone after guest resumed.
> >>>>>>>>>>>> So, I add a new command for virtio-gpu that when guest is
> >>>>>>>>>>>> suspending, it will notify qemu and set
> >>>>>>>>>>>> parameter(preserver_resource) to 1 and then qemu will
> >>>>>>>>>>>> preserve resources, and when resuming, guest will notify
> >>>>>>>>>>>> qemu to set parameter to 0, and then qemu will keep the
> >>>>>>>>>>>> normal
> >> actions.
> >>>>>>>>>>>> That can
> >>>>>>>> help guest's display come back.
> >>>>>>>>>>>> When I upstream above implementation, Parav and MST
> suggest
> >>>>>>>>>>>> me
> >>>>>> to
> >>>>>>>>>> use
> >>>>>>>>>>>> the PM capability to fix this problem instead of adding a
> >>>>>>>>>>>> new command or state bit.
> >>>>>>>>>>>> Now, I have tried the PM capability of virtio-gpu, it can't
> >>>>>>>>>>>> be used to solve this problem.
> >>>>>>>>>>>> The reason is:
> >>>>>>>>>>>> during guest suspending, it will write D3 state through PM
> >>>>>>>>>>>> cap, then I can save the resources of virtio-gpu on Qemu
> >>>>>>>>>>>> side(set preserver_resource to 1), but during the process
> >>>>>>>>>>>> of resuming, t

[virtio-dev] RE: [virtio-comment] RE: [PATCH v6 1/1] content: Add new feature VIRTIO_F_PRESERVE_RESOURCES

2024-01-15 Thread Parav Pandit


> From: Chen, Jiqian 
> Sent: Monday, January 15, 2024 1:51 PM
> 
> On 2024/1/15 15:55, Parav Pandit wrote:
> >
> >> From: Chen, Jiqian 
> >> Sent: Monday, January 15, 2024 1:18 PM
> >>
> >> On 2024/1/15 15:37, Parav Pandit wrote:
> >>>
> >>>> From: Chen, Jiqian 
> >>>> Sent: Monday, January 15, 2024 1:03 PM
> >>>>
> >>>> On 2024/1/12 17:44, Parav Pandit wrote:
> >>>>>
> >>>>>
> >>>>>> From: Chen, Jiqian 
> >>>>>> Sent: Friday, January 12, 2024 2:55 PM
> >>>>>>
> >>>>>> On 2024/1/12 16:47, Parav Pandit wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>>> From: Chen, Jiqian 
> >>>>>>>> Sent: Friday, January 12, 2024 1:55 PM
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On 2024/1/12 16:02, Parav Pandit wrote:
> >>>>>>>>> Hi Jiqian,
> >>>>>>>>>
> >>>>>>>>>> From: Chen, Jiqian 
> >>>>>>>>>> Sent: Friday, January 12, 2024 1:11 PM
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Hi all,
> >>>>>>>>>> Sorry to reply late.
> >>>>>>>>>> I don't know if you still remember this problem, let me
> >>>>>>>>>> briefly descript
> >>>>>> it.
> >>>>>>>>>> I am working to implement virtgpu S3 function on Xen.
> >>>>>>>>>> Currently on Xen, if we start a guest through qemu with
> >>>>>>>>>> enabling virtgpu, and then suspend and resume guest. We can
> >>>>>>>>>> find that the guest kernel comes back, but the display doesn't.
> >>>>>>>>>> It just shown a black
> >>>>>>>> screen.
> >>>>>>>>>> That is because during suspending, guest called into qemu and
> >>>>>>>>>> qemu destroyed all display resources and reset renderer. This
> >>>>>>>>>> made the display gone after guest resumed.
> >>>>>>>>>> So, I add a new command for virtio-gpu that when guest is
> >>>>>>>>>> suspending, it will notify qemu and set
> >>>>>>>>>> parameter(preserver_resource) to 1 and then qemu will
> >>>>>>>>>> preserve resources, and when resuming, guest will notify qemu
> >>>>>>>>>> to set parameter to 0, and then qemu will keep the normal
> actions.
> >>>>>>>>>> That can
> >>>>>> help guest's display come back.
> >>>>>>>>>> When I upstream above implementation, Parav and MST suggest
> >>>>>>>>>> me
> >>>> to
> >>>>>>>> use
> >>>>>>>>>> the PM capability to fix this problem instead of adding a new
> >>>>>>>>>> command or state bit.
> >>>>>>>>>> Now, I have tried the PM capability of virtio-gpu, it can't
> >>>>>>>>>> be used to solve this problem.
> >>>>>>>>>> The reason is:
> >>>>>>>>>> during guest suspending, it will write D3 state through PM
> >>>>>>>>>> cap, then I can save the resources of virtio-gpu on Qemu
> >>>>>>>>>> side(set preserver_resource to 1), but during the process of
> >>>>>>>>>> resuming, the state of PM cap will be cleared by qemu
> >>>>>>>>>> resetting(qemu_system_wakeup-> qemu_devices_reset->
> >>>>>>>>>> virtio_vga_base_reset-> virtio_gpu_gl_reset), it causes that
> >>>>>>>>>> when guest reads state from PM cap, it will find the
> >>>>>>>>>> virtio-gpu has already been D0 state,
> >>>>>>>>> This behavior needs to be fixed. As the spec listed out, "
> >>>>>>>>> This 2-bit field is
> >>>>>>>> used both to determine the current power state of a Function"
> >>>>>

[virtio-dev] RE: [virtio-comment] RE: [PATCH v6 1/1] content: Add new feature VIRTIO_F_PRESERVE_RESOURCES

2024-01-14 Thread Parav Pandit


> From: Chen, Jiqian 
> Sent: Monday, January 15, 2024 1:18 PM
> 
> On 2024/1/15 15:37, Parav Pandit wrote:
> >
> >> From: Chen, Jiqian 
> >> Sent: Monday, January 15, 2024 1:03 PM
> >>
> >> On 2024/1/12 17:44, Parav Pandit wrote:
> >>>
> >>>
> >>>> From: Chen, Jiqian 
> >>>> Sent: Friday, January 12, 2024 2:55 PM
> >>>>
> >>>> On 2024/1/12 16:47, Parav Pandit wrote:
> >>>>>
> >>>>>
> >>>>>> From: Chen, Jiqian 
> >>>>>> Sent: Friday, January 12, 2024 1:55 PM
> >>>>>>
> >>>>>>
> >>>>>> On 2024/1/12 16:02, Parav Pandit wrote:
> >>>>>>> Hi Jiqian,
> >>>>>>>
> >>>>>>>> From: Chen, Jiqian 
> >>>>>>>> Sent: Friday, January 12, 2024 1:11 PM
> >>>>>>>
> >>>>>>>>
> >>>>>>>> Hi all,
> >>>>>>>> Sorry to reply late.
> >>>>>>>> I don't know if you still remember this problem, let me briefly
> >>>>>>>> descript
> >>>> it.
> >>>>>>>> I am working to implement virtgpu S3 function on Xen.
> >>>>>>>> Currently on Xen, if we start a guest through qemu with
> >>>>>>>> enabling virtgpu, and then suspend and resume guest. We can
> >>>>>>>> find that the guest kernel comes back, but the display doesn't.
> >>>>>>>> It just shown a black
> >>>>>> screen.
> >>>>>>>> That is because during suspending, guest called into qemu and
> >>>>>>>> qemu destroyed all display resources and reset renderer. This
> >>>>>>>> made the display gone after guest resumed.
> >>>>>>>> So, I add a new command for virtio-gpu that when guest is
> >>>>>>>> suspending, it will notify qemu and set
> >>>>>>>> parameter(preserver_resource) to 1 and then qemu will preserve
> >>>>>>>> resources, and when resuming, guest will notify qemu to set
> >>>>>>>> parameter to 0, and then qemu will keep the normal actions.
> >>>>>>>> That can
> >>>> help guest's display come back.
> >>>>>>>> When I upstream above implementation, Parav and MST suggest me
> >> to
> >>>>>> use
> >>>>>>>> the PM capability to fix this problem instead of adding a new
> >>>>>>>> command or state bit.
> >>>>>>>> Now, I have tried the PM capability of virtio-gpu, it can't be
> >>>>>>>> used to solve this problem.
> >>>>>>>> The reason is:
> >>>>>>>> during guest suspending, it will write D3 state through PM cap,
> >>>>>>>> then I can save the resources of virtio-gpu on Qemu side(set
> >>>>>>>> preserver_resource to 1), but during the process of resuming,
> >>>>>>>> the state of PM cap will be cleared by qemu
> >>>>>>>> resetting(qemu_system_wakeup-> qemu_devices_reset->
> >>>>>>>> virtio_vga_base_reset-> virtio_gpu_gl_reset), it causes that
> >>>>>>>> when guest reads state from PM cap, it will find the virtio-gpu
> >>>>>>>> has already been D0 state,
> >>>>>>> This behavior needs to be fixed. As the spec listed out, " This
> >>>>>>> 2-bit field is
> >>>>>> used both to determine the current power state of a Function"
> >>>>>> Do you mean it is wrong to reset PM cap when vritio_gpu reset in
> >>>>>> current qemu code? Why?
> >>>>> Because PM is implementing the support from D3->d0 transition and
> >>>>> if it
> >>>> device in D3, the device needs to respond that it is in D3 to match
> >>>> the PCI spec.
> >>>>>
> >>>>>> Shouldn't all device states, including PM registers, be reset
> >>>>>> during the process of virtio-gpu reset?
> >>>>> If No_Soft_Reset== 0, no device context to be preserved.
> >>>>> If No_Soft_Reset == 1, device to preserve minimal reg

[virtio-dev] RE: [PATCH v6 1/1] content: Add new feature VIRTIO_F_PRESERVE_RESOURCES

2024-01-14 Thread Parav Pandit


> From: Chen, Jiqian 
> Sent: Monday, January 15, 2024 1:03 PM
> 
> On 2024/1/12 17:44, Parav Pandit wrote:
> >
> >
> >> From: Chen, Jiqian 
> >> Sent: Friday, January 12, 2024 2:55 PM
> >>
> >> On 2024/1/12 16:47, Parav Pandit wrote:
> >>>
> >>>
> >>>> From: Chen, Jiqian 
> >>>> Sent: Friday, January 12, 2024 1:55 PM
> >>>>
> >>>>
> >>>> On 2024/1/12 16:02, Parav Pandit wrote:
> >>>>> Hi Jiqian,
> >>>>>
> >>>>>> From: Chen, Jiqian 
> >>>>>> Sent: Friday, January 12, 2024 1:11 PM
> >>>>>
> >>>>>>
> >>>>>> Hi all,
> >>>>>> Sorry to reply late.
> >>>>>> I don't know if you still remember this problem, let me briefly
> >>>>>> descript
> >> it.
> >>>>>> I am working to implement virtgpu S3 function on Xen.
> >>>>>> Currently on Xen, if we start a guest through qemu with enabling
> >>>>>> virtgpu, and then suspend and resume guest. We can find that the
> >>>>>> guest kernel comes back, but the display doesn't. It just shown a
> >>>>>> black
> >>>> screen.
> >>>>>> That is because during suspending, guest called into qemu and
> >>>>>> qemu destroyed all display resources and reset renderer. This
> >>>>>> made the display gone after guest resumed.
> >>>>>> So, I add a new command for virtio-gpu that when guest is
> >>>>>> suspending, it will notify qemu and set
> >>>>>> parameter(preserver_resource) to 1 and then qemu will preserve
> >>>>>> resources, and when resuming, guest will notify qemu to set
> >>>>>> parameter to 0, and then qemu will keep the normal actions. That
> >>>>>> can
> >> help guest's display come back.
> >>>>>> When I upstream above implementation, Parav and MST suggest me
> to
> >>>> use
> >>>>>> the PM capability to fix this problem instead of adding a new
> >>>>>> command or state bit.
> >>>>>> Now, I have tried the PM capability of virtio-gpu, it can't be
> >>>>>> used to solve this problem.
> >>>>>> The reason is:
> >>>>>> during guest suspending, it will write D3 state through PM cap,
> >>>>>> then I can save the resources of virtio-gpu on Qemu side(set
> >>>>>> preserver_resource to 1), but during the process of resuming, the
> >>>>>> state of PM cap will be cleared by qemu
> >>>>>> resetting(qemu_system_wakeup-> qemu_devices_reset->
> >>>>>> virtio_vga_base_reset-> virtio_gpu_gl_reset), it causes that when
> >>>>>> guest reads state from PM cap, it will find the virtio-gpu has
> >>>>>> already been D0 state,
> >>>>> This behavior needs to be fixed. As the spec listed out, " This
> >>>>> 2-bit field is
> >>>> used both to determine the current power state of a Function"
> >>>> Do you mean it is wrong to reset PM cap when vritio_gpu reset in
> >>>> current qemu code? Why?
> >>> Because PM is implementing the support from D3->d0 transition and if
> >>> it
> >> device in D3, the device needs to respond that it is in D3 to match
> >> the PCI spec.
> >>>
> >>>> Shouldn't all device states, including PM registers, be reset
> >>>> during the process of virtio-gpu reset?
> >>> If No_Soft_Reset== 0, no device context to be preserved.
> >>> If No_Soft_Reset == 1, device to preserve minimal registers and
> >>> other
> >> things listed in pci spec.
> >>>
> >>>>
> >>>>>
> >>>>> So device needs to return D3 in the PowerState field.  This is must.
> >>>> But current behavior is a kind of soft reset, I
> >>>> think(!PCI_PM_CTRL_NO_SOFT_RESET). The pm cap is reset by qemu is
> >>>> reasonable.
> >>> What you described is, you need to do No_Soft_Reset=1, so please set
> >>> the
> >> cap accordingly to achieve the restore.
> >>>
> >>> Also in your case the if the QEMU knows that it will have to resume
> >>> the
&g

[virtio-dev] RE: [PATCH v6 1/1] content: Add new feature VIRTIO_F_PRESERVE_RESOURCES

2024-01-12 Thread Parav Pandit



> From: Chen, Jiqian 
> Sent: Friday, January 12, 2024 2:55 PM
> 
> On 2024/1/12 16:47, Parav Pandit wrote:
> >
> >
> >> From: Chen, Jiqian 
> >> Sent: Friday, January 12, 2024 1:55 PM
> >>
> >>
> >> On 2024/1/12 16:02, Parav Pandit wrote:
> >>> Hi Jiqian,
> >>>
> >>>> From: Chen, Jiqian 
> >>>> Sent: Friday, January 12, 2024 1:11 PM
> >>>
> >>>>
> >>>> Hi all,
> >>>> Sorry to reply late.
> >>>> I don't know if you still remember this problem, let me briefly descript
> it.
> >>>> I am working to implement virtgpu S3 function on Xen.
> >>>> Currently on Xen, if we start a guest through qemu with enabling
> >>>> virtgpu, and then suspend and resume guest. We can find that the
> >>>> guest kernel comes back, but the display doesn't. It just shown a
> >>>> black
> >> screen.
> >>>> That is because during suspending, guest called into qemu and qemu
> >>>> destroyed all display resources and reset renderer. This made the
> >>>> display gone after guest resumed.
> >>>> So, I add a new command for virtio-gpu that when guest is
> >>>> suspending, it will notify qemu and set
> >>>> parameter(preserver_resource) to 1 and then qemu will preserve
> >>>> resources, and when resuming, guest will notify qemu to set
> >>>> parameter to 0, and then qemu will keep the normal actions. That can
> help guest's display come back.
> >>>> When I upstream above implementation, Parav and MST suggest me to
> >> use
> >>>> the PM capability to fix this problem instead of adding a new
> >>>> command or state bit.
> >>>> Now, I have tried the PM capability of virtio-gpu, it can't be used
> >>>> to solve this problem.
> >>>> The reason is:
> >>>> during guest suspending, it will write D3 state through PM cap,
> >>>> then I can save the resources of virtio-gpu on Qemu side(set
> >>>> preserver_resource to 1), but during the process of resuming, the
> >>>> state of PM cap will be cleared by qemu
> >>>> resetting(qemu_system_wakeup-> qemu_devices_reset->
> >>>> virtio_vga_base_reset-> virtio_gpu_gl_reset), it causes that when
> >>>> guest reads state from PM cap, it will find the virtio-gpu has
> >>>> already been D0 state,
> >>> This behavior needs to be fixed. As the spec listed out, " This
> >>> 2-bit field is
> >> used both to determine the current power state of a Function"
> >> Do you mean it is wrong to reset PM cap when vritio_gpu reset in
> >> current qemu code? Why?
> > Because PM is implementing the support from D3->d0 transition and if it
> device in D3, the device needs to respond that it is in D3 to match the PCI
> spec.
> >
> >> Shouldn't all device states, including PM registers, be reset during
> >> the process of virtio-gpu reset?
> > If No_Soft_Reset== 0, no device context to be preserved.
> > If No_Soft_Reset == 1, device to preserve minimal registers and other
> things listed in pci spec.
> >
> >>
> >>>
> >>> So device needs to return D3 in the PowerState field.  This is must.
> >> But current behavior is a kind of soft reset, I
> >> think(!PCI_PM_CTRL_NO_SOFT_RESET). The pm cap is reset by qemu is
> >> reasonable.
> > What you described is, you need to do No_Soft_Reset=1, so please set the
> cap accordingly to achieve the restore.
> >
> > Also in your case the if the QEMU knows that it will have to resume the
> device soon.
> > Hence, it can well prepare the gpu context before resuming the vcpus.
> > In such case, the extra register wouldn’t be necessary.
> > Having PM control register is anyway needed to drive the device properly.
> Even as you said, the reset behavior of PM in the current QEMU code is
> incorrect, and even if it is modified, it also cannot fix this problem with 
> S3.
You meant D3?

> Because there is twice device reset during resuming.
> First is when we trigger a resume command to qemu, qemu will reset
> device(qemu_system_wakeup-> qemu_devices_reset->
> virtio_vga_base_reset-> virtio_gpu_gl_reset).
A device implementation that supports PM should not reset the device. Please 
fix it.

> After the first time, then second time happens in guest kernel,
> virtio_device_restore-> virtio_reset_device. But the PM state changes (D3 to
> D0) happens between those two, so the display resources will be still
> destroyed by the second time of device reset.
The driver that understands that device supports PM, will skip the reset steps.
Hence the 2nd reset will also not occur.
Therefore, device state will be restored.

[virtio-dev] RE: [PATCH v6 1/1] content: Add new feature VIRTIO_F_PRESERVE_RESOURCES

2024-01-12 Thread Parav Pandit



> From: Chen, Jiqian 
> Sent: Friday, January 12, 2024 1:55 PM
> 
> 
> On 2024/1/12 16:02, Parav Pandit wrote:
> > Hi Jiqian,
> >
> >> From: Chen, Jiqian 
> >> Sent: Friday, January 12, 2024 1:11 PM
> >
> >>
> >> Hi all,
> >> Sorry to reply late.
> >> I don't know if you still remember this problem, let me briefly descript 
> >> it.
> >> I am working to implement virtgpu S3 function on Xen.
> >> Currently on Xen, if we start a guest through qemu with enabling
> >> virtgpu, and then suspend and resume guest. We can find that the
> >> guest kernel comes back, but the display doesn't. It just shown a black
> screen.
> >> That is because during suspending, guest called into qemu and qemu
> >> destroyed all display resources and reset renderer. This made the
> >> display gone after guest resumed.
> >> So, I add a new command for virtio-gpu that when guest is suspending,
> >> it will notify qemu and set parameter(preserver_resource) to 1 and
> >> then qemu will preserve resources, and when resuming, guest will
> >> notify qemu to set parameter to 0, and then qemu will keep the normal
> >> actions. That can help guest's display come back.
> >> When I upstream above implementation, Parav and MST suggest me to
> use
> >> the PM capability to fix this problem instead of adding a new command
> >> or state bit.
> >> Now, I have tried the PM capability of virtio-gpu, it can't be used
> >> to solve this problem.
> >> The reason is:
> >> during guest suspending, it will write D3 state through PM cap, then
> >> I can save the resources of virtio-gpu on Qemu side(set
> >> preserver_resource to 1), but during the process of resuming, the
> >> state of PM cap will be cleared by qemu
> >> resetting(qemu_system_wakeup-> qemu_devices_reset->
> >> virtio_vga_base_reset-> virtio_gpu_gl_reset), it causes that when
> >> guest reads state from PM cap, it will find the virtio-gpu has
> >> already been D0 state,
> > This behavior needs to be fixed. As the spec listed out, " This 2-bit field 
> > is
> used both to determine the current power state of a Function"
> Do you mean it is wrong to reset PM cap when vritio_gpu reset in current
> qemu code? Why?
Because PM is implementing the support from D3->d0 transition and if it device 
in D3, the device needs to respond that it is in D3 to match the PCI spec.

> Shouldn't all device states, including PM registers, be reset during the
> process of virtio-gpu reset?
If No_Soft_Reset== 0, no device context to be preserved.
If No_Soft_Reset == 1, device to preserve minimal registers and other things 
listed in pci spec.

> 
> >
> > So device needs to return D3 in the PowerState field.  This is must.
> But current behavior is a kind of soft reset, I
> think(!PCI_PM_CTRL_NO_SOFT_RESET). The pm cap is reset by qemu is
> reasonable.
What you described is, you need to do No_Soft_Reset=1, so please set the cap 
accordingly to achieve the restore.

Also in your case the if the QEMU knows that it will have to resume the device 
soon.
Hence, it can well prepare the gpu context before resuming the vcpus.
In such case, the extra register wouldn’t be necessary.
Having PM control register is anyway needed to drive the device properly.

Having the extra busy_poll register is flexible. So please evaluate if you need 
it or not.

> 
> >
> > In addition to it, an additional busy_poll register is helpful that 
> > indicates the
> device is ready to use.
> > This is because 10msec is the timeout set by the PCI spec.
> > This can be hard for the devices to implement if the large GPU state is 
> > being
> read from files or slow media or for other hw devices in large quantities.
> >
> > This limit comes from the PCI spec normative of below.
> >
> > After transitioning a VF from D3Hot to D0, at least one of the following is
> true:
> > ◦ At least 10 ms has passed since the request to enter D0 was issued.
> >
> > So Readiness Time Reporting capability is not so useful.
> >
> > Hence, after PM state to D3->D0 transition is successful, virtio level PCI
> register is useful to ensure that device is resumed to drive rest of the
> registers.
> >
> >> so
> >> guest will not write D0 through PM cap, so I can't know when to
> >> restore the resources(set preserver_resource to 0).
> >> Do you have any other suggestions?
> >> Or can I just fallback to the version that add a new
> >> command(VIRTIO_GPU_CMD_PRESERVE_RESOURCE) in virtio-gpu? I think
>

[virtio-dev] RE: [PATCH v6 1/1] content: Add new feature VIRTIO_F_PRESERVE_RESOURCES

2024-01-12 Thread Parav Pandit

Hi Jiqian,

> From: Chen, Jiqian 
> Sent: Friday, January 12, 2024 1:11 PM

> 
> Hi all,
> Sorry to reply late.
> I don't know if you still remember this problem, let me briefly descript it.
> I am working to implement virtgpu S3 function on Xen.
> Currently on Xen, if we start a guest through qemu with enabling virtgpu, and
> then suspend and resume guest. We can find that the guest kernel comes
> back, but the display doesn't. It just shown a black screen.
> That is because during suspending, guest called into qemu and qemu
> destroyed all display resources and reset renderer. This made the display gone
> after guest resumed.
> So, I add a new command for virtio-gpu that when guest is suspending, it will
> notify qemu and set parameter(preserver_resource) to 1 and then qemu will
> preserve resources, and when resuming, guest will notify qemu to set
> parameter to 0, and then qemu will keep the normal actions. That can help
> guest's display come back.
> When I upstream above implementation, Parav and MST suggest me to use
> the PM capability to fix this problem instead of adding a new command or
> state bit.
> Now, I have tried the PM capability of virtio-gpu, it can't be used to solve 
> this
> problem.
> The reason is:
> during guest suspending, it will write D3 state through PM cap, then I can 
> save
> the resources of virtio-gpu on Qemu side(set preserver_resource to 1), but
> during the process of resuming, the state of PM cap will be cleared by qemu
> resetting(qemu_system_wakeup-> qemu_devices_reset->
> virtio_vga_base_reset-> virtio_gpu_gl_reset), it causes that when guest reads
> state from PM cap, it will find the virtio-gpu has already been D0 state, 
This behavior needs to be fixed. As the spec listed out, " This 2-bit field is 
used both to determine the current power state of a Function"

So device needs to return D3 in the PowerState field.  This is must.

In addition to it, an additional busy_poll register is helpful that indicates 
the device is ready to use.
This is because 10msec is the timeout set by the PCI spec.
This can be hard for the devices to implement if the large GPU state is being 
read from files or slow media or for other hw devices in large quantities.

This limit comes from the PCI spec normative of below.

After transitioning a VF from D3Hot to D0, at least one of the following is 
true:
◦ At least 10 ms has passed since the request to enter D0 was issued.

So Readiness Time Reporting capability is not so useful.

Hence, after PM state to D3->D0 transition is successful, virtio level PCI 
register is useful to ensure that device is resumed to drive rest of the 
registers.

> so
> guest will not write D0 through PM cap, so I can't know when to restore the
> resources(set preserver_resource to 0).
> Do you have any other suggestions?
> Or can I just fallback to the version that add a new
> command(VIRTIO_GPU_CMD_PRESERVE_RESOURCE) in virtio-gpu? I think
> that way is more reasonable and feasible for virtio-gpu to protect display
> resources during S3. As for other devices, if necessary, they can also refer 
> to
> the implementation of the virtio-gpu to add new commands to prevent
> resource loss during S3.
> 
> On 2023/10/27 11:03, Chen, Jiqian wrote:
> > Hi Michael S. Tsirkin and Parav Pandit, Thank you for your detailed
> > explanation. I will try to use PM cap to fix this issue.
> >
> > On 2023/10/26 18:30, Michael S. Tsirkin wrote:
> >> On Thu, Oct 26, 2023 at 10:24:26AM +, Chen, Jiqian wrote:
> >>>
> >>> On 2023/10/25 11:51, Parav Pandit wrote:
> >>>>
> >>>>
> >>>>> From: Chen, Jiqian 
> >>>>> Sent: Tuesday, October 24, 2023 5:44 PM
> >>>>>
> >>>>> On 2023/10/24 18:51, Parav Pandit wrote:
> >>>>>>
> >>>>>>> From: Chen, Jiqian 
> >>>>>>> Sent: Tuesday, October 24, 2023 4:06 PM
> >>>>>>>
> >>>>>>> On 2023/10/23 21:35, Parav Pandit wrote:
> >>>>>>>>
> >>>>>>>>> From: Chen, Jiqian 
> >>>>>>>>> Sent: Monday, October 23, 2023 4:09 PM
> >>>>>>>>
> >>>>>>>>>> I think this can be done without introducing the new register.
> >>>>>>>>>> Can you please check if the PM register itself can serve the
> >>>>>>>>>> purpose instead
> >>>>>>>>> of new virtio level register?
> >>>>>>>>> Do you mean the system PM register?
> >>>>>>>> No, the device's PM regist

[virtio-dev] RE: [PATCH v6 1/1] content: Add new feature VIRTIO_F_PRESERVE_RESOURCES

2023-10-24 Thread Parav Pandit



> From: Chen, Jiqian 
> Sent: Tuesday, October 24, 2023 5:44 PM
> 
> On 2023/10/24 18:51, Parav Pandit wrote:
> >
> >> From: Chen, Jiqian 
> >> Sent: Tuesday, October 24, 2023 4:06 PM
> >>
> >> On 2023/10/23 21:35, Parav Pandit wrote:
> >>>
> >>>> From: Chen, Jiqian 
> >>>> Sent: Monday, October 23, 2023 4:09 PM
> >>>
> >>>>> I think this can be done without introducing the new register.
> >>>>> Can you please check if the PM register itself can serve the
> >>>>> purpose instead
> >>>> of new virtio level register?
> >>>> Do you mean the system PM register?
> >>> No, the device's PM register at transport level.
> >> I tried to find this register(pci level or virtio pci level or virtio
> >> driver level), but I didn't find it in Linux kernel or Qemu codes.
> >> May I know which register you are referring to specifically? Or which
> >> PM state bit you mentioned below?
> >>
> > PCI spec's PCI Power Management Capability Structure in section 7.5.2.
> Yes, what you point to is PM capability for PCIe device.
> But the problem is still that in Qemu code, it will check the
> condition(pci_bus_is_express or pci_is_express) of all virtio-pci devices in
> function virtio_pci_realize(), if the virtio devices aren't a PCIe device, it 
> will not
> add PM capability for them.
PCI PM capability is must for PCIe devices. So may be QEMU code has put it only 
under is_pcie check.
But it can be done outside of that check as well because this capability exists 
on PCI too for long time and it is backward compatible.

> And another problem is how about the MMIO transport devices? Since
> preserve resources is need for all transport type devices.
> 
MMIO lacks such rich PM definitions. If in future MMIO wants to support, it 
will be extended to match to other transports like PCI.

> >>>
> >>>> I think it is unreasonable to let virtio- device listen the PM
> >>>> state of Guest system.
> >>> Guest driver performs any work on the guest systems PM callback
> >>> events in
> >> the virtio driver.
> >> I didn't find any PM state callback in the virtio driver.
> >>
> > There are virtio_suspend and virtio_resume in case of Linux.
> I think what you said virtio_suspend/resume is freeze/restore callback from
> "struct virtio_driver" or suspend/resume callback from "static const struct
> dev_pm_ops virtio_pci_pm_ops".
> And yes, I agree, if virtio devices have PM capability, maybe we can set PM 
> state
> in those callback functions.
> 
> >
> >>>
> >>>> It's more suitable that each device gets notifications from driver,
> >>>> and then do preserving resources operation.
> >>> I agree that each device gets the notification from driver.
> >>> The question is, should it be virtio driver, or existing pci driver
> >>> which
> >> transitions the state from d0->d3 and d3->d0 is enough.
> >> It seems there isn't existing pci driver to transitions d0 or d3
> >> state. Could you please tell me which one it is specifically? I am very 
> >> willing to
> give a try.
> >>
> > Virtio-pci modern driver of Linux should be able to.
> Yes, I know, it is the VIRTIO_PCI_FLAG_INIT_PM_BIT. But still the two 
> problems I
> said above.
>
Both can be resolved without switching to pcie.
 
> >
> >>> Can you please check that?
> >>>
> >>>>>> --- a/transport-pci.tex
> >>>>>> +++ b/transport-pci.tex
> >>>>>> @@ -325,6 +325,7 @@ \subsubsection{Common configuration
> structure
> >>>>>> layout}\label{sec:Virtio Transport
> >>>>>>  /* About the administration virtqueue. */
> >>>>>>  le16 admin_queue_index; /* read-only for driver */
> >>>>>>  le16 admin_queue_num; /* read-only for driver */
> >>>>>> +le16 preserve_resources;/* read-write */
> >>>>> Preserving these resources in the device implementation takes
> >>>>> finite amount
> >>>> of time.
> >>>>> Possibly more than 40nsec (time of PCIe write TLP).
> >>>>> Hence this register must be a polling register to indicate that
> >>>> preservation_done.
> >>>>> This will tell the guest when the preservation is done, and when
> >>>>> restoration is
> >>>

[virtio-dev] RE: [PATCH v6 1/1] content: Add new feature VIRTIO_F_PRESERVE_RESOURCES

2023-10-24 Thread Parav Pandit


> From: Chen, Jiqian 
> Sent: Tuesday, October 24, 2023 4:06 PM
> 
> On 2023/10/23 21:35, Parav Pandit wrote:
> >
> >> From: Chen, Jiqian 
> >> Sent: Monday, October 23, 2023 4:09 PM
> >
> >>> I think this can be done without introducing the new register.
> >>> Can you please check if the PM register itself can serve the purpose
> >>> instead
> >> of new virtio level register?
> >> Do you mean the system PM register?
> > No, the device's PM register at transport level.
> I tried to find this register(pci level or virtio pci level or virtio driver 
> level), but I
> didn't find it in Linux kernel or Qemu codes.
> May I know which register you are referring to specifically? Or which PM state
> bit you mentioned below?
> 
PCI spec's PCI Power Management Capability Structure in section 7.5.2.
> >
> >> I think it is unreasonable to let virtio- device listen the PM state
> >> of Guest system.
> > Guest driver performs any work on the guest systems PM callback events in
> the virtio driver.
> I didn't find any PM state callback in the virtio driver.
> 
There are virtio_suspend and virtio_resume in case of Linux.

> >
> >> It's more suitable that each device
> >> gets notifications from driver, and then do preserving resources operation.
> > I agree that each device gets the notification from driver.
> > The question is, should it be virtio driver, or existing pci driver which
> transitions the state from d0->d3 and d3->d0 is enough.
> It seems there isn't existing pci driver to transitions d0 or d3 state. Could 
> you
> please tell me which one it is specifically? I am very willing to give a try.
> 
Virtio-pci modern driver of Linux should be able to.

> > Can you please check that?
> >
> >>>> --- a/transport-pci.tex
> >>>> +++ b/transport-pci.tex
> >>>> @@ -325,6 +325,7 @@ \subsubsection{Common configuration structure
> >>>> layout}\label{sec:Virtio Transport
> >>>>  /* About the administration virtqueue. */
> >>>>  le16 admin_queue_index; /* read-only for driver */
> >>>>  le16 admin_queue_num; /* read-only for driver */
> >>>> +le16 preserve_resources;/* read-write */
> >>> Preserving these resources in the device implementation takes finite
> >>> amount
> >> of time.
> >>> Possibly more than 40nsec (time of PCIe write TLP).
> >>> Hence this register must be a polling register to indicate that
> >> preservation_done.
> >>> This will tell the guest when the preservation is done, and when
> >>> restoration is
> >> done, so that it can resume upper layers.
> >>>
> >>> Please refer to queue_reset definition to learn more about such
> >>> register
> >> definition.
> >> Thanks, I will refer to "queue_reset". So, I need three values,
> >> driver write 1 to let device do preserving resources, driver write 2
> >> to let device do restoring resources, device write 0 to tell driver
> >> that preserving or restoring done, am I right?
> >>
> > Right.
> >
> > And if the existing pcie pm state bits can do, we can leverage that.
> > If it cannot be used, lets add that reasoning in the commit log to describe 
> > this
> register.
> >
> >>>
> >>> Lets please make sure that PCIe PM level registers are
> >>> sufficient/not-sufficient
> >> to decide the addition of this register.
> >> But if the device is not a PCIe device, it doesn't have PM
> >> capability, then this will not work. Actually in my local
> >> environment, pci_is_express() return false in Qemu, they are not PCIe
> device.
> > It is reasonable to ask to plug in as PCIe device in 2023 to get new
> > functionality that too you mentioned a gpu device. 
> > Which does not have very long history of any backward compatibility.
> Do you suggest me to add PM capability for virtio-gpu or change virtio-gpu to 
> a
> PCIe device?
> 
PCI Power Management Capability Structure does not seem to be limited to PCIe.

[virtio-dev] RE: [PATCH v6 1/1] content: Add new feature VIRTIO_F_PRESERVE_RESOURCES

2023-10-23 Thread Parav Pandit


> From: Chen, Jiqian 
> Sent: Monday, October 23, 2023 4:09 PM

> > I think this can be done without introducing the new register.
> > Can you please check if the PM register itself can serve the purpose instead
> of new virtio level register?
> Do you mean the system PM register? 
No, the device's PM register at transport level.

> I think it is unreasonable to let virtio-
> device listen the PM state of Guest system. 
Guest driver performs any work on the guest systems PM callback events in the 
virtio driver.

> It's more suitable that each device
> gets notifications from driver, and then do preserving resources operation.
I agree that each device gets the notification from driver.
The question is, should it be virtio driver, or existing pci driver which 
transitions the state from d0->d3 and d3->d0 is enough.
Can you please check that?

> >> --- a/transport-pci.tex
> >> +++ b/transport-pci.tex
> >> @@ -325,6 +325,7 @@ \subsubsection{Common configuration structure
> >> layout}\label{sec:Virtio Transport
> >>  /* About the administration virtqueue. */
> >>  le16 admin_queue_index; /* read-only for driver */
> >>  le16 admin_queue_num; /* read-only for driver */
> >> +le16 preserve_resources;/* read-write */
> > Preserving these resources in the device implementation takes finite amount
> of time.
> > Possibly more than 40nsec (time of PCIe write TLP).
> > Hence this register must be a polling register to indicate that
> preservation_done.
> > This will tell the guest when the preservation is done, and when 
> > restoration is
> done, so that it can resume upper layers.
> >
> > Please refer to queue_reset definition to learn more about such register
> definition.
> Thanks, I will refer to "queue_reset". So, I need three values, driver write 
> 1 to
> let device do preserving resources, driver write 2 to let device do restoring
> resources, device write 0 to tell driver that preserving or restoring done, 
> am I
> right?
> 
Right.

And if the existing pcie pm state bits can do, we can leverage that.
If it cannot be used, lets add that reasoning in the commit log to describe 
this register.

> >
> > Lets please make sure that PCIe PM level registers are 
> > sufficient/not-sufficient
> to decide the addition of this register.
> But if the device is not a PCIe device, it doesn't have PM capability, then 
> this
> will not work. Actually in my local environment, pci_is_express() return 
> false in
> Qemu, they are not PCIe device.
It is reasonable to ask to plug in as PCIe device in 2023 to get new 
functionality that too you mentioned a gpu device. 
Which does not have very long history of any backward compatibility.
Please explore the d0<->d3 PM state bit if can be used.

Thanks a lot.

[virtio-dev] RE: [PATCH v6 1/1] content: Add new feature VIRTIO_F_PRESERVE_RESOURCES

2023-10-23 Thread Parav Pandit

Hi Jiqian,

> From: Jiqian Chen 
> Sent: Saturday, October 21, 2023 9:21 AM
> To: Michael S . Tsirkin ; Gerd Hoffmann
> ; Parav Pandit ; Jason Wang
> ; Xuan Zhuo ; David
> Airlie ; Gurchetan Singh
> ; Chia-I Wu ; Marc-
> André Lureau ; Robert Beckett
> ; Mikhail Golubev-Ciuchea  ciuc...@opensynergy.com>; virtio-comm...@lists.oasis-open.org; virtio-
> d...@lists.oasis-open.org
> Cc: Honglei Huang ; Julia Zhang
> ; Huang Rui ; Jiqian Chen
> 
> Subject: [PATCH v6 1/1] content: Add new feature
> VIRTIO_F_PRESERVE_RESOURCES
> 
> In some scenes, Qemu may reset or destroy resources of virtio device, but some
> of them can't be re-created, so that causes some problems.
> 
It can be re-created. It is just that the guest driver lost the previous 
resource values.
So a better wording for the motivation is combining with below line to me is,

Currently guest drivers destroy and re-create virtio resources during power 
management suspend/resume sequence respectively.
For example, for a PCI transport, even if the device offers D3 to d0 state 
transition, a virtio guest driver has no way to know if 
the device can store the state during D0->D3 transition and same state can be 
restored on D3->D0 transition.

> For example, when we do S3 for guest, guest will set device_status to 0, it
> causes Qemu to reset virtioi-gpu device, and then all render resources of 
> virtio-
> gpu will be destroyed. As a result, after guest resuming, the display can't 
> come
> back, and we only see a black screen.
> 
To make guest drivers aware of above device capability, introduce a new feature 
bit to indicate such device capability.
This patch adds...

> In order to deal with the above scene, we need a mechanism that allows guest
> and Qemu to negotiate their behaviors for resources. So, this patch adds a new
> feature named VIRTIO_F_PRESERVE_RESOURCES. It allows guest to tell Qemu
> when there is a need to preserve resources, guest must preserve resources 
> until
> 0 is set.
> 
I think this can be done without introducing the new register.
Can you please check if the PM register itself can serve the purpose instead of 
new virtio level register?

> Signed-off-by: Jiqian Chen 
> ---
>  conformance.tex   |  2 ++
>  content.tex   | 25 +
>  transport-pci.tex |  6 ++
>  3 files changed, 33 insertions(+)
> 
> diff --git a/conformance.tex b/conformance.tex index dc00e84..60cc0b1
> 100644
> --- a/conformance.tex
> +++ b/conformance.tex
> @@ -91,6 +91,7 @@ \section{Conformance Targets}\label{sec:Conformance /
> Conformance Targets}  \item \ref{drivernormative:Basic Facilities of a Virtio
> Device / Packed Virtqueues / The Virtqueue Descriptor Table / Indirect
> Descriptors}  \item \ref{drivernormative:Basic Facilities of a Virtio Device /
> Packed Virtqueues / Supplying Buffers to The Device / Updating flags}  \item
> \ref{drivernormative:Basic Facilities of a Virtio Device / Packed Virtqueues /
> Supplying Buffers to The Device / Sending Available Buffer Notifications}
> +\item \ref{drivernormative:Basic Facilities of a Virtio Device /
> +Preserve Resources}
>  \item \ref{drivernormative:General Initialization And Device Operation /
> Device Initialization}  \item \ref{drivernormative:General Initialization And
> Device Operation / Device Cleanup}  \item \ref{drivernormative:Reserved
> Feature Bits} @@ -172,6 +173,7 @@ \section{Conformance
> Targets}\label{sec:Conformance / Conformance Targets}  \item
> \ref{devicenormative:Basic Facilities of a Virtio Device / Packed Virtqueues /
> The Virtqueue Descriptor Table}  \item \ref{devicenormative:Basic Facilities 
> of
> a Virtio Device / Packed Virtqueues / Scatter-Gather Support}  \item
> \ref{devicenormative:Basic Facilities of a Virtio Device / Shared Memory
> Regions}
> +\item \ref{devicenormative:Basic Facilities of a Virtio Device /
> +Preserve Resources}
>  \item \ref{devicenormative:Reserved Feature Bits}  \end{itemize}
> 
> diff --git a/content.tex b/content.tex
> index 0a62dce..b6b1859 100644
> --- a/content.tex
> +++ b/content.tex
> @@ -502,6 +502,27 @@ \section{Exporting Objects}\label{sec:Basic Facilities
> of a Virtio Device / Expo  types. It is RECOMMENDED that devices generate
> version 4  UUIDs as specified by \hyperref[intro:rfc4122]{[RFC4122]}.
> 
> +\section{Preserve Resources}\label{sec:Basic Facilities of a Virtio
> +Device / Preserve Resources}
> +
> +As virtio devices are paravirtualization devices by design.
This is not true and not relevant for the spec. Please remove this line.

> +There are various devices resources created by sending commands from
> +frontend and stored in backend.
> +
> +In some scenes, resources may be destroyed or reset, some of them can
> +be re-created

[virtio-dev] RE: [virtio-comment] Re: [virtio-dev] Re: [PATCH 0/5] virtio: introduce SUSPEND bit and vq state

2023-10-11 Thread Parav Pandit


> From: Zhu, Lingshan 
> Sent: Wednesday, October 11, 2023 4:09 PM

> I am sure I have not ignored any questions.
What about below one?

https://lore.kernel.org/virtio-dev/20230921011221-mutt-send-email-...@kernel.org/

RE: [virtio-dev] Re: [PATCH 0/5] virtio: introduce SUSPEND bit and vq state

2023-10-09 Thread Parav Pandit


> From: Zhu, Lingshan 
> Sent: Monday, October 9, 2023 3:36 PM
> 
> On 9/27/2023 6:39 PM, Parav Pandit wrote:
> >
> >> From: Zhu, Lingshan 
> >> Sent: Wednesday, September 27, 2023 1:50 PM I see his proposal can:
> >> 1) meet some customers requirements without nested and bare-metal
> >> 2) align with Nvidia production
> > Slightly inaccurate.
> > The work produced is for the virtio spec update for the users.
> >
> > I have missed adding other contributors Sign-off who also share similar use
> cases, which I will add in v1.
> >
> >> 3) easier to emulate by onboard SOC
> >>
> >> The general purpose of his proposal and mine are aligned: migrate virtio
> 
> >> devices.
> >>
> > Great.
> >
> >> Jason has ever proposed to collaborate, please allow me quote his proposal:
> >>
> >> "
> >> Let me repeat once again here for the possible steps to collaboration:
> >>
> >> 1) define virtqueue state, inflight descriptors in the section of
> >> basic facility but not under the admin commands
> >> 2) define the dirty page tracking, device context/states in the
> >> section of basic facility but not under the admin commands
> >> 3) define transport specific interfaces or admin commands to access
> >> them "
> >>
> >> I totally agree with his proposal.
> > We started discussing some of the it.
> > If I draw parallels, one should not say "detach descriptors from virtqueue" 
> > for
> the infrastructure that exists in the basic facilities.
> > If so, one should explain the technical design reason and it would make 
> > sense.
> not sure what is  "detach descriptors from virtqueue", but admin vq carries
> commands anyway.
> >
> > So let's discuss it.
> > I like to better understand the _real_ technical reason for detaching it.
> so please to or cc me in your series.
Sure, will do it in the v2.
Jason already added you in the v1.
It is already in the virtio-comment mailing list, so for now you can respond to 
v1.

RE: [virtio-dev] Re: [PATCH 0/5] virtio: introduce SUSPEND bit and vq state

2023-09-27 Thread Parav Pandit



> From: Zhu, Lingshan 
> Sent: Wednesday, September 27, 2023 1:50 PM

> I see his proposal can:
> 1) meet some customers requirements without nested and bare-metal
> 2) align with Nvidia production
Slightly inaccurate.
The work produced is for the virtio spec update for the users.

I have missed adding other contributors Sign-off who also share similar use 
cases, which I will add in v1.

> 3) easier to emulate by onboard SOC
> 
> The general purpose of his proposal and mine are aligned: migrate virtio  
> devices.
> 
Great.

> Jason has ever proposed to collaborate, please allow me quote his proposal:
> 
> "
> Let me repeat once again here for the possible steps to collaboration:
> 
> 1) define virtqueue state, inflight descriptors in the section of
> basic facility but not under the admin commands
> 2) define the dirty page tracking, device context/states in the
> section of basic facility but not under the admin commands
> 3) define transport specific interfaces or admin commands to access them
> "
> 
> I totally agree with his proposal.

We started discussing some of the it.
If I draw parallels, one should not say "detach descriptors from virtqueue" for 
the infrastructure that exists in the basic facilities.
If so, one should explain the technical design reason and it would make sense.

So let's discuss it.
I like to better understand the _real_ technical reason for detaching it.

RE: [virtio-dev] Re: [VIRTIO PCI PATCH v5 1/1] transport-pci: Add freeze_mode to virtio_pci_common_cfg

2023-09-27 Thread Parav Pandit



> From: Michael S. Tsirkin 
> Sent: Tuesday, September 26, 2023 5:28 PM

[..]

> > When new feature bit of suspend is offered, guest will _not_ do reset, hence
> no need to mix reset with suspend.
> >
> > [1] 
> > https://lists.oasis-open.org/archives/virtio-comment/202309/msg00260.html
> 
> I frankly don't understand what you proposed there.
> 
> reset is a simple operation, it just erases all virtio state from the device. 
> It is very
> commonly is performed at OS/driver startup. Making it not reset some virtio
> state is a bad idea since suddenly you kexec a guest that attempts a reset and
> device is stuck in some weird state.

Just exactly what you said.
Reset simply resets the device including the suspended state and any other bit 
being set.

Reset flow cannot be influenced by some suspend bit setting.
I said same in above [1].

RE: [virtio-dev] Re: [VIRTIO PCI PATCH v5 1/1] transport-pci: Add freeze_mode to virtio_pci_common_cfg

2023-09-26 Thread Parav Pandit

Hi Jiqian,

> From: Chen, Jiqian 
> Sent: Tuesday, September 26, 2023 4:30 PM

> It is not to trigger resume behavior to device by setting unfreeze in my 
> scenario.
> You may misunderstand my intention.
> In current kernel and qemu code, it will trigger reset device behavior by 
> setting
> device_status to 0 during resuming, and I found some operations of reset is 
> not
> reasonable during resuming, like virtio_gpu_reset, it will destroy render
> resources and then caused the display gone after guest finishing reusming.

When suspend is done, resume must not be done.
I don’t see a need to mix these operations from guest specially when suspend is 
done.

Did I miss your response for my suggestion few days ago in [1] for not mixing 
reset with suspend?

When new feature bit of suspend is offered, guest will _not_ do reset, hence no 
need to mix reset with suspend.

[1] https://lists.oasis-open.org/archives/virtio-comment/202309/msg00260.html

RE: [virtio-dev] Re: [PATCH 0/5] virtio: introduce SUSPEND bit and vq state

2023-09-26 Thread Parav Pandit



> From: Zhu, Lingshan 
> Sent: Tuesday, September 26, 2023 11:07 AM

> > 1. cover letter is missing the problem statement and use case
> I only reply to this section of comments, this does not mean I agree with you 
> on
> your other statements, Instead I agree with Jason on his replies to you.
> 
> In my cover letter:
> "The main usecase of these new facilities is Live Migration."
>
:)
Two letter word do not explain the use case of why is asking to mediating a 
native virtio device.

And yet you call is the basic facilities.
Anyways you know the misaligned response in email and cover letter is evident.

> Did you miss it?
No.
It misses the detail as I described in the theory of operation described in [1].

[1] https://lists.oasis-open.org/archives/virtio-comment/202309/msg00061.html

> > 2. why queue suspend and both device suspend are introduced, only one
> should be there. The design description is missing.
> there are no queue suspend, they are device suspend and vq state accessors.
> please read the patch if you want to comment.
> > 3. Even though it claims under some random basic facility, cover letter 
> > clearly
> states the main use case is "live migration".
> it is not random, they are precisely defined virtio basic facilities.
> > 4. Patch 4 is not needed at all. When device is suspended, it is 
> > _suspended_. It
> does not do any bifurcation.
> The device should not accept vq reset and the driver should reset vqs, please
> read previous discussions with MST and please don't ignore the conclusions
> > 5. only suspend bit of patch2 is not enough to cover P2P. One needs suspend
> and freeze both covered in series [1].
> we have discussed this for many times, P2P is out of virtio spec, do you want 
> to
> mediate every PCI state/functionality?
> > 6. Finally the whole description of 1 to 4 need to be split in the device
> operation, so that both passthrough and medication can utilize it using admin
> cmd and otherwise.
> Do you see any reasons this solution can not be used for passthrough and
> mediation?
Right. Proposed solution does not meeting following requirements addressed in 
[1].

[1] 
https://lore.kernel.org/virtio-comment/20230906081637.32185-1-lingshan@intel.com/T/#m7efbaadbc73f033c2793d9eb1eb0afa210aae4be

[1] I replied to Jason in previous email.
I will repeat here. They are covered in [1].

1. Missing P2P support
2. Missing dirty page tracking
3. Incremental device context framework for short downtime
3.a Ability to do inflight descriptor tracking
4. Ability to do the work for multiple member devices in parallel.


> Or does features_OK work for passthrough or mediation? Any difference?
It does not work. Passthrough devices are not trapped by the hypervisor.

> > Since Zhu, told that dirty tracking and inflight descriptors will be done, I
> presume he will propose to do over admin q or command interface.
> > And since all can run over the admin commands, the plumbing done in 1 to 4
> can be made using admin commands.
> No
Such negative assertion does not help.
Explain why part, like how I explained above.

> >
> > Until now we could not establish creating yet another DMA interface that is
> better than q interface.
> > So ...
> > To me both the methods will start looking more converged to me over admin
> command and queues.
> I don't think so, again, we are introducing basic facilities and these 
> facilities
> don't depend on or rely on admin vq.
If so, stop the work "live migration" from the cover letter.
Reliance of admin command (again not vq, be careful what you constantly claim).
Not reliance on admin queue or admin command does/does not make it basic 
facility.

Admin commands and queues are already in basic facilities section today.
So claiming that hey one is using admin commands that means it is non_basic 
facility is not correct.

> >
> > Passthrough will use them over owner device.
> > Mediation somehow need to do over member device.
> > Mediation will not use any device suspend command because it needs to
> keep bisecting everything.
> please read QEMU vhost live migration solution
Can you please share the pointer to it?

I am familiar with [2] and it does not require device suspend flow as things 
are bisected.

[2] https://qemu-project.gitlab.io/qemu/interop/vhost-user.html#introduction

RE: [virtio-dev] Re: [PATCH 0/5] virtio: introduce SUSPEND bit and vq state

2023-09-25 Thread Parav Pandit



> From: Jason Wang 
> Sent: Tuesday, September 26, 2023 10:07 AM
> 
> On Tue, Sep 26, 2023 at 11:40 AM Parav Pandit  wrote:
> >
> >
> >
> > > From: Jason Wang 
> > > Sent: Tuesday, September 26, 2023 8:16 AM
> > >
> > > On Mon, Sep 25, 2023 at 6:41 PM Parav Pandit  wrote:
> > > >
> > > >
> > > >
> > > > > From: Jason Wang 
> > > > > Sent: Friday, September 22, 2023 8:38 AM
> > > >
> > > > > > Device context has no overlap.
> > > > >
> > > > > I can give you one example, e.g debugging.
> > > > >
> > > > Almost every feature needs debugging. :) So I am omitting it for
> > > > time being.
> > >
> > > Well, I don't think so. We have a lot of handy tools for that (ethtool 
> > > -d?).
> > Sure add something specific for debug and explicitly mention that it is for
> debug like -d.
> > Every feature and functionality needs debug, not specifically device 
> > context.
> > So add infra for debug.
> 
> Why do you think it's an infra? All you need to do is a simple decoupling.
>
It is too vague comment for me.
Everything you need to debug can be queried from the device if needed.
So please add the debug infrastructure for it.

It may be useful outside of debug too.

Practically any query interface added, can be used by the member driver for 
debug purposes.

> > Device migration series is not the vehicle to piggy back on.
> >
> > > > > 1) define virtqueue state, inflight descriptors in the section
> > > > > of basic facility but not under the admin commands
> > > > It will be part of the device context such a way that so that one
> > > > can only read
> > > the vq state only instead of full device context.
> > > > This will work.
> > >
> > > I'm not sure what it looks like, but I think they are well decoupled in 
> > > this
> series.
> > > E.g driver can choose to just read e.g last_avail_idx and report to
> > > ethtool or watchdog.
> > >
> > Once its done it will be visible how it looks like.
> > The key is it needs to cover BOTH use cases.
> >
> > > As I replied in other thread, I see several problems:
> > >
> > > 1) layer violation, PCI specific state were mixed into the basic
> > > facilities
> >
> > After agreeing to see merged donctext, now you are hinting that you don’t
> agree to merge two.
> 
> It's not a merging, it's about decoupling. I'm fine if you don't do coupling 
> and
> layer violation.
>
Admin command is decoupled from admin virtqueue already.
Device context is decoupled from admin command already.
Dirty page tracking is decpupled from admin command already.
Device mode is decoupled from admin command already.

 
> > I disagree if you are leaning towards that direction.
> > I hope my deduction from your above comment is incorrect.
> 
> I agree to seek a way to unify but it doesn't mean everything in your current
> proposal is correct. Basic facility part should be transport independent.
>
Can you please comment in my series what is incorrect? I would like to discuss 
there, similar to your ask here.

 
> >
> > There is no violation. PCI specific device context will be captured in PCI
> specific section.
> 
> Is that what you've done in your series now?
>
It will be added in the v1 of my series.
 
> > Device type contexts will be captured in those device type sections.
> >
> > TLVs will cover many of the device context information.
> >
> > > 2) I don't see a good definition on "device context"
> > > 3) TLV works fine for queue but not registers
> > >
> > Please definition of device context in [1].
> >
> > [1]
> > https://lists.oasis-open.org/archives/virtio-comment/202309/msg00064.h
> > tml
> >
> > > What needs to be done first is to describe what device context means
> > > and what it contains. Not the actual data structure since it may vary.
> > >
> > Sure, it is already defined in device migration theory of operation section 
> > in
> [2].
> 
> It's too vague, for example it's not easy to infer if the following belong to 
> device
> context:
> 
I am 100% confident and challenge you that theory of operation explained in my 
series is sigficantly better than commit log of Lingshan saying "main use case 
is live migration".

> 1) dirty pages

Above does not. Once you read each patch it should be clear, because dirty 
pages in not part of the device context.

> 2) vir

RE: [virtio-dev] Re: [PATCH 0/5] virtio: introduce SUSPEND bit and vq state

2023-09-25 Thread Parav Pandit



> From: Jason Wang 
> Sent: Tuesday, September 26, 2023 8:16 AM
> 
> On Mon, Sep 25, 2023 at 6:41 PM Parav Pandit  wrote:
> >
> >
> >
> > > From: Jason Wang 
> > > Sent: Friday, September 22, 2023 8:38 AM
> >
> > > > Device context has no overlap.
> > >
> > > I can give you one example, e.g debugging.
> > >
> > Almost every feature needs debugging. :) So I am omitting it for time
> > being.
> 
> Well, I don't think so. We have a lot of handy tools for that (ethtool -d?).
Sure add something specific for debug and explicitly mention that it is for 
debug like -d.
Every feature and functionality needs debug, not specifically device context.
So add infra for debug. Device migration series is not the vehicle to piggy 
back on.

> > > 1) define virtqueue state, inflight descriptors in the section of
> > > basic facility but not under the admin commands
> > It will be part of the device context such a way that so that one can only 
> > read
> the vq state only instead of full device context.
> > This will work.
> 
> I'm not sure what it looks like, but I think they are well decoupled in this 
> series.
> E.g driver can choose to just read e.g last_avail_idx and report to ethtool or
> watchdog.
>
Once its done it will be visible how it looks like.
The key is it needs to cover BOTH use cases.

> As I replied in other thread, I see several problems:
> 
> 1) layer violation, PCI specific state were mixed into the basic facilities

After agreeing to see merged donctext, now you are hinting that you don’t agree 
to merge two.
I disagree if you are leaning towards that direction.
I hope my deduction from your above comment is incorrect.

There is no violation. PCI specific device context will be captured in PCI 
specific section.
Device type contexts will be captured in those device type sections.

TLVs will cover many of the device context information.

> 2) I don't see a good definition on "device context"
> 3) TLV works fine for queue but not registers
> 
Please definition of device context in [1].

[1] https://lists.oasis-open.org/archives/virtio-comment/202309/msg00064.html

> What needs to be done first is to describe what device context means and what
> it contains. Not the actual data structure since it may vary.
> 
Sure, it is already defined in device migration theory of operation section in 
[2].
I will try to take it out and put in device management section, so that device 
migration can refer to it and 
Some other basic facility also can refer to it (which must need to explain a 
use case beyond silly debug point).

[2] https://lists.oasis-open.org/archives/virtio-comment/202309/msg00062.html

> >
> > > 3) define transport specific interfaces or admin commands to access
> > > them
> > >
> > As you also envisioned, it is done using admin commands to access it.
> 
> That's fine, but it should allow other ways.
>
For passthrough admin commands fits the design.
Sure, you need to draft it how to do other ways..
 > >
> > > Does this work? It seems you refused such steps in the past.
> > >
> > I didn’t.
> > There must be some confusion in many emails we both exchanged, because
> already posted v0 has it split such as device context.
> >
> > For dirty page tracking I couldn’t find a solid use case without device
> migration, so I asked which you already replied above.
> 
> First, you never explain why such coupling gives us any benefit.
It is not coupled. Device migration uses this facility. So it is matter of text 
organization in the spec.
Not the design.

> Second, I've given you sufficient examples but you tend to ignore them. Why
> not go through Qemu codes then you will see the answer.
You gave example of debug, and profiling. You didn’t explain the use case of 
how to actually connect and how to profile etc.

> > > Actually, I would like to leave 2) as it's very complicated which
> > > might not converge easily.
> > >
> > I will split current "device migration" section to two.
> > 1. device management
> > 2. device migration
> >
> > Device management covers device mode, device context and dirty page
> tracking.
> 
> I don't see any connection between "management" and "device context".
>
You have better name than device management?
May be device operation.
May be it is just better to have the device context just the way it is in [1] 
under basic facility.
 
> > Device migration refers to device management section.
> >
> > We can omit the dirty page tracking commands for a moment and first close
> on the device mode and device context as first step.
> > Since it is

RE: [virtio-dev] [PATCH v2] virtio-tee: Reserve device ID 46 for TEE device

2023-09-25 Thread Parav Pandit



> From: virtio-dev@lists.oasis-open.org  On
> Behalf Of jeshwank
> Sent: Monday, September 25, 2023 4:47 PM
> To: virtio-dev@lists.oasis-open.org

I notice this repeated occurs error of choosing the mailing list as virtio-dev 
for spec work.
I will fix the OASIS document references.

In the meant time, the mailing list should be virtio-comment@lists... For this 
patch.


> 
> In a virtual environment, an application running in guest VM may want to
> delegate security sensitive tasks to a Trusted Application (TA) running 
> within a
> Trusted Execution Environment (TEE). A TEE is a trusted OS running in some
> secure environment, for example, TrustZone on ARM CPUs, or a separate secure
> co-processor etc.
> 
> A virtual TEE device emulates a TEE within a guest VM. Such a virtual TEE 
> device
> supports multiple operations such as:
> 
> VIRTIO_TEE_CMD_OPEN_DEVICE – Open a communication channel with virtio
>  TEE device.
> VIRTIO_TEE_CMD_CLOSE_DEVICE – Close communication channel with virtio
>   TEE device.
> VIRTIO_TEE_CMD_GET_VERSION – Get version of virtio TEE.
> VIRTIO_TEE_CMD_OPEN_SESSION – Open a session to communicate with
>   trusted application running in TEE.
> VIRTIO_TEE_CMD_CLOSE_SESSION – Close a session to end communication
>with trusted application running in TEE.
> VIRTIO_TEE_CMD_INVOKE_FUNC – Invoke a command or function in trusted
>  application running in TEE.
> VIRTIO_TEE_CMD_CANCEL_REQ – Cancel an ongoing command within TEE.
> 
> We would like to reserve device ID 46 for Virtio-TEE device.
> 
> Signed-off-by: Jeshwanth Kumar 
Reviewed-by: Parav Pandit 

> ---
>  content.tex | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/content.tex b/content.tex
> index 0a62dce..644aa4a 100644
> --- a/content.tex
> +++ b/content.tex
> @@ -739,6 +739,8 @@ \chapter{Device Types}\label{sec:Device Types}  \hline
>  45 &   SPI master \\
>  \hline
> +46 &   TEE device \\
> +\hline
>  \end{tabular}
> 
>  Some of the devices above are unspecified by this document,
> --
> 2.25.1
> 
> 
> -
> To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org

RE: [virtio-dev] Re: [PATCH 0/5] virtio: introduce SUSPEND bit and vq state

2023-09-25 Thread Parav Pandit



> From: Jason Wang 
> Sent: Friday, September 22, 2023 8:38 AM

> > Device context has no overlap.
> 
> I can give you one example, e.g debugging.
>
Almost every feature needs debugging. :)
So I am omitting it for time being.

 
> >
> > Dirty page tracking has no overlap. What do you want to profile and monitor?
> In case if you want to profile, it can be used without migration command
> anyway?
> 
> It works like a dirty bit of PTE. We all know it has a broader use case than
> logging. For example, tracking working set and do optimization on
> IOMMU/IOTLB or even device IOTLB.
> 
> 1) Try to prove your facility can only work for one specific cases
> 2) Try to prove your facility can work for more than one cases
> 
> Which one is easier and more beneficial to virtio?
> 
> 
> > If you describe, may be we I can split "device migration" chapter to
> > two pieces, Device management and device migration.
> >
> > Device migration will use these basic facility.
> > Would that help you?
> 
> Definitely, but it needs to be done by not making it under the subsection of
> admin commands, that's it.
> 
> Let me repeat once again here for the possible steps to collaboration:
> 
> 1) define virtqueue state, inflight descriptors in the section of basic 
> facility but
> not under the admin commands
It will be part of the device context such a way that so that one can only read 
the vq state only instead of full device context.
This will work.

> 2) define the dirty page tracking, device context/states in the section of 
> basic
> facility but not under the admin commands
Great.
Device context is already defined in the basic facility outside of the admin 
commands in [1].
Current text is around device migration and spec evolves it can adopt more 
generic text without device migration.

[1] https://lists.oasis-open.org/archives/virtio-comment/202309/msg00064.html

> 3) define transport specific interfaces or admin commands to access them
>
As you also envisioned, it is done using admin commands to access it.

> Does this work? It seems you refused such steps in the past.
> 
I didn’t.
There must be some confusion in many emails we both exchanged, because already 
posted v0 has it split such as device context.

For dirty page tracking I couldn’t find a solid use case without device 
migration, so I asked which you already replied above.

> Actually, I would like to leave 2) as it's very complicated which might not
> converge easily.
>
I will split current "device migration" section to two.
1. device management
2. device migration

Device management covers device mode, device context and dirty page tracking.
Device migration refers to device management section.

We can omit the dirty page tracking commands for a moment and first close on 
the device mode and device context as first step.
Since it is already part of v0 and it is needed, I will keep it in subsequent 
v1 but moved to device management section.

Michael,
Are you ok with this approach to step forward?

RE: [virtio-dev] Re: [PATCH 0/5] virtio: introduce SUSPEND bit and vq state

2023-09-21 Thread Parav Pandit



> From: Michael S. Tsirkin 
> Sent: Thursday, September 21, 2023 5:53 PM

> 
> Again the thread has been side-tracked, which linux module does what is not
> what it was talking about.  By the way I wonder who decided to drop virtio
> comment from this and copy virtio-dev. Guys please don't do this and generally
> just send spec patches to virtio-comment, implementation discussion on virtio-
> dev.

I have no idea who dropped the virtio-comment.
For sure it is not me, as I am fully aware that virtio-dev is not the one to 
discuss.

> 
> 
> > >
> > > But for example, to migrate cross-vendor you need the pci config
> > > space to look the same and for some devices this might mean that pci
> > > config space will have to be mediated.
> > > Or maybe not. vdpa is a good framework in that it gives us
> > > flexibility, it is not opinionated.
> >
> > Sure, as you list, both has its pros and cons.
> > So both solutions has its space and trade off.
> 
> You can thinkably write a vfio based driver for PF and VF in userspace, sure. 
> But
> I think this is just making things unnecessarily complex for users who will 
> have
> to know which device to use with which driver. I think that e.g. if we have 
> two
> ways to submit admin commands, vdpa can just support both of them and that
> is all.
> When mediation is not needed then vdpa will just get out of the way.
> 
Well there is enough documentation already exists to indicate users to know 
when to use what.

> > Hence, if both can converge to same set of commands, => good.
> > When there is different way two stacks operates for those items, we will 
> > have
> spec extensions for both model, right?
> 
> My intiution says a modest amount of duplication isn't too bad.
> E.g. I can see two ways to submit admin commands as being acceptable.
> Are the SUSPEND bit and vq state as defined by these patches acceptable in
> addition to vq commands as defined by your patches? For sure it seems
> inelegant to say the least.

Right, my patches are not bifurcating the device during the device migration.
It has generic device migration concept, so be it config space, or shared 
memory or admin queue or config interrupts or some other dma interface or some 
other things.
All are covered under device migration that software does not need to bisect.

So may be instead of suspending the VQ, it can be reseting the VQ by using 
existing functionality, and when enabling the VQ, it can start from newly 
supplied avail index.
This way, it can be elegant too.

These patches do not explain the motivation of why to suspend individual 
queues. Do you know?
There is suspend device bit as well, so it is unclear why to have both.

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org

RE: [virtio-dev] Re: [PATCH 0/5] virtio: introduce SUSPEND bit and vq state

2023-09-21 Thread Parav Pandit


> From: Zhu, Lingshan 
> Sent: Thursday, September 21, 2023 3:25 PM
> 
> On 9/21/2023 5:26 PM, Parav Pandit wrote:
> >
> >> From: Zhu, Lingshan 
> >> Sent: Thursday, September 21, 2023 2:49 PM TDISP devices can not be
> >> migrated for now, and the TDISP spec make clear examples of attacking
> >> models, your admin vq LM on the PF exactly match the model.
> > I gave hint yesterday to you to consult Ravi at Intel who showed TDISP
> migration using a dedicated TVM using similar mechanism as admin command.
> > But you sadly ignored...
> >
> > So let me make another attempt to explain,
> >
> > When in future TDISP device migration to be supported, the admin command
> will be done through a dedicated PF or a VF that resides in another trust
> domain, for example another TVM.
> > Such admin virtio device will not be located in the hypervisor.
> > Thereby, it will be secure.
> > The admin commands pave the road to make this happen. Only thing changes
> is delegation of admin commands to another admin device instead of a PF.
> if you plan to do it in future, then lets discuss in the future.
> 
> And TDISP can be migrated in future does not mean admin vq LM is secure, I
> have repeated for so many times of the attacking model. and I will not repeat
> again.

> > There are other solutions too that will arise.
> > I have seen another one too, may be DPU.
> >
> > In all the 2 approaches, TDISP is migratable and spec will evolve as 
> > multiple
> vendors including Intel, AMD and others showed the path towards it without
> mediation.
> > Virtio will be able to leverage that as well using admin commands.
> >
> > I want to emphasize again, do not keep repeating AQ in your comments.
> > It is admin commands in proposal [1].
> we are discussing LM, right? Can TDISP help you here? TDISP spec gives
> examples of attacking models, and your admin vq matches it, I gave you quote
> of the spec yesterday.
> 
> This thread is about live migration anyway, not TDISP.
> >
> > As Michael also requested, I kindly request to co-operate on doing join
> technical work, shared ideas, knowledge and improve the spec.
> >
> > [1]
> > https://lore.kernel.org/virtio-comment/20230909142911.524407-7-parav@n
> > vidia.com/T/#mf15b68617f772770c6bf79f70e8ddc6fea834cfa
> see other threads, I propose to reuse the basic facilities of live migration 
> in
> admin vq.
> >

I don’t see a point in repeating anything anymore with your constant 
repetitions and ignorance to ideas.

I am happy to collaborate to driver virtio spec when you can give thoughts with 
an open mind to address two use cases to converge and discuss.

1. virtio device migration using mediation approach
2. virtio member passthrough device migration

RE: [virtio-dev] Re: [PATCH 0/5] virtio: introduce SUSPEND bit and vq state

2023-09-21 Thread Parav Pandit



> From: Michael S. Tsirkin 
> Sent: Thursday, September 21, 2023 3:32 PM
> 
> On Thu, Sep 21, 2023 at 09:17:53AM +, Parav Pandit wrote:
> >
> >
> > > From: Michael S. Tsirkin 
> > > Sent: Thursday, September 21, 2023 1:41 PM
> >
> > > > I mean to say,
> > > > Virtio spec has not achieved mediation less, device migration.
> > > > Virtio spec has not achieved device migration using mediation.
> > >
> > > But yes it has - it was implemented with shadow vq.
> > >
> > Shadow vq + several other trap points on config space, cvq and more.
> 
> exactly.
> 
Ok.

> > we cannot suspend the whole device and resume from where it was left off.
> > Those extensions are happening now.
> >
> > > > And two proposals are trying to do it.
> > > >
> > > > >
> > > > > > >
> > > > > > > > > 1. For mediation something that works within existing
> > > > > > > > > mediation framework - e.g. reusing as he does feature
> > > > > > > > > bits - will require less support than a completely separate 
> > > > > > > > > facility.
> > > > > > > > > I think Zhu Lingshan also believes that since there will
> > > > > > > > > be less code -> less security issues.
> > > > > > > > >
> > > > > > > > With approach of [1], there is less code in the core
> > > > > > > > device migration flow
> > > > > > > because none of those fields etc are parsed/read/written by
> > > > > > > the driver software.
> > > > > > >
> > > > > > > What is or is not executed in a specific flow is a separate 
> > > > > > > question.
> > > > > > > But the point is vdpa and any mediation have to talk virtio
> > > > > > > things such as feature bits. So reusing e.g. feature bits
> > > > > > > needs less code than operating the admin command machinery
> > > > > > > to check what is supported. Yes, you can operate this
> > > > > > > machinery during setup and not during migration itself. It's 
> > > > > > > still less
> code to maintain.
> > > > > > >
> > > > > > I wouldn't go down the path of code comparison.
> > > > > > But if you want to: we can take a concrete example of what is
> > > > > > done by similar
> > > > > device who uses admin command approach.
> > > > > > The admin command-based approach migration driver is likely
> > > > > > 10x smaller
> > > > > than the actual driver driving the feature bits and rest of the 
> > > > > config.
> > > > >
> > > > > yes but mediation driver already has to do feature bits. so if
> > > > > doing mediation then the cost of adding this specific extension is 
> > > > > low.
> > > > >
> > > > I thought first you were counting the cost of the code and not the
> > > > spec in your
> > > point "feature bits needs less code than operating".
> > >
> > > yes - with vdpa it's mostly just
> > >   vdev->status |= SUSPEND
> > >   vdev->status &= ~SUSPEND
> > > all over the place.
> > >
> > + inflight descriptors.
> 
> for sure, this is just stopping it.
> 
> > > > > > If one needs more precise numbers of number of lines of code,
> > > > > > I can
> > > derive it.
> > > > > > As features and functionality grows, every line of code gets
> > > > > > added there in
> > > > > mediation too.
> > > > > > I agree such mediation has value and use case, as we know it
> > > > > > is not the only
> > > > > approach fitting all use cases.
> > > > >
> > > > > Do you see how this extension is easier for mediation than
> > > > > driving admin queue though?
> > > > >
> > > > If we count the total cost of code than building the mediation
> > > > framework +
> > > extensions, than it is not.
> > > > But as I said, I wouldn't compare two solutions as they are
> > > > addressing a slightly
> > > different requirement.
> > >
> > > yes they are. the point of comparis

RE: [virtio-dev] Re: [PATCH 0/5] virtio: introduce SUSPEND bit and vq state

2023-09-21 Thread Parav Pandit



> From: Michael S. Tsirkin 
> Sent: Thursday, September 21, 2023 3:40 PM
> 
> On Thu, Sep 21, 2023 at 09:17:53AM +, Parav Pandit wrote:
> > Vdpa stack make total sense when the underlying device is not virtio and
> hence emulation.
> 
> Which linux framework is used is kind of beside the point but since you bring
> this up - not necessarily.
> 
> E.g. I personally don't care much about "stack" but clearly we need a virtio
> driver on the host to be involved, teaching vfio about virtio is probably a 
> much
> worse idea than creating a mode in the vdpa driver which mostly sets up the
> IOMMU and otherwise gets out of the way of using the VF and just drives the
> PF.
Well, vdpa has to drive large many things including cvq, config space, msix and 
more.
It can help to overcome some issues as you listed below.
So that way vdpa over virtio is useful.

In vfio world, there is nothing significant to teach about virtio.
Vfio is already has the model to extend the stack to do only specific work and 
reuse the rest.

> 
> But for example, to migrate cross-vendor you need the pci config space to look
> the same and for some devices this might mean that pci config space will have
> to be mediated.
> Or maybe not. vdpa is a good framework in that it gives us flexibility, it is 
> not
> opinionated.

Sure, as you list, both has its pros and cons.
So both solutions has its space and trade off.

Hence, if both can converge to same set of commands, => good.
When there is different way two stacks operates for those items, we will have 
spec extensions for both model, right?

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org

RE: [virtio-dev] Re: [PATCH 0/5] virtio: introduce SUSPEND bit and vq state

2023-09-21 Thread Parav Pandit



> From: Zhu, Lingshan 
> Sent: Thursday, September 21, 2023 2:49 PM

> TDISP devices can not be migrated for now, and the TDISP spec make clear
> examples of attacking models, your admin vq LM on the PF exactly match the
> model.

I gave hint yesterday to you to consult Ravi at Intel who showed TDISP 
migration using a dedicated TVM using similar mechanism as admin command.
But you sadly ignored...

So let me make another attempt to explain,

When in future TDISP device migration to be supported, the admin command will 
be done through a dedicated PF or a VF that resides in another trust domain, 
for example another TVM.
Such admin virtio device will not be located in the hypervisor.
Thereby, it will be secure.
The admin commands pave the road to make this happen. Only thing changes is 
delegation of admin commands to another admin device instead of a PF.

There are other solutions too that will arise.
I have seen another one too, may be DPU.

In all the 2 approaches, TDISP is migratable and spec will evolve as multiple 
vendors including Intel, AMD and others showed the path towards it without 
mediation.
Virtio will be able to leverage that as well using admin commands.

I want to emphasize again, do not keep repeating AQ in your comments.
It is admin commands in proposal [1].

As Michael also requested, I kindly request to co-operate on doing join 
technical work, shared ideas, knowledge and improve the spec.

[1] 
https://lore.kernel.org/virtio-comment/20230909142911.524407-7-pa...@nvidia.com/T/#mf15b68617f772770c6bf79f70e8ddc6fea834cfa

RE: [virtio-dev] Re: [PATCH 0/5] virtio: introduce SUSPEND bit and vq state

2023-09-21 Thread Parav Pandit




> From: Michael S. Tsirkin 
> Sent: Thursday, September 21, 2023 1:41 PM

> > I mean to say,
> > Virtio spec has not achieved mediation less, device migration.
> > Virtio spec has not achieved device migration using mediation.
> 
> But yes it has - it was implemented with shadow vq.
>
Shadow vq + several other trap points on config space, cvq and more.

we cannot suspend the whole device and resume from where it was left off.
Those extensions are happening now.

> > And two proposals are trying to do it.
> >
> > >
> > > > >
> > > > > > > 1. For mediation something that works within existing
> > > > > > > mediation framework - e.g. reusing as he does feature bits -
> > > > > > > will require less support than a completely separate facility.
> > > > > > > I think Zhu Lingshan also believes that since there will be
> > > > > > > less code -> less security issues.
> > > > > > >
> > > > > > With approach of [1], there is less code in the core device
> > > > > > migration flow
> > > > > because none of those fields etc are parsed/read/written by the
> > > > > driver software.
> > > > >
> > > > > What is or is not executed in a specific flow is a separate question.
> > > > > But the point is vdpa and any mediation have to talk virtio
> > > > > things such as feature bits. So reusing e.g. feature bits needs
> > > > > less code than operating the admin command machinery to check
> > > > > what is supported. Yes, you can operate this machinery during
> > > > > setup and not during migration itself. It's still less code to 
> > > > > maintain.
> > > > >
> > > > I wouldn't go down the path of code comparison.
> > > > But if you want to: we can take a concrete example of what is done
> > > > by similar
> > > device who uses admin command approach.
> > > > The admin command-based approach migration driver is likely 10x
> > > > smaller
> > > than the actual driver driving the feature bits and rest of the config.
> > >
> > > yes but mediation driver already has to do feature bits. so if doing
> > > mediation then the cost of adding this specific extension is low.
> > >
> > I thought first you were counting the cost of the code and not the spec in 
> > your
> point "feature bits needs less code than operating".
> 
> yes - with vdpa it's mostly just
>   vdev->status |= SUSPEND
>   vdev->status &= ~SUSPEND
> all over the place.
> 
+ inflight descriptors.

> > > > If one needs more precise numbers of number of lines of code, I can
> derive it.
> > > > As features and functionality grows, every line of code gets added
> > > > there in
> > > mediation too.
> > > > I agree such mediation has value and use case, as we know it is
> > > > not the only
> > > approach fitting all use cases.
> > >
> > > Do you see how this extension is easier for mediation than driving
> > > admin queue though?
> > >
> > If we count the total cost of code than building the mediation framework +
> extensions, than it is not.
> > But as I said, I wouldn't compare two solutions as they are addressing a 
> > slightly
> different requirement.
> 
> yes they are. the point of comparison was explaining why people who use
> mediation anyway might not want to also use aq. can i assume that's clear?
>
I am not fully sure.
I frankly don't find it right for member virtio device itself to be mediated.
Vdpa stack make total sense when the underlying device is not virtio and hence 
emulation.
But when there is native virtio member device, further mediation is overkill 
for certain scenarios.
But I understand that it helps to utilize a vdpa stack and thereby overcome 
some limitations, while it introduces other limitations...

> > What to compare is what can be reused between two solutions.
> >
> > > > >
> > > > > > > 2. With or without mediation, the mapping of commands to VFs
> > > > > > > is simpler, allowing more control - for example, let's say
> > > > > > > you want to reset a VF - you do not need to flush the queue
> > > > > > > of existing commands, which might potentially take a long
> > > > > > > time because some other VFs are very busy - you just reset
> > > > > > > the VF which any unmap flow will
> > > > > already do.
> > > > > > >
> > > > > > If I understand you right, to reset a VF, no need to flush the
> > > > > > queues without
> > > > > mediation too.
> > > > > > Just do VF FLR or do device reset, both will be fine.
> > > > >
> > > > > Not to reset the VF - that's a narrow definition. To put it back
> > > > > in its original state unrelated to any VMs.  Nope, FLR is not
> > > > > enough - there could be commands in queue addressing the VF. If
> > > > > they take effect after FLR VF state changed and it can't be
> > > > > cleanly assigned to a new
> > > VM.
> > > > >
> > > > If you mean the SR-PCIM command, than it is virtio spec to define them.
> > > > We ratified this recently in the PCI-SIG of what gets cleared on
> > > > VF FLR and
> > > what stays that SR-PCIM interface to do.
> > >
> > > Interesting. Which ECN exactly do you refer

RE: [virtio-dev] Re: [PATCH 0/5] virtio: introduce SUSPEND bit and vq state

2023-09-21 Thread Parav Pandit




> From: Michael S. Tsirkin 
> Sent: Thursday, September 21, 2023 12:51 PM

> > > As I said, doable without mediation but already done with.
> > >
> > It is not done in the virtio spec that can work without mediation right?
>
> have trouble parsing this sentence

I mean to say,
Virtio spec has not achieved mediation less, device migration.
Virtio spec has not achieved device migration using mediation.
And two proposals are trying to do it.

> 
> > >
> > > > > 1. For mediation something that works within existing mediation
> > > > > framework - e.g. reusing as he does feature bits - will require
> > > > > less support than a completely separate facility.
> > > > > I think Zhu Lingshan also believes that since there will be less
> > > > > code -> less security issues.
> > > > >
> > > > With approach of [1], there is less code in the core device
> > > > migration flow
> > > because none of those fields etc are parsed/read/written by the
> > > driver software.
> > >
> > > What is or is not executed in a specific flow is a separate question.
> > > But the point is vdpa and any mediation have to talk virtio things
> > > such as feature bits. So reusing e.g. feature bits needs less code
> > > than operating the admin command machinery to check what is
> > > supported. Yes, you can operate this machinery during setup and not
> > > during migration itself. It's still less code to maintain.
> > >
> > I wouldn't go down the path of code comparison.
> > But if you want to: we can take a concrete example of what is done by 
> > similar
> device who uses admin command approach.
> > The admin command-based approach migration driver is likely 10x smaller
> than the actual driver driving the feature bits and rest of the config.
> 
> yes but mediation driver already has to do feature bits. so if doing mediation
> then the cost of adding this specific extension is low.
>
I thought first you were counting the cost of the code and not the spec in your 
point "feature bits needs less code than operating".
 
> > If one needs more precise numbers of number of lines of code, I can derive 
> > it.
> > As features and functionality grows, every line of code gets added there in
> mediation too.
> > I agree such mediation has value and use case, as we know it is not the only
> approach fitting all use cases.
> 
> Do you see how this extension is easier for mediation than driving admin queue
> though?
>
If we count the total cost of code than building the mediation framework + 
extensions, than it is not.
But as I said, I wouldn't compare two solutions as they are addressing a 
slightly different requirement.

What to compare is what can be reused between two solutions.
 
> > >
> > > > > 2. With or without mediation, the mapping of commands to VFs is
> > > > > simpler, allowing more control - for example, let's say you want
> > > > > to reset a VF - you do not need to flush the queue of existing
> > > > > commands, which might potentially take a long time because some
> > > > > other VFs are very busy - you just reset the VF which any unmap
> > > > > flow will
> > > already do.
> > > > >
> > > > If I understand you right, to reset a VF, no need to flush the
> > > > queues without
> > > mediation too.
> > > > Just do VF FLR or do device reset, both will be fine.
> > >
> > > Not to reset the VF - that's a narrow definition. To put it back in
> > > its original state unrelated to any VMs.  Nope, FLR is not enough -
> > > there could be commands in queue addressing the VF. If they take
> > > effect after FLR VF state changed and it can't be cleanly assigned to a 
> > > new
> VM.
> > >
> > If you mean the SR-PCIM command, than it is virtio spec to define them.
> > We ratified this recently in the PCI-SIG of what gets cleared on VF FLR and
> what stays that SR-PCIM interface to do.
> 
> Interesting. Which ECN exactly do you refer to?
> 
B405.

> > So if other commands are in pipe, only after they are done, such VF will be
> assigned to new VM.
> 
> Exactly. this is exactly what I meant when I said "flush the queue" - you 
> have to
> wait until these commands are done, then do reset.
>
Not exactly. VF reset is fully controlled by the guest. Hence, it does not 
collide with admin side of commands,
Same admin command for dirty page tracking and device context do not mess with 
FLR.
This is the critical because VF FLR cannot clear the page addresses already 
reported and not yet read by the driver.
It is covered in admin proposal.
 
> > This aspect is also covered in the proposal [2] in the DISCARD command and
> stop dirty page tracking (aka stop write reporting).
> >
> > [1]
> > https://lore.kernel.org/virtio-comment/20230909142911.524407-7-parav@n
> > vidia.com/T/#mf15b68617f772770c6bf79f70e8ddc6fea834cfa
> >
> > To me, frankly, both methods are addressing two slightly different 
> > requirement
> due to which it may not be worth the comparison.
> > It is not either or.
> > And so far, technically what I understand is, both methods has its

RE: [virtio-dev] Re: [PATCH 0/5] virtio: introduce SUSPEND bit and vq state

2023-09-21 Thread Parav Pandit




> From: Michael S. Tsirkin 
> Sent: Thursday, September 21, 2023 11:37 AM

> > We should be enhancing the device context so that more and more items can
> be annotated.
> > I started with small to get the design and idea through and will expand the
> device context so that cross vendors can migrate.
> 
> As I said, doable without mediation but already done with.
>
It is not done in the virtio spec that can work without mediation right?
 
> 
> > > 1. For mediation something that works within existing mediation
> > > framework - e.g. reusing as he does feature bits - will require less
> > > support than a completely separate facility.
> > > I think Zhu Lingshan also believes that since there will be less
> > > code -> less security issues.
> > >
> > With approach of [1], there is less code in the core device migration flow
> because none of those fields etc are parsed/read/written by the driver
> software.
> 
> What is or is not executed in a specific flow is a separate question.
> But the point is vdpa and any mediation have to talk virtio things such as 
> feature
> bits. So reusing e.g. feature bits needs less code than operating the admin
> command machinery to check what is supported. Yes, you can operate this
> machinery during setup and not during migration itself. It's still less code 
> to
> maintain.
>
I wouldn't go down the path of code comparison.
But if you want to: we can take a concrete example of what is done by similar 
device who uses admin command approach.
The admin command-based approach migration driver is likely 10x smaller than 
the actual driver driving the feature bits and rest of the config.
If one needs more precise numbers of number of lines of code, I can derive it.

As features and functionality grows, every line of code gets added there in 
mediation too.
I agree such mediation has value and use case, as we know it is not the only 
approach fitting all use cases.

> 
> > > 2. With or without mediation, the mapping of commands to VFs is
> > > simpler, allowing more control - for example, let's say you want to
> > > reset a VF - you do not need to flush the queue of existing
> > > commands, which might potentially take a long time because some
> > > other VFs are very busy - you just reset the VF which any unmap flow will
> already do.
> > >
> > If I understand you right, to reset a VF, no need to flush the queues 
> > without
> mediation too.
> > Just do VF FLR or do device reset, both will be fine.
> 
> Not to reset the VF - that's a narrow definition. To put it back in its 
> original state
> unrelated to any VMs.  Nope, FLR is not enough - there could be commands in
> queue addressing the VF. If they take effect after FLR VF state changed and it
> can't be cleanly assigned to a new VM.
> 
If you mean the SR-PCIM command, than it is virtio spec to define them.
We ratified this recently in the PCI-SIG of what gets cleared on VF FLR and 
what stays that SR-PCIM interface to do.

So if other commands are in pipe, only after they are done, such VF will be 
assigned to new VM.

This aspect is also covered in the proposal [2] in the DISCARD command and stop 
dirty page tracking (aka stop write reporting).

[1] 
https://lore.kernel.org/virtio-comment/20230909142911.524407-7-pa...@nvidia.com/T/#mf15b68617f772770c6bf79f70e8ddc6fea834cfa

To me, frankly, both methods are addressing two slightly different requirement 
due to which it may not be worth the comparison.
It is not either or.
And so far, technically what I understand is, both methods has its pros, cons 
and limitations due to which both requirements (passthrough and nest) cannot be 
addressed uniformly.
Both has its space for the solution it offers.

At best, two methods can find some common ground of commands or plumbing to 
reuse, it will be great.
If I am understanding is wrong, I would like to learn and discuss.

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org

RE: [virtio-dev] Re: [PATCH 0/5] virtio: introduce SUSPEND bit and vq state

2023-09-20 Thread Parav Pandit

Hi Michael,

> From: Michael S. Tsirkin 
> Sent: Thursday, September 21, 2023 11:12 AM

> I was attempting to have each of you see other's point of view.
> It seems clear I was right, at least one way communication was not getting
> through. Let me try to help.
> 
> 
> First, clearly Zhu Lingshan cares about the mediation use-case, not the un-
> mediated one.  Mediation is clearly heavier but also more powerful in many
> use-cases - is that obvious or do I need to list the reasons?

I agree to it.

> To mention one example, it supports cross-vendor migration. Which the
> unmediated variant maybe can in theory support too, and when it does maybe
> in a better and more structured way - but that will require standartization 
> effort
> that didn't happen yet. With mediation it was already demonstrated more than
> once.
>
We should be enhancing the device context so that more and more items can be 
annotated.
I started with small to get the design and idea through and will expand the 
device context so that cross vendors can migrate.

> 1. For mediation something that works within existing mediation framework -
> e.g. reusing as he does feature bits - will require less support than a 
> completely
> separate facility.
> I think Zhu Lingshan also believes that since there will be less code -> less
> security issues.
>
With approach of [1], there is less code in the core device migration flow 
because none of those fields etc are parsed/read/written by the driver software.
 
> 2. With or without mediation, the mapping of commands to VFs is simpler,
> allowing more control - for example, let's say you want to reset a VF - you 
> do not
> need to flush the queue of existing commands, which might potentially take a
> long time because some other VFs are very busy - you just reset the VF which
> any unmap flow will already do.
> 
If I understand you right, to reset a VF, no need to flush the queues without 
mediation too.
Just do VF FLR or do device reset, both will be fine.

> 
> 
> But Zhu Lingshan, all this will be pointless if you also do not try to do 
> this and list
> what are reasonable points that Parav made. Please do not mistake what I'm
> doing here for taking sides I just want the communication to start working. 
> And
> that means everyone tries to take all use-cases into account even if working 
> for
> a vendor that does not care about this use-case. Otherwise we will just keep
> getting into these flamewars.

Right. I like to take this opportunity to ask again, lets sit together and see 
if we can utilize common framework between two methods.
For example,
1. device context for mediation and without mediation can provide common 
structures that both solutions can use
2. device provisioning (not in any of our series, that can find common ways)

May be more can be merged once we are open to collaborate.
If we face technical issues in unifying the methods, it will be self-explained 
why both methods to exist or create something different for two different use 
cases.

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org

RE: [virtio-dev] Re: [PATCH 0/5] virtio: introduce SUSPEND bit and vq state

2023-09-20 Thread Parav Pandit



> From: Jason Wang 
> Sent: Thursday, September 21, 2023 9:50 AM

> Parav, I think I've clarified several times:
> 
> migration using the admin command is probably fine in some use cases.
>
This is definitely, was not clear to me.
I am 100% clear now.
 
> What's not fine, is:
> 
> Mandate the admin command to be the only way for migration.
>
For sure, my series did not mandate that either.
I kept asking if we both can converge it will be really good to merge two use 
cases, we should.
If we cannot because of technical issues, than both methods to exists to 
address two different use cases.

> Are we on the same page for my concern now?
> 
Yes.

> > What is the advantage of descriptor posting using virtqueue. It is the way 
> > of
> virtio spec...
> >
> > > > Well, it is one way to achieve it.
> > > > There may be different way to do all bulk data transfer without
> > > > admin
> > > commands.
> > >
> > > Why is virtqueue the only way to do bulk data transferring? Can't
> > > DMA be initiated by other-way?
> > >
> >
> > Sure, what is the disadvantage of existing mechanism of virtqueue that can 
> > do
> following.
> > 1. Ability to do DMA
> > 2. agnostic of the DMA who do not want to do DMA
> 
> I don't understand this.
>
Admin commands can work without DMA, right because they are transported using 
admin queue.
 
> > 3. Ability to multiple command executions in parallel
> 
> Each device has their self-contained interface, why can't the commands be
> executed in parallel.
> 
Within the device it cannot if the interface is synchronous.

> > 4. Non blocking interface for driver that does not require any kind of
> > polling
> 
> Are you saying the interrupt can only work for virtqueue?
>
No. I am saying if one has to invent a interface that satisfy above needs, it 
will become a virtqueue.
And if it is not, one should list those disadvantages and cost of new 
interface, and explain its benefits.
Such interface should be generic one too.
 
> >
> > Why to invent new DMA scheme which at does all the 4 tasks?
> 
> It's simply because admin virtqueue can not work for all the cases. I think 
> you've
> agreed on this, no?
>
I think it may work for nested case as well at the cost of replicating it on 
each device and adding special plumbing to isolate it, so that guest cannot 
issue driver notifications to it.

> > First please list down disadvantages of admin queue + show all 4 things are
> achieved using new DMA interface.
> > That will help to understand why new dma interface is needed.
> 
> I can give you a simple example. For example, what happens if we want to
> migrate the owner? Having another owner for this owner is not a good answer.

That is the nesting, don’t see any difference with other nesting.

RE: [virtio-dev] Re: [PATCH 0/5] virtio: introduce SUSPEND bit and vq state

2023-09-20 Thread Parav Pandit


> From: Jason Wang 
> Sent: Thursday, September 21, 2023 8:56 AM


> My understanding is it might be better that each side do a summary of the both
> proposals.

I will summarize it soon in reply to [1].
Thanks.

[1] 
https://lore.kernel.org/virtio-comment/20230909142911.524407-7-pa...@nvidia.com/T/#md9fcfa1ba997463de8c7fb8c6d1786b224b0bead

RE: [virtio-dev] Re: [PATCH 0/5] virtio: introduce SUSPEND bit and vq state

2023-09-20 Thread Parav Pandit


> From: Jason Wang 
> Sent: Thursday, September 21, 2023 9:39 AM
> 
> On Thu, Sep 21, 2023 at 12:01 PM Parav Pandit  wrote:
> >
> >
> >
> > > From: Jason Wang 
> > > Sent: Thursday, September 21, 2023 8:48 AM
> >
> > > As replied in another thread, the issues for BAR are:
> > >
> > > 1) Not sure it can have an efficient interface, it would be
> > > something like VIRTIO_PCI_CAP_PCI_CFG which is very slow compared to
> > > single register accessing
> > > 2) There's no owner/group/member for MMIO, most of the time, we only
> > > need a single MMIO device. If we want the owner to manage itself, it
> > > seems redundant as is implied in all the existing transports (without 
> > > admin
> commands).
> > > Even if we had, it might still suffer from bootstrap issues.
> > > 3) For live migration, it means the admin commands needs to start
> > > from duplicating every existing transport specific interface it can
> > > give us. One example is that we may end up with two interfaces to
> > > access virtqueue addresses etc. This results in extra complicity and
> > > it is actually a full transport (driver can just use admin commands to 
> > > drive
> the device).
> > In [1] there is no duplication. The live migration driver never parses the 
> > device
> context either while reading or write.
> > Hence no code and no complexity in driver and no duplicate work.
> > Therefore, those admin commands are not to drive the guest device either.
> 
> I'm not sure how this is related to the duplication issue.
> 
You commented that admin virtqueue duplicates somethings.
And I explained above that it does not.

> >
> > > 4) Admin commands itself may not be capable of doing things like
> > > dirty page logging, it requires the assistance from the transport
> > >
> > Admin command in [1] is capable of dirty page logging.
> 
> In your design, the logging is done via DMA not the virtqueue.
> 
No. it is done via admin command, not DMA in [2].

[2] 
https://lore.kernel.org/virtio-comment/20230909142911.524407-7-pa...@nvidia.com/T/#m17b09acd8c73d374e98ad84764b315afa94f59c9

> The only job for virtqueue is to initiate the DMA. But if DMA can be 
> initiated via
> virtqueue, it can be done in other ways.
> 
Lets first establish 4 things in alternative way, any motivation to do so with 
5th point without giant registers need in device.

> >
> > > 1) Parav's proposal does several couplings: couple basic build
> > > blocks (suspend, dirty page tracking) with live migration, couple
> > > live migration with admin commands.
> > In which use case you find dirty page tracking useful without migration for
> which you like to see it detached from device migration flow?
> 
> Is it only the dirty page tracking? It's the combinations of
> 
> 1) suspending
> 2) device states
> 3) dirty page tracking
> 
> Eeah of those will have use cases other than live migration: VM stop, power
> management in VM, profiling and monitoring, failover etc.
> 
Suspend/resume with different power state is driven by the guest directly.
So it may find some overlap.

Device context has no overlap.

Dirty page tracking has no overlap. What do you want to profile and monitor? In 
case if you want to profile, it can be used without migration command anyway?
If you describe, may be we I can split "device migration" chapter to two 
pieces, 
Device management and device migration.

Device migration will use these basic facility.
Would that help you?

Also those can be split later when the actual use case can be described.


> > One can always use these commands if they wish to as_is.
> >
> > > 2) It's still not clear that Parav's proposal can work, a lot of
> > > corner cases needs to be examined
> > >
> > Please let me know which part can be improved in [1].
> 
> I will do that but it may take time. It's near to the public holiday.

I understand. No problem. Take your time.
I will proceed with v1 enhancements regardless for [1].

RE: [virtio-dev] Re: [PATCH 0/5] virtio: introduce SUSPEND bit and vq state

2023-09-20 Thread Parav Pandit

> From: Jason Wang 
> Sent: Thursday, September 21, 2023 9:32 AM
> 
> On Thu, Sep 21, 2023 at 11:51 AM Parav Pandit  wrote:
> >
> >
> >
> > > From: Jason Wang 
> > > Sent: Thursday, September 21, 2023 8:45 AM The main issue I see so
> > > far is that you want to couple migration with admin commands but I
> > > don't see much advantages to doing this.
> > >
> > The way I read above comment is, to draw a parallel line: descriptor 
> > posting in
> virtio spec is tied to virtqueues. What is the advantage of it?
> 
> Are you saying virtio can't live without admin commands? Again, let's not 
> shift
> concepts.
>
No, I did not say that.
I just don’t see how functionalities proposed in [1] can be done without admin 
commands by the _device_ for member device passthrough requirement.

You made point as "don’t see much advantage with migration done using admin 
commands".
What is the advantage of descriptor posting using virtqueue. It is the way of 
virtio spec...

> > Well, it is one way to achieve it.
> > There may be different way to do all bulk data transfer without admin
> commands.
> 
> Why is virtqueue the only way to do bulk data transferring? Can't DMA be
> initiated by other-way?
>

Sure, what is the disadvantage of existing mechanism of virtqueue that can do 
following.
1. Ability to do DMA
2. agnostic of the DMA who do not want to do DMA
3. Ability to multiple command executions in parallel
4. Non blocking interface for driver that does not require any kind of polling

Why to invent new DMA scheme which at does all the 4 tasks?
First please list down disadvantages of admin queue + show all 4 things are 
achieved using new DMA interface.
That will help to understand why new dma interface is needed.

> Thanks
> 
> > What it the advantage of it, please list down them in [1] for the command
> where you can find alternative.
> >
> > [1]
> > https://lore.kernel.org/virtio-comment/20230909142911.524407-7-parav@n
> > vidia.com/T/#md9fcfa1ba997463de8c7fb8c6d1786b224b0bead

RE: [virtio-dev] Re: [PATCH 0/5] virtio: introduce SUSPEND bit and vq state

2023-09-20 Thread Parav Pandit



> From: Jason Wang 
> Sent: Thursday, September 21, 2023 8:48 AM


> I'm fine. But TDISP is something that needs to be considered. The earlier we
> realize the possible issue the better.

[1] has considered this in the design.

[1] 
https://lore.kernel.org/virtio-comment/20230909142911.524407-7-pa...@nvidia.com/T/#md9fcfa1ba997463de8c7fb8c6d1786b224b0bead

RE: [virtio-dev] Re: [PATCH 0/5] virtio: introduce SUSPEND bit and vq state

2023-09-20 Thread Parav Pandit



> From: Jason Wang 
> Sent: Thursday, September 21, 2023 8:48 AM

> As replied in another thread, the issues for BAR are:
> 
> 1) Not sure it can have an efficient interface, it would be something like
> VIRTIO_PCI_CAP_PCI_CFG which is very slow compared to single register
> accessing
> 2) There's no owner/group/member for MMIO, most of the time, we only need
> a single MMIO device. If we want the owner to manage itself, it seems
> redundant as is implied in all the existing transports (without admin 
> commands).
> Even if we had, it might still suffer from bootstrap issues.
> 3) For live migration, it means the admin commands needs to start from
> duplicating every existing transport specific interface it can give us. One
> example is that we may end up with two interfaces to access virtqueue
> addresses etc. This results in extra complicity and it is actually a full 
> transport
> (driver can just use admin commands to drive the device).
In [1] there is no duplication. The live migration driver never parses the 
device context either while reading or write.
Hence no code and no complexity in driver and no duplicate work.
Therefore, those admin commands are not to drive the guest device either.

> 4) Admin commands itself may not be capable of doing things like dirty page
> logging, it requires the assistance from the transport
>
Admin command in [1] is capable of dirty page logging.
 
> 1) Parav's proposal does several couplings: couple basic build blocks 
> (suspend,
> dirty page tracking) with live migration, couple live migration with admin
> commands. 
In which use case you find dirty page tracking useful without migration for 
which you like to see it detached from device migration flow?
One can always use these commands if they wish to as_is.

> 2) It's still not clear that Parav's proposal can work, a lot of corner cases 
> needs
> to be examined
> 
Please let me know which part can be improved in [1].

[1] 
https://lore.kernel.org/virtio-comment/20230909142911.524407-7-pa...@nvidia.com/T/#md9fcfa1ba997463de8c7fb8c6d1786b224b0bead


> >
> > > the bar is only a proxy, doesn't fix anything. and even larger side
> > > channel attacking surface: vf-->pf-->vf
> >
> > In this model there's no pf. BAR belongs to vf itself and you submit
> > commands for the VF through its BAR.
> > Just separate from the pci config space.
> >
> > The whole attacking surface discussion is also puzzling.  We either
> > are or are not discussing confidential computing/TDI.  I couldn't
> > figure it out. This needs a separate thread I think.
> 
> Anyhow, it's not bad to take it into consideration. But we can do it 
> elsewhere for
> sure.
Thanks.
Please have comments in [1].

[1] 
https://lore.kernel.org/virtio-comment/20230909142911.524407-7-pa...@nvidia.com/T/#md9fcfa1ba997463de8c7fb8c6d1786b224b0bead

RE: [virtio-dev] Re: [PATCH 0/5] virtio: introduce SUSPEND bit and vq state

2023-09-20 Thread Parav Pandit



> From: Jason Wang 
> Sent: Thursday, September 21, 2023 8:45 AM
> The main issue I see so far is that
> you want to couple migration with admin commands but I don't see much
> advantages to doing this.
> 
The way I read above comment is, to draw a parallel line: descriptor posting in 
virtio spec is tied to virtqueues. What is the advantage of it?
Well, it is one way to achieve it.
There may be different way to do all bulk data transfer without admin commands.
What it the advantage of it, please list down them in [1] for the command where 
you can find alternative.

[1] 
https://lore.kernel.org/virtio-comment/20230909142911.524407-7-pa...@nvidia.com/T/#md9fcfa1ba997463de8c7fb8c6d1786b224b0bead

RE: [virtio-dev] Re: [PATCH 0/5] virtio: introduce SUSPEND bit and vq state

2023-09-20 Thread Parav Pandit




> From: Michael S. Tsirkin 
> Sent: Thursday, September 21, 2023 1:34 AM
> 
> On Wed, Sep 20, 2023 at 05:21:52PM +, Parav Pandit wrote:
> > > OK so with this summary in mind, can you find any advantages to
> > > inband+mediation that are real or do you just see disadvantages? And
> > > it's a tricky question because I can see some advantages ;)
> >
> > inband + mediation may be useful for nested case.
> 
> Hint: there's more.

Can you please list down?

The starting point of discussion is, there is passthrough member device without 
mediation in virtio interface layers.
How shall device migration should work for it?


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org

RE: [virtio-dev] Re: [PATCH 0/5] virtio: introduce SUSPEND bit and vq state

2023-09-20 Thread Parav Pandit



> From: Michael S. Tsirkin 
> Sent: Wednesday, September 20, 2023 7:46 PM
> 
> > Details of his position in my view:
> >
> > 1. Device migration must be done through VF itself by suspending specific 
> > vqs
> and the VF device both.
> > 2. When device migration is done using #1, it must be done using mediation
> approach in hypervisor.
> >
> > 3. When migration is done using inband mediation it is more secure than AQ
> approach.
> > (as opposed to AQ of the owner device who enables/disables SR-IOV).
> >
> > 4. AQ is not secure.
> > But,
> > 5. AQ and admin commands can be built on top of his proposal #1, even if AQ
> is less secure. Opposing statements...
> >
> > 6. Dirty page tracking and inflight descriptors tracking to be done in his 
> > v1. but
> he does not want to review such coverage in [1].
> >
> > 8. Since his series does not cover any device context migration and
> > does not talk anything about it, I deduce that he plans to use cvq for 
> > setting
> ups RSS and other fields using inband CVQ of the VF.
> > This further limit the solution to only net device, ignoring rest of the 
> > other
> 20+ device types, where all may not have the CVQ.
> >
> > 9. trapping and emulation of following objects: AQ, CVQ, virtio config 
> > space,
> PCI FLR flow in hypervisor is secure, but when if AQ of the PF do far small 
> work
> of it, AQ is not secure.
> >
> > 10. Any traps proposed in #9 mostly do not work with future TDISP as TDISP 
> > do
> not bifurcate the device, so ignore them for now to promote inband migration.
> >
> > 11. He do not show interest in collaboration (even after requesting few 
> > times)
> to see if we can produce common commands that may work for both
> passthrough (without mediation) and using mediation for nested case.
> >
> > 12. Some how register access on single physical card for the PFs and VFs 
> > gives
> better QoS guarantee than virtqueue as registers can scale infinitely no 
> matter
> how many VFs or for multiple VQs because it is per VF.
> >
> > [1]
> > https://lore.kernel.org/virtio-comment/20230909142911.524407-7-parav@n
> > vidia.com/T/#md9fcfa1ba997463de8c7fb8c6d1786b224b0bead
> 
> 
> OK so with this summary in mind, can you find any advantages to
> inband+mediation that are real or do you just see disadvantages? And
> it's a tricky question because I can see some advantages ;)

inband + mediation may be useful for nested case.

In attempting inband + mediation, there are many critical pieces are let go.
It may be fine to let go for some cases but not for passthrough.

The fundamental advantages of owner-based approach I see are:
1. Nesting use case usually involve large number of VMs to be hosted in one VM
For this purpose, better to hand over a PF to level 0 VM that hosts VFs and 
level 1 VMs and avoid two level device nesting.

2. Support P2P natively

3. Single non replicated resource (AQ) manages less frequent work of device 
migration
No need to replicate AQs to thousands of VFs who rarely do the migration work.
Overall gains system, device, and memory efficiency.

5. Passthrough simply do not work at all ever without owner device.
This is because dirty page tracking, device context management, CVQ, FLR, 
config space, device status, MSIX config all must be trapped.
Many systems do not prefer this involvement of hypervisor even if the 
hypervisor is trusted. (to avoid moving parts).
New generation TEE, TPM devices are on horizon, and they would not like things 
not trapped either.
The security audit surface is very large for them.

6. Any new basic functionality added to device must always also require 
constant software updates at few layers in mediation entities

7. TDISP is inherently covered. Without owner device TDISP is broken as device 
cannot be bifurcated.

To me #2, #5, #7 are critical piece that a device migration must support/work 
with.
Rest is second level of importance.

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org

RE: [virtio-dev] Re: [PATCH 0/5] virtio: introduce SUSPEND bit and vq state

2023-09-20 Thread Parav Pandit

> From: Michael S. Tsirkin
> Sent: Wednesday, September 20, 2023 6:12 PM

> And Parav same goes for you - can you summarize Zhu Lingshan's position?

Below is my summary about Zhu Lingshan's position:

One line summary of his position in my view:

0. Use inband device migration only, use mediation, mediation is secure, but AQ
is not secure.

Details of his position in my view:

1. Device migration must be done through VF itself by suspending specific vqs
and the VF device both.
2. When device migration is done using #1, it must be done using mediation
approach in hypervisor.

3. When migration is done using inband mediation it is more secure than AQ
approach.
(as opposed to AQ of the owner device who enables/disables SR-IOV).

4. AQ is not secure.
But,
5. AQ and admin commands can be built on top of his proposal #1, even if AQ is
less secure. Opposing statements...

6. Dirty page tracking and inflight descriptors tracking to be done in his v1.
but he does not want to review such coverage in [1].

8. Since his series does not cover any device context migration and does not
talk anything about it,
I deduce that he plans to use cvq for setting ups RSS and other fields using
inband CVQ of the VF.
This further limit the solution to only net device, ignoring rest of the other
20+ device types, where all may not have the CVQ.

9. trapping and emulation of following objects: AQ, CVQ, virtio config space,
PCI FLR flow in hypervisor is secure, but when if AQ of the PF do far small
work of it, AQ is not secure.

10. Any traps proposed in #9 mostly do not work with future TDISP as TDISP do
not bifurcate the device, so ignore them for now to promote inband migration.

11. He do not show interest in collaboration (even after requesting few times)
to see if we can produce common commands that may work for both passthrough
(without mediation) and using mediation for nested case.

12. Some how register access on single physical card for the PFs and VFs gives
better QoS guarantee than virtqueue as registers can scale infinitely no matter
how many VFs or for multiple VQs because it is per VF.

[1]
https://lore.kernel.org/virtio-comment/20230909142911.524407-7-pa...@nvidia.com/T/#md9fcfa1ba997463de8c7fb8c6d1786b224b0bead

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org

RE: [virtio-dev] Re: [PATCH 0/5] virtio: introduce SUSPEND bit and vq state

2023-09-20 Thread Parav Pandit



> From: Michael S. Tsirkin 
> Sent: Wednesday, September 20, 2023 5:36 PM
> 
> OK so we are ignoring TDISP applications for now? Everyone agrees on that?

We are actively considering TDISP applications to support in (unknown) future 
in a way that new spec additions for new features we do, does not block the 
future TDISP support.

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org

RE: [virtio-dev] Re: [PATCH 0/5] virtio: introduce SUSPEND bit and vq state

2023-09-20 Thread Parav Pandit

Random words like malicious SW to describe an attack do not make sense.

Refer the patches and series and usage model to describe the sw attack if any.
I disagree and I will not repeat all the points anymore.
If you have comments in [1], please reply in [1].

Series [1] clearly describes the usage model at least for one widely used OS = 
Linux.

[1] 
https://lore.kernel.org/virtio-comment/20230909142911.524407-7-pa...@nvidia.com/T/#md9fcfa1ba997463de8c7fb8c6d1786b224b0bead

From: Zhu, Lingshan 
Sent: Wednesday, September 20, 2023 4:41 PM
To: Parav Pandit ; Michael S. Tsirkin 
Cc: virtio-dev@lists.oasis-open.org; Jason Wang 
Subject: Re: [virtio-dev] Re: [PATCH 0/5] virtio: introduce SUSPEND bit and vq 
state

On 9/20/2023 5:52 PM, Parav Pandit wrote:

Hi Lingshan,

Last two email replies in non-next format are getting hard to follow.
Can you please revert back to have text-based emails?

When one wants to use PF for the live migration in trusted hypervisor, PF is in 
the trust zone.
even without live migration, it can be an attacking surface while guest running.
As repeated for many times, it can be used by malicious SW to dump guest memory

In future when hypervisor is not trusted, the task of LM will be delegated to 
other infrastructure TVM.
Ravi at Intel already explained this a year ago using migration TD.
This fits very well without bifurcating the member device which is extremely 
hard.
TD, TDX or TDX-IO are more complex topics, and we should
focus on our live migration solution, not CC.

My point is: using bar cap as a proxy for admin vq based LM is still 
problematic.

Maybe we can close this.

Parav

From: Zhu, Lingshan <mailto:lingshan@intel.com>
Sent: Wednesday, September 20, 2023 3:15 PM
To: Parav Pandit <mailto:pa...@nvidia.com>; Michael S. 
Tsirkin <mailto:m...@redhat.com>
Cc: virtio-dev@lists.oasis-open.org<mailto:virtio-dev@lists.oasis-open.org>; 
Jason Wang <mailto:jasow...@redhat.com>
Subject: Re: [virtio-dev] Re: [PATCH 0/5] virtio: introduce SUSPEND bit and vq 
state

On 9/20/2023 4:34 PM, Parav Pandit wrote:
@font-face {font-family:"Cambria Math"; panose-1:2 4 5 3 5 4 6 3 2 
4;}@font-face {font-family:Calibri; panose-1:2 15 5 2 2 2 4 3 2 4;}@font-face 
{font-family:Consolas; panose-1:2 11 6 9 2 2 4 3 2 4;}p.MsoNormal, 
li.MsoNormal, div.MsoNormal {margin:0in; font-size:11.0pt; 
font-family:"Calibri",sans-serif;}a:link, span.MsoHyperlink 
{mso-style-priority:99; color:blue; text-decoration:underline;}pre 
{mso-style-priority:99; mso-style-link:"HTML Preformatted Char"; margin:0in; 
margin-bottom:.0001pt; font-size:10.0pt; font-family:"Courier 
New";}span.HTMLPreformattedChar {mso-style-name:"HTML Preformatted Char"; 
mso-style-priority:99; mso-style-link:"HTML Preformatted"; 
font-family:Consolas;}span.fontstyle0 
{mso-style-name:fontstyle0;}span.EmailStyle21 {mso-style-type:personal-reply; 
font-family:"Calibri",sans-serif; color:windowtext;}.MsoChpDefault 
{mso-style-type:export-only; font-size:10.0pt; 
mso-ligatures:none;}div.WordSection1 {page:WordSection1;}
> There can be malicious SW on the host, and the host may be hacked and 
> compromised.
> For example:
> 1) SUSPEND the a running guest by admin vq
> 2) dumping guest memory through admin vq dirty page tracking.

No. hypervisor is trusted entity who is hosting the VM.
The PF may not owned by the hypervisor and the host can be hacked and 
computerized.

The device migration is initiated by the hypervisor.

I am omitting the TDISP question for now as talked before.
TDISP spec will evolve for hypercalls when we get there.
Confidential computing is out of the spec, as we discussed and agreed.

This is to demonstrate why even using a bar cap as proxy for admin vq LM is 
still problematic.
TDISP gives examples of the attacking models, and admin vq based LM
conforms to the models.

From: virtio-dev@lists.oasis-open.org<mailto:virtio-dev@lists.oasis-open.org> 
<mailto:virtio-dev@lists.oasis-open.org> On 
Behalf Of Zhu, Lingshan
Sent: Wednesday, September 20, 2023 12:01 PM
To: Parav Pandit <mailto:pa...@nvidia.com>; Michael S. 
Tsirkin <mailto:m...@redhat.com>
Cc: virtio-dev@lists.oasis-open.org<mailto:virtio-dev@lists.oasis-open.org>; 
Jason Wang <mailto:jasow...@redhat.com>
Subject: Re: [virtio-dev] Re: [PATCH 0/5] virtio: introduce SUSPEND bit and vq 
state

On 9/20/2023 2:08 PM, Parav Pandit wrote:

From: Zhu, Lingshan <mailto:lingshan....@intel.com>

Sent: Wednesday, September 20, 2023 11:36 AM

On 9/19/2023 2:49 AM, Michael S. Tsirkin wrote:

On Mon, Sep 18, 2023 at 06:41:55PM +, Parav Pandit wrote:

Please refer to the code for setting FEATURES_OK.

It wont work when one needs to suspend the device.

There is no point of doing such work over registers as fundamental

framework is over the AQ.

Well not really. It's over admin commands. When these we

RE: [virtio-dev] Re: [PATCH 0/5] virtio: introduce SUSPEND bit and vq state

2023-09-20 Thread Parav Pandit



> From: Michael S. Tsirkin 
> Sent: Wednesday, September 20, 2023 4:06 PM

> 
> I freely admit the finer points of this extended flamewar have been lost on 
> me,
> and I wager I'm not the only one. I thought you wanted to migrate the device
> just by accessing the device itself (e.g. the VF) without accessing other 
> devices
> (e.g. the PF), while Parav wants it in a separate device so the whole of the
> device itself can passed through to guest. Isn't this, fundamentally, the 
> issue?
Right. An admin device doing the work of device migration. Today it is the 
owner PF.
In future it can be other admin device who is deleted this task of migration, 
who can be group owner.
All the admin commands that we plumb here just works great in that CC/TDI 
future, because only thing changes is the admin device issuing this command.

> 
> > the bar is only a proxy, doesn't fix anything. and even larger side
> > channel attacking surface: vf-->pf-->vf
> 
> In this model there's no pf. BAR belongs to vf itself and you submit commands
> for the VF through its BAR.
> Just separate from the pci config space.
> 
> The whole attacking surface discussion is also puzzling.  We either are or are
> not discussing confidential computing/TDI.  I couldn't figure it out. This 
> needs a
> separate thread I think.

True. Many of Lingshan thoughts/comments gets mixed I feel.
Because he proposes trap+emulation/mediation-based solution by hypervisor and 
none of that is secure anyway in CC/TDI concept.
He keeps attacking AQ as some side channel attack, while somehow trap+emulation 
also done by hypervisor is secure, which obviously does not make sense in 
CC/TDI concept.
Both scores equal where hypervisor trust is of concern.

And admin command approach [1] has clear direction for CC to delete those admin 
commands to a dedicated trusted entity instead of hypervisor.

I try to explain these few times, but..

Anyways, if AQ has some comments better to reply in its thread at [1].

[1] 
https://lore.kernel.org/virtio-comment/20230909142911.524407-7-pa...@nvidia.com/T/#md9fcfa1ba997463de8c7fb8c6d1786b224b0bead

I will post v1 for [1] with more mature device context this week along with 
future provisioning item note.

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org

RE: [virtio-dev] Re: [PATCH 0/5] virtio: introduce SUSPEND bit and vq state

2023-09-20 Thread Parav Pandit

Hi Lingshan,

Last two email replies in non-next format are getting hard to follow.
Can you please revert back to have text-based emails?

When one wants to use PF for the live migration in trusted hypervisor, PF is in 
the trust zone.

In future when hypervisor is not trusted, the task of LM will be delegated to 
other infrastructure TVM.
Ravi at Intel already explained this a year ago using migration TD.
This fits very well without bifurcating the member device which is extremely 
hard.

Parav

From: Zhu, Lingshan 
Sent: Wednesday, September 20, 2023 3:15 PM
To: Parav Pandit ; Michael S. Tsirkin 
Cc: virtio-dev@lists.oasis-open.org; Jason Wang 
Subject: Re: [virtio-dev] Re: [PATCH 0/5] virtio: introduce SUSPEND bit and vq 
state

On 9/20/2023 4:34 PM, Parav Pandit wrote:
@font-face {font-family:"Cambria Math"; panose-1:2 4 5 3 5 4 6 3 2 
4;}@font-face {font-family:Calibri; panose-1:2 15 5 2 2 2 4 3 2 4;}@font-face 
{font-family:Consolas; panose-1:2 11 6 9 2 2 4 3 2 4;}p.MsoNormal, 
li.MsoNormal, div.MsoNormal {margin:0in; font-size:11.0pt; 
font-family:"Calibri",sans-serif;}a:link, span.MsoHyperlink 
{mso-style-priority:99; color:blue; text-decoration:underline;}pre 
{mso-style-priority:99; mso-style-link:"HTML Preformatted Char"; margin:0in; 
margin-bottom:.0001pt; font-size:10.0pt; font-family:"Courier 
New";}span.HTMLPreformattedChar {mso-style-name:"HTML Preformatted Char"; 
mso-style-priority:99; mso-style-link:"HTML Preformatted"; 
font-family:Consolas;}span.fontstyle0 
{mso-style-name:fontstyle0;}span.EmailStyle21 {mso-style-type:personal-reply; 
font-family:"Calibri",sans-serif; color:windowtext;}.MsoChpDefault 
{mso-style-type:export-only; font-size:10.0pt; 
mso-ligatures:none;}div.WordSection1 {page:WordSection1;}
> There can be malicious SW on the host, and the host may be hacked and 
> compromised.
> For example:
> 1) SUSPEND the a running guest by admin vq
> 2) dumping guest memory through admin vq dirty page tracking.

No. hypervisor is trusted entity who is hosting the VM.
The PF may not owned by the hypervisor and the host can be hacked and 
computerized.

The device migration is initiated by the hypervisor.

I am omitting the TDISP question for now as talked before.
TDISP spec will evolve for hypercalls when we get there.
Confidential computing is out of the spec, as we discussed and agreed.

This is to demonstrate why even using a bar cap as proxy for admin vq LM is 
still problematic.
TDISP gives examples of the attacking models, and admin vq based LM
conforms to the models.

From: virtio-dev@lists.oasis-open.org<mailto:virtio-dev@lists.oasis-open.org> 
<mailto:virtio-dev@lists.oasis-open.org> On 
Behalf Of Zhu, Lingshan
Sent: Wednesday, September 20, 2023 12:01 PM
To: Parav Pandit <mailto:pa...@nvidia.com>; Michael S. 
Tsirkin <mailto:m...@redhat.com>
Cc: virtio-dev@lists.oasis-open.org<mailto:virtio-dev@lists.oasis-open.org>; 
Jason Wang <mailto:jasow...@redhat.com>
Subject: Re: [virtio-dev] Re: [PATCH 0/5] virtio: introduce SUSPEND bit and vq 
state

On 9/20/2023 2:08 PM, Parav Pandit wrote:

From: Zhu, Lingshan <mailto:lingshan@intel.com>

Sent: Wednesday, September 20, 2023 11:36 AM

On 9/19/2023 2:49 AM, Michael S. Tsirkin wrote:

On Mon, Sep 18, 2023 at 06:41:55PM +, Parav Pandit wrote:

Please refer to the code for setting FEATURES_OK.

It wont work when one needs to suspend the device.

There is no point of doing such work over registers as fundamental

framework is over the AQ.

Well not really. It's over admin commands. When these were built the

intent always was that it's possible to use admin commands through

another interface, other than admin queue. Is there a problem

implementing admin commands over a memory BAR? For example, I can see

an "admin command" capability pointing at a BAR where commands are

supplied, and using a new group type referring to device itself.

I am not sure, if a bar cap would be implemented as a proxy for the admin vq

based live migration. then the problems of admin vq LM that we have discussed

still exist. the bar is only a proxy, doesn't fix anything. and even larger side

channel attacking surface: vf-->pf-->vf

AQ LM using PF has no side channel attack as hypervisor and owner device is 
trusted entity as already discussed.
I believe we have discussed this for many times, and I even provide you some 
examples.

Let me repeat for the last time.

There can be malicious SW on the host, and the host may be hacked and 
compromised.
For example:
1) SUSPEND the a running guest by admin vq
2) dumping guest memory through admin vq dirty page tracking.

These above can happen right?

You made TDISP as an example, but have you really read the TDISP spec?
In the spec:

Device Security Architecture - Administrative interfaces (e.g., a PF) may be
used to influence the security properties o

RE: [virtio-dev] [PATCH] virtio-tee: Reserve device ID 46 for TEE device

2023-09-20 Thread Parav Pandit

Hi,

> From: virtio-dev@lists.oasis-open.org  On
> Behalf Of jeshwank
> Sent: Wednesday, September 6, 2023 3:13 PM
> 
> The virtio-tee device allows guest OS to access the TEE present in host 
> system,
> to perform secure operations.
> 
> This patch is to reserve a device ID 46 for virtio-tee device.
> 

For those who breath TEE its may be obvious thing. 
For me, a humble request, 
can you please expand this patch to contain short description of basic 
functionality of the TEE device to encompass under this device-id?

> Signed-off-by: jeshwank 
> ---
>  content.tex | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/content.tex b/content.tex
> index 0a62dce..644aa4a 100644
> --- a/content.tex
> +++ b/content.tex
> @@ -739,6 +739,8 @@ \chapter{Device Types}\label{sec:Device Types}  \hline
>  45 &   SPI master \\
>  \hline
> +46 &   TEE device \\
> +\hline
>  \end{tabular}
> 
>  Some of the devices above are unspecified by this document,
> --
> 2.25.1
> 
> 
> -
> To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org

RE: [virtio-dev] Re: [PATCH 0/5] virtio: introduce SUSPEND bit and vq state

2023-09-20 Thread Parav Pandit

> There can be malicious SW on the host, and the host may be hacked and 
> compromised.
> For example:
> 1) SUSPEND the a running guest by admin vq
> 2) dumping guest memory through admin vq dirty page tracking.

No. hypervisor is trusted entity who is hosting the VM.
The device migration is initiated by the hypervisor.

I am omitting the TDISP question for now as talked before.
TDISP spec will evolve for hypercalls when we get there.

From: virtio-dev@lists.oasis-open.org  On 
Behalf Of Zhu, Lingshan
Sent: Wednesday, September 20, 2023 12:01 PM
To: Parav Pandit ; Michael S. Tsirkin 
Cc: virtio-dev@lists.oasis-open.org; Jason Wang 
Subject: Re: [virtio-dev] Re: [PATCH 0/5] virtio: introduce SUSPEND bit and vq 
state

On 9/20/2023 2:08 PM, Parav Pandit wrote:

From: Zhu, Lingshan <mailto:lingshan@intel.com>

Sent: Wednesday, September 20, 2023 11:36 AM

On 9/19/2023 2:49 AM, Michael S. Tsirkin wrote:

On Mon, Sep 18, 2023 at 06:41:55PM +, Parav Pandit wrote:

Please refer to the code for setting FEATURES_OK.

It wont work when one needs to suspend the device.

There is no point of doing such work over registers as fundamental

framework is over the AQ.

Well not really. It's over admin commands. When these were built the

intent always was that it's possible to use admin commands through

another interface, other than admin queue. Is there a problem

implementing admin commands over a memory BAR? For example, I can see

an "admin command" capability pointing at a BAR where commands are

supplied, and using a new group type referring to device itself.

I am not sure, if a bar cap would be implemented as a proxy for the admin vq

based live migration. then the problems of admin vq LM that we have discussed

still exist. the bar is only a proxy, doesn't fix anything. and even larger side

channel attacking surface: vf-->pf-->vf

AQ LM using PF has no side channel attack as hypervisor and owner device is 
trusted entity as already discussed.
I believe we have discussed this for many times, and I even provide you some 
examples.

Let me repeat for the last time.

There can be malicious SW on the host, and the host may be hacked and 
compromised.
For example:
1) SUSPEND the a running guest by admin vq
2) dumping guest memory through admin vq dirty page tracking.

These above can happen right?

You made TDISP as an example, but have you really read the TDISP spec?
In the spec:

Device Security Architecture - Administrative interfaces (e.g., a PF) may be
used to influence the security properties of the TDI used by the TVM.

TEE-I/O requires the device to organize its hardware/software interfaces such 
that the PF cannot
be used to affect the security of a TDI when it is in use by a TVM

Clear?

RE: [virtio-dev] Re: [virtio-comment] Re: [VIRTIO PCI PATCH v5 1/1] transport-pci: Add freeze_mode to virtio_pci_common_cfg

2023-09-20 Thread Parav Pandit

Hi Jiquian,

> From: Chen, Jiqian 
> Sent: Wednesday, September 20, 2023 1:24 PM
> 
> Hi Lingshan,
> Please reply to your own email thread, below are not related to my patches.
> Thanks a lot.

They are related to your patch.
 Both the patches have overlapping functionalities.

You probably missed my previous response explaining the same at [1].

[1] https://lists.oasis-open.org/archives/virtio-comment/202309/msg00255.html

RE: [virtio-dev] Re: [virtio-comment] Re: [VIRTIO PCI PATCH v5 1/1] transport-pci: Add freeze_mode to virtio_pci_common_cfg

2023-09-20 Thread Parav Pandit


> From: Zhu, Lingshan 
> Sent: Wednesday, September 20, 2023 1:17 PM

> > This is not live or device migration. This is restoring the device context
> initiated by the driver owning the device.
> restore the device context should be done by the hypervisor before setting
> DRIVER_OK and waking up the guest, not a concern here, out of spec

The framework is generic for the PCI devices hence, there may not be any 
hypervisor at all. Hence restore operation to trigger on DRIVER_OK setting, 
when previously suspended.

Since we already agree in previous email that re-read until device sets the 
DRIVER_OK, its fine to me. It covers the above restore condition.

Thanks.

RE: [virtio-dev] Re: [virtio-comment] Re: [VIRTIO PCI PATCH v5 1/1] transport-pci: Add freeze_mode to virtio_pci_common_cfg

2023-09-20 Thread Parav Pandit


> From: Zhu, Lingshan 
> Sent: Wednesday, September 20, 2023 1:16 PM

[..]
> > In my view, setting the DRIVER_OK is the signal regardless of hypervisor or
> physical device.
> > Hence the re-read is must.
> Yes, as I said below, should verify by re-read.
> >
Thanks.

RE: [virtio-dev] Re: [virtio-comment] Re: [VIRTIO PCI PATCH v5 1/1] transport-pci: Add freeze_mode to virtio_pci_common_cfg

2023-09-20 Thread Parav Pandit

> From: Zhu, Lingshan 
> Sent: Wednesday, September 20, 2023 1:00 PM
> 
> On 9/20/2023 3:24 PM, Chen, Jiqian wrote:
> > Hi Lingshan,
> > It seems you reply to the wrong email thread. They are not related to my
> patch.
> These reply to Parva's comments.
> @Parva, if you want to discuss more about live migration, please reply in my
> thread, lets don't flood here.
You made the point that "this is not live migration".
I am not discussing live migration in this thread.

I replied for the point that device restoration from suspend for physical and 
hypevisor based device is not a 40nsec worth of work to restore by just doing a 
register write.
This is not live or device migration. This is restoring the device context 
initiated by the driver owning the device.

RE: [virtio-dev] Re: [virtio-comment] Re: [VIRTIO PCI PATCH v5 1/1] transport-pci: Add freeze_mode to virtio_pci_common_cfg

2023-09-20 Thread Parav Pandit



> From: Zhu, Lingshan 
> Sent: Wednesday, September 20, 2023 12:58 PM
> 
> On 9/20/2023 3:10 PM, Parav Pandit wrote:
> >> From: Zhu, Lingshan 
> >> Sent: Wednesday, September 20, 2023 12:37 PM
> >>> The problem to overcome in [1] is, resume operation needs to be
> >>> synchronous
> >> as it involves large part of context to resume back, and hence just
> >> asynchronously setting DRIVER_OK is not enough.
> >>> The sw must verify back that device has resumed the operation and
> >>> ready to
> >> answer requests.
> >> this is not live migration, all device status and other information
> >> still stay in the device, no need to "resume" context, just resume running.
> >>
> > I am aware that it is not live migration. :)
> >
> > "Just resuming" involves lot of device setup task. The device implementation
> does not know for how long a device is suspended.
> > So for example, a VM is suspended for 6 hours, hence the device context
> could be saved in a slow disk.
> > Hence, when the resume is done, it needs to setup things again and driver 
> > got
> to verify before accessing more from the device.
> The restore procedures should perform by the hypervisor and done before set
> DRIVER_OK and wake up the guest.
Which is the signal to trigger the restore? Which is the trigger in physical 
device when there is no hypervisor?

In my view, setting the DRIVER_OK is the signal regardless of hypervisor or 
physical device.
Hence the re-read is must.

> And the hypervisor/driver needs to check the device status by re-reading.
> >
> >> Like resume from a failed LM.
> >>> This is slightly different flow than setting the DRIVER_OK for the
> >>> first time
> >> device initialization sequence as it does not involve large restoration.
> >>> So, to merge two ideas, instead of doing DRIVER_OK to resume, the
> >>> driver
> >> should clear the SUSPEND bit and verify that it is out of SUSPEND.
> >>> Because driver is still in _OK_ driving the device flipping the SUSPEND 
> >>> bit.
> >> Please read the spec, it says:
> >> The driver MUST NOT clear a device status bit
> >>
> > Yes, this is why either DRIER_OK validation by the driver is needed or 
> > Jiqian's
> synchronous new register..
> so re-read
Yes. re-read until set, Thanks.

RE: [virtio-dev] Re: [virtio-comment] Re: [VIRTIO PCI PATCH v5 1/1] transport-pci: Add freeze_mode to virtio_pci_common_cfg

2023-09-20 Thread Parav Pandit


> From: Zhu, Lingshan 
> Sent: Wednesday, September 20, 2023 12:37 PM

> > The problem to overcome in [1] is, resume operation needs to be synchronous
> as it involves large part of context to resume back, and hence just
> asynchronously setting DRIVER_OK is not enough.
> > The sw must verify back that device has resumed the operation and ready to
> answer requests.
> this is not live migration, all device status and other information still 
> stay in the
> device, no need to "resume" context, just resume running.
> 
I am aware that it is not live migration. :)

"Just resuming" involves lot of device setup task. The device implementation 
does not know for how long a device is suspended.
So for example, a VM is suspended for 6 hours, hence the device context could 
be saved in a slow disk.
Hence, when the resume is done, it needs to setup things again and driver got 
to verify before accessing more from the device.
 
> Like resume from a failed LM.
> >
> > This is slightly different flow than setting the DRIVER_OK for the first 
> > time
> device initialization sequence as it does not involve large restoration.
> >
> > So, to merge two ideas, instead of doing DRIVER_OK to resume, the driver
> should clear the SUSPEND bit and verify that it is out of SUSPEND.
> >
> > Because driver is still in _OK_ driving the device flipping the SUSPEND bit.
> Please read the spec, it says:
> The driver MUST NOT clear a device status bit
> 
Yes, this is why either DRIER_OK validation by the driver is needed or Jiqian's 
synchronous new register..

RE: [virtio-dev] Re: [virtio-comment] Re: [VIRTIO PCI PATCH v5 1/1] transport-pci: Add freeze_mode to virtio_pci_common_cfg

2023-09-20 Thread Parav Pandit


> From: Chen, Jiqian 
> Sent: Wednesday, September 20, 2023 12:03 PM

> If driver write 0 to reset device, can the SUSPEND bit be cleared?
It must as reset operation, resets everything else and so the suspend too.

> (pci_pm_resume->virtio_pci_restore->virtio_device_restore-
> >virtio_reset_device)
> If SUSPEND is cleared, then during the reset process in Qemu, I can't judge if
> the reset request is from guest restore process or not, and then I can't 
> change
> the reset behavior.
Reset should not be influenced by suspend.
Suspend should do the work of suspend and reset to do the reset.

The problem to overcome in [1] is, resume operation needs to be synchronous as 
it involves large part of context to resume back, and hence just asynchronously 
setting DRIVER_OK is not enough.
The sw must verify back that device has resumed the operation and ready to 
answer requests.

This is slightly different flow than setting the DRIVER_OK for the first time 
device initialization sequence as it does not involve large restoration.

So, to merge two ideas, instead of doing DRIVER_OK to resume, the driver should 
clear the SUSPEND bit and verify that it is out of SUSPEND.

Because driver is still in _OK_ driving the device flipping the SUSPEND bit.

RE: [virtio-dev] Re: [PATCH 0/5] virtio: introduce SUSPEND bit and vq state

2023-09-20 Thread Parav Pandit


> From: Zhu, Lingshan 
> Sent: Wednesday, September 20, 2023 11:36 AM
> 
> On 9/19/2023 2:49 AM, Michael S. Tsirkin wrote:
> > On Mon, Sep 18, 2023 at 06:41:55PM +, Parav Pandit wrote:
> >>> Please refer to the code for setting FEATURES_OK.
> >> It wont work when one needs to suspend the device.
> >> There is no point of doing such work over registers as fundamental
> framework is over the AQ.
> > Well not really. It's over admin commands. When these were built the
> > intent always was that it's possible to use admin commands through
> > another interface, other than admin queue. Is there a problem
> > implementing admin commands over a memory BAR? For example, I can see
> > an "admin command" capability pointing at a BAR where commands are
> > supplied, and using a new group type referring to device itself.
> I am not sure, if a bar cap would be implemented as a proxy for the admin vq
> based live migration. then the problems of admin vq LM that we have discussed
> still exist. the bar is only a proxy, doesn't fix anything. and even larger 
> side
> channel attacking surface: vf-->pf-->vf

AQ LM using PF has no side channel attack as hypervisor and owner device is 
trusted entity as already discussed.

RE: [virtio-dev] RE: [VIRTIO PCI PATCH v5 1/1] transport-pci: Add freeze_mode to virtio_pci_common_cfg

2023-09-19 Thread Parav Pandit



> From: Chen, Jiqian 
> Sent: Wednesday, September 20, 2023 9:28 AM

> >> For above purpose, we need a mechanism that allows guests and QEMU to
> >> negotiate their reset behavior. So this patch add a new parameter
> >> named
> > Freeze != reset. :)
> > Please fix it to say freeze or suspend.
> But in my virtio-gpu scene, I want to prevent Qemu destroying resources when
> Guest do resuming(pci_pm_resume-> virtio_pci_restore->
> virtio_device_restore-> virtio_reset_device-> vp_modern_set_status->Qemu
> virtio_pci_reset->virtio_gpu_gl_reset-> virtio_gpu_reset). And I add check in
> virtio_gpu_gl_reset and virtio_gpu_reset, if freeze_mode was set to FREEZE_S3
> during Guest suspending, Qemu will not destroy resources. So the reason why I
> add this mechanism is to affect the reset behavior. And I think this also can 
> help
> other virtio devices to affect their behavior, like the issue of virtio-video 
> which
> Mikhail Golubev-Ciuchea encountered.
>
The point is when driver tells to freeze, it is freeze command and not reset.
So resume() should not invoke device_reset() when FREEZE+RESUME supported.
 
> >
> >> freeze_mode to struct virtio_pci_common_cfg. And when guest suspends,
> >> it can write freeze_mode to be FREEZE_S3, and then virtio devices can
> >> change their reset behavior on Qemu side according to freeze_mode.
> >> What's more,
> > Not reset, but suspend behavior.
> The same reason as above.
>
Reset should not be done by the guest driver when the device supports unfreeze.
 
> >
> >> freeze_mode can be used for all virtio devices to affect the behavior
> >> of Qemu, not just virtio gpu device.
> >>
> >> Signed-off-by: Jiqian Chen 
> >> ---
> >>  transport-pci.tex | 7 +++
> >>  1 file changed, 7 insertions(+)
> >>
> >> diff --git a/transport-pci.tex b/transport-pci.tex index
> >> a5c6719..2543536 100644
> >> --- a/transport-pci.tex
> >> +++ b/transport-pci.tex
> >> @@ -319,6 +319,7 @@ \subsubsection{Common configuration structure
> >> layout}\label{sec:Virtio Transport
> >>  le64 queue_desc;/* read-write */
> >>  le64 queue_driver;  /* read-write */
> >>  le64 queue_device;  /* read-write */
> >> +le16 freeze_mode;   /* read-write */
> >>  le16 queue_notif_config_data;   /* read-only for driver */
> >>  le16 queue_reset;   /* read-write */
> >>
> > The new field cannot be in the middle of the structure.
> > Otherwise, the location of the queue_notif_config_data depends on
> completely unrelated feature bit, breaking the backward compatibility.
> > So please move it at the end.
> I have confused about this. I found in latest kernel code(master branch):
> struct virtio_pci_common_cfg {
>   /* About the whole device. */
>   __le32 device_feature_select;   /* read-write */
>   __le32 device_feature;  /* read-only */
>   __le32 guest_feature_select;/* read-write */
>   __le32 guest_feature;   /* read-write */
>   __le16 msix_config; /* read-write */
>   __le16 num_queues;  /* read-only */
>   __u8 device_status; /* read-write */
>   __u8 config_generation; /* read-only */
> 
>   /* About a specific virtqueue. */
>   __le16 queue_select;/* read-write */
>   __le16 queue_size;  /* read-write, power of 2. */
>   __le16 queue_msix_vector;   /* read-write */
>   __le16 queue_enable;/* read-write */
>   __le16 queue_notify_off;/* read-only */
>   __le32 queue_desc_lo;   /* read-write */
>   __le32 queue_desc_hi;   /* read-write */
>   __le32 queue_avail_lo;  /* read-write */
>   __le32 queue_avail_hi;  /* read-write */
>   __le32 queue_used_lo;   /* read-write */
>   __le32 queue_used_hi;   /* read-write */
> 
>   __le16 freeze_mode; /* read-write */
> };
> There is no queue_notif_config_data or queue_reset, and freeze_mode I added
> is at the end. Why is it different from virtio-spec?
>
Because notify data may not be used by Linux driver so it may be shorter.
I didn’t dig code yet.
 
> >
> >> @@ -393,6 +394,12 @@ \subsubsection{Common configuration structure
> >> layout}\label{sec:Virtio Transport  \item[\field{queue_device}]
> >>  The driver writes the physical address of Device Area here.
> >> See section \ref{sec:Basic Facilities of a Virtio Device / Virtqueues}.
> >>
> >> +\item[\field{freeze_mode}]
> >> +The driver writes this to set the freeze mode of virtio pci.
> >> +VIRTIO_PCI_FREEZE_MODE_UNFREEZE - virtio-pci is running;
> >> +VIRTIO_PCI_FREEZE_MODE_FREEZE_S3 - guest vm is doing S3, and
> >> +virtio-
> > For above names, please define the actual values in the spec.
> Ok, I will add them.
> 
> >
> >> pci enters S3 suspension;
> >> +Other values are reserved for future use, like S4, etc.

[virtio-dev] RE: [VIRTIO PCI PATCH v5 1/1] transport-pci: Add freeze_mode to virtio_pci_common_cfg

2023-09-19 Thread Parav Pandit

Hi Jiqian,

> From: Jiqian Chen 
> Sent: Tuesday, September 19, 2023 5:13 PM
> 
> When guest vm does S3, Qemu will reset and clear some things of virtio
> devices, but guest can't aware that, so that may cause some problems.
It is not true that guest VM is not aware of it.
As you show in your kernel patch, it is freeze/unfreeze in the guest VM PCI PM 
driver callback. So please update the commit log.

> For excample, Qemu calls virtio_reset->virtio_gpu_gl_reset when guest
s/excample/example

> resume, that function will destroy render resources of virtio-gpu. As a 
> result,
> after guest resume, the display can't come back and we only saw a black
> screen. Due to guest can't re-create all the resources, so we need to let Qemu
> not to destroy them when S3.
Above QEMU specific details to go in cover letter, instead of commit log, but 
no strong opinion.
Explaining the use case is good.

> 
> For above purpose, we need a mechanism that allows guests and QEMU to
> negotiate their reset behavior. So this patch add a new parameter named
Freeze != reset. :)
Please fix it to say freeze or suspend.

> freeze_mode to struct virtio_pci_common_cfg. And when guest suspends, it can
> write freeze_mode to be FREEZE_S3, and then virtio devices can change their
> reset behavior on Qemu side according to freeze_mode. What's more,
Not reset, but suspend behavior.

> freeze_mode can be used for all virtio devices to affect the behavior of Qemu,
> not just virtio gpu device.
> 
> Signed-off-by: Jiqian Chen 
> ---
>  transport-pci.tex | 7 +++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/transport-pci.tex b/transport-pci.tex index a5c6719..2543536 
> 100644
> --- a/transport-pci.tex
> +++ b/transport-pci.tex
> @@ -319,6 +319,7 @@ \subsubsection{Common configuration structure
> layout}\label{sec:Virtio Transport
>  le64 queue_desc;/* read-write */
>  le64 queue_driver;  /* read-write */
>  le64 queue_device;  /* read-write */
> +le16 freeze_mode;   /* read-write */
>  le16 queue_notif_config_data;   /* read-only for driver */
>  le16 queue_reset;   /* read-write */
> 
The new field cannot be in the middle of the structure.
Otherwise, the location of the queue_notif_config_data depends on completely 
unrelated feature bit, breaking the backward compatibility.
So please move it at the end.

> @@ -393,6 +394,12 @@ \subsubsection{Common configuration structure
> layout}\label{sec:Virtio Transport  \item[\field{queue_device}]
>  The driver writes the physical address of Device Area here.  See 
> section
> \ref{sec:Basic Facilities of a Virtio Device / Virtqueues}.
> 
> +\item[\field{freeze_mode}]
> +The driver writes this to set the freeze mode of virtio pci.
> +VIRTIO_PCI_FREEZE_MODE_UNFREEZE - virtio-pci is running;
> +VIRTIO_PCI_FREEZE_MODE_FREEZE_S3 - guest vm is doing S3, and virtio-
For above names, please define the actual values in the spec.

> pci enters S3 suspension;
> +Other values are reserved for future use, like S4, etc.
> +
It cannot be just one way communication from driver to device as freezing the 
device of few hundred MB to GB of gpu memory or other device memory can take 
several msec.
Hence driver must poll to get the acknowledgement from the device that freeze 
functionality is completed.

Please refer to queue_reset register definition for achieving such scheme and 
reframe the wording for it.

Also kindly add the device and driver normative on how/when this register is 
accessed.

Also please fix the description to not talk about guest VM. Largely it only 
exists in theory of operation etc text.

You need to describe what exactly should happen in the device when its freeze.
Please refer to my series where infrastructure is added for device migration 
where the FREEZE mode behavior is defined.
It is similar to what you define, but its management plane operation controlled 
outside of the guest VM.
But it is good direction in terms of what to define in spec language.
https://lore.kernel.org/virtio-comment/20230909142911.524407-7-pa...@nvidia.com/T/#u

you are missing the feature bit to indicate to the driver that device supports 
this functionality.
Please add one.

>  \item[\field{queue_notif_config_data}]
>  This field exists only if VIRTIO_F_NOTIF_CONFIG_DATA has been
> negotiated.
>  The driver will use this value when driver sends available buffer
> --
> 2.34.1


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org

RE: [virtio-dev] Re: [PATCH 0/5] virtio: introduce SUSPEND bit and vq state

2023-09-19 Thread Parav Pandit

> From: Zhu, Lingshan 
> Sent: Tuesday, September 19, 2023 2:09 PM
> 
> On 9/19/2023 4:31 PM, Parav Pandit wrote:
> >> From: Zhu, Lingshan 
> >> Sent: Tuesday, September 19, 2023 1:34 PM
> >>> In your previous email you wrote,
> >>>
> >>> 1. "so lets focus on LM topic, other than confidential computing."
> >>> 2. "again, TDISP is out of spec and TDISP devices are not migratable."
> >>>
> >>>   From above two comments from you I understood it that way, you
> >>> want to
> >> focus now on the LM _other_than_ CC.
> >>> How to read it differently?
> >> You said:"Lingshan wanted to take this TDISP extension in future."
> >>
> > Based on above two comments from you this is what I understand.
> I am not a native speaker, but do you see I ever mentioned that I want to take
> TDISP in future? Any verbs I used?
> >> How do you conclude this statement? Did I ever said I want to take
> >> TDISP extension in future?
> > CC and TDISP goes hand in hand.
> >
> > So you want to consider TDISP extension now (and not in future)?
> As I said before, CC and TDISP is out of spec, that means we should ignore 
> them
> for now.
> And I don't plan to take TDISP in future, you should not confuse the 
> sentences.
Ah, I understood now that you do not plan TDISP in future.
Thanks.

RE: [virtio-dev] Re: [PATCH 0/5] virtio: introduce SUSPEND bit and vq state

2023-09-19 Thread Parav Pandit



> From: Zhu, Lingshan 
> Sent: Tuesday, September 19, 2023 1:32 PM
> 
> On 9/19/2023 2:41 AM, Parav Pandit wrote:
> >> From: Zhu, Lingshan 
> >> Sent: Monday, September 18, 2023 3:05 PM
> >>
> >> On 9/18/2023 2:54 PM, Parav Pandit wrote:
> >>>> From: Zhu, Lingshan 
> >>>> Sent: Monday, September 18, 2023 12:19 PM so admin vq based LM
> >>>> solution can be a side channel attacking surface
> >>> It will be part of the DSM whenever it will be used in future.
> >>> Hence, it is not attack surface.
> >> I am not sure, why we have to trust the PF?
> >> This is out of virtio scope anyway.
> >>
> >> I have explained many times how it can be a attack surface, and examples.
> >>
> > And none of that make any sense as fundamentally, hypervisor is trusted
> regardless of the approach.
> this is not about hypervisors, I am saying admin vq based LM solution can be a
> side channel attacking surface Please refer to my previously listed examples
> and the TDISP spec is FYI.
> >
In previous email you wrote " As I said before, CC and TDISP is out of spec, 
that means we should ignore them for now."
So I am ignoring it now and hence, I am ignoring above comment.
Lets reach to a common ground for simplified case and than consider more 
complex cases.

> >> What happen if malicious SW dump guest memory by admin vq dirty page
> >> tracking feature?
> > What??
> > Where is this malicious SW is located, in guest VM?
> host, in this attacking model.
> >
> >>>>>>>> For untrusted hypervisor, same set of attack surface is present
> >>>>>>>> with
> >>>>>>>> trap+emulation.
> >>>>>>>> So both method score same. Hence its not relevant point for
> discussion.
> >>>>>>> this is not hypervisor, Do you see any modern hypervisor have
> >>>>>>> these issues?
> >>>>>>>
> >>>>>>> This is admin vq for LM can be a side channel attacking surface.
> >>>>> It is not.
> >>>>> Hypervisor is trusted entity.
> >>>>> For untrusted hypervisor the TDISP is unified solution build by
> >>>>> the various
> >>>> industry bodies including DMTF, PCI for last few years.
> >>>>> We want to utilize that.
> >>>> first, TDISP is out of virtio spec.
> >>> Sure, hence, untrusted hypervisor are out of scope.
> >>> Otherwise, trap+emulation is equally dead which relies on the
> >>> hypervisor to
> >> do things.
> >> so lets focus on LM topic, other than confidential computing.
> > ok.
> >
> >>> Just because data transfer is not done, it does not mean that
> >>> thousands of
> >> polling register writes complete in stipulated time.
> >> 1) again, they are per-device facilities
> > That does not satisfy that it can somehow do work in < x usec time.
> why? Do you mind take examples of basic PCI virtio common config space
> registers?
> >> 2) we use very few registers, even status byte does not require
> >> polling, just re- read with delay.
> >>
> >> Please refer to the code for setting FEATURES_OK.
> > It wont work when one needs to suspend the device.
> > There is no point of doing such work over registers as fundamental framework
> is over the AQ.
> why it doesn't work?

For two following reasons.
1. All the things needed cannot be communicated over registers efficiently, 
such as (a) device context, (b) dirty pages.
2. synchronous registers on the VF cannot inter operate with FLR and device 
reset flow.

RE: [virtio-dev] Re: [PATCH 0/5] virtio: introduce SUSPEND bit and vq state

2023-09-19 Thread Parav Pandit


> From: Zhu, Lingshan 
> Sent: Tuesday, September 19, 2023 1:34 PM

> > In your previous email you wrote,
> >
> > 1. "so lets focus on LM topic, other than confidential computing."
> > 2. "again, TDISP is out of spec and TDISP devices are not migratable."
> >
> >  From above two comments from you I understood it that way, you want to
> focus now on the LM _other_than_ CC.
> > How to read it differently?
> You said:"Lingshan wanted to take this TDISP extension in future."
> 
Based on above two comments from you this is what I understand.
> How do you conclude this statement? Did I ever said I want to take TDISP
> extension in future?
CC and TDISP goes hand in hand.

So you want to consider TDISP extension now (and not in future)?

RE: [virtio-dev] Re: [PATCH 0/5] virtio: introduce SUSPEND bit and vq state

2023-09-19 Thread Parav Pandit


> From: Zhu, Lingshan 
> Sent: Tuesday, September 19, 2023 1:16 PM
> 
> On 9/19/2023 3:32 PM, Parav Pandit wrote:
> >> From: Jason Wang 
> >> Sent: Tuesday, September 19, 2023 9:58 AM
> >>
> >> On Mon, Sep 18, 2023 at 2:55 PM Parav Pandit  wrote:
> >>>> From: Zhu, Lingshan 
> >>>> Sent: Monday, September 18, 2023 12:19 PM
> >>>
> >>>> so admin vq based LM solution can be a side channel attacking
> >>>> surface
> >>> It will be part of the DSM whenever it will be used in future.
> >>> Hence, it is not attack surface.
> >> DSM is not a part of TVM. So it really depends on what kind of work
> >> did the admin virtqueue do. For commands that can't be self-contained
> >> like provisioning, it is fine, since it is done before the TDI
> >> assignment. But it not necessarily for your migration proposal. It
> >> seems you've found another case that self-containing is important:
> >> allowing the owner to access the member after TDI is attached to TVM
> >> is a side channel attack.
> > TVM and DSM specs will be extended in future when we get there, so core
> hypervisor will not be involved.
> > With trap+mediation, it is involved.
> >
> > Lingshan wanted to take this TDISP extension in future.
> > So are you both aligned or not yet?
> I didn't say that, never ever.

In your previous email you wrote,

1. "so lets focus on LM topic, other than confidential computing."
2. "again, TDISP is out of spec and TDISP devices are not migratable."

From above two comments from you I understood it that way, you want to focus 
now on the LM _other_than_ CC.
How to read it differently?

[virtio-dev] RE: [virtio-comment] [PATCH 5/5] virtio-pci: implement VIRTIO_F_QUEUE_STATE

2023-09-19 Thread Parav Pandit


> From: Jason Wang 
> Sent: Tuesday, September 19, 2023 10:02 AM

> > Migration compatibility is topic in itself regardless of device migration 
> > series.
> 
> Why? Without compatibility support, migration can't work in the production
> environment.

As I said it is part of the future series. We don’t cook all features at one.
Orchestration knows when not to migrate in out of band manner.

> 
> > It is part of the feature provisioning phase needed regardless.
> 
> Definitely not, it is something that must be considered even without any 
> feature.
I disagree to make it must.
It is not must but it is greatly useful for sure.

> It's about the robustness of the migration protocol.
> Sometimes you need to do that since some states were lost in the previous
> version of protocols or formats .
> 
> > Like how you and Lingshan wanted to keep the suspend bit series small and
> logical, device migration series is also logically split for the 
> functionality.
> > I don’t see a need to mention the long known missing functionality and
> common to both approaches.
> 
> Again, your proposal needs to describe at least the plan for dealing with
> migration compatibility since you want a passthrough based solution. That's 
> the
> point.
> 
No matter passthrough or no-passthrough, migration compatibility cannot be 
achieved if the device does not provide a way to query and configure.
Migration will fail when features mis-match.

So, I will add note to add this in future in the commit log.

> >
> > > > I will include the notes of future follow up work items in v1,
> > > > which will be
> > > taken care post this series.
> > > >
> > > > > > > Dirty page tracking in virtio is not a must for live
> > > > > > > migration to work. It can be done via platform facilities or
> > > > > > > even software. And to make it more efficient, it needs to
> > > > > > > utilize transport facilities instead of a
> > > > > general one.
> > > > > > >
> > > > > > It is also optional in the spec proposal.
> > > > > > Most platforms claimed are not able to do efficiently either,
> > > > >
> > > > > Most platforms are working towards an efficient way. But we are
> > > > > talking about different things, hardware based dirty page
> > > > > logging is not a must, that is what I'm saying. For example, KVM
> > > > > doesn't use hardware
> > > to log dirty pages.
> > > > >
> > > > I also said same, that hw based dirty page logging is not must. :)
> > > > One day hw mmu will be able to track everything efficiently. I
> > > > have not seen it
> > > happening yet.
> > >
> > > How do you define efficiency? KVM uses page fault and most modern
> > > IOMMU support PRI now.
> > >
> > One cannot define PRI as mandatory feature.
> 
> There's no way to mandate PRI, it's a PCI specific facility.
> 
You proposed it to do PRI for migration, it becomes mandatory at that point.

> > In our research and experiments we see that PRI is significantly slower to
> handle page faults.
> > Yet different topic...
> 
> PRI's performance is definitely another topic, it's just an example that 
> tracking
> dirty pages by device is optional and transport (PCI) can evolve for sure. 
> What's
> more important, it demonstrates the basic design of virtio, which is trying to
> leverage the transport instead of a mandatory reveinting of everything.
> 
An example that does not work is not worth and dependable technology to rely on 
to achieve it now.
Anyway, all will not use PRI always.

> >
> > Efficiency is defined by the downtime of the multiple devices in a VM.
> 
> Ok, but you tend to ignore my question regarding the downtime.
> 
What is the question?
Admin commands can achieve the desired downtime, if that is what you are asking.

> > And leading OS allowed device advancements by allowing device to report
> dirty pages in cpu and platform agnostic way...
> >
> 
> It has many things that I don't see a good answer for. For example, the QOS
> raised by Ling Shan.
> 
I am not going to repeat QoS anymore. :)
He is questioning virtqueue semantics itself, he better rewrite the spec to not 
use virtqueue.

> > One can use post-copy approach as well, current device migration is around
> established pre-copy approach.
> 
> Another drawback of your proposal. With transport specific assistance like 
> PRI,
PRI page fault rate in our research is 20x slower than cpu page fault rate.

> you can do both pre and post. But the point is we need to make sure pre-copy
> downtime can satisfy the requirement instead of switching to another.
> 
In our work we see it satisfy the downtime requirements.

Again dirty page tracking is optional so when PRI can catch up in next few 
years, driver can stop relying on it.

> >
> > > >
> > > > > > hence the vfio subsystem added the support for it.
> > > > >
> > > > > As an open standard, if it is designed for a specific software
> > > > > subsystem on a specific OS, it's a failure.
> > > > >
> > > > It is not.
> > > > One need accept that, in certain areas

[virtio-dev] RE: [virtio-comment] [PATCH 5/5] virtio-pci: implement VIRTIO_F_QUEUE_STATE

2023-09-19 Thread Parav Pandit


> From: Jason Wang 
> Sent: Tuesday, September 19, 2023 10:06 AM

> > Current spec is not the steering point to define new methods.
> > So we will build the spec infra to support passthrough.
> >
> 
> Passthrough migration actually, passthrough is already supported now.
Yes, device migration basic facility for passthrough devices.
My series adds basic facility section extension.

> 
> > Mediation/trap-emulation where hypervisor is involved is also second use
> case that you are addressing.
> >
> > And hence, both are not mutually exclusive.
> > Hence we should not debate that anymore.
> >
> > >
> > > > And for sure virtio do not need to live in the dark shadow of mediation
> always.
> > >
> > > 99% of virtio devices are implemented in this way (which is what you
> > > call dark and shadow) now.
> > >
> > What I am saying is one should not say mediation/trap-emulation is the only
> way for virtio.
> 
> Then using things like "dark shadow" is not fair.
I apologize. Lets work towards supporting device migration for passthrough as 
well.
The comments I hear from you hints that virtio must live its life through 
mediation.

> 
> > So let passthrough device migration to progress.
> 
> Then you need to answer or address the concerns.
> 
Sure. Will do once I receive the comments in the patches of [1].

[1] 
https://lore.kernel.org/virtio-comment/20230909142911.524407-4-pa...@nvidia.com/T/#md9fcfa1ba997463de8c7fb8c6d1786b224b0bead

> >
> > > > For nesting use case sure one can do mediation related mode.
> > > >
> > > > So only mediation is not the direction.
> > >
> > > CPU and MMU virtualization were all built in this way.
> > >
> > Not anymore. Both of them have vcpus and viommu where may things are not
> trapped.
> 
> We are talking about different things. I'm saying trap is a must but you say 
> not
> all are trapped.
> 
To be clear, all I am saying is virtio interface level trap is not must to 
achieve device migration.

> > So as I said both has pros and cons and users will pick what fits their 
> > need and
> use case.
> >
> > > >
> > > > > > So for such N and M being > 1, one can use software base
> > > > > > emulation
> > > anyway.
> > > > >
> > > > > No, only the control path is trapped, the datapath is still 
> > > > > passthrough.
> > > > >
> > > > Again, it depends on the use case.
> > >
> > > No matter what use case, the definition and methodology of
> > > virtualization stands still.
> > >
> > I will stop debating this because the core technical question is not 
> > answered.
> > I don’t see a technology available that virtio can utilize to it.
> > That is interface that can work without messing with device status and flr
> while device migration is ongoing.
> 
> Again, you need to justify it. For example, why does it mess up device status?
With FLR and device reset in place, many things like queues, device registers 
etc do not work.
Because they are reset while migration is going on, and inflight descriptors 
info is lost.

> Why is rest ok but not suspending?
> 
Rest is not ok either to me.
I am not suggesting trapping CVQ or AQ or other virtio registers either.

> At least so far, I don't see good answers for thoses.
> 
> > Hence, methodology for passthrough and mediation/trap-emulation is
> fundamentally different.
> > And that is just fine.
> >
> > > >
> > > > > >
> > > > > > >
> > > > > > > And exposing the whole device to the guest drivers will have
> > > > > > > security implications, your proposal has demonstrated that
> > > > > > > you need a workaround for
> > > > > > There is no security implications in passthrough.
> > > > >
> > > > > How can you prove this or is it even possible for you to prove this?
> > > > Huh, when you claim that it is not secure, please point out
> > > > exactly what is not
> > > secure.
> > > > Please take with PCI SIG and file CVE to PCI sig.
> > >
> > > I am saying it has security implications. That is why you need to
> > > explain why you think it doesn't. What's more, the implications are
> > > obviously nothing related to PCI SIG but a vendor virtio hardware
> implementation.
> > >
> > PCI passthough for virtio member devices and non virtio devices with P2P, 
> > and
> their interaction is already there in the VM.
> > Device migration is not adding/removing anything, nor touching any security
> aspect of it.
> > Because it does not need to it either.
> > Device migration is making sure that it continue to exists.
> 
> Since we are discussing in the virtio community, what we care about is the
> chance that guest(driver) can explore device security vulnerabilities. In this
> context, exposing more means the increasing of the attacking surfaces since we
> (cloud vendor) can't control guests but the hypervisor.
> 
Guest (driver) can explore device security vulnerabilities in many areas not 
just device status and cvq.
So it is not good answer to me.

> >
> > > >
> > > > > You expose all device details to guests (especially the
> > > > > transport specific

[virtio-dev] RE: [virtio-comment] [PATCH 5/5] virtio-pci: implement VIRTIO_F_QUEUE_STATE

2023-09-19 Thread Parav Pandit


> From: Jason Wang 
> Sent: Tuesday, September 19, 2023 10:04 AM
> 
> On Sun, Sep 17, 2023 at 1:25 PM Parav Pandit  wrote:
> >
> >
> > > From: Jason Wang 
> > > Sent: Thursday, September 14, 2023 8:41 AM
> > >
> > > On Wed, Sep 13, 2023 at 12:37 PM Parav Pandit 
> wrote:
> > > >
> > > >
> > > >
> > > > > From: Zhu, Lingshan 
> > > > > Sent: Wednesday, September 13, 2023 9:51 AM
> > > >
> > > > > we plan to implement a self-contain solution
> > > > Make sure that works with device reset and FLR.
> > >
> > > We don't need to do that. It's out of the spec.
> > >
> > It is not. For the PCI member device, it needs to work reliably.
> 
> We never mentioned FLR in the PCI transport layer before and vendors have
> produced tons of hardware PCI devices for several years.
It is not mentioned like many other PCI things because its native.
What I was saying is that if you are claiming that suspend, resume etc all are 
some basic facilities that can work with passthrough also, please show how it 
works with FLR in place.

> 
> If it's important, please describe it in detail in your series but it doesn't.
> 
It is mentioned. Please review there.

> > Not doing means it relies on the trap+emulation, hence it just cannot
> complete.
> > And it is ok to me.
> > I just wont claim that trap+emulation is _complete_ method.
> >
> > > > And if not, explain that it is for mediation mode related tricks.
> > >
> > > It's not the tricks and again, it's not mediation but trap and
> > > emulation. It's the fundamental methodology used in virtualization, so 
> > > does
> the virtio spec.
> >
> > Not the virto spec of 2023 and more for new features.
> > The base for virtio spec 1.x was 0.9.5, but not the QEMU or other mediation
> based software AFAIK.
> 
> Are you saying those new features will not be suitable for software devices? 
> If
> yes, please explain why.
> 
> Or are you saying the virtio spec is not capable for hardware devices?

No, you were hinting that trap+emulation is the fundamental technology of 
virtualization and virtio spec.

And I replied that for virtio spec 1.x only base line was spec 0.9.5, 
trap+emulation was not the base line when 1.x spec is drafted.

Virtio spec with few caveats is capable of hw and new sw, and it needs to 
continue to build new features that can work without trap+emulation mode.

[virtio-dev] RE: [virtio-comment] [PATCH 5/5] virtio-pci: implement VIRTIO_F_QUEUE_STATE

2023-09-19 Thread Parav Pandit


> From: Jason Wang 
> Sent: Tuesday, September 19, 2023 9:56 AM

> On Sun, Sep 17, 2023 at 1:29 PM Parav Pandit  wrote:
> >
> >
> > > From: Jason Wang 
> > > Sent: Thursday, September 14, 2023 8:42 AM
> > >
> > > On Wed, Sep 13, 2023 at 12:46 PM Parav Pandit 
> wrote:
> > > >
> > > > > From: Jason Wang 
> > > > > Sent: Wednesday, September 13, 2023 10:14 AM
> > > >
> > > > > It's not about how many states in a single state machine, it's
> > > > > about how many state machines that exist for device status.
> > > > > Having more than one creates big obstacles and complexity in the
> > > > > device. You need to define the interaction of each state
> > > > > otherwise you leave undefined
> > > behaviours.
> > > > The device mode has zero relation to the device status.
> > >
> > > You will soon get this issue when you want to do nesting.
> > >
> > I don’t think so. One needs to intercept it when one wants to do
> trap+emulation which seems to fullfil the nesting use case.
> 
> Well, how can you trap it? You have admin vq in L0, it means the suspending is
> never exposed to L1 unless you assign the owner to L1.
> Is this what you want?
> 
When nesting is not done, it is not needed.
Only the nest cases need to trap it.
So when one want to do nesting use case, one should also place the admin peer 
PF in that guest.
Right assign one VF and its peer admin VF to L1.

> >
> > > > It does not mess with it at all.
> > > > In fact the new bits in device status is making it more complex
> > > > for the device
> > > to handle.
> > >
> > > Are you challenging the design of the device status? It's definitely
> > > too late to do this.
> > >
> > No. I am saying the extending device_status with yet another state is 
> > equally
> complex and its core of the device.
> 
> You never explain why.
If you are comparing two methods, then a new feature adds complexity.
Hence, they both score equal adding complexity for new feature.
In case of device_status one needs to things synchronously.
This adds complexity on the device side to answer those registers in hot 
downtime path.
When done over admin commands, they happen in parallel.
 
> 
> >
> > > This proposal increases just one bit and that worries you? Or you
> > > think one more state is much more complicated than a new state
> > > machine with two states?
> >
> > It is mode and not state. And two modes are needed for supporting P2P
> device.
> 
> You keep saying you are migrating the core virtio devices but then you are
> saying it is required for PCI. And you never explain why it can't be done by
> reusing the device status bit.
It cannot be done using device status bits, because hypervisor is not involved 
in trapping, and parsing it.
We better discuss in the actual series where things are posted.

> 
> > When one wants to do with mediation, there also two states are needed.
> >
> > The key is modes are not interacting
> 
> You need to explain why they are not interacting. It touches the virtio 
> facility
> which (partially) overlaps the function of the device status for sure. You 
> invent a
> new state machine, and leave the vendors to guess how or why they are not
> interacting with the existing one.
Huh, something needs explanation when there is interaction.
I explained that device_status is just another virtio register.

I missed to add the other vendors Sign-off. I will add it.

> There are just too many corner cases that need to be figured out.
> 
> For example:
> 
> How do you define stop? Is it a virtio level stop, transport level or a 
> mixing of
> them both? 
It is defined in the series and device and driver requirements section and also 
in theory of operation.

> Is the device allowed to stop in the middle or reset, feature
> negotiation or even transport specific things like FLR?
Yes.

> If yes, how about other operations and who defines and maintains those
> transitional states? If not, why and how long would a stop wait for an 
> operation?
> Can a stop fail? What happens if the driver wants to reset but the device is
> stopped by the admin commands? Who suppresses who and why?
All are described in the series in the normative. If something is missing, 
please put comment there and I will fix in v1.

> 
> This demonstrates the complexity of your proposal and I don't see any of the
> above were clearly stated in your series. Reusing the existing device status
> machine, everything would be simplified.
You see, too early conclusion saying things are missing, take mine...
I will fix missing items in v1, please put review comments in there.

> 
> > with the device_status because device_status is just another register of the
> virtio.
> 
> Let's don't do layer violation, device status is the basic facility of the 
> virtio
> device which is not coupled with any transport so it is not necessarily
> implemented via registers.

There is no violation. Device_status is already part of the transport specific 
context.

RE: [virtio-dev] Re: [PATCH 0/5] virtio: introduce SUSPEND bit and vq state

2023-09-19 Thread Parav Pandit


> From: Jason Wang 
> Sent: Tuesday, September 19, 2023 9:58 AM
> 
> On Mon, Sep 18, 2023 at 2:55 PM Parav Pandit  wrote:
> >
> > > From: Zhu, Lingshan 
> > > Sent: Monday, September 18, 2023 12:19 PM
> >
> >
> > > so admin vq based LM solution can be a side channel attacking
> > > surface
> > It will be part of the DSM whenever it will be used in future.
> > Hence, it is not attack surface.
> 
> DSM is not a part of TVM. So it really depends on what kind of work did the
> admin virtqueue do. For commands that can't be self-contained like
> provisioning, it is fine, since it is done before the TDI assignment. But it 
> not
> necessarily for your migration proposal. It seems you've found another case
> that self-containing is important:
> allowing the owner to access the member after TDI is attached to TVM is a side
> channel attack.

TVM and DSM specs will be extended in future when we get there, so core 
hypervisor will not be involved.
With trap+mediation, it is involved.

Lingshan wanted to take this TDISP extension in future.
So are you both aligned or not yet?

RE: [virtio-dev] Re: [PATCH 0/5] virtio: introduce SUSPEND bit and vq state

2023-09-18 Thread Parav Pandit


> From: Zhu, Lingshan 
> Sent: Monday, September 18, 2023 3:05 PM
> 
> On 9/18/2023 2:54 PM, Parav Pandit wrote:
> >> From: Zhu, Lingshan 
> >> Sent: Monday, September 18, 2023 12:19 PM
> >
> >> so admin vq based LM solution can be a side channel attacking surface
> > It will be part of the DSM whenever it will be used in future.
> > Hence, it is not attack surface.
> I am not sure, why we have to trust the PF?
> This is out of virtio scope anyway.
> 
> I have explained many times how it can be a attack surface, and examples.
> 
And none of that make any sense as fundamentally, hypervisor is trusted 
regardless of the approach.

> What happen if malicious SW dump guest memory by admin vq dirty page
> tracking feature?
What??
Where is this malicious SW is located, in guest VM?

> >
> >>>>>> For untrusted hypervisor, same set of attack surface is present
> >>>>>> with
> >>>>>> trap+emulation.
> >>>>>> So both method score same. Hence its not relevant point for discussion.
> >>>>> this is not hypervisor, Do you see any modern hypervisor have
> >>>>> these issues?
> >>>>>
> >>>>> This is admin vq for LM can be a side channel attacking surface.
> >>> It is not.
> >>> Hypervisor is trusted entity.
> >>> For untrusted hypervisor the TDISP is unified solution build by the
> >>> various
> >> industry bodies including DMTF, PCI for last few years.
> >>> We want to utilize that.
> >> first, TDISP is out of virtio spec.
> > Sure, hence, untrusted hypervisor are out of scope.
> > Otherwise, trap+emulation is equally dead which relies on the hypervisor to
> do things.
> so lets focus on LM topic, other than confidential computing.
ok.

> > Just because data transfer is not done, it does not mean that thousands of
> polling register writes complete in stipulated time.
> 1) again, they are per-device facilities
That does not satisfy that it can somehow do work in < x usec time.
> 2) we use very few registers, even status byte does not require polling, just 
> re-
> read with delay.
> 
> Please refer to the code for setting FEATURES_OK.
It wont work when one needs to suspend the device.
There is no point of doing such work over registers as fundamental framework is 
over the AQ.

RE: [virtio-dev] Re: [PATCH 0/5] virtio: introduce SUSPEND bit and vq state

2023-09-18 Thread Parav Pandit

> From: Zhu, Lingshan 
> Sent: Monday, September 18, 2023 12:19 PM


> so admin vq based LM solution can be a side channel attacking surface
It will be part of the DSM whenever it will be used in future.
Hence, it is not attack surface.

> >
>  For untrusted hypervisor, same set of attack surface is present
>  with
>  trap+emulation.
>  So both method score same. Hence its not relevant point for discussion.
> >>> this is not hypervisor, Do you see any modern hypervisor have these
> >>> issues?
> >>>
> >>> This is admin vq for LM can be a side channel attacking surface.
> > It is not.
> > Hypervisor is trusted entity.
> > For untrusted hypervisor the TDISP is unified solution build by the various
> industry bodies including DMTF, PCI for last few years.
> > We want to utilize that.
> first, TDISP is out of virtio spec.
Sure, hence, untrusted hypervisor are out of scope.
Otherwise, trap+emulation is equally dead which relies on the hypervisor to do 
things.

> second, TDISP devices can not be migrated for now third, admin vq can be an
> side channel attacking surface as explained above.
When TDISP are not used, hypervisor is trusted entity, period.
And hence, it cannot be considered attack surface.
An hypervisor can even disable SR-IOV. 

> >
> >> #3 There is no QoS issue with admin commands and queues. If you
> >> claim that
> > then whole virtio spec based on the virtqueues is broken.
> >> And it is certainly not the case.
> > Please do not confuse the concepts and purposes of the data queues
> > and admin vq.
> >
>  I am not confused.
>  There is no guarantee that a register placed on the VF will be
>  serviced by the device in exact same time regardless of VF count =
>  1 or 4000.
>  Yet again not relevant comparison.
> >>> please read my previous replies in other threads.
> > It does not answer.
> > The claim that somehow a polling register ensures downtime guarantee for
> scale of thousands of member devices is some specific device implementation
> without explanation.
> the registers and the LM facilities are per-device.
> >
> > For data-queues, it can be slow without mq or rss, that means
> > performance overhead, but can work.
>  No, it does not work. The application failed because of jitter in
>  the video and audio due to missing the latency budget.
>  A financial application is terminated due to timeouts and packet loss.
> 
>  Device migration is just another 3rd such applications.
> 
>  Its also same.
>  My last reply on this vague argument.
> >>> I think the points are clear, and you already understand the points,
> >>> so no need to argue anymore
> > Yes, I am clear from long time, nor AQ nor no register, RSS queues, none
> cannot guarantee any performance characteristics.
> > It is pretty clear to me.
> > Any performance guarantees are explicitly requested when desired.
> >
> > For admin vq, if it don't meet QOS requirements, it fails to
> > migrate guests.
> >
> > I have replied to the same question so many times, and this is the
> > last time.
>  I also replied many times that QoS argument is not valid anymore.
>  Same can happen with registers writes.
>  Perf characteristics for 30+ devices is not in the virtio spec. It
>  is implementation details.
> >>> as replied many times, registers only serve the device itself and
> >>> registers are not DATA PATH, means the device don't transfer data
> >>> through registers.
> > It does not matter data path or control path, the fact is it downtime 
> > assurance
> cannot be guaranteed by register interface design, it is the implementation
> details.
> > And so does for admin commands and/or AQ.
> the registers do not perform any data transitions, e.g., we don't migrate 
> dirty
> pages through registers.
> But you do these by admin vq
So what?
Just because data transfer is not done, it does not mean that thousands of 
polling register writes complete in stipulated time.

[virtio-dev] RE: [virtio-comment] Re: [PATCH 2/5] virtio: introduce SUSPEND bit in device status

2023-09-18 Thread Parav Pandit


> From: Zhu, Lingshan 
> Sent: Monday, September 18, 2023 12:09 PM

> > There is parent object.
> > There is VQ which you propose to do SUSPEND_RESET of the parent virtio
> device which is already SUSPENDED.
> that is why we plan to implement a new feature bit to control this behavior.
> 
It does not matter adding a feature bit when the semantics itself is wrong.

> However, in next version, as MST suggested, I will forbid resetting vqs after
> SUSPEND.
Ok. That is good.
I got confused with above proposed bit.

RE: [virtio-dev] Re: [PATCH 0/5] virtio: introduce SUSPEND bit and vq state

2023-09-18 Thread Parav Pandit


> From: Zhu, Lingshan 
> Sent: Monday, September 18, 2023 10:55 AM
> To: Parav Pandit ; virtio-dev@lists.oasis-open.org; Michael
> S. Tsirkin ; Jason Wang 
> Subject: Re: [virtio-dev] Re: [PATCH 0/5] virtio: introduce SUSPEND bit and vq
> state
> 
> CC MST and Jason
> 
> On 9/18/2023 1:21 PM, Zhu, Lingshan wrote:
> >
> >
> > On 9/18/2023 12:32 PM, Parav Pandit wrote:
> >>
> >>> From: Zhu, Lingshan 
> >>> Sent: Monday, September 18, 2023 8:40 AM
> >>>
> >>> On 9/17/2023 1:32 PM, Parav Pandit wrote:
> >>>>> From: virtio-dev@lists.oasis-open.org
> >>>>>  On Behalf Of Zhu, Lingshan
> >>>>> Sent: Friday, September 15, 2023 9:59 AM
> >>>>>
> >>>>> On 9/14/2023 7:14 PM, Michael S. Tsirkin wrote:
> >>>>>> On Wed, Sep 06, 2023 at 04:16:32PM +0800, Zhu Lingshan wrote:
> >>>>>>> This series introduces
> >>>>>>> 1)a new SUSPEND bit in the device status Which is used to
> >>>>>>> suspend the device, so that the device states and virtqueue
> >>>>>>> states are stabilized.
> >>>>>>>
> >>>>>>> 2)virtqueue state and its accessor, to get and set
> >>>>>>> last_avail_idx and last_used_idx of virtqueues.
> >>>>>>>
> >>>>>>> The main usecase of these new facilities is Live Migration.
> >>>>>>>
> >>>>>>> Future work: dirty page tracking and in-flight descriptors.
> >>>>>>> This series addresses many comments from Jason, Stefan and
> >>>>>>> Eugenio from RFC series.
> >>>>>> Compared to Parav's patchset this is much less functional.
> >>>>> we will add dirty page tracking and in-flight IO tracker in V2,
> >>>>> then it will be a full featured LM solution.
> >>>>>
> >>>>> They are not in this series because we want this series to be
> >>>>> small and focus.
> >>>>>> Assuming that one goes in, can't we add ability to submit admin
> >>>>>> commands through MMIO on the device itself and be done with it?
> >>>>> I am not sure, IMHO, if we use admin vq as back-ends for MMIO
> >>>>> based live migration, then the issues in admin vq still exist, for 
> >>>>> example:
> >>>>> 1)nested virtualization
> >>>>> 2)bare-metal live migration
> >>>>> 3)QOS
> >>>>> 4)introduce more attacking surfaces.
> >>>>>
> >>>> #4 is just random without.
> >>> I failed to process "random without".
> >>>
> >>> If you expect admin vq to perform live migration, it can certainly
> >>> be a side channel attacking surface, for example:
> >>> a) a malicious SW can stop the device running
> >>> b) a malicious SW can sniff guest memory by tracking guest dirty
> >>> pages, then speculate guest operations and stole secrets.
> >> This is the mode when hypervisor is trusted.
> > PF is not always owned by the hypervisor, right?
> > And you don't pass-through the PF to any guests, right?
> >> When hypervisor is untrusted, the CC model TDISP enabled device, TSM
> >> will delegate the tasks to the DSM.
> > TDISP devices can not be migrated for now.
That is fine, the infra is build so that it can be migrated one day.
And at that point the proposed admin command-based model also fits fine.

> >>
> >> For untrusted hypervisor, same set of attack surface is present with
> >> trap+emulation.
> >> So both method score same. Hence its not relevant point for discussion.
> > this is not hypervisor, Do you see any modern hypervisor have these
> > issues?
> >
> > This is admin vq for LM can be a side channel attacking surface.
It is not.
Hypervisor is trusted entity.
For untrusted hypervisor the TDISP is unified solution build by the various 
industry bodies including DMTF, PCI for last few years.
We want to utilize that.

> >>
> >>>> #3 There is no QoS issue with admin commands and queues. If you
> >>>> claim that
> >>> then whole virtio spec based on the virtqueues is broken.
> >>>> And it is certainly not the case.
> >>> Please do not confuse the concepts and purposes of the data queues
> >>> and admin vq.
> >>>
> >> I am not confused.
> &g

[virtio-dev] RE: [virtio-comment] Re: [PATCH 2/5] virtio: introduce SUSPEND bit in device status

2023-09-18 Thread Parav Pandit


> From: Zhu, Lingshan 
> Sent: Monday, September 18, 2023 10:45 AM
> 
> On 9/18/2023 12:42 PM, Parav Pandit wrote:
> >> From: virtio-comm...@lists.oasis-open.org
> >>  On Behalf Of Zhu, Lingshan
> >> Sent: Monday, September 18, 2023 8:27 AM
> >
> >> a new feature bit: VIRTIO_F_RING_SUSPEND_RESET. If this feature bit
> >> has been negotiated then the device allow reset a vq after SUSPEND.
> > This is simply a wrong semantics to build to operate individual object 
> > after its
> parent object is suspended.
> A device can choose to respond to a set of signals and ignore others, right?
> 
> And, This is not your admin vq based LM solution, therefore there is NO PARENT
> objects.
There is parent object.
There is VQ which you propose to do SUSPEND_RESET of the parent virtio device 
which is already SUSPENDED.

Admin commands and vq exists in the spec because to admin work.
The admin vq series is split from its users because it is hard to do everything 
in one go.

[virtio-dev] RE: [virtio-comment] Re: [PATCH 2/5] virtio: introduce SUSPEND bit in device status

2023-09-17 Thread Parav Pandit


> From: virtio-comm...@lists.oasis-open.org  open.org> On Behalf Of Zhu, Lingshan
> Sent: Monday, September 18, 2023 8:27 AM


> a new feature bit: VIRTIO_F_RING_SUSPEND_RESET. If this feature bit has been
> negotiated then the device allow reset a vq after SUSPEND.

This is simply a wrong semantics to build to operate individual object after 
its parent object is suspended.

RE: [virtio-dev] Re: [PATCH 0/5] virtio: introduce SUSPEND bit and vq state

2023-09-17 Thread Parav Pandit



> From: Zhu, Lingshan 
> Sent: Monday, September 18, 2023 8:40 AM
> 
> On 9/17/2023 1:32 PM, Parav Pandit wrote:
> >> From: virtio-dev@lists.oasis-open.org
> >>  On Behalf Of Zhu, Lingshan
> >> Sent: Friday, September 15, 2023 9:59 AM
> >>
> >> On 9/14/2023 7:14 PM, Michael S. Tsirkin wrote:
> >>> On Wed, Sep 06, 2023 at 04:16:32PM +0800, Zhu Lingshan wrote:
> >>>> This series introduces
> >>>> 1)a new SUSPEND bit in the device status Which is used to suspend
> >>>> the device, so that the device states and virtqueue states are
> >>>> stabilized.
> >>>>
> >>>> 2)virtqueue state and its accessor, to get and set last_avail_idx
> >>>> and last_used_idx of virtqueues.
> >>>>
> >>>> The main usecase of these new facilities is Live Migration.
> >>>>
> >>>> Future work: dirty page tracking and in-flight descriptors.
> >>>> This series addresses many comments from Jason, Stefan and Eugenio
> >>>> from RFC series.
> >>> Compared to Parav's patchset this is much less functional.
> >> we will add dirty page tracking and in-flight IO tracker in V2, then
> >> it will be a full featured LM solution.
> >>
> >> They are not in this series because we want this series to be small and 
> >> focus.
> >>> Assuming that one goes in, can't we add ability to submit admin
> >>> commands through MMIO on the device itself and be done with it?
> >> I am not sure, IMHO, if we use admin vq as back-ends for MMIO based
> >> live migration, then the issues in admin vq still exist, for example:
> >> 1)nested virtualization
> >> 2)bare-metal live migration
> >> 3)QOS
> >> 4)introduce more attacking surfaces.
> >>
> > #4 is just random without.
> I failed to process "random without".
> 
> If you expect admin vq to perform live migration, it can certainly be a side
> channel attacking surface, for example:
> a) a malicious SW can stop the device running
> b) a malicious SW can sniff guest memory by tracking guest dirty pages, then
> speculate guest operations and stole secrets.
This is the mode when hypervisor is trusted.
When hypervisor is untrusted, the CC model TDISP enabled device, TSM will 
delegate the tasks to the DSM.

For untrusted hypervisor, same set of attack surface is present with 
trap+emulation.
So both method score same. Hence its not relevant point for discussion.

> > #3 There is no QoS issue with admin commands and queues. If you claim that
> then whole virtio spec based on the virtqueues is broken.
> > And it is certainly not the case.
> Please do not confuse the concepts and purposes of the data queues and
> admin vq.
> 
I am not confused.
There is no guarantee that a register placed on the VF will be serviced by the 
device in exact same time regardless of VF count = 1 or 4000.
Yet again not relevant comparison.

> For data-queues, it can be slow without mq or rss, that means performance
> overhead, but can work.
No, it does not work. The application failed because of jitter in the video and 
audio due to missing the latency budget.
A financial application is terminated due to timeouts and packet loss.

Device migration is just another 3rd such applications.

Its also same.
My last reply on this vague argument.

> For admin vq, if it don't meet QOS requirements, it fails to migrate guests.
> 
> I have replied to the same question so many times, and this is the last time.
> >
I also replied many times that QoS argument is not valid anymore. Same can 
happen with registers writes.
Perf characteristics for 30+ devices is not in the virtio spec. It is 
implementation details.

RE: [virtio-dev] Re: [PATCH 0/5] virtio: introduce SUSPEND bit and vq state

2023-09-16 Thread Parav Pandit


> From: virtio-dev@lists.oasis-open.org  On
> Behalf Of Zhu, Lingshan
> Sent: Friday, September 15, 2023 9:59 AM
> 
> On 9/14/2023 7:14 PM, Michael S. Tsirkin wrote:
> > On Wed, Sep 06, 2023 at 04:16:32PM +0800, Zhu Lingshan wrote:
> >> This series introduces
> >> 1)a new SUSPEND bit in the device status Which is used to suspend the
> >> device, so that the device states and virtqueue states are
> >> stabilized.
> >>
> >> 2)virtqueue state and its accessor, to get and set last_avail_idx and
> >> last_used_idx of virtqueues.
> >>
> >> The main usecase of these new facilities is Live Migration.
> >>
> >> Future work: dirty page tracking and in-flight descriptors.
> >> This series addresses many comments from Jason, Stefan and Eugenio
> >> from RFC series.
> > Compared to Parav's patchset this is much less functional.
> we will add dirty page tracking and in-flight IO tracker in V2, then it will 
> be a
> full featured LM solution.
> 
> They are not in this series because we want this series to be small and focus.
> >
> > Assuming that one goes in, can't we add ability to submit admin
> > commands through MMIO on the device itself and be done with it?
> I am not sure, IMHO, if we use admin vq as back-ends for MMIO based live
> migration, then the issues in admin vq still exist, for example:
> 1)nested virtualization
> 2)bare-metal live migration
> 3)QOS
> 4)introduce more attacking surfaces.
> 
#4 is just random without.
#3 There is no QoS issue with admin commands and queues. If you claim that then 
whole virtio spec based on the virtqueues is broken.
And it is certainly not the case.

> And what's more, if we wants to implementing a new capability onbehalf of
> admin vq, does the capability need to store at least one descriptor buffer, 
> that is
> the capability length should be at least the max_lengh_of_buffer?
> 
> If that is not possible, do we need to implement extra fields like length and
> remaining_length, then the device repeating update the cap data, and the
> driver repeat reading, way to complex and introduce significant downtime.
> 
At least I do not understand. May be you can explain more?

We observe less than 300 msec downtime using admin commands over admin queue in 
internal tests in virtio area.

[virtio-dev] RE: [virtio-comment] [PATCH 5/5] virtio-pci: implement VIRTIO_F_QUEUE_STATE

2023-09-16 Thread Parav Pandit


> From: Jason Wang 
> Sent: Thursday, September 14, 2023 8:42 AM
> 
> On Wed, Sep 13, 2023 at 12:46 PM Parav Pandit  wrote:
> >
> > > From: Jason Wang 
> > > Sent: Wednesday, September 13, 2023 10:14 AM
> >
> > > It's not about how many states in a single state machine, it's about
> > > how many state machines that exist for device status. Having more
> > > than one creates big obstacles and complexity in the device. You
> > > need to define the interaction of each state otherwise you leave undefined
> behaviours.
> > The device mode has zero relation to the device status.
> 
> You will soon get this issue when you want to do nesting.
> 
I don’t think so. One needs to intercept it when one wants to do trap+emulation 
which seems to fullfil the nesting use case.

> > It does not mess with it at all.
> > In fact the new bits in device status is making it more complex for the 
> > device
> to handle.
> 
> Are you challenging the design of the device status? It's definitely too late 
> to do
> this.
> 
No. I am saying the extending device_status with yet another state is equally 
complex and its core of the device.

> This proposal increases just one bit and that worries you? Or you think one
> more state is much more complicated than a new state machine with two
> states?

It is mode and not state. And two modes are needed for supporting P2P device.
When one wants to do with mediation, there also two states are needed.

The key is modes are not interacting with the device_status because 
device_status is just another register of the virtio.

[virtio-dev] RE: [virtio-comment] [PATCH 5/5] virtio-pci: implement VIRTIO_F_QUEUE_STATE

2023-09-16 Thread Parav Pandit


> From: Jason Wang 
> Sent: Thursday, September 14, 2023 8:41 AM
> 
> On Wed, Sep 13, 2023 at 12:37 PM Parav Pandit  wrote:
> >
> >
> >
> > > From: Zhu, Lingshan 
> > > Sent: Wednesday, September 13, 2023 9:51 AM
> >
> > > we plan to implement a self-contain solution
> > Make sure that works with device reset and FLR.
> 
> We don't need to do that. It's out of the spec.
> 
It is not. For the PCI member device, it needs to work reliably.
Not doing means it relies on the trap+emulation, hence it just cannot complete.
And it is ok to me.
I just wont claim that trap+emulation is _complete_ method.

> > And if not, explain that it is for mediation mode related tricks.
> 
> It's not the tricks and again, it's not mediation but trap and emulation. 
> It's the
> fundamental methodology used in virtualization, so does the virtio spec.

Not the virto spec of 2023 and more for new features.
The base for virtio spec 1.x was 0.9.5, but not the QEMU or other mediation 
based software AFAIK.

[virtio-dev] RE: [virtio-comment] [PATCH 5/5] virtio-pci: implement VIRTIO_F_QUEUE_STATE

2023-09-16 Thread Parav Pandit


> From: Jason Wang 
> Sent: Thursday, September 14, 2023 8:39 AM
> 
> On Wed, Sep 13, 2023 at 2:39 PM Parav Pandit  wrote:
> >
> >
> > > From: Jason Wang 
> > > Sent: Wednesday, September 13, 2023 10:15 AM
> >
> > [..]
> > > > > > [1]
> > > > > > https://lists.oasis-open.org/archives/virtio-comment/202309/ms
> > > > > > g000
> > > > > > 61.h
> > > > > > tml
> > > > >
> > > > > The series works for stateless devices. Before we introduce
> > > > > device states in the spec, we can't migrate stateful devices. So
> > > > > the device context doesn't make much sense right now.
> > > > The series works for stateful devices too. The device context covers it.
> > >
> > > How? Can it be used for migrating any existing stateful devices?
> > > Don't we need to define what context means for a specific stateful
> > > device before you can introduce things like device context? Please
> > > go through the archives for the relevant discussions (e.g
> > > virtio-FS), it's not as simple as introducing a device context API.
> > >
> > A device will have its own context for example RSS definition, or flow 
> > filters
> tomorrow.
> 
> If you know there are things that are missing when posting the patches, please
> use the RFC tag.
> 
It is not missing. They are optional, which is why it is not needed in this 
series.

> > The device context will be extended post the first series.
> >
> > > And what's more, how can it handle the migration compatibility?
> > It will be taken care in follow on as we all know that this to be checked.
> 
> You don't even mention it anywhere in your series.
> 
Migration compatibility is topic in itself regardless of device migration 
series.
It is part of the feature provisioning phase needed regardless.
Like how you and Lingshan wanted to keep the suspend bit series small and 
logical, device migration series is also logically split for the functionality.
I don’t see a need to mention the long known missing functionality and common 
to both approaches.

> > I will include the notes of future follow up work items in v1, which will be
> taken care post this series.
> >
> > > > > Dirty page tracking in virtio is not a must for live migration
> > > > > to work. It can be done via platform facilities or even
> > > > > software. And to make it more efficient, it needs to utilize
> > > > > transport facilities instead of a
> > > general one.
> > > > >
> > > > It is also optional in the spec proposal.
> > > > Most platforms claimed are not able to do efficiently either,
> > >
> > > Most platforms are working towards an efficient way. But we are
> > > talking about different things, hardware based dirty page logging is
> > > not a must, that is what I'm saying. For example, KVM doesn't use hardware
> to log dirty pages.
> > >
> > I also said same, that hw based dirty page logging is not must. :) One
> > day hw mmu will be able to track everything efficiently. I have not seen it
> happening yet.
> 
> How do you define efficiency? KVM uses page fault and most modern IOMMU
> support PRI now.
>
One cannot define PRI as mandatory feature. In our research and experiments we 
see that PRI is significantly slower to handle page faults.
Yet different topic...

Efficiency is defined by the downtime of the multiple devices in a VM.
And leading OS allowed device advancements by allowing device to report dirty 
pages in cpu and platform agnostic way...

One can use post-copy approach as well, current device migration is around 
established pre-copy approach.

> >
> > > > hence the vfio subsystem added the support for it.
> > >
> > > As an open standard, if it is designed for a specific software
> > > subsystem on a specific OS, it's a failure.
> > >
> > It is not.
> > One need accept that, in certain areas virtio is following the trails of
> advancement already done in sw stack.
> > So that virtio spec advancement fits in to supply such use cases.
> > And blocking such advancement of virtio spec to promote only_mediation
> approach is not good either.
> >
> > BTW: One can say the mediation approach is also designed for specific
> software subsystem and hence failure.
> > I will stay away from quoting it, as I don’t see it this way.
> 
> The proposal is based on well known technology since the birth of 
> virtualization.
Sure, but that does not change the fact that such

[virtio-dev] RE: [virtio-comment] [PATCH 5/5] virtio-pci: implement VIRTIO_F_QUEUE_STATE

2023-09-16 Thread Parav Pandit


> From: Jason Wang 
> Sent: Thursday, September 14, 2023 8:41 AM
> 
> On Wed, Sep 13, 2023 at 2:06 PM Parav Pandit  wrote:
> >
> >
> > > From: Jason Wang 
> > > Sent: Wednesday, September 13, 2023 10:14 AM
> > > To: Parav Pandit 
> >
> > > > One can build infinite level of nesting to not do passthrough, at
> > > > the end user
> > > applications remains slow.
> > >
> > > We are talking about nested virtualization but nested emulation. I
> > > won't repeat the definition of virtualization but no matter how much
> > > level of nesting, the hypervisor will try hard to let the
> > > application run natively for most of the time, otherwise it's not the 
> > > nested
> virtualization at all.
> > >
> > > Nested virtualization has been supported by all major cloud vendors,
> > > please read the relevant documentation for the performance
> > > implications. Virtio community is not the correct place to debate
> > > whether a nest is useful. We need to make sure the datapath could be
> > > assigned to any nest layers without losing any fundamental facilities like
> migration.
> > >
> > I am not debating. You or Lingshan claim or imply that mediation is the only
> way to progress.
> 
> Let me correct your temiology again. It's "trap and emulation" . It means the
> workload runs mostly native but sometimes is trapped by the hypervisor.
>
 
> And it's not the only way. It's the start point since all current virtio spec 
> is built
> upon this methodology.
Current spec is not the steering point to define new methods.
So we will build the spec infra to support passthrough.

Mediation/trap-emulation where hypervisor is involved is also second use case 
that you are addressing.

And hence, both are not mutually exclusive.
Hence we should not debate that anymore.

> 
> > And for sure virtio do not need to live in the dark shadow of mediation 
> > always.
> 
> 99% of virtio devices are implemented in this way (which is what you call dark
> and shadow) now.
> 
What I am saying is one should not say mediation/trap-emulation is the only way 
for virtio.
So let passthrough device migration to progress.

> > For nesting use case sure one can do mediation related mode.
> >
> > So only mediation is not the direction.
> 
> CPU and MMU virtualization were all built in this way.
> 
Not anymore. Both of them have vcpus and viommu where may things are not 
trapped.
So as I said both has pros and cons and users will pick what fits their need 
and use case.

> >
> > > > So for such N and M being > 1, one can use software base emulation
> anyway.
> > >
> > > No, only the control path is trapped, the datapath is still passthrough.
> > >
> > Again, it depends on the use case.
> 
> No matter what use case, the definition and methodology of virtualization
> stands still.
> 
I will stop debating this because the core technical question is not answered.
I don’t see a technology available that virtio can utilize to it.
That is interface that can work without messing with device status and flr 
while device migration is ongoing.
Hence, methodology for passthrough and mediation/trap-emulation is 
fundamentally different.
And that is just fine.

> >
> > > >
> > > > >
> > > > > And exposing the whole device to the guest drivers will have
> > > > > security implications, your proposal has demonstrated that you
> > > > > need a workaround for
> > > > There is no security implications in passthrough.
> > >
> > > How can you prove this or is it even possible for you to prove this?
> > Huh, when you claim that it is not secure, please point out exactly what is 
> > not
> secure.
> > Please take with PCI SIG and file CVE to PCI sig.
> 
> I am saying it has security implications. That is why you need to explain why 
> you
> think it doesn't. What's more, the implications are obviously nothing related 
> to
> PCI SIG but a vendor virtio hardware implementation.
> 
PCI passthough for virtio member devices and non virtio devices with P2P, and 
their interaction is already there in the VM.
Device migration is not adding/removing anything, nor touching any security 
aspect of it.
Because it does not need to it either.
Device migration is making sure that it continue to exists.

> >
> > > You expose all device details to guests (especially the transport
> > > specific details), the attack surface is increased in this way.
> > One can say it is the opposite.
> > Attack surface is increased in hypervisor due to m

[virtio-dev] RE: [virtio-comment] [PATCH 5/5] virtio-pci: implement VIRTIO_F_QUEUE_STATE

2023-09-13 Thread Parav Pandit


> From: Jason Wang 
> Sent: Wednesday, September 13, 2023 10:15 AM

[..]
> > > > [1]
> > > > https://lists.oasis-open.org/archives/virtio-comment/202309/msg000
> > > > 61.h
> > > > tml
> > >
> > > The series works for stateless devices. Before we introduce device
> > > states in the spec, we can't migrate stateful devices. So the device
> > > context doesn't make much sense right now.
> > The series works for stateful devices too. The device context covers it.
> 
> How? Can it be used for migrating any existing stateful devices? Don't we need
> to define what context means for a specific stateful device before you can
> introduce things like device context? Please go through the archives for the
> relevant discussions (e.g virtio-FS), it's not as simple as introducing a 
> device
> context API.
> 
A device will have its own context for example RSS definition, or flow filters 
tomorrow.
The device context will be extended post the first series.

> And what's more, how can it handle the migration compatibility?
It will be taken care in follow on as we all know that this to be checked.
I will include the notes of future follow up work items in v1, which will be 
taken care post this series.

> > > Dirty page tracking in virtio is not a must for live migration to
> > > work. It can be done via platform facilities or even software. And
> > > to make it more efficient, it needs to utilize transport facilities 
> > > instead of a
> general one.
> > >
> > It is also optional in the spec proposal.
> > Most platforms claimed are not able to do efficiently either,
> 
> Most platforms are working towards an efficient way. But we are talking about
> different things, hardware based dirty page logging is not a must, that is 
> what
> I'm saying. For example, KVM doesn't use hardware to log dirty pages.
> 
I also said same, that hw based dirty page logging is not must. :)
One day hw mmu will be able to track everything efficiently. I have not seen it 
happening yet.

> > hence the vfio subsystem added the support for it.
> 
> As an open standard, if it is designed for a specific software subsystem on a
> specific OS, it's a failure.
> 
It is not.
One need accept that, in certain areas virtio is following the trails of 
advancement already done in sw stack.
So that virtio spec advancement fits in to supply such use cases.
And blocking such advancement of virtio spec to promote only_mediation approach 
is not good either.

BTW: One can say the mediation approach is also designed for specific software 
subsystem and hence failure.
I will stay away from quoting it, as I don’t see it this way.

> >
> > > The FLR, P2P demonstrates the fragility of a simple passthrough
> > > method and how it conflicts with live migration and complicates the device
> implementation.
> > Huh, it shows the opposite.
> > It shows that both will seamlessly work.
> 
> Have you even tried your proposal with a prototype device?
Of course, it is delivered to user for 1.5 years ago before bringing it to the 
spec with virtio-net and virtio-blk devices.

> 
> >
> > > And it means you need to audit all PCI features and do workaround if
> > > there're any possible issues (or using a whitelist).
> > No need for any of this.
> 
> You need to prove this otherwise it's fragile. It's the duty of the author to 
> justify
> not the reviewer.
> 
One cannot post patches and nor review giant series in one go.
Hence the work to be split on a logical boundary.
Features provisioning, pci layout etc is secondary tasks to take care of.

> For example FLR is required to be done in 100ms. How could you achieve this
> during the live migration? How does it affect the downtime and FRS?
> 
Good technical question to discuss instead of passthrough vs mediation. :)

Device administration work is separate from the device operational part.
The device context records what is the current device context, when the FLR 
occurs, the device stops all the operations.
And on next read of the device context the FLRed context is returned.

> >
> > > This is tricky and we are migrating virtio not virtio-pci. If we
> > > don't use simple passthrough we don't need to care about this.
> > >
> > Exactly, we are migrating virtio device for the PCI transport.
> 
> No, the migration facility is a general requirement for all transport.
It is for all transport. One can extend when do for MMIO.

> Starting from a PCI specific (actually your proposal does not even cover all 
> even
> for PCI) solution which may easily end up with issues in other transports.
> 
Like?

> Even if you want to migrate virtio for PCI,  please at least read Qemu 
> migration
> codes for virtio and PCI, then you will soon realize that a lot of things are
> missing in your proposal.
> 
Device context is something that will be extended.
VIRTIO_PCI_CAP_PCI_CFG will also be added as optional item for PCI transport.

> > As usual, if you have to keep arguing about not doing passhthrough, we are
> surely past that point.
> 
> Who

[virtio-dev] RE: [virtio-comment] [PATCH 5/5] virtio-pci: implement VIRTIO_F_QUEUE_STATE

2023-09-13 Thread Parav Pandit


> From: Jason Wang 
> Sent: Wednesday, September 13, 2023 10:14 AM
> To: Parav Pandit 

> > One can build infinite level of nesting to not do passthrough, at the end 
> > user
> applications remains slow.
> 
> We are talking about nested virtualization but nested emulation. I won't 
> repeat
> the definition of virtualization but no matter how much level of nesting, the
> hypervisor will try hard to let the application run natively for most of the 
> time,
> otherwise it's not the nested virtualization at all.
> 
> Nested virtualization has been supported by all major cloud vendors, please
> read the relevant documentation for the performance implications. Virtio
> community is not the correct place to debate whether a nest is useful. We need
> to make sure the datapath could be assigned to any nest layers without losing
> any fundamental facilities like migration.
> 
I am not debating. You or Lingshan claim or imply that mediation is the only 
way to progress.
And for sure virtio do not need to live in the dark shadow of mediation always.
For nesting use case sure one can do mediation related mode.

So only mediation is not the direction.

> > So for such N and M being > 1, one can use software base emulation anyway.
> 
> No, only the control path is trapped, the datapath is still passthrough.
> 
Again, it depends on the use case.

> >
> > >
> > > And exposing the whole device to the guest drivers will have
> > > security implications, your proposal has demonstrated that you need
> > > a workaround for
> > There is no security implications in passthrough.
> 
> How can you prove this or is it even possible for you to prove this?
Huh, when you claim that it is not secure, please point out exactly what is not 
secure.
Please take with PCI SIG and file CVE to PCI sig.

> You expose all device details to guests (especially the transport specific 
> details),
> the attack surface is increased in this way.
One can say it is the opposite.
Attack surface is increased in hypervisor due to mediation poking at everything 
controlled by the guest.


> 
> What's more, a simple passthrough may lose the chance to workaround
> hardware erratas and you will finally get back to the trap and emulation.
Hardware errata's is not the starting point to build the software stack and 
spec.
What you imply is, one must never use vfio stack, one must not use vcpu 
acceleration and everything must be emulated.

Same argument of hardware errata applied to data path too.
One should not implement in hw...

I disagree with such argument.

You can say nesting is requirement for some use cases, so spec should support 
it without blocking the passthrough mode.
Then it is fair discussion.

I will not debate further on passthrough vs control path mediation as either_or 
approach.

> 
> >
> > > FLR at least.
> > It is actually the opposite.
> > FLR is supported with the proposal without any workarounds and mediation.
> 
> It's an obvious drawback but not an advantage. And it's not a must for live
> migration to work. You need to prove the FLR doesn't conflict with the live
> migration, and it's not only FLR but also all the other PCI facilities. 
I don’t know what you mean by prove. It is already clear from the proposal FLR 
is not messing with rest of the device migration infrastructure.
You should read [1].

> one other
> example is P2P and what's the next? As more features were added to the PCI
> spec, you will have endless work in auditing the possible conflict with the
> passthrough based live migration.
> 
This drawback equally applies to mediation route where one need to do more than 
audit where the mediation layer to be extended.
So each method has its pros and cons. One suits one use case, other suits other 
use case.
Therefore, again attempting to claim that only mediation approach is the only 
way to progress is incorrect.

In fact audit is still better than mediation because most audits are read only 
work as opposed to endlessly extending trapping and adding support in core 
stack.
Again, it is a choice that user make with the tradeoff.

> >
> > >
> > > For non standard device we don't have choices other than
> > > passthrough, but for standard devices we have other choices.
> >
> > Passthrough is basic requirement that we will be fulfilling.
> 
> It has several drawbacks that I would not like to repeat. We all know even for
> VFIO, it requires a trap instead of a complete passthrough.
> 
Sure. Both has pros and cons.
And both can co-exist.

> > If one wants to do special nesting, may be, there.
> 
> Nesting is not special. Go and see how it is supported by major cloud vendors
> and you will get the answer. Introducing an interface in virtio tha

[virtio-dev] RE: [virtio-comment] [PATCH 5/5] virtio-pci: implement VIRTIO_F_QUEUE_STATE

2023-09-12 Thread Parav Pandit

> From: Jason Wang 
> Sent: Wednesday, September 13, 2023 10:14 AM

> It's not about how many states in a single state machine, it's about how many
> state machines that exist for device status. Having more than one creates big
> obstacles and complexity in the device. You need to define the interaction of
> each state otherwise you leave undefined behaviours.
The device mode has zero relation to the device status. It does not mess with 
it at all.
In fact the new bits in device status is making it more complex for the device 
to handle.

[virtio-dev] RE: [virtio-comment] [PATCH 5/5] virtio-pci: implement VIRTIO_F_QUEUE_STATE

2023-09-12 Thread Parav Pandit


> From: Zhu, Lingshan 
> Sent: Wednesday, September 13, 2023 9:52 AM

> > It should be named as SUSPEND_CFG_SPACE.!
> > All of this frankly seems intrusive enough as Michael pointed out.
> > Good luck.
> it also SUSPEND the data-path
Ok so it works like Suspend of English dictionary, then after that any other VQ 
related commands don’t progress.
Because it is suspended.

[virtio-dev] RE: [virtio-comment] [PATCH 5/5] virtio-pci: implement VIRTIO_F_QUEUE_STATE

2023-09-12 Thread Parav Pandit



> From: Zhu, Lingshan 
> Sent: Wednesday, September 13, 2023 9:51 AM

> we plan to implement a self-contain solution
Make sure that works with device reset and FLR.
And if not, explain that it is for mediation mode related tricks.

[virtio-dev] RE: [virtio-comment] [PATCH 5/5] virtio-pci: implement VIRTIO_F_QUEUE_STATE

2023-09-12 Thread Parav Pandit


> From: Zhu, Lingshan 
> Sent: Wednesday, September 13, 2023 9:51 AM


> > VQ depth defines the VQ's limit.
> still sounds like limitless and I will stop arguing this as you can see if 
> there is
> REALLY a queue can be limitless, we even don't need Multi-queue or RSS.

If you see some value in limitless queue, please add one.
I have not seen such construct until now and don’t see the need for it.

> >
>  If you say, that require multiple AQ, then how many should a vendor
> >> provide?
> >>> I didn’t say multiple AQs must be used.
> >>> It is same as NIC RQs.
> >> don't you agree a single vq has its own performance limitations?
> > For LM I don’t see the limitation.
> > The finite limit an AQ has, such limitation is no different than some 
> > register
> write poll with one entry at a time per device.
> see above, and we are implementing per device facilities.
> >
> >> In this series, it says:
> >> +When setting SUSPEND, the driver MUST re-read \field{device status}
> >> +to
> >> ensure the SUSPEND bit is set.
> >>
> >> And this is nothing to do with scale.
> > Hence, it is bringing same scale QOS limitation on register too that you 
> > claim
> may be present in the AQ.
> >
> > And hence, I responded earlier that when most things are not done through
> BAR, so there is no need to do suspend/resume via BAR either.
> > And hence the mode setting command of [1] is just fine.
> The bar registers are almost "triggers"
> >
> >>> On top of that once the device is SUSPENDED, it cannot accept some
> >>> other
> >> RESET_VQ command.
> >> so as SiWei suggested, there will be a new feature bit introduced in
> >> V2 for vq reset.
> > VQ cannot be RESET after the device reset as you wrote.
> It is device SUSPEND, not reset.
> >
Suspend means suspend of English language.
It cannot accept more synchronous commands after that and not supposed to 
respond.

> >> It does not reside on the PF to migrate the VFs.
> > Hence it does not scale and cannot do parallel operation within
> > the VF,
> >> unless
>  each register is replicated.
>  Why its not scale? It is a per device facility.
> >>> Because the device needs to answer per device through some large
> >>> scale
> >> memory to fit in a response time.
> >> Again, it is a per-device facility, and it is register based serve
> >> the only one device itself.
> >> And we do not plan to log the dirty pages in bar.
> > Hence, there is no reason to wrap suspend resume on the BAR either.
> > The mode setting admin command is just fine.
> They are device status bits.
And it doesn't have to be.

> >
>  Why do you need parallel operation against the LM facility?
> >>> Because your downtime was 300msec for 1000 VMs.
> >> the LM facility in this series is per-device, it only severs itself.
> > And that single threading and single threading per VQ reset via single 
> > register
> wont scale.
> it is per-device facility, for example, on the VF, not the owner PF.
And I repeatedly explained you that you never answered, is how such queue can 
work after device suspend.
A weird device bifurcation is not supported by pci and not to be done in virtio.

> see above and please feel free to reuse the basic facilities if you like in 
> your AQ
> LM
The whole attitude that "We .." and use in "your" LM is just simply wrong.
Please work towards collaborative design in technical committee.
What you want to repeat was already posted, so take some time to review and 
utilize. If not, describe why it is not useful.

[virtio-dev] RE: [virtio-comment] [PATCH 5/5] virtio-pci: implement VIRTIO_F_QUEUE_STATE

2023-09-12 Thread Parav Pandit



> From: Zhu, Lingshan 
> Sent: Wednesday, September 13, 2023 9:44 AM
> 
> 
> On 9/12/2023 9:35 PM, Parav Pandit wrote:
> >> From: Zhu, Lingshan 
> >> Sent: Tuesday, September 12, 2023 6:39 PM
> >>
> >> On 9/12/2023 6:41 PM, Parav Pandit wrote:
> >>>> From: Zhu, Lingshan 
> >>>> Sent: Tuesday, September 12, 2023 4:05 PM I mean, why do you think
> >>>> my series can not work with P2P
> >>> Because it misses the intermediate mode STOP that we have in series [1].
> >>>
> >>> [1]
> >>> https://lists.oasis-open.org/archives/virtio-comment/202309/msg00071
> >>> .h
> >>> tml
> >> Again, when SUSPEND:
> >> 1) the device freezes, means stop operation in both data-path and
> >> control-path, except the device status
> > Exactly, including the RESET_VQ command also cannot be served because
> device is frozen.
> see below
> >
> >> 2) a new feature bit will be introduced in V2, to allow RESET_VQ
> >> after SUSPEND
> > RESET_VQ after suspend is simply wrong. Because device is already
> suspended to not respond to some  extra RESET_VQ command.
> No, when the device presents SUSPEND, that means the device config space is
> stabilized at that moment, from the SW perspective the device will not make
> changes to config space until !SUSPEND.
> 
> However at that moment, the driver can still make modification to the config
> space and the driver handles the synchronization(checks, re-read, etc), so the
> driver is responsible for what it reads.
>
It should be named as SUSPEND_CFG_SPACE.!
All of this frankly seems intrusive enough as Michael pointed out.
Good luck.
 
> As you can see, this is not perfect, so SiWei suggest to implement a new 
> feature
> bit to control this, and it will be implemented in V2.
> >
> >> 3) if there is a device doing P2P against the device.
> >> They should be pass-through-ed to the same guest and should be
> >> suspended as well for LM, or it is a security problem.
> > There is no security problem. Multiple passthrough devices and P2P is 
> > already
> there in PCI using ACS for probably a decade now.
> As you aware of ACS, that means you have to trust them all, for example P2P
> devices has to be placed in one IOMMU group, and all devices in the group
> should be pass-through-ed to a guest
> >
Such things are done by the hypervisor already. There is nothing virtio 
specific here.
There is no security problem.
If there is one, please file CVE for generic P2P in the pci-sig and we will 
handle them this Thu meeting.

[virtio-dev] RE: [virtio-comment] [PATCH 5/5] virtio-pci: implement VIRTIO_F_QUEUE_STATE

2023-09-12 Thread Parav Pandit


> From: Zhu, Lingshan 
> Sent: Wednesday, September 13, 2023 9:33 AM
> To: Parav Pandit ; Jason Wang 
> Cc: Michael S. Tsirkin ; epere...@redhat.com;
> coh...@redhat.com; stefa...@redhat.com; virtio-comment@lists.oasis-
> open.org; virtio-dev@lists.oasis-open.org
> Subject: Re: [virtio-comment] [PATCH 5/5] virtio-pci: implement
> VIRTIO_F_QUEUE_STATE
> 
> 
> 
> On 9/13/2023 10:23 AM, Parav Pandit wrote:
> >> From: Zhu, Lingshan 
> >> Sent: Tuesday, September 12, 2023 3:45 PM
> >>> Do you find the administration commands we proposed in [1] useful
> >>> for
> >> nested case?
> >>> If not, both will likely diverge.
> >> Not till now.
> > I don’t think you reviewed [1] enough.
> > Following functionality that you want to post in v1 is already covered.
> > Why cannot you use it from [1]?
> >
> > a. Dirty page tracking (write recording in [1]), b. device
> > suspend/resume (mode setting) c. inflight descriptors (device context)
> >
> > [1]
> > https://lists.oasis-open.org/archives/virtio-comment/202309/msg00061.h
> > tml
> you cut off the message, I don't know which conversation you are replying to.
> 
> But anyway, as pointed out many times, we are implementing basic facilities.

I asked you what parts of the series [1] can be used by you for inflight 
tracking, dirty tracking, suspend/resume.
You replied, none is useful.
And after that you said you plan to send v2 that does dirty page tracking, 
inflight tracking.

So I asked why you cannot use [1] that covers things that you plan to send in 
future?

[1] https://lists.oasis-open.org/archives/virtio-comment/202309/msg00061.html

Hope this clarifies.

[virtio-dev] RE: [virtio-comment] [PATCH 5/5] virtio-pci: implement VIRTIO_F_QUEUE_STATE

2023-09-12 Thread Parav Pandit

> From: Zhu, Lingshan 
> Sent: Wednesday, September 13, 2023 9:31 AM
> 
> On 9/12/2023 9:43 PM, Parav Pandit wrote:
> >> From: Zhu, Lingshan 
> >> Sent: Tuesday, September 12, 2023 6:33 PM
> >>
> >> On 9/12/2023 5:21 PM, Parav Pandit wrote:
> >>>> From: Zhu, Lingshan 
> >>>> Sent: Tuesday, September 12, 2023 2:33 PM admin vq require fixed
> >>>> and dedicated resource to serve the VMs, the question still
> >>>> remains, does is scale to server big amount of devices migration?
> >>>> how many admin vqs do you need to serve 10 VMs, how many for 100?
> >>>> and so on? How to scale?
> >>>>
> >>> Yes, it scales within the AQ and across multiple AQs.
> >>> Please consult your board designers to know such limits for your device.
> >> scales require multiple AQs, then how many should a vendor provide
> >> for the worst case?
> >>
> >> I am boring for the same repeating questions.
> > I said it scales, within the AQ. (and across AQs).
> > I have answered enough times, so I will stop on same repeated question.
> > Your repeated question is not helping anyone as it is not in the scope of 
> > virtio.
> >
> > If you think it is, please get it written first for RSS and MQ in net 
> > section and
> post for review.
> You missed the point of the question and I agree no need to discuss this
> anymore.
Ok. thanks.

> >
> >>>> If one admin vq can serve 100 VMs, can it migrate 1000VMs in
> >>>> reasonable
> >> time?
> >>>> If not, how many exactly.
> >>>>
> >>> Yes, it can serve both 100 and 1000 VMs in reasonable time.
> >> I am not sure, the aq is limitless? Can serve thousands of VMs in a
> >> reasonable time? Like in 300ms?
> >>
> > Yes.
> really? limitless?
> >
I answered yes for " Can serve thousands of VMs in reasonable time? Like in 
300ms?"?
VQ depth defines the VQ's limit.

> >> If you say, that require multiple AQ, then how many should a vendor
> provide?
> >>
> > I didn’t say multiple AQs must be used.
> > It is same as NIC RQs.
> don't you agree a single vq has its own performance limitations?
For LM I don’t see the limitation. 
The finite limit an AQ has, such limitation is no different than some register 
write poll with one entry at a time per device.

> In this series, it says:
> +When setting SUSPEND, the driver MUST re-read \field{device status} to
> ensure the SUSPEND bit is set.
> 
> And this is nothing to do with scale.
Hence, it is bringing same scale QOS limitation on register too that you claim 
may be present in the AQ.

And hence, I responded earlier that when most things are not done through BAR, 
so there is no need to do suspend/resume via BAR either.
And hence the mode setting command of [1] is just fine.

> > On top of that once the device is SUSPENDED, it cannot accept some other
> RESET_VQ command.
> so as SiWei suggested, there will be a new feature bit introduced in V2
> for vq reset.
VQ cannot be RESET after the device reset as you wrote.

> >
> >>>> It does not reside on the PF to migrate the VFs.
> >>> Hence it does not scale and cannot do parallel operation within the VF,
> unless
> >> each register is replicated.
> >> Why its not scale? It is a per device facility.
> > Because the device needs to answer per device through some large scale
> memory to fit in a response time.
> Again, it is a per-device facility, and it is register based serve the
> only one device itself.
> And we do not plan to log the dirty pages in bar.
Hence, there is no reason to wrap suspend resume on the BAR either.
The mode setting admin command is just fine.

> >
> >> Why do you need parallel operation against the LM facility?
> > Because your downtime was 300msec for 1000 VMs.
> the LM facility in this series is per-device, it only severs itself.
And that single threading and single threading per VQ reset via single register 
wont scale.

> >
> >> That doesn't make a lot of sense.
> >>> Using register of a queue for bulk data transfer is solved question when 
> >>> the
> >> virtio spec was born.
> >>> I don’t see a point to discuss it.
> >>> Snippet from spec: " As a device can have zero or more virtqueues for bulk
> >> data transport"
> >> Where do you see the series intends to transfer bulk data through 
> >> registers?
> >>>> VFs config space can use the device dedicated resource like the
> bandwidth.
> >>&g

[virtio-dev] RE: [virtio-comment] [PATCH 5/5] virtio-pci: implement VIRTIO_F_QUEUE_STATE

2023-09-12 Thread Parav Pandit


> From: Zhu, Lingshan 
> Sent: Tuesday, September 12, 2023 3:45 PM

> > Do you find the administration commands we proposed in [1] useful for
> nested case?
> > If not, both will likely diverge.
> Not till now.

I don’t think you reviewed [1] enough.
Following functionality that you want to post in v1 is already covered.
Why cannot you use it from [1]?

a. Dirty page tracking (write recording in [1]), 
b. device suspend/resume (mode setting)
c. inflight descriptors (device context)

[1] https://lists.oasis-open.org/archives/virtio-comment/202309/msg00061.html

[virtio-dev] RE: [virtio-comment] [PATCH 5/5] virtio-pci: implement VIRTIO_F_QUEUE_STATE

2023-09-12 Thread Parav Pandit


> From: Zhu, Lingshan 
> Sent: Tuesday, September 12, 2023 6:33 PM
> 
> On 9/12/2023 5:21 PM, Parav Pandit wrote:
> >> From: Zhu, Lingshan 
> >> Sent: Tuesday, September 12, 2023 2:33 PM admin vq require fixed and
> >> dedicated resource to serve the VMs, the question still remains, does
> >> is scale to server big amount of devices migration? how many admin
> >> vqs do you need to serve 10 VMs, how many for 100? and so on? How to
> >> scale?
> >>
> > Yes, it scales within the AQ and across multiple AQs.
> > Please consult your board designers to know such limits for your device.
> scales require multiple AQs, then how many should a vendor provide for the
> worst case?
> 
> I am boring for the same repeating questions.
I said it scales, within the AQ. (and across AQs).
I have answered enough times, so I will stop on same repeated question.
Your repeated question is not helping anyone as it is not in the scope of 
virtio.

If you think it is, please get it written first for RSS and MQ in net section 
and post for review.

> >
> >> If one admin vq can serve 100 VMs, can it migrate 1000VMs in reasonable
> time?
> >> If not, how many exactly.
> >>
> > Yes, it can serve both 100 and 1000 VMs in reasonable time.
> I am not sure, the aq is limitless? Can serve thousands of VMs in a reasonable
> time? Like in 300ms?
> 
Yes.

> If you say, that require multiple AQ, then how many should a vendor provide?
> 
I didn’t say multiple AQs must be used.
It is same as NIC RQs.

> Don't say the board designer own the risks.

> >
> >> And register does not need to scale, it resides on the VF and only
> >> serve the VF.
> >>
> > Since its per VF, by nature it is linearly growing entity that the board 
> > design
> needs to support read and write with guaranteed timing.
> > It clearly scaled poor than queue.
> Please read my series. For example, we introduce a new bit SUSPEND in the
> \field{device status}, any scalability issues here?
That must behave like queue_reset, (it must get acknowledged from the device) 
that it is suspended.
And that brings the scale issue.
On top of that once the device is SUSPENDED, it cannot accept some other 
RESET_VQ command.

> >
> >> It does not reside on the PF to migrate the VFs.
> > Hence it does not scale and cannot do parallel operation within the VF, 
> > unless
> each register is replicated.
> Why its not scale? It is a per device facility.
Because the device needs to answer per device through some large scale memory 
to fit in a response time.

> Why do you need parallel operation against the LM facility?
Because your downtime was 300msec for 1000 VMs.

> That doesn't make a lot of sense.
> >
> > Using register of a queue for bulk data transfer is solved question when the
> virtio spec was born.
> > I don’t see a point to discuss it.
> > Snippet from spec: " As a device can have zero or more virtqueues for bulk
> data transport"
> Where do you see the series intends to transfer bulk data through registers?
> >
> >> VFs config space can use the device dedicated resource like the bandwidth.
> >>
> >> for AQ, still you need to reserve resource and how much?
> > It depends on your board, please consult your board designer to know
> depending on the implementation.
> >  From spec point of view, it should not be same as any other virtqueue.
> so the vendor own the risk to implement AQ LM? Why they have to?
> >>> No. I do not agree. It can fail and very hard for board designers.
> >>> AQs are more reliable way to transport bulk data in scalable manner
> >>> for tens
> >> of member devices.
> >> Really? How often do you observe virtio config space fail?
> > On Intel Icelake server we have seen it failing with 128 VFs.
> > And device needs to do very weird things to support 1000+ VFs forever
> expanding config space, which is not the topic of this discussion anyway.
> That is your setup problem.
> >
> >
> >>>> Please allow me to provide an extreme example, is one single admin
> >>>> vq limitless, that can serve hundreds to thousands of VMs migration?
> >>> It is left to the device implementation. Just like RSS and multi queue
> support?
> >>> Is one Q enough for 800Gbps to 10Mbps link?
> >>> Answer is: Not the scope of specification, spec provide the
> >>> framework to scale
> >> this way, but not impose on the device.
> >> Even if not support RSS or MQ, the device still can work with
> >> performance overhead, not fail.
> >>
> >

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1304 matches

Mail list logo