Re: vdpa legacy guest support (was Re: [PATCH] vdpa/mlx5: set_features should allow reset to zero)

2021-12-14 Thread Jason Wang
On Wed, Dec 15, 2021 at 9:05 AM Si-Wei Liu  wrote:
>
>
>
> On 12/13/2021 9:06 PM, Michael S. Tsirkin wrote:
> > On Mon, Dec 13, 2021 at 05:59:45PM -0800, Si-Wei Liu wrote:
> >>
> >> On 12/12/2021 1:26 AM, Michael S. Tsirkin wrote:
> >>> On Fri, Dec 10, 2021 at 05:44:15PM -0800, Si-Wei Liu wrote:
>  Sorry for reviving this ancient thread. I was kinda lost for the 
>  conclusion
>  it ended up with. I have the following questions,
> 
>  1. legacy guest support: from the past conversations it doesn't seem the
>  support will be completely dropped from the table, is my understanding
>  correct? Actually we're interested in supporting virtio v0.95 guest for 
>  x86,
>  which is backed by the spec at
>  https://urldefense.com/v3/__https://ozlabs.org/*rusty/virtio-spec/virtio-0.9.5.pdf__;fg!!ACWV5N9M2RV99hQ!dTKmzJwwRsFM7BtSuTDu1cNly5n4XCotH0WYmidzGqHSXt40i7ZU43UcNg7GYxZg$
>   . Though I'm not sure
>  if there's request/need to support wilder legacy virtio versions earlier
>  beyond.
> >>> I personally feel it's less work to add in kernel than try to
> >>> work around it in userspace. Jason feels differently.
> >>> Maybe post the patches and this will prove to Jason it's not
> >>> too terrible?
> >> I suppose if the vdpa vendor does support 0.95 in the datapath and ring
> >> layout level and is limited to x86 only, there should be easy way out.
> > Note a subtle difference: what matters is that guest, not host is x86.
> > Matters for emulators which might reorder memory accesses.
> > I guess this enforcement belongs in QEMU then?
> Right, I mean to get started, the initial guest driver support and the
> corresponding QEMU support for transitional vdpa backend can be limited
> to x86 guest/host only. Since the config space is emulated in QEMU, I
> suppose it's not hard to enforce in QEMU.

It's more than just config space, most devices have headers before the buffer.

> QEMU can drive GET_LEGACY,
> GET_ENDIAN et al ioctls in advance to get the capability from the
> individual vendor driver. For that, we need another negotiation protocol
> similar to vhost_user's protocol_features between the vdpa kernel and
> QEMU, way before the guest driver is ever probed and its feature
> negotiation kicks in. Not sure we need a GET_MEMORY_ORDER ioctl call
> from the device, but we can assume weak ordering for legacy at this
> point (x86 only)?

I'm lost here, we have get_features() so:

1) VERSION_1 means the device uses LE if provided, otherwise natvie
2) ORDER_PLATFORM means device requires platform ordering

Any reason for having a new API for this?

>
> >
> >> I
> >> checked with Eli and other Mellanox/NVDIA folks for hardware/firmware level
> >> 0.95 support, it seems all the ingredient had been there already dated back
> >> to the DPDK days. The only major thing limiting is in the vDPA software 
> >> that
> >> the current vdpa core has the assumption around VIRTIO_F_ACCESS_PLATFORM 
> >> for
> >> a few DMA setup ops, which is virtio 1.0 only.
> >>
>  2. suppose some form of legacy guest support needs to be there, how do we
>  deal with the bogus assumption below in vdpa_get_config() in the short 
>  term?
>  It looks one of the intuitive fix is to move the vdpa_set_features call 
>  out
>  of vdpa_get_config() to vdpa_set_config().
> 
>    /*
> * Config accesses aren't supposed to trigger before features 
>  are
>  set.
> * If it does happen we assume a legacy guest.
> */
>    if (!vdev->features_valid)
>    vdpa_set_features(vdev, 0);
>    ops->get_config(vdev, offset, buf, len);
> 
>  I can post a patch to fix 2) if there's consensus already reached.
> 
>  Thanks,
>  -Siwei
> >>> I'm not sure how important it is to change that.
> >>> In any case it only affects transitional devices, right?
> >>> Legacy only should not care ...
> >> Yes I'd like to distinguish legacy driver (suppose it is 0.95) against the
> >> modern one in a transitional device model rather than being legacy only.
> >> That way a v0.95 and v1.0 supporting vdpa parent can support both types of
> >> guests without having to reconfigure. Or are you suggesting limit to legacy
> >> only at the time of vdpa creation would simplify the implementation a lot?
> >>
> >> Thanks,
> >> -Siwei
> >
> > I don't know for sure. Take a look at the work Halil was doing
> > to try and support transitional devices with BE guests.
> Hmmm, we can have those endianness ioctls defined but the initial QEMU
> implementation can be started to support x86 guest/host with little
> endian and weak memory ordering first. The real trick is to detect
> legacy guest - I am not sure if it's feasible to shift all the legacy
> detection work to QEMU, or the kernel has to be part of the detection
> (e.g. the kick before DRIVER_OK thing we have to duplicate the tracking
> effort in 

Re: vdpa legacy guest support (was Re: [PATCH] vdpa/mlx5: set_features should allow reset to zero)

2021-12-14 Thread Si-Wei Liu



On 12/13/2021 9:06 PM, Michael S. Tsirkin wrote:

On Mon, Dec 13, 2021 at 05:59:45PM -0800, Si-Wei Liu wrote:


On 12/12/2021 1:26 AM, Michael S. Tsirkin wrote:

On Fri, Dec 10, 2021 at 05:44:15PM -0800, Si-Wei Liu wrote:

Sorry for reviving this ancient thread. I was kinda lost for the conclusion
it ended up with. I have the following questions,

1. legacy guest support: from the past conversations it doesn't seem the
support will be completely dropped from the table, is my understanding
correct? Actually we're interested in supporting virtio v0.95 guest for x86,
which is backed by the spec at
https://urldefense.com/v3/__https://ozlabs.org/*rusty/virtio-spec/virtio-0.9.5.pdf__;fg!!ACWV5N9M2RV99hQ!dTKmzJwwRsFM7BtSuTDu1cNly5n4XCotH0WYmidzGqHSXt40i7ZU43UcNg7GYxZg$
 . Though I'm not sure
if there's request/need to support wilder legacy virtio versions earlier
beyond.

I personally feel it's less work to add in kernel than try to
work around it in userspace. Jason feels differently.
Maybe post the patches and this will prove to Jason it's not
too terrible?

I suppose if the vdpa vendor does support 0.95 in the datapath and ring
layout level and is limited to x86 only, there should be easy way out.

Note a subtle difference: what matters is that guest, not host is x86.
Matters for emulators which might reorder memory accesses.
I guess this enforcement belongs in QEMU then?
Right, I mean to get started, the initial guest driver support and the 
corresponding QEMU support for transitional vdpa backend can be limited 
to x86 guest/host only. Since the config space is emulated in QEMU, I 
suppose it's not hard to enforce in QEMU. QEMU can drive GET_LEGACY, 
GET_ENDIAN et al ioctls in advance to get the capability from the 
individual vendor driver. For that, we need another negotiation protocol 
similar to vhost_user's protocol_features between the vdpa kernel and 
QEMU, way before the guest driver is ever probed and its feature 
negotiation kicks in. Not sure we need a GET_MEMORY_ORDER ioctl call 
from the device, but we can assume weak ordering for legacy at this 
point (x86 only)?





I
checked with Eli and other Mellanox/NVDIA folks for hardware/firmware level
0.95 support, it seems all the ingredient had been there already dated back
to the DPDK days. The only major thing limiting is in the vDPA software that
the current vdpa core has the assumption around VIRTIO_F_ACCESS_PLATFORM for
a few DMA setup ops, which is virtio 1.0 only.


2. suppose some form of legacy guest support needs to be there, how do we
deal with the bogus assumption below in vdpa_get_config() in the short term?
It looks one of the intuitive fix is to move the vdpa_set_features call out
of vdpa_get_config() to vdpa_set_config().

      /*
   * Config accesses aren't supposed to trigger before features are
set.
   * If it does happen we assume a legacy guest.
   */
      if (!vdev->features_valid)
      vdpa_set_features(vdev, 0);
      ops->get_config(vdev, offset, buf, len);

I can post a patch to fix 2) if there's consensus already reached.

Thanks,
-Siwei

I'm not sure how important it is to change that.
In any case it only affects transitional devices, right?
Legacy only should not care ...

Yes I'd like to distinguish legacy driver (suppose it is 0.95) against the
modern one in a transitional device model rather than being legacy only.
That way a v0.95 and v1.0 supporting vdpa parent can support both types of
guests without having to reconfigure. Or are you suggesting limit to legacy
only at the time of vdpa creation would simplify the implementation a lot?

Thanks,
-Siwei


I don't know for sure. Take a look at the work Halil was doing
to try and support transitional devices with BE guests.
Hmmm, we can have those endianness ioctls defined but the initial QEMU 
implementation can be started to support x86 guest/host with little 
endian and weak memory ordering first. The real trick is to detect 
legacy guest - I am not sure if it's feasible to shift all the legacy 
detection work to QEMU, or the kernel has to be part of the detection 
(e.g. the kick before DRIVER_OK thing we have to duplicate the tracking 
effort in QEMU) as well. Let me take a further look and get back.


Meanwhile, I'll check internally to see if a legacy only model would 
work. Thanks.


Thanks,
-Siwei






On 3/2/2021 2:53 AM, Jason Wang wrote:

On 2021/3/2 5:47 下午, Michael S. Tsirkin wrote:

On Mon, Mar 01, 2021 at 11:56:50AM +0800, Jason Wang wrote:

On 2021/3/1 5:34 上午, Michael S. Tsirkin wrote:

On Wed, Feb 24, 2021 at 10:24:41AM -0800, Si-Wei Liu wrote:

Detecting it isn't enough though, we will need a new ioctl to notify
the kernel that it's a legacy guest. Ugh :(

Well, although I think adding an ioctl is doable, may I
know what the use
case there will be for kernel to leverage such info
directly? Is there a
case QEMU can't do with dedicate ioctls later if there's indeed
differentiation (legacy v.s. 

Re: [PATCH 4/5] dax: remove the copy_from_iter and copy_to_iter methods

2021-12-14 Thread Dan Williams
On Tue, Dec 14, 2021 at 12:33 PM Vivek Goyal  wrote:
>
> On Tue, Dec 14, 2021 at 08:41:30AM -0800, Dan Williams wrote:
> > On Tue, Dec 14, 2021 at 6:23 AM Vivek Goyal  wrote:
> > >
> > > On Mon, Dec 13, 2021 at 09:23:18AM +0100, Christoph Hellwig wrote:
> > > > On Sun, Dec 12, 2021 at 06:44:26AM -0800, Dan Williams wrote:
> > > > > On Fri, Dec 10, 2021 at 6:17 AM Vivek Goyal  wrote:
> > > > > > Going forward, I am wondering should virtiofs use flushcache 
> > > > > > version as
> > > > > > well. What if host filesystem is using DAX and mapping persistent 
> > > > > > memory
> > > > > > pfn directly into qemu address space. I have never tested that.
> > > > > >
> > > > > > Right now we are relying on applications to do fsync/msync on 
> > > > > > virtiofs
> > > > > > for data persistence.
> > > > >
> > > > > This sounds like it would need coordination with a paravirtualized
> > > > > driver that can indicate whether the host side is pmem or not, like
> > > > > the virtio_pmem driver. However, if the guest sends any fsync/msync
> > > > > you would still need to go explicitly cache flush any dirty page
> > > > > because you can't necessarily trust that the guest did that already.
> > > >
> > > > Do we?  The application can't really know what backend it is on, so
> > > > it sounds like the current virtiofs implementation doesn't really, does 
> > > > it?
> > >
> > > Agreed that application does not know what backend it is on. So virtiofs
> > > just offers regular posix API where applications have to do fsync/msync
> > > for data persistence. No support for mmap(MAP_SYNC). We don't offer 
> > > persistent
> > > memory programming model on virtiofs. That's not the expectation. DAX
> > > is used only to bypass guest page cache.
> > >
> > > With this assumption, I think we might not have to use flushcache version
> > > at all even if shared filesystem is on persistent memory on host.
> > >
> > > - We mmap() host files into qemu address space. So any dax store in 
> > > virtiofs
> > >   should make corresponding pages dirty in page cache on host and when
> > >   and fsync()/msync() comes later, it should flush all the data to PMEM.
> > >
> > > - In case of file extending writes, virtiofs falls back to regular
> > >   FUSE_WRITE path (and not use DAX), and in that case host pmem driver
> > >   should make sure writes are flushed to pmem immediately.
> > >
> > > Are there any other path I am missing. If not, looks like we might not
> > > have to use flushcache version in virtiofs at all as long as we are not
> > > offering guest applications user space flushes and MAP_SYNC support.
> > >
> > > We still might have to use machine check safe variant though as loads
> > > might generate synchronous machine check. What's not clear to me is
> > > that if this MC safe variant should be used only in case of PMEM or
> > > should it be used in case of non-PMEM as well.
> >
> > It should be used on any memory address that can throw exception on
> > load, which is any physical address, in paths that can tolerate
> > memcpy() returning an error code, most I/O paths, and can tolerate
> > slower copy performance on older platforms that do not support MC
> > recovery with fast string operations, to date that's only PMEM users.
>
> Ok, So basically latest cpus can do fast string operations with MC
> recovery so that using MC safe variant is not a problem.
>
> Then there is range of cpus which can do MC recovery but do slower
> versions of memcpy and that's where the issue is.
>
> So if we knew that virtiofs dax window is backed by a pmem device
> then we should always use MC safe variant. Even if it means paying
> the price of slow version for the sake of correctness.
>
> But if we are not using pmem on host, then there is no point in
> using MC safe variant.
>
> IOW.
>
> if (virtiofs_backed_by_pmem) {

No, PMEM should not be considered at all relative to whether to use MC
or not, it is 100% a decision of whether you expect virtiofs users
will balk more at unhandled machine checks or performance regressions
on the platforms that set "enable_copy_mc_fragile()". See
quirk_intel_brickland_xeon_ras_cap() and
quirk_intel_purley_xeon_ras_cap() in arch/x86/kernel/quirks.c.

> use_mc_safe_version
> else
> use_non_mc_safe_version
> }
>
> Now question is, how do we know if virtiofs dax window is backed by
> a pmem or not. I checked virtio_pmem driver and that does not seem
> to communicate anything like that. It just communicates start of the
> range and size of range, nothing else.
>
> I don't have full handle on stack of modules of virtio_pmem, but my guess
> is it probably is using MC safe version always (because it does not
> know anthing about the backing storage).
>
> /me will definitely like to pay penalty of slower memcpy if virtiofs
> device is not backed by a pmem.

I assume you meant "not like", but again PMEM has no bearing on
whether using that device will throw machine 

Re: [RFC PATCH] virtio: do not reset stateful devices on resume

2021-12-14 Thread Michael S. Tsirkin
On Tue, Dec 14, 2021 at 05:33:05PM +0100, Mikhail Golubev wrote:
> From: Anton Yakovlev 
> 
> We assume that stateful devices can maintain their state while
> suspended. And for this reason they don't have a freeze callback. If
> such a device is reset during resume, the device state/context will be
> lost on the device side. And the virtual device will stop working.
> 
> Signed-off-by: Anton Yakovlev 
> Signed-off-by: Mikhail Golubev 

A bit more specific? Which configs does this patch fix?

> ---
>  drivers/virtio/virtio.c | 8 
>  1 file changed, 8 insertions(+)
> 
> diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c
> index 236081afe9a2..defa15b56eb8 100644
> --- a/drivers/virtio/virtio.c
> +++ b/drivers/virtio/virtio.c
> @@ -472,6 +472,13 @@ int virtio_device_restore(struct virtio_device *dev)
>   struct virtio_driver *drv = drv_to_virtio(dev->dev.driver);
>   int ret;
>  
> + /* Short path for stateful devices. Here we assume that if the device
> +  * does not have a freeze callback, its state was not changed when
> +  * suspended.
> +  */
> + if (drv && !drv->freeze)
> + goto on_config_enable;
> +
>   /* We always start by resetting the device, in case a previous
>* driver messed it up. */
>   dev->config->reset(dev);
> @@ -503,6 +510,7 @@ int virtio_device_restore(struct virtio_device *dev)
>   /* Finally, tell the device we're all set */
>   virtio_add_status(dev, VIRTIO_CONFIG_S_DRIVER_OK);
>  
> +on_config_enable:
>   virtio_config_enable(dev);
>  
>   return 0;
> -- 
> 2.34.1
> 
> 
> -- 

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 4/5] dax: remove the copy_from_iter and copy_to_iter methods

2021-12-14 Thread Vivek Goyal
On Tue, Dec 14, 2021 at 08:41:30AM -0800, Dan Williams wrote:
> On Tue, Dec 14, 2021 at 6:23 AM Vivek Goyal  wrote:
> >
> > On Mon, Dec 13, 2021 at 09:23:18AM +0100, Christoph Hellwig wrote:
> > > On Sun, Dec 12, 2021 at 06:44:26AM -0800, Dan Williams wrote:
> > > > On Fri, Dec 10, 2021 at 6:17 AM Vivek Goyal  wrote:
> > > > > Going forward, I am wondering should virtiofs use flushcache version 
> > > > > as
> > > > > well. What if host filesystem is using DAX and mapping persistent 
> > > > > memory
> > > > > pfn directly into qemu address space. I have never tested that.
> > > > >
> > > > > Right now we are relying on applications to do fsync/msync on virtiofs
> > > > > for data persistence.
> > > >
> > > > This sounds like it would need coordination with a paravirtualized
> > > > driver that can indicate whether the host side is pmem or not, like
> > > > the virtio_pmem driver. However, if the guest sends any fsync/msync
> > > > you would still need to go explicitly cache flush any dirty page
> > > > because you can't necessarily trust that the guest did that already.
> > >
> > > Do we?  The application can't really know what backend it is on, so
> > > it sounds like the current virtiofs implementation doesn't really, does 
> > > it?
> >
> > Agreed that application does not know what backend it is on. So virtiofs
> > just offers regular posix API where applications have to do fsync/msync
> > for data persistence. No support for mmap(MAP_SYNC). We don't offer 
> > persistent
> > memory programming model on virtiofs. That's not the expectation. DAX
> > is used only to bypass guest page cache.
> >
> > With this assumption, I think we might not have to use flushcache version
> > at all even if shared filesystem is on persistent memory on host.
> >
> > - We mmap() host files into qemu address space. So any dax store in virtiofs
> >   should make corresponding pages dirty in page cache on host and when
> >   and fsync()/msync() comes later, it should flush all the data to PMEM.
> >
> > - In case of file extending writes, virtiofs falls back to regular
> >   FUSE_WRITE path (and not use DAX), and in that case host pmem driver
> >   should make sure writes are flushed to pmem immediately.
> >
> > Are there any other path I am missing. If not, looks like we might not
> > have to use flushcache version in virtiofs at all as long as we are not
> > offering guest applications user space flushes and MAP_SYNC support.
> >
> > We still might have to use machine check safe variant though as loads
> > might generate synchronous machine check. What's not clear to me is
> > that if this MC safe variant should be used only in case of PMEM or
> > should it be used in case of non-PMEM as well.
> 
> It should be used on any memory address that can throw exception on
> load, which is any physical address, in paths that can tolerate
> memcpy() returning an error code, most I/O paths, and can tolerate
> slower copy performance on older platforms that do not support MC
> recovery with fast string operations, to date that's only PMEM users.

Ok, So basically latest cpus can do fast string operations with MC
recovery so that using MC safe variant is not a problem.

Then there is range of cpus which can do MC recovery but do slower
versions of memcpy and that's where the issue is.

So if we knew that virtiofs dax window is backed by a pmem device
then we should always use MC safe variant. Even if it means paying
the price of slow version for the sake of correctness. 

But if we are not using pmem on host, then there is no point in
using MC safe variant.

IOW.

if (virtiofs_backed_by_pmem) {
use_mc_safe_version
else
use_non_mc_safe_version
}

Now question is, how do we know if virtiofs dax window is backed by
a pmem or not. I checked virtio_pmem driver and that does not seem
to communicate anything like that. It just communicates start of the
range and size of range, nothing else.

I don't have full handle on stack of modules of virtio_pmem, but my guess
is it probably is using MC safe version always (because it does not
know anthing about the backing storage).

/me will definitely like to pay penalty of slower memcpy if virtiofs
device is not backed by a pmem.

Vivek

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 4/5] dax: remove the copy_from_iter and copy_to_iter methods

2021-12-14 Thread Dan Williams
On Tue, Dec 14, 2021 at 6:23 AM Vivek Goyal  wrote:
>
> On Mon, Dec 13, 2021 at 09:23:18AM +0100, Christoph Hellwig wrote:
> > On Sun, Dec 12, 2021 at 06:44:26AM -0800, Dan Williams wrote:
> > > On Fri, Dec 10, 2021 at 6:17 AM Vivek Goyal  wrote:
> > > > Going forward, I am wondering should virtiofs use flushcache version as
> > > > well. What if host filesystem is using DAX and mapping persistent memory
> > > > pfn directly into qemu address space. I have never tested that.
> > > >
> > > > Right now we are relying on applications to do fsync/msync on virtiofs
> > > > for data persistence.
> > >
> > > This sounds like it would need coordination with a paravirtualized
> > > driver that can indicate whether the host side is pmem or not, like
> > > the virtio_pmem driver. However, if the guest sends any fsync/msync
> > > you would still need to go explicitly cache flush any dirty page
> > > because you can't necessarily trust that the guest did that already.
> >
> > Do we?  The application can't really know what backend it is on, so
> > it sounds like the current virtiofs implementation doesn't really, does it?
>
> Agreed that application does not know what backend it is on. So virtiofs
> just offers regular posix API where applications have to do fsync/msync
> for data persistence. No support for mmap(MAP_SYNC). We don't offer persistent
> memory programming model on virtiofs. That's not the expectation. DAX
> is used only to bypass guest page cache.
>
> With this assumption, I think we might not have to use flushcache version
> at all even if shared filesystem is on persistent memory on host.
>
> - We mmap() host files into qemu address space. So any dax store in virtiofs
>   should make corresponding pages dirty in page cache on host and when
>   and fsync()/msync() comes later, it should flush all the data to PMEM.
>
> - In case of file extending writes, virtiofs falls back to regular
>   FUSE_WRITE path (and not use DAX), and in that case host pmem driver
>   should make sure writes are flushed to pmem immediately.
>
> Are there any other path I am missing. If not, looks like we might not
> have to use flushcache version in virtiofs at all as long as we are not
> offering guest applications user space flushes and MAP_SYNC support.
>
> We still might have to use machine check safe variant though as loads
> might generate synchronous machine check. What's not clear to me is
> that if this MC safe variant should be used only in case of PMEM or
> should it be used in case of non-PMEM as well.

It should be used on any memory address that can throw exception on
load, which is any physical address, in paths that can tolerate
memcpy() returning an error code, most I/O paths, and can tolerate
slower copy performance on older platforms that do not support MC
recovery with fast string operations, to date that's only PMEM users.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 4/5] dax: remove the copy_from_iter and copy_to_iter methods

2021-12-14 Thread Vivek Goyal
On Mon, Dec 13, 2021 at 09:23:18AM +0100, Christoph Hellwig wrote:
> On Sun, Dec 12, 2021 at 06:44:26AM -0800, Dan Williams wrote:
> > On Fri, Dec 10, 2021 at 6:17 AM Vivek Goyal  wrote:
> > > Going forward, I am wondering should virtiofs use flushcache version as
> > > well. What if host filesystem is using DAX and mapping persistent memory
> > > pfn directly into qemu address space. I have never tested that.
> > >
> > > Right now we are relying on applications to do fsync/msync on virtiofs
> > > for data persistence.
> > 
> > This sounds like it would need coordination with a paravirtualized
> > driver that can indicate whether the host side is pmem or not, like
> > the virtio_pmem driver. However, if the guest sends any fsync/msync
> > you would still need to go explicitly cache flush any dirty page
> > because you can't necessarily trust that the guest did that already.
> 
> Do we?  The application can't really know what backend it is on, so
> it sounds like the current virtiofs implementation doesn't really, does it?

Agreed that application does not know what backend it is on. So virtiofs
just offers regular posix API where applications have to do fsync/msync
for data persistence. No support for mmap(MAP_SYNC). We don't offer persistent
memory programming model on virtiofs. That's not the expectation. DAX 
is used only to bypass guest page cache.

With this assumption, I think we might not have to use flushcache version
at all even if shared filesystem is on persistent memory on host. 

- We mmap() host files into qemu address space. So any dax store in virtiofs
  should make corresponding pages dirty in page cache on host and when
  and fsync()/msync() comes later, it should flush all the data to PMEM.

- In case of file extending writes, virtiofs falls back to regular
  FUSE_WRITE path (and not use DAX), and in that case host pmem driver
  should make sure writes are flushed to pmem immediately.

Are there any other path I am missing. If not, looks like we might not
have to use flushcache version in virtiofs at all as long as we are not
offering guest applications user space flushes and MAP_SYNC support.

We still might have to use machine check safe variant though as loads
might generate synchronous machine check. What's not clear to me is
that if this MC safe variant should be used only in case of PMEM or
should it be used in case of non-PMEM as well.

Vivek

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 5/5] dax: always use _copy_mc_to_iter in dax_copy_to_iter

2021-12-14 Thread Vivek Goyal
On Sun, Dec 12, 2021 at 06:48:05AM -0800, Dan Williams wrote:
> On Fri, Dec 10, 2021 at 6:05 AM Vivek Goyal  wrote:
> >
> > On Thu, Dec 09, 2021 at 07:38:28AM +0100, Christoph Hellwig wrote:
> > > While using the MC-safe copy routines is rather pointless on a virtual 
> > > device
> > > like virtiofs,
> >
> > I was wondering about that. Is it completely pointless.
> >
> > Typically we are just mapping host page cache into qemu address space.
> > That shows as virtiofs device pfn in guest and that pfn is mapped into
> > guest application address space in mmap() call.
> >
> > Given on host its DRAM, so I would not expect machine check on load side
> > so there was no need to use machine check safe variant.
> 
> That's a broken assumption, DRAM experiences multi-bit ECC errors.
> Machine checks, data aborts, etc existed before PMEM.

So we should use MC safe variant when loading from DRAM as well?
(If needed platoform support is there).

Vivek

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 00/10] Allow for configuring max number of virtqueue pairs

2021-12-14 Thread Stefano Garzarella

On Tue, Dec 14, 2021 at 11:53:03AM +0200, Eli Cohen wrote:

On Tue, Dec 14, 2021 at 10:42:35AM +0100, Stefano Garzarella wrote:

Hi Eli,
I don't know what's wrong, but I've only received replies through the
virtualization@lists.linux-foundation.org mailing list.

Even in the archive I can't find your original series.

Adding virtualization-ow...@lists.linux-foundation.org to double check
what's going wrong.


OK, let me know if you want me to send you the patches. I am going to
send a new series and can put you on CC.


I'll look at the new series.

Thanks,
Stefano

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 00/10] Allow for configuring max number of virtqueue pairs

2021-12-14 Thread Stefano Garzarella

Hi Eli,
I don't know what's wrong, but I've only received replies through the 
virtualization@lists.linux-foundation.org mailing list.


Even in the archive I can't find your original series.

Adding virtualization-ow...@lists.linux-foundation.org to double check 
what's going wrong.


Thanks,
Stefano

On Tue, Dec 14, 2021 at 11:17:00AM +0800, Jason Wang wrote:

On Mon, Dec 13, 2021 at 10:43 PM Eli Cohen  wrote:


Allow the user to configure the max number of virtqueue pairs for a vdpa
instance. The user can then control the actual number of virtqueue pairs
using ethtool.

Example, set number of VQPs to 2:
$ ethtool -L ens1 combined 2

A user can check the max supported virtqueues for a management device by
runnig:

vdpa dev show
vdpa-a: type network mgmtdev auxiliary/mlx5_core.sf.1 vendor_id  \
max_vqp 3 max_vq_size 256 max_supported_vqs 256


I think the maxsupported_vqs should be an odd number since it should
contain control vq.



and refer to this value when adding a device.

To create a device with a max of 5 VQPs:
vdpa dev add name vdpa-a mgmtdev auxiliary/mlx5_core.sf.1 max_vqp 5


A question, consider the parent support both net and block, if user use

vdpa dev add name vdpa-a mgmtdev auxiliary/mlx5_core.sf.1

How do we know it's a net or block device? Or do you expect some block
specific attributes to be added?

Thanks



V1 -> V2:
1. Do not return the index of the control VQ. Instead return an
   indication if Ctrl VQ was negotiated.
2. Do not returen conig information if FEATURES_OK is not set to avoid
   reporting out of sync information.
3. Minor fixes as described by the individual patches
4. Add patches to return the max virtqueues a device is capable of
   supporting.

Eli Cohen (10):
  vdpa: Provide interface to read driver features
  vdpa/mlx5: Distribute RX virtqueues in RQT object
  vdpa: Read device configuration only if FEATURES_OK
  vdpa: Allow to configure max data virtqueues
  vdpa/mlx5: Fix config_attr_mask assignment
  vdpa/mlx5: Support configuring max data virtqueue pairs
  vdpa: Add support for returning device configuration information
  vdpa/mlx5: Restore cur_num_vqs in case of failure in change_num_qps()
  vdpa: Support reporting max device virtqueues
  vdpa/mlx5: Configure max supported virtqueues

 drivers/vdpa/alibaba/eni_vdpa.c| 16 +++--
 drivers/vdpa/ifcvf/ifcvf_main.c| 16 +++--
 drivers/vdpa/mlx5/net/mlx5_vnet.c  | 94 +-
 drivers/vdpa/vdpa.c| 42 -
 drivers/vdpa/vdpa_sim/vdpa_sim.c   | 21 +--
 drivers/vdpa/vdpa_user/vduse_dev.c | 16 +++--
 drivers/vdpa/virtio_pci/vp_vdpa.c  | 16 +++--
 drivers/vhost/vdpa.c   |  2 +-
 drivers/virtio/virtio_vdpa.c   |  2 +-
 include/linux/vdpa.h   | 16 +++--
 include/uapi/linux/vdpa.h  |  9 +++
 11 files changed, 178 insertions(+), 72 deletions(-)

--
2.33.1



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 07/10] vdpa: Add support for returning device configuration information

2021-12-14 Thread Jason Wang
On Tue, Dec 14, 2021 at 3:17 PM Eli Cohen  wrote:
>
> On Tue, Dec 14, 2021 at 12:05:38PM +0800, Jason Wang wrote:
> > On Mon, Dec 13, 2021 at 10:43 PM Eli Cohen  wrote:
> > >
> > > Add netlink attribute to store flags indicating current state of the
> > > device.
> > > In addition, introduce a flag to indicate whether control virtqueue is
> > > used.
> > >
> > > This indication can be retrieved by:
> > >
> > > vdpa dev config show vdpa-a
> > > vdpa-a: mac 00:00:00:00:88:88 link up link_announce false max_vq_pairs 1
> > > mtu 1500 ctrl_vq yes
> >
> > I think the cvq is kind of duplicated with the driver features?
>
> We don't pass to userspace the driver features. The availablity of CVQ
> is passed trough a new bit mask field that encodes (currently) just the
> availablity of CVQ.
>
> Are you suggesting the return the entire set of negotiated features and
> let user space decode that?

Yes, I think that can simplify things and save a lot of effort.

Thanks

>
> >
> > Thanks
> >
> > >
> > > Signed-off-by: Eli Cohen 
> > >
> > > 
> > > V1 -> V2
> > > Patch was changed to return only an indication of ctrl VQ
> > > ---
> > >  drivers/vdpa/vdpa.c   | 17 +
> > >  include/uapi/linux/vdpa.h |  8 
> > >  2 files changed, 25 insertions(+)
> > >
> > > diff --git a/drivers/vdpa/vdpa.c b/drivers/vdpa/vdpa.c
> > > index 7b7bef7673b4..130a8d4aeaed 100644
> > > --- a/drivers/vdpa/vdpa.c
> > > +++ b/drivers/vdpa/vdpa.c
> > > @@ -787,6 +787,19 @@ static int vdpa_nl_cmd_dev_get_dumpit(struct sk_buff 
> > > *msg, struct netlink_callba
> > > return msg->len;
> > >  }
> > >
> > > +static int vdpa_dev_net_ctrl_vq_fill(struct vdpa_device *vdev,
> > > +struct sk_buff *msg,
> > > +struct virtio_net_config *config,
> > > +u64 features)
> > > +{
> > > +   if (!(features & BIT_ULL(VIRTIO_NET_F_CTRL_VQ)))
> > > +   return 0;
> > > +
> > > +   /* currently the only flag can be returned */
> > > +   return nla_put_u64_64bit(msg, VDPA_ATTR_DEV_FLAGS,
> > > +BIT_ULL(VDPA_DEV_ATTR_CVQ), 
> > > VDPA_ATTR_PAD);
> > > +}
> > > +
> > >  static int vdpa_dev_net_mq_config_fill(struct vdpa_device *vdev,
> > >struct sk_buff *msg, u64 features,
> > >const struct virtio_net_config 
> > > *config)
> > > @@ -805,6 +818,7 @@ static int vdpa_dev_net_config_fill(struct 
> > > vdpa_device *vdev, struct sk_buff *ms
> > > struct virtio_net_config config = {};
> > > u64 features;
> > > u16 val_u16;
> > > +   int err;
> > >
> > > vdpa_get_config(vdev, 0, , sizeof(config));
> > >
> > > @@ -821,6 +835,9 @@ static int vdpa_dev_net_config_fill(struct 
> > > vdpa_device *vdev, struct sk_buff *ms
> > > return -EMSGSIZE;
> > >
> > > features = vdev->config->get_driver_features(vdev);
> > > +   err = vdpa_dev_net_ctrl_vq_fill(vdev, msg, , features);
> > > +   if (err)
> > > +   return err;
> > >
> > > return vdpa_dev_net_mq_config_fill(vdev, msg, features, );
> > >  }
> > > diff --git a/include/uapi/linux/vdpa.h b/include/uapi/linux/vdpa.h
> > > index a252f06f9dfd..23b854e3e5e2 100644
> > > --- a/include/uapi/linux/vdpa.h
> > > +++ b/include/uapi/linux/vdpa.h
> > > @@ -20,9 +20,16 @@ enum vdpa_command {
> > > VDPA_CMD_DEV_CONFIG_GET,/* can dump */
> > >  };
> > >
> > > +enum {
> > > +   VDPA_DEV_ATTR_CVQ,
> > > +};
> > > +
> > >  enum vdpa_attr {
> > > VDPA_ATTR_UNSPEC,
> > >
> > > +   /* Pad attribute for 64b alignment */
> > > +   VDPA_ATTR_PAD = VDPA_ATTR_UNSPEC,
> > > +
> > > /* bus name (optional) + dev name together make the parent device 
> > > handle */
> > > VDPA_ATTR_MGMTDEV_BUS_NAME, /* string */
> > > VDPA_ATTR_MGMTDEV_DEV_NAME, /* string */
> > > @@ -34,6 +41,7 @@ enum vdpa_attr {
> > > VDPA_ATTR_DEV_MAX_VQS,  /* u32 */
> > > VDPA_ATTR_DEV_MAX_VQ_SIZE,  /* u16 */
> > > VDPA_ATTR_DEV_MIN_VQ_SIZE,  /* u16 */
> > > +   VDPA_ATTR_DEV_FLAGS,/* u64 */
> > >
> > > VDPA_ATTR_DEV_NET_CFG_MACADDR,  /* binary */
> > > VDPA_ATTR_DEV_NET_STATUS,   /* u8 */
> > > --
> > > 2.33.1
> > >
> >
>

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 00/10] Allow for configuring max number of virtqueue pairs

2021-12-14 Thread Jason Wang
On Tue, Dec 14, 2021 at 3:32 PM Eli Cohen  wrote:
>
> On Tue, Dec 14, 2021 at 11:17:00AM +0800, Jason Wang wrote:
> > On Mon, Dec 13, 2021 at 10:43 PM Eli Cohen  wrote:
> > >
> > > Allow the user to configure the max number of virtqueue pairs for a vdpa
> > > instance. The user can then control the actual number of virtqueue pairs
> > > using ethtool.
> > >
> > > Example, set number of VQPs to 2:
> > > $ ethtool -L ens1 combined 2
> > >
> > > A user can check the max supported virtqueues for a management device by
> > > runnig:
> > >
> > > vdpa dev show
> > > vdpa-a: type network mgmtdev auxiliary/mlx5_core.sf.1 vendor_id  \
> > > max_vqp 3 max_vq_size 256 max_supported_vqs 256
> >
> > I think the maxsupported_vqs should be an odd number since it should
> > contain control vq.
>
> I was thinking we should report data virtqueues. If some upstream driver
> or device does not support CVQ, this should still be an even number.
>
> Moreover, if we want to include the CVQ, we might want to indicate that
> explicitly.

Can we expose driver features in this case?

Thanks

>
> >
> > >
> > > and refer to this value when adding a device.
> > >
> > > To create a device with a max of 5 VQPs:
> > > vdpa dev add name vdpa-a mgmtdev auxiliary/mlx5_core.sf.1 max_vqp 5
> >
> > A question, consider the parent support both net and block, if user use
> >
> > vdpa dev add name vdpa-a mgmtdev auxiliary/mlx5_core.sf.1
> >
> > How do we know it's a net or block device? Or do you expect some block
> > specific attributes to be added?
> >
> > Thanks
> >
> > >
> > > V1 -> V2:
> > > 1. Do not return the index of the control VQ. Instead return an
> > >indication if Ctrl VQ was negotiated.
> > > 2. Do not returen conig information if FEATURES_OK is not set to avoid
> > >reporting out of sync information.
> > > 3. Minor fixes as described by the individual patches
> > > 4. Add patches to return the max virtqueues a device is capable of
> > >supporting.
> > >
> > > Eli Cohen (10):
> > >   vdpa: Provide interface to read driver features
> > >   vdpa/mlx5: Distribute RX virtqueues in RQT object
> > >   vdpa: Read device configuration only if FEATURES_OK
> > >   vdpa: Allow to configure max data virtqueues
> > >   vdpa/mlx5: Fix config_attr_mask assignment
> > >   vdpa/mlx5: Support configuring max data virtqueue pairs
> > >   vdpa: Add support for returning device configuration information
> > >   vdpa/mlx5: Restore cur_num_vqs in case of failure in change_num_qps()
> > >   vdpa: Support reporting max device virtqueues
> > >   vdpa/mlx5: Configure max supported virtqueues
> > >
> > >  drivers/vdpa/alibaba/eni_vdpa.c| 16 +++--
> > >  drivers/vdpa/ifcvf/ifcvf_main.c| 16 +++--
> > >  drivers/vdpa/mlx5/net/mlx5_vnet.c  | 94 +-
> > >  drivers/vdpa/vdpa.c| 42 -
> > >  drivers/vdpa/vdpa_sim/vdpa_sim.c   | 21 +--
> > >  drivers/vdpa/vdpa_user/vduse_dev.c | 16 +++--
> > >  drivers/vdpa/virtio_pci/vp_vdpa.c  | 16 +++--
> > >  drivers/vhost/vdpa.c   |  2 +-
> > >  drivers/virtio/virtio_vdpa.c   |  2 +-
> > >  include/linux/vdpa.h   | 16 +++--
> > >  include/uapi/linux/vdpa.h  |  9 +++
> > >  11 files changed, 178 insertions(+), 72 deletions(-)
> > >
> > > --
> > > 2.33.1
> > >
> >
>

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 03/10] vdpa: Read device configuration only if FEATURES_OK

2021-12-14 Thread Jason Wang
On Tue, Dec 14, 2021 at 3:14 PM Eli Cohen  wrote:
>
> On Tue, Dec 14, 2021 at 11:44:06AM +0800, Jason Wang wrote:
> > On Mon, Dec 13, 2021 at 10:43 PM Eli Cohen  wrote:
> > >
> > > Avoid reading device configuration during feature negotiation. Read
> > > device status and verify that VIRTIO_CONFIG_S_FEATURES_OK is set.
> > > Otherwise, return -EAGAIN.
> > >
> > > Signed-off-by: Eli Cohen 
> >
> >
> >
> > > ---
> > >  drivers/vdpa/vdpa.c | 7 +++
> > >  1 file changed, 7 insertions(+)
> > >
> > > diff --git a/drivers/vdpa/vdpa.c b/drivers/vdpa/vdpa.c
> > > index 42d71d60d5dc..5749cf0a1500 100644
> > > --- a/drivers/vdpa/vdpa.c
> > > +++ b/drivers/vdpa/vdpa.c
> > > @@ -819,8 +819,15 @@ vdpa_dev_config_fill(struct vdpa_device *vdev, 
> > > struct sk_buff *msg, u32 portid,
> > >  {
> > > u32 device_id;
> > > void *hdr;
> > > +   u8 status;
> > > int err;
> > >
> > > +   status = vdev->config->get_status(vdev);
> > > +   if (!(status & VIRTIO_CONFIG_S_FEATURES_OK)) {
> > > +   NL_SET_ERR_MSG_MOD(extack, "Features negotiation not 
> > > completed");
> > > +   return -EAGAIN;
> > > +   }
> > > +
> > > hdr = genlmsg_put(msg, portid, seq, _nl_family, flags,
> > >   VDPA_CMD_DEV_CONFIG_GET);
> > > if (!hdr)
> >
> > It looks to me we need to synchronize this with set_status() and
> > cf_mutex (set_config).
>
> Makes sense.
>
> >
> > Or simply return all the config space (the maximum set of features)
>
> No sure I follow you on this. Are you suggesting to return in the
> netlink message all the fields of struct virtio_net_config as individual 
> fields?
>
> How is this related to the need to sync with cf_mutex?

I meant the reason for the synchronization is because some config
fields depend on the feature negotiation. I wonder if it would be
easier we just return all the possible config fields, then there's
probably no need for the synchronization since we don't need to check
the driver feature but the device features.

Thanks

>
> >
> > Thanks
> >
> > > --
> > > 2.33.1
> > >
> >
>

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization