Re: [RFC PATCH 04/12] s390/cio: introduce cio DMA pool

2019-04-19 Thread Sebastian Ott
On Fri, 12 Apr 2019, Halil Pasic wrote:
> On Fri, 12 Apr 2019 14:12:31 +0200 (CEST)
> Sebastian Ott  wrote:
> > On Fri, 12 Apr 2019, Halil Pasic wrote:
> > > On Thu, 11 Apr 2019 20:25:01 +0200 (CEST)
> > > Sebastian Ott  wrote:
> > > > I don't think we should use this global DMA pool. I guess it's OK for
> > > > stuff like airq (where we don't have a struct device at hand) but for
> > > > CCW we should use the device we have. Yes, this way we waste some memory
> > > > but all dma memory a device uses should fit in a page - so the wastage
> > > > is not too much.
> 
> Regarding the wastage. Let us do the math together in search for an
> upper (wastage) limit.
[...]
> Currently we need at least 224 bytes per device that is ~ 6%
> of a PAGE_SIZE.

Yes, we basically waste the whole page. I'm ok with that if the benefit is
to play nice with the kernel APIs.

> > For practical
> > matters: DMA debugging will complain about misuse of a specific device or
> > driver.
> > 
> 
> Do you mean CONFIG_DMA_API_DEBUG and CONFIG_DMA_API_DEBUG_SG? I've been
> running with those and did not see any complaints. Maybe we should
> clarify this one offline...

I didn't mean to imply that there are bugs already - just that when used
as intended the DMA_DEBUG_API can complain about stuff like "your device
is gone but you have still DMA memory set up for it" which will not work
if you don't use the correct device...

Sebastian

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v3 09/26] compat_ioctl: move drivers to compat_ptr_ioctl

2019-04-19 Thread Jiri Kosina
On Tue, 16 Apr 2019, Arnd Bergmann wrote:

> Each of these drivers has a copy of the same trivial helper function to
> convert the pointer argument and then call the native ioctl handler.
> 
> We now have a generic implementation of that, so use it.
> 
> Acked-by: Greg Kroah-Hartman 
> Reviewed-by: Jarkko Sakkinen 
> Reviewed-by: Jason Gunthorpe 
> Signed-off-by: Arnd Bergmann 
> ---
>  drivers/char/ppdev.c  | 12 +-
>  drivers/char/tpm/tpm_vtpm_proxy.c | 12 +-
>  drivers/firewire/core-cdev.c  | 12 +-
>  drivers/hid/usbhid/hiddev.c   | 11 +

For hiddev.c:

Reviewed-by: Jiri Kosina  

-- 
Jiri Kosina
SUSE Labs

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 3/3] virtio-gpu api: VIRTIO_GPU_F_RESSOURCE_V2

2019-04-19 Thread Chia-I Wu
On Wed, Apr 17, 2019 at 2:57 AM Gerd Hoffmann  wrote:
>
> On Fri, Apr 12, 2019 at 04:34:20PM -0700, Chia-I Wu wrote:
> > Hi,
> >
> > I am still new to virgl, and missed the last round of discussion about
> > resource_create_v2.
> >
> > From the discussion below, semantically resource_create_v2 creates a host
> > resource object _without_ any storage; memory_create creates a host memory
> > object which provides the storage.  Is that correct?
>
> Right now all resource_create_* variants create a resource object with
> host storage.  memory_create creates guest storage, and
> resource_attach_memory binds things together.  Then you have to transfer
> the data.
In Gurchetan's Vulkan example,  the host storage allocation happens in
(some variant of) memory_create, not in resource_create_v2.  Maybe
that's what got me confused.

>
> Hmm, maybe we need a flag indicating that host storage is not needed,
> for resources where we want establish some kind of shared mapping later
> on.
This makes sense, to support both Vulkan and non-Vulkan models.

This differs from this patch, but I think a full-fledged resource
should logically have three components

 - a RESOURCE component that has not storage
 - a MEMORY component that provides the storage
 - a BACKING component that is for transfers

resource_attach_backing sets the BACKING component.  BACKING always
uses guest pages and supports only transfers into or out of MEMORY.

resource_attach_memory sets the MEMORY component.  MEMORY can use host
or guest pages, and must always support GPU operations.  When a MEMORY
is mappable in the guest, we can skip BACKING and achieve zero-copy.

resource_create_* can then get a flag to indicate whether only
RESOURCE is created or RESOURCE+MEMORY is created.


>
> > Do we expect these new commands to be supported by OpenGL, which does not
> > separate resources and memories?
>
> Well, for opengl you need a 1:1 relationship between memory region and
> resource.
>
> > > Yes, even though it is not clear yet how we are going to handle
> > > host-allocated buffers in the vhost-user case ...
> >
> > This might be another dumb question, but is this only an issue for
> > vhost-user(-gpu) case?  What mechanisms are used to map host dma-buf into
> > the guest address space?
>
> qemu can change the address space, that includes mmap()ing stuff there.
> An external vhost-user process can't do this, it can only read the
> address space layout, and read/write from/to guest memory.
I thought vhost-user process can work with the host-allocated dmabuf
directly.  That is,

  qemu: injects dmabuf pages into guest address space
  vhost-user: work with the dmabuf
  guest: can read/write those pages

>
> > But one needs to create the resource first to know which memory types can
> > be attached to it.  I think the metadata needs to be returned with
> > resource_create_v2.
>
> There is a resource_info reply for that.
>
> > That should be good enough.  But by returning alignments, we can minimize
> > the gaps when attaching multiple resources, especially when the resources
> > are only used by GPU.
>
> We can add alignments to the resource_info reply.
>
> cheers,
>   Gerd
>
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC 0/3] VirtIO RDMA

2019-04-19 Thread Yuval Shaia
On Thu, Apr 11, 2019 at 07:02:15PM +0200, Cornelia Huck wrote:
> On Thu, 11 Apr 2019 14:01:54 +0300
> Yuval Shaia  wrote:
> 
> > Data center backends use more and more RDMA or RoCE devices and more and
> > more software runs in virtualized environment.
> > There is a need for a standard to enable RDMA/RoCE on Virtual Machines.
> > 
> > Virtio is the optimal solution since is the de-facto para-virtualizaton
> > technology and also because the Virtio specification
> > allows Hardware Vendors to support Virtio protocol natively in order to
> > achieve bare metal performance.
> > 
> > This RFC is an effort to addresses challenges in defining the RDMA/RoCE
> > Virtio Specification and a look forward on possible implementation
> > techniques.
> > 
> > Open issues/Todo list:
> > List is huge, this is only start point of the project.
> > Anyway, here is one example of item in the list:
> > - Multi VirtQ: Every QP has two rings and every CQ has one. This means that
> >   in order to support for example 32K QPs we will need 64K VirtQ. Not sure
> >   that this is reasonable so one option is to have one for all and
> >   multiplex the traffic on it. This is not good approach as by design it
> >   introducing an optional starvation. Another approach would be multi
> >   queues and round-robin (for example) between them.
> > 
> > Expectations from this posting:
> > In general, any comment is welcome, starting from hey, drop this as it is a
> > very bad idea, to yeah, go ahead, we really want it.
> > Idea here is that since it is not a minor effort i first want to know if
> > there is some sort interest in the community for such device.
> 
> My first reaction is: Sounds sensible, but it would be good to have a
> spec for this :)
> 
> You'll need a spec if you want this to go forward anyway, so at least a
> sketch would be good to answer questions such as how many virtqueues
> you use for which purpose, what is actually put on the virtqueues,
> whether there are negotiable features, and what the expectations for
> the device and the driver are. It also makes it easier to understand
> how this is supposed to work in practice.
> 
> If folks agree that this sounds useful, the next step would be to
> reserve an id for the device type.

Thanks for the tips, will sure do that, it is that first i wanted to make
sure there is a use case here.

Waiting for any feedback from the community.

> 
> > 
> > The scope of the implementation is limited to probing the device and doing
> > some basic ibverbs commands. Data-path is not yet implemented. So with this
> > one can expect only that driver is (partialy) loaded and basic queries and
> > resource allocation is done.
> > 
> > One note regarding the patchset.
> > I know it is not standard to collaps patches from several repos as i did
> > here (qemu and linux) but decided to do it anyway so the whole picture can
> > be seen.
> > 
> > patch 1: virtio-net: Move some virtio-net-pci decl to include/hw/virtio
> > This is a prelimenary patch just as a hack so i will not need to
> > impelement new netdev
> > patch 2: hw/virtio-rdma: VirtIO rdma device
> > The implementation of the device
> > patch 3: RDMA/virtio-rdma: VirtIO rdma driver
> > The device driver
> > 
> 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [Qemu-devel] [RFC 0/3] VirtIO RDMA

2019-04-19 Thread Yuval Shaia
On Fri, Apr 12, 2019 at 03:21:56PM +0530, Devesh Sharma wrote:
> On Thu, Apr 11, 2019 at 11:11 PM Yuval Shaia  wrote:
> >
> > On Thu, Apr 11, 2019 at 08:34:20PM +0300, Yuval Shaia wrote:
> > > On Thu, Apr 11, 2019 at 05:24:08PM +, Jason Gunthorpe wrote:
> > > > On Thu, Apr 11, 2019 at 07:02:15PM +0200, Cornelia Huck wrote:
> > > > > On Thu, 11 Apr 2019 14:01:54 +0300
> > > > > Yuval Shaia  wrote:
> > > > >
> > > > > > Data center backends use more and more RDMA or RoCE devices and 
> > > > > > more and
> > > > > > more software runs in virtualized environment.
> > > > > > There is a need for a standard to enable RDMA/RoCE on Virtual 
> > > > > > Machines.
> > > > > >
> > > > > > Virtio is the optimal solution since is the de-facto 
> > > > > > para-virtualizaton
> > > > > > technology and also because the Virtio specification
> > > > > > allows Hardware Vendors to support Virtio protocol natively in 
> > > > > > order to
> > > > > > achieve bare metal performance.
> > > > > >
> > > > > > This RFC is an effort to addresses challenges in defining the 
> > > > > > RDMA/RoCE
> > > > > > Virtio Specification and a look forward on possible implementation
> > > > > > techniques.
> > > > > >
> > > > > > Open issues/Todo list:
> > > > > > List is huge, this is only start point of the project.
> > > > > > Anyway, here is one example of item in the list:
> > > > > > - Multi VirtQ: Every QP has two rings and every CQ has one. This 
> > > > > > means that
> > > > > >   in order to support for example 32K QPs we will need 64K VirtQ. 
> > > > > > Not sure
> > > > > >   that this is reasonable so one option is to have one for all and
> > > > > >   multiplex the traffic on it. This is not good approach as by 
> > > > > > design it
> > > > > >   introducing an optional starvation. Another approach would be 
> > > > > > multi
> > > > > >   queues and round-robin (for example) between them.
> > > > > >
> > > > > > Expectations from this posting:
> > > > > > In general, any comment is welcome, starting from hey, drop this as 
> > > > > > it is a
> > > > > > very bad idea, to yeah, go ahead, we really want it.
> > > > > > Idea here is that since it is not a minor effort i first want to 
> > > > > > know if
> > > > > > there is some sort interest in the community for such device.
> > > > >
> > > > > My first reaction is: Sounds sensible, but it would be good to have a
> > > > > spec for this :)
> > > >
> > > > I'm unclear why you'd want to have a virtio queue for anything other
> > > > that some kind of command channel.
> > > >
> > > > I'm not sure a QP or CQ benefits from this??
> > >
> > > Virtqueue is a standard mechanism to pass data from guest to host. By
> >
> > And vice versa (CQ?)
> >
> > > saying that - it really sounds like QP send and recv rings. So my thought
> > > is to use a standard way for rings. As i've learned this is how it is used
> > > by other virtio devices ex virtio-net.
> > >
> > > >
> > > > Jason
> > >
> I would like to ask more basic question, how virtio queue will glue
> with actual h/w qps? I may be to naive though.

Have to admit - I have no idea.
This work is based on emulated device so i'm my case - the emulated device
is creating the virtqueue. I guess that HW device will create a QP and
expose a virtqueue interface to it.
The same driver should serve both the SW and HW devices.

One of the objectives of this RFC is to collaborate an effort and
implementation notes/ideas from HW vendors.

> 
> -Regards
> Devesh
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [Qemu-devel] [RFC 3/3] RDMA/virtio-rdma: VirtIO rdma driver

2019-04-19 Thread Yuval Shaia
On Mon, Apr 15, 2019 at 06:07:52PM -0700, Bart Van Assche wrote:
> On 4/11/19 4:01 AM, Yuval Shaia wrote:
> > +++ b/drivers/infiniband/hw/virtio/Kconfig
> > @@ -0,0 +1,6 @@
> > +config INFINIBAND_VIRTIO_RDMA
> > +   tristate "VirtIO Paravirtualized RDMA Driver"
> > +   depends on NETDEVICES && ETHERNET && PCI && INET
> > +   ---help---
> > + This driver provides low-level support for VirtIO Paravirtual
> > + RDMA adapter.
> 
> Does this driver really depend on Ethernet, or does it also work with
> Ethernet support disabled?

The device should eventually expose Ethernet interface as well as IB.

> 
> > +static inline struct virtio_rdma_info *to_vdev(struct ib_device *ibdev)
> > +{
> > +   return container_of(ibdev, struct virtio_rdma_info, ib_dev);
> > +}
> 
> Is it really worth to introduce this function? Have you considered to
> use container_of(ibdev, struct virtio_rdma_info, ib_dev) directly instead
> of to_vdev()?

Agree, not sure really needed, just saw that some drivers uses this pattern.

> 
> > +static void rdma_ctrl_ack(struct virtqueue *vq)
> > +{
> > +   struct virtio_rdma_info *dev = vq->vdev->priv;
> > +
> > +   wake_up(>acked);
> > +
> > +   printk("%s\n", __func__);
> > +}
> 
> Should that printk() be changed into pr_debug()? The same comment holds for
> all other printk() calls.

All prints will be removed, this is still wip.

> 
> > +#define VIRTIO_RDMA_BOARD_ID   1
> > +#define VIRTIO_RDMA_HW_NAME"virtio-rdma"
> > +#define VIRTIO_RDMA_HW_REV 1
> > +#define VIRTIO_RDMA_DRIVER_VER "1.0"
> 
> Is a driver version number useful in an upstream driver?

I've noticed that other drivers exposes this in sysfs.

> 
> > +struct ib_cq *virtio_rdma_create_cq(struct ib_device *ibdev,
> > +   const struct ib_cq_init_attr *attr,
> > +   struct ib_ucontext *context,
> > +   struct ib_udata *udata)
> > +{
> > +   struct scatterlist in, out;
> > +   struct virtio_rdma_ib_cq *vcq;
> > +   struct cmd_create_cq *cmd;
> > +   struct rsp_create_cq *rsp;
> > +   struct ib_cq *cq = NULL;
> > +   int rc;
> > +
> > +   /* TODO: Check MAX_CQ */
> > +
> > +   cmd = kmalloc(sizeof(*cmd), GFP_ATOMIC);
> > +   if (!cmd)
> > +   return ERR_PTR(-ENOMEM);
> > +
> > +   rsp = kmalloc(sizeof(*rsp), GFP_ATOMIC);
> > +   if (!rsp) {
> > +   kfree(cmd);
> > +   return ERR_PTR(-ENOMEM);
> > +   }
> > +
> > +   vcq = kzalloc(sizeof(*vcq), GFP_KERNEL);
> > +   if (!vcq)
> > +   goto out;
> 
> Are you sure that you want to mix GFP_ATOMIC and GFP_KERNEL in a single
> function?

Right, a mistake.

> 
> Thanks,
> 
> Bart.

Thanks.

> 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC 0/3] VirtIO RDMA

2019-04-19 Thread Yuval Shaia
On Thu, Apr 11, 2019 at 05:40:26PM +, Jason Gunthorpe wrote:
> On Thu, Apr 11, 2019 at 08:34:20PM +0300, Yuval Shaia wrote:
> > On Thu, Apr 11, 2019 at 05:24:08PM +, Jason Gunthorpe wrote:
> > > On Thu, Apr 11, 2019 at 07:02:15PM +0200, Cornelia Huck wrote:
> > > > On Thu, 11 Apr 2019 14:01:54 +0300
> > > > Yuval Shaia  wrote:
> > > > 
> > > > > Data center backends use more and more RDMA or RoCE devices and more 
> > > > > and
> > > > > more software runs in virtualized environment.
> > > > > There is a need for a standard to enable RDMA/RoCE on Virtual 
> > > > > Machines.
> > > > > 
> > > > > Virtio is the optimal solution since is the de-facto 
> > > > > para-virtualizaton
> > > > > technology and also because the Virtio specification
> > > > > allows Hardware Vendors to support Virtio protocol natively in order 
> > > > > to
> > > > > achieve bare metal performance.
> > > > > 
> > > > > This RFC is an effort to addresses challenges in defining the 
> > > > > RDMA/RoCE
> > > > > Virtio Specification and a look forward on possible implementation
> > > > > techniques.
> > > > > 
> > > > > Open issues/Todo list:
> > > > > List is huge, this is only start point of the project.
> > > > > Anyway, here is one example of item in the list:
> > > > > - Multi VirtQ: Every QP has two rings and every CQ has one. This 
> > > > > means that
> > > > >   in order to support for example 32K QPs we will need 64K VirtQ. Not 
> > > > > sure
> > > > >   that this is reasonable so one option is to have one for all and
> > > > >   multiplex the traffic on it. This is not good approach as by design 
> > > > > it
> > > > >   introducing an optional starvation. Another approach would be multi
> > > > >   queues and round-robin (for example) between them.
> > > > > 
> > > > > Expectations from this posting:
> > > > > In general, any comment is welcome, starting from hey, drop this as 
> > > > > it is a
> > > > > very bad idea, to yeah, go ahead, we really want it.
> > > > > Idea here is that since it is not a minor effort i first want to know 
> > > > > if
> > > > > there is some sort interest in the community for such device.
> > > > 
> > > > My first reaction is: Sounds sensible, but it would be good to have a
> > > > spec for this :)
> > > 
> > > I'm unclear why you'd want to have a virtio queue for anything other
> > > that some kind of command channel.
> > > 
> > > I'm not sure a QP or CQ benefits from this??
> > 
> > Virtqueue is a standard mechanism to pass data from guest to host. By
> > saying that - it really sounds like QP send and recv rings. So my thought
> > is to use a standard way for rings. As i've learned this is how it is used
> > by other virtio devices ex virtio-net.
> 
> I doubt you can use virtio queues from userspace securely? Usually
> needs a dedicated page for each user space process

Not yet started to do any work on datapath, i guess you are right but need
further work on this area.
Thanks for raising the concern.

As i said, there are many open issues at this stage.

> 
> Jason
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC 3/3] RDMA/virtio-rdma: VirtIO rdma driver

2019-04-19 Thread Yuval Shaia
> > +
> > +   wake_up(>acked);
> > +
> > +   printk("%s\n", __func__);
> 
> Cool:-)
> 
> this line should be for debug?

Yes

> 
> Zhu Yanjun
> 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 3/3] virtio-gpu api: VIRTIO_GPU_F_RESSOURCE_V2

2019-04-19 Thread Chia-I Wu
Hi,

I am still new to virgl, and missed the last round of discussion about
resource_create_v2.

>From the discussion below, semantically resource_create_v2 creates a host
resource object _without_ any storage; memory_create creates a host memory
object which provides the storage.  Is that correct?

And this version of memory_create is probably the most special one among
its other potential variants, because it is the only(?) one who imports the
pre-allocated guest pages.

Do we expect these new commands to be supported by OpenGL, which does not
separate resources and memories?

On Thu, Apr 11, 2019 at 10:49 PM Gerd Hoffmann  wrote:

> On Thu, Apr 11, 2019 at 06:36:15PM -0700, Gurchetan Singh wrote:
> > On Wed, Apr 10, 2019 at 10:03 PM Gerd Hoffmann 
> wrote:
> > >
> > > > > +/* VIRTIO_GPU_CMD_RESOURCE_CREATE_V2 */
> > > > > +struct virtio_gpu_cmd_resource_create_v2 {
> > > > > +   struct virtio_gpu_ctrl_hdr hdr;
> > > > > +   __le32 resource_id;
> > > > > +   __le32 format;
> > > > > +   __le32 width;
> > > > > +   __le32 height;
> > > > > +   /* 3d only */
> > > > > +   __le32 target;
> > > > > +   __le32 bind;
> > > > > +   __le32 depth;
> > > > > +   __le32 array_size;
> > > > > +   __le32 last_level;
> > > > > +   __le32 nr_samples;
> > > > > +   __le32 flags;
> > > > > +};
> > > >
> > > >
> > > > I assume this is always backed by some host side allocation, without
> any
> > > > guest side pages associated with it?
> > >
> > > No.  It is not backed at all yet.  Workflow would be like this:
> > >
> > >   (1) VIRTIO_GPU_CMD_RESOURCE_CREATE_V2
> > >   (2) VIRTIO_GPU_CMD_MEMORY_CREATE (see patch 2)
> > >   (3) VIRTIO_GPU_CMD_RESOURCE_MEMORY_ATTACH (see patch 2)
> >
> > Thanks for the clarification.
> >
> > >
> > > You could also create a larger pool with VIRTIO_GPU_CMD_MEMORY_CREATE,
> > > then go attach multiple resources to it.
> > >
> > > > If so, do we want the option for the guest allocate?
> > >
> > > Allocation options are handled by VIRTIO_GPU_CMD_MEMORY_CREATE
> > > (initially guest allocated only, i.e. what virtio-gpu supports today,
> > > the plan is to add other allocation types later on).
> >
> > You want to cover Vulkan, host-allocated dma-bufs, and guest-allocated
> > dma-bufs with this, correct?  Let me know if it's a non-goal :-)
>
> Yes, even though it is not clear yet how we are going to handle
> host-allocated buffers in the vhost-user case ...

This might be another dumb question, but is this only an issue for
vhost-user(-gpu) case?  What mechanisms are used to map host dma-buf into
the guest address space?



>
> > If so, we might want to distinguish between memory types (kind of like
> > memoryTypeIndex in Vulkan). [Assuming memory_id is like resource_id]
>
> For the host-allocated buffers we surely want that, yes.
> For guest-allocated memory regions it isn't useful I think ...
>
Guest-allocated memory regions can be just another memory type.

But one needs to create the resource first to know which memory types can
be attached to it.  I think the metadata needs to be returned with
resource_create_v2.

>
> > 1) Vulkan seems the most straightforward
> >
> > virtio_gpu_cmd_memory_create --> create kernel data structure,
> > vkAllocateMemory on the host or import guest memory into Vulkan,
> > depending on the memory type
> > virtio_gpu_cmd_resource_create_v2 -->  vkCreateImage +
> > vkGetImageMemoryRequirements on host
> > virtio_gpu_cmd_resource_attach_memory -->  vkBindImageMemory on host
>
> Yes.
>
> Note 1: virtio_gpu_cmd_memory_create + virtio_gpu_cmd_resource_create_v2
> ordering doesn't matter, so you can virtio_gpu_cmd_resource_create_v2
> first to figure stride and size, then adjust memory size accordingly.
>
> Note 2: The old virtio_gpu_cmd_resource_create variants can be used
> too if you don't need the _v2 features.
>
> Note 3: If I understand things correctly it would be valid to create a
> memory pool (allocate one big chunk of memory) with vkAllocateMemory,
> then bind multiple images at different offsets to it.
>
> > 2) With a guest allocated dma-buf using some new allocation library,
> >
> > virtio_gpu_cmd_resource_create_v2 --> host returns metadata describing
> > optimal allocation
> > virtio_gpu_cmd_memory_create --> allocate guest memory pages since
> > it's guest memory type
> > virtio_gpu_cmd_resource_attach_memory --> associate guest pages with
> > resource in kernel, send iovecs to host for bookkeeping
>
> virtio_gpu_cmd_memory_create sends the iovecs.  Otherwise correct.
>
> > 3) With gbm it's a little trickier,
> >
> > virtio_gpu_cmd_resource_create_v2 --> gbm_bo_create_with_modifiers,
> > get metadata in return
>
> Only get metadata in return.
>
> > virtio_gpu_cmd_memory_create --> create kernel data structure, but
> > don't allocate pages, nothing on the host
>
> Memory allocation happens here.  Probably makes sense to have a
> virtio_gpu_cmd_memory_create_host command here, because the parameters
> we 

Re: [RFC 3/3] RDMA/virtio-rdma: VirtIO rdma driver

2019-04-19 Thread Yanjun Zhu



On 2019/4/11 19:01, Yuval Shaia wrote:

Signed-off-by: Yuval Shaia 
---
  drivers/infiniband/Kconfig|   1 +
  drivers/infiniband/hw/Makefile|   1 +
  drivers/infiniband/hw/virtio/Kconfig  |   6 +
  drivers/infiniband/hw/virtio/Makefile |   4 +
  drivers/infiniband/hw/virtio/virtio_rdma.h|  40 +
  .../infiniband/hw/virtio/virtio_rdma_device.c |  59 ++
  .../infiniband/hw/virtio/virtio_rdma_device.h |  32 +
  drivers/infiniband/hw/virtio/virtio_rdma_ib.c | 711 ++
  drivers/infiniband/hw/virtio/virtio_rdma_ib.h |  48 ++
  .../infiniband/hw/virtio/virtio_rdma_main.c   | 149 
  .../infiniband/hw/virtio/virtio_rdma_netdev.c |  44 ++
  .../infiniband/hw/virtio/virtio_rdma_netdev.h |  33 +
  include/uapi/linux/virtio_ids.h   |   1 +
  13 files changed, 1129 insertions(+)
  create mode 100644 drivers/infiniband/hw/virtio/Kconfig
  create mode 100644 drivers/infiniband/hw/virtio/Makefile
  create mode 100644 drivers/infiniband/hw/virtio/virtio_rdma.h
  create mode 100644 drivers/infiniband/hw/virtio/virtio_rdma_device.c
  create mode 100644 drivers/infiniband/hw/virtio/virtio_rdma_device.h
  create mode 100644 drivers/infiniband/hw/virtio/virtio_rdma_ib.c
  create mode 100644 drivers/infiniband/hw/virtio/virtio_rdma_ib.h
  create mode 100644 drivers/infiniband/hw/virtio/virtio_rdma_main.c
  create mode 100644 drivers/infiniband/hw/virtio/virtio_rdma_netdev.c
  create mode 100644 drivers/infiniband/hw/virtio/virtio_rdma_netdev.h

diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig
index a1fb840de45d..218a47d4cecf 100644
--- a/drivers/infiniband/Kconfig
+++ b/drivers/infiniband/Kconfig
@@ -107,6 +107,7 @@ source "drivers/infiniband/hw/hfi1/Kconfig"
  source "drivers/infiniband/hw/qedr/Kconfig"
  source "drivers/infiniband/sw/rdmavt/Kconfig"
  source "drivers/infiniband/sw/rxe/Kconfig"
+source "drivers/infiniband/hw/virtio/Kconfig"
  endif
  
  source "drivers/infiniband/ulp/ipoib/Kconfig"

diff --git a/drivers/infiniband/hw/Makefile b/drivers/infiniband/hw/Makefile
index e4f31c1be8f7..10ffb2c421e4 100644
--- a/drivers/infiniband/hw/Makefile
+++ b/drivers/infiniband/hw/Makefile
@@ -14,3 +14,4 @@ obj-$(CONFIG_INFINIBAND_HFI1) += hfi1/
  obj-$(CONFIG_INFINIBAND_HNS)  += hns/
  obj-$(CONFIG_INFINIBAND_QEDR) += qedr/
  obj-$(CONFIG_INFINIBAND_BNXT_RE)  += bnxt_re/
+obj-$(CONFIG_INFINIBAND_VIRTIO_RDMA)   += virtio/
diff --git a/drivers/infiniband/hw/virtio/Kconfig 
b/drivers/infiniband/hw/virtio/Kconfig
new file mode 100644
index ..92e41691cf5d
--- /dev/null
+++ b/drivers/infiniband/hw/virtio/Kconfig
@@ -0,0 +1,6 @@
+config INFINIBAND_VIRTIO_RDMA
+   tristate "VirtIO Paravirtualized RDMA Driver"
+   depends on NETDEVICES && ETHERNET && PCI && INET
+   ---help---
+ This driver provides low-level support for VirtIO Paravirtual
+ RDMA adapter.
diff --git a/drivers/infiniband/hw/virtio/Makefile 
b/drivers/infiniband/hw/virtio/Makefile
new file mode 100644
index ..fb637e467167
--- /dev/null
+++ b/drivers/infiniband/hw/virtio/Makefile
@@ -0,0 +1,4 @@
+obj-$(CONFIG_INFINIBAND_VIRTIO_RDMA) += virtio_rdma.o
+
+virtio_rdma-y := virtio_rdma_main.o virtio_rdma_device.o virtio_rdma_ib.o \
+virtio_rdma_netdev.o
diff --git a/drivers/infiniband/hw/virtio/virtio_rdma.h 
b/drivers/infiniband/hw/virtio/virtio_rdma.h
new file mode 100644
index ..7896a2dfb812
--- /dev/null
+++ b/drivers/infiniband/hw/virtio/virtio_rdma.h
@@ -0,0 +1,40 @@
+/*
+ * Virtio RDMA device: Driver main data types
+ *
+ * Copyright (C) 2019 Yuval Shaia Oracle Corporation
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA  02110-1301 USA
+ */
+
+#ifndef __VIRTIO_RDMA__
+#define __VIRTIO_RDMA__
+
+#include 
+#include 
+
+struct virtio_rdma_info {
+   struct ib_device ib_dev;
+   struct virtio_device *vdev;
+   struct virtqueue *ctrl_vq;
+   wait_queue_head_t acked; /* arm on send to host, release on recv */
+   struct net_device *netdev;
+};
+
+static inline struct virtio_rdma_info *to_vdev(struct ib_device *ibdev)
+{
+   return container_of(ibdev, struct virtio_rdma_info, ib_dev);
+}
+
+#endif
diff --git a/drivers/infiniband/hw/virtio/virtio_rdma_device.c 

Re: [Qemu-devel] [RFC 0/3] VirtIO RDMA

2019-04-19 Thread Yuval Shaia
On Thu, Apr 11, 2019 at 08:34:20PM +0300, Yuval Shaia wrote:
> On Thu, Apr 11, 2019 at 05:24:08PM +, Jason Gunthorpe wrote:
> > On Thu, Apr 11, 2019 at 07:02:15PM +0200, Cornelia Huck wrote:
> > > On Thu, 11 Apr 2019 14:01:54 +0300
> > > Yuval Shaia  wrote:
> > > 
> > > > Data center backends use more and more RDMA or RoCE devices and more and
> > > > more software runs in virtualized environment.
> > > > There is a need for a standard to enable RDMA/RoCE on Virtual Machines.
> > > > 
> > > > Virtio is the optimal solution since is the de-facto para-virtualizaton
> > > > technology and also because the Virtio specification
> > > > allows Hardware Vendors to support Virtio protocol natively in order to
> > > > achieve bare metal performance.
> > > > 
> > > > This RFC is an effort to addresses challenges in defining the RDMA/RoCE
> > > > Virtio Specification and a look forward on possible implementation
> > > > techniques.
> > > > 
> > > > Open issues/Todo list:
> > > > List is huge, this is only start point of the project.
> > > > Anyway, here is one example of item in the list:
> > > > - Multi VirtQ: Every QP has two rings and every CQ has one. This means 
> > > > that
> > > >   in order to support for example 32K QPs we will need 64K VirtQ. Not 
> > > > sure
> > > >   that this is reasonable so one option is to have one for all and
> > > >   multiplex the traffic on it. This is not good approach as by design it
> > > >   introducing an optional starvation. Another approach would be multi
> > > >   queues and round-robin (for example) between them.
> > > > 
> > > > Expectations from this posting:
> > > > In general, any comment is welcome, starting from hey, drop this as it 
> > > > is a
> > > > very bad idea, to yeah, go ahead, we really want it.
> > > > Idea here is that since it is not a minor effort i first want to know if
> > > > there is some sort interest in the community for such device.
> > > 
> > > My first reaction is: Sounds sensible, but it would be good to have a
> > > spec for this :)
> > 
> > I'm unclear why you'd want to have a virtio queue for anything other
> > that some kind of command channel.
> > 
> > I'm not sure a QP or CQ benefits from this??
> 
> Virtqueue is a standard mechanism to pass data from guest to host. By

And vice versa (CQ?)

> saying that - it really sounds like QP send and recv rings. So my thought
> is to use a standard way for rings. As i've learned this is how it is used
> by other virtio devices ex virtio-net.
> 
> > 
> > Jason
> 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC 0/3] VirtIO RDMA

2019-04-19 Thread Jason Gunthorpe
On Thu, Apr 11, 2019 at 08:34:20PM +0300, Yuval Shaia wrote:
> On Thu, Apr 11, 2019 at 05:24:08PM +, Jason Gunthorpe wrote:
> > On Thu, Apr 11, 2019 at 07:02:15PM +0200, Cornelia Huck wrote:
> > > On Thu, 11 Apr 2019 14:01:54 +0300
> > > Yuval Shaia  wrote:
> > > 
> > > > Data center backends use more and more RDMA or RoCE devices and more and
> > > > more software runs in virtualized environment.
> > > > There is a need for a standard to enable RDMA/RoCE on Virtual Machines.
> > > > 
> > > > Virtio is the optimal solution since is the de-facto para-virtualizaton
> > > > technology and also because the Virtio specification
> > > > allows Hardware Vendors to support Virtio protocol natively in order to
> > > > achieve bare metal performance.
> > > > 
> > > > This RFC is an effort to addresses challenges in defining the RDMA/RoCE
> > > > Virtio Specification and a look forward on possible implementation
> > > > techniques.
> > > > 
> > > > Open issues/Todo list:
> > > > List is huge, this is only start point of the project.
> > > > Anyway, here is one example of item in the list:
> > > > - Multi VirtQ: Every QP has two rings and every CQ has one. This means 
> > > > that
> > > >   in order to support for example 32K QPs we will need 64K VirtQ. Not 
> > > > sure
> > > >   that this is reasonable so one option is to have one for all and
> > > >   multiplex the traffic on it. This is not good approach as by design it
> > > >   introducing an optional starvation. Another approach would be multi
> > > >   queues and round-robin (for example) between them.
> > > > 
> > > > Expectations from this posting:
> > > > In general, any comment is welcome, starting from hey, drop this as it 
> > > > is a
> > > > very bad idea, to yeah, go ahead, we really want it.
> > > > Idea here is that since it is not a minor effort i first want to know if
> > > > there is some sort interest in the community for such device.
> > > 
> > > My first reaction is: Sounds sensible, but it would be good to have a
> > > spec for this :)
> > 
> > I'm unclear why you'd want to have a virtio queue for anything other
> > that some kind of command channel.
> > 
> > I'm not sure a QP or CQ benefits from this??
> 
> Virtqueue is a standard mechanism to pass data from guest to host. By
> saying that - it really sounds like QP send and recv rings. So my thought
> is to use a standard way for rings. As i've learned this is how it is used
> by other virtio devices ex virtio-net.

I doubt you can use virtio queues from userspace securely? Usually
needs a dedicated page for each user space process

Jason
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH 04/12] s390/cio: introduce cio DMA pool

2019-04-19 Thread Sebastian Ott
On Fri, 12 Apr 2019, Halil Pasic wrote:
> On Thu, 11 Apr 2019 20:25:01 +0200 (CEST)
> Sebastian Ott  wrote:
> > I don't think we should use this global DMA pool. I guess it's OK for
> > stuff like airq (where we don't have a struct device at hand) but for
> > CCW we should use the device we have. Yes, this way we waste some memory
> > but all dma memory a device uses should fit in a page - so the wastage
> > is not too much.
> > 
> 
> Is what you envision an own gen_pool on for each subchannel (e.g. a
> struct io_subchannel member)?

Either that or if that's too much overhead simply map a page and create
a struct containing the few dma areas for that device.

> I'm struggling with understanding the expected benefits of a
> per-subchannel pool/allocator. Can you please tell me what benefits do
> you expect (over the current approach)?

Logically DMA is a capability of a device and the whole DMA API is build
around devices. Working around that just feels wrong. For practical
matters: DMA debugging will complain about misuse of a specific device or
driver.

> I understand you idea is to keep the CIO global pool for stuff that can
> not be tied to a single device (i.e. ariq). So the per device stuff would
> also mean more code. Would you be OK with postponing this alleged
> enhancement (i.e. implement it as a patch on top of this series)?

I don't like it but it's just in-kernel usage which we can change at any
time. So if it helps you to do it that way, why not..

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH 04/12] s390/cio: introduce cio DMA pool

2019-04-19 Thread Sebastian Ott
On Fri, 5 Apr 2019, Halil Pasic wrote:
> To support protected virtualization cio will need to make sure the
> memory used for communication with the hypervisor is DMA memory.
> 
> Let us introduce a DMA pool to cio that will help us in allocating
> deallocating those chunks of memory.
> 
> We use a gen_pool backed with DMA pages to avoid each allocation
> effectively wasting a page, as we typically allocate much less
> than PAGE_SIZE.

I don't think we should use this global DMA pool. I guess it's OK for
stuff like airq (where we don't have a struct device at hand) but for
CCW we should use the device we have. Yes, this way we waste some memory
but all dma memory a device uses should fit in a page - so the wastage
is not too much.

Sebastian

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC 0/3] VirtIO RDMA

2019-04-19 Thread Jason Gunthorpe
On Thu, Apr 11, 2019 at 07:02:15PM +0200, Cornelia Huck wrote:
> On Thu, 11 Apr 2019 14:01:54 +0300
> Yuval Shaia  wrote:
> 
> > Data center backends use more and more RDMA or RoCE devices and more and
> > more software runs in virtualized environment.
> > There is a need for a standard to enable RDMA/RoCE on Virtual Machines.
> > 
> > Virtio is the optimal solution since is the de-facto para-virtualizaton
> > technology and also because the Virtio specification
> > allows Hardware Vendors to support Virtio protocol natively in order to
> > achieve bare metal performance.
> > 
> > This RFC is an effort to addresses challenges in defining the RDMA/RoCE
> > Virtio Specification and a look forward on possible implementation
> > techniques.
> > 
> > Open issues/Todo list:
> > List is huge, this is only start point of the project.
> > Anyway, here is one example of item in the list:
> > - Multi VirtQ: Every QP has two rings and every CQ has one. This means that
> >   in order to support for example 32K QPs we will need 64K VirtQ. Not sure
> >   that this is reasonable so one option is to have one for all and
> >   multiplex the traffic on it. This is not good approach as by design it
> >   introducing an optional starvation. Another approach would be multi
> >   queues and round-robin (for example) between them.
> > 
> > Expectations from this posting:
> > In general, any comment is welcome, starting from hey, drop this as it is a
> > very bad idea, to yeah, go ahead, we really want it.
> > Idea here is that since it is not a minor effort i first want to know if
> > there is some sort interest in the community for such device.
> 
> My first reaction is: Sounds sensible, but it would be good to have a
> spec for this :)

I'm unclear why you'd want to have a virtio queue for anything other
that some kind of command channel.

I'm not sure a QP or CQ benefits from this??

Jason
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC 0/3] VirtIO RDMA

2019-04-19 Thread Yuval Shaia
On Thu, Apr 11, 2019 at 05:24:08PM +, Jason Gunthorpe wrote:
> On Thu, Apr 11, 2019 at 07:02:15PM +0200, Cornelia Huck wrote:
> > On Thu, 11 Apr 2019 14:01:54 +0300
> > Yuval Shaia  wrote:
> > 
> > > Data center backends use more and more RDMA or RoCE devices and more and
> > > more software runs in virtualized environment.
> > > There is a need for a standard to enable RDMA/RoCE on Virtual Machines.
> > > 
> > > Virtio is the optimal solution since is the de-facto para-virtualizaton
> > > technology and also because the Virtio specification
> > > allows Hardware Vendors to support Virtio protocol natively in order to
> > > achieve bare metal performance.
> > > 
> > > This RFC is an effort to addresses challenges in defining the RDMA/RoCE
> > > Virtio Specification and a look forward on possible implementation
> > > techniques.
> > > 
> > > Open issues/Todo list:
> > > List is huge, this is only start point of the project.
> > > Anyway, here is one example of item in the list:
> > > - Multi VirtQ: Every QP has two rings and every CQ has one. This means 
> > > that
> > >   in order to support for example 32K QPs we will need 64K VirtQ. Not sure
> > >   that this is reasonable so one option is to have one for all and
> > >   multiplex the traffic on it. This is not good approach as by design it
> > >   introducing an optional starvation. Another approach would be multi
> > >   queues and round-robin (for example) between them.
> > > 
> > > Expectations from this posting:
> > > In general, any comment is welcome, starting from hey, drop this as it is 
> > > a
> > > very bad idea, to yeah, go ahead, we really want it.
> > > Idea here is that since it is not a minor effort i first want to know if
> > > there is some sort interest in the community for such device.
> > 
> > My first reaction is: Sounds sensible, but it would be good to have a
> > spec for this :)
> 
> I'm unclear why you'd want to have a virtio queue for anything other
> that some kind of command channel.
> 
> I'm not sure a QP or CQ benefits from this??

Virtqueue is a standard mechanism to pass data from guest to host. By
saying that - it really sounds like QP send and recv rings. So my thought
is to use a standard way for rings. As i've learned this is how it is used
by other virtio devices ex virtio-net.

> 
> Jason
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[RFC 3/3] RDMA/virtio-rdma: VirtIO rdma driver

2019-04-19 Thread Yuval Shaia
Signed-off-by: Yuval Shaia 
---
 drivers/infiniband/Kconfig|   1 +
 drivers/infiniband/hw/Makefile|   1 +
 drivers/infiniband/hw/virtio/Kconfig  |   6 +
 drivers/infiniband/hw/virtio/Makefile |   4 +
 drivers/infiniband/hw/virtio/virtio_rdma.h|  40 +
 .../infiniband/hw/virtio/virtio_rdma_device.c |  59 ++
 .../infiniband/hw/virtio/virtio_rdma_device.h |  32 +
 drivers/infiniband/hw/virtio/virtio_rdma_ib.c | 711 ++
 drivers/infiniband/hw/virtio/virtio_rdma_ib.h |  48 ++
 .../infiniband/hw/virtio/virtio_rdma_main.c   | 149 
 .../infiniband/hw/virtio/virtio_rdma_netdev.c |  44 ++
 .../infiniband/hw/virtio/virtio_rdma_netdev.h |  33 +
 include/uapi/linux/virtio_ids.h   |   1 +
 13 files changed, 1129 insertions(+)
 create mode 100644 drivers/infiniband/hw/virtio/Kconfig
 create mode 100644 drivers/infiniband/hw/virtio/Makefile
 create mode 100644 drivers/infiniband/hw/virtio/virtio_rdma.h
 create mode 100644 drivers/infiniband/hw/virtio/virtio_rdma_device.c
 create mode 100644 drivers/infiniband/hw/virtio/virtio_rdma_device.h
 create mode 100644 drivers/infiniband/hw/virtio/virtio_rdma_ib.c
 create mode 100644 drivers/infiniband/hw/virtio/virtio_rdma_ib.h
 create mode 100644 drivers/infiniband/hw/virtio/virtio_rdma_main.c
 create mode 100644 drivers/infiniband/hw/virtio/virtio_rdma_netdev.c
 create mode 100644 drivers/infiniband/hw/virtio/virtio_rdma_netdev.h

diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig
index a1fb840de45d..218a47d4cecf 100644
--- a/drivers/infiniband/Kconfig
+++ b/drivers/infiniband/Kconfig
@@ -107,6 +107,7 @@ source "drivers/infiniband/hw/hfi1/Kconfig"
 source "drivers/infiniband/hw/qedr/Kconfig"
 source "drivers/infiniband/sw/rdmavt/Kconfig"
 source "drivers/infiniband/sw/rxe/Kconfig"
+source "drivers/infiniband/hw/virtio/Kconfig"
 endif
 
 source "drivers/infiniband/ulp/ipoib/Kconfig"
diff --git a/drivers/infiniband/hw/Makefile b/drivers/infiniband/hw/Makefile
index e4f31c1be8f7..10ffb2c421e4 100644
--- a/drivers/infiniband/hw/Makefile
+++ b/drivers/infiniband/hw/Makefile
@@ -14,3 +14,4 @@ obj-$(CONFIG_INFINIBAND_HFI1) += hfi1/
 obj-$(CONFIG_INFINIBAND_HNS)   += hns/
 obj-$(CONFIG_INFINIBAND_QEDR)  += qedr/
 obj-$(CONFIG_INFINIBAND_BNXT_RE)   += bnxt_re/
+obj-$(CONFIG_INFINIBAND_VIRTIO_RDMA)   += virtio/
diff --git a/drivers/infiniband/hw/virtio/Kconfig 
b/drivers/infiniband/hw/virtio/Kconfig
new file mode 100644
index ..92e41691cf5d
--- /dev/null
+++ b/drivers/infiniband/hw/virtio/Kconfig
@@ -0,0 +1,6 @@
+config INFINIBAND_VIRTIO_RDMA
+   tristate "VirtIO Paravirtualized RDMA Driver"
+   depends on NETDEVICES && ETHERNET && PCI && INET
+   ---help---
+ This driver provides low-level support for VirtIO Paravirtual
+ RDMA adapter.
diff --git a/drivers/infiniband/hw/virtio/Makefile 
b/drivers/infiniband/hw/virtio/Makefile
new file mode 100644
index ..fb637e467167
--- /dev/null
+++ b/drivers/infiniband/hw/virtio/Makefile
@@ -0,0 +1,4 @@
+obj-$(CONFIG_INFINIBAND_VIRTIO_RDMA) += virtio_rdma.o
+
+virtio_rdma-y := virtio_rdma_main.o virtio_rdma_device.o virtio_rdma_ib.o \
+virtio_rdma_netdev.o
diff --git a/drivers/infiniband/hw/virtio/virtio_rdma.h 
b/drivers/infiniband/hw/virtio/virtio_rdma.h
new file mode 100644
index ..7896a2dfb812
--- /dev/null
+++ b/drivers/infiniband/hw/virtio/virtio_rdma.h
@@ -0,0 +1,40 @@
+/*
+ * Virtio RDMA device: Driver main data types
+ *
+ * Copyright (C) 2019 Yuval Shaia Oracle Corporation
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA  02110-1301 USA
+ */
+
+#ifndef __VIRTIO_RDMA__
+#define __VIRTIO_RDMA__
+
+#include 
+#include 
+
+struct virtio_rdma_info {
+   struct ib_device ib_dev;
+   struct virtio_device *vdev;
+   struct virtqueue *ctrl_vq;
+   wait_queue_head_t acked; /* arm on send to host, release on recv */
+   struct net_device *netdev;
+};
+
+static inline struct virtio_rdma_info *to_vdev(struct ib_device *ibdev)
+{
+   return container_of(ibdev, struct virtio_rdma_info, ib_dev);
+}
+
+#endif
diff --git a/drivers/infiniband/hw/virtio/virtio_rdma_device.c 
b/drivers/infiniband/hw/virtio/virtio_rdma_device.c
new file mode 100644
index 

Re: [RFC PATCH 05/12] s390/cio: add protected virtualization support to cio

2019-04-19 Thread Sebastian Ott
On Fri, 5 Apr 2019, Halil Pasic wrote:
> @@ -1593,20 +1609,29 @@ struct ccw_device * __init 
> ccw_device_create_console(struct ccw_driver *drv)
>   return ERR_CAST(sch);
>  
>   io_priv = kzalloc(sizeof(*io_priv), GFP_KERNEL | GFP_DMA);
> - if (!io_priv) {
> - put_device(>dev);
> - return ERR_PTR(-ENOMEM);
> - }
> + if (!io_priv)
> + goto err_priv;
> + io_priv->dma_area = cio_dma_zalloc(sizeof(*io_priv->dma_area));

This is called very early - the dma pool is not yet initialized.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[RFC 2/3] hw/virtio-rdma: VirtIO rdma device

2019-04-19 Thread Yuval Shaia
Signed-off-by: Yuval Shaia 
---
 hw/Kconfig  |   1 +
 hw/rdma/Kconfig |   4 +
 hw/rdma/Makefile.objs   |   2 +
 hw/rdma/virtio/virtio-rdma-ib.c | 287 
 hw/rdma/virtio/virtio-rdma-ib.h |  93 +++
 hw/rdma/virtio/virtio-rdma-main.c   | 185 +
 hw/virtio/Makefile.objs |   1 +
 hw/virtio/virtio-rdma-pci.c | 108 
 include/hw/pci/pci.h|   1 +
 include/hw/virtio/virtio-rdma.h |  44 +++
 include/standard-headers/linux/virtio_ids.h |   1 +
 11 files changed, 727 insertions(+)
 create mode 100644 hw/rdma/Kconfig
 create mode 100644 hw/rdma/virtio/virtio-rdma-ib.c
 create mode 100644 hw/rdma/virtio/virtio-rdma-ib.h
 create mode 100644 hw/rdma/virtio/virtio-rdma-main.c
 create mode 100644 hw/virtio/virtio-rdma-pci.c
 create mode 100644 include/hw/virtio/virtio-rdma.h

diff --git a/hw/Kconfig b/hw/Kconfig
index d5ecd02070..88b9f15007 100644
--- a/hw/Kconfig
+++ b/hw/Kconfig
@@ -26,6 +26,7 @@ source pci-bridge/Kconfig
 source pci-host/Kconfig
 source pcmcia/Kconfig
 source pci/Kconfig
+source rdma/Kconfig
 source scsi/Kconfig
 source sd/Kconfig
 source smbios/Kconfig
diff --git a/hw/rdma/Kconfig b/hw/rdma/Kconfig
new file mode 100644
index 00..b10bd7182b
--- /dev/null
+++ b/hw/rdma/Kconfig
@@ -0,0 +1,4 @@
+config VIRTIO_RDMA
+bool
+default y
+depends on VIRTIO
diff --git a/hw/rdma/Makefile.objs b/hw/rdma/Makefile.objs
index c354e60e5b..ed640882be 100644
--- a/hw/rdma/Makefile.objs
+++ b/hw/rdma/Makefile.objs
@@ -3,3 +3,5 @@ obj-$(CONFIG_PCI) += rdma_utils.o rdma_backend.o rdma_rm.o 
rdma.o
 obj-$(CONFIG_PCI) += vmw/pvrdma_dev_ring.o vmw/pvrdma_cmd.o \
  vmw/pvrdma_qp_ops.o vmw/pvrdma_main.o
 endif
+obj-$(CONFIG_VIRTIO_RDMA) += virtio/virtio-rdma-main.o \
+ virtio/virtio-rdma-ib.o
diff --git a/hw/rdma/virtio/virtio-rdma-ib.c b/hw/rdma/virtio/virtio-rdma-ib.c
new file mode 100644
index 00..2590a831a2
--- /dev/null
+++ b/hw/rdma/virtio/virtio-rdma-ib.c
@@ -0,0 +1,287 @@
+/*
+ * Virtio RDMA Device - IB verbs
+ *
+ * Copyright (C) 2019 Oracle
+ *
+ * Authors:
+ *  Yuval Shaia 
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ *
+ */
+
+#include 
+
+#include "qemu/osdep.h"
+
+#include "virtio-rdma-ib.h"
+#include "../rdma_utils.h"
+#include "../rdma_rm.h"
+#include "../rdma_backend.h"
+
+int virtio_rdma_query_device(VirtIORdma *rdev, struct iovec *in,
+ struct iovec *out)
+{
+struct ibv_device_attr attr = {};
+int offs;
+size_t s;
+
+addrconf_addr_eui48((unsigned char *)_image_guid,
+(const char *)>netdev->mac);
+
+attr.max_mr_size = 4096;
+attr.page_size_cap = 4096;
+attr.vendor_id = 1;
+attr.vendor_part_id = 1;
+attr.hw_ver = VIRTIO_RDMA_HW_VER;
+attr.max_qp = 1024;
+attr.max_qp_wr = 1024;
+attr.device_cap_flags = 0;
+attr.max_sge = 64;
+attr.max_sge_rd = 64;
+attr.max_cq = 1024;
+attr.max_cqe = 64;
+attr.max_mr = 1024;
+attr.max_pd = 1024;
+attr.max_qp_rd_atom = 0;
+attr.max_ee_rd_atom = 0;
+attr.max_res_rd_atom = 0;
+attr.max_qp_init_rd_atom = 0;
+attr.max_ee_init_rd_atom = 0;
+attr.atomic_cap = IBV_ATOMIC_NONE;
+attr.max_ee = 0;
+attr.max_rdd = 0;
+attr.max_mw = 0;
+attr.max_raw_ipv6_qp = 0;
+attr.max_raw_ethy_qp = 0;
+attr.max_mcast_grp = 0;
+attr.max_mcast_qp_attach = 0;
+attr.max_total_mcast_qp_attach = 0;
+attr.max_ah = 1024;
+attr.max_fmr = 0;
+attr.max_map_per_fmr = 0;
+attr.max_srq = 0;
+attr.max_srq_wr = 0;
+attr.max_srq_sge = 0;
+attr.max_pkeys = 1;
+attr.local_ca_ack_delay = 0;
+attr.phys_port_cnt = VIRTIO_RDMA_PORT_CNT;
+
+offs = offsetof(struct ibv_device_attr, sys_image_guid);
+s = iov_from_buf(out, 1, 0, (void *) + offs, sizeof(attr) - offs);
+
+return s == sizeof(attr) - offs ? VIRTIO_RDMA_CTRL_OK :
+  VIRTIO_RDMA_CTRL_ERR;
+}
+
+int virtio_rdma_query_port(VirtIORdma *rdev, struct iovec *in,
+   struct iovec *out)
+{
+struct ibv_port_attr attr = {};
+struct cmd_query_port cmd = {};
+int offs;
+size_t s;
+
+s = iov_to_buf(in, 1, 0, , sizeof(cmd));
+if (s != sizeof(cmd)) {
+return VIRTIO_RDMA_CTRL_ERR;
+}
+
+if (cmd.port != 1) {
+return VIRTIO_RDMA_CTRL_ERR;
+}
+
+attr.state = IBV_PORT_ACTIVE;
+attr.max_mtu = attr.active_mtu = IBV_MTU_1024;
+attr.gid_tbl_len = 256;
+attr.port_cap_flags = 0;
+attr.max_msg_sz = 1024;
+attr.bad_pkey_cntr = 0;
+attr.qkey_viol_cntr = 0;
+attr.pkey_tbl_len = 1;
+attr.lid = 0;
+attr.sm_lid = 0;
+attr.lmc = 0;
+attr.max_vl_num 

Re: [PATCH v5 2/5] virtio-pmem: Add virtio pmem driver

2019-04-19 Thread Yuval Shaia
> +
> +static int virtio_pmem_probe(struct virtio_device *vdev)
> +{
> + int err = 0;
> + struct resource res;
> + struct virtio_pmem *vpmem;
> + struct nvdimm_bus *nvdimm_bus;
> + struct nd_region_desc ndr_desc = {};
> + int nid = dev_to_node(>dev);
> + struct nd_region *nd_region;
> +
> + if (!vdev->config->get) {
> + dev_err(>dev, "%s failure: config disabled\n",
> + __func__);
> + return -EINVAL;
> + }
> +
> + vdev->priv = vpmem = devm_kzalloc(>dev, sizeof(*vpmem),
> + GFP_KERNEL);
> + if (!vpmem) {
> + err = -ENOMEM;
> + goto out_err;
> + }
> +
> + vpmem->vdev = vdev;
> + err = init_vq(vpmem);
> + if (err)
> + goto out_err;
> +
> + virtio_cread(vpmem->vdev, struct virtio_pmem_config,
> + start, >start);
> + virtio_cread(vpmem->vdev, struct virtio_pmem_config,
> + size, >size);
> +
> + res.start = vpmem->start;
> + res.end   = vpmem->start + vpmem->size-1;
> + vpmem->nd_desc.provider_name = "virtio-pmem";
> + vpmem->nd_desc.module = THIS_MODULE;
> +
> + vpmem->nvdimm_bus = nvdimm_bus = nvdimm_bus_register(>dev,
> + >nd_desc);
> + if (!nvdimm_bus)
> + goto out_vq;
> +
> + dev_set_drvdata(>dev, nvdimm_bus);
> +
> + ndr_desc.res = 
> + ndr_desc.numa_node = nid;
> + ndr_desc.flush = virtio_pmem_flush;
> + set_bit(ND_REGION_PAGEMAP, _desc.flags);
> + set_bit(ND_REGION_ASYNC, _desc.flags);
> + nd_region = nvdimm_pmem_region_create(nvdimm_bus, _desc);
> + nd_region->provider_data =  dev_to_virtio

Pleas delete the extra space.

> + (nd_region->dev.parent->parent);
> +
> + if (!nd_region)
> + goto out_nd;
> +
> + return 0;
> +out_nd:
> + err = -ENXIO;
> + nvdimm_bus_unregister(nvdimm_bus);
> +out_vq:
> + vdev->config->del_vqs(vdev);
> +out_err:
> + dev_err(>dev, "failed to register virtio pmem memory\n");
> + return err;
> +}
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[RFC 0/3] VirtIO RDMA

2019-04-19 Thread Yuval Shaia
Data center backends use more and more RDMA or RoCE devices and more and
more software runs in virtualized environment.
There is a need for a standard to enable RDMA/RoCE on Virtual Machines.

Virtio is the optimal solution since is the de-facto para-virtualizaton
technology and also because the Virtio specification
allows Hardware Vendors to support Virtio protocol natively in order to
achieve bare metal performance.

This RFC is an effort to addresses challenges in defining the RDMA/RoCE
Virtio Specification and a look forward on possible implementation
techniques.

Open issues/Todo list:
List is huge, this is only start point of the project.
Anyway, here is one example of item in the list:
- Multi VirtQ: Every QP has two rings and every CQ has one. This means that
  in order to support for example 32K QPs we will need 64K VirtQ. Not sure
  that this is reasonable so one option is to have one for all and
  multiplex the traffic on it. This is not good approach as by design it
  introducing an optional starvation. Another approach would be multi
  queues and round-robin (for example) between them.

Expectations from this posting:
In general, any comment is welcome, starting from hey, drop this as it is a
very bad idea, to yeah, go ahead, we really want it.
Idea here is that since it is not a minor effort i first want to know if
there is some sort interest in the community for such device.

The scope of the implementation is limited to probing the device and doing
some basic ibverbs commands. Data-path is not yet implemented. So with this
one can expect only that driver is (partialy) loaded and basic queries and
resource allocation is done.

One note regarding the patchset.
I know it is not standard to collaps patches from several repos as i did
here (qemu and linux) but decided to do it anyway so the whole picture can
be seen.

patch 1: virtio-net: Move some virtio-net-pci decl to include/hw/virtio
This is a prelimenary patch just as a hack so i will not need to
impelement new netdev
patch 2: hw/virtio-rdma: VirtIO rdma device
The implementation of the device
patch 3: RDMA/virtio-rdma: VirtIO rdma driver
The device driver

-- 
2.20.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[RFC 1/3] virtio-net: Move some virtio-net-pci decl to include/hw/virtio

2019-04-19 Thread Yuval Shaia
Signed-off-by: Yuval Shaia 
---
 hw/virtio/virtio-net-pci.c | 18 ++-
 include/hw/virtio/virtio-net-pci.h | 35 ++
 2 files changed, 37 insertions(+), 16 deletions(-)
 create mode 100644 include/hw/virtio/virtio-net-pci.h

diff --git a/hw/virtio/virtio-net-pci.c b/hw/virtio/virtio-net-pci.c
index db07ab9e21..63617d5550 100644
--- a/hw/virtio/virtio-net-pci.c
+++ b/hw/virtio/virtio-net-pci.c
@@ -17,24 +17,10 @@
 
 #include "qemu/osdep.h"
 
-#include "hw/virtio/virtio-net.h"
+#include "hw/virtio/virtio-net-pci.h"
 #include "virtio-pci.h"
 #include "qapi/error.h"
 
-typedef struct VirtIONetPCI VirtIONetPCI;
-
-/*
- * virtio-net-pci: This extends VirtioPCIProxy.
- */
-#define TYPE_VIRTIO_NET_PCI "virtio-net-pci-base"
-#define VIRTIO_NET_PCI(obj) \
-OBJECT_CHECK(VirtIONetPCI, (obj), TYPE_VIRTIO_NET_PCI)
-
-struct VirtIONetPCI {
-VirtIOPCIProxy parent_obj;
-VirtIONet vdev;
-};
-
 static Property virtio_net_properties[] = {
 DEFINE_PROP_BIT("ioeventfd", VirtIOPCIProxy, flags,
 VIRTIO_PCI_FLAG_USE_IOEVENTFD_BIT, true),
@@ -82,7 +68,7 @@ static void virtio_net_pci_instance_init(Object *obj)
 
 static const VirtioPCIDeviceTypeInfo virtio_net_pci_info = {
 .base_name = TYPE_VIRTIO_NET_PCI,
-.generic_name  = "virtio-net-pci",
+.generic_name  = TYPE_VIRTIO_NET_PCI_GENERIC,
 .transitional_name = "virtio-net-pci-transitional",
 .non_transitional_name = "virtio-net-pci-non-transitional",
 .instance_size = sizeof(VirtIONetPCI),
diff --git a/include/hw/virtio/virtio-net-pci.h 
b/include/hw/virtio/virtio-net-pci.h
new file mode 100644
index 00..f14e6ed992
--- /dev/null
+++ b/include/hw/virtio/virtio-net-pci.h
@@ -0,0 +1,35 @@
+/*
+ * PCI Virtio Network Device
+ *
+ * Copyright IBM, Corp. 2007
+ *
+ * Authors:
+ *  Anthony Liguori   
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ *
+ */
+
+#ifndef QEMU_VIRTIO_NET_PCI_H
+#define QEMU_VIRTIO_NET_PCI_H
+
+#include "hw/virtio/virtio-net.h"
+#include "virtio-pci.h"
+
+typedef struct VirtIONetPCI VirtIONetPCI;
+
+/*
+ * virtio-net-pci: This extends VirtioPCIProxy.
+ */
+#define TYPE_VIRTIO_NET_PCI_GENERIC "virtio-net-pci"
+#define TYPE_VIRTIO_NET_PCI "virtio-net-pci-base"
+#define VIRTIO_NET_PCI(obj) \
+OBJECT_CHECK(VirtIONetPCI, (obj), TYPE_VIRTIO_NET_PCI)
+
+struct VirtIONetPCI {
+VirtIOPCIProxy parent_obj;
+VirtIONet vdev;
+};
+
+#endif
-- 
2.20.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v5 2/5] virtio-pmem: Add virtio pmem driver

2019-04-19 Thread Yuval Shaia
On Wed, Apr 10, 2019 at 02:24:26PM +0200, Cornelia Huck wrote:
> On Wed, 10 Apr 2019 09:38:22 +0530
> Pankaj Gupta  wrote:
> 
> > This patch adds virtio-pmem driver for KVM guest.
> > 
> > Guest reads the persistent memory range information from
> > Qemu over VIRTIO and registers it on nvdimm_bus. It also
> > creates a nd_region object with the persistent memory
> > range information so that existing 'nvdimm/pmem' driver
> > can reserve this into system memory map. This way
> > 'virtio-pmem' driver uses existing functionality of pmem
> > driver to register persistent memory compatible for DAX
> > capable filesystems.
> > 
> > This also provides function to perform guest flush over
> > VIRTIO from 'pmem' driver when userspace performs flush
> > on DAX memory range.
> > 
> > Signed-off-by: Pankaj Gupta 
> > ---
> >  drivers/nvdimm/virtio_pmem.c |  88 ++
> >  drivers/virtio/Kconfig   |  10 +++
> >  drivers/virtio/Makefile  |   1 +
> >  drivers/virtio/pmem.c| 124 +++
> >  include/linux/virtio_pmem.h  |  60 +++
> >  include/uapi/linux/virtio_ids.h  |   1 +
> >  include/uapi/linux/virtio_pmem.h |  10 +++
> >  7 files changed, 294 insertions(+)
> >  create mode 100644 drivers/nvdimm/virtio_pmem.c
> >  create mode 100644 drivers/virtio/pmem.c
> >  create mode 100644 include/linux/virtio_pmem.h
> >  create mode 100644 include/uapi/linux/virtio_pmem.h
> > 
> (...)
> > diff --git a/drivers/virtio/pmem.c b/drivers/virtio/pmem.c
> > new file mode 100644
> > index ..cc9de9589d56
> > --- /dev/null
> > +++ b/drivers/virtio/pmem.c
> > @@ -0,0 +1,124 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * virtio_pmem.c: Virtio pmem Driver
> > + *
> > + * Discovers persistent memory range information
> > + * from host and registers the virtual pmem device
> > + * with libnvdimm core.
> > + */
> > +#include 
> > +#include <../../drivers/nvdimm/nd.h>
> > +
> > +static struct virtio_device_id id_table[] = {
> > +   { VIRTIO_ID_PMEM, VIRTIO_DEV_ANY_ID },
> > +   { 0 },
> > +};
> > +
> > + /* Initialize virt queue */
> > +static int init_vq(struct virtio_pmem *vpmem)
> 
> IMHO, you don't gain much by splitting off this function...

It make sense to have all the vq-init-related stuff in one function, so
here pmem_lock and req_list are used only for the vq.
Saying that - maybe it would be better to have the 3 in one struct.

> 
> > +{
> > +   struct virtqueue *vq;
> > +
> > +   /* single vq */
> > +   vpmem->req_vq = vq = virtio_find_single_vq(vpmem->vdev,
> > +   host_ack, "flush_queue");
> > +   if (IS_ERR(vq))
> > +   return PTR_ERR(vq);
> 
> I'm personally not a fan of chained assignments... I think I'd just
> drop the 'vq' variable and operate on vpmem->req_vq directly.

+1

> 
> > +
> > +   spin_lock_init(>pmem_lock);
> > +   INIT_LIST_HEAD(>req_list);
> > +
> > +   return 0;
> > +};
> > +
> > +static int virtio_pmem_probe(struct virtio_device *vdev)
> > +{
> > +   int err = 0;
> > +   struct resource res;
> > +   struct virtio_pmem *vpmem;
> > +   struct nvdimm_bus *nvdimm_bus;
> > +   struct nd_region_desc ndr_desc = {};
> > +   int nid = dev_to_node(>dev);
> > +   struct nd_region *nd_region;
> > +
> > +   if (!vdev->config->get) {
> > +   dev_err(>dev, "%s failure: config disabled\n",
> 
> Maybe s/config disabled/config access disabled/ ? That seems to be the
> more common message.
> 
> > +   __func__);
> > +   return -EINVAL;
> > +   }
> > +
> > +   vdev->priv = vpmem = devm_kzalloc(>dev, sizeof(*vpmem),
> > +   GFP_KERNEL);
> 
> Here, the vpmem variable makes sense for convenience, but I'm again not
> a fan of the chaining :)

+1

> 
> > +   if (!vpmem) {
> > +   err = -ENOMEM;
> > +   goto out_err;
> > +   }
> > +
> > +   vpmem->vdev = vdev;
> > +   err = init_vq(vpmem);
> > +   if (err)
> > +   goto out_err;
> > +
> > +   virtio_cread(vpmem->vdev, struct virtio_pmem_config,
> > +   start, >start);
> > +   virtio_cread(vpmem->vdev, struct virtio_pmem_config,
> > +   size, >size);
> > +
> > +   res.start = vpmem->start;
> > +   res.end   = vpmem->start + vpmem->size-1;
> > +   vpmem->nd_desc.provider_name = "virtio-pmem";
> > +   vpmem->nd_desc.module = THIS_MODULE;
> > +
> > +   vpmem->nvdimm_bus = nvdimm_bus = nvdimm_bus_register(>dev,
> > +   >nd_desc);
> 
> And here :)
> 
> > +   if (!nvdimm_bus)
> > +   goto out_vq;
> > +
> > +   dev_set_drvdata(>dev, nvdimm_bus);
> > +
> > +   ndr_desc.res = 
> > +   ndr_desc.numa_node = nid;
> > +   ndr_desc.flush = virtio_pmem_flush;
> > +   set_bit(ND_REGION_PAGEMAP, _desc.flags);
> > +   set_bit(ND_REGION_ASYNC, _desc.flags);
> > +   nd_region = nvdimm_pmem_region_create(nvdimm_bus, _desc);
> > +   nd_region->provider_data =  dev_to_virtio
> > +  

Re: [PATCH v5 0/6] virtio pmem driver

2019-04-19 Thread Arkadiusz Miśkiewicz
On 10/04/2019 06:08, Pankaj Gupta wrote:
>  This patch series has implementation for "virtio pmem". 
>  "virtio pmem" is fake persistent memory(nvdimm) in guest 
>  which allows to bypass the guest page cache. This also

Will kernel pstore be able to use this persistent memory for storing
crash dumps?

-- 
Arkadiusz Miśkiewicz, arekm / ( maven.pl | pld-linux.org )
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH RFC 0/4] vsock/virtio: optimizations to increase the throughput

2019-04-19 Thread Stefano Garzarella
On Mon, Apr 08, 2019 at 02:43:28PM +0800, Jason Wang wrote:
> 
> On 2019/4/4 下午6:58, Stefano Garzarella wrote:
> > This series tries to increase the throughput of virtio-vsock with slight
> > changes:
> >   - patch 1/4: reduces the number of credit update messages sent to the
> >transmitter
> >   - patch 2/4: allows the host to split packets on multiple buffers,
> >in this way, we can remove the packet size limit to
> >VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE
> >   - patch 3/4: uses VIRTIO_VSOCK_MAX_PKT_BUF_SIZE as the max packet size
> >allowed
> >   - patch 4/4: increases RX buffer size to 64 KiB (affects only host->guest)
> > 
> > RFC:
> >   - maybe patch 4 can be replaced with multiple queues with different
> > buffer sizes or using EWMA to adapt the buffer size to the traffic
> 
> 
> Or EWMA + mergeable rx buffer, but if we decide to unify the datapath with
> virtio-net, we can reuse their codes.
> 
> 
> > 
> >   - as Jason suggested in a previous thread [1] I'll evaluate to use
> > virtio-net as transport, but I need to understand better how to
> > interface with it, maybe introducing sk_buff in virtio-vsock.
> > 
> > Any suggestions?
> 
> 
> My understanding is this is not a must, but if it makes things easier, we
> can do this.

Hopefully it should simplify the maintainability and avoid duplicated code.

> 
> Another thing that may help is to implement sendpage(), which will greatly
> improve the performance.

Thanks for your suggestions!
I'll try to implement sendpage() in VSOCK to measure the improvement.

Cheers,
Stefano
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH RFC 3/4] vsock/virtio: change the maximum packet size allowed

2019-04-19 Thread Stefano Garzarella
On Mon, Apr 08, 2019 at 10:57:44AM -0400, Michael S. Tsirkin wrote:
> On Mon, Apr 08, 2019 at 04:55:31PM +0200, Stefano Garzarella wrote:
> > > Anyway, any change to this behavior requires compatibility so new guest
> > > drivers work with old vhost_vsock.ko.  Therefore we should probably just
> > > leave the limit for now.
> > 
> > I understood your point of view and I completely agree with you.
> > But, until we don't have a way to expose features/versions between guest
> > and host,
> 
> Why not use the standard virtio feature negotiation mechanism for this?
> 

Yes, I have this in my mind :), but I want to understand better if we can
use virtio-net also for this mechanism.
For now, I don't think limiting the packets to 64 KiB is a big issue.

What do you think if I postpone this when I have more clear if we can
use virtio-net or not? (in order to avoid duplicated work)

Thanks,
Stefano
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC 3/4] vsock/virtio: change the maximum packet size allowed

2019-04-19 Thread Stefano Garzarella
On Fri, Apr 05, 2019 at 09:24:47AM +0100, Stefan Hajnoczi wrote:
> On Thu, Apr 04, 2019 at 12:58:37PM +0200, Stefano Garzarella wrote:
> > Since now we are able to split packets, we can avoid limiting
> > their sizes to VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE.
> > Instead, we can use VIRTIO_VSOCK_MAX_PKT_BUF_SIZE as the max
> > packet size.
> > 
> > Signed-off-by: Stefano Garzarella 
> > ---
> >  net/vmw_vsock/virtio_transport_common.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/net/vmw_vsock/virtio_transport_common.c 
> > b/net/vmw_vsock/virtio_transport_common.c
> > index f32301d823f5..822e5d07a4ec 100644
> > --- a/net/vmw_vsock/virtio_transport_common.c
> > +++ b/net/vmw_vsock/virtio_transport_common.c
> > @@ -167,8 +167,8 @@ static int virtio_transport_send_pkt_info(struct 
> > vsock_sock *vsk,
> > vvs = vsk->trans;
> >  
> > /* we can send less than pkt_len bytes */
> > -   if (pkt_len > VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE)
> > -   pkt_len = VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE;
> > +   if (pkt_len > VIRTIO_VSOCK_MAX_PKT_BUF_SIZE)
> > +   pkt_len = VIRTIO_VSOCK_MAX_PKT_BUF_SIZE;
> 
> The next line limits pkt_len based on available credits:
> 
>   /* virtio_transport_get_credit might return less than pkt_len credit */
>   pkt_len = virtio_transport_get_credit(vvs, pkt_len);
> 
> I think drivers/vhost/vsock.c:vhost_transport_do_send_pkt() now works
> correctly even with pkt_len > VIRTIO_VSOCK_MAX_PKT_BUF_SIZE.

Correct.

> 
> The other ->send_pkt() callback is
> net/vmw_vsock/virtio_transport.c:virtio_transport_send_pkt_work() and it
> can already send any size packet.
> 
> Do you remember why VIRTIO_VSOCK_MAX_PKT_BUF_SIZE still needs to be the
> limit?  I'm wondering if we can get rid of it now and just limit packets
> to the available credits.

There are 2 reasons why I left this limit:
1. When the host receives a packets, it must be <=
   VIRTIO_VSOCK_MAX_PKT_BUF_SIZE [drivers/vhost/vsock.c:vhost_vsock_alloc_pkt()]
   So in this way we can limit the packets sent from the guest.

2. When the host send packets, it help us to increase the parallelism
   (especially if the guest has 64 KB RX buffers) because the user thread
   will split packets, calling multiple times transport->stream_enqueue()
   in net/vmw_vsock/af_vsock.c:vsock_stream_sendmsg() while the
   vhost_transport_send_pkt_work() send them to the guest.


Do you think make sense?

Thanks,
Stefano
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC 3/4] vsock/virtio: change the maximum packet size allowed

2019-04-19 Thread Stefano Garzarella
On Mon, Apr 08, 2019 at 10:37:23AM +0100, Stefan Hajnoczi wrote:
> On Fri, Apr 05, 2019 at 12:07:47PM +0200, Stefano Garzarella wrote:
> > On Fri, Apr 05, 2019 at 09:24:47AM +0100, Stefan Hajnoczi wrote:
> > > On Thu, Apr 04, 2019 at 12:58:37PM +0200, Stefano Garzarella wrote:
> > > > Since now we are able to split packets, we can avoid limiting
> > > > their sizes to VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE.
> > > > Instead, we can use VIRTIO_VSOCK_MAX_PKT_BUF_SIZE as the max
> > > > packet size.
> > > > 
> > > > Signed-off-by: Stefano Garzarella 
> > > > ---
> > > >  net/vmw_vsock/virtio_transport_common.c | 4 ++--
> > > >  1 file changed, 2 insertions(+), 2 deletions(-)
> > > > 
> > > > diff --git a/net/vmw_vsock/virtio_transport_common.c 
> > > > b/net/vmw_vsock/virtio_transport_common.c
> > > > index f32301d823f5..822e5d07a4ec 100644
> > > > --- a/net/vmw_vsock/virtio_transport_common.c
> > > > +++ b/net/vmw_vsock/virtio_transport_common.c
> > > > @@ -167,8 +167,8 @@ static int virtio_transport_send_pkt_info(struct 
> > > > vsock_sock *vsk,
> > > > vvs = vsk->trans;
> > > >  
> > > > /* we can send less than pkt_len bytes */
> > > > -   if (pkt_len > VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE)
> > > > -   pkt_len = VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE;
> > > > +   if (pkt_len > VIRTIO_VSOCK_MAX_PKT_BUF_SIZE)
> > > > +   pkt_len = VIRTIO_VSOCK_MAX_PKT_BUF_SIZE;
> > > 
> > > The next line limits pkt_len based on available credits:
> > > 
> > >   /* virtio_transport_get_credit might return less than pkt_len credit */
> > >   pkt_len = virtio_transport_get_credit(vvs, pkt_len);
> > > 
> > > I think drivers/vhost/vsock.c:vhost_transport_do_send_pkt() now works
> > > correctly even with pkt_len > VIRTIO_VSOCK_MAX_PKT_BUF_SIZE.
> > 
> > Correct.
> > 
> > > 
> > > The other ->send_pkt() callback is
> > > net/vmw_vsock/virtio_transport.c:virtio_transport_send_pkt_work() and it
> > > can already send any size packet.
> > > 
> > > Do you remember why VIRTIO_VSOCK_MAX_PKT_BUF_SIZE still needs to be the
> > > limit?  I'm wondering if we can get rid of it now and just limit packets
> > > to the available credits.
> > 
> > There are 2 reasons why I left this limit:
> > 1. When the host receives a packets, it must be <=
> >VIRTIO_VSOCK_MAX_PKT_BUF_SIZE 
> > [drivers/vhost/vsock.c:vhost_vsock_alloc_pkt()]
> >So in this way we can limit the packets sent from the guest.
> 
> The general intent is to prevent the guest from sending huge buffers.
> This is good.
> 
> However, the guest must already obey the credit limit advertized by the
> host.  Therefore I think we should be checking against that instead of
> an arbitrary constant limit.
> 
> So I think the limit should be the receive buffer size, not
> VIRTIO_VSOCK_MAX_PKT_BUF_SIZE.  But at this point the code doesn't know
> which connection the packet is associated with and cannot check the
> receive buffer size. :(
> 
> Anyway, any change to this behavior requires compatibility so new guest
> drivers work with old vhost_vsock.ko.  Therefore we should probably just
> leave the limit for now.

I understood your point of view and I completely agree with you.
But, until we don't have a way to expose features/versions between guest
and host, maybe is better to leave the limit in order to be compatible
with old vhost_vsock.

> 
> > 2. When the host send packets, it help us to increase the parallelism
> >(especially if the guest has 64 KB RX buffers) because the user thread
> >will split packets, calling multiple times transport->stream_enqueue()
> >in net/vmw_vsock/af_vsock.c:vsock_stream_sendmsg() while the
> >vhost_transport_send_pkt_work() send them to the guest.
> 
> Sorry, I don't understand the reasoning.  Overall this creates more
> work.  Are you saying the benefit is that
> vhost_transport_send_pkt_work() can run "early" and notify the guest of
> partial rx data before all of it has been enqueued?

Something like that. Your reasoning is more accurate.
Anyway, I'll do some tests in order to understand better the behaviour!

Thanks,
Stefano
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] drm/qxl: drop prime import/export callbacks

2019-04-19 Thread David Airlie
On Tue, Apr 9, 2019 at 4:03 PM Gerd Hoffmann  wrote:
>
> On Tue, Apr 09, 2019 at 02:01:33PM +1000, Dave Airlie wrote:
> > On Sat, 12 Jan 2019 at 07:13, Dave Airlie  wrote:
> > >
> > > On Thu, 10 Jan 2019 at 18:17, Gerd Hoffmann  wrote:
> > > >
> > > > Also set prime_handle_to_fd and prime_fd_to_handle to NULL,
> > > > so drm will not advertive DRM_PRIME_CAP_{IMPORT,EXPORT} to
> > > > userspace.
> >
> > It's been pointed out to me that disables DRI3 for these devices, I'm
> > not sure that is the solution we actually wanted.
> >
> > any ideas?
>
> Well.  Lets have a look at where we stand:
>
>  * drm_gem_prime_export() works with qxl, you'll get a dma-buf handle.
>
>  * Other drivers trying to map that dma-buf (drm_gem_map_dma_buf()
>callback) will not work, due to the ->gem_prime_get_sg_table()
>callback not being there.
>
>  * drm_gem_prime_import() will work with buffers from the same qxl
>device, there is a shortcut for this special case.  Otherwise it
>will not work, due to the ->gem_prime_import_sg_table() callback
>not being there.
>
> Bottom line: you can use prime to pass qxl object handles from one
> application to another.  But you can't actually export/import qxl
> buffer objects from/to other devices.
>
> Problem is that we have no way to signal to userspace that prime can
> be used that way.
>
> Setting DRM_PRIME_CAP_{IMPORT,EXPORT} even though the driver can't
> do that leads to other problems.  Userspace thinks it can have other
> devices (intel vgpu for example) handle the rendering, then import
> the rendered buffer into qxl for scanout.
>
> Should we add something like DRM_PRIME_CAP_SAME_DEVICE?

Yeah I expect we need some sort of same device only capability, so
that dri3 userspace can work.

If we just fail importing in these cases what happens? userspace just
gets confused, I know we used to print a backtrace if we hit the mmap
path, but if we didn't do that what happens?

Dave.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC 2/4] vhost/vsock: split packets to send using multiple buffers

2019-04-19 Thread Stefano Garzarella
On Fri, Apr 05, 2019 at 09:13:56AM +0100, Stefan Hajnoczi wrote:
> On Thu, Apr 04, 2019 at 12:58:36PM +0200, Stefano Garzarella wrote:
> > @@ -139,8 +139,18 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
> > break;
> > }
> >  
> > -   len = iov_length(>iov[out], in);
> > -   iov_iter_init(_iter, READ, >iov[out], in, len);
> > +   payload_len = pkt->len - pkt->off;
> > +   iov_len = iov_length(>iov[out], in);
> > +   iov_iter_init(_iter, READ, >iov[out], in, iov_len);
> > +
> > +   /* If the packet is greater than the space available in the
> > +* buffer, we split it using multiple buffers.
> > +*/
> > +   if (payload_len > iov_len - sizeof(pkt->hdr))
> 
> Integer underflow.  iov_len is controlled by the guest and therefore
> untrusted.  Please validate iov_len before assuming it's larger than
> sizeof(pkt->hdr).
> 

Okay, I'll do it!

> > -   vhost_add_used(vq, head, sizeof(pkt->hdr) + pkt->len);
> > +   vhost_add_used(vq, head, sizeof(pkt->hdr) + payload_len);
> > added = true;
> >  
> > +   pkt->off += payload_len;
> > +
> > +   /* If we didn't send all the payload we can requeue the packet
> > +* to send it with the next available buffer.
> > +*/
> > +   if (pkt->off < pkt->len) {
> > +   spin_lock_bh(>send_pkt_list_lock);
> > +   list_add(>list, >send_pkt_list);
> > +   spin_unlock_bh(>send_pkt_list_lock);
> > +   continue;
> 
> The virtio_transport_deliver_tap_pkt() call is skipped.  Packet capture
> should see the exact packets that are delivered.  I think this patch
> will present one large packet instead of several smaller packets that
> were actually delivered.

I'll modify virtio_transport_build_skb() to take care of pkt->off
and reading the payload size from the virtio_vsock_hdr.
Otherwise, should I introduce another field in virtio_vsock_pkt to store
the payload size?

Thanks,
Stefano
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC 0/4] vsock/virtio: optimizations to increase the throughput

2019-04-19 Thread Stefano Garzarella
On Thu, Apr 04, 2019 at 02:04:10PM -0400, Michael S. Tsirkin wrote:
> On Thu, Apr 04, 2019 at 06:47:15PM +0200, Stefano Garzarella wrote:
> > On Thu, Apr 04, 2019 at 11:52:46AM -0400, Michael S. Tsirkin wrote:
> > > I simply love it that you have analysed the individual impact of
> > > each patch! Great job!
> > 
> > Thanks! I followed Stefan's suggestions!
> > 
> > > 
> > > For comparison's sake, it could be IMHO benefitial to add a column
> > > with virtio-net+vhost-net performance.
> > > 
> > > This will both give us an idea about whether the vsock layer introduces
> > > inefficiencies, and whether the virtio-net idea has merit.
> > > 
> > 
> > Sure, I already did TCP tests on virtio-net + vhost, starting qemu in
> > this way:
> >   $ qemu-system-x86_64 ... \
> >   -netdev tap,id=net0,vhost=on,ifname=tap0,script=no,downscript=no \
> >   -device virtio-net-pci,netdev=net0
> > 
> > I did also a test using TCP_NODELAY, just to be fair, because VSOCK
> > doesn't implement something like this.
> 
> Why not?
> 

I think because originally VSOCK was designed to be simple and
low-latency, but of course we can introduce something like that.

Current implementation directly copy the buffer from the user-space in a
virtio_vsock_pkt and enqueue it to be transmitted.

Maybe we can introduce a buffer per socket where accumulate bytes and
send it when it is full or when a timer is fired . We can also introduce
a VSOCK_NODELAY (maybe using the same value of TCP_NODELAY for
compatibility) to send the buffer immediately for low-latency use cases.

What do you think?

Thanks,
Stefano
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] drm/virtio: move drm_connector_update_edid_property() call

2019-04-19 Thread Max Filippov
On Thu, Apr 4, 2019 at 9:46 PM Gerd Hoffmann  wrote:
>
> drm_connector_update_edid_property can sleep, we must not
> call it while holding a spinlock.  Move the callsize.
>
> Reported-by: Max Filippov 
> Signed-off-by: Gerd Hoffmann 
> ---
>  drivers/gpu/drm/virtio/virtgpu_vq.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

This change fixes the warning for me, thanks!
Tested-by: Max Filippov 

-- 
Thanks.
-- Max
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC 1/4] vsock/virtio: reduce credit update messages

2019-04-19 Thread Stefano Garzarella
On Thu, Apr 04, 2019 at 08:15:39PM +0100, Stefan Hajnoczi wrote:
> On Thu, Apr 04, 2019 at 12:58:35PM +0200, Stefano Garzarella wrote:
> > @@ -256,6 +257,7 @@ virtio_transport_stream_do_dequeue(struct vsock_sock 
> > *vsk,
> > struct virtio_vsock_sock *vvs = vsk->trans;
> > struct virtio_vsock_pkt *pkt;
> > size_t bytes, total = 0;
> > +   s64 free_space;
> 
> Why s64?  buf_alloc, fwd_cnt, and last_fwd_cnt are all u32.  fwd_cnt -
> last_fwd_cnt <= buf_alloc is always true.
> 

Right, I'll use a u32 for free_space!
Is is a leftover because initially I implemented something like
virtio_transport_has_space().

> > int err = -EFAULT;
> >  
> > spin_lock_bh(>rx_lock);
> > @@ -288,9 +290,15 @@ virtio_transport_stream_do_dequeue(struct vsock_sock 
> > *vsk,
> > }
> > spin_unlock_bh(>rx_lock);
> >  
> > -   /* Send a credit pkt to peer */
> > -   virtio_transport_send_credit_update(vsk, VIRTIO_VSOCK_TYPE_STREAM,
> > -   NULL);
> > +   /* We send a credit update only when the space available seen
> > +* by the transmitter is less than VIRTIO_VSOCK_MAX_PKT_BUF_SIZE
> > +*/
> > +   free_space = vvs->buf_alloc - (vvs->fwd_cnt - vvs->last_fwd_cnt);
> 
> Locking?  These fields should be accessed under tx_lock.
> 

Yes, we need a lock, but looking in the code, vvs->fwd_cnd is written
taking rx_lock (virtio_transport_dec_rx_pkt) and it is read with the
tx_lock (virtio_transport_inc_tx_pkt).

Maybe we should use another spin_lock shared between RX and TX for those
fields or use atomic variables.

What do you suggest?

Thanks,
Stefano

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC 0/4] vsock/virtio: optimizations to increase the throughput

2019-04-19 Thread Stefano Garzarella
On Thu, Apr 04, 2019 at 03:14:10PM +0100, Stefan Hajnoczi wrote:
> On Thu, Apr 04, 2019 at 12:58:34PM +0200, Stefano Garzarella wrote:
> > This series tries to increase the throughput of virtio-vsock with slight
> > changes:
> >  - patch 1/4: reduces the number of credit update messages sent to the
> >   transmitter
> >  - patch 2/4: allows the host to split packets on multiple buffers,
> >   in this way, we can remove the packet size limit to
> >   VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE
> >  - patch 3/4: uses VIRTIO_VSOCK_MAX_PKT_BUF_SIZE as the max packet size
> >   allowed
> >  - patch 4/4: increases RX buffer size to 64 KiB (affects only host->guest)
> > 
> > RFC:
> >  - maybe patch 4 can be replaced with multiple queues with different
> >buffer sizes or using EWMA to adapt the buffer size to the traffic
> > 
> >  - as Jason suggested in a previous thread [1] I'll evaluate to use
> >virtio-net as transport, but I need to understand better how to
> >interface with it, maybe introducing sk_buff in virtio-vsock.
> > 
> > Any suggestions?
> 
> Great performance results, nice job!

:)

> 
> Please include efficiency numbers (bandwidth / CPU utilization) in the
> future.  Due to the nature of these optimizations it's unlikely that
> efficiency has decreased, so I'm not too worried about it this time.

Thanks for the suggestion! I'll measure also the efficiency for future
optimizations.

Cheers,
Stefano
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC 0/4] vsock/virtio: optimizations to increase the throughput

2019-04-19 Thread Stefano Garzarella
On Thu, Apr 04, 2019 at 11:52:46AM -0400, Michael S. Tsirkin wrote:
> I simply love it that you have analysed the individual impact of
> each patch! Great job!

Thanks! I followed Stefan's suggestions!

> 
> For comparison's sake, it could be IMHO benefitial to add a column
> with virtio-net+vhost-net performance.
> 
> This will both give us an idea about whether the vsock layer introduces
> inefficiencies, and whether the virtio-net idea has merit.
> 

Sure, I already did TCP tests on virtio-net + vhost, starting qemu in
this way:
  $ qemu-system-x86_64 ... \
  -netdev tap,id=net0,vhost=on,ifname=tap0,script=no,downscript=no \
  -device virtio-net-pci,netdev=net0

I did also a test using TCP_NODELAY, just to be fair, because VSOCK
doesn't implement something like this.
In both cases I set the MTU to the maximum allowed (65520).

VSOCKTCP + virtio-net + vhost
  host -> guest [Gbps] host -> guest [Gbps]
pkt_size  before opt. patch 1 patches 2+3 patch 4 TCP_NODELAY
  64  0.060 0.102 0.102 0.096 0.160.15
  256 0.22  0.40  0.40  0.36  0.320.57
  512 0.42  0.82  0.85  0.74  1.2 1.2
  1K  0.7   1.6   1.6   1.5   2.1 2.1
  2K  1.5   3.0   3.1   2.9   3.5 3.4
  4K  2.5   5.2   5.3   5.3   5.5 5.3
  8K  3.9   8.4   8.6   8.8   8.0 7.9
  16K 6.6  11.1  11.3  12.8   9.810.2
  32K 9.9  15.8  15.8  18.1  11.810.7
  64K13.5  17.4  17.7  21.4  11.411.3
  128K   17.9  19.0  19.0  23.6  11.211.0
  256K   18.0  19.4  19.8  24.4  11.111.0
  512K   18.4  19.6  20.1  25.3  10.110.7

For small packet size (< 4K) I think we should implement some kind of
batching/merging, that could be for free if we use virtio-net as a transport.

Note: Maybe I have something miss configured because TCP on virtio-net
for host -> guest case doesn't exceed 11 Gbps.

VSOCKTCP + virtio-net + vhost
  guest -> host [Gbps] guest -> host [Gbps]
pkt_size  before opt. patch 1 patches 2+3 TCP_NODELAY
  64  0.088 0.100 0.101   0.240.24
  256 0.35  0.36  0.410.361.03
  512 0.70  0.74  0.730.691.6
  1K  1.1   1.3   1.3 1.1 3.0
  2K  2.4   2.4   2.6 2.1 5.5
  4K  4.3   4.3   4.5 3.8 8.8
  8K  7.3   7.4   7.6 6.620.0
  16K 9.2   9.6  11.112.329.4
  32K 8.3   8.9  18.119.328.2
  64K 8.3   8.9  25.420.628.7
  128K7.2   8.7  26.723.127.9
  256K7.7   8.4  24.928.529.4
  512K7.7   8.5  25.028.329.3

For guest -> host I think is important the TCP_NODELAY test, because TCP
buffering increases a lot the throughput.

> One other comment: it makes sense to test with disabling smap
> mitigations (boot host and guest with nosmap).  No problem with also
> testing the default smap path, but I think you will discover that the
> performance impact of smap hardening being enabled is often severe for
> such benchmarks.

Thanks for this valuable suggestion, I'll redo all the tests with nosmap!

Cheers,
Stefano
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH RFC 4/4] vsock/virtio: increase RX buffer size to 64 KiB

2019-04-19 Thread Stefano Garzarella
In order to increase host -> guest throughput with large packets,
we can use 64 KiB RX buffers.

Signed-off-by: Stefano Garzarella 
---
 include/linux/virtio_vsock.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
index 6d7a22cc20bf..43cce304408e 100644
--- a/include/linux/virtio_vsock.h
+++ b/include/linux/virtio_vsock.h
@@ -10,7 +10,7 @@
 #define VIRTIO_VSOCK_DEFAULT_MIN_BUF_SIZE  128
 #define VIRTIO_VSOCK_DEFAULT_BUF_SIZE  (1024 * 256)
 #define VIRTIO_VSOCK_DEFAULT_MAX_BUF_SIZE  (1024 * 256)
-#define VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE   (1024 * 4)
+#define VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE   (1024 * 64)
 #define VIRTIO_VSOCK_MAX_BUF_SIZE  0xUL
 #define VIRTIO_VSOCK_MAX_PKT_BUF_SIZE  (1024 * 64)
 
-- 
2.20.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH RFC 3/4] vsock/virtio: change the maximum packet size allowed

2019-04-19 Thread Stefano Garzarella
Since now we are able to split packets, we can avoid limiting
their sizes to VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE.
Instead, we can use VIRTIO_VSOCK_MAX_PKT_BUF_SIZE as the max
packet size.

Signed-off-by: Stefano Garzarella 
---
 net/vmw_vsock/virtio_transport_common.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/vmw_vsock/virtio_transport_common.c 
b/net/vmw_vsock/virtio_transport_common.c
index f32301d823f5..822e5d07a4ec 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -167,8 +167,8 @@ static int virtio_transport_send_pkt_info(struct vsock_sock 
*vsk,
vvs = vsk->trans;
 
/* we can send less than pkt_len bytes */
-   if (pkt_len > VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE)
-   pkt_len = VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE;
+   if (pkt_len > VIRTIO_VSOCK_MAX_PKT_BUF_SIZE)
+   pkt_len = VIRTIO_VSOCK_MAX_PKT_BUF_SIZE;
 
/* virtio_transport_get_credit might return less than pkt_len credit */
pkt_len = virtio_transport_get_credit(vvs, pkt_len);
-- 
2.20.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH RFC 2/4] vhost/vsock: split packets to send using multiple buffers

2019-04-19 Thread Stefano Garzarella
If the packets to sent to the guest are bigger than the buffer
available, we can split them, using multiple buffers and fixing
the length in the packet header.
This is safe since virtio-vsock supports only stream sockets.

Signed-off-by: Stefano Garzarella 
---
 drivers/vhost/vsock.c | 35 +--
 1 file changed, 29 insertions(+), 6 deletions(-)

diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index bb5fc0e9fbc2..9951b7e661f6 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -94,7 +94,7 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
struct iov_iter iov_iter;
unsigned out, in;
size_t nbytes;
-   size_t len;
+   size_t iov_len, payload_len;
int head;
 
spin_lock_bh(>send_pkt_list_lock);
@@ -139,8 +139,18 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
break;
}
 
-   len = iov_length(>iov[out], in);
-   iov_iter_init(_iter, READ, >iov[out], in, len);
+   payload_len = pkt->len - pkt->off;
+   iov_len = iov_length(>iov[out], in);
+   iov_iter_init(_iter, READ, >iov[out], in, iov_len);
+
+   /* If the packet is greater than the space available in the
+* buffer, we split it using multiple buffers.
+*/
+   if (payload_len > iov_len - sizeof(pkt->hdr))
+   payload_len = iov_len - sizeof(pkt->hdr);
+
+   /* Set the correct length in the header */
+   pkt->hdr.len = cpu_to_le32(payload_len);
 
nbytes = copy_to_iter(>hdr, sizeof(pkt->hdr), _iter);
if (nbytes != sizeof(pkt->hdr)) {
@@ -149,16 +159,29 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
break;
}
 
-   nbytes = copy_to_iter(pkt->buf, pkt->len, _iter);
-   if (nbytes != pkt->len) {
+   nbytes = copy_to_iter(pkt->buf + pkt->off, payload_len,
+ _iter);
+   if (nbytes != payload_len) {
virtio_transport_free_pkt(pkt);
vq_err(vq, "Faulted on copying pkt buf\n");
break;
}
 
-   vhost_add_used(vq, head, sizeof(pkt->hdr) + pkt->len);
+   vhost_add_used(vq, head, sizeof(pkt->hdr) + payload_len);
added = true;
 
+   pkt->off += payload_len;
+
+   /* If we didn't send all the payload we can requeue the packet
+* to send it with the next available buffer.
+*/
+   if (pkt->off < pkt->len) {
+   spin_lock_bh(>send_pkt_list_lock);
+   list_add(>list, >send_pkt_list);
+   spin_unlock_bh(>send_pkt_list_lock);
+   continue;
+   }
+
if (pkt->reply) {
int val;
 
-- 
2.20.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [Qemu-devel] [PATCH v4 5/5] xfs: disable map_sync for async flush

2019-04-19 Thread Darrick J. Wong
On Thu, Apr 04, 2019 at 06:08:44AM -0400, Pankaj Gupta wrote:
> 
> > On Thu 04-04-19 05:09:10, Pankaj Gupta wrote:
> > > 
> > > > > On Thu, Apr 04, 2019 at 09:09:12AM +1100, Dave Chinner wrote:
> > > > > > On Wed, Apr 03, 2019 at 04:10:18PM +0530, Pankaj Gupta wrote:
> > > > > > > Virtio pmem provides asynchronous host page cache flush
> > > > > > > mechanism. we don't support 'MAP_SYNC' with virtio pmem
> > > > > > > and xfs.
> > > > > > > 
> > > > > > > Signed-off-by: Pankaj Gupta 
> > > > > > > ---
> > > > > > >  fs/xfs/xfs_file.c | 8 
> > > > > > >  1 file changed, 8 insertions(+)
> > > > > > > 
> > > > > > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > > > > > > index 1f2e2845eb76..dced2eb8c91a 100644
> > > > > > > --- a/fs/xfs/xfs_file.c
> > > > > > > +++ b/fs/xfs/xfs_file.c
> > > > > > > @@ -1203,6 +1203,14 @@ xfs_file_mmap(
> > > > > > >   if (!IS_DAX(file_inode(filp)) && (vma->vm_flags & VM_SYNC))
> > > > > > >   return -EOPNOTSUPP;
> > > > > > >  
> > > > > > > + /* We don't support synchronous mappings with DAX files if
> > > > > > > +  * dax_device is not synchronous.
> > > > > > > +  */
> > > > > > > + if (IS_DAX(file_inode(filp)) && !dax_synchronous(
> > > > > > > + xfs_find_daxdev_for_inode(file_inode(filp))) &&
> > > > > > > + (vma->vm_flags & VM_SYNC))
> > > > > > > + return -EOPNOTSUPP;
> > > > > > > +
> > > > > > >   file_accessed(filp);
> > > > > > >   vma->vm_ops = _file_vm_ops;
> > > > > > >   if (IS_DAX(file_inode(filp)))
> > > > > > 
> > > > > > All this ad hoc IS_DAX conditional logic is getting pretty nasty.
> > > > > > 
> > > > > > xfs_file_mmap(
> > > > > > 
> > > > > > {
> > > > > > struct inode*inode = file_inode(filp);
> > > > > > 
> > > > > > if (vma->vm_flags & VM_SYNC) {
> > > > > > if (!IS_DAX(inode))
> > > > > > return -EOPNOTSUPP;
> > > > > > if (!dax_synchronous(xfs_find_daxdev_for_inode(inode))
> > > > > > return -EOPNOTSUPP;
> > > > > > }
> > > > > > 
> > > > > > file_accessed(filp);
> > > > > > vma->vm_ops = _file_vm_ops;
> > > > > > if (IS_DAX(inode))
> > > > > > vma->vm_flags |= VM_HUGEPAGE;
> > > > > > return 0;
> > > > > > }
> > > > > > 
> > > > > > 
> > > > > > Even better, factor out all the "MAP_SYNC supported" checks into a
> > > > > > helper so that the filesystem code just doesn't have to care about
> > > > > > the details of checking for DAX+MAP_SYNC support
> > > > > 
> > > > > Seconded, since ext4 has nearly the same flag validation logic.
> > > > 
> > > 
> > > Only issue with this I see is we need the helper function only for
> > > supported
> > > filesystems ext4 & xfs (right now). If I create the function in "fs.h" it
> > > will be compiled for every filesystem, even for those don't need it.
> > > 
> > > Sample patch below, does below patch is near to what you have in mind?
> > 
> > So I would put the helper in include/linux/dax.h and have it like:
> > 
> > bool daxdev_mapping_supported(struct vm_area_struct *vma,

Should this be static inline if you're putting it in the header file?

A comment ought to be added to describe what this predicate function
does.

> >   struct dax_device *dax_dev)
> > {
> > if (!(vma->vm_flags & VM_SYNC))
> > return true;
> > if (!IS_DAX(file_inode(vma->vm_file)))
> > return false;
> > return dax_synchronous(dax_dev);
> > }
> 
> Sure. This is much better. I was also not sure what to name the helper 
> function.
> I will go ahead with this unless 'Dave' & 'Darrick' have anything to add.

Jan's approach (modulo that one comment) looks good to me.

--D

> Thank you very much.
> 
> Best regards,
> Pankaj 
> 
> > 
> > Honza
> > > 
> > > =
> > > 
> > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > > index 1f2e2845eb76..614995170cac 100644
> > > --- a/fs/xfs/xfs_file.c
> > > +++ b/fs/xfs/xfs_file.c
> > > @@ -1196,12 +1196,17 @@ xfs_file_mmap(
> > > struct file *filp,
> > > struct vm_area_struct *vma)
> > >  {
> > > +   struct dax_device *dax_dev =
> > > xfs_find_daxdev_for_inode(file_inode(filp));
> > > +
> > > /*
> > > -* We don't support synchronous mappings for non-DAX files. At
> > > least
> > > -* until someone comes with a sensible use case.
> > > +* We don't support synchronous mappings for non-DAX files and
> > > +* for DAX files if underneath dax_device is not synchronous.
> > >  */
> > > -   if (!IS_DAX(file_inode(filp)) && (vma->vm_flags & VM_SYNC))
> > > -   return -EOPNOTSUPP;
> > > +   if (vma->vm_flags & VM_SYNC) {
> > > +   int err = is_synchronous(filp, dax_dev);
> > > +   if (err)
> > > +   return err;
> > > +   }
> > >  
> > > 

[PATCH RFC 1/4] vsock/virtio: reduce credit update messages

2019-04-19 Thread Stefano Garzarella
In order to reduce the number of credit update messages,
we send them only when the space available seen by the
transmitter is less than VIRTIO_VSOCK_MAX_PKT_BUF_SIZE.

Signed-off-by: Stefano Garzarella 
---
 include/linux/virtio_vsock.h|  1 +
 net/vmw_vsock/virtio_transport_common.c | 14 +++---
 2 files changed, 12 insertions(+), 3 deletions(-)

diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
index e223e2632edd..6d7a22cc20bf 100644
--- a/include/linux/virtio_vsock.h
+++ b/include/linux/virtio_vsock.h
@@ -37,6 +37,7 @@ struct virtio_vsock_sock {
u32 tx_cnt;
u32 buf_alloc;
u32 peer_fwd_cnt;
+   u32 last_fwd_cnt;
u32 peer_buf_alloc;
 
/* Protected by rx_lock */
diff --git a/net/vmw_vsock/virtio_transport_common.c 
b/net/vmw_vsock/virtio_transport_common.c
index 602715fc9a75..f32301d823f5 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -206,6 +206,7 @@ static void virtio_transport_dec_rx_pkt(struct 
virtio_vsock_sock *vvs,
 void virtio_transport_inc_tx_pkt(struct virtio_vsock_sock *vvs, struct 
virtio_vsock_pkt *pkt)
 {
spin_lock_bh(>tx_lock);
+   vvs->last_fwd_cnt = vvs->fwd_cnt;
pkt->hdr.fwd_cnt = cpu_to_le32(vvs->fwd_cnt);
pkt->hdr.buf_alloc = cpu_to_le32(vvs->buf_alloc);
spin_unlock_bh(>tx_lock);
@@ -256,6 +257,7 @@ virtio_transport_stream_do_dequeue(struct vsock_sock *vsk,
struct virtio_vsock_sock *vvs = vsk->trans;
struct virtio_vsock_pkt *pkt;
size_t bytes, total = 0;
+   s64 free_space;
int err = -EFAULT;
 
spin_lock_bh(>rx_lock);
@@ -288,9 +290,15 @@ virtio_transport_stream_do_dequeue(struct vsock_sock *vsk,
}
spin_unlock_bh(>rx_lock);
 
-   /* Send a credit pkt to peer */
-   virtio_transport_send_credit_update(vsk, VIRTIO_VSOCK_TYPE_STREAM,
-   NULL);
+   /* We send a credit update only when the space available seen
+* by the transmitter is less than VIRTIO_VSOCK_MAX_PKT_BUF_SIZE
+*/
+   free_space = vvs->buf_alloc - (vvs->fwd_cnt - vvs->last_fwd_cnt);
+   if (free_space < VIRTIO_VSOCK_MAX_PKT_BUF_SIZE) {
+   virtio_transport_send_credit_update(vsk,
+   VIRTIO_VSOCK_TYPE_STREAM,
+   NULL);
+   }
 
return total;
 
-- 
2.20.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH RFC 0/4] vsock/virtio: optimizations to increase the throughput

2019-04-19 Thread Stefano Garzarella
This series tries to increase the throughput of virtio-vsock with slight
changes:
 - patch 1/4: reduces the number of credit update messages sent to the
  transmitter
 - patch 2/4: allows the host to split packets on multiple buffers,
  in this way, we can remove the packet size limit to
  VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE
 - patch 3/4: uses VIRTIO_VSOCK_MAX_PKT_BUF_SIZE as the max packet size
  allowed
 - patch 4/4: increases RX buffer size to 64 KiB (affects only host->guest)

RFC:
 - maybe patch 4 can be replaced with multiple queues with different
   buffer sizes or using EWMA to adapt the buffer size to the traffic

 - as Jason suggested in a previous thread [1] I'll evaluate to use
   virtio-net as transport, but I need to understand better how to
   interface with it, maybe introducing sk_buff in virtio-vsock.

Any suggestions?

Here some benchmarks step by step. I used iperf3 [2] modified with VSOCK
support:

host -> guest [Gbps]
pkt_sizebefore opt.   patch 1   patches 2+3   patch 4
  640.060   0.102   0.102   0.096
  256   0.220.400.400.36
  512   0.420.820.850.74
  1K0.7 1.6 1.6 1.5
  2K1.5 3.0 3.1 2.9
  4K2.5 5.2 5.3 5.3
  8K3.9 8.4 8.6 8.8
  16K   6.611.111.312.8
  32K   9.915.815.818.1
  64K  13.517.417.721.4
  128K 17.919.019.023.6
  256K 18.019.419.824.4
  512K 18.419.620.125.3

guest -> host [Gbps]
pkt_sizebefore opt.   patch 1   patches 2+3
  640.088   0.100   0.101
  256   0.350.360.41
  512   0.700.740.73
  1K1.1 1.3 1.3
  2K2.4 2.4 2.6
  4K4.3 4.3 4.5
  8K7.3 7.4 7.6
  16K   9.2 9.611.1
  32K   8.3 8.918.1
  64K   8.3 8.925.4
  128K  7.2 8.726.7
  256K  7.7 8.424.9
  512K  7.7 8.525.0

Thanks,
Stefano

[1] https://www.spinics.net/lists/netdev/msg531783.html
[2] https://github.com/stefano-garzarella/iperf/

Stefano Garzarella (4):
  vsock/virtio: reduce credit update messages
  vhost/vsock: split packets to send using multiple buffers
  vsock/virtio: change the maximum packet size allowed
  vsock/virtio: increase RX buffer size to 64 KiB

 drivers/vhost/vsock.c   | 35 -
 include/linux/virtio_vsock.h|  3 ++-
 net/vmw_vsock/virtio_transport_common.c | 18 +
 3 files changed, 44 insertions(+), 12 deletions(-)

-- 
2.20.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] drm/cirrus: rewrite and modernize driver.

2019-04-19 Thread Stéphane Marchesin
On Wed, Apr 3, 2019 at 7:58 PM David Airlie  wrote:

> On Wed, Apr 3, 2019 at 5:23 PM Gerd Hoffmann  wrote:
> >
> > Time to kill some bad sample code people are copying from ;)
> >
> > This is a complete rewrite of the cirrus driver.  The cirrus_mode_set()
> > function is pretty much the only function which is carried over largely
> > unmodified.  Everything else is upside down.
> >
> > It is a single monster patch.  But given that it does some pretty
> > fundamental changes to the drivers workflow and also reduces the code
> > size by roughly 70% I think it'll still be alot easier to review than a
> > longish baby-step patch series.
> >
> > Changes summary:
> >  - Given the small amout of video memory (4 MB) the cirrus device has
> >the rewritten driver doesn't try to manage buffers there.  Instead
> >it will blit (memcpy) the active framebuffer to video memory.
>
> Does it get any slower, with TTM I just wrote it to migrate just the
> frontbuffer in/out of VRAM on modeset, won't we end up with more
> copies now?
>
> >  - All gem objects are stored in main memory and are manged using the
> >new shmem helpers.  ttm is out.
> >  - Only DRM_FORMAT_RGB565 (depth 16) is supported.  The old driver does
> >that too by default.  There was a module parameter which enables 24/32
> >bpp support and disables higher resolutions (due to cirrus hardware
> >constrains).  That parameter wasn't reimplemented.
> This might be the big sticking point, this is a userspace regression
> for a feature that was explicitly added a few years ago, can we really
> get away without it?
>

Chrome OS testing in VMs was one of the consumers of 32bpp on cirrus, and
we have gotten rid of cirrus in favor of virtio gpu, so we'd be fine. Of
course I can't speak for other consumers :)

Stéphane



>
> The rest looks good though!
> Dave.
>
> >  - The simple display pipeline is used.
> >  - The generic fbdev emulation is used.
> >  - It's a atomic driver now.
> >
> ___
> dri-devel mailing list
> dri-de...@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v4 5/5] xfs: disable map_sync for async flush

2019-04-19 Thread Adam Borowski
On Thu, Apr 04, 2019 at 02:12:30AM -0400, Pankaj Gupta wrote:
> > All this ad hoc IS_DAX conditional logic is getting pretty nasty.
> > 
> > xfs_file_mmap(
> > 
> > {
> > struct inode*inode = file_inode(filp);
> > 
> > if (vma->vm_flags & VM_SYNC) {
> > if (!IS_DAX(inode))
> > return -EOPNOTSUPP;
> > if (!dax_synchronous(xfs_find_daxdev_for_inode(inode))
> > return -EOPNOTSUPP;
> > }
> > 
> > file_accessed(filp);
> > vma->vm_ops = _file_vm_ops;
> > if (IS_DAX(inode))
> > vma->vm_flags |= VM_HUGEPAGE;
> > return 0;
> > }
> 
> Sure, this is better.

> > Even better, factor out all the "MAP_SYNC supported" checks into a
> > helper so that the filesystem code just doesn't have to care about
> > the details of checking for DAX+MAP_SYNC support
> 
> o.k. Will add one common helper function for both ext4 & xfs filesystems.

Note this pending patch for Goldwyn Rodrigues' patchset for btrfs:

https://lore.kernel.org/linux-btrfs/20190328102418.5466-1-kilob...@angband.pl/

We might want to coordinate.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀
⣾⠁⢠⠒⠀⣿⡁ Did ya know that typing "test -j8" instead of "ctest -j8"
⢿⡄⠘⠷⠚⠋⠀ will make your testsuite pass much faster, and fix bugs?
⠈⠳⣄
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [Qemu-devel] [PATCH v4 2/5] virtio-pmem: Add virtio pmem driver

2019-04-19 Thread Yuval Shaia
On Wed, Apr 03, 2019 at 08:40:13AM -0400, Pankaj Gupta wrote:
> 
> > Subject: Re: [Qemu-devel] [PATCH v4 2/5] virtio-pmem: Add virtio pmem driver
> > 
> > On Wed, Apr 03, 2019 at 04:10:15PM +0530, Pankaj Gupta wrote:
> > > This patch adds virtio-pmem driver for KVM guest.
> > > 
> > > Guest reads the persistent memory range information from
> > > Qemu over VIRTIO and registers it on nvdimm_bus. It also
> > > creates a nd_region object with the persistent memory
> > > range information so that existing 'nvdimm/pmem' driver
> > > can reserve this into system memory map. This way
> > > 'virtio-pmem' driver uses existing functionality of pmem
> > > driver to register persistent memory compatible for DAX
> > > capable filesystems.
> > > 
> > > This also provides function to perform guest flush over
> > > VIRTIO from 'pmem' driver when userspace performs flush
> > > on DAX memory range.
> > > 
> > > Signed-off-by: Pankaj Gupta 
> > > ---
> > >  drivers/nvdimm/virtio_pmem.c |  84 +
> > >  drivers/virtio/Kconfig   |  10 +++
> > >  drivers/virtio/Makefile  |   1 +
> > >  drivers/virtio/pmem.c| 125 +++
> > >  include/linux/virtio_pmem.h  |  60 +++
> > >  include/uapi/linux/virtio_ids.h  |   1 +
> > >  include/uapi/linux/virtio_pmem.h |  10 +++
> > >  7 files changed, 291 insertions(+)
> > >  create mode 100644 drivers/nvdimm/virtio_pmem.c
> > >  create mode 100644 drivers/virtio/pmem.c
> > >  create mode 100644 include/linux/virtio_pmem.h
> > >  create mode 100644 include/uapi/linux/virtio_pmem.h
> > > 
> > > diff --git a/drivers/nvdimm/virtio_pmem.c b/drivers/nvdimm/virtio_pmem.c
> > > new file mode 100644
> > > index ..2a1b1ba2c1ff
> > > --- /dev/null
> > > +++ b/drivers/nvdimm/virtio_pmem.c
> > > @@ -0,0 +1,84 @@
> > > +// SPDX-License-Identifier: GPL-2.0
> > 
> > Is this comment stile (//) acceptable?
> 
> In existing code, i can see same comment
> pattern for license at some places.

Is it preferred for new code?

> 
> > 
> > > +/*
> > > + * virtio_pmem.c: Virtio pmem Driver
> > > + *
> > > + * Discovers persistent memory range information
> > > + * from host and provides a virtio based flushing
> > > + * interface.
> > > + */
> > > +#include 
> > > +#include "nd.h"
> > > +
> > > + /* The interrupt handler */
> > > +void host_ack(struct virtqueue *vq)
> > > +{
> > > + unsigned int len;
> > > + unsigned long flags;
> > > + struct virtio_pmem_request *req, *req_buf;
> > > + struct virtio_pmem *vpmem = vq->vdev->priv;
> > > +
> > > + spin_lock_irqsave(>pmem_lock, flags);
> > > + while ((req = virtqueue_get_buf(vq, )) != NULL) {
> > > + req->done = true;
> > > + wake_up(>host_acked);
> > > +
> > > + if (!list_empty(>req_list)) {
> > > + req_buf = list_first_entry(>req_list,
> > > + struct virtio_pmem_request, list);
> > > + list_del(>req_list);
> > > + req_buf->wq_buf_avail = true;
> > > + wake_up(_buf->wq_buf);
> > > + }
> > > + }
> > > + spin_unlock_irqrestore(>pmem_lock, flags);
> > > +}
> > > +EXPORT_SYMBOL_GPL(host_ack);
> > > +
> > > + /* The request submission function */
> > > +int virtio_pmem_flush(struct nd_region *nd_region)
> > > +{
> > > + int err;
> > > + unsigned long flags;
> > > + struct scatterlist *sgs[2], sg, ret;
> > > + struct virtio_device *vdev = nd_region->provider_data;
> > > + struct virtio_pmem *vpmem = vdev->priv;
> > > + struct virtio_pmem_request *req;
> > > +
> > > + might_sleep();
> > 
> > [1]
> > 
> > > + req = kmalloc(sizeof(*req), GFP_KERNEL);
> > > + if (!req)
> > > + return -ENOMEM;
> > > +
> > > + req->done = req->wq_buf_avail = false;
> > > + strcpy(req->name, "FLUSH");
> > > + init_waitqueue_head(>host_acked);
> > > + init_waitqueue_head(>wq_buf);
> > > + sg_init_one(, req->name, strlen(req->name));
> > > + sgs[0] = 
> > > + sg_init_one(, >ret, sizeof(req->ret));
> > > + sgs[1] = 
> > > +
> > > + spin_lock_irqsave(>pmem_lock, flags);
> > > + err = virtqueue_add_sgs(vpmem->req_vq, sgs, 1, 1, req, GFP_ATOMIC);
> > 
> > Is it okay to use GFP_ATOMIC in a might-sleep ([1]) function?
> 
> might sleep will give us a warning if we try to sleep from non-sleepable
> context. 
> 
> We are doing it other way, i.e might_sleep is not inside GFP_ATOMIC. 
> 
> > 
> > > + if (err) {
> > > + dev_err(>dev, "failed to send command to virtio pmem 
> > > device\n");
> > > +
> > > + list_add_tail(>req_list, >list);
> > > + spin_unlock_irqrestore(>pmem_lock, flags);
> > > +
> > > + /* When host has read buffer, this completes via host_ack */
> > > + wait_event(req->wq_buf, req->wq_buf_avail);
> > > + spin_lock_irqsave(>pmem_lock, flags);
> > > + }
> > > + virtqueue_kick(vpmem->req_vq);
> > 
> > You probably want to check return value here.
> 
> Don't think it will matter in this case?

Have no idea, if it fails 

Re: [Qemu-devel] [PATCH v4 2/5] virtio-pmem: Add virtio pmem driver

2019-04-19 Thread Yuval Shaia
On Wed, Apr 03, 2019 at 04:10:15PM +0530, Pankaj Gupta wrote:
> This patch adds virtio-pmem driver for KVM guest.
> 
> Guest reads the persistent memory range information from
> Qemu over VIRTIO and registers it on nvdimm_bus. It also
> creates a nd_region object with the persistent memory
> range information so that existing 'nvdimm/pmem' driver
> can reserve this into system memory map. This way
> 'virtio-pmem' driver uses existing functionality of pmem
> driver to register persistent memory compatible for DAX
> capable filesystems.
> 
> This also provides function to perform guest flush over
> VIRTIO from 'pmem' driver when userspace performs flush
> on DAX memory range.
> 
> Signed-off-by: Pankaj Gupta 
> ---
>  drivers/nvdimm/virtio_pmem.c |  84 +
>  drivers/virtio/Kconfig   |  10 +++
>  drivers/virtio/Makefile  |   1 +
>  drivers/virtio/pmem.c| 125 +++
>  include/linux/virtio_pmem.h  |  60 +++
>  include/uapi/linux/virtio_ids.h  |   1 +
>  include/uapi/linux/virtio_pmem.h |  10 +++
>  7 files changed, 291 insertions(+)
>  create mode 100644 drivers/nvdimm/virtio_pmem.c
>  create mode 100644 drivers/virtio/pmem.c
>  create mode 100644 include/linux/virtio_pmem.h
>  create mode 100644 include/uapi/linux/virtio_pmem.h
> 
> diff --git a/drivers/nvdimm/virtio_pmem.c b/drivers/nvdimm/virtio_pmem.c
> new file mode 100644
> index ..2a1b1ba2c1ff
> --- /dev/null
> +++ b/drivers/nvdimm/virtio_pmem.c
> @@ -0,0 +1,84 @@
> +// SPDX-License-Identifier: GPL-2.0

Is this comment stile (//) acceptable?

> +/*
> + * virtio_pmem.c: Virtio pmem Driver
> + *
> + * Discovers persistent memory range information
> + * from host and provides a virtio based flushing
> + * interface.
> + */
> +#include 
> +#include "nd.h"
> +
> + /* The interrupt handler */
> +void host_ack(struct virtqueue *vq)
> +{
> + unsigned int len;
> + unsigned long flags;
> + struct virtio_pmem_request *req, *req_buf;
> + struct virtio_pmem *vpmem = vq->vdev->priv;
> +
> + spin_lock_irqsave(>pmem_lock, flags);
> + while ((req = virtqueue_get_buf(vq, )) != NULL) {
> + req->done = true;
> + wake_up(>host_acked);
> +
> + if (!list_empty(>req_list)) {
> + req_buf = list_first_entry(>req_list,
> + struct virtio_pmem_request, list);
> + list_del(>req_list);
> + req_buf->wq_buf_avail = true;
> + wake_up(_buf->wq_buf);
> + }
> + }
> + spin_unlock_irqrestore(>pmem_lock, flags);
> +}
> +EXPORT_SYMBOL_GPL(host_ack);
> +
> + /* The request submission function */
> +int virtio_pmem_flush(struct nd_region *nd_region)
> +{
> + int err;
> + unsigned long flags;
> + struct scatterlist *sgs[2], sg, ret;
> + struct virtio_device *vdev = nd_region->provider_data;
> + struct virtio_pmem *vpmem = vdev->priv;
> + struct virtio_pmem_request *req;
> +
> + might_sleep();

[1]

> + req = kmalloc(sizeof(*req), GFP_KERNEL);
> + if (!req)
> + return -ENOMEM;
> +
> + req->done = req->wq_buf_avail = false;
> + strcpy(req->name, "FLUSH");
> + init_waitqueue_head(>host_acked);
> + init_waitqueue_head(>wq_buf);
> + sg_init_one(, req->name, strlen(req->name));
> + sgs[0] = 
> + sg_init_one(, >ret, sizeof(req->ret));
> + sgs[1] = 
> +
> + spin_lock_irqsave(>pmem_lock, flags);
> + err = virtqueue_add_sgs(vpmem->req_vq, sgs, 1, 1, req, GFP_ATOMIC);

Is it okay to use GFP_ATOMIC in a might-sleep ([1]) function?

> + if (err) {
> + dev_err(>dev, "failed to send command to virtio pmem 
> device\n");
> +
> + list_add_tail(>req_list, >list);
> + spin_unlock_irqrestore(>pmem_lock, flags);
> +
> + /* When host has read buffer, this completes via host_ack */
> + wait_event(req->wq_buf, req->wq_buf_avail);
> + spin_lock_irqsave(>pmem_lock, flags);
> + }
> + virtqueue_kick(vpmem->req_vq);

You probably want to check return value here.

> + spin_unlock_irqrestore(>pmem_lock, flags);
> +
> + /* When host has read buffer, this completes via host_ack */
> + wait_event(req->host_acked, req->done);
> + err = req->ret;
> + kfree(req);
> +
> + return err;
> +};
> +EXPORT_SYMBOL_GPL(virtio_pmem_flush);
> +MODULE_LICENSE("GPL");
> diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
> index 35897649c24f..9f634a2ed638 100644
> --- a/drivers/virtio/Kconfig
> +++ b/drivers/virtio/Kconfig
> @@ -42,6 +42,16 @@ config VIRTIO_PCI_LEGACY
>  
> If unsure, say Y.
>  
> +config VIRTIO_PMEM
> + tristate "Support for virtio pmem driver"
> + depends on VIRTIO
> + depends on LIBNVDIMM
> + help
> + This driver provides support for virtio based flushing interface
> + for persistent memory range.
> 

Re: [PATCH] drm/cirrus: rewrite and modernize driver.

2019-04-19 Thread David Airlie
On Wed, Apr 3, 2019 at 5:23 PM Gerd Hoffmann  wrote:
>
> Time to kill some bad sample code people are copying from ;)
>
> This is a complete rewrite of the cirrus driver.  The cirrus_mode_set()
> function is pretty much the only function which is carried over largely
> unmodified.  Everything else is upside down.
>
> It is a single monster patch.  But given that it does some pretty
> fundamental changes to the drivers workflow and also reduces the code
> size by roughly 70% I think it'll still be alot easier to review than a
> longish baby-step patch series.
>
> Changes summary:
>  - Given the small amout of video memory (4 MB) the cirrus device has
>the rewritten driver doesn't try to manage buffers there.  Instead
>it will blit (memcpy) the active framebuffer to video memory.

Does it get any slower, with TTM I just wrote it to migrate just the
frontbuffer in/out of VRAM on modeset, won't we end up with more
copies now?

>  - All gem objects are stored in main memory and are manged using the
>new shmem helpers.  ttm is out.
>  - Only DRM_FORMAT_RGB565 (depth 16) is supported.  The old driver does
>that too by default.  There was a module parameter which enables 24/32
>bpp support and disables higher resolutions (due to cirrus hardware
>constrains).  That parameter wasn't reimplemented.
This might be the big sticking point, this is a userspace regression
for a feature that was explicitly added a few years ago, can we really
get away without it?

The rest looks good though!
Dave.

>  - The simple display pipeline is used.
>  - The generic fbdev emulation is used.
>  - It's a atomic driver now.
>
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v4 5/5] xfs: disable map_sync for async flush

2019-04-19 Thread Darrick J. Wong
On Thu, Apr 04, 2019 at 09:09:12AM +1100, Dave Chinner wrote:
> On Wed, Apr 03, 2019 at 04:10:18PM +0530, Pankaj Gupta wrote:
> > Virtio pmem provides asynchronous host page cache flush
> > mechanism. we don't support 'MAP_SYNC' with virtio pmem 
> > and xfs.
> > 
> > Signed-off-by: Pankaj Gupta 
> > ---
> >  fs/xfs/xfs_file.c | 8 
> >  1 file changed, 8 insertions(+)
> > 
> > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > index 1f2e2845eb76..dced2eb8c91a 100644
> > --- a/fs/xfs/xfs_file.c
> > +++ b/fs/xfs/xfs_file.c
> > @@ -1203,6 +1203,14 @@ xfs_file_mmap(
> > if (!IS_DAX(file_inode(filp)) && (vma->vm_flags & VM_SYNC))
> > return -EOPNOTSUPP;
> >  
> > +   /* We don't support synchronous mappings with DAX files if
> > +* dax_device is not synchronous.
> > +*/
> > +   if (IS_DAX(file_inode(filp)) && !dax_synchronous(
> > +   xfs_find_daxdev_for_inode(file_inode(filp))) &&
> > +   (vma->vm_flags & VM_SYNC))
> > +   return -EOPNOTSUPP;
> > +
> > file_accessed(filp);
> > vma->vm_ops = _file_vm_ops;
> > if (IS_DAX(file_inode(filp)))
> 
> All this ad hoc IS_DAX conditional logic is getting pretty nasty.
> 
> xfs_file_mmap(
> 
> {
>   struct inode*inode = file_inode(filp);
> 
>   if (vma->vm_flags & VM_SYNC) {
>   if (!IS_DAX(inode))
>   return -EOPNOTSUPP;
>   if (!dax_synchronous(xfs_find_daxdev_for_inode(inode))
>   return -EOPNOTSUPP;
>   }
> 
>   file_accessed(filp);
>   vma->vm_ops = _file_vm_ops;
>   if (IS_DAX(inode))
>   vma->vm_flags |= VM_HUGEPAGE;
>   return 0;
> }
> 
> 
> Even better, factor out all the "MAP_SYNC supported" checks into a
> helper so that the filesystem code just doesn't have to care about
> the details of checking for DAX+MAP_SYNC support

Seconded, since ext4 has nearly the same flag validation logic.

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> da...@fromorbit.com
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH] virtio: Fix indentation of VIRTIO_MMIO

2019-04-19 Thread Fabrizio Castro
VIRTIO_MMIO config option block starts with a space, fix that.

Signed-off-by: Fabrizio Castro 
---
 drivers/virtio/Kconfig | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
index 3589764..1b5c9f0 100644
--- a/drivers/virtio/Kconfig
+++ b/drivers/virtio/Kconfig
@@ -62,12 +62,12 @@ config VIRTIO_INPUT
 
 If unsure, say M.
 
- config VIRTIO_MMIO
+config VIRTIO_MMIO
tristate "Platform bus driver for memory mapped virtio devices"
depends on HAS_IOMEM && HAS_DMA
-   select VIRTIO
-   ---help---
-This drivers provides support for memory mapped virtio
+   select VIRTIO
+   ---help---
+This drivers provides support for memory mapped virtio
 platform device driver.
 
 If unsure, say N.
-- 
2.7.4

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH] virtio-net: Remove inclusion of pci.h

2019-04-19 Thread Yuval Shaia
This header is not in use - remove it.

Signed-off-by: Yuval Shaia 
---
 drivers/net/virtio_net.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 7eb38ea9ba56..07c1e81087b2 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -31,7 +31,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
-- 
2.19.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH] virtio-net: Fix some minor formatting errors

2019-04-19 Thread Yuval Shaia
Signed-off-by: Yuval Shaia 
---
 drivers/net/virtio_net.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 07c1e81087b2..be1188815c72 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1587,7 +1587,8 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct 
net_device *dev)
dev->stats.tx_fifo_errors++;
if (net_ratelimit())
dev_warn(>dev,
-"Unexpected TXQ (%d) queue failure: %d\n", 
qnum, err);
+"Unexpected TXQ (%d) queue failure: %d\n",
+qnum, err);
dev->stats.tx_dropped++;
dev_kfree_skb_any(skb);
return NETDEV_TX_OK;
@@ -2383,7 +2384,7 @@ static int virtnet_set_guest_offloads(struct virtnet_info 
*vi, u64 offloads)
 
if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_GUEST_OFFLOADS,
  VIRTIO_NET_CTRL_GUEST_OFFLOADS_SET, )) {
-   dev_warn(>dev->dev, "Fail to set guest offload. \n");
+   dev_warn(>dev->dev, "Fail to set guest offload.\n");
return -EINVAL;
}
 
@@ -3114,8 +3115,9 @@ static int virtnet_probe(struct virtio_device *vdev)
/* Should never trigger: MTU was previously validated
 * in virtnet_validate.
 */
-   dev_err(>dev, "device MTU appears to have changed 
"
-   "it is now %d < %d", mtu, dev->min_mtu);
+   dev_err(>dev,
+   "device MTU appears to have changed it is now 
%d < %d",
+   mtu, dev->min_mtu);
goto free;
}
 
-- 
2.19.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 -next] drm/virtio: remove set but not used variable 'vgdev'

2019-04-19 Thread Mukesh Ojha



On 3/25/2019 2:56 PM, YueHaibing wrote:

Fixes gcc '-Wunused-but-set-variable' warning:

drivers/gpu/drm/virtio/virtgpu_ttm.c: In function 'virtio_gpu_init_mem_type':
drivers/gpu/drm/virtio/virtgpu_ttm.c:117:28: warning:
  variable 'vgdev' set but not used [-Wunused-but-set-variable]

drivers/gpu/drm/virtio/virtgpu_ttm.c: In function 'virtio_gpu_bo_swap_notify':
drivers/gpu/drm/virtio/virtgpu_ttm.c:300:28: warning:
  variable 'vgdev' set but not used [-Wunused-but-set-variable]

It is never used since introduction in dc5698e80cf7 ("Add virtio gpu driver.")

Signed-off-by: YueHaibing 



Reviewed-by: Mukesh Ojha 

-Mukesh


---
v2: fix patch prefix
---
  drivers/gpu/drm/virtio/virtgpu_ttm.c | 6 --
  1 file changed, 6 deletions(-)

diff --git a/drivers/gpu/drm/virtio/virtgpu_ttm.c 
b/drivers/gpu/drm/virtio/virtgpu_ttm.c
index d6225ba20b30..eb007c2569d8 100644
--- a/drivers/gpu/drm/virtio/virtgpu_ttm.c
+++ b/drivers/gpu/drm/virtio/virtgpu_ttm.c
@@ -114,10 +114,6 @@ static const struct ttm_mem_type_manager_func 
virtio_gpu_bo_manager_func = {
  static int virtio_gpu_init_mem_type(struct ttm_bo_device *bdev, uint32_t type,
struct ttm_mem_type_manager *man)
  {
-   struct virtio_gpu_device *vgdev;
-
-   vgdev = virtio_gpu_get_vgdev(bdev);
-
switch (type) {
case TTM_PL_SYSTEM:
/* System memory */
@@ -297,10 +293,8 @@ static void virtio_gpu_bo_move_notify(struct 
ttm_buffer_object *tbo,
  static void virtio_gpu_bo_swap_notify(struct ttm_buffer_object *tbo)
  {
struct virtio_gpu_object *bo;
-   struct virtio_gpu_device *vgdev;
  
  	bo = container_of(tbo, struct virtio_gpu_object, tbo);

-   vgdev = (struct virtio_gpu_device *)bo->gem_base.dev->dev_private;
  
  	if (bo->pages)

virtio_gpu_object_free_sg_table(bo);




___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 -next] drm/virtio: remove set but not used variable 'vgdev'

2019-04-19 Thread Mukesh Ojha


On 3/25/2019 2:56 PM, YueHaibing wrote:

Fixes gcc '-Wunused-but-set-variable' warning:

drivers/gpu/drm/virtio/virtgpu_ttm.c: In function 'virtio_gpu_init_mem_type':
drivers/gpu/drm/virtio/virtgpu_ttm.c:117:28: warning:
  variable 'vgdev' set but not used [-Wunused-but-set-variable]

drivers/gpu/drm/virtio/virtgpu_ttm.c: In function 'virtio_gpu_bo_swap_notify':
drivers/gpu/drm/virtio/virtgpu_ttm.c:300:28: warning:
  variable 'vgdev' set but not used [-Wunused-but-set-variable]

It is never used since introduction in dc5698e80cf7 ("Add virtio gpu driver.")

Signed-off-by: YueHaibing 



Reviewed-by: Mukesh Ojha >


Thanks.
Mukesh


---
v2: fix patch prefix
---
  drivers/gpu/drm/virtio/virtgpu_ttm.c | 6 --
  1 file changed, 6 deletions(-)

diff --git a/drivers/gpu/drm/virtio/virtgpu_ttm.c 
b/drivers/gpu/drm/virtio/virtgpu_ttm.c
index d6225ba20b30..eb007c2569d8 100644
--- a/drivers/gpu/drm/virtio/virtgpu_ttm.c
+++ b/drivers/gpu/drm/virtio/virtgpu_ttm.c
@@ -114,10 +114,6 @@ static const struct ttm_mem_type_manager_func 
virtio_gpu_bo_manager_func = {
  static int virtio_gpu_init_mem_type(struct ttm_bo_device *bdev, uint32_t type,
struct ttm_mem_type_manager *man)
  {
-   struct virtio_gpu_device *vgdev;
-
-   vgdev = virtio_gpu_get_vgdev(bdev);
-
switch (type) {
case TTM_PL_SYSTEM:
/* System memory */
@@ -297,10 +293,8 @@ static void virtio_gpu_bo_move_notify(struct 
ttm_buffer_object *tbo,
  static void virtio_gpu_bo_swap_notify(struct ttm_buffer_object *tbo)
  {
struct virtio_gpu_object *bo;
-   struct virtio_gpu_device *vgdev;
  
  	bo = container_of(tbo, struct virtio_gpu_object, tbo);

-   vgdev = (struct virtio_gpu_device *)bo->gem_base.dev->dev_private;
  
  	if (bo->pages)

virtio_gpu_object_free_sg_table(bo);



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 1/3] drm/virtio: add missing drm_atomic_helper_shutdown() call.

2019-04-19 Thread Mukesh Ojha

Please atleast mention here why it is required?

-Mukesh

On 4/1/2019 7:33 PM, Gerd Hoffmann wrote:

Signed-off-by: Gerd Hoffmann 
---
  drivers/gpu/drm/virtio/virtgpu_display.c | 1 +
  1 file changed, 1 insertion(+)

diff --git a/drivers/gpu/drm/virtio/virtgpu_display.c 
b/drivers/gpu/drm/virtio/virtgpu_display.c
index 653ec7d0bf4d..86843a4d6102 100644
--- a/drivers/gpu/drm/virtio/virtgpu_display.c
+++ b/drivers/gpu/drm/virtio/virtgpu_display.c
@@ -385,5 +385,6 @@ void virtio_gpu_modeset_fini(struct virtio_gpu_device 
*vgdev)
  
  	for (i = 0 ; i < vgdev->num_scanouts; ++i)

kfree(vgdev->outputs[i].edid);
+   drm_atomic_helper_shutdown(vgdev->ddev);
drm_mode_config_cleanup(vgdev->ddev);
  }

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [summary] virtio network device failover writeup

2019-04-19 Thread Liran Alon


> On 21 Mar 2019, at 19:12, Michael S. Tsirkin  wrote:
> 
> On Thu, Mar 21, 2019 at 06:31:35PM +0200, Liran Alon wrote:
>> 
>> 
>>> On 21 Mar 2019, at 17:50, Michael S. Tsirkin  wrote:
>>> 
>>> On Thu, Mar 21, 2019 at 08:45:17AM -0700, Stephen Hemminger wrote:
 On Thu, 21 Mar 2019 15:04:37 +0200
 Liran Alon  wrote:
 
>> 
>> OK. Now what happens if master is moved to another namespace? Do we need
>> to move the slaves too?  
> 
> No. Why would we move the slaves? The whole point is to make most 
> customer ignore the net-failover slaves and remain them “hidden” in their 
> dedicated netns.
> We won’t prevent customer from explicitly moving the net-failover slaves 
> out of this netns, but we will not move them out of there automatically.
 
 
 The 2-device netvsc already handles case where master changes namespace.
>>> 
>>> Is it by moving slave with it?
>> 
>> See c0a41b887ce6 ("hv_netvsc: move VF to same namespace as netvsc device”).
>> It seems that when NetVSC master netdev changes netns, the VF is moved to 
>> the same netns by the NetVSC driver.
>> Kinda the opposite than what we are suggesting here to make sure that the 
>> net-failover master netdev is on a separate
>> netns than it’s slaves...
>> 
>> -Liran
>> 
>>> 
>>> -- 
>>> MST
> 
> Not exactly opposite I'd say.
> 
> If failover is in host ns, slaves in /primary and /standby, then moving
> failover to /container should move slaves to /container/primary and
> /container/standby.

Yes I agree.
I meant that they tried to keep the VF on the same netns as the NetVSC.
But of course what you just described is exactly the functionality I would have 
wanted in our net-failover mechanism.

-Liran

> 
> 
> -- 
> MST

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [summary] virtio network device failover writeup

2019-04-19 Thread Liran Alon


> On 21 Mar 2019, at 15:51, Michael S. Tsirkin  wrote:
> 
> On Thu, Mar 21, 2019 at 03:24:39PM +0200, Liran Alon wrote:
>> 
>> 
>>> On 21 Mar 2019, at 15:12, Michael S. Tsirkin  wrote:
>>> 
>>> On Thu, Mar 21, 2019 at 03:04:37PM +0200, Liran Alon wrote:
 
 
> On 21 Mar 2019, at 14:57, Michael S. Tsirkin  wrote:
> 
> On Thu, Mar 21, 2019 at 02:47:50PM +0200, Liran Alon wrote:
>> 
>> 
>>> On 21 Mar 2019, at 14:37, Michael S. Tsirkin  wrote:
>>> 
>>> On Thu, Mar 21, 2019 at 12:07:57PM +0200, Liran Alon wrote:
 2) It brings non-intuitive customer experience. For example, a 
 customer may attempt to analyse connectivity issue by checking the 
 connectivity
 on a net-failover slave (e.g. the VF) but will see no connectivity 
 when in-fact checking the connectivity on the net-failover master 
 netdev shows correct connectivity.
 
 The set of changes I vision to fix our issues are:
 1) Hide net-failover slaves in a different netns created and 
 managed by the kernel. But that user can enter to it and manage 
 the netdevs there if wishes to do so explicitly.
 (E.g. Configure the net-failover VF slave in some special way).
 2) Match the virtio-net and the VF based on a PV attribute instead 
 of MAC. (Similar to as done in NetVSC). E.g. Provide a virtio-net 
 interface to get PCI slot where the matching VF will be 
 hot-plugged by hypervisor.
 3) Have an explicit virtio-net control message to command 
 hypervisor to switch data-path from virtio-net to VF and 
 vice-versa. Instead of relying on intercepting the PCI master 
 enable-bit
 as an indicator on when VF is about to be set up. (Similar to as 
 done in NetVSC).
 
 Is there any clear issue we see regarding the above suggestion?
 
 -Liran
>>> 
>>> The issue would be this: how do we avoid conflicting with namespaces
>>> created by users?
>> 
>> This is kinda controversial, but maybe separate netns names into 2 
>> groups: hidden and normal.
>> To reference a hidden netns, you need to do it explicitly. 
>> Hidden and normal netns names can collide as they will be maintained 
>> in different namespaces (Yes I’m overloading the term namespace 
>> here…).
> 
> Maybe it's an unnamed namespace. Hidden until userspace gives it a 
> name?
 
 This is also a good idea that will solve the issue. Yes.
 
> 
>> Does this seems reasonable?
>> 
>> -Liran
> 
> Reasonable I'd say yes, easy to implement probably no. But maybe I
> missed a trick or two.
 
 BTW, from a practical point of view, I think that even until we figure 
 out a solution on how to implement this,
 it was better to create an kernel auto-generated name (e.g. 
 “kernel_net_failover_slaves")
 that will break only userspace workloads that by a very rare-chance 
 have a netns that collides with this then
 the breakage we have today for the various userspace components.
 
 -Liran
>>> 
>>> It seems quite easy to supply that as a module parameter. Do we need two
>>> namespaces though? Won't some userspace still be confused by the two
>>> slaves sharing the MAC address?
>> 
>> That’s one reasonable option.
>> Another one is that we will indeed change the mechanism by which we 
>> determine a VF should be bonded with a virtio-net device.
>> i.e. Expose a new virtio-net property that specify the PCI slot of the 
>> VF to be bonded with.
>> 
>> The second seems cleaner but I don’t have a strong opinion on this. Both 
>> seem reasonable to me and your suggestion is faster to implement from 
>> current state of things.
>> 
>> -Liran
> 
> OK. Now what happens if master is moved to another namespace? Do we need
> to move the slaves too?
 
 No. Why would we move the slaves?
>>> 
>>> 
>>> The reason we have 3 device model at all is so users can fine tune the
>>> slaves.
>> 
>> I Agree.
>> 
>>> I don't see why this applies to the root namespace but not
>>> a container. If it has access to failover it should have access
>>> to slaves.
>> 
>> Oh now I see your point. I haven’t thought about the containers usage.
>> My thinking was that customer can always just enter to the “hidden” netns 
>> and configure there whatever he wants.
>> 
>> Do you have a suggestion how to handle this?
>> 
>> One option can be that every "visible" netns on system will have a “hidden” 
>> unnamed netns where the net-failover slaves reside in.
>> If customer wishes to 

Re: [summary] virtio network device failover writeup

2019-04-19 Thread Liran Alon


> On 21 Mar 2019, at 17:50, Michael S. Tsirkin  wrote:
> 
> On Thu, Mar 21, 2019 at 08:45:17AM -0700, Stephen Hemminger wrote:
>> On Thu, 21 Mar 2019 15:04:37 +0200
>> Liran Alon  wrote:
>> 
 
 OK. Now what happens if master is moved to another namespace? Do we need
 to move the slaves too?  
>>> 
>>> No. Why would we move the slaves? The whole point is to make most customer 
>>> ignore the net-failover slaves and remain them “hidden” in their dedicated 
>>> netns.
>>> We won’t prevent customer from explicitly moving the net-failover slaves 
>>> out of this netns, but we will not move them out of there automatically.
>> 
>> 
>> The 2-device netvsc already handles case where master changes namespace.
> 
> Is it by moving slave with it?

See c0a41b887ce6 ("hv_netvsc: move VF to same namespace as netvsc device”).
It seems that when NetVSC master netdev changes netns, the VF is moved to the 
same netns by the NetVSC driver.
Kinda the opposite than what we are suggesting here to make sure that the 
net-failover master netdev is on a separate
netns than it’s slaves...

-Liran

> 
> -- 
> MST

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH net v2] failover: allow name change on IFF_UP slave interfaces

2019-04-19 Thread Liran Alon


> On 21 Mar 2019, at 16:04, Michael S. Tsirkin  wrote:
> 
> On Wed, Mar 06, 2019 at 10:08:32PM -0500, Si-Wei Liu wrote:
>> When a netdev appears through hot plug then gets enslaved by a failover
>> master that is already up and running, the slave will be opened
>> right away after getting enslaved. Today there's a race that userspace
>> (udev) may fail to rename the slave if the kernel (net_failover)
>> opens the slave earlier than when the userspace rename happens.
>> Unlike bond or team, the primary slave of failover can't be renamed by
>> userspace ahead of time, since the kernel initiated auto-enslavement is
>> unable to, or rather, is never meant to be synchronized with the rename
>> request from userspace.
>> 
>> As the failover slave interfaces are not designed to be operated
>> directly by userspace apps: IP configuration, filter rules with
>> regard to network traffic passing and etc., should all be done on master
>> interface. In general, userspace apps only care about the
>> name of master interface, while slave names are less important as long
>> as admin users can see reliable names that may carry
>> other information describing the netdev. For e.g., they can infer that
>> "ens3nsby" is a standby slave of "ens3", while for a
>> name like "eth0" they can't tell which master it belongs to.
>> 
>> Historically the name of IFF_UP interface can't be changed because
>> there might be admin script or management software that is already
>> relying on such behavior and assumes that the slave name can't be
>> changed once UP. But failover is special: with the in-kernel
>> auto-enslavement mechanism, the userspace expectation for device
>> enumeration and bring-up order is already broken. Previously initramfs
>> and various userspace config tools were modified to bypass failover
>> slaves because of auto-enslavement and duplicate MAC address. Similarly,
>> in case that users care about seeing reliable slave name, the new type
>> of failover slaves needs to be taken care of specifically in userspace
>> anyway.
>> 
>> It's less risky to lift up the rename restriction on failover slave
>> which is already UP. Although it's possible this change may potentially
>> break userspace component (most likely configuration scripts or
>> management software) that assumes slave name can't be changed while
>> UP, it's relatively a limited and controllable set among all userspace
>> components, which can be fixed specifically to work with the new naming
>> behavior of failover slaves. Userspace component interacting with
>> slaves should be changed to operate on failover master instead, as the
>> failover slave is dynamic in nature which may come and go at any point.
>> The goal is to make the role of failover slaves less relevant, and
>> all userspace should only deal with master in the long run.
>> 
>> Fixes: 30c8bd5aa8b2 ("net: Introduce generic failover module")
>> Signed-off-by: Si-Wei Liu 
>> Reviewed-by: Liran Alon 
>> Acked-by: Michael S. Tsirkin 
> 
> I worry that userspace might have made a bunch of assumptions
> that names never change as long as interface is up.
> So listening for up events ensures that interface
> is not renamed.

That’s true. This is exactly what is described in 3rd paragraph of commit 
message.
However, as commit message claims, net-failover slaves can be treated specially
because userspace is already broken on their handling and they need to be 
modified
to behave specially in regards to those slaves. Therefore, it’s less risky to 
lift up the
rename restriction on failover slave which is already UP.

> 
> How about sending down and up events around such renames?

You mean that dev_change_name() will behave as proposed in this patch but also 
in addition
send fake DOWN and UP uevents to userspace?

-Liran

> 
> 
> 
>> ---
>> v1 -> v2:
>> - Drop configurable module parameter (Sridhar)
>> 
>> 
>> include/linux/netdevice.h | 3 +++
>> net/core/dev.c| 3 ++-
>> net/core/failover.c   | 6 +++---
>> 3 files changed, 8 insertions(+), 4 deletions(-)
>> 
>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>> index 857f8ab..6d9e4e0 100644
>> --- a/include/linux/netdevice.h
>> +++ b/include/linux/netdevice.h
>> @@ -1487,6 +1487,7 @@ struct net_device_ops {
>>  * @IFF_NO_RX_HANDLER: device doesn't support the rx_handler hook
>>  * @IFF_FAILOVER: device is a failover master device
>>  * @IFF_FAILOVER_SLAVE: device is lower dev of a failover master device
>> + * @IFF_SLAVE_RENAME_OK: rename is allowed while slave device is running
>>  */
>> enum netdev_priv_flags {
>>  IFF_802_1Q_VLAN = 1<<0,
>> @@ -1518,6 +1519,7 @@ enum netdev_priv_flags {
>>  IFF_NO_RX_HANDLER   = 1<<26,
>>  IFF_FAILOVER= 1<<27,
>>  IFF_FAILOVER_SLAVE  = 1<<28,
>> +IFF_SLAVE_RENAME_OK = 1<<29,
>> };
>> 
>> #define IFF_802_1Q_VLAN  IFF_802_1Q_VLAN
>> @@ -1548,6 +1550,7 @@ enum 

Re: [summary] virtio network device failover writeup

2019-04-19 Thread Liran Alon


> On 21 Mar 2019, at 15:12, Michael S. Tsirkin  wrote:
> 
> On Thu, Mar 21, 2019 at 03:04:37PM +0200, Liran Alon wrote:
>> 
>> 
>>> On 21 Mar 2019, at 14:57, Michael S. Tsirkin  wrote:
>>> 
>>> On Thu, Mar 21, 2019 at 02:47:50PM +0200, Liran Alon wrote:
 
 
> On 21 Mar 2019, at 14:37, Michael S. Tsirkin  wrote:
> 
> On Thu, Mar 21, 2019 at 12:07:57PM +0200, Liran Alon wrote:
>> 2) It brings non-intuitive customer experience. For example, a 
>> customer may attempt to analyse connectivity issue by checking the 
>> connectivity
>> on a net-failover slave (e.g. the VF) but will see no connectivity 
>> when in-fact checking the connectivity on the net-failover master 
>> netdev shows correct connectivity.
>> 
>> The set of changes I vision to fix our issues are:
>> 1) Hide net-failover slaves in a different netns created and managed 
>> by the kernel. But that user can enter to it and manage the netdevs 
>> there if wishes to do so explicitly.
>> (E.g. Configure the net-failover VF slave in some special way).
>> 2) Match the virtio-net and the VF based on a PV attribute instead 
>> of MAC. (Similar to as done in NetVSC). E.g. Provide a virtio-net 
>> interface to get PCI slot where the matching VF will be hot-plugged 
>> by hypervisor.
>> 3) Have an explicit virtio-net control message to command hypervisor 
>> to switch data-path from virtio-net to VF and vice-versa. Instead of 
>> relying on intercepting the PCI master enable-bit
>> as an indicator on when VF is about to be set up. (Similar to as 
>> done in NetVSC).
>> 
>> Is there any clear issue we see regarding the above suggestion?
>> 
>> -Liran
> 
> The issue would be this: how do we avoid conflicting with namespaces
> created by users?
 
 This is kinda controversial, but maybe separate netns names into 2 
 groups: hidden and normal.
 To reference a hidden netns, you need to do it explicitly. 
 Hidden and normal netns names can collide as they will be maintained 
 in different namespaces (Yes I’m overloading the term namespace here…).
>>> 
>>> Maybe it's an unnamed namespace. Hidden until userspace gives it a name?
>> 
>> This is also a good idea that will solve the issue. Yes.
>> 
>>> 
 Does this seems reasonable?
 
 -Liran
>>> 
>>> Reasonable I'd say yes, easy to implement probably no. But maybe I
>>> missed a trick or two.
>> 
>> BTW, from a practical point of view, I think that even until we figure 
>> out a solution on how to implement this,
>> it was better to create an kernel auto-generated name (e.g. 
>> “kernel_net_failover_slaves")
>> that will break only userspace workloads that by a very rare-chance have 
>> a netns that collides with this then
>> the breakage we have today for the various userspace components.
>> 
>> -Liran
> 
> It seems quite easy to supply that as a module parameter. Do we need two
> namespaces though? Won't some userspace still be confused by the two
> slaves sharing the MAC address?
 
 That’s one reasonable option.
 Another one is that we will indeed change the mechanism by which we 
 determine a VF should be bonded with a virtio-net device.
 i.e. Expose a new virtio-net property that specify the PCI slot of the VF 
 to be bonded with.
 
 The second seems cleaner but I don’t have a strong opinion on this. Both 
 seem reasonable to me and your suggestion is faster to implement from 
 current state of things.
 
 -Liran
>>> 
>>> OK. Now what happens if master is moved to another namespace? Do we need
>>> to move the slaves too?
>> 
>> No. Why would we move the slaves?
> 
> 
> The reason we have 3 device model at all is so users can fine tune the
> slaves.

I Agree.

> I don't see why this applies to the root namespace but not
> a container. If it has access to failover it should have access
> to slaves.

Oh now I see your point. I haven’t thought about the containers usage.
My thinking was that customer can always just enter to the “hidden” netns and 
configure there whatever he wants.

Do you have a suggestion how to handle this?

One option can be that every "visible" netns on system will have a “hidden” 
unnamed netns where the net-failover slaves reside in.
If customer wishes to be able to enter to that netns and manage the 
net-failover slaves explicitly, it will need to have an updated iproute2
that knows how to enter to that hidden netns. For most customers, they won’t 
need to ever enter that netns and thus it is ok they don’t
have this updated iproute2.

> 
>> The whole point is to make most customer ignore the net-failover slaves and 
>> remain them “hidden” in 

Re: [summary] virtio network device failover writeup

2019-04-19 Thread Liran Alon


> On 21 Mar 2019, at 14:57, Michael S. Tsirkin  wrote:
> 
> On Thu, Mar 21, 2019 at 02:47:50PM +0200, Liran Alon wrote:
>> 
>> 
>>> On 21 Mar 2019, at 14:37, Michael S. Tsirkin  wrote:
>>> 
>>> On Thu, Mar 21, 2019 at 12:07:57PM +0200, Liran Alon wrote:
 2) It brings non-intuitive customer experience. For example, a 
 customer may attempt to analyse connectivity issue by checking the 
 connectivity
 on a net-failover slave (e.g. the VF) but will see no connectivity 
 when in-fact checking the connectivity on the net-failover master 
 netdev shows correct connectivity.
 
 The set of changes I vision to fix our issues are:
 1) Hide net-failover slaves in a different netns created and managed 
 by the kernel. But that user can enter to it and manage the netdevs 
 there if wishes to do so explicitly.
 (E.g. Configure the net-failover VF slave in some special way).
 2) Match the virtio-net and the VF based on a PV attribute instead of 
 MAC. (Similar to as done in NetVSC). E.g. Provide a virtio-net 
 interface to get PCI slot where the matching VF will be hot-plugged by 
 hypervisor.
 3) Have an explicit virtio-net control message to command hypervisor 
 to switch data-path from virtio-net to VF and vice-versa. Instead of 
 relying on intercepting the PCI master enable-bit
 as an indicator on when VF is about to be set up. (Similar to as done 
 in NetVSC).
 
 Is there any clear issue we see regarding the above suggestion?
 
 -Liran
>>> 
>>> The issue would be this: how do we avoid conflicting with namespaces
>>> created by users?
>> 
>> This is kinda controversial, but maybe separate netns names into 2 
>> groups: hidden and normal.
>> To reference a hidden netns, you need to do it explicitly. 
>> Hidden and normal netns names can collide as they will be maintained in 
>> different namespaces (Yes I’m overloading the term namespace here…).
> 
> Maybe it's an unnamed namespace. Hidden until userspace gives it a name?
 
 This is also a good idea that will solve the issue. Yes.
 
> 
>> Does this seems reasonable?
>> 
>> -Liran
> 
> Reasonable I'd say yes, easy to implement probably no. But maybe I
> missed a trick or two.
 
 BTW, from a practical point of view, I think that even until we figure out 
 a solution on how to implement this,
 it was better to create an kernel auto-generated name (e.g. 
 “kernel_net_failover_slaves")
 that will break only userspace workloads that by a very rare-chance have a 
 netns that collides with this then
 the breakage we have today for the various userspace components.
 
 -Liran
>>> 
>>> It seems quite easy to supply that as a module parameter. Do we need two
>>> namespaces though? Won't some userspace still be confused by the two
>>> slaves sharing the MAC address?
>> 
>> That’s one reasonable option.
>> Another one is that we will indeed change the mechanism by which we 
>> determine a VF should be bonded with a virtio-net device.
>> i.e. Expose a new virtio-net property that specify the PCI slot of the VF to 
>> be bonded with.
>> 
>> The second seems cleaner but I don’t have a strong opinion on this. Both 
>> seem reasonable to me and your suggestion is faster to implement from 
>> current state of things.
>> 
>> -Liran
> 
> OK. Now what happens if master is moved to another namespace? Do we need
> to move the slaves too?

No. Why would we move the slaves? The whole point is to make most customer 
ignore the net-failover slaves and remain them “hidden” in their dedicated 
netns.
We won’t prevent customer from explicitly moving the net-failover slaves out of 
this netns, but we will not move them out of there automatically.

> 
> Also siwei's patch is then kind of extraneous right?
> Attempts to rename a slave will now fail as it's in a namespace…

I’m not sure actually. Isn't udev/systemd netns-aware?
I would expect it to be able to provide names also to netdevs in netns 
different than default netns.
If that’s the case, Si-Wei patch to be able to rename a net-failover slave when 
it is already open is still required. As the race-condition still exists.

-Liran

> 
>>> 
>>> -- 
>>> MST

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [summary] virtio network device failover writeup

2019-04-19 Thread Liran Alon


> On 21 Mar 2019, at 14:37, Michael S. Tsirkin  wrote:
> 
> On Thu, Mar 21, 2019 at 12:07:57PM +0200, Liran Alon wrote:
>> 2) It brings non-intuitive customer experience. For example, a customer 
>> may attempt to analyse connectivity issue by checking the connectivity
>> on a net-failover slave (e.g. the VF) but will see no connectivity when 
>> in-fact checking the connectivity on the net-failover master netdev 
>> shows correct connectivity.
>> 
>> The set of changes I vision to fix our issues are:
>> 1) Hide net-failover slaves in a different netns created and managed by 
>> the kernel. But that user can enter to it and manage the netdevs there 
>> if wishes to do so explicitly.
>> (E.g. Configure the net-failover VF slave in some special way).
>> 2) Match the virtio-net and the VF based on a PV attribute instead of 
>> MAC. (Similar to as done in NetVSC). E.g. Provide a virtio-net interface 
>> to get PCI slot where the matching VF will be hot-plugged by hypervisor.
>> 3) Have an explicit virtio-net control message to command hypervisor to 
>> switch data-path from virtio-net to VF and vice-versa. Instead of 
>> relying on intercepting the PCI master enable-bit
>> as an indicator on when VF is about to be set up. (Similar to as done in 
>> NetVSC).
>> 
>> Is there any clear issue we see regarding the above suggestion?
>> 
>> -Liran
> 
> The issue would be this: how do we avoid conflicting with namespaces
> created by users?
 
 This is kinda controversial, but maybe separate netns names into 2 groups: 
 hidden and normal.
 To reference a hidden netns, you need to do it explicitly. 
 Hidden and normal netns names can collide as they will be maintained in 
 different namespaces (Yes I’m overloading the term namespace here…).
>>> 
>>> Maybe it's an unnamed namespace. Hidden until userspace gives it a name?
>> 
>> This is also a good idea that will solve the issue. Yes.
>> 
>>> 
 Does this seems reasonable?
 
 -Liran
>>> 
>>> Reasonable I'd say yes, easy to implement probably no. But maybe I
>>> missed a trick or two.
>> 
>> BTW, from a practical point of view, I think that even until we figure out a 
>> solution on how to implement this,
>> it was better to create an kernel auto-generated name (e.g. 
>> “kernel_net_failover_slaves")
>> that will break only userspace workloads that by a very rare-chance have a 
>> netns that collides with this then
>> the breakage we have today for the various userspace components.
>> 
>> -Liran
> 
> It seems quite easy to supply that as a module parameter. Do we need two
> namespaces though? Won't some userspace still be confused by the two
> slaves sharing the MAC address?

That’s one reasonable option.
Another one is that we will indeed change the mechanism by which we determine a 
VF should be bonded with a virtio-net device.
i.e. Expose a new virtio-net property that specify the PCI slot of the VF to be 
bonded with.

The second seems cleaner but I don’t have a strong opinion on this. Both seem 
reasonable to me and your suggestion is faster to implement from current state 
of things.

-Liran

> 
> -- 
> MST

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [summary] virtio network device failover writeup

2019-04-19 Thread Liran Alon


> On 21 Mar 2019, at 0:10, Michael S. Tsirkin  wrote:
> 
> On Wed, Mar 20, 2019 at 11:43:41PM +0200, Liran Alon wrote:
>> 
>> 
>>> On 20 Mar 2019, at 16:09, Michael S. Tsirkin  wrote:
>>> 
>>> On Wed, Mar 20, 2019 at 02:23:36PM +0200, Liran Alon wrote:
 
 
> On 20 Mar 2019, at 12:25, Michael S. Tsirkin  wrote:
> 
> On Wed, Mar 20, 2019 at 01:25:58AM +0200, Liran Alon wrote:
>> 
>> 
>>> On 19 Mar 2019, at 23:19, Michael S. Tsirkin  wrote:
>>> 
>>> On Tue, Mar 19, 2019 at 08:46:47AM -0700, Stephen Hemminger wrote:
 On Tue, 19 Mar 2019 14:38:06 +0200
 Liran Alon  wrote:
 
> b.3) cloud-init: If configured to perform network-configuration, it 
> attempts to configure all available netdevs. It should avoid however 
> doing so on net-failover slaves.
> (Microsoft has handled this by adding a mechanism in cloud-init to 
> blacklist a netdev from being configured in case it is owned by a 
> specific PCI driver. Specifically, they blacklist Mellanox VF driver. 
> However, this technique doesn’t work for the net-failover mechanism 
> because both the net-failover netdev and the virtio-net netdev are 
> owned by the virtio-net PCI driver).
 
 Cloud-init should really just ignore all devices that have a master 
 device.
 That would have been more general, and safer for other use cases.
>>> 
>>> Given lots of userspace doesn't do this, I wonder whether it would be
>>> safer to just somehow pretend to userspace that the slave links are
>>> down? And add a special attribute for the actual link state.
>> 
>> I think this may be problematic as it would also break legit use case
>> of userspace attempt to set various config on VF slave.
>> In general, lying to userspace usually leads to problems.
> 
> I hear you on this. So how about instead of lying,
> we basically just fail some accesses to slaves
> unless a flag is set e.g. in ethtool.
> 
> Some userspace will need to change to set it but in a minor way.
> Arguably/hopefully failure to set config would generally be a safer
> failure.
 
 Once userspace will set this new flag by ethtool, all operations done by 
 other userspace components will still work.
>>> 
>>> Sorry about being unclear, the idea would be to require the flag on each 
>>> ethtool operation.
>> 
>> Oh. I have indeed misunderstood your previous email then. :)
>> Thanks for clarifying.
>> 
>>> 
 E.g. Running dhclient without parameters, after this flag was set, will 
 still attempt to perform DHCP on it and will now succeed.
>>> 
>>> I think sending/receiving should probably just fail unconditionally.
>> 
>> You mean that you wish that somehow kernel will prevent Tx on net-failover 
>> slave netdev
>> unless skb is marked with some flag to indicate it has been sent via the 
>> net-failover master?
> 
> We can maybe avoid binding a protocol socket to the device?

That is indeed another possibility that would work to avoid the DHCP issues.
And will still allow checking connectivity. So it is better.
However, I still think it provides an non-intuitive customer experience.
In addition, I also want to take into account that most customers are expected 
a 1:1 mapping between a vNIC and a netdev.
i.e. A cloud instance should show 1-netdev if it has one vNIC attached to it 
defined.
Customers usually don’t care how they get accelerated networking. They just 
care they do.

> 
>> This indeed resolves the group of userspace issues around performing DHCP on 
>> net-failover slaves directly (By dracut/initramfs, dhclient and etc.).
>> 
>> However, I see a couple of down-sides to it:
>> 1) It doesn’t resolve all userspace issues listed in this email thread. For 
>> example, cloud-init will still attempt to perform network config on 
>> net-failover slaves.
>> It also doesn’t help with regard to Ubuntu’s netplan issue that creates udev 
>> rules that match only by MAC.
> 
> 
> How about we fail to retrieve mac from the slave?

That would work but I think it is cleaner to just not bind PV and VF based on 
having the same MAC.

> 
>> 2) It brings non-intuitive customer experience. For example, a customer may 
>> attempt to analyse connectivity issue by checking the connectivity
>> on a net-failover slave (e.g. the VF) but will see no connectivity when 
>> in-fact checking the connectivity on the net-failover master netdev shows 
>> correct connectivity.
>> 
>> The set of changes I vision to fix our issues are:
>> 1) Hide net-failover slaves in a different netns created and managed by the 
>> kernel. But that user can enter to it and manage the netdevs there if wishes 
>> to do so explicitly.
>> (E.g. Configure the net-failover VF slave in some special way).
>> 2) Match the virtio-net and the VF based on a PV attribute instead of MAC. 
>> (Similar to as done in NetVSC). E.g. 

Re: [summary] virtio network device failover writeup

2019-04-19 Thread Liran Alon


> On 21 Mar 2019, at 10:58, Michael S. Tsirkin  wrote:
> 
> On Thu, Mar 21, 2019 at 12:19:22AM +0200, Liran Alon wrote:
>> 
>> 
>>> On 21 Mar 2019, at 0:10, Michael S. Tsirkin  wrote:
>>> 
>>> On Wed, Mar 20, 2019 at 11:43:41PM +0200, Liran Alon wrote:
 
 
> On 20 Mar 2019, at 16:09, Michael S. Tsirkin  wrote:
> 
> On Wed, Mar 20, 2019 at 02:23:36PM +0200, Liran Alon wrote:
>> 
>> 
>>> On 20 Mar 2019, at 12:25, Michael S. Tsirkin  wrote:
>>> 
>>> On Wed, Mar 20, 2019 at 01:25:58AM +0200, Liran Alon wrote:
 
 
> On 19 Mar 2019, at 23:19, Michael S. Tsirkin  wrote:
> 
> On Tue, Mar 19, 2019 at 08:46:47AM -0700, Stephen Hemminger wrote:
>> On Tue, 19 Mar 2019 14:38:06 +0200
>> Liran Alon  wrote:
>> 
>>> b.3) cloud-init: If configured to perform network-configuration, it 
>>> attempts to configure all available netdevs. It should avoid 
>>> however doing so on net-failover slaves.
>>> (Microsoft has handled this by adding a mechanism in cloud-init to 
>>> blacklist a netdev from being configured in case it is owned by a 
>>> specific PCI driver. Specifically, they blacklist Mellanox VF 
>>> driver. However, this technique doesn’t work for the net-failover 
>>> mechanism because both the net-failover netdev and the virtio-net 
>>> netdev are owned by the virtio-net PCI driver).
>> 
>> Cloud-init should really just ignore all devices that have a master 
>> device.
>> That would have been more general, and safer for other use cases.
> 
> Given lots of userspace doesn't do this, I wonder whether it would be
> safer to just somehow pretend to userspace that the slave links are
> down? And add a special attribute for the actual link state.
 
 I think this may be problematic as it would also break legit use case
 of userspace attempt to set various config on VF slave.
 In general, lying to userspace usually leads to problems.
>>> 
>>> I hear you on this. So how about instead of lying,
>>> we basically just fail some accesses to slaves
>>> unless a flag is set e.g. in ethtool.
>>> 
>>> Some userspace will need to change to set it but in a minor way.
>>> Arguably/hopefully failure to set config would generally be a safer
>>> failure.
>> 
>> Once userspace will set this new flag by ethtool, all operations done by 
>> other userspace components will still work.
> 
> Sorry about being unclear, the idea would be to require the flag on each 
> ethtool operation.
 
 Oh. I have indeed misunderstood your previous email then. :)
 Thanks for clarifying.
 
> 
>> E.g. Running dhclient without parameters, after this flag was set, will 
>> still attempt to perform DHCP on it and will now succeed.
> 
> I think sending/receiving should probably just fail unconditionally.
 
 You mean that you wish that somehow kernel will prevent Tx on net-failover 
 slave netdev
 unless skb is marked with some flag to indicate it has been sent via the 
 net-failover master?
>>> 
>>> We can maybe avoid binding a protocol socket to the device?
>> 
>> That is indeed another possibility that would work to avoid the DHCP issues.
>> And will still allow checking connectivity. So it is better.
>> However, I still think it provides an non-intuitive customer experience.
>> In addition, I also want to take into account that most customers are 
>> expected a 1:1 mapping between a vNIC and a netdev.
>> i.e. A cloud instance should show 1-netdev if it has one vNIC attached to it 
>> defined.
>> Customers usually don’t care how they get accelerated networking. They just 
>> care they do.
>> 
>>> 
 This indeed resolves the group of userspace issues around performing DHCP 
 on net-failover slaves directly (By dracut/initramfs, dhclient and etc.).
 
 However, I see a couple of down-sides to it:
 1) It doesn’t resolve all userspace issues listed in this email thread. 
 For example, cloud-init will still attempt to perform network config on 
 net-failover slaves.
 It also doesn’t help with regard to Ubuntu’s netplan issue that creates 
 udev rules that match only by MAC.
>>> 
>>> 
>>> How about we fail to retrieve mac from the slave?
>> 
>> That would work but I think it is cleaner to just not bind PV and VF based 
>> on having the same MAC.
> 
> There's a reference to that under "Non-MAC based pairing".
> 
> I'll look into making it more explicit.

Yes I know. I was referring to what you described in that section.

> 
>>> 
 2) It brings non-intuitive customer experience. For example, a customer 
 may attempt to analyse connectivity issue by checking the connectivity
 on a net-failover slave (e.g. the VF) but will see no connectivity when 

Re: [summary] virtio network device failover writeup

2019-04-19 Thread Liran Alon


> On 20 Mar 2019, at 12:25, Michael S. Tsirkin  wrote:
> 
> On Wed, Mar 20, 2019 at 01:25:58AM +0200, Liran Alon wrote:
>> 
>> 
>>> On 19 Mar 2019, at 23:19, Michael S. Tsirkin  wrote:
>>> 
>>> On Tue, Mar 19, 2019 at 08:46:47AM -0700, Stephen Hemminger wrote:
 On Tue, 19 Mar 2019 14:38:06 +0200
 Liran Alon  wrote:
 
> b.3) cloud-init: If configured to perform network-configuration, it 
> attempts to configure all available netdevs. It should avoid however 
> doing so on net-failover slaves.
> (Microsoft has handled this by adding a mechanism in cloud-init to 
> blacklist a netdev from being configured in case it is owned by a 
> specific PCI driver. Specifically, they blacklist Mellanox VF driver. 
> However, this technique doesn’t work for the net-failover mechanism 
> because both the net-failover netdev and the virtio-net netdev are owned 
> by the virtio-net PCI driver).
 
 Cloud-init should really just ignore all devices that have a master device.
 That would have been more general, and safer for other use cases.
>>> 
>>> Given lots of userspace doesn't do this, I wonder whether it would be
>>> safer to just somehow pretend to userspace that the slave links are
>>> down? And add a special attribute for the actual link state.
>> 
>> I think this may be problematic as it would also break legit use case
>> of userspace attempt to set various config on VF slave.
>> In general, lying to userspace usually leads to problems.
> 
> I hear you on this. So how about instead of lying,
> we basically just fail some accesses to slaves
> unless a flag is set e.g. in ethtool.
> 
> Some userspace will need to change to set it but in a minor way.
> Arguably/hopefully failure to set config would generally be a safer
> failure.

Once userspace will set this new flag by ethtool, all operations done by other 
userspace components will still work.
E.g. Running dhclient without parameters, after this flag was set, will still 
attempt to perform DHCP on it and will now succeed.

Therefore, this proposal just effectively delays when the net-failover slave 
can be operated on by userspace.
But what we actually want is to never allow a net-failover slave to be operated 
by userspace unless it is explicitly stated
by userspace that it wishes to perform a set of actions on the net-failover 
slave.

Something that was achieved if, for example, the net-failover slaves were in a 
different netns than default netns.
This also aligns with expected customer experience that most customers just 
want to see a 1:1 mapping between a vNIC and a visible netdev.
But of course maybe there are other ideas that can achieve similar behaviour.

-Liran

> 
> Which things to fail? Probably sending/receiving packets?  Getting MAC?
> More?
> 
>> If we reach
>> to a scenario where we try to avoid userspace issues generically and
>> not on a userspace component basis, I believe the right path should be
>> to hide the net-failover slaves such that explicit action is required
>> to actually manipulate them (As described in blog-post). E.g.
>> Automatically move net-failover slaves by kernel to a different netns.
>> 
>> -Liran
>> 
>>> 
>>> -- 
>>> MST

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [summary] virtio network device failover writeup

2019-04-19 Thread Liran Alon


> On 19 Mar 2019, at 23:19, Michael S. Tsirkin  wrote:
> 
> On Tue, Mar 19, 2019 at 08:46:47AM -0700, Stephen Hemminger wrote:
>> On Tue, 19 Mar 2019 14:38:06 +0200
>> Liran Alon  wrote:
>> 
>>> b.3) cloud-init: If configured to perform network-configuration, it 
>>> attempts to configure all available netdevs. It should avoid however doing 
>>> so on net-failover slaves.
>>> (Microsoft has handled this by adding a mechanism in cloud-init to 
>>> blacklist a netdev from being configured in case it is owned by a specific 
>>> PCI driver. Specifically, they blacklist Mellanox VF driver. However, this 
>>> technique doesn’t work for the net-failover mechanism because both the 
>>> net-failover netdev and the virtio-net netdev are owned by the virtio-net 
>>> PCI driver).
>> 
>> Cloud-init should really just ignore all devices that have a master device.
>> That would have been more general, and safer for other use cases.
> 
> Given lots of userspace doesn't do this, I wonder whether it would be
> safer to just somehow pretend to userspace that the slave links are
> down? And add a special attribute for the actual link state.

I think this may be problematic as it would also break legit use case of 
userspace attempt to set various config on VF slave.
In general, lying to userspace usually leads to problems. If we reach to a 
scenario where we try to avoid userspace issues generically and not
on a userspace component basis, I believe the right path should be to hide the 
net-failover slaves such that explicit action is required
to actually manipulate them (As described in blog-post). E.g. Automatically 
move net-failover slaves by kernel to a different netns.

-Liran

> 
> -- 
> MST

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [summary] virtio network device failover writeup

2019-04-19 Thread Liran Alon


> On 20 Mar 2019, at 16:09, Michael S. Tsirkin  wrote:
> 
> On Wed, Mar 20, 2019 at 02:23:36PM +0200, Liran Alon wrote:
>> 
>> 
>>> On 20 Mar 2019, at 12:25, Michael S. Tsirkin  wrote:
>>> 
>>> On Wed, Mar 20, 2019 at 01:25:58AM +0200, Liran Alon wrote:
 
 
> On 19 Mar 2019, at 23:19, Michael S. Tsirkin  wrote:
> 
> On Tue, Mar 19, 2019 at 08:46:47AM -0700, Stephen Hemminger wrote:
>> On Tue, 19 Mar 2019 14:38:06 +0200
>> Liran Alon  wrote:
>> 
>>> b.3) cloud-init: If configured to perform network-configuration, it 
>>> attempts to configure all available netdevs. It should avoid however 
>>> doing so on net-failover slaves.
>>> (Microsoft has handled this by adding a mechanism in cloud-init to 
>>> blacklist a netdev from being configured in case it is owned by a 
>>> specific PCI driver. Specifically, they blacklist Mellanox VF driver. 
>>> However, this technique doesn’t work for the net-failover mechanism 
>>> because both the net-failover netdev and the virtio-net netdev are 
>>> owned by the virtio-net PCI driver).
>> 
>> Cloud-init should really just ignore all devices that have a master 
>> device.
>> That would have been more general, and safer for other use cases.
> 
> Given lots of userspace doesn't do this, I wonder whether it would be
> safer to just somehow pretend to userspace that the slave links are
> down? And add a special attribute for the actual link state.
 
 I think this may be problematic as it would also break legit use case
 of userspace attempt to set various config on VF slave.
 In general, lying to userspace usually leads to problems.
>>> 
>>> I hear you on this. So how about instead of lying,
>>> we basically just fail some accesses to slaves
>>> unless a flag is set e.g. in ethtool.
>>> 
>>> Some userspace will need to change to set it but in a minor way.
>>> Arguably/hopefully failure to set config would generally be a safer
>>> failure.
>> 
>> Once userspace will set this new flag by ethtool, all operations done by 
>> other userspace components will still work.
> 
> Sorry about being unclear, the idea would be to require the flag on each 
> ethtool operation.

Oh. I have indeed misunderstood your previous email then. :)
Thanks for clarifying.

> 
>> E.g. Running dhclient without parameters, after this flag was set, will 
>> still attempt to perform DHCP on it and will now succeed.
> 
> I think sending/receiving should probably just fail unconditionally.

You mean that you wish that somehow kernel will prevent Tx on net-failover 
slave netdev
unless skb is marked with some flag to indicate it has been sent via the 
net-failover master?

This indeed resolves the group of userspace issues around performing DHCP on 
net-failover slaves directly (By dracut/initramfs, dhclient and etc.).

However, I see a couple of down-sides to it:
1) It doesn’t resolve all userspace issues listed in this email thread. For 
example, cloud-init will still attempt to perform network config on 
net-failover slaves.
It also doesn’t help with regard to Ubuntu’s netplan issue that creates udev 
rules that match only by MAC.
2) It brings non-intuitive customer experience. For example, a customer may 
attempt to analyse connectivity issue by checking the connectivity
on a net-failover slave (e.g. the VF) but will see no connectivity when in-fact 
checking the connectivity on the net-failover master netdev shows correct 
connectivity.

The set of changes I vision to fix our issues are:
1) Hide net-failover slaves in a different netns created and managed by the 
kernel. But that user can enter to it and manage the netdevs there if wishes to 
do so explicitly.
(E.g. Configure the net-failover VF slave in some special way).
2) Match the virtio-net and the VF based on a PV attribute instead of MAC. 
(Similar to as done in NetVSC). E.g. Provide a virtio-net interface to get PCI 
slot where the matching VF will be hot-plugged by hypervisor.
3) Have an explicit virtio-net control message to command hypervisor to switch 
data-path from virtio-net to VF and vice-versa. Instead of relying on 
intercepting the PCI master enable-bit
as an indicator on when VF is about to be set up. (Similar to as done in 
NetVSC).

Is there any clear issue we see regarding the above suggestion?

-Liran

> 
>> Therefore, this proposal just effectively delays when the net-failover slave 
>> can be operated on by userspace.
>> But what we actually want is to never allow a net-failover slave to be 
>> operated by userspace unless it is explicitly stated
>> by userspace that it wishes to perform a set of actions on the net-failover 
>> slave.
>> 
>> Something that was achieved if, for example, the net-failover slaves were in 
>> a different netns than default netns.
>> This also aligns with expected customer experience that most customers just 
>> want to see a 1:1 mapping between a vNIC and a visible netdev.
>> But of 

Re: [RFC PATCH V2 5/5] vhost: access vq metadata through kernel virtual address

2019-04-19 Thread Jerome Glisse
On Fri, Mar 08, 2019 at 04:50:36PM +0800, Jason Wang wrote:
> 
> On 2019/3/8 上午3:16, Andrea Arcangeli wrote:
> > On Thu, Mar 07, 2019 at 12:56:45PM -0500, Michael S. Tsirkin wrote:
> > > On Thu, Mar 07, 2019 at 10:47:22AM -0500, Michael S. Tsirkin wrote:
> > > > On Wed, Mar 06, 2019 at 02:18:12AM -0500, Jason Wang wrote:
> > > > > +static const struct mmu_notifier_ops vhost_mmu_notifier_ops = {
> > > > > + .invalidate_range = vhost_invalidate_range,
> > > > > +};
> > > > > +
> > > > >   void vhost_dev_init(struct vhost_dev *dev,
> > > > >   struct vhost_virtqueue **vqs, int nvqs, int 
> > > > > iov_limit)
> > > > >   {
> > > > I also wonder here: when page is write protected then
> > > > it does not look like .invalidate_range is invoked.
> > > > 
> > > > E.g. mm/ksm.c calls
> > > > 
> > > > mmu_notifier_invalidate_range_start and
> > > > mmu_notifier_invalidate_range_end but not mmu_notifier_invalidate_range.
> > > > 
> > > > Similarly, rmap in page_mkclean_one will not call
> > > > mmu_notifier_invalidate_range.
> > > > 
> > > > If I'm right vhost won't get notified when page is write-protected 
> > > > since you
> > > > didn't install start/end notifiers. Note that end notifier can be called
> > > > with page locked, so it's not as straight-forward as just adding a call.
> > > > Writing into a write-protected page isn't a good idea.
> > > > 
> > > > Note that documentation says:
> > > > it is fine to delay the mmu_notifier_invalidate_range
> > > > call to mmu_notifier_invalidate_range_end() outside the page 
> > > > table lock.
> > > > implying it's called just later.
> > > OK I missed the fact that _end actually calls
> > > mmu_notifier_invalidate_range internally. So that part is fine but the
> > > fact that you are trying to take page lock under VQ mutex and take same
> > > mutex within notifier probably means it's broken for ksm and rmap at
> > > least since these call invalidate with lock taken.
> > Yes this lock inversion needs more thoughts.
> > 
> > > And generally, Andrea told me offline one can not take mutex under
> > > the notifier callback. I CC'd Andrea for why.
> > Yes, the problem then is the ->invalidate_page is called then under PT
> > lock so it cannot take mutex, you also cannot take the page_lock, it
> > can at most take a spinlock or trylock_page.
> > 
> > So it must switch back to the _start/_end methods unless you rewrite
> > the locking.
> > 
> > The difference with _start/_end, is that ->invalidate_range avoids the
> > _start callback basically, but to avoid the _start callback safely, it
> > has to be called in between the ptep_clear_flush and the set_pte_at
> > whenever the pfn changes like during a COW. So it cannot be coalesced
> > in a single TLB flush that invalidates all sptes in a range like we
> > prefer for performance reasons for example in KVM. It also cannot
> > sleep.
> > 
> > In short ->invalidate_range must be really fast (it shouldn't require
> > to send IPI to all other CPUs like KVM may require during an
> > invalidate_range_start) and it must not sleep, in order to prefer it
> > to _start/_end.
> > 
> > I.e. the invalidate of the secondary MMU that walks the linux
> > pagetables in hardware (in vhost case with GUP in software) has to
> > happen while the linux pagetable is zero, otherwise a concurrent
> > hardware pagetable lookup could re-instantiate a mapping to the old
> > page in between the set_pte_at and the invalidate_range_end (which
> > internally calls ->invalidate_range). Jerome documented it nicely in
> > Documentation/vm/mmu_notifier.rst .
> 
> 
> Right, I've actually gone through this several times but some details were
> missed by me obviously.
> 
> 
> > 
> > Now you don't really walk the pagetable in hardware in vhost, but if
> > you use gup_fast after usemm() it's similar.
> > 
> > For vhost the invalidate would be really fast, there are no IPI to
> > deliver at all, the problem is just the mutex.
> 
> 
> Yes. A possible solution is to introduce a valid flag for VA. Vhost may only
> try to access kernel VA when it was valid. Invalidate_range_start() will
> clear this under the protection of the vq mutex when it can block. Then
> invalidate_range_end() then can clear this flag. An issue is blockable is 
> always false for range_end().
> 

Note that there can be multiple asynchronous concurrent invalidate_range
callbacks. So a flag does not work but a counter of number of active
invalidation would. See how KSM is doing it for instance in kvm_main.c

The pattern for this kind of thing is:
my_invalidate_range_start(start,end) {
...
if (mystruct_overlap(mystruct, start, end)) {
mystruct_lock();
mystruct->invalidate_count++;
...
mystruct_unlock();
}
}

my_invalidate_range_end(start,end) {
...
if (mystruct_overlap(mystruct, start, end)) {
mystruct_lock();
mystruct->invalidate_count--;
  

Re: [summary] virtio network device failover writeup

2019-04-19 Thread Liran Alon
Hi Michael,

Great blog-post which summarise everything very well!

Some comments I have:

1) I think that when we are using the term “1-netdev model” on community 
discussion, we tend to refer to what you have defined in blog-post as "3-device 
model with hidden slaves”.
Therefore, I would suggest to just remove the “1-netdev model” section and 
rename the "3-device model with hidden slaves” section to “1-netdev model”.

2) The userspace issues result both from using “2-netdev model” and “3-netdev 
model”. However, they are described in blog-post as they only exist on 
“3-netdev model”.
The reason these issues are not seen in Azure environment is because these 
issues were partially handled by Microsoft for their specific 2-netdev model.
Which leads me to the next comment.

3) I suggest that blog-post will also elaborate on what exactly are the 
userspace issues which results in models different than “1-netdev model”.
The issues that I’m aware of are (Please tell me if you are aware of others!):
(a) udev rename race-condition: When net-failover device is opened, it also 
opens it's slaves. However, the order of events to udev on KOBJ_ADD is first 
for the net-failover netdev and only then for the virtio-net netdev. This means 
that if userspace will respond to first event by open the net-failover, then 
any attempt of userspace to rename virtio-net netdev as a response to the 
second event will fail because the virtio-net netdev is already opened. Also 
note that this udev rename rule is useful because we would like to add rules 
that renames virtio-net netdev to clearly signal that it’s used as the standby 
interface of another net-failover netdev.
The way this problem was workaround by Microsoft in NetVSC is to delay the open 
done on slave-VF from the open of the NetVSC netdev. However, this is still a 
race and thus a hacky solution. It was accepted by community only because it’s 
internal to the NetVSC driver. However, similar solution was rejected by 
community for the net-failover driver.
The solution that we currently proposed to address this (Patch by Si-Wei) was 
to change the rename kernel handling to allow a net-failover slave to be 
renamed even if it is already opened. Patch is still not accepted.
(b) Issues caused because of various userspace components DHCP the net-failover 
slaves: DHCP of course should only be done on the net-failover netdev. 
Attempting to DHCP on net-failover slaves as-well will cause networking issues. 
Therefore, userspace components should be taught to avoid doing DHCP on the 
net-failover slaves. The various userspace components include:
b.1) dhclient: If run without parameters, it by default just enum all netdevs 
and attempt to DHCP them all.
(I don’t think Microsoft has handled this)
b.2) initramfs / dracut: In order to mount the root file-system from iSCSI, 
these components needs networking and therefore DHCP on all netdevs.
(Microsoft haven’t handled (b.2) because they don’t have images which perform 
iSCSI boot in their Azure setup. Still an open issue)
b.3) cloud-init: If configured to perform network-configuration, it attempts to 
configure all available netdevs. It should avoid however doing so on 
net-failover slaves.
(Microsoft has handled this by adding a mechanism in cloud-init to blacklist a 
netdev from being configured in case it is owned by a specific PCI driver. 
Specifically, they blacklist Mellanox VF driver. However, this technique 
doesn’t work for the net-failover mechanism because both the net-failover 
netdev and the virtio-net netdev are owned by the virtio-net PCI driver).
b.4) Various distros network-manager need to be updated to avoid DHCP on 
net-failover slaves? (Not sure. Asking...)

4) Another interesting use-case where the net-failover mechanism is useful is 
for handling NIC firmware failures or NIC firmware Live-Upgrade.
In both cases, there is a need to perform a full PCIe reset of the NIC. Which 
lose all the NIC eSwitch configuration of the various VFs.
To handle these cases gracefully, one could just hot-unplug all VFs from guests 
running on host (which will make all guests now use the virtio-net netdev which 
is backed by a netdev that eventually is on top of PF). Therefore, networking 
will be restored to guests once the PCIe reset is completed and the PF is 
functional again. To re-acceelrate the guests network, hypervisor can just 
hot-plug new VFs to guests.

P.S:
I would very appreciate all this forum help in closing on the pending items 
written in (3). Which currently prevents using this net-failover mechanism in 
real production use-cases.

Regards,
-Liran

> On 17 Mar 2019, at 15:55, Michael S. Tsirkin  wrote:
> 
> Hi all,
> I've put up a blog post with a summary of where network
> device failover stands and some open issues.
> Not sure where best to host it, I just put it up on blogspot:
> 

Re: [summary] virtio network device failover writeup

2019-04-19 Thread Liran Alon


> On 19 Mar 2019, at 23:06, Michael S. Tsirkin  wrote:
> 
> On Tue, Mar 19, 2019 at 02:38:06PM +0200, Liran Alon wrote:
>> Hi Michael,
>> 
>> Great blog-post which summarise everything very well!
>> 
>> Some comments I have:
> 
> Thanks!
> I'll try to update everything in the post when I'm not so jet-lagged.
> 
>> 1) I think that when we are using the term “1-netdev model” on community 
>> discussion, we tend to refer to what you have defined in blog-post as 
>> "3-device model with hidden slaves”.
>> Therefore, I would suggest to just remove the “1-netdev model” section and 
>> rename the "3-device model with hidden slaves” section to “1-netdev model”.
>> 
>> 2) The userspace issues result both from using “2-netdev model” and 
>> “3-netdev model”. However, they are described in blog-post as they only 
>> exist on “3-netdev model”.
>> The reason these issues are not seen in Azure environment is because these 
>> issues were partially handled by Microsoft for their specific 2-netdev model.
>> Which leads me to the next comment.
>> 
>> 3) I suggest that blog-post will also elaborate on what exactly are the 
>> userspace issues which results in models different than “1-netdev model”.
>> The issues that I’m aware of are (Please tell me if you are aware of 
>> others!):
>> (a) udev rename race-condition: When net-failover device is opened, it also 
>> opens it's slaves. However, the order of events to udev on KOBJ_ADD is first 
>> for the net-failover netdev and only then for the virtio-net netdev. This 
>> means that if userspace will respond to first event by open the 
>> net-failover, then any attempt of userspace to rename virtio-net netdev as a 
>> response to the second event will fail because the virtio-net netdev is 
>> already opened. Also note that this udev rename rule is useful because we 
>> would like to add rules that renames virtio-net netdev to clearly signal 
>> that it’s used as the standby interface of another net-failover netdev.
>> The way this problem was workaround by Microsoft in NetVSC is to delay the 
>> open done on slave-VF from the open of the NetVSC netdev. However, this is 
>> still a race and thus a hacky solution. It was accepted by community only 
>> because it’s internal to the NetVSC driver. However, similar solution was 
>> rejected by community for the net-failover driver.
>> The solution that we currently proposed to address this (Patch by Si-Wei) 
>> was to change the rename kernel handling to allow a net-failover slave to be 
>> renamed even if it is already opened. Patch is still not accepted.
>> (b) Issues caused because of various userspace components DHCP the 
>> net-failover slaves: DHCP of course should only be done on the net-failover 
>> netdev. Attempting to DHCP on net-failover slaves as-well will cause 
>> networking issues. Therefore, userspace components should be taught to avoid 
>> doing DHCP on the net-failover slaves. The various userspace components 
>> include:
>> b.1) dhclient: If run without parameters, it by default just enum all 
>> netdevs and attempt to DHCP them all.
>> (I don’t think Microsoft has handled this)
>> b.2) initramfs / dracut: In order to mount the root file-system from iSCSI, 
>> these components needs networking and therefore DHCP on all netdevs.
>> (Microsoft haven’t handled (b.2) because they don’t have images which 
>> perform iSCSI boot in their Azure setup. Still an open issue)
>> b.3) cloud-init: If configured to perform network-configuration, it attempts 
>> to configure all available netdevs. It should avoid however doing so on 
>> net-failover slaves.
>> (Microsoft has handled this by adding a mechanism in cloud-init to blacklist 
>> a netdev from being configured in case it is owned by a specific PCI driver. 
>> Specifically, they blacklist Mellanox VF driver. However, this technique 
>> doesn’t work for the net-failover mechanism because both the net-failover 
>> netdev and the virtio-net netdev are owned by the virtio-net PCI driver).
>> b.4) Various distros network-manager need to be updated to avoid DHCP on 
>> net-failover slaves? (Not sure. Asking...)
>> 
>> 4) Another interesting use-case where the net-failover mechanism is useful 
>> is for handling NIC firmware failures or NIC firmware Live-Upgrade.
>> In both cases, there is a need to perform a full PCIe reset of the NIC. 
>> Which lose all the NIC eSwitch configuration of the various VFs.
> 
> In this setup, how does VF keep going? If it doesn't keep going, why is
> it helpful?

Let me attempt to clarify.

First, let’s analyse what can a cloud provider do when it wishes to upgrade the 
NIC firmware when there are currently running guests utilising SR-IOV.
He can perform the following operations in order:
1) Hot-unplug all VFs from all running guests.
2) Upgrade NIC firmware. Will result in PCIe reset which will cause momentary 
network down-time on PF but immediately afterwards PF will be set up again and 
guests will have network connectivity.
3) 

Re: [RFC PATCH V2 5/5] vhost: access vq metadata through kernel virtual address

2019-04-19 Thread Jerome Glisse
On Fri, Mar 08, 2019 at 07:56:04AM -0500, Michael S. Tsirkin wrote:
> On Fri, Mar 08, 2019 at 04:58:44PM +0800, Jason Wang wrote:
> > 
> > On 2019/3/8 上午3:17, Jerome Glisse wrote:
> > > On Thu, Mar 07, 2019 at 12:56:45PM -0500, Michael S. Tsirkin wrote:
> > > > On Thu, Mar 07, 2019 at 10:47:22AM -0500, Michael S. Tsirkin wrote:
> > > > > On Wed, Mar 06, 2019 at 02:18:12AM -0500, Jason Wang wrote:
> > > > > > +static const struct mmu_notifier_ops vhost_mmu_notifier_ops = {
> > > > > > +   .invalidate_range = vhost_invalidate_range,
> > > > > > +};
> > > > > > +
> > > > > >   void vhost_dev_init(struct vhost_dev *dev,
> > > > > > struct vhost_virtqueue **vqs, int nvqs, int 
> > > > > > iov_limit)
> > > > > >   {
> > > > > I also wonder here: when page is write protected then
> > > > > it does not look like .invalidate_range is invoked.
> > > > > 
> > > > > E.g. mm/ksm.c calls
> > > > > 
> > > > > mmu_notifier_invalidate_range_start and
> > > > > mmu_notifier_invalidate_range_end but not 
> > > > > mmu_notifier_invalidate_range.
> > > > > 
> > > > > Similarly, rmap in page_mkclean_one will not call
> > > > > mmu_notifier_invalidate_range.
> > > > > 
> > > > > If I'm right vhost won't get notified when page is write-protected 
> > > > > since you
> > > > > didn't install start/end notifiers. Note that end notifier can be 
> > > > > called
> > > > > with page locked, so it's not as straight-forward as just adding a 
> > > > > call.
> > > > > Writing into a write-protected page isn't a good idea.
> > > > > 
> > > > > Note that documentation says:
> > > > >   it is fine to delay the mmu_notifier_invalidate_range
> > > > >   call to mmu_notifier_invalidate_range_end() outside the page 
> > > > > table lock.
> > > > > implying it's called just later.
> > > > OK I missed the fact that _end actually calls
> > > > mmu_notifier_invalidate_range internally. So that part is fine but the
> > > > fact that you are trying to take page lock under VQ mutex and take same
> > > > mutex within notifier probably means it's broken for ksm and rmap at
> > > > least since these call invalidate with lock taken.
> > > > 
> > > > And generally, Andrea told me offline one can not take mutex under
> > > > the notifier callback. I CC'd Andrea for why.
> > > Correct, you _can not_ take mutex or any sleeping lock from within the
> > > invalidate_range callback as those callback happens under the page table
> > > spinlock. You can however do so under the invalidate_range_start call-
> > > back only if it is a blocking allow callback (there is a flag passdown
> > > with the invalidate_range_start callback if you are not allow to block
> > > then return EBUSY and the invalidation will be aborted).
> > > 
> > > 
> > > > That's a separate issue from set_page_dirty when memory is file backed.
> > > If you can access file back page then i suggest using set_page_dirty
> > > from within a special version of vunmap() so that when you vunmap you
> > > set the page dirty without taking page lock. It is safe to do so
> > > always from within an mmu notifier callback if you had the page map
> > > with write permission which means that the page had write permission
> > > in the userspace pte too and thus it having dirty pte is expected
> > > and calling set_page_dirty on the page is allowed without any lock.
> > > Locking will happen once the userspace pte are tear down through the
> > > page table lock.
> > 
> > 
> > Can I simply can set_page_dirty() before vunmap() in the mmu notifier
> > callback, or is there any reason that it must be called within vumap()?
> > 
> > Thanks
> 
> 
> I think this is what Jerome is saying, yes.
> Maybe add a patch to mmu notifier doc file, documenting this?
> 

Better to do in vunmap as you can look at kernel vmap pte to see if
the dirty bit is set and only call set_page_dirty in that case. But
yes you can do it outside vunmap in which case you have to call dirty
for all pages unless you have some other way to know if a page was
written to or not.

Note that if you also need to do that when you tear down the vunmap
through the regular path but with an exclusion from mmu notifier.
So if mmu notifier is running then you can skip the set_page_dirty
if none are running and you hold the lock then you can safely call
set_page_dirty.

Cheers,
Jérôme
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC PATCH V2 5/5] vhost: access vq metadata through kernel virtual address

2019-04-19 Thread Jerome Glisse
On Fri, Mar 08, 2019 at 02:48:45PM -0500, Andrea Arcangeli wrote:
> Hello Jeson,
> 
> On Fri, Mar 08, 2019 at 04:50:36PM +0800, Jason Wang wrote:
> > Just to make sure I understand here. For boosting through huge TLB, do 
> > you mean we can do that in the future (e.g by mapping more userspace 
> > pages to kenrel) or it can be done by this series (only about three 4K 
> > pages were vmapped per virtqueue)?
> 
> When I answered about the advantages of mmu notifier and I mentioned
> guaranteed 2m/gigapages where available, I overlooked the detail you
> were using vmap instead of kmap. So with vmap you're actually doing
> the opposite, it slows down the access because it will always use a 4k
> TLB even if QEMU runs on THP or gigapages hugetlbfs.
> 
> If there's just one page (or a few pages) in each vmap there's no need
> of vmap, the linearity vmap provides doesn't pay off in such
> case.
> 
> So likely there's further room for improvement here that you can
> achieve in the current series by just dropping vmap/vunmap.
> 
> You can just use kmap (or kmap_atomic if you're in preemptible
> section, should work from bh/irq).
> 
> In short the mmu notifier to invalidate only sets a "struct page *
> userringpage" pointer to NULL without calls to vunmap.
> 
> In all cases immediately after gup_fast returns you can always call
> put_page immediately (which explains why I'd like an option to drop
> FOLL_GET from gup_fast to speed it up).

By the way this is on my todo list, i want to merge HMM page snapshoting
with gup code which means mostly allowing to gup_fast without taking a
reference on the page (so without FOLL_GET). I hope to get to that some-
time before summer.

> 
> Then you can check the sequence_counter and inc/dec counter increased
> by _start/_end. That will tell you if the page you got and you called
> put_page to immediately unpin it or even to free it, cannot go away
> under you until the invalidate is called.
> 
> If sequence counters and counter tells that gup_fast raced with anyt
> mmu notifier invalidate you can just repeat gup_fast. Otherwise you're
> done, the page cannot go away under you, the host virtual to host
> physical mapping cannot change either. And the page is not pinned
> either. So you can just set the "struct page * userringpage = page"
> where "page" was the one setup by gup_fast.
> 
> When later the invalidate runs, you can just call set_page_dirty if
> gup_fast was called with "write = 1" and then you clear the pointer
> "userringpage = NULL".
> 
> When you need to read/write to the memory
> kmap/kmap_atomic(userringpage) should work.
> 
> In short because there's no hardware involvement here, the established
> mapping is just the pointer to the page, there is no need of setting
> up any pagetables or to do any TLB flushes (except on 32bit archs if
> the page is above the direct mapping but it never happens on 64bit
> archs).

Agree. The vmap is probably overkill if you only have a handfull of
them kmap will be faster.

Cheers,
Jérôme
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH V2 5/5] vhost: access vq metadata through kernel virtual address

2019-04-19 Thread Jerome Glisse
On Thu, Mar 07, 2019 at 10:43:12PM -0500, Michael S. Tsirkin wrote:
> On Thu, Mar 07, 2019 at 10:40:53PM -0500, Jerome Glisse wrote:
> > On Thu, Mar 07, 2019 at 10:16:00PM -0500, Michael S. Tsirkin wrote:
> > > On Thu, Mar 07, 2019 at 09:55:39PM -0500, Jerome Glisse wrote:
> > > > On Thu, Mar 07, 2019 at 09:21:03PM -0500, Michael S. Tsirkin wrote:
> > > > > On Thu, Mar 07, 2019 at 02:17:20PM -0500, Jerome Glisse wrote:
> > > > > > > It's because of all these issues that I preferred just accessing
> > > > > > > userspace memory and handling faults. Unfortunately there does not
> > > > > > > appear to exist an API that whitelists a specific driver along 
> > > > > > > the lines
> > > > > > > of "I checked this code for speculative info leaks, don't add 
> > > > > > > barriers
> > > > > > > on data path please".
> > > > > > 
> > > > > > Maybe it would be better to explore adding such helper then 
> > > > > > remapping
> > > > > > page into kernel address space ?
> > > > > 
> > > > > I explored it a bit (see e.g. thread around: "__get_user slower than
> > > > > get_user") and I can tell you it's not trivial given the issue is 
> > > > > around
> > > > > security.  So in practice it does not seem fair to keep a significant
> > > > > optimization out of kernel because *maybe* we can do it differently 
> > > > > even
> > > > > better :)
> > > > 
> > > > Maybe a slightly different approach between this patchset and other
> > > > copy user API would work here. What you want really is something like
> > > > a temporary mlock on a range of memory so that it is safe for the
> > > > kernel to access range of userspace virtual address ie page are
> > > > present and with proper permission hence there can be no page fault
> > > > while you are accessing thing from kernel context.
> > > > 
> > > > So you can have like a range structure and mmu notifier. When you
> > > > lock the range you block mmu notifier to allow your code to work on
> > > > the userspace VA safely. Once you are done you unlock and let the
> > > > mmu notifier go on. It is pretty much exactly this patchset except
> > > > that you remove all the kernel vmap code. A nice thing about that
> > > > is that you do not need to worry about calling set page dirty it
> > > > will already be handle by the userspace VA pte. It also use less
> > > > memory than when you have kernel vmap.
> > > > 
> > > > This idea might be defeated by security feature where the kernel is
> > > > running in its own address space without the userspace address
> > > > space present.
> > > 
> > > Like smap?
> > 
> > Yes like smap but also other newer changes, with similar effect, since
> > the spectre drama.
> > 
> > Cheers,
> > Jérôme
> 
> Sorry do you mean meltdown and kpti?

Yes all that and similar thing. I do not have the full list in my head.

Cheers,
Jérôme
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH V2 5/5] vhost: access vq metadata through kernel virtual address

2019-04-19 Thread Jerome Glisse
On Thu, Mar 07, 2019 at 09:21:03PM -0500, Michael S. Tsirkin wrote:
> On Thu, Mar 07, 2019 at 02:17:20PM -0500, Jerome Glisse wrote:
> > > It's because of all these issues that I preferred just accessing
> > > userspace memory and handling faults. Unfortunately there does not
> > > appear to exist an API that whitelists a specific driver along the lines
> > > of "I checked this code for speculative info leaks, don't add barriers
> > > on data path please".
> > 
> > Maybe it would be better to explore adding such helper then remapping
> > page into kernel address space ?
> 
> I explored it a bit (see e.g. thread around: "__get_user slower than
> get_user") and I can tell you it's not trivial given the issue is around
> security.  So in practice it does not seem fair to keep a significant
> optimization out of kernel because *maybe* we can do it differently even
> better :)

Maybe a slightly different approach between this patchset and other
copy user API would work here. What you want really is something like
a temporary mlock on a range of memory so that it is safe for the
kernel to access range of userspace virtual address ie page are
present and with proper permission hence there can be no page fault
while you are accessing thing from kernel context.

So you can have like a range structure and mmu notifier. When you
lock the range you block mmu notifier to allow your code to work on
the userspace VA safely. Once you are done you unlock and let the
mmu notifier go on. It is pretty much exactly this patchset except
that you remove all the kernel vmap code. A nice thing about that
is that you do not need to worry about calling set page dirty it
will already be handle by the userspace VA pte. It also use less
memory than when you have kernel vmap.

This idea might be defeated by security feature where the kernel is
running in its own address space without the userspace address
space present.

Anyway just wanted to put the idea forward.

Cheers,
Jérôme
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH V2 5/5] vhost: access vq metadata through kernel virtual address

2019-04-19 Thread Jerome Glisse
On Thu, Mar 07, 2019 at 10:16:00PM -0500, Michael S. Tsirkin wrote:
> On Thu, Mar 07, 2019 at 09:55:39PM -0500, Jerome Glisse wrote:
> > On Thu, Mar 07, 2019 at 09:21:03PM -0500, Michael S. Tsirkin wrote:
> > > On Thu, Mar 07, 2019 at 02:17:20PM -0500, Jerome Glisse wrote:
> > > > > It's because of all these issues that I preferred just accessing
> > > > > userspace memory and handling faults. Unfortunately there does not
> > > > > appear to exist an API that whitelists a specific driver along the 
> > > > > lines
> > > > > of "I checked this code for speculative info leaks, don't add barriers
> > > > > on data path please".
> > > > 
> > > > Maybe it would be better to explore adding such helper then remapping
> > > > page into kernel address space ?
> > > 
> > > I explored it a bit (see e.g. thread around: "__get_user slower than
> > > get_user") and I can tell you it's not trivial given the issue is around
> > > security.  So in practice it does not seem fair to keep a significant
> > > optimization out of kernel because *maybe* we can do it differently even
> > > better :)
> > 
> > Maybe a slightly different approach between this patchset and other
> > copy user API would work here. What you want really is something like
> > a temporary mlock on a range of memory so that it is safe for the
> > kernel to access range of userspace virtual address ie page are
> > present and with proper permission hence there can be no page fault
> > while you are accessing thing from kernel context.
> > 
> > So you can have like a range structure and mmu notifier. When you
> > lock the range you block mmu notifier to allow your code to work on
> > the userspace VA safely. Once you are done you unlock and let the
> > mmu notifier go on. It is pretty much exactly this patchset except
> > that you remove all the kernel vmap code. A nice thing about that
> > is that you do not need to worry about calling set page dirty it
> > will already be handle by the userspace VA pte. It also use less
> > memory than when you have kernel vmap.
> > 
> > This idea might be defeated by security feature where the kernel is
> > running in its own address space without the userspace address
> > space present.
> 
> Like smap?

Yes like smap but also other newer changes, with similar effect, since
the spectre drama.

Cheers,
Jérôme
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH V2 5/5] vhost: access vq metadata through kernel virtual address

2019-04-19 Thread Jerome Glisse
On Thu, Mar 07, 2019 at 12:56:45PM -0500, Michael S. Tsirkin wrote:
> On Thu, Mar 07, 2019 at 10:47:22AM -0500, Michael S. Tsirkin wrote:
> > On Wed, Mar 06, 2019 at 02:18:12AM -0500, Jason Wang wrote:
> > > +static const struct mmu_notifier_ops vhost_mmu_notifier_ops = {
> > > + .invalidate_range = vhost_invalidate_range,
> > > +};
> > > +
> > >  void vhost_dev_init(struct vhost_dev *dev,
> > >   struct vhost_virtqueue **vqs, int nvqs, int iov_limit)
> > >  {
> > 
> > I also wonder here: when page is write protected then
> > it does not look like .invalidate_range is invoked.
> > 
> > E.g. mm/ksm.c calls
> > 
> > mmu_notifier_invalidate_range_start and
> > mmu_notifier_invalidate_range_end but not mmu_notifier_invalidate_range.
> > 
> > Similarly, rmap in page_mkclean_one will not call
> > mmu_notifier_invalidate_range.
> > 
> > If I'm right vhost won't get notified when page is write-protected since you
> > didn't install start/end notifiers. Note that end notifier can be called
> > with page locked, so it's not as straight-forward as just adding a call.
> > Writing into a write-protected page isn't a good idea.
> > 
> > Note that documentation says:
> > it is fine to delay the mmu_notifier_invalidate_range
> > call to mmu_notifier_invalidate_range_end() outside the page table lock.
> > implying it's called just later.
> 
> OK I missed the fact that _end actually calls
> mmu_notifier_invalidate_range internally. So that part is fine but the
> fact that you are trying to take page lock under VQ mutex and take same
> mutex within notifier probably means it's broken for ksm and rmap at
> least since these call invalidate with lock taken.
> 
> And generally, Andrea told me offline one can not take mutex under
> the notifier callback. I CC'd Andrea for why.

Correct, you _can not_ take mutex or any sleeping lock from within the
invalidate_range callback as those callback happens under the page table
spinlock. You can however do so under the invalidate_range_start call-
back only if it is a blocking allow callback (there is a flag passdown
with the invalidate_range_start callback if you are not allow to block
then return EBUSY and the invalidation will be aborted).


> 
> That's a separate issue from set_page_dirty when memory is file backed.

If you can access file back page then i suggest using set_page_dirty
from within a special version of vunmap() so that when you vunmap you
set the page dirty without taking page lock. It is safe to do so
always from within an mmu notifier callback if you had the page map
with write permission which means that the page had write permission
in the userspace pte too and thus it having dirty pte is expected
and calling set_page_dirty on the page is allowed without any lock.
Locking will happen once the userspace pte are tear down through the
page table lock.

> It's because of all these issues that I preferred just accessing
> userspace memory and handling faults. Unfortunately there does not
> appear to exist an API that whitelists a specific driver along the lines
> of "I checked this code for speculative info leaks, don't add barriers
> on data path please".

Maybe it would be better to explore adding such helper then remapping
page into kernel address space ?

Cheers,
Jérôme
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH V2 5/5] vhost: access vq metadata through kernel virtual address

2019-04-19 Thread Jerome Glisse
On Thu, Mar 07, 2019 at 02:38:38PM -0500, Andrea Arcangeli wrote:
> On Thu, Mar 07, 2019 at 02:09:10PM -0500, Jerome Glisse wrote:
> > I thought this patch was only for anonymous memory ie not file back ?
> 
> Yes, the other common usages are on hugetlbfs/tmpfs that also don't
> need to implement writeback and are obviously safe too.
> 
> > If so then set dirty is mostly useless it would only be use for swap
> > but for this you can use an unlock version to set the page dirty.
> 
> It's not a practical issue but a security issue perhaps: you can
> change the KVM userland to run on VM_SHARED ext4 as guest physical
> memory, you could do that with the qemu command line that is used to
> place it on tmpfs or hugetlbfs for example and some proprietary KVM
> userland may do for other reasons. In general it shouldn't be possible
> to crash the kernel with this, and it wouldn't be nice to fail if
> somebody decides to put VM_SHARED ext4 (we could easily allow vhost
> ring only backed by anon or tmpfs or hugetlbfs to solve this of
> course).
> 
> It sounds like we should at least optimize away the _lock from
> set_page_dirty if it's anon/hugetlbfs/tmpfs, would be nice if there
> was a clean way to do that.
> 
> Now assuming we don't nak the use on ext4 VM_SHARED and we stick to
> set_page_dirty_lock for such case: could you recap how that
> __writepage ext4 crash was solved if try_to_free_buffers() run on a
> pinned GUP page (in our vhost case try_to_unmap would have gotten rid
> of the pins through the mmu notifier and the page would have been
> freed just fine).

So for the above the easiest thing is to call set_page_dirty() from
the mmu notifier callback. It is always safe to use the non locking
variant from such callback. Well it is safe only if the page was
map with write permission prior to the callback so here i assume
nothing stupid is going on and that you only vmap page with write
if they have a CPU pte with write and if not then you force a write
page fault.

Basicly from mmu notifier callback you have the same right as zap
pte has.
> 

> The first two things that come to mind is that we can easily forbid
> the try_to_free_buffers() if the page might be pinned by GUP, it has
> false positives with the speculative pagecache lookups but it cannot
> give false negatives. We use those checks to know when a page is
> pinned by GUP, for example, where we cannot merge KSM pages with gup
> pins etc... However what if the elevated refcount wasn't there when
> try_to_free_buffers run and is there when __remove_mapping runs?
> 
> What I mean is that it sounds easy to forbid try_to_free_buffers for
> the long term pins, but that still won't prevent the same exact issue
> for a transient pin (except the window to trigger it will be much smaller).

I think here you do not want to go down the same path as what is being
plane for GUP. GUP is being fix for "broken" hardware. Myself i am
converting proper hardware to no longer use GUP but rely on mmu notifier.

So i would not do any dance with blocking try_to_free_buffer, just
do everything from mmu notifier callback and you are fine.

> 
> I basically don't see how long term GUP pins breaks stuff in ext4
> while transient short term GUP pins like O_DIRECT don't. The VM code
> isn't able to disambiguate if the pin is short or long term and it
> won't even be able to tell the difference between a GUP pin (long or
> short term) and a speculative get_page_unless_zero run by the
> pagecache speculative pagecache lookup. Even a random speculative
> pagecache lookup that runs just before __remove_mapping, can cause
> __remove_mapping to fail despite try_to_free_buffers() succeeded
> before it (like if there was a transient or long term GUP
> pin). speculative lookup that can happen across all page struct at all
> times and they will cause page_ref_freeze in __remove_mapping to
> fail.
> 
> I'm sure I'm missing details on the ext4 __writepage problem and how
> set_page_dirty_lock broke stuff with long term GUP pins, so I'm
> asking...

O_DIRECT can suffer from the same issue but the race window for that
is small enough that it is unlikely it ever happened. But for device
driver that GUP page for hours/days/weeks/months ... obviously the
race window is big enough here. It affects many fs (ext4, xfs, ...)
in different ways. I think ext4 is the most obvious because of the
kernel log trace it leaves behind.

Bottom line is for set_page_dirty to be safe you need the following:
lock_page()
page_mkwrite()
set_pte_with_write()
unlock_page()

Now when loosing the write permission on the pte you will first get
a mmu notifier callback so anyone that abide by mmu notifier is fine
as long as they only write to the page if they found a pte with
write as it means the above sequence did happen and page is write-
able until the mmu notifier callback happens.

When you lookup a page into the page cache you still need to call
page_mkwrite() before installing a write-able pte.


Re: [RFC PATCH V2 5/5] vhost: access vq metadata through kernel virtual address

2019-04-19 Thread Jerome Glisse
On Thu, Mar 07, 2019 at 10:34:39AM -0500, Michael S. Tsirkin wrote:
> On Thu, Mar 07, 2019 at 10:45:57AM +0800, Jason Wang wrote:
> > 
> > On 2019/3/7 上午12:31, Michael S. Tsirkin wrote:
> > > > +static void vhost_set_vmap_dirty(struct vhost_vmap *used)
> > > > +{
> > > > +   int i;
> > > > +
> > > > +   for (i = 0; i < used->npages; i++)
> > > > +   set_page_dirty_lock(used->pages[i]);
> > > This seems to rely on page lock to mark page dirty.
> > > 
> > > Could it happen that page writeback will check the
> > > page, find it clean, and then you mark it dirty and then
> > > invalidate callback is called?
> > > 
> > > 
> > 
> > Yes. But does this break anything?
> > The page is still there, we just remove a
> > kernel mapping to it.
> > 
> > Thanks
> 
> Yes it's the same problem as e.g. RDMA:
>   we've just marked the page as dirty without having buffers.
>   Eventually writeback will find it and filesystem will complain...
>   So if the pages are backed by a non-RAM-based filesystem, it’s all just 
> broken.
> 
> one can hope that RDMA guys will fix it in some way eventually.
> For now, maybe add a flag in e.g. VMA that says that there's no
> writeback so it's safe to mark page dirty at any point?

I thought this patch was only for anonymous memory ie not file back ?
If so then set dirty is mostly useless it would only be use for swap
but for this you can use an unlock version to set the page dirty.

Cheers,
Jérôme
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v2] vsock/virtio: fix kernel panic from virtio_transport_reset_no_sock

2019-04-19 Thread Stefano Garzarella
On Wed, Mar 6, 2019 at 11:13 AM Adalbert Lazăr  wrote:
>
> Previous to commit 22b5c0b63f32 ("vsock/virtio: fix kernel panic
> after device hot-unplug"), vsock_core_init() was called from
> virtio_vsock_probe(). Now, virtio_transport_reset_no_sock() can be called
> before vsock_core_init() has the chance to run.
>
> [Wed Feb 27 14:17:09 2019] BUG: unable to handle kernel NULL pointer 
> dereference at 0110
> [Wed Feb 27 14:17:09 2019] #PF error: [normal kernel read fault]
> [Wed Feb 27 14:17:09 2019] PGD 0 P4D 0
> [Wed Feb 27 14:17:09 2019] Oops:  [#1] SMP PTI
> [Wed Feb 27 14:17:09 2019] CPU: 3 PID: 59 Comm: kworker/3:1 Not tainted 
> 5.0.0-rc7-390-generic-hvi #390
> [Wed Feb 27 14:17:09 2019] Hardware name: QEMU Standard PC (i440FX + PIIX, 
> 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
> [Wed Feb 27 14:17:09 2019] Workqueue: virtio_vsock virtio_transport_rx_work 
> [vmw_vsock_virtio_transport]
> [Wed Feb 27 14:17:09 2019] RIP: 0010:virtio_transport_reset_no_sock+0x8c/0xc0 
> [vmw_vsock_virtio_transport_common]
> [Wed Feb 27 14:17:09 2019] Code: 35 8b 4f 14 48 8b 57 08 31 f6 44 8b 4f 10 44 
> 8b 07 48 8d 7d c8 e8 84 f8 ff ff 48 85 c0 48 89 c3 74 2a e8 f7 31 03 00 48 89 
> df <48> 8b 80 10 01 00 00 e8 68 fb 69 ed 48 8b 75 f0 65 48 33 34 25 28
> [Wed Feb 27 14:17:09 2019] RSP: 0018:b42701ab7d40 EFLAGS: 00010282
> [Wed Feb 27 14:17:09 2019] RAX:  RBX: 9d79637ee080 RCX: 
> 0003
> [Wed Feb 27 14:17:09 2019] RDX: 0001 RSI: 0002 RDI: 
> 9d79637ee080
> [Wed Feb 27 14:17:09 2019] RBP: b42701ab7d78 R08: 9d796fae70e0 R09: 
> 9d796f403500
> [Wed Feb 27 14:17:09 2019] R10: b42701ab7d90 R11:  R12: 
> 9d7969d09240
> [Wed Feb 27 14:17:09 2019] R13: 9d79624e6840 R14: 9d7969d09318 R15: 
> 9d796d48ff80
> [Wed Feb 27 14:17:09 2019] FS:  () 
> GS:9d796fac() knlGS:
> [Wed Feb 27 14:17:09 2019] CS:  0010 DS:  ES:  CR0: 80050033
> [Wed Feb 27 14:17:09 2019] CR2: 0110 CR3: 000427f22000 CR4: 
> 06e0
> [Wed Feb 27 14:17:09 2019] DR0:  DR1:  DR2: 
> 
> [Wed Feb 27 14:17:09 2019] DR3:  DR6: fffe0ff0 DR7: 
> 0400
> [Wed Feb 27 14:17:09 2019] Call Trace:
> [Wed Feb 27 14:17:09 2019]  virtio_transport_recv_pkt+0x63/0x820 
> [vmw_vsock_virtio_transport_common]
> [Wed Feb 27 14:17:09 2019]  ? kfree+0x17e/0x190
> [Wed Feb 27 14:17:09 2019]  ? detach_buf_split+0x145/0x160
> [Wed Feb 27 14:17:09 2019]  ? __switch_to_asm+0x40/0x70
> [Wed Feb 27 14:17:09 2019]  virtio_transport_rx_work+0xa0/0x106 
> [vmw_vsock_virtio_transport]
> [Wed Feb 27 14:17:09 2019] NET: Registered protocol family 40
> [Wed Feb 27 14:17:09 2019]  process_one_work+0x167/0x410
> [Wed Feb 27 14:17:09 2019]  worker_thread+0x4d/0x460
> [Wed Feb 27 14:17:09 2019]  kthread+0x105/0x140
> [Wed Feb 27 14:17:09 2019]  ? rescuer_thread+0x360/0x360
> [Wed Feb 27 14:17:09 2019]  ? kthread_destroy_worker+0x50/0x50
> [Wed Feb 27 14:17:09 2019]  ret_from_fork+0x35/0x40
> [Wed Feb 27 14:17:09 2019] Modules linked in: vmw_vsock_virtio_transport 
> vmw_vsock_virtio_transport_common input_leds vsock serio_raw i2c_piix4 
> mac_hid qemu_fw_cfg autofs4 cirrus ttm drm_kms_helper syscopyarea sysfillrect 
> sysimgblt fb_sys_fops virtio_net psmouse drm net_failover pata_acpi 
> virtio_blk failover floppy
>
> Fixes: 22b5c0b63f32 ("vsock/virtio: fix kernel panic after device hot-unplug")
> Reported-by: Alexandru Herghelegiu 
> Signed-off-by: Adalbert Lazăr 
> Co-developed-by: Stefan Hajnoczi 
> ---
>  net/vmw_vsock/virtio_transport_common.c | 22 +++---
>  1 file changed, 15 insertions(+), 7 deletions(-)
>

Thanks, LGTM!

Reviewed-by: Stefano Garzarella 


> diff --git a/net/vmw_vsock/virtio_transport_common.c 
> b/net/vmw_vsock/virtio_transport_common.c
> index 3ae3a33da70b..602715fc9a75 100644
> --- a/net/vmw_vsock/virtio_transport_common.c
> +++ b/net/vmw_vsock/virtio_transport_common.c
> @@ -662,6 +662,8 @@ static int virtio_transport_reset(struct vsock_sock *vsk,
>   */
>  static int virtio_transport_reset_no_sock(struct virtio_vsock_pkt *pkt)
>  {
> +   const struct virtio_transport *t;
> +   struct virtio_vsock_pkt *reply;
> struct virtio_vsock_pkt_info info = {
> .op = VIRTIO_VSOCK_OP_RST,
> .type = le16_to_cpu(pkt->hdr.type),
> @@ -672,15 +674,21 @@ static int virtio_transport_reset_no_sock(struct 
> virtio_vsock_pkt *pkt)
> if (le16_to_cpu(pkt->hdr.op) == VIRTIO_VSOCK_OP_RST)
> return 0;
>
> -   pkt = virtio_transport_alloc_pkt(, 0,
> -le64_to_cpu(pkt->hdr.dst_cid),
> -le32_to_cpu(pkt->hdr.dst_port),
> -le64_to_cpu(pkt->hdr.src_cid),
> -

Re: [RFC PATCH net-next] failover: allow name change on IFF_UP slave interfaces

2019-04-19 Thread Liran Alon



> On 6 Mar 2019, at 23:42, si-wei liu  wrote:
> 
> 
> 
> On 3/6/2019 1:36 PM, Samudrala, Sridhar wrote:
>> 
>> On 3/6/2019 1:26 PM, si-wei liu wrote:
>>> 
>>> 
>>> On 3/6/2019 4:04 AM, Jiri Pirko wrote:
> --- a/net/core/failover.c
> +++ b/net/core/failover.c
> @@ -16,6 +16,11 @@
> 
> static LIST_HEAD(failover_list);
> static DEFINE_SPINLOCK(failover_lock);
> +static bool slave_rename_ok = true;
> +
> +module_param(slave_rename_ok, bool, (S_IRUGO | S_IWUSR));
> +MODULE_PARM_DESC(slave_rename_ok,
> +  "If set allow renaming the slave when failover master is up");
> 
 No module parameters please. If you need to set something do it using
 rtnl_link_ops. Thanks.
 
 
>>> I understand what you ask for, but without module parameters userspace 
>>> don't work. During boot (dracut) the virtio netdev gets enslaved earlier 
>>> than when userspace comes up, so failover has to determine the setting 
>>> during initialization/creation. This config is not dynamic, at least for 
>>> the life cycle of a particular failover link it shouldn't be changed. 
>>> Without module parameter, how does the userspace specify this value during 
>>> kernel initialization? 
>>> 
>> Can we enable this by default and not make it configurable via module 
>> parameter?
>> Is there any  usecase where someone expects rename to fail with failover 
>> slaves?
> Probably just cater for those application that assumes fixed name on UP 
> interface?
> 
> It's already the default for the configurable. I myself don't think that's a 
> big problem for failover users. So far there's not even QEMU support I think 
> everything can be changed. I don't feel strong to just fix it without 
> introducing configurable. But maybe Michael or others think it differently...
> 
> If no one objects, I don't feel strong to make it fixed behavior.
> 
> -Siwei
> 

I agree we should just remove the module parameter.

-Liran


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] vsock/virtio: fix kernel panic from virtio_transport_reset_no_sock

2019-04-19 Thread Adalbert Lazăr
On Wed, 6 Mar 2019 17:02:16 +, Stefan Hajnoczi  wrote:
> On Wed, Mar 06, 2019 at 11:10:41AM +0200, Adalbert Lazăr wrote:
> > On Wed, 6 Mar 2019 08:41:04 +, Stefan Hajnoczi  
> > wrote:
> > > On Tue, Mar 05, 2019 at 08:01:45PM +0200, Adalbert Lazăr wrote:
> > > The pkt argument is the received packet that we must reply to.
> > > The reply packet is allocated just before line 680 and must be free
> > > explicitly for return -ENOTCONN.
> > > 
> > > You can avoid the leak and make the code easier to read like this:
> > > 
> > >   struct virtio_vsock_pkt *reply;
> > > 
> > >   ...
> > > 
> > >  -- avoid reusing 'pkt'
> > > v
> > >   reply = virtio_transport_alloc_pkt(, 0, ...);
> > >   if (!reply)
> > >   return -ENOMEM;
> > > 
> > >   t = virtio_transport_get_ops();
> > >   if (!t) {
> > >   virtio_transport_free_pkt(reply); <-- prevent memory leak
> > >   return -ENOTCONN;
> > >   }
> > >   return t->send_pkt(reply);
> > 
> > What do you think about Stefano's suggestion, to move the check above
> > the line were the reply is allocated?
> 
> That's fine too.
> 
> However a follow up patch to eliminate the confusing way that 'pkt' is
> reused is still warranted.  If you are busy I'd be happy to send that
> cleanup.
> 
> Stefan

I've got it, a couple of minutes after I've replied :)
The second version[1] should be in your mailbox.

Thank you,
Adalbert

[1]: https://patchwork.kernel.org/patch/10840787/
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC PATCH V2 4/5] vhost: introduce helpers to get the size of metadata area

2019-04-19 Thread Christophe de Dinechin



> On 6 Mar 2019, at 08:18, Jason Wang  wrote:
> 
> Signed-off-by: Jason Wang 
> ---
> drivers/vhost/vhost.c | 46 --
> 1 file changed, 28 insertions(+), 18 deletions(-)
> 
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 2025543..1015464 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -413,6 +413,27 @@ static void vhost_dev_free_iovecs(struct vhost_dev *dev)
>   vhost_vq_free_iovecs(dev->vqs[i]);
> }
> 
> +static size_t vhost_get_avail_size(struct vhost_virtqueue *vq, int num)

Nit: Any reason not to make `num` unsigned or size_t?

> +{
> + size_t event = vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX) ? 2 : 0;
> +
> + return sizeof(*vq->avail) +
> +sizeof(*vq->avail->ring) * num + event;
> +}
> +
> +static size_t vhost_get_used_size(struct vhost_virtqueue *vq, int num)
> +{
> + size_t event = vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX) ? 2 : 0;
> +
> + return sizeof(*vq->used) +
> +sizeof(*vq->used->ring) * num + event;
> +}
> +
> +static size_t vhost_get_desc_size(struct vhost_virtqueue *vq, int num)
> +{
> + return sizeof(*vq->desc) * num;
> +}
> +
> void vhost_dev_init(struct vhost_dev *dev,
>   struct vhost_virtqueue **vqs, int nvqs, int iov_limit)
> {
> @@ -1253,13 +1274,9 @@ static bool vq_access_ok(struct vhost_virtqueue *vq, 
> unsigned int num,
>struct vring_used __user *used)
> 
> {
> - size_t s = vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX) ? 2 : 0;
> -
> - return access_ok(desc, num * sizeof *desc) &&
> -access_ok(avail,
> -  sizeof *avail + num * sizeof *avail->ring + s) &&
> -access_ok(used,
> - sizeof *used + num * sizeof *used->ring + s);
> + return access_ok(desc, vhost_get_desc_size(vq, num)) &&
> +access_ok(avail, vhost_get_avail_size(vq, num)) &&
> +access_ok(used, vhost_get_used_size(vq, num));
> }
> 
> static void vhost_vq_meta_update(struct vhost_virtqueue *vq,
> @@ -1311,22 +1328,18 @@ static bool iotlb_access_ok(struct vhost_virtqueue 
> *vq,
> 
> int vq_meta_prefetch(struct vhost_virtqueue *vq)
> {
> - size_t s = vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX) ? 2 : 0;
>   unsigned int num = vq->num;
> 
>   if (!vq->iotlb)
>   return 1;
> 
>   return iotlb_access_ok(vq, VHOST_ACCESS_RO, (u64)(uintptr_t)vq->desc,
> -num * sizeof(*vq->desc), VHOST_ADDR_DESC) &&
> +vhost_get_desc_size(vq, num), VHOST_ADDR_DESC) &&
>  iotlb_access_ok(vq, VHOST_ACCESS_RO, (u64)(uintptr_t)vq->avail,
> -sizeof *vq->avail +
> -num * sizeof(*vq->avail->ring) + s,
> +vhost_get_avail_size(vq, num),
>  VHOST_ADDR_AVAIL) &&
>  iotlb_access_ok(vq, VHOST_ACCESS_WO, (u64)(uintptr_t)vq->used,
> -sizeof *vq->used +
> -num * sizeof(*vq->used->ring) + s,
> -VHOST_ADDR_USED);
> +vhost_get_used_size(vq, num), VHOST_ADDR_USED);
> }
> EXPORT_SYMBOL_GPL(vq_meta_prefetch);
> 
> @@ -1343,13 +1356,10 @@ bool vhost_log_access_ok(struct vhost_dev *dev)
> static bool vq_log_access_ok(struct vhost_virtqueue *vq,
>void __user *log_base)
> {
> - size_t s = vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX) ? 2 : 0;
> -
>   return vq_memory_access_ok(log_base, vq->umem,
>  vhost_has_feature(vq, VHOST_F_LOG_ALL)) &&
>   (!vq->log_used || log_access_ok(log_base, vq->log_addr,
> - sizeof *vq->used +
> - vq->num * sizeof *vq->used->ring + s));
> +   vhost_get_used_size(vq, vq->num)));
> }
> 
> /* Can we start vq? */
> -- 
> 1.8.3.1
> 

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v2] vsock/virtio: fix kernel panic from virtio_transport_reset_no_sock

2019-04-19 Thread Adalbert Lazăr
Previous to commit 22b5c0b63f32 ("vsock/virtio: fix kernel panic
after device hot-unplug"), vsock_core_init() was called from
virtio_vsock_probe(). Now, virtio_transport_reset_no_sock() can be called
before vsock_core_init() has the chance to run.

[Wed Feb 27 14:17:09 2019] BUG: unable to handle kernel NULL pointer 
dereference at 0110
[Wed Feb 27 14:17:09 2019] #PF error: [normal kernel read fault]
[Wed Feb 27 14:17:09 2019] PGD 0 P4D 0
[Wed Feb 27 14:17:09 2019] Oops:  [#1] SMP PTI
[Wed Feb 27 14:17:09 2019] CPU: 3 PID: 59 Comm: kworker/3:1 Not tainted 
5.0.0-rc7-390-generic-hvi #390
[Wed Feb 27 14:17:09 2019] Hardware name: QEMU Standard PC (i440FX + PIIX, 
1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[Wed Feb 27 14:17:09 2019] Workqueue: virtio_vsock virtio_transport_rx_work 
[vmw_vsock_virtio_transport]
[Wed Feb 27 14:17:09 2019] RIP: 0010:virtio_transport_reset_no_sock+0x8c/0xc0 
[vmw_vsock_virtio_transport_common]
[Wed Feb 27 14:17:09 2019] Code: 35 8b 4f 14 48 8b 57 08 31 f6 44 8b 4f 10 44 
8b 07 48 8d 7d c8 e8 84 f8 ff ff 48 85 c0 48 89 c3 74 2a e8 f7 31 03 00 48 89 
df <48> 8b 80 10 01 00 00 e8 68 fb 69 ed 48 8b 75 f0 65 48 33 34 25 28
[Wed Feb 27 14:17:09 2019] RSP: 0018:b42701ab7d40 EFLAGS: 00010282
[Wed Feb 27 14:17:09 2019] RAX:  RBX: 9d79637ee080 RCX: 
0003
[Wed Feb 27 14:17:09 2019] RDX: 0001 RSI: 0002 RDI: 
9d79637ee080
[Wed Feb 27 14:17:09 2019] RBP: b42701ab7d78 R08: 9d796fae70e0 R09: 
9d796f403500
[Wed Feb 27 14:17:09 2019] R10: b42701ab7d90 R11:  R12: 
9d7969d09240
[Wed Feb 27 14:17:09 2019] R13: 9d79624e6840 R14: 9d7969d09318 R15: 
9d796d48ff80
[Wed Feb 27 14:17:09 2019] FS:  () 
GS:9d796fac() knlGS:
[Wed Feb 27 14:17:09 2019] CS:  0010 DS:  ES:  CR0: 80050033
[Wed Feb 27 14:17:09 2019] CR2: 0110 CR3: 000427f22000 CR4: 
06e0
[Wed Feb 27 14:17:09 2019] DR0:  DR1:  DR2: 

[Wed Feb 27 14:17:09 2019] DR3:  DR6: fffe0ff0 DR7: 
0400
[Wed Feb 27 14:17:09 2019] Call Trace:
[Wed Feb 27 14:17:09 2019]  virtio_transport_recv_pkt+0x63/0x820 
[vmw_vsock_virtio_transport_common]
[Wed Feb 27 14:17:09 2019]  ? kfree+0x17e/0x190
[Wed Feb 27 14:17:09 2019]  ? detach_buf_split+0x145/0x160
[Wed Feb 27 14:17:09 2019]  ? __switch_to_asm+0x40/0x70
[Wed Feb 27 14:17:09 2019]  virtio_transport_rx_work+0xa0/0x106 
[vmw_vsock_virtio_transport]
[Wed Feb 27 14:17:09 2019] NET: Registered protocol family 40
[Wed Feb 27 14:17:09 2019]  process_one_work+0x167/0x410
[Wed Feb 27 14:17:09 2019]  worker_thread+0x4d/0x460
[Wed Feb 27 14:17:09 2019]  kthread+0x105/0x140
[Wed Feb 27 14:17:09 2019]  ? rescuer_thread+0x360/0x360
[Wed Feb 27 14:17:09 2019]  ? kthread_destroy_worker+0x50/0x50
[Wed Feb 27 14:17:09 2019]  ret_from_fork+0x35/0x40
[Wed Feb 27 14:17:09 2019] Modules linked in: vmw_vsock_virtio_transport 
vmw_vsock_virtio_transport_common input_leds vsock serio_raw i2c_piix4 mac_hid 
qemu_fw_cfg autofs4 cirrus ttm drm_kms_helper syscopyarea sysfillrect sysimgblt 
fb_sys_fops virtio_net psmouse drm net_failover pata_acpi virtio_blk failover 
floppy

Fixes: 22b5c0b63f32 ("vsock/virtio: fix kernel panic after device hot-unplug")
Reported-by: Alexandru Herghelegiu 
Signed-off-by: Adalbert Lazăr 
Co-developed-by: Stefan Hajnoczi 
---
 net/vmw_vsock/virtio_transport_common.c | 22 +++---
 1 file changed, 15 insertions(+), 7 deletions(-)

diff --git a/net/vmw_vsock/virtio_transport_common.c 
b/net/vmw_vsock/virtio_transport_common.c
index 3ae3a33da70b..602715fc9a75 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -662,6 +662,8 @@ static int virtio_transport_reset(struct vsock_sock *vsk,
  */
 static int virtio_transport_reset_no_sock(struct virtio_vsock_pkt *pkt)
 {
+   const struct virtio_transport *t;
+   struct virtio_vsock_pkt *reply;
struct virtio_vsock_pkt_info info = {
.op = VIRTIO_VSOCK_OP_RST,
.type = le16_to_cpu(pkt->hdr.type),
@@ -672,15 +674,21 @@ static int virtio_transport_reset_no_sock(struct 
virtio_vsock_pkt *pkt)
if (le16_to_cpu(pkt->hdr.op) == VIRTIO_VSOCK_OP_RST)
return 0;

-   pkt = virtio_transport_alloc_pkt(, 0,
-le64_to_cpu(pkt->hdr.dst_cid),
-le32_to_cpu(pkt->hdr.dst_port),
-le64_to_cpu(pkt->hdr.src_cid),
-le32_to_cpu(pkt->hdr.src_port));
-   if (!pkt)
+   reply = virtio_transport_alloc_pkt(, 0,
+  le64_to_cpu(pkt->hdr.dst_cid),
+  le32_to_cpu(pkt->hdr.dst_port),
+  

Re: [PATCH] vsock/virtio: fix kernel panic from virtio_transport_reset_no_sock

2019-04-19 Thread Adalbert Lazăr
On Wed, 6 Mar 2019 08:41:04 +, Stefan Hajnoczi  wrote:
> On Tue, Mar 05, 2019 at 08:01:45PM +0200, Adalbert Lazăr wrote:
> 
> Thanks for the patch, Adalbert!  Please add a Signed-off-by tag so your
> patch can be merged (see Documentation/process/submitting-patches.rst
> Chapter 11 for details on the Developer's Certificate of Origin).
> 
> >  static int virtio_transport_reset_no_sock(struct virtio_vsock_pkt *pkt)
> >  {
> > +   const struct virtio_transport *t;
> > struct virtio_vsock_pkt_info info = {
> > .op = VIRTIO_VSOCK_OP_RST,
> > .type = le16_to_cpu(pkt->hdr.type),
> > @@ -680,7 +681,11 @@ static int virtio_transport_reset_no_sock(struct 
> > virtio_vsock_pkt *pkt)
> > if (!pkt)
> > return -ENOMEM;
> >  
> > -   return virtio_transport_get_ops()->send_pkt(pkt);
> > +   t = virtio_transport_get_ops();
> > +   if (!t)
> > +   return -ENOTCONN;
> 
> pkt is leaked here.  This is an easy mistake to make because the code is
> unclear. 

Thank you for your kind words :)

> The pkt argument is the received packet that we must reply to.
> The reply packet is allocated just before line 680 and must be free
> explicitly for return -ENOTCONN.
> 
> You can avoid the leak and make the code easier to read like this:
> 
>   struct virtio_vsock_pkt *reply;
> 
>   ...
> 
>  -- avoid reusing 'pkt'
> v
>   reply = virtio_transport_alloc_pkt(, 0, ...);
>   if (!reply)
>   return -ENOMEM;
> 
>   t = virtio_transport_get_ops();
>   if (!t) {
>   virtio_transport_free_pkt(reply); <-- prevent memory leak
>   return -ENOTCONN;
>   }
>   return t->send_pkt(reply);

What do you think about Stefano's suggestion, to move the check above
the line were the reply is allocated?
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC PATCH V2 2/5] vhost: fine grain userspace memory accessors

2019-04-19 Thread Christophe de Dinechin


> On 6 Mar 2019, at 08:18, Jason Wang  wrote:
> 
> This is used to hide the metadata address from virtqueue helpers. This
> will allow to implement a vmap based fast accessing to metadata.
> 
> Signed-off-by: Jason Wang 
> ---
> drivers/vhost/vhost.c | 94 +--
> 1 file changed, 77 insertions(+), 17 deletions(-)
> 
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 400aa78..29709e7 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -869,6 +869,34 @@ static inline void __user *__vhost_get_user(struct 
> vhost_virtqueue *vq,
>   ret; \
> })
> 
> +static inline int vhost_put_avail_event(struct vhost_virtqueue *vq)
> +{
> + return vhost_put_user(vq, cpu_to_vhost16(vq, vq->avail_idx),
> +   vhost_avail_event(vq));
> +}
> +
> +static inline int vhost_put_used(struct vhost_virtqueue *vq,
> +  struct vring_used_elem *head, int idx,
> +  int count)
> +{
> + return vhost_copy_to_user(vq, vq->used->ring + idx, head,
> +   count * sizeof(*head));
> +}
> +
> +static inline int vhost_put_used_flags(struct vhost_virtqueue *vq)
> +
> +{
> + return vhost_put_user(vq, cpu_to_vhost16(vq, vq->used_flags),
> +   >used->flags);
> +}
> +
> +static inline int vhost_put_used_idx(struct vhost_virtqueue *vq)
> +
> +{
> + return vhost_put_user(vq, cpu_to_vhost16(vq, vq->last_used_idx),
> +   >used->idx);
> +}
> +
> #define vhost_get_user(vq, x, ptr, type)  \
> ({ \
>   int ret; \
> @@ -907,6 +935,43 @@ static void vhost_dev_unlock_vqs(struct vhost_dev *d)
>   mutex_unlock(>vqs[i]->mutex);
> }
> 
> +static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq,
> +   __virtio16 *idx)
> +{
> + return vhost_get_avail(vq, *idx, >avail->idx);
> +}
> +
> +static inline int vhost_get_avail_head(struct vhost_virtqueue *vq,
> +__virtio16 *head, int idx)
> +{
> + return vhost_get_avail(vq, *head,
> +>avail->ring[idx & (vq->num - 1)]);
> +}
> +
> +static inline int vhost_get_avail_flags(struct vhost_virtqueue *vq,
> + __virtio16 *flags)
> +{
> + return vhost_get_avail(vq, *flags, >avail->flags);
> +}
> +
> +static inline int vhost_get_used_event(struct vhost_virtqueue *vq,
> +__virtio16 *event)
> +{
> + return vhost_get_avail(vq, *event, vhost_used_event(vq));
> +}
> +
> +static inline int vhost_get_used_idx(struct vhost_virtqueue *vq,
> +  __virtio16 *idx)
> +{
> + return vhost_get_used(vq, *idx, >used->idx);
> +}
> +
> +static inline int vhost_get_desc(struct vhost_virtqueue *vq,
> +  struct vring_desc *desc, int idx)
> +{
> + return vhost_copy_from_user(vq, desc, vq->desc + idx, sizeof(*desc));
> +}
> +
> static int vhost_new_umem_range(struct vhost_umem *umem,
>   u64 start, u64 size, u64 end,
>   u64 userspace_addr, int perm)
> @@ -1840,8 +1905,7 @@ int vhost_log_write(struct vhost_virtqueue *vq, struct 
> vhost_log *log,
> static int vhost_update_used_flags(struct vhost_virtqueue *vq)
> {
>   void __user *used;
> - if (vhost_put_user(vq, cpu_to_vhost16(vq, vq->used_flags),
> ->used->flags) < 0)
> + if (vhost_put_used_flags(vq))
>   return -EFAULT;
>   if (unlikely(vq->log_used)) {
>   /* Make sure the flag is seen before log. */
> @@ -1858,8 +1922,7 @@ static int vhost_update_used_flags(struct 
> vhost_virtqueue *vq)
> 
> static int vhost_update_avail_event(struct vhost_virtqueue *vq, u16 
> avail_event)
> {
> - if (vhost_put_user(vq, cpu_to_vhost16(vq, vq->avail_idx),
> -vhost_avail_event(vq)))
> + if (vhost_put_avail_event(vq))
>   return -EFAULT;
>   if (unlikely(vq->log_used)) {
>   void __user *used;
> @@ -1895,7 +1958,7 @@ int vhost_vq_init_access(struct vhost_virtqueue *vq)
>   r = -EFAULT;
>   goto err;
>   }
> - r = vhost_get_used(vq, last_used_idx, >used->idx);
> + r = vhost_get_used_idx(vq, _used_idx);
>   if (r) {
>   vq_err(vq, "Can't access used idx at %p\n",
>  >used->idx);

From the error case, it looks like you are not entirely encapsulating
knowledge of what the accessor uses, i.e. it’s not:

vq_err(vq, "Can't access used idx at %p\n",
   _user_idx);

Maybe move error message within accessor?

> @@ -2094,7 +2157,7 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
>   last_avail_idx = vq->last_avail_idx;
> 
>   if (vq->avail_idx == vq->last_avail_idx) {
> - if 

Re: [PATCH] vsock/virtio: fix kernel panic from virtio_transport_reset_no_sock

2019-04-19 Thread Stefano Garzarella
Hi Adalbert,
thanks for catching this issue, I have a comment below.

On Tue, Mar 05, 2019 at 08:01:45PM +0200, Adalbert Lazăr wrote:
> Previous to commit 22b5c0b63f32 ("vsock/virtio: fix kernel panic after device 
> hot-unplug"),
> vsock_core_init() was called from virtio_vsock_probe(). Now,
> virtio_transport_reset_no_sock() can be called before vsock_core_init()
> has the chance to run.
> 
> [Wed Feb 27 14:17:09 2019] BUG: unable to handle kernel NULL pointer 
> dereference at 0110
> [Wed Feb 27 14:17:09 2019] #PF error: [normal kernel read fault]
> [Wed Feb 27 14:17:09 2019] PGD 0 P4D 0
> [Wed Feb 27 14:17:09 2019] Oops:  [#1] SMP PTI
> [Wed Feb 27 14:17:09 2019] CPU: 3 PID: 59 Comm: kworker/3:1 Not tainted 
> 5.0.0-rc7-390-generic-hvi #390
> [Wed Feb 27 14:17:09 2019] Hardware name: QEMU Standard PC (i440FX + PIIX, 
> 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
> [Wed Feb 27 14:17:09 2019] Workqueue: virtio_vsock virtio_transport_rx_work 
> [vmw_vsock_virtio_transport]
> [Wed Feb 27 14:17:09 2019] RIP: 0010:virtio_transport_reset_no_sock+0x8c/0xc0 
> [vmw_vsock_virtio_transport_common]
> [Wed Feb 27 14:17:09 2019] Code: 35 8b 4f 14 48 8b 57 08 31 f6 44 8b 4f 10 44 
> 8b 07 48 8d 7d c8 e8 84 f8 ff ff 48 85 c0 48 89 c3 74 2a e8 f7 31 03 00 48 89 
> df <48> 8b 80 10 01 00 00 e8 68 fb 69 ed 48 8b 75 f0 65 48 33 34 25 28
> [Wed Feb 27 14:17:09 2019] RSP: 0018:b42701ab7d40 EFLAGS: 00010282
> [Wed Feb 27 14:17:09 2019] RAX:  RBX: 9d79637ee080 RCX: 
> 0003
> [Wed Feb 27 14:17:09 2019] RDX: 0001 RSI: 0002 RDI: 
> 9d79637ee080
> [Wed Feb 27 14:17:09 2019] RBP: b42701ab7d78 R08: 9d796fae70e0 R09: 
> 9d796f403500
> [Wed Feb 27 14:17:09 2019] R10: b42701ab7d90 R11:  R12: 
> 9d7969d09240
> [Wed Feb 27 14:17:09 2019] R13: 9d79624e6840 R14: 9d7969d09318 R15: 
> 9d796d48ff80
> [Wed Feb 27 14:17:09 2019] FS:  () 
> GS:9d796fac() knlGS:
> [Wed Feb 27 14:17:09 2019] CS:  0010 DS:  ES:  CR0: 80050033
> [Wed Feb 27 14:17:09 2019] CR2: 0110 CR3: 000427f22000 CR4: 
> 06e0
> [Wed Feb 27 14:17:09 2019] DR0:  DR1:  DR2: 
> 
> [Wed Feb 27 14:17:09 2019] DR3:  DR6: fffe0ff0 DR7: 
> 0400
> [Wed Feb 27 14:17:09 2019] Call Trace:
> [Wed Feb 27 14:17:09 2019]  virtio_transport_recv_pkt+0x63/0x820 
> [vmw_vsock_virtio_transport_common]
> [Wed Feb 27 14:17:09 2019]  ? kfree+0x17e/0x190
> [Wed Feb 27 14:17:09 2019]  ? detach_buf_split+0x145/0x160
> [Wed Feb 27 14:17:09 2019]  ? __switch_to_asm+0x40/0x70
> [Wed Feb 27 14:17:09 2019]  virtio_transport_rx_work+0xa0/0x106 
> [vmw_vsock_virtio_transport]
> [Wed Feb 27 14:17:09 2019] NET: Registered protocol family 40
> [Wed Feb 27 14:17:09 2019]  process_one_work+0x167/0x410
> [Wed Feb 27 14:17:09 2019]  worker_thread+0x4d/0x460
> [Wed Feb 27 14:17:09 2019]  kthread+0x105/0x140
> [Wed Feb 27 14:17:09 2019]  ? rescuer_thread+0x360/0x360
> [Wed Feb 27 14:17:09 2019]  ? kthread_destroy_worker+0x50/0x50
> [Wed Feb 27 14:17:09 2019]  ret_from_fork+0x35/0x40
> [Wed Feb 27 14:17:09 2019] Modules linked in: vmw_vsock_virtio_transport 
> vmw_vsock_virtio_transport_common input_leds vsock serio_raw i2c_piix4 
> mac_hid qemu_fw_cfg autofs4 cirrus ttm drm_kms_helper syscopyarea sysfillrect 
> sysimgblt fb_sys_fops virtio_net psmouse drm net_failover pata_acpi 
> virtio_blk failover floppy
> [Wed Feb 27 14:17:09 2019] CR2: 0110
> [Wed Feb 27 14:17:09 2019] ---[ end trace baa35abd2e040fe5 ]---
> [Wed Feb 27 14:17:09 2019] RIP: 0010:virtio_transport_reset_no_sock+0x8c/0xc0 
> [vmw_vsock_virtio_transport_common]
> [Wed Feb 27 14:17:09 2019] Code: 35 8b 4f 14 48 8b 57 08 31 f6 44 8b 4f 10 44 
> 8b 07 48 8d 7d c8 e8 84 f8 ff ff 48 85 c0 48 89 c3 74 2a e8 f7 31 03 00 48 89 
> df <48> 8b 80 10 01 00 00 e8 68 fb 69 ed 48 8b 75 f0 65 48 33 34 25 28
> [Wed Feb 27 14:17:09 2019] RSP: 0018:b42701ab7d40 EFLAGS: 00010282
> [Wed Feb 27 14:17:09 2019] RAX:  RBX: 9d79637ee080 RCX: 
> 0003
> [Wed Feb 27 14:17:09 2019] RDX: 0001 RSI: 0002 RDI: 
> 9d79637ee080
> [Wed Feb 27 14:17:09 2019] RBP: b42701ab7d78 R08: 9d796fae70e0 R09: 
> 9d796f403500
> [Wed Feb 27 14:17:09 2019] R10: b42701ab7d90 R11:  R12: 
> 9d7969d09240
> [Wed Feb 27 14:17:09 2019] R13: 9d79624e6840 R14: 9d7969d09318 R15: 
> 9d796d48ff80
> [Wed Feb 27 14:17:09 2019] FS:  () 
> GS:9d796fac() knlGS:
> [Wed Feb 27 14:17:09 2019] CS:  0010 DS:  ES:  CR0: 80050033
> [Wed Feb 27 14:17:09 2019] CR2: 0110 CR3: 000427f22000 CR4: 
> 06e0
> [Wed Feb 27 14:17:09 2019] DR0:  DR1:  DR2: 
> 
> [Wed Feb 27 14:17:09 

Re: [PATCH] vsock/virtio: fix kernel panic from virtio_transport_reset_no_sock

2019-04-19 Thread Adalbert Lazăr
On Wed, 6 Mar 2019 09:12:36 +0100, Stefano Garzarella  
wrote:
> > --- a/net/vmw_vsock/virtio_transport_common.c
> > +++ b/net/vmw_vsock/virtio_transport_common.c
> > @@ -662,6 +662,7 @@ static int virtio_transport_reset(struct vsock_sock 
> > *vsk,
> >   */
> >  static int virtio_transport_reset_no_sock(struct virtio_vsock_pkt *pkt)
> >  {
> > +   const struct virtio_transport *t;
> > struct virtio_vsock_pkt_info info = {
> > .op = VIRTIO_VSOCK_OP_RST,
> > .type = le16_to_cpu(pkt->hdr.type),
> > @@ -680,7 +681,11 @@ static int virtio_transport_reset_no_sock(struct 
> > virtio_vsock_pkt *pkt)
> > if (!pkt)
> > return -ENOMEM;
> >  
> > -   return virtio_transport_get_ops()->send_pkt(pkt);
> > +   t = virtio_transport_get_ops();
> > +   if (!t)
> > +   return -ENOTCONN;
> 
> Should be better to do this check before the virtio_transport_alloc_pkt?
> 
> Otherwise, I think we should free that packet before to return -ENOTCONN.

Right! :D
I will send a second version.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH] vsock/virtio: fix kernel panic from virtio_transport_reset_no_sock

2019-04-19 Thread Adalbert Lazăr
Previous to commit 22b5c0b63f32 ("vsock/virtio: fix kernel panic after device 
hot-unplug"),
vsock_core_init() was called from virtio_vsock_probe(). Now,
virtio_transport_reset_no_sock() can be called before vsock_core_init()
has the chance to run.

[Wed Feb 27 14:17:09 2019] BUG: unable to handle kernel NULL pointer 
dereference at 0110
[Wed Feb 27 14:17:09 2019] #PF error: [normal kernel read fault]
[Wed Feb 27 14:17:09 2019] PGD 0 P4D 0
[Wed Feb 27 14:17:09 2019] Oops:  [#1] SMP PTI
[Wed Feb 27 14:17:09 2019] CPU: 3 PID: 59 Comm: kworker/3:1 Not tainted 
5.0.0-rc7-390-generic-hvi #390
[Wed Feb 27 14:17:09 2019] Hardware name: QEMU Standard PC (i440FX + PIIX, 
1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[Wed Feb 27 14:17:09 2019] Workqueue: virtio_vsock virtio_transport_rx_work 
[vmw_vsock_virtio_transport]
[Wed Feb 27 14:17:09 2019] RIP: 0010:virtio_transport_reset_no_sock+0x8c/0xc0 
[vmw_vsock_virtio_transport_common]
[Wed Feb 27 14:17:09 2019] Code: 35 8b 4f 14 48 8b 57 08 31 f6 44 8b 4f 10 44 
8b 07 48 8d 7d c8 e8 84 f8 ff ff 48 85 c0 48 89 c3 74 2a e8 f7 31 03 00 48 89 
df <48> 8b 80 10 01 00 00 e8 68 fb 69 ed 48 8b 75 f0 65 48 33 34 25 28
[Wed Feb 27 14:17:09 2019] RSP: 0018:b42701ab7d40 EFLAGS: 00010282
[Wed Feb 27 14:17:09 2019] RAX:  RBX: 9d79637ee080 RCX: 
0003
[Wed Feb 27 14:17:09 2019] RDX: 0001 RSI: 0002 RDI: 
9d79637ee080
[Wed Feb 27 14:17:09 2019] RBP: b42701ab7d78 R08: 9d796fae70e0 R09: 
9d796f403500
[Wed Feb 27 14:17:09 2019] R10: b42701ab7d90 R11:  R12: 
9d7969d09240
[Wed Feb 27 14:17:09 2019] R13: 9d79624e6840 R14: 9d7969d09318 R15: 
9d796d48ff80
[Wed Feb 27 14:17:09 2019] FS:  () 
GS:9d796fac() knlGS:
[Wed Feb 27 14:17:09 2019] CS:  0010 DS:  ES:  CR0: 80050033
[Wed Feb 27 14:17:09 2019] CR2: 0110 CR3: 000427f22000 CR4: 
06e0
[Wed Feb 27 14:17:09 2019] DR0:  DR1:  DR2: 

[Wed Feb 27 14:17:09 2019] DR3:  DR6: fffe0ff0 DR7: 
0400
[Wed Feb 27 14:17:09 2019] Call Trace:
[Wed Feb 27 14:17:09 2019]  virtio_transport_recv_pkt+0x63/0x820 
[vmw_vsock_virtio_transport_common]
[Wed Feb 27 14:17:09 2019]  ? kfree+0x17e/0x190
[Wed Feb 27 14:17:09 2019]  ? detach_buf_split+0x145/0x160
[Wed Feb 27 14:17:09 2019]  ? __switch_to_asm+0x40/0x70
[Wed Feb 27 14:17:09 2019]  virtio_transport_rx_work+0xa0/0x106 
[vmw_vsock_virtio_transport]
[Wed Feb 27 14:17:09 2019] NET: Registered protocol family 40
[Wed Feb 27 14:17:09 2019]  process_one_work+0x167/0x410
[Wed Feb 27 14:17:09 2019]  worker_thread+0x4d/0x460
[Wed Feb 27 14:17:09 2019]  kthread+0x105/0x140
[Wed Feb 27 14:17:09 2019]  ? rescuer_thread+0x360/0x360
[Wed Feb 27 14:17:09 2019]  ? kthread_destroy_worker+0x50/0x50
[Wed Feb 27 14:17:09 2019]  ret_from_fork+0x35/0x40
[Wed Feb 27 14:17:09 2019] Modules linked in: vmw_vsock_virtio_transport 
vmw_vsock_virtio_transport_common input_leds vsock serio_raw i2c_piix4 mac_hid 
qemu_fw_cfg autofs4 cirrus ttm drm_kms_helper syscopyarea sysfillrect sysimgblt 
fb_sys_fops virtio_net psmouse drm net_failover pata_acpi virtio_blk failover 
floppy
[Wed Feb 27 14:17:09 2019] CR2: 0110
[Wed Feb 27 14:17:09 2019] ---[ end trace baa35abd2e040fe5 ]---
[Wed Feb 27 14:17:09 2019] RIP: 0010:virtio_transport_reset_no_sock+0x8c/0xc0 
[vmw_vsock_virtio_transport_common]
[Wed Feb 27 14:17:09 2019] Code: 35 8b 4f 14 48 8b 57 08 31 f6 44 8b 4f 10 44 
8b 07 48 8d 7d c8 e8 84 f8 ff ff 48 85 c0 48 89 c3 74 2a e8 f7 31 03 00 48 89 
df <48> 8b 80 10 01 00 00 e8 68 fb 69 ed 48 8b 75 f0 65 48 33 34 25 28
[Wed Feb 27 14:17:09 2019] RSP: 0018:b42701ab7d40 EFLAGS: 00010282
[Wed Feb 27 14:17:09 2019] RAX:  RBX: 9d79637ee080 RCX: 
0003
[Wed Feb 27 14:17:09 2019] RDX: 0001 RSI: 0002 RDI: 
9d79637ee080
[Wed Feb 27 14:17:09 2019] RBP: b42701ab7d78 R08: 9d796fae70e0 R09: 
9d796f403500
[Wed Feb 27 14:17:09 2019] R10: b42701ab7d90 R11:  R12: 
9d7969d09240
[Wed Feb 27 14:17:09 2019] R13: 9d79624e6840 R14: 9d7969d09318 R15: 
9d796d48ff80
[Wed Feb 27 14:17:09 2019] FS:  () 
GS:9d796fac() knlGS:
[Wed Feb 27 14:17:09 2019] CS:  0010 DS:  ES:  CR0: 80050033
[Wed Feb 27 14:17:09 2019] CR2: 0110 CR3: 000427f22000 CR4: 
06e0
[Wed Feb 27 14:17:09 2019] DR0:  DR1:  DR2: 

[Wed Feb 27 14:17:09 2019] DR3:  DR6: fffe0ff0 DR7: 
0400
---
 net/vmw_vsock/virtio_transport_common.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/net/vmw_vsock/virtio_transport_common.c 
b/net/vmw_vsock/virtio_transport_common.c
index 3ae3a33da70b..502201aaff2a 

Re: [PATCH v7 0/7] Add virtio-iommu driver

2019-04-19 Thread Thiago Jung Bauermann

Hello Jean-Philippe,

Jean-Philippe Brucker  writes:
> Makes sense, though I think other virtio devices have been developed a
> little more organically: device and driver code got upstreamed first,
> and then the specification describing their interface got merged into
> the standard. For example I believe that code for crypto, input and GPU
> devices were upstreamed long before the specification was merged. Once
> an implementation is upstream, the interface is expected to be
> backward-compatible (all subsequent changes are introduced using feature
> bits).
>
> So I've been working with this process in mind, also described by Jens
> at KVM forum 2017 [3]:
> (1) Reserve a device ID, and get that merged into virtio (ID 23 for
> virtio-iommu was reserved last year)
> (2) Open-source an implementation (this driver and Eric's device)
> (3) Formalize and upstream the device specification
>
> But I get that some overlap between (2) and (3) would have been better.
> So far the spec document has been reviewed mainly from the IOMMU point
> of view, and might require more changes to be in line with the other
> virtio devices -- hopefully just wording changes. I'll kick off step
> (3), but I think the virtio folks are a bit busy with finalizing the 1.1
> spec so I expect it to take a while.

I read v0.9 of the spec and have some minor comments, hope this is a
good place to send them:

1. In section 2.6.2, one reads

If the VIRTIO_IOMMU_F_INPUT_RANGE feature is offered and the range
described by fields virt_start and virt_end doesn’t fit in the range
described by input_range, the device MAY set status to VIRTIO_-
IOMMU_S_RANGE and ignore the request.

Shouldn't int say "If the VIRTIO_IOMMU_F_INPUT_RANGE feature is
negotiated" instead?

2. There's a typo at the end of section 2.6.5:

The VIRTIO_IOMMU_MAP_F_MMIO flag is a memory type rather than a
protection lag.

s/lag/flag/

3. In section 3.1.2.1.1, the viommu compatible field says "virtio,mmio".
Shouldn't it say "virtio,mmio-iommu" instead, to be consistent with
"virtio,pci-iommu"?

4. There's a typo in section 3.3:

A host bridge may limit the input address space – transaction
accessing some addresses won’t reach the physical IOMMU.

s/transaction/transactions/

I also have one last comment which you may freely ignore, considering
it's clearly just personal opinion and also considering that the
specification is mature at this point: it specifies memory ranges by
specifying start and end addresses. My experience has been that this is
error prone, leading to confusion and bugs regarding whether the end
address is inclusive or exclusive. I tend to prefer expressing memory
ranges by specifying a start address and a length, which eliminates
ambiguity.

--
Thiago Jung Bauermann
IBM Linux Technology Center

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH] drm/bochs: Fix the ID mismatch error

2019-04-19 Thread Alistair Francis
On Thu, Feb 21, 2019 at 9:37 PM kra...@redhat.com  wrote:
>
> On Thu, Feb 21, 2019 at 10:44:06AM -0800, Alistair Francis wrote:
> > On Thu, Feb 21, 2019 at 3:52 AM kra...@redhat.com  wrote:
> > >
> > > On Thu, Feb 21, 2019 at 12:33:03AM +, Alistair Francis wrote:
> > > > When running RISC-V QEMU with the Bochs device attached via PCIe the
> > > > probe of the Bochs device fails with:
> > > > [drm:bochs_hw_init] *ERROR* ID mismatch
> > > >
> > > > This was introduced by this commit:
> > > > 7780eb9ce8 bochs: convert to drm_dev_register
> > > >
> > > > To fix the error we ensure that pci_enable_device() is called before
> > > > bochs_load().
> > > >
> > > > Signed-off-by: Alistair Francis 
> > > > Reported-by: David Abdurachmanov 
> > >
> > > Pushed to drm-misc-fixes.
> >
> > Thanks. Any chance this will make it into 5.0?
>
> Hmm, we are damn close to the release, not sure there will be one more
> drm-fixes pull req.  But I've added a proper Fixes: tag, so even if the
> patch misses the boat it should land in the stable branches shortly
> thereafter.

Landing in the stable branches is probably enough. If you do end up
sending another pull request it would be great if this gets in. It
would be nice to have this fixed in the official 5.0 tag.

Alistair

>
> cheers,
>   Gerd
>
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)

2019-04-19 Thread Liran Alon


> On 28 Feb 2019, at 1:50, Michael S. Tsirkin  wrote:
> 
> On Wed, Feb 27, 2019 at 03:34:56PM -0800, si-wei liu wrote:
>> 
>> 
>> On 2/27/2019 2:38 PM, Michael S. Tsirkin wrote:
>>> On Tue, Feb 26, 2019 at 04:17:21PM -0800, si-wei liu wrote:
 
 On 2/25/2019 6:08 PM, Michael S. Tsirkin wrote:
> On Mon, Feb 25, 2019 at 04:58:07PM -0800, si-wei liu wrote:
>> On 2/22/2019 7:14 AM, Michael S. Tsirkin wrote:
>>> On Thu, Feb 21, 2019 at 11:55:11PM -0800, si-wei liu wrote:
 On 2/21/2019 11:00 PM, Samudrala, Sridhar wrote:
> On 2/21/2019 7:33 PM, si-wei liu wrote:
>> On 2/21/2019 5:39 PM, Michael S. Tsirkin wrote:
>>> On Thu, Feb 21, 2019 at 05:14:44PM -0800, Siwei Liu wrote:
 Sorry for replying to this ancient thread. There was some remaining
 issue that I don't think the initial net_failover patch got 
 addressed
 cleanly, see:
 
 https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.launchpad.net_ubuntu_-2Bsource_linux_-2Bbug_1815268=DwIBAg=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0=aL-QfUoSYx8r0XCOBkcDtF8f-cYxrJI3skYLFTb8XJE=yk6Nqv3a6_JMzyrXKY67h00FyNrDJyQ-PYMFffDSTXM=
 
 The renaming of 'eth0' to 'ens4' fails because the udev userspace 
 was
 not specifically writtten for such kernel automatic enslavement.
 Specifically, if it is a bond or team, the slave would typically 
 get
 renamed *before* virtual device gets created, that's what udev can
 control (without getting netdev opened early by the other part of
 kernel) and other userspace components for e.g. initramfs,
 init-scripts can coordinate well in between. The in-kernel
 auto-enslavement of net_failover breaks this userspace convention,
 which don't provides a solution if user care about consistent 
 naming
 on the slave netdevs specifically.
 
 Previously this issue had been specifically called out when 
 IFF_HIDDEN
 and the 1-netdev was proposed, but no one gives out a solution to 
 this
 problem ever since. Please share your mind how to proceed and solve
 this userspace issue if netdev does not welcome a 1-netdev model.
>>> Above says:
>>> 
>>>   there's no motivation in the systemd/udevd community at
>>>   this point to refactor the rename logic and make it work well 
>>> with
>>>   3-netdev.
>>> 
>>> What would the fix be? Skip slave devices?
>>> 
>> There's nothing user can get if just skipping slave devices - the
>> name is still unchanged and unpredictable e.g. eth0, or eth1 the
>> next reboot, while the rest may conform to the naming scheme (ens3
>> and such). There's no way one can fix this in userspace alone - when
>> the failover is created the enslaved netdev was opened by the kernel
>> earlier than the userspace is made aware of, and there's no
>> negotiation protocol for kernel to know when userspace has done
>> initial renaming of the interface. I would expect netdev list should
>> at least provide the direction in general for how this can be
>> solved...
>>> I was just wondering what did you mean when you said
>>> "refactor the rename logic and make it work well with 3-netdev" -
>>> was there a proposal udev rejected?
>> No. I never believed this particular issue can be fixed in userspace 
>> alone.
>> Previously someone had said it could be, but I never see any work or
>> relevant discussion ever happened in various userspace communities (for 
>> e.g.
>> dracut, initramfs-tools, systemd, udev, and NetworkManager). IMHO the 
>> root
>> of the issue derives from the kernel, it makes more sense to start from
>> netdev, work out and decide on a solution: see what can be done in the
>> kernel in order to fix it, then after that engage userspace community for
>> the feasibility...
>> 
>>> Anyway, can we write a time diagram for what happens in which order that
>>> leads to failure?  That would help look for triggers that we can tie
>>> into, or add new ones.
>>> 
>> See attached diagram.
>> 
>>> 
>>> 
> Is there an issue if slave device names are not predictable? The 
> user/admin scripts are expected
> to only work with the master failover device.
 Where does this expectation come from?
 
 Admin users may have ethtool or tc configurations that need to deal 
 with
 predictable interface name. Third-party app which was built upon 
 specifying
 certain interface name can't be modified to chase 

Re: [PATCH v7 0/7] Add virtio-iommu driver

2019-04-19 Thread Thiago Jung Bauermann


Michael S. Tsirkin  writes:

> On Mon, Jan 21, 2019 at 11:29:05AM +, Jean-Philippe Brucker wrote:
>> Hi,
>>
>> On 18/01/2019 15:51, Michael S. Tsirkin wrote:
>> >
>> > On Tue, Jan 15, 2019 at 12:19:52PM +, Jean-Philippe Brucker wrote:
>> >> Implement the virtio-iommu driver, following specification v0.9 [1].
>> >>
>> >> This is a simple rebase onto Linux v5.0-rc2. We now use the
>> >> dev_iommu_fwspec_get() helper introduced in v5.0 instead of accessing
>> >> dev->iommu_fwspec, but there aren't any functional change from v6 [2].
>> >>
>> >> Our current goal for virtio-iommu is to get a paravirtual IOMMU working
>> >> on Arm, and enable device assignment to guest userspace. In this
>> >> use-case the mappings are static, and don't require optimal performance,
>> >> so this series tries to keep things simple. However there is plenty more
>> >> to do for features and optimizations, and having this base in v5.1 would
>> >> be good. Given that most of the changes are to drivers/iommu, I believe
>> >> the driver and future changes should go via the IOMMU tree.
>> >>
>> >> You can find Linux driver and kvmtool device on v0.9.2 branches [3],
>> >> module and x86 support on virtio-iommu/devel. Also tested with Eric's
>> >> QEMU device [4]. Please note that the series depends on Robin's
>> >> probe-deferral fix [5], which will hopefully land in v5.0.
>> >>
>> >> [1] Virtio-iommu specification v0.9, sources and pdf
>> >> git://linux-arm.org/virtio-iommu.git virtio-iommu/v0.9
>> >> http://jpbrucker.net/virtio-iommu/spec/v0.9/virtio-iommu-v0.9.pdf
>> >>
>> >> [2] [PATCH v6 0/7] Add virtio-iommu driver
>> >> 
>> >> https://lists.linuxfoundation.org/pipermail/iommu/2018-December/032127.html
>> >>
>> >> [3] git://linux-arm.org/linux-jpb.git virtio-iommu/v0.9.2
>> >> git://linux-arm.org/kvmtool-jpb.git virtio-iommu/v0.9.2
>> >>
>> >> [4] [RFC v9 00/17] VIRTIO-IOMMU device
>> >> https://www.mail-archive.com/qemu-devel@nongnu.org/msg575578.html
>> >>
>> >> [5] [PATCH] iommu/of: Fix probe-deferral
>> >> https://www.spinics.net/lists/arm-kernel/msg698371.html
>> >
>> > Thanks for the work!
>> > So really my only issue with this is that there's no
>> > way for the IOMMU to describe the devices that it
>> > covers.
>> >
>> > As a result that is then done in a platform-specific way.
>> >
>> > And this means that for example it does not solve the problem that e.g.
>> > some power people have in that their platform simply does not have a way
>> > to specify which devices are covered by the IOMMU.
>>
>> Isn't power using device tree? I haven't looked much at power because I
>> was told a while ago that they already paravirtualize their IOMMU and
>> don't need virtio-iommu, except perhaps for some legacy platforms. Or
>> something along those lines. But I would certainly be interested in
>> enabling the IOMMU for more architectures.
>
> I have CC'd the relevant ppc developers, let's see what do they think.

I'm far from being an expert, but what could be very useful for us is to
have a way for the guest to request IOMMU bypass for a device.

>From what I understand, the pSeries platform used by POWER guests always
puts devices behind an IOMMU, so at least for current systems a
description of which devices are covered by the IOMMU would always say
"all of them".

>> As for the enumeration problem, I still don't think we can get much
>> better than DT and ACPI as solutions (and IMO they are necessary to make
>> this device portable). But I believe that getting DT and ACPI support is
>> just a one-off inconvenience. That is, once the required bindings are
>> accepted, any future extension can then be done at the virtio level with
>> feature bits and probe requests, without having to update ACPI or DT.

There is a device tree binding that can specify devices connected to a
given IOMMU in Documentation/devicetree/bindings/iommu/iommu.txt.
I don't believe POWER machines use it though.

>> Thanks,
>> Jean
>>
>> > Solving that problem would make me much more excited about
>> > this device.
>> >
>> > On the other hand I can see that while there have been some
>> > developments most of the code has been stable for quite a while now.
>> >
>> > So what I am trying to do right about now, is making a small module that
>> > loads early and pokes at the IOMMU sufficiently to get the data about
>> > which devices use the IOMMU out of it using standard virtio config
>> > space.  IIUC it's claimed to be impossible without messy changes to the
>> > boot sequence.
>> >
>> > If I succeed at least on some platforms I'll ask that this design is
>> > worked into this device, minimizing info that goes through DT/ACPI.  If
>> > I see I can't make it in time to meet the next merge window, I plan
>> > merging the existing patches using DT (barring surprises).
>> >
>> > As I only have a very small amount of time to spend on this attempt, If
>> > someone else wants to try doing that in parallel, that would be great!

--
Thiago 

Re: [PATCH] drm/bochs: Fix the ID mismatch error

2019-04-19 Thread Alistair Francis
On Thu, Feb 21, 2019 at 3:52 AM kra...@redhat.com  wrote:
>
> On Thu, Feb 21, 2019 at 12:33:03AM +, Alistair Francis wrote:
> > When running RISC-V QEMU with the Bochs device attached via PCIe the
> > probe of the Bochs device fails with:
> > [drm:bochs_hw_init] *ERROR* ID mismatch
> >
> > This was introduced by this commit:
> > 7780eb9ce8 bochs: convert to drm_dev_register
> >
> > To fix the error we ensure that pci_enable_device() is called before
> > bochs_load().
> >
> > Signed-off-by: Alistair Francis 
> > Reported-by: David Abdurachmanov 
>
> Pushed to drm-misc-fixes.

Thanks. Any chance this will make it into 5.0?

Alistair

>
> thanks,
>   Gerd
>
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH] virtio_ring: Use DMA API if guest memory is encrypted

2019-04-19 Thread Thiago Jung Bauermann


Hello Michael,

Michael S. Tsirkin  writes:

> On Tue, Jan 29, 2019 at 03:42:44PM -0200, Thiago Jung Bauermann wrote:
>>
>> Fixing address of powerpc mailing list.
>>
>> Thiago Jung Bauermann  writes:
>>
>> > Hello,
>> >
>> > With Christoph's rework of the DMA API that recently landed, the patch
>> > below is the only change needed in virtio to make it work in a POWER
>> > secure guest under the ultravisor.
>> >
>> > The other change we need (making sure the device's dma_map_ops is NULL
>> > so that the dma-direct/swiotlb code is used) can be made in
>> > powerpc-specific code.
>> >
>> > Of course, I also have patches (soon to be posted as RFC) which hook up
>> >  to the powerpc secure guest support code.
>> >
>> > What do you think?
>> >
>> > From d0629a36a75c678b4a72b853f8f7f8c17eedd6b3 Mon Sep 17 00:00:00 2001
>> > From: Thiago Jung Bauermann 
>> > Date: Thu, 24 Jan 2019 22:08:02 -0200
>> > Subject: [RFC PATCH] virtio_ring: Use DMA API if guest memory is encrypted
>> >
>> > The host can't access the guest memory when it's encrypted, so using
>> > regular memory pages for the ring isn't an option. Go through the DMA API.
>> >
>> > Signed-off-by: Thiago Jung Bauermann 
>
> Well I think this will come back to bite us (witness xen which is now
> reworking precisely this path - but at least they aren't to blame, xen
> came before ACCESS_PLATFORM).
>
> I also still think the right thing would have been to set
> ACCESS_PLATFORM for all systems where device can't access all memory.

I understand. The problem with that approach for us is that because we
don't know which guests will become secure guests and which will remain
regular guests, QEMU would need to offer ACCESS_PLATFORM to all guests.

And the problem with that is that for QEMU on POWER, having
ACCESS_PLATFORM turned off means that it can bypass the IOMMU for the
device (which makes sense considering that the name of the flag was
IOMMU_PLATFORM). And we need that for regular guests to avoid
performance degradation.

So while ACCESS_PLATFORM solves our problems for secure guests, we can't
turn it on by default because we can't affect legacy systems. Doing so
would penalize existing systems that can access all memory. They would
all have to unnecessarily go through address translations, and take a
performance hit.

The semantics of ACCESS_PLATFORM assume that the hypervisor/QEMU knows
in advance - right when the VM is instantiated - that it will not have
access to all guest memory. Unfortunately that assumption is subtly
broken on our secure-platform. The hypervisor/QEMU realizes that the
platform is going secure only *after the VM is instantiated*. It's the
kernel running in the VM that determines that it wants to switch the
platform to secure-mode.

Another way of looking at this issue which also explains our reluctance
is that the only difference between a secure guest and a regular guest
(at least regarding virtio) is that the former uses swiotlb while the
latter doens't. And from the device's point of view they're
indistinguishable. It can't tell one guest that is using swiotlb from
one that isn't. And that implies that secure guest vs regular guest
isn't a virtio interface issue, it's "guest internal affairs". So
there's no reason to reflect that in the feature flags.

That said, we still would like to arrive at a proper design for this
rather than add yet another hack if we can avoid it. So here's another
proposal: considering that the dma-direct code (in kernel/dma/direct.c)
automatically uses swiotlb when necessary (thanks to Christoph's recent
DMA work), would it be ok to replace virtio's own direct-memory code
that is used in the !ACCESS_PLATFORM case with the dma-direct code? That
way we'll get swiotlb even with !ACCESS_PLATFORM, and virtio will get a
code cleanup (replace open-coded stuff with calls to existing
infrastructure).

> But I also think I don't have the energy to argue about power secure
> guest anymore.  So be it for power secure guest since the involved
> engineers disagree with me.  Hey I've been wrong in the past ;).

Yeah, it's been a difficult discussion. Thanks for still engaging!
I honestly thought that this patch was a good solution (if the guest has
encrypted memory it means that the DMA API needs to be used), but I can
see where you are coming from. As I said, we'd like to arrive at a good
solution if possible.

> But the name "sev_active" makes me scared because at least AMD guys who
> were doing the sensible thing and setting ACCESS_PLATFORM

My understanding is, AMD guest-platform knows in advance that their
guest will run in secure mode and hence sets the flag at the time of VM
instantiation. Unfortunately we dont have that luxury on our platforms.

> (unless I'm
> wrong? I reemember distinctly that's so) will likely be affected too.
> We don't want that.
>
> So let's find a way to make sure it's just power secure guest for now
> pls.

Yes, my understanding is that they turn ACCESS_PLATFORM on. And because
of 

[PATCH] drm/bochs: Fix the ID mismatch error

2019-04-19 Thread Alistair Francis
When running RISC-V QEMU with the Bochs device attached via PCIe the
probe of the Bochs device fails with:
[drm:bochs_hw_init] *ERROR* ID mismatch

This was introduced by this commit:
7780eb9ce8 bochs: convert to drm_dev_register

To fix the error we ensure that pci_enable_device() is called before
bochs_load().

Signed-off-by: Alistair Francis 
Reported-by: David Abdurachmanov 
---
 drivers/gpu/drm/bochs/bochs_drv.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/gpu/drm/bochs/bochs_drv.c 
b/drivers/gpu/drm/bochs/bochs_drv.c
index f3dd66ae990a..aa35007262cd 100644
--- a/drivers/gpu/drm/bochs/bochs_drv.c
+++ b/drivers/gpu/drm/bochs/bochs_drv.c
@@ -154,6 +154,10 @@ static int bochs_pci_probe(struct pci_dev *pdev,
if (IS_ERR(dev))
return PTR_ERR(dev);
 
+   ret = pci_enable_device(pdev);
+   if (ret)
+   goto err_free_dev;
+
dev->pdev = pdev;
pci_set_drvdata(pdev, dev);
 
-- 
2.20.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH] virtio_ring: Use DMA API if guest memory is encrypted

2019-04-19 Thread Thiago Jung Bauermann


Christoph Hellwig  writes:

> On Tue, Jan 29, 2019 at 09:36:08PM -0500, Michael S. Tsirkin wrote:
>> This has been discussed ad nauseum. virtio is all about compatibility.
>> Losing a couple of lines of code isn't worth breaking working setups.
>> People that want "just use DMA API no tricks" now have the option.
>> Setting a flag in a feature bit map is literally a single line
>> of code in the hypervisor. So stop pushing for breaking working
>> legacy setups and just fix it in the right place.
>
> I agree with the legacy aspect.  What I am missing is an extremely
> strong wording that says you SHOULD always set this flag for new
> hosts, including an explanation why.

My understanding of ACCESS_PLATFORM is that it means "this device will
behave in all aspects like a regular device attached to this bus". Is
that it? Therefore it should be set because it's the sane thing to do?

--
Thiago Jung Bauermann
IBM Linux Technology Center

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v3 2/2] vsock/virtio: reset connected sockets on device removal

2019-04-19 Thread Stefano Garzarella
When the virtio transport device disappear, we should reset all
connected sockets in order to inform the users.

Signed-off-by: Stefano Garzarella 
Reviewed-by: Stefan Hajnoczi 
---
 net/vmw_vsock/virtio_transport.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
index 9dae54698737..15eb5d3d4750 100644
--- a/net/vmw_vsock/virtio_transport.c
+++ b/net/vmw_vsock/virtio_transport.c
@@ -634,6 +634,9 @@ static void virtio_vsock_remove(struct virtio_device *vdev)
flush_work(>event_work);
flush_work(>send_pkt_work);
 
+   /* Reset all connected sockets when the device disappear */
+   vsock_for_each_connected_socket(virtio_vsock_reset_sock);
+
vdev->config->reset(vdev);
 
mutex_lock(>rx_lock);
-- 
2.20.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v3 1/2] vsock/virtio: fix kernel panic after device hot-unplug

2019-04-19 Thread Stefano Garzarella
virtio_vsock_remove() invokes the vsock_core_exit() also if there
are opened sockets for the AF_VSOCK protocol family. In this way
the vsock "transport" pointer is set to NULL, triggering the
kernel panic at the first socket activity.

This patch move the vsock_core_init()/vsock_core_exit() in the
virtio_vsock respectively in module_init and module_exit functions,
that cannot be invoked until there are open sockets.

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1609699
Reported-by: Yan Fu 
Signed-off-by: Stefano Garzarella 
Acked-by: Stefan Hajnoczi 
---
 net/vmw_vsock/virtio_transport.c | 26 ++
 1 file changed, 18 insertions(+), 8 deletions(-)

diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
index 5d3cce9e8744..9dae54698737 100644
--- a/net/vmw_vsock/virtio_transport.c
+++ b/net/vmw_vsock/virtio_transport.c
@@ -75,6 +75,9 @@ static u32 virtio_transport_get_local_cid(void)
 {
struct virtio_vsock *vsock = virtio_vsock_get();
 
+   if (!vsock)
+   return VMADDR_CID_ANY;
+
return vsock->guest_cid;
 }
 
@@ -584,10 +587,6 @@ static int virtio_vsock_probe(struct virtio_device *vdev)
 
virtio_vsock_update_guest_cid(vsock);
 
-   ret = vsock_core_init(_transport.transport);
-   if (ret < 0)
-   goto out_vqs;
-
vsock->rx_buf_nr = 0;
vsock->rx_buf_max_nr = 0;
atomic_set(>queued_replies, 0);
@@ -618,8 +617,6 @@ static int virtio_vsock_probe(struct virtio_device *vdev)
mutex_unlock(_virtio_vsock_mutex);
return 0;
 
-out_vqs:
-   vsock->vdev->config->del_vqs(vsock->vdev);
 out:
kfree(vsock);
mutex_unlock(_virtio_vsock_mutex);
@@ -669,7 +666,6 @@ static void virtio_vsock_remove(struct virtio_device *vdev)
 
mutex_lock(_virtio_vsock_mutex);
the_virtio_vsock = NULL;
-   vsock_core_exit();
mutex_unlock(_virtio_vsock_mutex);
 
vdev->config->del_vqs(vdev);
@@ -702,14 +698,28 @@ static int __init virtio_vsock_init(void)
virtio_vsock_workqueue = alloc_workqueue("virtio_vsock", 0, 0);
if (!virtio_vsock_workqueue)
return -ENOMEM;
+
ret = register_virtio_driver(_vsock_driver);
if (ret)
-   destroy_workqueue(virtio_vsock_workqueue);
+   goto out_wq;
+
+   ret = vsock_core_init(_transport.transport);
+   if (ret)
+   goto out_vdr;
+
+   return 0;
+
+out_vdr:
+   unregister_virtio_driver(_vsock_driver);
+out_wq:
+   destroy_workqueue(virtio_vsock_workqueue);
return ret;
+
 }
 
 static void __exit virtio_vsock_exit(void)
 {
+   vsock_core_exit();
unregister_virtio_driver(_vsock_driver);
destroy_workqueue(virtio_vsock_workqueue);
 }
-- 
2.20.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v3 0/2] vsock/virtio: fix issues on device hot-unplug

2019-04-19 Thread Stefano Garzarella
These patches try to handle the hot-unplug of vsock virtio transport device in
a proper way.

Maybe move the vsock_core_init()/vsock_core_exit() functions in the module_init
and module_exit of vsock_virtio_transport module can't be the best way, but the
architecture of vsock_core forces us to this approach for now.

The vsock_core proto_ops expect a valid pointer to the transport device, so we
can't call vsock_core_exit() until there are open sockets.

v2 -> v3:
 - Rebased on master

v1 -> v2:
 - Fixed commit message of patch 1.
 - Added Reviewed-by, Acked-by tags by Stefan

Stefano Garzarella (2):
  vsock/virtio: fix kernel panic after device hot-unplug
  vsock/virtio: reset connected sockets on device removal

 net/vmw_vsock/virtio_transport.c | 29 +
 1 file changed, 21 insertions(+), 8 deletions(-)

-- 
2.20.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[RFC PATCH] virtio_ring: Use DMA API if guest memory is encrypted

2019-04-19 Thread Thiago Jung Bauermann


Hello,

With Christoph's rework of the DMA API that recently landed, the patch
below is the only change needed in virtio to make it work in a POWER
secure guest under the ultravisor.

The other change we need (making sure the device's dma_map_ops is NULL
so that the dma-direct/swiotlb code is used) can be made in
powerpc-specific code.

Of course, I also have patches (soon to be posted as RFC) which hook up
 to the powerpc secure guest support code.

What do you think?

>From d0629a36a75c678b4a72b853f8f7f8c17eedd6b3 Mon Sep 17 00:00:00 2001
From: Thiago Jung Bauermann 
Date: Thu, 24 Jan 2019 22:08:02 -0200
Subject: [RFC PATCH] virtio_ring: Use DMA API if guest memory is encrypted

The host can't access the guest memory when it's encrypted, so using
regular memory pages for the ring isn't an option. Go through the DMA API.

Signed-off-by: Thiago Jung Bauermann 
---
 drivers/virtio/virtio_ring.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index cd7e755484e3..321a27075380 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -259,8 +259,11 @@ static bool vring_use_dma_api(struct virtio_device *vdev)
 * not work without an even larger kludge.  Instead, enable
 * the DMA API if we're a Xen guest, which at least allows
 * all of the sensible Xen configurations to work correctly.
+*
+* Also, if guest memory is encrypted the host can't access
+* it directly. In this case, we'll need to use the DMA API.
 */
-   if (xen_domain())
+   if (xen_domain() || sev_active())
return true;

return false;

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH] virtio_ring: Use DMA API if guest memory is encrypted

2019-04-19 Thread Thiago Jung Bauermann


Fixing address of powerpc mailing list.

Thiago Jung Bauermann  writes:

> Hello,
>
> With Christoph's rework of the DMA API that recently landed, the patch
> below is the only change needed in virtio to make it work in a POWER
> secure guest under the ultravisor.
>
> The other change we need (making sure the device's dma_map_ops is NULL
> so that the dma-direct/swiotlb code is used) can be made in
> powerpc-specific code.
>
> Of course, I also have patches (soon to be posted as RFC) which hook up
>  to the powerpc secure guest support code.
>
> What do you think?
>
> From d0629a36a75c678b4a72b853f8f7f8c17eedd6b3 Mon Sep 17 00:00:00 2001
> From: Thiago Jung Bauermann 
> Date: Thu, 24 Jan 2019 22:08:02 -0200
> Subject: [RFC PATCH] virtio_ring: Use DMA API if guest memory is encrypted
>
> The host can't access the guest memory when it's encrypted, so using
> regular memory pages for the ring isn't an option. Go through the DMA API.
>
> Signed-off-by: Thiago Jung Bauermann 
> ---
>  drivers/virtio/virtio_ring.c | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> index cd7e755484e3..321a27075380 100644
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -259,8 +259,11 @@ static bool vring_use_dma_api(struct virtio_device *vdev)
>* not work without an even larger kludge.  Instead, enable
>* the DMA API if we're a Xen guest, which at least allows
>* all of the sensible Xen configurations to work correctly.
> +  *
> +  * Also, if guest memory is encrypted the host can't access
> +  * it directly. In this case, we'll need to use the DMA API.
>*/
> - if (xen_domain())
> + if (xen_domain() || sev_active())
>   return true;
>
>   return false;


-- 
Thiago Jung Bauermann
IBM Linux Technology Center

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v3 5/5] xfs: disable map_sync for virtio pmem

2019-04-19 Thread Darrick J. Wong
On Wed, Jan 09, 2019 at 08:17:36PM +0530, Pankaj Gupta wrote:
> Virtio pmem provides asynchronous host page cache flush
> mechanism. we don't support 'MAP_SYNC' with virtio pmem 
> and xfs.
> 
> Signed-off-by: Pankaj Gupta 
> ---
>  fs/xfs/xfs_file.c | 8 
>  1 file changed, 8 insertions(+)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index e474250..eae4aa4 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -1190,6 +1190,14 @@ xfs_file_mmap(
>   if (!IS_DAX(file_inode(filp)) && (vma->vm_flags & VM_SYNC))
>   return -EOPNOTSUPP;
>  
> + /* We don't support synchronous mappings with guest direct access
> +  * and virtio based host page cache mechanism.
> +  */
> + if (IS_DAX(file_inode(filp)) && virtio_pmem_host_cache_enabled(

Echoing what Jan said, this ought to be some sort of generic function
that tells us whether or not memory mapped from the dax device will
always still be accessible even after a crash (i.e. supports MAP_SYNC).

What if the underlying file on the host is itself on pmem and can be
MAP_SYNC'd?  Shouldn't the guest be able to use MAP_SYNC as well?

--D

> + xfs_find_daxdev_for_inode(file_inode(filp))) &&
> + (vma->vm_flags & VM_SYNC))
> + return -EOPNOTSUPP;
> +
>   file_accessed(filp);
>   vma->vm_ops = _file_vm_ops;
>   if (IS_DAX(file_inode(filp)))
> -- 
> 2.9.3
> 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


  1   2   >