[virtio-dev] Re: [RFC] vhost: introduce mdev based hardware vhost backend

2018-04-09 Thread Tiwei Bie
On Tue, Apr 10, 2018 at 10:52:52AM +0800, Jason Wang wrote:
> On 2018年04月02日 23:23, Tiwei Bie wrote:
> > This patch introduces a mdev (mediated device) based hardware
> > vhost backend. This backend is an abstraction of the various
> > hardware vhost accelerators (potentially any device that uses
> > virtio ring can be used as a vhost accelerator). Some generic
> > mdev parent ops are provided for accelerator drivers to support
> > generating mdev instances.
> > 
> > What's this
> > ===
> > 
> > The idea is that we can setup a virtio ring compatible device
> > with the messages available at the vhost-backend. Originally,
> > these messages are used to implement a software vhost backend,
> > but now we will use these messages to setup a virtio ring
> > compatible hardware device. Then the hardware device will be
> > able to work with the guest virtio driver in the VM just like
> > what the software backend does. That is to say, we can implement
> > a hardware based vhost backend in QEMU, and any virtio ring
> > compatible devices potentially can be used with this backend.
> > (We also call it vDPA -- vhost Data Path Acceleration).
> > 
> > One problem is that, different virtio ring compatible devices
> > may have different device interfaces. That is to say, we will
> > need different drivers in QEMU. It could be troublesome. And
> > that's what this patch trying to fix. The idea behind this
> > patch is very simple: mdev is a standard way to emulate device
> > in kernel.
> 
> So you just move the abstraction layer from qemu to kernel, and you still
> need different drivers in kernel for different device interfaces of
> accelerators. This looks even more complex than leaving it in qemu. As you
> said, another idea is to implement userspace vhost backend for accelerators
> which seems easier and could co-work with other parts of qemu without
> inventing new type of messages.

I'm not quite sure. Do you think it's acceptable to
add various vendor specific hardware drivers in QEMU?

> 
> Need careful thought here to seek a best solution here.

Yeah, definitely! :)
And your opinions would be very helpful!

> 
> >   So we defined a standard device based on mdev, which
> > is able to accept vhost messages. When the mdev emulation code
> > (i.e. the generic mdev parent ops provided by this patch) gets
> > vhost messages, it will parse and deliver them to accelerator
> > drivers. Drivers can use these messages to setup accelerators.
> > 
> > That is to say, the generic mdev parent ops (e.g. read()/write()/
> > ioctl()/...) will be provided for accelerator drivers to register
> > accelerators as mdev parent devices. And each accelerator device
> > will support generating standard mdev instance(s).
> > 
> > With this standard device interface, we will be able to just
> > develop one userspace driver to implement the hardware based
> > vhost backend in QEMU.
> > 
> > Difference between vDPA and PCI passthru
> > 
> > 
> > The key difference between vDPA and PCI passthru is that, in
> > vDPA only the data path of the device (e.g. DMA ring, notify
> > region and queue interrupt) is pass-throughed to the VM, the
> > device control path (e.g. PCI configuration space and MMIO
> > regions) is still defined and emulated by QEMU.
> > 
> > The benefits of keeping virtio device emulation in QEMU compared
> > with virtio device PCI passthru include (but not limit to):
> > 
> > - consistent device interface for guest OS in the VM;
> > - max flexibility on the hardware design, especially the
> >accelerator for each vhost backend doesn't have to be a
> >full PCI device;
> > - leveraging the existing virtio live-migration framework;
> > 
> > The interface of this mdev based device
> > ===
> > 
> > 1. BAR0
> > 
> > The MMIO region described by BAR0 is the main control
> > interface. Messages will be written to or read from
> > this region.
> > 
> > The message type is determined by the `request` field
> > in message header. The message size is encoded in the
> > message header too. The message format looks like this:
> > 
> > struct vhost_vfio_op {
> > __u64 request;
> > __u32 flags;
> > /* Flag values: */
> > #define VHOST_VFIO_NEED_REPLY 0x1 /* Whether need reply */
> > __u32 size;
> > union {
> > __u64 u64;
> > struct vhost_vring_state state;
> > struct vhost_vring_addr addr;
> > struct vhost_memory memory;
> > } payload;
> > };
> > 
> > The existing vhost-kernel ioctl cmds are reused as
> > the message requests in above structure.
> > 
> > Each message will be written to or read from this
> > region at offset 0:
> > 
> > int vhost_vfio_write(struct vhost_dev *dev, struct vhost_vfio_op *op)
> > {
> > int count = VHOST_VFIO_OP_HDR_SIZE + op->size;
> > struct vhost_vfio *vfio = dev->opaque;
> > int ret;
> > 
> > ret = pwrite64(vfio->device_fd, op, count, 

[virtio-dev] Re: [RFC] vhost: introduce mdev based hardware vhost backend

2018-04-09 Thread Jason Wang



On 2018年04月02日 23:23, Tiwei Bie wrote:

This patch introduces a mdev (mediated device) based hardware
vhost backend. This backend is an abstraction of the various
hardware vhost accelerators (potentially any device that uses
virtio ring can be used as a vhost accelerator). Some generic
mdev parent ops are provided for accelerator drivers to support
generating mdev instances.

What's this
===

The idea is that we can setup a virtio ring compatible device
with the messages available at the vhost-backend. Originally,
these messages are used to implement a software vhost backend,
but now we will use these messages to setup a virtio ring
compatible hardware device. Then the hardware device will be
able to work with the guest virtio driver in the VM just like
what the software backend does. That is to say, we can implement
a hardware based vhost backend in QEMU, and any virtio ring
compatible devices potentially can be used with this backend.
(We also call it vDPA -- vhost Data Path Acceleration).

One problem is that, different virtio ring compatible devices
may have different device interfaces. That is to say, we will
need different drivers in QEMU. It could be troublesome. And
that's what this patch trying to fix. The idea behind this
patch is very simple: mdev is a standard way to emulate device
in kernel.


So you just move the abstraction layer from qemu to kernel, and you 
still need different drivers in kernel for different device interfaces 
of accelerators. This looks even more complex than leaving it in qemu. 
As you said, another idea is to implement userspace vhost backend for 
accelerators which seems easier and could co-work with other parts of 
qemu without inventing new type of messages.


Need careful thought here to seek a best solution here.


  So we defined a standard device based on mdev, which
is able to accept vhost messages. When the mdev emulation code
(i.e. the generic mdev parent ops provided by this patch) gets
vhost messages, it will parse and deliver them to accelerator
drivers. Drivers can use these messages to setup accelerators.

That is to say, the generic mdev parent ops (e.g. read()/write()/
ioctl()/...) will be provided for accelerator drivers to register
accelerators as mdev parent devices. And each accelerator device
will support generating standard mdev instance(s).

With this standard device interface, we will be able to just
develop one userspace driver to implement the hardware based
vhost backend in QEMU.

Difference between vDPA and PCI passthru


The key difference between vDPA and PCI passthru is that, in
vDPA only the data path of the device (e.g. DMA ring, notify
region and queue interrupt) is pass-throughed to the VM, the
device control path (e.g. PCI configuration space and MMIO
regions) is still defined and emulated by QEMU.

The benefits of keeping virtio device emulation in QEMU compared
with virtio device PCI passthru include (but not limit to):

- consistent device interface for guest OS in the VM;
- max flexibility on the hardware design, especially the
   accelerator for each vhost backend doesn't have to be a
   full PCI device;
- leveraging the existing virtio live-migration framework;

The interface of this mdev based device
===

1. BAR0

The MMIO region described by BAR0 is the main control
interface. Messages will be written to or read from
this region.

The message type is determined by the `request` field
in message header. The message size is encoded in the
message header too. The message format looks like this:

struct vhost_vfio_op {
__u64 request;
__u32 flags;
/* Flag values: */
#define VHOST_VFIO_NEED_REPLY 0x1 /* Whether need reply */
__u32 size;
union {
__u64 u64;
struct vhost_vring_state state;
struct vhost_vring_addr addr;
struct vhost_memory memory;
} payload;
};

The existing vhost-kernel ioctl cmds are reused as
the message requests in above structure.

Each message will be written to or read from this
region at offset 0:

int vhost_vfio_write(struct vhost_dev *dev, struct vhost_vfio_op *op)
{
int count = VHOST_VFIO_OP_HDR_SIZE + op->size;
struct vhost_vfio *vfio = dev->opaque;
int ret;

ret = pwrite64(vfio->device_fd, op, count, vfio->bar0_offset);
if (ret != count)
return -1;

return 0;
}

int vhost_vfio_read(struct vhost_dev *dev, struct vhost_vfio_op *op)
{
int count = VHOST_VFIO_OP_HDR_SIZE + op->size;
struct vhost_vfio *vfio = dev->opaque;
uint64_t request = op->request;
int ret;

ret = pread64(vfio->device_fd, op, count, vfio->bar0_offset);
if (ret != count || request != op->request)
return -1;

return 0;
}

It's quite straightforward to set things to the device.
Just need to write the message to 

[virtio-dev] [PATCH v32 4/4] virtio-balloon: VIRTIO_BALLOON_F_PAGE_POISON

2018-04-09 Thread Wei Wang
The VIRTIO_BALLOON_F_PAGE_POISON feature bit is used to indicate if the
guest is using page poisoning. Guest writes to the poison_val config
field to tell host about the page poisoning value in use.

Signed-off-by: Wei Wang 
Suggested-by: Michael S. Tsirkin 
Cc: Michael S. Tsirkin 
Cc: Michal Hocko 
Cc: Andrew Morton 
---
 drivers/virtio/virtio_balloon.c | 10 ++
 include/uapi/linux/virtio_balloon.h |  3 +++
 2 files changed, 13 insertions(+)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 197f865..aea91ab 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -730,6 +730,7 @@ static struct file_system_type balloon_fs = {
 static int virtballoon_probe(struct virtio_device *vdev)
 {
struct virtio_balloon *vb;
+   __u32 poison_val;
int err;
 
if (!vdev->config->get) {
@@ -775,6 +776,11 @@ static int virtballoon_probe(struct virtio_device *vdev)
vb->stop_cmd_id = cpu_to_virtio32(vb->vdev,
VIRTIO_BALLOON_FREE_PAGE_REPORT_STOP_ID);
INIT_WORK(>report_free_page_work, report_free_page_func);
+   if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_PAGE_POISON)) {
+   memset(_val, PAGE_POISON, sizeof(poison_val));
+   virtio_cwrite(vb->vdev, struct virtio_balloon_config,
+ poison_val, _val);
+   }
}
 
vb->nb.notifier_call = virtballoon_oom_notify;
@@ -893,6 +899,9 @@ static int virtballoon_restore(struct virtio_device *vdev)
 
 static int virtballoon_validate(struct virtio_device *vdev)
 {
+   if (!page_poisoning_enabled())
+   __virtio_clear_bit(vdev, VIRTIO_BALLOON_F_PAGE_POISON);
+
__virtio_clear_bit(vdev, VIRTIO_F_IOMMU_PLATFORM);
return 0;
 }
@@ -902,6 +911,7 @@ static unsigned int features[] = {
VIRTIO_BALLOON_F_STATS_VQ,
VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
VIRTIO_BALLOON_F_FREE_PAGE_HINT,
+   VIRTIO_BALLOON_F_PAGE_POISON,
 };
 
 static struct virtio_driver virtio_balloon_driver = {
diff --git a/include/uapi/linux/virtio_balloon.h 
b/include/uapi/linux/virtio_balloon.h
index b2d86c2..8b93581 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -35,6 +35,7 @@
 #define VIRTIO_BALLOON_F_STATS_VQ  1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM2 /* Deflate balloon on OOM */
 #define VIRTIO_BALLOON_F_FREE_PAGE_HINT3 /* VQ to report free pages */
+#define VIRTIO_BALLOON_F_PAGE_POISON   4 /* Guest is using page poisoning */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
@@ -47,6 +48,8 @@ struct virtio_balloon_config {
__u32 actual;
/* Free page report command id, readonly by guest */
__u32 free_page_report_cmd_id;
+   /* Stores PAGE_POISON if page poisoning is in use */
+   __u32 poison_val;
 };
 
 #define VIRTIO_BALLOON_S_SWAP_IN  0   /* Amount of memory swapped in */
-- 
2.7.4


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] [PATCH v32 3/4] mm/page_poison: expose page_poisoning_enabled to kernel modules

2018-04-09 Thread Wei Wang
In some usages, e.g. virtio-balloon, a kernel module needs to know if
page poisoning is in use. This patch exposes the page_poisoning_enabled
function to kernel modules.

Signed-off-by: Wei Wang 
Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: Michael S. Tsirkin 
Acked-by: Andrew Morton 
---
 mm/page_poison.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/mm/page_poison.c b/mm/page_poison.c
index e83fd44..762b472 100644
--- a/mm/page_poison.c
+++ b/mm/page_poison.c
@@ -17,6 +17,11 @@ static int early_page_poison_param(char *buf)
 }
 early_param("page_poison", early_page_poison_param);
 
+/**
+ * page_poisoning_enabled - check if page poisoning is enabled
+ *
+ * Return true if page poisoning is enabled, or false if not.
+ */
 bool page_poisoning_enabled(void)
 {
/*
@@ -29,6 +34,7 @@ bool page_poisoning_enabled(void)
(!IS_ENABLED(CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC) &&
debug_pagealloc_enabled()));
 }
+EXPORT_SYMBOL_GPL(page_poisoning_enabled);
 
 static void poison_page(struct page *page)
 {
-- 
2.7.4


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] [PATCH v32 1/4] mm: support reporting free page blocks

2018-04-09 Thread Wei Wang
This patch adds support to walk through the free page blocks in the
system and report them via a callback function. Some page blocks may
leave the free list after zone->lock is released, so it is the caller's
responsibility to either detect or prevent the use of such pages.

One use example of this patch is to accelerate live migration by skipping
the transfer of free pages reported from the guest. A popular method used
by the hypervisor to track which part of memory is written during live
migration is to write-protect all the guest memory. So, those pages that
are reported as free pages but are written after the report function
returns will be captured by the hypervisor, and they will be added to the
next round of memory transfer.

Signed-off-by: Wei Wang 
Signed-off-by: Liang Li 
Cc: Michal Hocko 
Cc: Andrew Morton 
Cc: Michael S. Tsirkin 
Acked-by: Michal Hocko 
---
 include/linux/mm.h |  6 
 mm/page_alloc.c| 97 ++
 2 files changed, 103 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index f945dff..4698df1 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1951,6 +1951,12 @@ extern void free_area_init_node(int nid, unsigned long * 
zones_size,
unsigned long zone_start_pfn, unsigned long *zholes_size);
 extern void free_initmem(void);
 
+extern int walk_free_mem_block(void *opaque,
+  int min_order,
+  int (*report_pfn_range)(void *opaque,
+  unsigned long pfn,
+  unsigned long num));
+
 /*
  * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK)
  * into the buddy system. The freed pages will be poisoned with pattern
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4ea0182..83e50fd 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4912,6 +4912,103 @@ void show_free_areas(unsigned int filter, nodemask_t 
*nodemask)
show_swap_cache_info();
 }
 
+/*
+ * Walk through a free page list and report the found pfn range via the
+ * callback.
+ *
+ * Return 0 if it completes the reporting. Otherwise, return the non-zero
+ * value returned from the callback.
+ */
+static int walk_free_page_list(void *opaque,
+  struct zone *zone,
+  int order,
+  enum migratetype mt,
+  int (*report_pfn_range)(void *,
+  unsigned long,
+  unsigned long))
+{
+   struct page *page;
+   struct list_head *list;
+   unsigned long pfn, flags;
+   int ret = 0;
+
+   spin_lock_irqsave(>lock, flags);
+   list = >free_area[order].free_list[mt];
+   list_for_each_entry(page, list, lru) {
+   pfn = page_to_pfn(page);
+   ret = report_pfn_range(opaque, pfn, 1 << order);
+   if (ret)
+   break;
+   }
+   spin_unlock_irqrestore(>lock, flags);
+
+   return ret;
+}
+
+/**
+ * walk_free_mem_block - Walk through the free page blocks in the system
+ * @opaque: the context passed from the caller
+ * @min_order: the minimum order of free lists to check
+ * @report_pfn_range: the callback to report the pfn range of the free pages
+ *
+ * If the callback returns a non-zero value, stop iterating the list of free
+ * page blocks. Otherwise, continue to report.
+ *
+ * Please note that there are no locking guarantees for the callback and
+ * that the reported pfn range might be freed or disappear after the
+ * callback returns so the caller has to be very careful how it is used.
+ *
+ * The callback itself must not sleep or perform any operations which would
+ * require any memory allocations directly (not even GFP_NOWAIT/GFP_ATOMIC)
+ * or via any lock dependency. It is generally advisable to implement
+ * the callback as simple as possible and defer any heavy lifting to a
+ * different context.
+ *
+ * There is no guarantee that each free range will be reported only once
+ * during one walk_free_mem_block invocation.
+ *
+ * pfn_to_page on the given range is strongly discouraged and if there is
+ * an absolute need for that make sure to contact MM people to discuss
+ * potential problems.
+ *
+ * The function itself might sleep so it cannot be called from atomic
+ * contexts.
+ *
+ * In general low orders tend to be very volatile and so it makes more
+ * sense to query larger ones first for various optimizations which like
+ * ballooning etc... This will reduce the overhead as well.
+ *
+ * Return 0 if it completes the reporting. Otherwise, return the non-zero
+ * value returned from the callback.
+ */
+int walk_free_mem_block(void 

[virtio-dev] Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice

2018-04-09 Thread Siwei Liu
On Mon, Apr 9, 2018 at 4:03 PM, Stephen Hemminger
 wrote:
> On Mon, 9 Apr 2018 15:30:42 -0700
> Siwei Liu  wrote:
>
>> On Mon, Apr 9, 2018 at 3:15 PM, Andrew Lunn  wrote:
>> >> No, implementation wise I'd avoid changing the class on the fly. What
>> >> I'm looking to is a means to add a secondary class or class aliasing
>> >> mechanism for netdevs that allows mapping for a kernel device
>> >> namespace (/class/net-kernel) to userspace (/class/net). Imagine
>> >> creating symlinks between these two namespaces as an analogy. All
>> >> userspace visible netdevs today will have both a kernel name and a
>> >> userspace visible name, having one (/class/net) referecing the other
>> >> (/class/net-kernel) in its own namespace. The newly introduced
>> >> IFF_AUTO_MANAGED device will have a kernel name only
>> >> (/class/net-kernel). As a result, the existing applications using
>> >> /class/net don't break, while we're adding the kernel namespace that
>> >> allows IFF_AUTO_MANAGED devices which will not be exposed to userspace
>> >> at all.
>> >
>> > My gut feeling is this whole scheme will not fly. You really should be
>> > talking to GregKH.
>>
>> Will do. Before spreading it out loudly I'd run it within netdev to
>> clarify the need for why not exposing the lower netdevs is critical
>> for cloud service providers in the face of introducing a new feature,
>> and we are not hiding anything but exposing it in a way that don't
>> break existing userspace applications while introducing feature is
>> possible with the limitation of keeping old userspace still.
>>
>> >
>> > Anyway, please remember that IFF_AUTO_MANAGED will need to be dynamic.
>> > A device can start out as a normal device, and will change to being
>> > automatic later, when the user on top of it probes.
>>
>> Sure. In whatever form it's still a netdev, and changing the namespace
>> should be more dynamic than changing the class.
>>
>> Thanks a lot,
>> -Siwei
>>
>> >
>> > Andrew
>
> Also, remember for netdev's /sys is really a third class API.
> The primary API's are netlink and ioctl. Also why not use existing
> network namespaces rather than inventing a new abstraction?

Because we want to leave old userspace unmodified while making SR-IOV
live migration transparent to users. Specifically, we'd want old udevd
to skip through uevents for the lower netdevs, while also making new
udevd able to name the bypass_master interface by referencing the pci
slot information which is only present in the sysfs entry for
IFF_AUTO_MANAGED net device.

The problem of using network namespace is that, no sysfs entry will be
around for IFF_AUTO_MANAGED netdev if we isolate it out to a separate
netns, unless we generalize the scope for what netns is designed for
(isolation I mean). For auto-managed netdevs we don't neccessarily
wants strict isolation, but rather a way of sticking to 1-netdev model
for strict backward compatibility requiement of the old userspace,
while exposing the information in a way new userspace understands.

Thanks,
-Siwei

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice

2018-04-09 Thread Siwei Liu
On Mon, Apr 9, 2018 at 3:15 PM, Andrew Lunn  wrote:
>> No, implementation wise I'd avoid changing the class on the fly. What
>> I'm looking to is a means to add a secondary class or class aliasing
>> mechanism for netdevs that allows mapping for a kernel device
>> namespace (/class/net-kernel) to userspace (/class/net). Imagine
>> creating symlinks between these two namespaces as an analogy. All
>> userspace visible netdevs today will have both a kernel name and a
>> userspace visible name, having one (/class/net) referecing the other
>> (/class/net-kernel) in its own namespace. The newly introduced
>> IFF_AUTO_MANAGED device will have a kernel name only
>> (/class/net-kernel). As a result, the existing applications using
>> /class/net don't break, while we're adding the kernel namespace that
>> allows IFF_AUTO_MANAGED devices which will not be exposed to userspace
>> at all.
>
> My gut feeling is this whole scheme will not fly. You really should be
> talking to GregKH.

Will do. Before spreading it out loudly I'd run it within netdev to
clarify the need for why not exposing the lower netdevs is critical
for cloud service providers in the face of introducing a new feature,
and we are not hiding anything but exposing it in a way that don't
break existing userspace applications while introducing feature is
possible with the limitation of keeping old userspace still.

>
> Anyway, please remember that IFF_AUTO_MANAGED will need to be dynamic.
> A device can start out as a normal device, and will change to being
> automatic later, when the user on top of it probes.

Sure. In whatever form it's still a netdev, and changing the namespace
should be more dynamic than changing the class.

Thanks a lot,
-Siwei

>
> Andrew

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice

2018-04-09 Thread Siwei Liu
On Fri, Apr 6, 2018 at 8:19 PM, Andrew Lunn  wrote:
> Hi Siwei
>
>> I think everyone seems to agree not to fiddle with the ":" prefix, but
>> rather have a new class of network subsystem under /sys/class thus a
>> separate device namespace e.g. /sys/class/net-kernel for those
>> auto-managed lower netdevs is needed.
>
> How do you get a device into this new class? I don't know the Linux
> driver model too well, but to get a device out of one class and into
> another, i think you need to device_del(dev). modify dev->class and
> then device_add(dev). However, device_add() says you are not allowed
> to do this.

No, implementation wise I'd avoid changing the class on the fly. What
I'm looking to is a means to add a secondary class or class aliasing
mechanism for netdevs that allows mapping for a kernel device
namespace (/class/net-kernel) to userspace (/class/net). Imagine
creating symlinks between these two namespaces as an analogy. All
userspace visible netdevs today will have both a kernel name and a
userspace visible name, having one (/class/net) referecing the other
(/class/net-kernel) in its own namespace. The newly introduced
IFF_AUTO_MANAGED device will have a kernel name only
(/class/net-kernel). As a result, the existing applications using
/class/net don't break, while we're adding the kernel namespace that
allows IFF_AUTO_MANAGED devices which will not be exposed to userspace
at all.

As this requires changing the internals of driver model core it's a
rather big hammer approach I'd think. If there exists a better
implementation than this to allow adding a separate layer of in-kernel
device namespace, I'd more than welcome to hear about.

>
> And i don't even see how this helps. Are you also not going to call
> list_netdevice()? Are you going to add some other list for these
> devices in a different class?

list_netdevice() is still called. I think with the current RFC patch,
I've added two lists for netdevs under the kernel namespace:
dev_cmpl_list and name_cmpl_hlist. As a result of that, all userspace
netdevs get registered will be added to two types of lists: the
userspace list for e.g. dev_list, and also the kernelspace list e.g.
dev_cmpl_list (I can rename it to something more accurate). The
IFF_AUTO_MANAGED device will be only added to kernelspace list e.g.
dev_cmpl_list.

Hope all your questions are answered.

Thanks,
-Siwei


>
>Andrew

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



Re: [virtio-dev] Re: [PATCH v2] virtio_balloon: export hugetlb page allocation counts

2018-04-09 Thread Jonathan Helman



On 03/22/2018 07:38 PM, Jason Wang wrote:



On 2018年03月22日 11:10, Michael S. Tsirkin wrote:

On Thu, Mar 22, 2018 at 09:52:18AM +0800, Jason Wang wrote:

On 2018年03月20日 12:26, Jonathan Helman wrote:

On Mar 19, 2018, at 7:31 PM, Jason Wang  wrote:



On 2018年03月20日 06:14, Jonathan Helman wrote:

Export the number of successful and failed hugetlb page
allocations via the virtio balloon driver. These 2 counts
come directly from the vm_events HTLB_BUDDY_PGALLOC and
HTLB_BUDDY_PGALLOC_FAIL.

Signed-off-by: Jonathan Helman

Reviewed-by: Jason Wang

Thanks.


---
   drivers/virtio/virtio_balloon.c | 6 ++
   include/uapi/linux/virtio_balloon.h | 4 +++-
   2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/virtio/virtio_balloon.c 
b/drivers/virtio/virtio_balloon.c

index dfe5684..6b237e3 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -272,6 +272,12 @@ static unsigned int 
update_balloon_stats(struct virtio_balloon *vb)

   pages_to_bytes(events[PSWPOUT]));
   update_stat(vb, idx++, VIRTIO_BALLOON_S_MAJFLT, 
events[PGMAJFAULT]);
   update_stat(vb, idx++, VIRTIO_BALLOON_S_MINFLT, 
events[PGFAULT]);

+#ifdef CONFIG_HUGETLB_PAGE
+    update_stat(vb, idx++, VIRTIO_BALLOON_S_HTLB_PGALLOC,
+    events[HTLB_BUDDY_PGALLOC]);
+    update_stat(vb, idx++, VIRTIO_BALLOON_S_HTLB_PGFAIL,
+    events[HTLB_BUDDY_PGALLOC_FAIL]);
+#endif
   #endif
   update_stat(vb, idx++, VIRTIO_BALLOON_S_MEMFREE,
   pages_to_bytes(i.freeram));
diff --git a/include/uapi/linux/virtio_balloon.h 
b/include/uapi/linux/virtio_balloon.h

index 4e8b830..40297a3 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -53,7 +53,9 @@ struct virtio_balloon_config {
   #define VIRTIO_BALLOON_S_MEMTOT   5   /* Total amount of memory */
   #define VIRTIO_BALLOON_S_AVAIL    6   /* Available memory as in 
/proc */

   #define VIRTIO_BALLOON_S_CACHES   7   /* Disk caches */
-#define VIRTIO_BALLOON_S_NR   8
+#define VIRTIO_BALLOON_S_HTLB_PGALLOC  8  /* Hugetlb page 
allocations */
+#define VIRTIO_BALLOON_S_HTLB_PGFAIL   9  /* Hugetlb page 
allocation failures */

+#define VIRTIO_BALLOON_S_NR   10
 /*
    * Memory statistics structure.
Not for this patch, but it looks to me that exporting such nr 
through uapi is fragile.

Sorry, can you explain what you mean here?

Jon
Spec said "Within an output buffer submitted to the statsq, the 
device MUST

ignore entries with tag values that it does not recognize". So exporting
VIRTIO_BALLOON_S_NR seems useless and device implementation can not 
depend

on such number in uapi.

Thanks

Suggestions? I don't like to break build for people ...



Didn't have a good idea. But maybe we should keep VIRTIO_BALLOON_S_NR 
unchanged, and add a comment here.


Thanks


I think Jason's comment is for a future patch. Didn't see this patch get 
applied, so wondering if it could be.


Thanks,
Jon

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] Re: [virtio] [PATCH 4/5] packed-ring: reposition drivernormative on driver notifications

2018-04-09 Thread Paolo Bonzini
On 09/04/2018 20:58, Halil Pasic wrote:
>  The driver has to be careful to expose the new \field{flags}
>  value before checking if notifications are suppressed.

This paragraph should also be reworked to be part of the normative text
below, I think?  The memory barrier must be "between two things", and
only with this paragraph you know that it's between the flags write and
the suppression structure read.

Thanks,

Paolo

> +\drivernormative{\paragraph}{Notifying The Device}{Basic Facilities of a 
> Virtio Device / Packed Virtqueues / Supplying Buffers to The Device / 
> Notifying The Device}
> +The driver MUST perform a suitable memory barrier before reading
> +the Driver Event Suppression structure, to avoid missing a notification.
> +


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] [PATCH 5/5] packed-ring: tweak driver code example

2018-04-09 Thread Halil Pasic
This patch does three things. Emphasises that the example is pseudo-code
as we tend to stick to C otherwise and tweaks the wording, fixes
a typo in the pseudo code, and removes a line of code that I neither can
see the necessity for nor the sense of.

Signed-off-by: Halil Pasic 
---
 packed-ring.tex |9 +
 1 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/packed-ring.tex b/packed-ring.tex
index 388850b..f619069 100644
--- a/packed-ring.tex
+++ b/packed-ring.tex
@@ -604,9 +604,11 @@ value before checking if notifications are suppressed.
 The driver MUST perform a suitable memory barrier before reading
 the Driver Event Suppression structure, to avoid missing a notification.
 
-\subsubsection{Implementation Example}\label{sec:Basic Facilities of a Virtio 
Device / Packed Virtqueues / Supplying Buffers to The Device / Implementation 
Example}
+\subsubsection{Pseudo-Code Example}\label{sec:Basic Facilities of a
+Virtio Device / Packed Virtqueues / Supplying Buffers to The Device /
+Pseudo-Code Example}
 
-Below is a driver code example. It does not attempt to reduce
+Below is driver pseudo-code example supplying buffers to the device. It does 
not attempt to reduce
 the number of device interrupts, neither does it support
 the VIRTIO_F_RING_EVENT_IDX feature.
 
@@ -621,7 +623,6 @@ sgs = 0;
 for (each buffer element b) {
 sgs++;
 
-vq->ids[vq->next_avail] = -1;
 vq->desc[vq->next_avail].address = get_addr(b);
 vq->desc[vq->next_avail].len = get_len(b);
 
@@ -645,7 +646,7 @@ for (each buffer element b) {
 
 if (vq->next_avail >= vq->size) {
 vq->next_avail = 0;
-vq->avail_wrap_count \^= 1;
+vq->avail_wrap_count ^= 1;
 }
 }
 vq->sgs[id] = sgs;
-- 
1.7.1


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] [PATCH 2/5] consistency: clean up stale wording for v1.1 terms

2018-04-09 Thread Halil Pasic
Occasionally we still use used and available ring instead of Device and
Driver Area respectively. Same goes for descriptor table with the
difference that depending on the context we need either Descriptor Area
or descriptor ring.

Let's clean these up, at least where the cleanup is trivial.

Signed-off-by: Halil Pasic 
---
 content.tex |   14 +++---
 packed-ring.tex |   10 ++
 2 files changed, 13 insertions(+), 11 deletions(-)

diff --git a/content.tex b/content.tex
index 7a92cb1..e3ac662 100644
--- a/content.tex
+++ b/content.tex
@@ -1247,7 +1247,7 @@ The driver typically does this as follows, for each 
virtqueue a device has:
 
 \item Optionally, select a smaller virtqueue size and write it to 
\field{queue_size}.
 
-\item Allocate and zero Descriptor Table, Available and Used rings for the
+\item Allocate and zero Descriptor Area, Device Area and Driver Area for the
virtqueue in contiguous physical memory.
 
 \item Optionally, if MSI-X capability is present and enabled on the
@@ -1769,9 +1769,9 @@ nor behaviour:
 writing to \field{QueueSel}.
   }
   \hline
-  \mmioreg{QueueAlign}{Used Ring alignment in the virtual queue}{0x03c}{W}{%
+  \mmioreg{QueueAlign}{Device Area alignment in the virtual queue}{0x03c}{W}{%
 Writing to this register notifies the device about alignment
-boundary of the Used Ring in bytes. This value should be a power
+boundary of the Device Area in bytes. This value should be a power
 of 2 and applies to the queue selected by writing to \field{QueueSel}.
   }
   \hline
@@ -1779,7 +1779,7 @@ nor behaviour:
 Writing to this register notifies the device about location of the
 virtual queue in the Guest's physical address space. This value
 is the index number of a page starting with the queue
-Descriptor Table. Value zero (0x0) means physical address zero
+Descriptor Area. Value zero (0x0) means physical address zero
 (0x) and is illegal. When the driver stops using the
 queue it writes zero (0x0) to this register.
 Reading from this register returns the currently used page
@@ -1830,7 +1830,7 @@ The virtual queue is configured as follows:
queue is not available.
 
 \item Allocate and zero the queue pages in contiguous virtual
-   memory, aligning the Used Ring to an optimal boundary (usually
+   memory, aligning the Device Area to an optimal boundary (usually
page size). The driver should choose a queue size smaller than or
equal to \field{QueueNumMax}.
 
@@ -2066,8 +2066,8 @@ struct vq_info_block {
 \end{lstlisting}
 
 \field{desc}, \field{driver} and \field{device} contain the guest
-addresses for the descriptor area,
-available area and used area for queue \field{index}, respectively. The actual
+addresses for the Descriptor Area,
+Driver Area and Device Area for queue \field{index}, respectively. The actual
 virtqueue size (number of allocated buffers) is transmitted in \field{num}.
 
 \devicenormative{\paragraph}{Configuring a Virtqueue}{Virtio Transport Options 
/ Virtio over channel I/O / Device Initialization / Configuring a Virtqueue}
diff --git a/packed-ring.tex b/packed-ring.tex
index eb006b0..e656fed 100644
--- a/packed-ring.tex
+++ b/packed-ring.tex
@@ -445,7 +445,9 @@ struct pvirtq_event_suppress {
 };
 \end{lstlisting}
 
-\devicenormative{\subsection}{The Virtqueue Descriptor Table}{Basic Facilities 
of a Virtio Device / Packed Virtqueues / The Virtqueue Descriptor Table}
+\devicenormative{\subsection}{The Virtqueue Descriptor Ring}{Basic
+Facilities of a Virtio Device / Packed Virtqueues / The Virtqueue
+Descriptor Ring}
 A device MUST NOT write to a device-readable buffer, and a device SHOULD NOT
 read a device-writable buffer.
 A device MUST NOT use a descriptor unless it observes
@@ -454,7 +456,7 @@ the VIRTQ_DESC_F_AVAIL bit in its \field{flags} being 
changed
 A device MUST NOT change a descriptor after changing it's
 the VIRTQ_DESC_F_USED bit in its \field{flags}.
 
-\drivernormative{\subsection}{The Virtqueue Descriptor Table}{Basic Facilities 
of a Virtio Device / PAcked Virtqueues / The Virtqueue Descriptor Table}
+\drivernormative{\subsection}{The Virtqueue Descriptor Ring}{Basic Facilities 
of a Virtio Device / PAcked Virtqueues / The Virtqueue Descriptor Ring}
 A driver MUST NOT change a descriptor unless it observes
 the VIRTQ_DESC_F_USED bit in its \field{flags} being changed.
 A driver MUST NOT change a descriptor after changing
@@ -498,7 +500,7 @@ were made available by the driver.
 The device MAY limit the number of buffers it will allow in a
 list.
 
-\drivernormative{\subsection}{Indirect Descriptors}{Basic Facilities of a 
Virtio Device / Packed Virtqueues / The Virtqueue Descriptor Table / Indirect 
Descriptors}
+\drivernormative{\subsection}{Indirect Descriptors}{Basic Facilities of a 
Virtio Device / Packed Virtqueues / The Virtqueue Descriptor Ring / Indirect 
Descriptors}
 The driver MUST NOT set the DESC_F_INDIRECT flag 

[virtio-dev] [PATCH 1/5] packed-ring: fix used descriptors but meant buffers

2018-04-09 Thread Halil Pasic
The subsection is talking about virtio buffers, that is possibly as
scatter-gather list mapped by multiple descriptors. Mistakenly we
at places we say descriptors instead of buffers.

While at it fix the title case capitalization. We seem to use
title case capitalization for subsections in most cases.

Signed-off-by: Halil Pasic 
---
 packed-ring.tex |6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/packed-ring.tex b/packed-ring.tex
index 2ef9559..eb006b0 100644
--- a/packed-ring.tex
+++ b/packed-ring.tex
@@ -272,10 +272,10 @@ Buffer ID is also reserved and is ignored by the device.
 In descriptors with VIRTQ_DESC_F_INDIRECT set VIRTQ_DESC_F_WRITE
 is reserved and is ignored by the device.
 
-\subsection{In-order use of descriptors}
-\label{sec:Packed Virtqueues / In-order use of descriptors}
+\subsection{In-Order Use of Buffers}
+\label{sec:Packed Virtqueues / In-Order Use of Buffers}
 
-Some devices always use descriptors in the same order in which
+Some devices always use buffers in the same order in which
 they have been made available. These devices can offer the
 VIRTIO_F_IN_ORDER feature. If negotiated, this knowledge allows
 devices to notify the use of a batch of buffers to the driver by
-- 
1.7.1


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] [PATCH 4/5] packed-ring: reposition drivernormative on driver notifications

2018-04-09 Thread Halil Pasic
The drivernormative was simply in the wrong place. It does not belong
to the code example.

Signed-off-by: Halil Pasic 
---
 packed-ring.tex |9 -
 1 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/packed-ring.tex b/packed-ring.tex
index d93df3b..388850b 100644
--- a/packed-ring.tex
+++ b/packed-ring.tex
@@ -600,6 +600,10 @@ Suppression Structure Format}.
 The driver has to be careful to expose the new \field{flags}
 value before checking if notifications are suppressed.
 
+\drivernormative{\paragraph}{Notifying The Device}{Basic Facilities of a 
Virtio Device / Packed Virtqueues / Supplying Buffers to The Device / Notifying 
The Device}
+The driver MUST perform a suitable memory barrier before reading
+the Driver Event Suppression structure, to avoid missing a notification.
+
 \subsubsection{Implementation Example}\label{sec:Basic Facilities of a Virtio 
Device / Packed Virtqueues / Supplying Buffers to The Device / Implementation 
Example}
 
 Below is a driver code example. It does not attempt to reduce
@@ -658,11 +662,6 @@ if (vq->device_event.flags != RING_EVENT_FLAGS_DISABLE) {
 
 \end{lstlisting}
 
-
-\drivernormative{\paragraph}{Notifying The Device}{Basic Facilities of a 
Virtio Device / Packed Virtqueues / Supplying Buffers to The Device / Notifying 
The Device}
-The driver MUST perform a suitable memory barrier before reading
-the Driver Event Suppression structure, to avoid missing a notification.
-
 \subsection{Receiving Used Buffers From The Device}\label{sec:Basic Facilities 
of a Virtio Device / Packed Virtqueues / Receiving Used Buffers From The Device}
 
 Once the device has used buffers referred to by a descriptor (read from or 
written to them, or
-- 
1.7.1


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] [PATCH 3/5] packed-packed: fix supplying buffers description

2018-04-09 Thread Halil Pasic
There was a discrepancy between the pseudo code and the
textual description of the algorithm describing how shall
a driver supply buffers to the device.

The textual description did not make sense to me, so I've
changed that one to be more in sync with the pseudo code.

Signed-off-by: Halil Pasic 
---
Still a bit shaky. Please review carefully! This patch
may as well be rooted in a terrible misunderstanding.
---
 packed-ring.tex |   64 ++-
 1 files changed, 35 insertions(+), 29 deletions(-)

diff --git a/packed-ring.tex b/packed-ring.tex
index e656fed..d93df3b 100644
--- a/packed-ring.tex
+++ b/packed-ring.tex
@@ -527,27 +527,12 @@ when using the packed virtqueue format in more detail.
 The driver offers buffers to one of the device's virtqueues as follows:
 
 \begin{enumerate}
-\item The driver places the buffer into free descriptor(s) in the Descriptor 
Ring.
-
-\item The driver performs a suitable memory barrier to ensure that it updates
-  the descriptor(s) before checking for notification suppression.
-
-\item If notifications are not suppressed, the driver notifies the device
-of the new available buffers.
-\end{enumerate}
-
-What follows are the requirements of each stage in more detail.
-
-\subsubsection{Placing Available Buffers Into The Descriptor 
Ring}\label{sec:Basic Facilities of a Virtio Device / Virtqueues / Supplying 
Buffers to The Device / Placing Available Buffers Into The Descriptor Ring}
-
-For each buffer element, b:
-
-\begin{enumerate}
-\item Get the next descriptor ring entry, d
 \item Get the next free buffer id value
+\item For each buffer element, b:
+\begin{enumerate}
+\item Get the next free descriptor ring entry, d
 \item Set \field{d.addr} to the physical address of the start of b
 \item Set \field{d.len} to the length of b.
-\item Set \field{d.id} to the buffer id
 \item Calculate the flags as follows:
 \begin{enumerate}
 \item If b is device-writable, set the VIRTQ_DESC_F_WRITE bit to 1, otherwise 0
@@ -556,20 +541,41 @@ For each buffer element, b:
 \end{enumerate}
 \item Perform a memory barrier to ensure that the descriptor has
   been initialized
-\item Set \field{d.flags} to the calculated flags value
+\item If b is not the first buffer element set \field{d.flags} to the
+  calculated flags value, otherwise save it
 \item If d is the last descriptor in the ring, toggle the
-  Driver Ring Wrap Counter
+  Driver Ring Wrap Counter and reset d to the head of the ring
 \item Otherwise, increment d to point at the next descriptor
 \end{enumerate}
+\item Point d to the descriptor corresponding to the last buffer element
+  and set \field{d.id} to the buffer id
+\item Save some value associated with the buffer id that makes the
+  skipping forward described in section \ref{sec:Packed Virtqueues /
+  Indirect Flag: Scatter-Gather Support} and in section \ref{sec:Packed
+  Virtqueues / In-Order Use of Buffers}
+\item Point d to the descriptor corresponding to the first buffer element
+  and set \field{d.flags} to the value calculated and saved 
+\item The driver performs a suitable memory barrier to ensure that all
+  writes done up to this point precede  all writhes done after this
+  point as observed by the device.
+\item Point d to the descriptor corresponding to the first buffer element
+  and set \field{d.flags} to the value calculated and saved 
+\item The driver performs a suitable memory barrier to ensure that it updates
+  the descriptor(s) before checking for notification suppression.
+\item If notifications are not suppressed, the driver notifies the device
+of the new available buffers.
+\end{enumerate}
 
-This makes a single descriptor buffer available. However, in
-general the driver MAY make use of a batch of descriptors as part
+This makes a single virtio buffer available. However, in
+general the driver MAY make use of a batch of buffers as part
 of a single request. In that case, it defers updating
-the descriptor flags for the first descriptor
-(and the previous memory barrier) until after the rest of
-the descriptors have been initialized.
+the descriptor flags for the first descriptor of the first
+virtio buffer (and the preceding memory barrier) until after the rest of
+the descriptors have been written. The notification steps are performed
+only after the after the flags of the first descriptor of the first
+virtio buffer are set.
 
-Once the descriptor \field{flags} field is updated by the driver, this exposes
+Once the \field{flags} field of the first descriptor is updated by the driver, 
this exposes
 the descriptor and its contents.  The device MAY
 access the descriptor and any following descriptors the driver created and the
 memory they refer to immediately.
@@ -577,9 +583,9 @@ memory they refer to immediately.
 \drivernormative{\paragraph}{Updating flags}{Basic Facilities of
 a Virtio Device / Packed Virtqueues / 

[virtio-dev] [PATCH 0/5] fixes and tweaks for virtio v1.1

2018-04-09 Thread Halil Pasic
Some of these are more and some are less important. Please
be careful, there is always a chance that I misunderstood something
and that I'm trying to change for the worse.

I've identified some other weak point's too, but tried to
proceed in order the more grave and the better I think I
understand it (and thus propose a fix) the higher the priority.

Some stuff I've left for my notifiers patch set. This one
and that one will have to get merged if both turn out to
be good.

Halil Pasic (5):
  packed-ring: fix used descriptors but meant buffers
  consistency: clean up stale  wording for v1.1 terms
  packed-packed: fix supplying buffers description
  packed-ring: reposition drivernormative on driver notifications
  packed-ring: tweak driver code example

 content.tex |   14 
 packed-ring.tex |   96 +-
 2 files changed, 59 insertions(+), 51 deletions(-)


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] Re: [RFC PATCH net-next v5 3/4] virtio_net: Extend virtio to use VF datapath when available

2018-04-09 Thread Samudrala, Sridhar

On 4/9/2018 1:07 AM, Jiri Pirko wrote:

Sat, Apr 07, 2018 at 12:59:14AM CEST, sridhar.samudr...@intel.com wrote:

On 4/6/2018 5:48 AM, Jiri Pirko wrote:

Thu, Apr 05, 2018 at 11:08:22PM CEST, sridhar.samudr...@intel.com wrote:

[...]


+static int virtnet_bypass_join_child(struct net_device *bypass_netdev,
+struct net_device *child_netdev)
+{
+   struct virtnet_bypass_info *vbi;
+   bool backup;
+
+   vbi = netdev_priv(bypass_netdev);
+   backup = (child_netdev->dev.parent == bypass_netdev->dev.parent);
+   if (backup ? rtnl_dereference(vbi->backup_netdev) :
+   rtnl_dereference(vbi->active_netdev)) {
+   netdev_info(bypass_netdev,
+   "%s attempting to join bypass dev when %s already 
present\n",
+   child_netdev->name, backup ? "backup" : "active");

Bypass module should check if there is already some other netdev
enslaved and refuse right there.

This will work for virtio-net with 3 netdev model, but this check has to be 
done by netvsc
as its bypass_netdev is same as the backup_netdev.
Will add a flag while registering with the bypass module to indicate if the 
driver is doing
a 2 netdev or 3 netdev model and based on that flag this check can be done in 
bypass module
for 3 netdev scenario.

Just let me undestand it clearly. What I expect the difference would be
between 2netdev and3 netdev model is this:
2netdev:
   bypass_master
  /
 /
VF_slave

3netdev:
   bypass_master
  / \
 /   \
VF_slave   backup_slave

Is that correct? If not, how does it look like?



Looks correct.
VF_slave and backup_slave are the original netdevs and are present in both the 
models.
In the 3 netdev model, bypass_master netdev is created and VF_slave and 
backup_slave are
marked as the 2 slaves of this new netdev.

In the 2 netdev model, backup_slave acts as bypass_master and the bypass module 
doesn't
have access to netdev_priv of the backup_slave.

Once i moved all the ndo_ops of the master netdev to bypass module, i realized 
that we can
move the create/destroy of the upper netdev also to bypass.c.
That way the changes to virtio_net become very minimal.

With these updates, bypass module now supports both the models by exporting 2 
sets of
functions.
3 netdev:
   int bypass_master_create(struct net_device *backup_netdev,
struct bypass_master **pbypass_master);
   void bypass_master_destroy(struct bypass_master *bypass_master);


2 netdev:
   int bypass_master_register(struct net_device *backup_netdev, struct 
bypass_ops *ops,
  struct bypass_master **pbypass_master);
   void bypass_master_unregister(struct bypass_master *bypass_master);

Will send the next revision in a day or two.

Thanks
Sridhar


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



RE: [virtio-dev] Dynamic assignment of vhost-user to a VM

2018-04-09 Thread Avi Cohen (A)
Thank you Paolo
Avi 

> -Original Message-
> From: Paolo Bonzini [mailto:pbonz...@redhat.com]
> Sent: Monday, 09 April, 2018 12:43 PM
> To: Avi Cohen (A); virtio-dev@lists.oasis-open.org
> Subject: Re: [virtio-dev] Dynamic assignment of vhost-user to a VM
> 
> On 09/04/2018 08:33, Avi Cohen (A) wrote:
> > Hello All,
> > Is it  possible to assign a vhost-user frontend to a VM *after* the VM was
> launched ?
> > Best Regards
> > Avi
> 
> If you're referring to QEMU then yes, you can dynamically create all of
> character devices (to connect to the vhost-user server), network devices (to
> create a vhost-user client) and NICs (that peer with the vhost-user server).
> 
> If you're using libvirt to run QEMU, you can in fact invoke a single "virsh 
> attach-
> device" command to handle all three steps.
> 
> Thanks,
> 
> Paolo


[virtio-dev] Re: [PATCH v31 2/4] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT

2018-04-09 Thread Wei Wang

On 04/09/2018 02:03 PM, Michael S. Tsirkin wrote:

On Fri, Apr 06, 2018 at 08:17:23PM +0800, Wei Wang wrote:

Negotiation of the VIRTIO_BALLOON_F_FREE_PAGE_HINT feature indicates the
support of reporting hints of guest free pages to host via virtio-balloon.

Host requests the guest to report free page hints by sending a new cmd
id to the guest via the free_page_report_cmd_id configuration register.

When the guest starts to report, the first element added to the free page
vq is the cmd id given by host. When the guest finishes the reporting
of all the free pages, VIRTIO_BALLOON_FREE_PAGE_REPORT_STOP_ID is added
to the vq to tell host that the reporting is done. Host polls the free
page vq after sending the starting cmd id, so the guest doesn't need to
kick after filling an element to the vq.

Host may also requests the guest to stop the reporting in advance by
sending the stop cmd id to the guest via the configuration register.

Signed-off-by: Wei Wang 
Signed-off-by: Liang Li 
Cc: Michael S. Tsirkin 
Cc: Michal Hocko 


Pretty good by now, Minor comments below.


Thanks for the comments.




---
  drivers/virtio/virtio_balloon.c | 272 +++-
  include/uapi/linux/virtio_balloon.h |   4 +
  2 files changed, 240 insertions(+), 36 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index dfe5684..aef73ee 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -51,9 +51,22 @@ MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
  static struct vfsmount *balloon_mnt;
  #endif
  
+enum virtio_balloon_vq {

+   VIRTIO_BALLOON_VQ_INFLATE,
+   VIRTIO_BALLOON_VQ_DEFLATE,
+   VIRTIO_BALLOON_VQ_STATS,
+   VIRTIO_BALLOON_VQ_FREE_PAGE,
+   VIRTIO_BALLOON_VQ_MAX
+};
+
  struct virtio_balloon {
struct virtio_device *vdev;
-   struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
+   struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq;
+
+   /* Balloon's own wq for cpu-intensive work items */
+   struct workqueue_struct *balloon_wq;
+   /* The free page reporting work item submitted to the balloon wq */
+   struct work_struct report_free_page_work;
  
  	/* The balloon servicing is delegated to a freezable workqueue. */

struct work_struct update_balloon_stats_work;
@@ -63,6 +76,13 @@ struct virtio_balloon {
spinlock_t stop_update_lock;
bool stop_update;
  
+	/* The new cmd id received from host */

+   uint32_t cmd_id_received;
+   /* The cmd id that is in use */
+   __virtio32 cmd_id_use;

I'd prefer cmd_id_active but it's not critical.


OK, will change.



+
+static void report_free_page_func(struct work_struct *work)
+{
+   struct virtio_balloon *vb;
+   struct virtqueue *vq;
+   unsigned int unused;
+   int ret;
+
+   vb = container_of(work, struct virtio_balloon, report_free_page_work);
+   vq = vb->free_page_vq;
+
+   /* Start by sending the received cmd id to host with an outbuf. */
+   ret = send_start_cmd_id(vb, vb->cmd_id_received);
+   if (unlikely(ret))
+   goto err;
+
+   ret = walk_free_mem_block(vb, 0, _balloon_send_free_pages);
+   if (unlikely(ret == -EIO))
+   goto err;
why is EIO special? I think you should special-case EINTR maybe.


Actually EINTR isn't an error which needs to bail out. That's just the 
case that the vq is full, that hint isn't added. Maybe it is not 
necessary to treat the "vq full" case as an error.
How about just returning "0" when the vq is full, instead of returning 
"EINTR"? (The next hint will continue to be added)







+
+   /* End by sending a stop id to host with an outbuf. */
+   ret = send_stop_cmd_id(vb);
+   if (likely(!ret)) {

What happens on failure? Don't we need to detach anyway?


Yes. Please see below, we could make some more change.



+   /* Ending: detach all the used buffers from the vq. */
+   while (vq->num_free != virtqueue_get_vring_size(vq))
+   virtqueue_get_buf(vq, );

This isn't all that happens here. It also waits for buffers to
be consumed. Is this by design? And why is it good idea to
busy poll while doing it?


Because host and guest operations happen asynchronously. When the guest 
reaches here, host may have not put anything to the vq yet. How about 
doing this via the free page vq handler?
Host will send a free page vq interrupt before exiting the optimization. 
I think this would be nicer.



Best,
Wei

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



Re: [virtio-dev] Dynamic assignment of vhost-user to a VM

2018-04-09 Thread Paolo Bonzini
On 09/04/2018 08:33, Avi Cohen (A) wrote:
> Hello All,
> Is it  possible to assign a vhost-user frontend to a VM *after* the VM was 
> launched ?
> Best Regards
> Avi

If you're referring to QEMU then yes, you can dynamically create all of
character devices (to connect to the vhost-user server), network devices
(to create a vhost-user client) and NICs (that peer with the vhost-user
server).

If you're using libvirt to run QEMU, you can in fact invoke a single
"virsh attach-device" command to handle all three steps.

Thanks,

Paolo

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] Dynamic assignment of vhost-user to a VM

2018-04-09 Thread Avi Cohen (A)
Hello All,
Is it  possible to assign a vhost-user frontend to a VM *after* the VM was 
launched ?
Best Regards
Avi

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] Re: [PATCH v31 2/4] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT

2018-04-09 Thread Michael S. Tsirkin
On Fri, Apr 06, 2018 at 08:17:23PM +0800, Wei Wang wrote:
> Negotiation of the VIRTIO_BALLOON_F_FREE_PAGE_HINT feature indicates the
> support of reporting hints of guest free pages to host via virtio-balloon.
> 
> Host requests the guest to report free page hints by sending a new cmd
> id to the guest via the free_page_report_cmd_id configuration register.
> 
> When the guest starts to report, the first element added to the free page
> vq is the cmd id given by host. When the guest finishes the reporting
> of all the free pages, VIRTIO_BALLOON_FREE_PAGE_REPORT_STOP_ID is added
> to the vq to tell host that the reporting is done. Host polls the free
> page vq after sending the starting cmd id, so the guest doesn't need to
> kick after filling an element to the vq.
> 
> Host may also requests the guest to stop the reporting in advance by
> sending the stop cmd id to the guest via the configuration register.
> 
> Signed-off-by: Wei Wang 
> Signed-off-by: Liang Li 
> Cc: Michael S. Tsirkin 
> Cc: Michal Hocko 


Pretty good by now, Minor comments below.

> ---
>  drivers/virtio/virtio_balloon.c | 272 
> +++-
>  include/uapi/linux/virtio_balloon.h |   4 +
>  2 files changed, 240 insertions(+), 36 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index dfe5684..aef73ee 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -51,9 +51,22 @@ MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
>  static struct vfsmount *balloon_mnt;
>  #endif
>  
> +enum virtio_balloon_vq {
> + VIRTIO_BALLOON_VQ_INFLATE,
> + VIRTIO_BALLOON_VQ_DEFLATE,
> + VIRTIO_BALLOON_VQ_STATS,
> + VIRTIO_BALLOON_VQ_FREE_PAGE,
> + VIRTIO_BALLOON_VQ_MAX
> +};
> +
>  struct virtio_balloon {
>   struct virtio_device *vdev;
> - struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
> + struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq;
> +
> + /* Balloon's own wq for cpu-intensive work items */
> + struct workqueue_struct *balloon_wq;
> + /* The free page reporting work item submitted to the balloon wq */
> + struct work_struct report_free_page_work;
>  
>   /* The balloon servicing is delegated to a freezable workqueue. */
>   struct work_struct update_balloon_stats_work;
> @@ -63,6 +76,13 @@ struct virtio_balloon {
>   spinlock_t stop_update_lock;
>   bool stop_update;
>  
> + /* The new cmd id received from host */
> + uint32_t cmd_id_received;
> + /* The cmd id that is in use */
> + __virtio32 cmd_id_use;

I'd prefer cmd_id_active but it's not critical.

> + /* Buffer to store the stop sign */
> + __virtio32 stop_cmd_id;
> +
>   /* Waiting for host to ack the pages we released. */
>   wait_queue_head_t acked;
>  
> @@ -320,17 +340,6 @@ static void stats_handle_request(struct virtio_balloon 
> *vb)
>   virtqueue_kick(vq);
>  }
>  
> -static void virtballoon_changed(struct virtio_device *vdev)
> -{
> - struct virtio_balloon *vb = vdev->priv;
> - unsigned long flags;
> -
> - spin_lock_irqsave(>stop_update_lock, flags);
> - if (!vb->stop_update)
> - queue_work(system_freezable_wq, >update_balloon_size_work);
> - spin_unlock_irqrestore(>stop_update_lock, flags);
> -}
> -
>  static inline s64 towards_target(struct virtio_balloon *vb)
>  {
>   s64 target;
> @@ -347,6 +356,34 @@ static inline s64 towards_target(struct virtio_balloon 
> *vb)
>   return target - vb->num_pages;
>  }
>  
> +static void virtballoon_changed(struct virtio_device *vdev)
> +{
> + struct virtio_balloon *vb = vdev->priv;
> + unsigned long flags;
> + s64 diff = towards_target(vb);
> +
> + if (diff) {
> + spin_lock_irqsave(>stop_update_lock, flags);
> + if (!vb->stop_update)
> + queue_work(system_freezable_wq,
> +>update_balloon_size_work);
> + spin_unlock_irqrestore(>stop_update_lock, flags);
> + }
> +
> + if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT)) {
> + virtio_cread(vdev, struct virtio_balloon_config,
> +  free_page_report_cmd_id, >cmd_id_received);
> + if (vb->cmd_id_received !=
> + VIRTIO_BALLOON_FREE_PAGE_REPORT_STOP_ID) {
> + spin_lock_irqsave(>stop_update_lock, flags);
> + if (!vb->stop_update)
> + queue_work(vb->balloon_wq,
> +>report_free_page_work);
> + spin_unlock_irqrestore(>stop_update_lock, flags);
> + }
> + }
> +}
> +
>  static void update_balloon_size(struct virtio_balloon *vb)
>  {
>   u32 actual = vb->num_pages;
> @@ -421,42 +458,178 @@ static void