Re: [PATCH] xen: introduce xen_vring_use_dma

2020-06-25 Thread Stefano Stabellini
On Wed, 24 Jun 2020, Michael S. Tsirkin wrote:
> On Wed, Jun 24, 2020 at 02:53:54PM -0700, Stefano Stabellini wrote:
> > On Wed, 24 Jun 2020, Michael S. Tsirkin wrote:
> > > On Wed, Jun 24, 2020 at 10:59:47AM -0700, Stefano Stabellini wrote:
> > > > On Wed, 24 Jun 2020, Michael S. Tsirkin wrote:
> > > > > On Wed, Jun 24, 2020 at 05:17:32PM +0800, Peng Fan wrote:
> > > > > > Export xen_swiotlb for all platforms using xen swiotlb
> > > > > > 
> > > > > > Use xen_swiotlb to determine when vring should use dma APIs to map 
> > > > > > the
> > > > > > ring: when xen_swiotlb is enabled the dma API is required. When it 
> > > > > > is
> > > > > > disabled, it is not required.
> > > > > > 
> > > > > > Signed-off-by: Peng Fan 
> > > > > 
> > > > > Isn't there some way to use VIRTIO_F_IOMMU_PLATFORM for this?
> > > > > Xen was there first, but everyone else is using that now.
> > > > 
> > > > Unfortunately it is complicated and it is not related to
> > > > VIRTIO_F_IOMMU_PLATFORM :-(
> > > > 
> > > > 
> > > > The Xen subsystem in Linux uses dma_ops via swiotlb_xen to translate
> > > > foreign mappings (memory coming from other VMs) to physical addresses.
> > > > On x86, it also uses dma_ops to translate Linux's idea of a physical
> > > > address into a real physical address (this is unneeded on ARM.)
> > > > 
> > > > 
> > > > So regardless of VIRTIO_F_IOMMU_PLATFORM, dma_ops should be used on 
> > > > Xen/x86
> > > > always and on Xen/ARM if Linux is Dom0 (because it has foreign
> > > > mappings.) That is why we have the if (xen_domain) return true; in
> > > > vring_use_dma_api.
> > > 
> > > VIRTIO_F_IOMMU_PLATFORM makes guest always use DMA ops.
> > > 
> > > Xen hack predates VIRTIO_F_IOMMU_PLATFORM so it *also*
> > > forces DMA ops even if VIRTIO_F_IOMMU_PLATFORM is clear.
> > >
> > > Unfortunately as a result Xen never got around to
> > > properly setting VIRTIO_F_IOMMU_PLATFORM.
> > 
> > I don't think VIRTIO_F_IOMMU_PLATFORM would be correct for this because
> > the usage of swiotlb_xen is not a property of virtio,
> 
> 
> Basically any device without VIRTIO_F_ACCESS_PLATFORM
> (that is it's name in latest virtio spec, VIRTIO_F_IOMMU_PLATFORM is
> what linux calls it) is declared as "special, don't follow normal rules
> for access".
> 
> So yes swiotlb_xen is not a property of virtio, but what *is* a property
> of virtio is that it's not special, just a regular device from DMA POV.

I am trying to understand what you meant but I think I am missing
something.

Are you saying that modern virtio should always have
VIRTIO_F_ACCESS_PLATFORM, hence use normal dma_ops as any other devices?

If that is the case, how is it possible that virtio breaks on ARM using
the default dma_ops? The breakage is not Xen related (except that Xen
turns dma_ops on). The original message from Peng was:

  vring_map_one_sg -> vring_use_dma_api
   -> dma_map_page
   -> __swiotlb_map_page
->swiotlb_map_page
->__dma_map_area(phys_to_virt(dma_to_phys(dev, 
dev_addr)), size, dir);
  However we are using per device dma area for rpmsg, phys_to_virt
  could not return a correct virtual address for virtual address in
  vmalloc area. Then kernel panic.

I must be missing something. Maybe it is because it has to do with RPMesg?
 

> > > > You might have noticed that I missed one possible case above: Xen/ARM
> > > > DomU :-)
> > > > 
> > > > Xen/ARM domUs don't need swiotlb_xen, it is not even initialized. So if
> > > > (xen_domain) return true; would give the wrong answer in that case.
> > > > Linux would end up calling the "normal" dma_ops, not swiotlb-xen, and
> > > > the "normal" dma_ops fail.
> > > > 
> > > > 
> > > > The solution I suggested was to make the check in vring_use_dma_api more
> > > > flexible by returning true if the swiotlb_xen is supposed to be used,
> > > > not in general for all Xen domains, because that is what the check was
> > > > really meant to do.
> > > 
> > > Why not fix DMA ops so they DTRT (nop) on Xen/ARM DomU? What is wrong 
> > > with that?
> > 
> > swiotlb-xen is not used on Xen/ARM DomU, the default dma_ops are the
> > ones that are used. So you are saying, why don't we fix the default
> > dma_ops to work with virtio?
> > 
> > It is bad that the default dma_ops crash with virtio, so yes I think it
> > would be good to fix that. However, even if we fixed that, the if
> > (xen_domain()) check in vring_use_dma_api is still a problem.
> 
> Why is it a problem? It just makes virtio use DMA API.
> If that in turn works, problem solved.

You are correct in the sense that it would work. However I do think it
is wrong for vring_use_dma_api to enable dma_ops/swiotlb-xen for Xen/ARM
DomUs that don't need it. There are many different types of Xen guests,
Xen x86 is drastically different from Xen ARM, it seems wrong to treat
them the same way.



Anyway, re-reading the last messages of the original thread [1], it
looks like Peng

[RFC 3/3] virtio-blk: use NUMA-aware memory allocation in probe

2020-06-25 Thread Stefan Hajnoczi
Allocate frequently-accessed data structures from the NUMA node
associated with this device to avoid slow cross-NUMA node memory
accesses.

Only the following memory allocations are made NUMA-aware:

1. Called during probe. If called in the data path then hopefully we're
   executing on a CPU in the same NUMA node as the device. If the CPU is
   not in the right NUMA node then it's unclear whether forcing memory
   allocations to use the device's NUMA node will increase or decrease
   performance.

2. Memory will be frequently accessed from the data path. There is no
   need to worry about data that is not accessed from
   performance-critical code paths.

Signed-off-by: Stefan Hajnoczi 
---
 drivers/block/virtio_blk.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 9d21bf0f155e..40845e9ad3b1 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -482,6 +482,7 @@ static int init_vq(struct virtio_blk *vblk)
unsigned short num_vqs;
struct virtio_device *vdev = vblk->vdev;
struct irq_affinity desc = { 0, };
+   int node = dev_to_node(&vdev->dev);
 
err = virtio_cread_feature(vdev, VIRTIO_BLK_F_MQ,
   struct virtio_blk_config, num_queues,
@@ -491,7 +492,8 @@ static int init_vq(struct virtio_blk *vblk)
 
num_vqs = min_t(unsigned int, nr_cpu_ids, num_vqs);
 
-   vblk->vqs = kmalloc_array(num_vqs, sizeof(*vblk->vqs), GFP_KERNEL);
+   vblk->vqs = kmalloc_array_node(num_vqs, sizeof(*vblk->vqs),
+  GFP_KERNEL, node);
if (!vblk->vqs)
return -ENOMEM;
 
@@ -683,6 +685,7 @@ module_param_named(queue_depth, virtblk_queue_depth, uint, 
0444);
 
 static int virtblk_probe(struct virtio_device *vdev)
 {
+   int node = dev_to_node(&vdev->dev);
struct virtio_blk *vblk;
struct request_queue *q;
int err, index;
@@ -714,7 +717,7 @@ static int virtblk_probe(struct virtio_device *vdev)
 
/* We need an extra sg elements at head and tail. */
sg_elems += 2;
-   vdev->priv = vblk = kmalloc(sizeof(*vblk), GFP_KERNEL);
+   vdev->priv = vblk = kmalloc_node(sizeof(*vblk), GFP_KERNEL, node);
if (!vblk) {
err = -ENOMEM;
goto out_free_index;
-- 
2.26.2

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[RFC 1/3] virtio-pci: use NUMA-aware memory allocation in probe

2020-06-25 Thread Stefan Hajnoczi
Allocate frequently-accessed data structures from the NUMA node
associated with this virtio-pci device. This avoids slow cross-NUMA node
memory accesses.

Only the following memory allocations are made NUMA-aware:

1. Called during probe. If called in the data path then hopefully we're
   executing on a CPU in the same NUMA node as the device. If the CPU is
   not in the right NUMA node then it's unclear whether forcing memory
   allocations to use the device's NUMA node will increase or decrease
   performance.

2. Memory will be frequently accessed from the data path. There is no
   need to worry about data that is not accessed from
   performance-critical code paths.

Signed-off-by: Stefan Hajnoczi 
---
 drivers/virtio/virtio_pci_common.c | 16 
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/drivers/virtio/virtio_pci_common.c 
b/drivers/virtio/virtio_pci_common.c
index 222d630c41fc..cc6e49f9c698 100644
--- a/drivers/virtio/virtio_pci_common.c
+++ b/drivers/virtio/virtio_pci_common.c
@@ -178,11 +178,13 @@ static struct virtqueue *vp_setup_vq(struct virtio_device 
*vdev, unsigned index,
 u16 msix_vec)
 {
struct virtio_pci_device *vp_dev = to_vp_device(vdev);
-   struct virtio_pci_vq_info *info = kmalloc(sizeof *info, GFP_KERNEL);
+   int node = dev_to_node(&vdev->dev);
+   struct virtio_pci_vq_info *info;
struct virtqueue *vq;
unsigned long flags;
 
/* fill out our structure that represents an active queue */
+   info = kmalloc_node(sizeof *info, GFP_KERNEL, node);
if (!info)
return ERR_PTR(-ENOMEM);
 
@@ -283,10 +285,12 @@ static int vp_find_vqs_msix(struct virtio_device *vdev, 
unsigned nvqs,
struct irq_affinity *desc)
 {
struct virtio_pci_device *vp_dev = to_vp_device(vdev);
+   int node = dev_to_node(&vdev->dev);
u16 msix_vec;
int i, err, nvectors, allocated_vectors, queue_idx = 0;
 
-   vp_dev->vqs = kcalloc(nvqs, sizeof(*vp_dev->vqs), GFP_KERNEL);
+   vp_dev->vqs = kcalloc_node(nvqs, sizeof(*vp_dev->vqs),
+  GFP_KERNEL, node);
if (!vp_dev->vqs)
return -ENOMEM;
 
@@ -355,9 +359,11 @@ static int vp_find_vqs_intx(struct virtio_device *vdev, 
unsigned nvqs,
const char * const names[], const bool *ctx)
 {
struct virtio_pci_device *vp_dev = to_vp_device(vdev);
+   int node = dev_to_node(&vdev->dev);
int i, err, queue_idx = 0;
 
-   vp_dev->vqs = kcalloc(nvqs, sizeof(*vp_dev->vqs), GFP_KERNEL);
+   vp_dev->vqs = kcalloc_node(nvqs, sizeof(*vp_dev->vqs),
+  GFP_KERNEL, node);
if (!vp_dev->vqs)
return -ENOMEM;
 
@@ -513,10 +519,12 @@ static int virtio_pci_probe(struct pci_dev *pci_dev,
const struct pci_device_id *id)
 {
struct virtio_pci_device *vp_dev, *reg_dev = NULL;
+   int node = dev_to_node(&pci_dev->dev);
int rc;
 
/* allocate our structure and fill it out */
-   vp_dev = kzalloc(sizeof(struct virtio_pci_device), GFP_KERNEL);
+   vp_dev = kzalloc_node(sizeof(struct virtio_pci_device),
+ GFP_KERNEL, node);
if (!vp_dev)
return -ENOMEM;
 
-- 
2.26.2

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[RFC 0/3] virtio: NUMA-aware memory allocation

2020-06-25 Thread Stefan Hajnoczi
These patches are not ready to be merged because I was unable to measure a
performance improvement. I'm publishing them so they are archived in case
someone picks up this work again in the future.

The goal of these patches is to allocate virtqueues and driver state from the
device's NUMA node for optimal memory access latency. Only guests with a vNUMA
topology and virtio devices spread across vNUMA nodes benefit from this.  In
other cases the memory placement is fine and we don't need to take NUMA into
account inside the guest.

These patches could be extended to virtio_net.ko and other devices in the
future. I only tested virtio_blk.ko.

The benchmark configuration was designed to trigger worst-case NUMA placement:
 * Physical NVMe storage controller on host NUMA node 0
 * IOThread pinned to host NUMA node 0
 * virtio-blk-pci device in vNUMA node 1
 * vCPU 0 on host NUMA node 1 and vCPU 1 on host NUMA node 0
 * vCPU 0 in vNUMA node 0 and vCPU 1 in vNUMA node 1

The intent is to have .probe() code run on vCPU 0 in vNUMA node 0 (host NUMA
node 1) so that memory is in the wrong NUMA node for the virtio-blk-pci devic=
e.
Applying these patches fixes memory placement so that virtqueues and driver
state is allocated in vNUMA node 1 where the virtio-blk-pci device is located.

The fio 4KB randread benchmark results do not show a significant improvement:

Name  IOPS   Error
virtio-blk42373.79 =C2=B1 0.54%
virtio-blk-numa   42517.07 =C2=B1 0.79%

Stefan Hajnoczi (3):
  virtio-pci: use NUMA-aware memory allocation in probe
  virtio_ring: use NUMA-aware memory allocation in probe
  virtio-blk: use NUMA-aware memory allocation in probe

 include/linux/gfp.h|  2 +-
 drivers/block/virtio_blk.c |  7 +--
 drivers/virtio/virtio_pci_common.c | 16 
 drivers/virtio/virtio_ring.c   | 26 +-
 mm/page_alloc.c|  2 +-
 5 files changed, 36 insertions(+), 17 deletions(-)

--=20
2.26.2

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[RFC 2/3] virtio_ring: use NUMA-aware memory allocation in probe

2020-06-25 Thread Stefan Hajnoczi
Allocate frequently-accessed data structures from the NUMA node
associated with this device to avoid slow cross-NUMA node memory
accesses.

Only the following memory allocations are made NUMA-aware:

1. Called during probe. If called in the data path then hopefully we're
   executing on a CPU in the same NUMA node as the device. If the CPU is
   not in the right NUMA node then it's unclear whether forcing memory
   allocations to use the device's NUMA node will increase or decrease
   performance.

2. Memory will be frequently accessed from the data path. There is no
   need to worry about data that is not accessed from
   performance-critical code paths.

This patch adds a non-meminit alloc_pages_exact_nid() caller so I've
removed the __meminit added by commit e19318116048 ("mm/page_alloc.c:
add __meminit to alloc_pages_exact_nid()").

Cc: Fabian Frederick 
Cc: Andrew Morton 
Cc: Mel Gorman 
Signed-off-by: Stefan Hajnoczi 
---
I have included the alloc_pages_exact_nid() __meminit removal in this
patch to provide context for reviewers.
---
 include/linux/gfp.h  |  2 +-
 drivers/virtio/virtio_ring.c | 26 +-
 mm/page_alloc.c  |  2 +-
 3 files changed, 19 insertions(+), 11 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 4aba4c86c626..9b69df707c7a 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -563,7 +563,7 @@ extern unsigned long get_zeroed_page(gfp_t gfp_mask);
 
 void *alloc_pages_exact(size_t size, gfp_t gfp_mask);
 void free_pages_exact(void *virt, size_t size);
-void * __meminit alloc_pages_exact_nid(int nid, size_t size, gfp_t gfp_mask);
+void *alloc_pages_exact_nid(int nid, size_t size, gfp_t gfp_mask);
 
 #define __get_free_page(gfp_mask) \
__get_free_pages((gfp_mask), 0)
diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 58b96baa8d48..d06b42309bed 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -276,7 +276,9 @@ static void *vring_alloc_queue(struct virtio_device *vdev, 
size_t size,
return dma_alloc_coherent(vdev->dev.parent, size,
  dma_handle, flag);
} else {
-   void *queue = alloc_pages_exact(PAGE_ALIGN(size), flag);
+   int node = dev_to_node(&vdev->dev);
+   void *queue = alloc_pages_exact_nid(node, PAGE_ALIGN(size),
+   flag);
 
if (queue) {
phys_addr_t phys_addr = virt_to_phys(queue);
@@ -1567,6 +1569,7 @@ static struct virtqueue *vring_create_virtqueue_packed(
struct vring_packed_desc_event *driver, *device;
dma_addr_t ring_dma_addr, driver_event_dma_addr, device_event_dma_addr;
size_t ring_size_in_bytes, event_size_in_bytes;
+   int node = dev_to_node(&vdev->dev);
unsigned int i;
 
ring_size_in_bytes = num * sizeof(struct vring_packed_desc);
@@ -1591,7 +1594,7 @@ static struct virtqueue *vring_create_virtqueue_packed(
if (!device)
goto err_device;
 
-   vq = kmalloc(sizeof(*vq), GFP_KERNEL);
+   vq = kmalloc_node(sizeof(*vq), GFP_KERNEL, node);
if (!vq)
goto err_vq;
 
@@ -1639,9 +1642,10 @@ static struct virtqueue *vring_create_virtqueue_packed(
vq->packed.event_flags_shadow = 0;
vq->packed.avail_used_flags = 1 << VRING_PACKED_DESC_F_AVAIL;
 
-   vq->packed.desc_state = kmalloc_array(num,
+   vq->packed.desc_state = kmalloc_array_node(num,
sizeof(struct vring_desc_state_packed),
-   GFP_KERNEL);
+   GFP_KERNEL,
+   node);
if (!vq->packed.desc_state)
goto err_desc_state;
 
@@ -1653,9 +1657,10 @@ static struct virtqueue *vring_create_virtqueue_packed(
for (i = 0; i < num-1; i++)
vq->packed.desc_state[i].next = i + 1;
 
-   vq->packed.desc_extra = kmalloc_array(num,
+   vq->packed.desc_extra = kmalloc_array_node(num,
sizeof(struct vring_desc_extra_packed),
-   GFP_KERNEL);
+   GFP_KERNEL,
+   node);
if (!vq->packed.desc_extra)
goto err_desc_extra;
 
@@ -2059,13 +2064,14 @@ struct virtqueue *__vring_new_virtqueue(unsigned int 
index,
void (*callback)(struct virtqueue *),
const char *name)
 {
+   int node = dev_to_node(&vdev->dev);
unsigned int i;
struct vring_virtqueue *vq;
 
if (virtio_has_feature(vdev, VIRTIO_F_RING_PACKED))
return NULL;
 
-   vq = kmalloc(sizeof(*vq), GFP_KERNEL);
+   vq = kmalloc_node(sizeof(*vq), GFP_KERNEL, node);
if (!vq)
return NULL;
 
@@ -2110,8 +2116,10 @@ struct virtqueue *__vring_new_virtqueue(unsigned in