Re: [PATCH 2/2] virtio-blk: set NUMA affinity for a tagset

2021-09-28 Thread Leon Romanovsky
On Wed, Sep 29, 2021 at 02:28:08AM +0300, Max Gurtovoy wrote:
> 
> On 9/28/2021 7:27 PM, Leon Romanovsky wrote:
> > On Tue, Sep 28, 2021 at 06:59:15PM +0300, Max Gurtovoy wrote:
> > > On 9/27/2021 9:23 PM, Leon Romanovsky wrote:
> > > > On Mon, Sep 27, 2021 at 08:25:09PM +0300, Max Gurtovoy wrote:
> > > > > On 9/27/2021 2:34 PM, Leon Romanovsky wrote:
> > > > > > On Sun, Sep 26, 2021 at 05:55:18PM +0300, Max Gurtovoy wrote:
> > > > > > > To optimize performance, set the affinity of the block device 
> > > > > > > tagset
> > > > > > > according to the virtio device affinity.
> > > > > > > 
> > > > > > > Signed-off-by: Max Gurtovoy 
> > > > > > > ---
> > > > > > > drivers/block/virtio_blk.c | 2 +-
> > > > > > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > > > > > > 
> > > > > > > diff --git a/drivers/block/virtio_blk.c 
> > > > > > > b/drivers/block/virtio_blk.c
> > > > > > > index 9b3bd083b411..1c68c3e0ebf9 100644
> > > > > > > --- a/drivers/block/virtio_blk.c
> > > > > > > +++ b/drivers/block/virtio_blk.c
> > > > > > > @@ -774,7 +774,7 @@ static int virtblk_probe(struct virtio_device 
> > > > > > > *vdev)
> > > > > > >   memset(&vblk->tag_set, 0, sizeof(vblk->tag_set));
> > > > > > >   vblk->tag_set.ops = &virtio_mq_ops;
> > > > > > >   vblk->tag_set.queue_depth = queue_depth;
> > > > > > > - vblk->tag_set.numa_node = NUMA_NO_NODE;
> > > > > > > + vblk->tag_set.numa_node = virtio_dev_to_node(vdev);
> > > > > > I afraid that by doing it, you will increase chances to see OOM, 
> > > > > > because
> > > > > > in NUMA_NO_NODE, MM will try allocate memory in whole system, while 
> > > > > > in
> > > > > > the latter mode only on specific NUMA which can be depleted.
> > > > > This is a common methodology we use in the block layer and in NVMe 
> > > > > subsystem
> > > > > and we don't afraid of the OOM issue you raised.
> > > > There are many reasons for that, but we are talking about virtio here
> > > > and not about NVMe.
> > > Ok. what reasons ?
> > For example, NVMe are physical devices that rely on DMA operations,
> > PCI connectivity e.t.c to operate. Such systems indeed can benefit from
> > NUMA locality hints. At the end, these devices are physically connected
> > to that NUMA node.
> 
> FYI Virtio devices are also physical devices that have PCI interface and
> rely on DMA operations.
> 
> from virtio spec: "Virtio devices use normal bus mechanisms of interrupts
> and DMA which should be familiar
> to any device driver author".

Yes, this is how bus in Linux is implemented, there is nothing new here. 

> 
> Also we develop virtio HW at NVIDIA for blk and net devices with our SNAP
> technology.
> 
> These devices are connected via PCI bus to the host.

How all these related to general virtio-blk implementation?

> 
> We also support SRIOV.
> 
> Same it true also for paravirt devices that are emulated by QEMU but still
> the guest sees them as PCI devices.

Yes, the key word here - "emulated".

> 
> > 
> > In our case, virtio-blk is a software interface that doesn't have all
> > these limitations. On the contrary, the virtio-blk can be created on one
> > CPU and moved later to be close to the QEMU which can run on another NUMA
> > node.
> 
> Not at all. virtio is HW interface.

Virtio are para-virtualized devices that are represented as HW interfaces
in the guest OS. They are not needed to be real devices in the hypervisor,
which is my (and probably most of the world) use case.

My QEMU command line contains something like that: "-drive 
file=IMAGE.img,if=virtio"

> 
> I don't understand what are you saying here ?
> 
> > 
> > Also this patch increases chances to get OOM by factor of NUMA nodes.
> 
> This is common practice in Linux for storage drivers. Why does it bothers
> you at all ?

Do I need a reason to ask for a clarification for publicly posted patch
in open mailing list?

I use virtio and care about it.

> 
> I already decreased the memory footprint for virtio blk devices.

As I wrote before, you decreased by several KB, but by this patch you
limited available memory in magnitudes.

> 
> 
> > Before your patch, the virtio_blk can allocate from X memory, after your
> > patch it will be X/NUMB_NUMA_NODES.
> 
> So go ahead and change all the block layer if it bothers you so much.
> 
> Also please change the NVMe subsystem when you do it.

I suggest less radical approach - don't take patches without proven
benefit.

We are in 2021, let's rely on NUMA node policy.

> 
> And lets see what the community will say.

Stephen asked you for performance data too. I'm not alone here.

> 
> > In addition, it has all chances to even hurt performance.
> > 
> > So yes, post v2, but as Stefan and I asked, please provide supportive
> > performance results, because what was done for another subsystem doesn't
> > mean that it will be applicable here.
> 
> I will measure the perf but even if we wont see an improvement since it
> might not be the bottleneck, this changes should be mer

Re: [PATCH] scsi: virtio_scsi: Fix spelling mistake "Unsupport" -> "Unsupported"

2021-09-28 Thread Martin K. Petersen
On Sat, 25 Sep 2021 00:03:30 +0100, Colin King wrote:

> From: Colin Ian King 
> 
> There are a couple of spelling mistakes in pr_info and pr_err messages.
> Fix them.
> 
> 

Applied to 5.15/scsi-fixes, thanks!

[1/1] scsi: virtio_scsi: Fix spelling mistake "Unsupport" -> "Unsupported"
  https://git.kernel.org/mkp/scsi/c/cced4c0ec7c0

-- 
Martin K. Petersen  Oracle Linux Engineering
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v1 1/8] x86/xen: update xen_oldmem_pfn_is_ram() documentation

2021-09-28 Thread Boris Ostrovsky


On 9/28/21 2:22 PM, David Hildenbrand wrote:
> The callback is only used for the vmcore nowadays.
>
> Signed-off-by: David Hildenbrand 


Reviewed-by: Boris Ostrovsky 


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v1 2/8] x86/xen: simplify xen_oldmem_pfn_is_ram()

2021-09-28 Thread Boris Ostrovsky

On 9/28/21 2:22 PM, David Hildenbrand wrote:
> Let's simplify return handling.
>
> Signed-off-by: David Hildenbrand 
> ---
>  arch/x86/xen/mmu_hvm.c | 11 ++-
>  1 file changed, 2 insertions(+), 9 deletions(-)
>
> diff --git a/arch/x86/xen/mmu_hvm.c b/arch/x86/xen/mmu_hvm.c
> index b242d1f4b426..eb61622df75b 100644
> --- a/arch/x86/xen/mmu_hvm.c
> +++ b/arch/x86/xen/mmu_hvm.c
> @@ -21,23 +21,16 @@ static int xen_oldmem_pfn_is_ram(unsigned long pfn)
>   .domid = DOMID_SELF,
>   .pfn = pfn,
>   };
> - int ram;
>  
>   if (HYPERVISOR_hvm_op(HVMOP_get_mem_type, &a))
>   return -ENXIO;
>  
>   switch (a.mem_type) {
>   case HVMMEM_mmio_dm:
> - ram = 0;
> - break;
> - case HVMMEM_ram_rw:
> - case HVMMEM_ram_ro:
> + return 0;
>   default:
> - ram = 1;
> - break;
> + return 1;
>   }
> -
> - return ram;
>  }
>  #endif
>  


How about

    return a.mem_type != HVMMEM_mmio_dm;


Result should be promoted to int and this has added benefit of not requiring 
changes in patch 4.


-boris

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v3 08/19] drivers/hv: map and unmap guest memory

2021-09-28 Thread Olaf Hering
Am Tue, 28 Sep 2021 11:31:04 -0700
schrieb Nuno Das Neves :

> +++ b/include/asm-generic/hyperv-tlfs.h
> -#define HV_HYP_PAGE_SHIFT  12
> +#define HV_HYP_PAGE_SHIFT12

I think in case this change is really required, it should be in a separate 
patch.


Olaf


pgps9ziBfF6NP.pgp
Description: Digitale Signatur von OpenPGP
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH v1 8/8] virtio-mem: kdump mode to sanitize /proc/vmcore access

2021-09-28 Thread David Hildenbrand
Although virtio-mem currently supports reading unplugged memory in the
hypervisor, this will change in the future, indicated to the device via
a new feature flag. We similarly sanitized /proc/kcore access recently. [1]

Let's register a vmcore callback, to allow vmcore code to check if a PFN
belonging to a virtio-mem device is either currently plugged and should
be dumped or is currently unplugged and should not be accessed, instead
mapping the shared zeropage or returning zeroes when reading.

This is important when not capturing /proc/vmcore via tools like
"makedumpfile" that can identify logically unplugged virtio-mem memory via
PG_offline in the memmap, but simply by e.g., copying the file.

Distributions that support virtio-mem+kdump have to make sure that the
virtio_mem module will be part of the kdump kernel or the kdump initrd;
dracut was recently [2] extended to include virtio-mem in the generated
initrd. As long as no special kdump kernels are used, this will
automatically make sure that virtio-mem will be around in the kdump initrd
and sanitize /proc/vmcore access -- with dracut.

With this series, we'll send one virtio-mem state request for every
~2 MiB chunk of virtio-mem memory indicated in the vmcore that we intend
to read/map.

In the future, we might want to allow building virtio-mem for kdump
mode only, even without CONFIG_MEMORY_HOTPLUG and friends: this way,
we could support special stripped-down kdump kernels that have many
other config options disabled; we'll tackle that once required. Further,
we might want to try sensing bigger blocks (e.g., memory sections)
first before falling back to device blocks on demand.

Tested with Fedora rawhide, which contains a recent kexec-tools version
(considering "System RAM (virtio_mem)" when creating the vmcore header) and
a recent dracut version (including the virtio_mem module in the kdump
initrd).

[1] https://lkml.kernel.org/r/20210526093041.8800-1-da...@redhat.com
[2] https://github.com/dracutdevs/dracut/pull/1157

Signed-off-by: David Hildenbrand 
---
 drivers/virtio/virtio_mem.c | 136 
 1 file changed, 124 insertions(+), 12 deletions(-)

diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
index 76d8aef3cfd2..ec0b2ab37acb 100644
--- a/drivers/virtio/virtio_mem.c
+++ b/drivers/virtio/virtio_mem.c
@@ -223,6 +223,9 @@ struct virtio_mem {
 * When this lock is held the pointers can't change, ONLINE and
 * OFFLINE blocks can't change the state and no subblocks will get
 * plugged/unplugged.
+*
+* In kdump mode, used to serialize requests, last_block_addr and
+* last_block_plugged.
 */
struct mutex hotplug_mutex;
bool hotplug_active;
@@ -230,6 +233,9 @@ struct virtio_mem {
/* An error occurred we cannot handle - stop processing requests. */
bool broken;
 
+   /* Cached valued of is_kdump_kernel() when the device was probed. */
+   bool in_kdump;
+
/* The driver is being removed. */
spinlock_t removal_lock;
bool removing;
@@ -243,6 +249,13 @@ struct virtio_mem {
/* Memory notifier (online/offline events). */
struct notifier_block memory_notifier;
 
+#ifdef CONFIG_PROC_VMCORE
+   /* vmcore callback for /proc/vmcore handling in kdump mode */
+   struct vmcore_cb vmcore_cb;
+   uint64_t last_block_addr;
+   bool last_block_plugged;
+#endif /* CONFIG_PROC_VMCORE */
+
/* Next device in the list of virtio-mem devices. */
struct list_head next;
 };
@@ -2293,6 +2306,12 @@ static void virtio_mem_run_wq(struct work_struct *work)
uint64_t diff;
int rc;
 
+   if (unlikely(vm->in_kdump)) {
+   dev_warn_once(&vm->vdev->dev,
+"unexpected workqueue run in kdump kernel\n");
+   return;
+   }
+
hrtimer_cancel(&vm->retry_timer);
 
if (vm->broken)
@@ -2521,6 +2540,86 @@ static int virtio_mem_init_hotplug(struct virtio_mem *vm)
return rc;
 }
 
+#ifdef CONFIG_PROC_VMCORE
+static int virtio_mem_send_state_request(struct virtio_mem *vm, uint64_t addr,
+uint64_t size)
+{
+   const uint64_t nb_vm_blocks = size / vm->device_block_size;
+   const struct virtio_mem_req req = {
+   .type = cpu_to_virtio16(vm->vdev, VIRTIO_MEM_REQ_STATE),
+   .u.state.addr = cpu_to_virtio64(vm->vdev, addr),
+   .u.state.nb_blocks = cpu_to_virtio16(vm->vdev, nb_vm_blocks),
+   };
+   int rc = -ENOMEM;
+
+   dev_dbg(&vm->vdev->dev, "requesting state: 0x%llx - 0x%llx\n", addr,
+   addr + size - 1);
+
+   switch (virtio_mem_send_request(vm, &req)) {
+   case VIRTIO_MEM_RESP_ACK:
+   return virtio16_to_cpu(vm->vdev, vm->resp.u.state.state);
+   case VIRTIO_MEM_RESP_ERROR:
+   rc = -EINVAL;
+   break;
+   default:
+  

[PATCH v1 7/8] virtio-mem: factor out hotplug specifics from virtio_mem_remove() into virtio_mem_deinit_hotplug()

2021-09-28 Thread David Hildenbrand
Let's prepare for a new virtio-mem kdump mode in which we don't actually
hot(un)plug any memory but only observe the state of device blocks.

Signed-off-by: David Hildenbrand 
---
 drivers/virtio/virtio_mem.c | 13 ++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
index 1be3ee7f684d..76d8aef3cfd2 100644
--- a/drivers/virtio/virtio_mem.c
+++ b/drivers/virtio/virtio_mem.c
@@ -2667,9 +2667,8 @@ static int virtio_mem_probe(struct virtio_device *vdev)
return rc;
 }
 
-static void virtio_mem_remove(struct virtio_device *vdev)
+static void virtio_mem_deinit_hotplug(struct virtio_mem *vm)
 {
-   struct virtio_mem *vm = vdev->priv;
unsigned long mb_id;
int rc;
 
@@ -2716,7 +2715,8 @@ static void virtio_mem_remove(struct virtio_device *vdev)
 * away. Warn at least.
 */
if (virtio_mem_has_memory_added(vm)) {
-   dev_warn(&vdev->dev, "device still has system memory added\n");
+   dev_warn(&vm->vdev->dev,
+"device still has system memory added\n");
} else {
virtio_mem_delete_resource(vm);
kfree_const(vm->resource_name);
@@ -2730,6 +2730,13 @@ static void virtio_mem_remove(struct virtio_device *vdev)
} else {
vfree(vm->bbm.bb_states);
}
+}
+
+static void virtio_mem_remove(struct virtio_device *vdev)
+{
+   struct virtio_mem *vm = vdev->priv;
+
+   virtio_mem_deinit_hotplug(vm);
 
/* reset the device and cleanup the queues */
vdev->config->reset(vdev);
-- 
2.31.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v1 6/8] virtio-mem: factor out hotplug specifics from virtio_mem_probe() into virtio_mem_init_hotplug()

2021-09-28 Thread David Hildenbrand
Let's prepare for a new virtio-mem kdump mode in which we don't actually
hot(un)plug any memory but only observe the state of device blocks.

Signed-off-by: David Hildenbrand 
---
 drivers/virtio/virtio_mem.c | 87 +++--
 1 file changed, 45 insertions(+), 42 deletions(-)

diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
index 2ba7e8d6ba8d..1be3ee7f684d 100644
--- a/drivers/virtio/virtio_mem.c
+++ b/drivers/virtio/virtio_mem.c
@@ -260,6 +260,8 @@ static void virtio_mem_fake_offline_going_offline(unsigned 
long pfn,
 static void virtio_mem_fake_offline_cancel_offline(unsigned long pfn,
   unsigned long nr_pages);
 static void virtio_mem_retry(struct virtio_mem *vm);
+static int virtio_mem_create_resource(struct virtio_mem *vm);
+static void virtio_mem_delete_resource(struct virtio_mem *vm);
 
 /*
  * Register a virtio-mem device so it will be considered for the online_page
@@ -2395,7 +2397,8 @@ static int virtio_mem_init_vq(struct virtio_mem *vm)
 static int virtio_mem_init_hotplug(struct virtio_mem *vm)
 {
const struct range pluggable_range = mhp_get_pluggable_range(true);
-   uint64_t sb_size, addr;
+   uint64_t unit_pages, sb_size, addr;
+   int rc;
 
/* bad device setup - warn only */
if (!IS_ALIGNED(vm->addr, memory_block_size_bytes()))
@@ -2474,7 +2477,48 @@ static int virtio_mem_init_hotplug(struct virtio_mem *vm)
dev_info(&vm->vdev->dev, "big block size: 0x%llx",
 (unsigned long long)vm->bbm.bb_size);
 
+   /* create the parent resource for all memory */
+   rc = virtio_mem_create_resource(vm);
+   if (rc)
+   return rc;
+
+   /* use a single dynamic memory group to cover the whole memory device */
+   if (vm->in_sbm)
+   unit_pages = PHYS_PFN(memory_block_size_bytes());
+   else
+   unit_pages = PHYS_PFN(vm->bbm.bb_size);
+   rc = memory_group_register_dynamic(vm->nid, unit_pages);
+   if (rc < 0)
+   goto out_del_resource;
+   vm->mgid = rc;
+
+   /*
+* If we still have memory plugged, we have to unplug all memory first.
+* Registering our parent resource makes sure that this memory isn't
+* actually in use (e.g., trying to reload the driver).
+*/
+   if (vm->plugged_size) {
+   vm->unplug_all_required = true;
+   dev_info(&vm->vdev->dev, "unplugging all memory is required\n");
+   }
+
+   /* register callbacks */
+   vm->memory_notifier.notifier_call = virtio_mem_memory_notifier_cb;
+   rc = register_memory_notifier(&vm->memory_notifier);
+   if (rc)
+   goto out_unreg_group;
+   rc = register_virtio_mem_device(vm);
+   if (rc)
+   goto out_unreg_mem;
+
return 0;
+out_unreg_mem:
+   unregister_memory_notifier(&vm->memory_notifier);
+out_unreg_group:
+   memory_group_unregister(vm->mgid);
+out_del_resource:
+   virtio_mem_delete_resource(vm);
+   return rc;
 }
 
 static int virtio_mem_init(struct virtio_mem *vm)
@@ -2578,7 +2622,6 @@ static bool virtio_mem_has_memory_added(struct virtio_mem 
*vm)
 static int virtio_mem_probe(struct virtio_device *vdev)
 {
struct virtio_mem *vm;
-   uint64_t unit_pages;
int rc;
 
BUILD_BUG_ON(sizeof(struct virtio_mem_req) != 24);
@@ -2608,40 +2651,6 @@ static int virtio_mem_probe(struct virtio_device *vdev)
if (rc)
goto out_del_vq;
 
-   /* create the parent resource for all memory */
-   rc = virtio_mem_create_resource(vm);
-   if (rc)
-   goto out_del_vq;
-
-   /* use a single dynamic memory group to cover the whole memory device */
-   if (vm->in_sbm)
-   unit_pages = PHYS_PFN(memory_block_size_bytes());
-   else
-   unit_pages = PHYS_PFN(vm->bbm.bb_size);
-   rc = memory_group_register_dynamic(vm->nid, unit_pages);
-   if (rc < 0)
-   goto out_del_resource;
-   vm->mgid = rc;
-
-   /*
-* If we still have memory plugged, we have to unplug all memory first.
-* Registering our parent resource makes sure that this memory isn't
-* actually in use (e.g., trying to reload the driver).
-*/
-   if (vm->plugged_size) {
-   vm->unplug_all_required = true;
-   dev_info(&vm->vdev->dev, "unplugging all memory is required\n");
-   }
-
-   /* register callbacks */
-   vm->memory_notifier.notifier_call = virtio_mem_memory_notifier_cb;
-   rc = register_memory_notifier(&vm->memory_notifier);
-   if (rc)
-   goto out_unreg_group;
-   rc = register_virtio_mem_device(vm);
-   if (rc)
-   goto out_unreg_mem;
-
virtio_device_ready(vdev);
 
/* trigger a config update to start processing the requested_size */
@@ -2649,12 +2658,6 @

[PATCH v1 5/8] virtio-mem: factor out hotplug specifics from virtio_mem_init() into virtio_mem_init_hotplug()

2021-09-28 Thread David Hildenbrand
Let's prepare for a new virtio-mem kdump mode in which we don't actually
hot(un)plug any memory but only observe the state of device blocks.

Signed-off-by: David Hildenbrand 
---
 drivers/virtio/virtio_mem.c | 81 -
 1 file changed, 44 insertions(+), 37 deletions(-)

diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
index bef8ad6bf466..2ba7e8d6ba8d 100644
--- a/drivers/virtio/virtio_mem.c
+++ b/drivers/virtio/virtio_mem.c
@@ -2392,41 +2392,10 @@ static int virtio_mem_init_vq(struct virtio_mem *vm)
return 0;
 }
 
-static int virtio_mem_init(struct virtio_mem *vm)
+static int virtio_mem_init_hotplug(struct virtio_mem *vm)
 {
const struct range pluggable_range = mhp_get_pluggable_range(true);
uint64_t sb_size, addr;
-   uint16_t node_id;
-
-   if (!vm->vdev->config->get) {
-   dev_err(&vm->vdev->dev, "config access disabled\n");
-   return -EINVAL;
-   }
-
-   /*
-* We don't want to (un)plug or reuse any memory when in kdump. The
-* memory is still accessible (but not mapped).
-*/
-   if (is_kdump_kernel()) {
-   dev_warn(&vm->vdev->dev, "disabled in kdump kernel\n");
-   return -EBUSY;
-   }
-
-   /* Fetch all properties that can't change. */
-   virtio_cread_le(vm->vdev, struct virtio_mem_config, plugged_size,
-   &vm->plugged_size);
-   virtio_cread_le(vm->vdev, struct virtio_mem_config, block_size,
-   &vm->device_block_size);
-   virtio_cread_le(vm->vdev, struct virtio_mem_config, node_id,
-   &node_id);
-   vm->nid = virtio_mem_translate_node_id(vm, node_id);
-   virtio_cread_le(vm->vdev, struct virtio_mem_config, addr, &vm->addr);
-   virtio_cread_le(vm->vdev, struct virtio_mem_config, region_size,
-   &vm->region_size);
-
-   /* Determine the nid for the device based on the lowest address. */
-   if (vm->nid == NUMA_NO_NODE)
-   vm->nid = memory_add_physaddr_to_nid(vm->addr);
 
/* bad device setup - warn only */
if (!IS_ALIGNED(vm->addr, memory_block_size_bytes()))
@@ -2496,10 +2465,6 @@ static int virtio_mem_init(struct virtio_mem *vm)
  vm->offline_threshold);
}
 
-   dev_info(&vm->vdev->dev, "start address: 0x%llx", vm->addr);
-   dev_info(&vm->vdev->dev, "region size: 0x%llx", vm->region_size);
-   dev_info(&vm->vdev->dev, "device block size: 0x%llx",
-(unsigned long long)vm->device_block_size);
dev_info(&vm->vdev->dev, "memory block size: 0x%lx",
 memory_block_size_bytes());
if (vm->in_sbm)
@@ -2508,10 +2473,52 @@ static int virtio_mem_init(struct virtio_mem *vm)
else
dev_info(&vm->vdev->dev, "big block size: 0x%llx",
 (unsigned long long)vm->bbm.bb_size);
+
+   return 0;
+}
+
+static int virtio_mem_init(struct virtio_mem *vm)
+{
+   uint16_t node_id;
+
+   if (!vm->vdev->config->get) {
+   dev_err(&vm->vdev->dev, "config access disabled\n");
+   return -EINVAL;
+   }
+
+   /*
+* We don't want to (un)plug or reuse any memory when in kdump. The
+* memory is still accessible (but not mapped).
+*/
+   if (is_kdump_kernel()) {
+   dev_warn(&vm->vdev->dev, "disabled in kdump kernel\n");
+   return -EBUSY;
+   }
+
+   /* Fetch all properties that can't change. */
+   virtio_cread_le(vm->vdev, struct virtio_mem_config, plugged_size,
+   &vm->plugged_size);
+   virtio_cread_le(vm->vdev, struct virtio_mem_config, block_size,
+   &vm->device_block_size);
+   virtio_cread_le(vm->vdev, struct virtio_mem_config, node_id,
+   &node_id);
+   vm->nid = virtio_mem_translate_node_id(vm, node_id);
+   virtio_cread_le(vm->vdev, struct virtio_mem_config, addr, &vm->addr);
+   virtio_cread_le(vm->vdev, struct virtio_mem_config, region_size,
+   &vm->region_size);
+
+   /* Determine the nid for the device based on the lowest address. */
+   if (vm->nid == NUMA_NO_NODE)
+   vm->nid = memory_add_physaddr_to_nid(vm->addr);
+
+   dev_info(&vm->vdev->dev, "start address: 0x%llx", vm->addr);
+   dev_info(&vm->vdev->dev, "region size: 0x%llx", vm->region_size);
+   dev_info(&vm->vdev->dev, "device block size: 0x%llx",
+(unsigned long long)vm->device_block_size);
if (vm->nid != NUMA_NO_NODE && IS_ENABLED(CONFIG_NUMA))
dev_info(&vm->vdev->dev, "nid: %d", vm->nid);
 
-   return 0;
+   return virtio_mem_init_hotplug(vm);
 }
 
 static int virtio_mem_create_resource(struct virtio_mem *vm)
-- 
2.31.1

___
Virtualization mailing list
Vi

[PATCH v1 4/8] proc/vmcore: convert oldmem_pfn_is_ram callback to more generic vmcore callbacks

2021-09-28 Thread David Hildenbrand
Let's support multiple registered callbacks, making sure that
registering vmcore callbacks cannot fail. Make the callback return a
bool instead of an int, handling how to deal with errors internally.
Drop unused HAVE_OLDMEM_PFN_IS_RAM.

We soon want to make use of this infrastructure from other drivers:
virtio-mem, registering one callback for each virtio-mem device, to
prevent reading unplugged virtio-mem memory.

Handle it via a generic vmcore_cb structure, prepared for future
extensions: for example, once we support virtio-mem on s390x where the
vmcore is completely constructed in the second kernel, we want to detect
and add plugged virtio-mem memory ranges to the vmcore in order for them
to get dumped properly.

Handle corner cases that are unexpected and shouldn't happen in sane
setups: registering a callback after the vmcore has already been opened
(warn only) and unregistering a callback after the vmcore has already been
opened (warn and essentially read only zeroes from that point on).

Signed-off-by: David Hildenbrand 
---
 arch/x86/kernel/aperture_64.c | 13 -
 arch/x86/xen/mmu_hvm.c| 15 +++---
 fs/proc/vmcore.c  | 99 ---
 include/linux/crash_dump.h| 26 +++--
 4 files changed, 113 insertions(+), 40 deletions(-)

diff --git a/arch/x86/kernel/aperture_64.c b/arch/x86/kernel/aperture_64.c
index 10562885f5fc..af3ba08b684b 100644
--- a/arch/x86/kernel/aperture_64.c
+++ b/arch/x86/kernel/aperture_64.c
@@ -73,12 +73,23 @@ static int gart_mem_pfn_is_ram(unsigned long pfn)
  (pfn >= aperture_pfn_start + aperture_page_count));
 }
 
+#ifdef CONFIG_PROC_VMCORE
+static bool gart_oldmem_pfn_is_ram(struct vmcore_cb *cb, unsigned long pfn)
+{
+   return !!gart_mem_pfn_is_ram(pfn);
+}
+
+static struct vmcore_cb gart_vmcore_cb = {
+   .pfn_is_ram = gart_oldmem_pfn_is_ram,
+};
+#endif
+
 static void __init exclude_from_core(u64 aper_base, u32 aper_order)
 {
aperture_pfn_start = aper_base >> PAGE_SHIFT;
aperture_page_count = (32 * 1024 * 1024) << aper_order >> PAGE_SHIFT;
 #ifdef CONFIG_PROC_VMCORE
-   WARN_ON(register_oldmem_pfn_is_ram(&gart_mem_pfn_is_ram));
+   register_vmcore_cb(&gart_vmcore_cb);
 #endif
 #ifdef CONFIG_PROC_KCORE
WARN_ON(register_mem_pfn_is_ram(&gart_mem_pfn_is_ram));
diff --git a/arch/x86/xen/mmu_hvm.c b/arch/x86/xen/mmu_hvm.c
index eb61622df75b..49bd4a6a5858 100644
--- a/arch/x86/xen/mmu_hvm.c
+++ b/arch/x86/xen/mmu_hvm.c
@@ -12,10 +12,10 @@
  * The kdump kernel has to check whether a pfn of the crashed kernel
  * was a ballooned page. vmcore is using this function to decide
  * whether to access a pfn of the crashed kernel.
- * Returns 0 if the pfn is not backed by a RAM page, the caller may
+ * Returns "false" if the pfn is not backed by a RAM page, the caller may
  * handle the pfn special in this case.
  */
-static int xen_oldmem_pfn_is_ram(unsigned long pfn)
+static bool xen_vmcore_pfn_is_ram(struct vmcore_cb *cb, unsigned long pfn)
 {
struct xen_hvm_get_mem_type a = {
.domid = DOMID_SELF,
@@ -23,15 +23,18 @@ static int xen_oldmem_pfn_is_ram(unsigned long pfn)
};
 
if (HYPERVISOR_hvm_op(HVMOP_get_mem_type, &a))
-   return -ENXIO;
+   return true;
 
switch (a.mem_type) {
case HVMMEM_mmio_dm:
-   return 0;
+   return false;
default:
-   return 1;
+   return true;
}
 }
+static struct vmcore_cb xen_vmcore_cb = {
+   .pfn_is_ram = xen_vmcore_pfn_is_ram,
+};
 #endif
 
 static void xen_hvm_exit_mmap(struct mm_struct *mm)
@@ -65,6 +68,6 @@ void __init xen_hvm_init_mmu_ops(void)
if (is_pagetable_dying_supported())
pv_ops.mmu.exit_mmap = xen_hvm_exit_mmap;
 #ifdef CONFIG_PROC_VMCORE
-   WARN_ON(register_oldmem_pfn_is_ram(&xen_oldmem_pfn_is_ram));
+   register_vmcore_cb(&xen_vmcore_cb);
 #endif
 }
diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
index a9bd80ab670e..7a04b2eca287 100644
--- a/fs/proc/vmcore.c
+++ b/fs/proc/vmcore.c
@@ -62,46 +62,75 @@ core_param(novmcoredd, vmcoredd_disabled, bool, 0);
 /* Device Dump Size */
 static size_t vmcoredd_orig_sz;
 
-/*
- * Returns > 0 for RAM pages, 0 for non-RAM pages, < 0 on error
- * The called function has to take care of module refcounting.
- */
-static int (*oldmem_pfn_is_ram)(unsigned long pfn);
-
-int register_oldmem_pfn_is_ram(int (*fn)(unsigned long pfn))
+static DECLARE_RWSEM(vmcore_cb_rwsem);
+/* List of registered vmcore callbacks. */
+static LIST_HEAD(vmcore_cb_list);
+/* Whether we had a surprise unregistration of a callback. */
+static bool vmcore_cb_unstable;
+/* Whether the vmcore has been opened once. */
+static bool vmcore_opened;
+
+void register_vmcore_cb(struct vmcore_cb *cb)
 {
-   if (oldmem_pfn_is_ram)
-   return -EBUSY;
-   oldmem_pfn_is_ram = fn;
-   return 0;
+   down_write(&vmcore_cb_rwsem);
+   IN

[PATCH v1 3/8] proc/vmcore: let pfn_is_ram() return a bool

2021-09-28 Thread David Hildenbrand
The callback should deal with errors internally, it doesn't make sense to
expose these via pfn_is_ram(). We'll rework the callbacks next. Right now
we consider errors as if "it's RAM"; no functional change.

Signed-off-by: David Hildenbrand 
---
 fs/proc/vmcore.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
index 9a15334da208..a9bd80ab670e 100644
--- a/fs/proc/vmcore.c
+++ b/fs/proc/vmcore.c
@@ -84,11 +84,11 @@ void unregister_oldmem_pfn_is_ram(void)
 }
 EXPORT_SYMBOL_GPL(unregister_oldmem_pfn_is_ram);
 
-static int pfn_is_ram(unsigned long pfn)
+static bool pfn_is_ram(unsigned long pfn)
 {
int (*fn)(unsigned long pfn);
/* pfn is ram unless fn() checks pagetype */
-   int ret = 1;
+   bool ret = true;
 
/*
 * Ask hypervisor if the pfn is really ram.
@@ -97,7 +97,7 @@ static int pfn_is_ram(unsigned long pfn)
 */
fn = oldmem_pfn_is_ram;
if (fn)
-   ret = fn(pfn);
+   ret = !!fn(pfn);
 
return ret;
 }
@@ -124,7 +124,7 @@ ssize_t read_from_oldmem(char *buf, size_t count,
nr_bytes = count;
 
/* If pfn is not ram, return zeros for sparse dump files */
-   if (pfn_is_ram(pfn) == 0)
+   if (!pfn_is_ram(pfn))
memset(buf, 0, nr_bytes);
else {
if (encrypted)
-- 
2.31.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v1 2/8] x86/xen: simplify xen_oldmem_pfn_is_ram()

2021-09-28 Thread David Hildenbrand
Let's simplify return handling.

Signed-off-by: David Hildenbrand 
---
 arch/x86/xen/mmu_hvm.c | 11 ++-
 1 file changed, 2 insertions(+), 9 deletions(-)

diff --git a/arch/x86/xen/mmu_hvm.c b/arch/x86/xen/mmu_hvm.c
index b242d1f4b426..eb61622df75b 100644
--- a/arch/x86/xen/mmu_hvm.c
+++ b/arch/x86/xen/mmu_hvm.c
@@ -21,23 +21,16 @@ static int xen_oldmem_pfn_is_ram(unsigned long pfn)
.domid = DOMID_SELF,
.pfn = pfn,
};
-   int ram;
 
if (HYPERVISOR_hvm_op(HVMOP_get_mem_type, &a))
return -ENXIO;
 
switch (a.mem_type) {
case HVMMEM_mmio_dm:
-   ram = 0;
-   break;
-   case HVMMEM_ram_rw:
-   case HVMMEM_ram_ro:
+   return 0;
default:
-   ram = 1;
-   break;
+   return 1;
}
-
-   return ram;
 }
 #endif
 
-- 
2.31.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v1 1/8] x86/xen: update xen_oldmem_pfn_is_ram() documentation

2021-09-28 Thread David Hildenbrand
The callback is only used for the vmcore nowadays.

Signed-off-by: David Hildenbrand 
---
 arch/x86/xen/mmu_hvm.c | 9 +++--
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/arch/x86/xen/mmu_hvm.c b/arch/x86/xen/mmu_hvm.c
index 57409373750f..b242d1f4b426 100644
--- a/arch/x86/xen/mmu_hvm.c
+++ b/arch/x86/xen/mmu_hvm.c
@@ -9,12 +9,9 @@
 
 #ifdef CONFIG_PROC_VMCORE
 /*
- * This function is used in two contexts:
- * - the kdump kernel has to check whether a pfn of the crashed kernel
- *   was a ballooned page. vmcore is using this function to decide
- *   whether to access a pfn of the crashed kernel.
- * - the kexec kernel has to check whether a pfn was ballooned by the
- *   previous kernel. If the pfn is ballooned, handle it properly.
+ * The kdump kernel has to check whether a pfn of the crashed kernel
+ * was a ballooned page. vmcore is using this function to decide
+ * whether to access a pfn of the crashed kernel.
  * Returns 0 if the pfn is not backed by a RAM page, the caller may
  * handle the pfn special in this case.
  */
-- 
2.31.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v1 0/8] proc/vmcore: sanitize access to virtio-mem memory

2021-09-28 Thread David Hildenbrand
As so often with virtio-mem changes that mess with common MM
infrastructure, this might be a good candiate to go via Andrew's tree.

--

After removing /dev/kmem, sanitizing /proc/kcore and handling /dev/mem,
this series tackles the last sane way how a VM could accidentially access
logically unplugged memory managed by a virtio-mem device: /proc/vmcore

When dumping memory via "makedumpfile", PG_offline pages, used by
virtio-mem to flag logically unplugged memory, are already properly
excluded; however, especially when accessing/copying /proc/vmcore "the
usual way", we can still end up reading logically unplugged memory part of
a virtio-mem device.

Patch #1-#3 are cleanups. Patch #4 extends the existing oldmem_pfn_is_ram
mechanism. Patch #5-#7 are virtio-mem refactorings for patch #8, which
implements the virtio-mem logic to query the state of device blocks.

Patch #8:

"
Although virtio-mem currently supports reading unplugged memory in the
hypervisor, this will change in the future, indicated to the device via
a new feature flag. We similarly sanitized /proc/kcore access recently.
[...]
Distributions that support virtio-mem+kdump have to make sure that the
virtio_mem module will be part of the kdump kernel or the kdump initrd;
dracut was recently [2] extended to include virtio-mem in the generated
initrd. As long as no special kdump kernels are used, this will
automatically make sure that virtio-mem will be around in the kdump initrd
and sanitize /proc/vmcore access -- with dracut.
"

This is the last remaining bit to support
VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE [3] in the Linux implementation of
virtio-mem.

Note: this is best-effort. We'll never be able to control what runs inside
the second kernel, really, but we also don't have to care: we only care
about sane setups where we don't want our VM getting zapped once we
touch the wrong memory location while dumping. While we usually expect sane
setups to use "makedumfile", nothing really speaks against just copying
/proc/vmcore, especially in environments where HWpoisioning isn't typically
expected. Also, we really don't want to put all our trust completely on the
memmap, so sanitizing also makes sense when just using "makedumpfile".

[1] https://lkml.kernel.org/r/20210526093041.8800-1-da...@redhat.com
[2] https://github.com/dracutdevs/dracut/pull/1157
[3] https://lists.oasis-open.org/archives/virtio-comment/202109/msg00021.html

Cc: Andrew Morton 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: "H. Peter Anvin" 
Cc: Boris Ostrovsky 
Cc: Juergen Gross 
Cc: Stefano Stabellini 
Cc: "Michael S. Tsirkin" 
Cc: Jason Wang 
Cc: Dave Young 
Cc: Baoquan He 
Cc: Vivek Goyal 
Cc: Michal Hocko 
Cc: Oscar Salvador 
Cc: Mike Rapoport 
Cc: "Rafael J. Wysocki" 
Cc: x...@kernel.org
Cc: xen-de...@lists.xenproject.org
Cc: virtualization@lists.linux-foundation.org
Cc: ke...@lists.infradead.org
Cc: linux-fsde...@vger.kernel.org
Cc: linux...@kvack.org

David Hildenbrand (8):
  x86/xen: update xen_oldmem_pfn_is_ram() documentation
  x86/xen: simplify xen_oldmem_pfn_is_ram()
  proc/vmcore: let pfn_is_ram() return a bool
  proc/vmcore: convert oldmem_pfn_is_ram callback to more generic vmcore
callbacks
  virtio-mem: factor out hotplug specifics from virtio_mem_init() into
virtio_mem_init_hotplug()
  virtio-mem: factor out hotplug specifics from virtio_mem_probe() into
virtio_mem_init_hotplug()
  virtio-mem: factor out hotplug specifics from virtio_mem_remove() into
virtio_mem_deinit_hotplug()
  virtio-mem: kdump mode to sanitize /proc/vmcore access

 arch/x86/kernel/aperture_64.c |  13 +-
 arch/x86/xen/mmu_hvm.c|  31 ++--
 drivers/virtio/virtio_mem.c   | 297 --
 fs/proc/vmcore.c  | 105 
 include/linux/crash_dump.h|  26 ++-
 5 files changed, 332 insertions(+), 140 deletions(-)


base-commit: 5816b3e6577eaa676ceb00a848f0fd65fe2adc29
-- 
2.31.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 2/2] virtio-blk: set NUMA affinity for a tagset

2021-09-28 Thread Leon Romanovsky
On Tue, Sep 28, 2021 at 06:59:15PM +0300, Max Gurtovoy wrote:
> 
> On 9/27/2021 9:23 PM, Leon Romanovsky wrote:
> > On Mon, Sep 27, 2021 at 08:25:09PM +0300, Max Gurtovoy wrote:
> > > On 9/27/2021 2:34 PM, Leon Romanovsky wrote:
> > > > On Sun, Sep 26, 2021 at 05:55:18PM +0300, Max Gurtovoy wrote:
> > > > > To optimize performance, set the affinity of the block device tagset
> > > > > according to the virtio device affinity.
> > > > > 
> > > > > Signed-off-by: Max Gurtovoy 
> > > > > ---
> > > > >drivers/block/virtio_blk.c | 2 +-
> > > > >1 file changed, 1 insertion(+), 1 deletion(-)
> > > > > 
> > > > > diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
> > > > > index 9b3bd083b411..1c68c3e0ebf9 100644
> > > > > --- a/drivers/block/virtio_blk.c
> > > > > +++ b/drivers/block/virtio_blk.c
> > > > > @@ -774,7 +774,7 @@ static int virtblk_probe(struct virtio_device 
> > > > > *vdev)
> > > > >   memset(&vblk->tag_set, 0, sizeof(vblk->tag_set));
> > > > >   vblk->tag_set.ops = &virtio_mq_ops;
> > > > >   vblk->tag_set.queue_depth = queue_depth;
> > > > > - vblk->tag_set.numa_node = NUMA_NO_NODE;
> > > > > + vblk->tag_set.numa_node = virtio_dev_to_node(vdev);
> > > > I afraid that by doing it, you will increase chances to see OOM, because
> > > > in NUMA_NO_NODE, MM will try allocate memory in whole system, while in
> > > > the latter mode only on specific NUMA which can be depleted.
> > > This is a common methodology we use in the block layer and in NVMe 
> > > subsystem
> > > and we don't afraid of the OOM issue you raised.
> > There are many reasons for that, but we are talking about virtio here
> > and not about NVMe.
> 
> Ok. what reasons ?

For example, NVMe are physical devices that rely on DMA operations,
PCI connectivity e.t.c to operate. Such systems indeed can benefit from
NUMA locality hints. At the end, these devices are physically connected
to that NUMA node.

In our case, virtio-blk is a software interface that doesn't have all
these limitations. On the contrary, the virtio-blk can be created on one
CPU and moved later to be close to the QEMU which can run on another NUMA
node.

Also this patch increases chances to get OOM by factor of NUMA nodes.
Before your patch, the virtio_blk can allocate from X memory, after your
patch it will be X/NUMB_NUMA_NODES.

In addition, it has all chances to even hurt performance.

So yes, post v2, but as Stefan and I asked, please provide supportive
performance results, because what was done for another subsystem doesn't
mean that it will be applicable here.

Thanks
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [GIT PULL] virtio,vdpa: fixes

2021-09-28 Thread pr-tracker-bot
The pull request you sent on Mon, 27 Sep 2021 18:30:03 -0400:

> https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git tags/for_linus

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/d33bec7b3dfa36691ed53ccaaf187d90b53be852

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization