Re: [PATCH v2] virtio_balloon: Fix endless deflation and inflation on arm64

2023-10-02 Thread David Hildenbrand

On 25.09.23 01:58, Gavin Shan wrote:

Hi David and Michael,

On 8/31/23 11:10, Gavin Shan wrote:

The deflation request to the target, which isn't unaligned to the
guest page size causes endless deflation and inflation actions. For
example, we receive the flooding QMP events for the changes on memory
balloon's size after a deflation request to the unaligned target is
sent for the ARM64 guest, where we have 64KB base page size.

/home/gavin/sandbox/qemu.main/build/qemu-system-aarch64  \
-accel kvm -machine virt,gic-version=host -cpu host  \
-smp maxcpus=8,cpus=8,sockets=2,clusters=2,cores=2,threads=1 \
-m 1024M,slots=16,maxmem=64G \
-object memory-backend-ram,id=mem0,size=512M \
-object memory-backend-ram,id=mem1,size=512M \
-numa node,nodeid=0,memdev=mem0,cpus=0-3 \
-numa node,nodeid=1,memdev=mem1,cpus=4-7 \
  :  \
-device virtio-balloon-pci,id=balloon0,bus=pcie.10

{ "execute" : "balloon", "arguments": { "value" : 1073672192 } }
{"return": {}}
{"timestamp": {"seconds": 1693272173, "microseconds": 88667},   \
 "event": "BALLOON_CHANGE", "data": {"actual": 1073610752}}
{"timestamp": {"seconds": 1693272174, "microseconds": 89704},   \
 "event": "BALLOON_CHANGE", "data": {"actual": 1073610752}}
{"timestamp": {"seconds": 1693272175, "microseconds": 90819},   \
 "event": "BALLOON_CHANGE", "data": {"actual": 1073610752}}
{"timestamp": {"seconds": 1693272176, "microseconds": 91961},   \
 "event": "BALLOON_CHANGE", "data": {"actual": 1073610752}}
{"timestamp": {"seconds": 1693272177, "microseconds": 93040},   \
 "event": "BALLOON_CHANGE", "data": {"actual": 1073676288}}
{"timestamp": {"seconds": 1693272178, "microseconds": 94117},   \
 "event": "BALLOON_CHANGE", "data": {"actual": 1073676288}}
{"timestamp": {"seconds": 1693272179, "microseconds": 95337},   \
 "event": "BALLOON_CHANGE", "data": {"actual": 1073610752}}
{"timestamp": {"seconds": 1693272180, "microseconds": 96615},   \
 "event": "BALLOON_CHANGE", "data": {"actual": 1073676288}}
{"timestamp": {"seconds": 1693272181, "microseconds": 97626},   \
 "event": "BALLOON_CHANGE", "data": {"actual": 1073610752}}
{"timestamp": {"seconds": 1693272182, "microseconds": 98693},   \
 "event": "BALLOON_CHANGE", "data": {"actual": 1073676288}}
{"timestamp": {"seconds": 1693272183, "microseconds": 99698},   \
 "event": "BALLOON_CHANGE", "data": {"actual": 1073610752}}
{"timestamp": {"seconds": 1693272184, "microseconds": 100727},  \
 "event": "BALLOON_CHANGE", "data": {"actual": 1073610752}}
{"timestamp": {"seconds": 1693272185, "microseconds": 90430},   \
 "event": "BALLOON_CHANGE", "data": {"actual": 1073610752}}
{"timestamp": {"seconds": 1693272186, "microseconds": 102999},  \
 "event": "BALLOON_CHANGE", "data": {"actual": 1073676288}}
   :


Fix it by aligning the target up to the guest page size, 64KB in this
specific case. With this applied, no flooding QMP events are observed
and the memory balloon's size can be stablizied to 0x3ffe soon
after the deflation request is sent.

{ "execute" : "balloon", "arguments": { "value" : 1073672192 } }
{"return": {}}
{"timestamp": {"seconds": 1693273328, "microseconds": 793075},  \
 "event": "BALLOON_CHANGE", "data": {"actual": 1073610752}}
{ "execute" : "query-balloon" }
{"return": {"actual": 1073610752}}

Signed-off-by: Gavin Shan 
Tested-by: Zhenyu Zhang 
---
v2: Align @num_pages up to the guest page size in towards_target()
  directly as David suggested.
---
   drivers/virtio/virtio_balloon.c | 6 +-
   1 file changed, 5 insertions(+), 1 deletion(-)



If the patch looks good, could you please merge this to Linux 6.6.rc4 since
it's something needed by our downstream. I hope it can land upstream as early
as possible, thanks a lot.


@MST, I cannot spot it in your usual vhost git yet. Should I pick it up 
or what are your plans?


--
Cheers,

David / dhildenb

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2] virtio_balloon: Fix endless deflation and inflation on arm64

2023-08-31 Thread David Hildenbrand

On 31.08.23 03:10, Gavin Shan wrote:

The deflation request to the target, which isn't unaligned to the
guest page size causes endless deflation and inflation actions. For
example, we receive the flooding QMP events for the changes on memory
balloon's size after a deflation request to the unaligned target is
sent for the ARM64 guest, where we have 64KB base page size.

   /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64  \
   -accel kvm -machine virt,gic-version=host -cpu host  \
   -smp maxcpus=8,cpus=8,sockets=2,clusters=2,cores=2,threads=1 \
   -m 1024M,slots=16,maxmem=64G \
   -object memory-backend-ram,id=mem0,size=512M \
   -object memory-backend-ram,id=mem1,size=512M \
   -numa node,nodeid=0,memdev=mem0,cpus=0-3 \
   -numa node,nodeid=1,memdev=mem1,cpus=4-7 \
 :  \
   -device virtio-balloon-pci,id=balloon0,bus=pcie.10

   { "execute" : "balloon", "arguments": { "value" : 1073672192 } }
   {"return": {}}
   {"timestamp": {"seconds": 1693272173, "microseconds": 88667},   \
"event": "BALLOON_CHANGE", "data": {"actual": 1073610752}}
   {"timestamp": {"seconds": 1693272174, "microseconds": 89704},   \
"event": "BALLOON_CHANGE", "data": {"actual": 1073610752}}
   {"timestamp": {"seconds": 1693272175, "microseconds": 90819},   \
"event": "BALLOON_CHANGE", "data": {"actual": 1073610752}}
   {"timestamp": {"seconds": 1693272176, "microseconds": 91961},   \
"event": "BALLOON_CHANGE", "data": {"actual": 1073610752}}
   {"timestamp": {"seconds": 1693272177, "microseconds": 93040},   \
"event": "BALLOON_CHANGE", "data": {"actual": 1073676288}}
   {"timestamp": {"seconds": 1693272178, "microseconds": 94117},   \
"event": "BALLOON_CHANGE", "data": {"actual": 1073676288}}
   {"timestamp": {"seconds": 1693272179, "microseconds": 95337},   \
"event": "BALLOON_CHANGE", "data": {"actual": 1073610752}}
   {"timestamp": {"seconds": 1693272180, "microseconds": 96615},   \
"event": "BALLOON_CHANGE", "data": {"actual": 1073676288}}
   {"timestamp": {"seconds": 1693272181, "microseconds": 97626},   \
"event": "BALLOON_CHANGE", "data": {"actual": 1073610752}}
   {"timestamp": {"seconds": 1693272182, "microseconds": 98693},   \
"event": "BALLOON_CHANGE", "data": {"actual": 1073676288}}
   {"timestamp": {"seconds": 1693272183, "microseconds": 99698},   \
"event": "BALLOON_CHANGE", "data": {"actual": 1073610752}}
   {"timestamp": {"seconds": 1693272184, "microseconds": 100727},  \
"event": "BALLOON_CHANGE", "data": {"actual": 1073610752}}
   {"timestamp": {"seconds": 1693272185, "microseconds": 90430},   \
"event": "BALLOON_CHANGE", "data": {"actual": 1073610752}}
   {"timestamp": {"seconds": 1693272186, "microseconds": 102999},  \
"event": "BALLOON_CHANGE", "data": {"actual": 1073676288}}
  :
   

Fix it by aligning the target up to the guest page size, 64KB in this
specific case. With this applied, no flooding QMP events are observed
and the memory balloon's size can be stablizied to 0x3ffe soon
after the deflation request is sent.

   { "execute" : "balloon", "arguments": { "value" : 1073672192 } }
   {"return": {}}
   {"timestamp": {"seconds": 1693273328, "microseconds": 793075},  \
"event": "BALLOON_CHANGE", "data": {"actual": 1073610752}}
   { "execute" : "query-balloon" }
   {"return": {"actual": 1073610752}}

Signed-off-by: Gavin Shan 
Tested-by: Zhenyu Zhang 
---
v2: Align @num_pages up to the guest page size in towards_target()
 directly as David suggested.
---
  drivers/virtio/virtio_balloon.c | 6 +-
  1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 5b15936a5214..2d5d252ef419 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -395,7 +395,11 @@ static inline s64 towards_target(struct virtio_balloon *vb)
virtio_cread_le(vb->vdev, struct virtio_balloon_config, num_pages,
&num_pages);
  
-	target = num_pages;

+   /*
+* Aligned up to guest page size to avoid inflating and deflating
+* balloon endlessly.
+*/
+   target = ALIGN(num_pages, VIRTIO_BALLOON_PAGES_PER_PAGE);
return target - vb->num_pages;
  }
  


Thanks!

Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] virtio_balloon: Fix endless deflation and inflation on arm64

2023-08-30 Thread David Hildenbrand

On 29.08.23 03:54, Gavin Shan wrote:

The deflation request to the target, which isn't unaligned to the
guest page size causes endless deflation and inflation actions. For
example, we receive the flooding QMP events for the changes on memory
balloon's size after a deflation request to the unaligned target is
sent for the ARM64 guest, where we have 64KB base page size.

   /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64  \
   -accel kvm -machine virt,gic-version=host -cpu host  \
   -smp maxcpus=8,cpus=8,sockets=2,clusters=2,cores=2,threads=1 \
   -m 1024M,slots=16,maxmem=64G \
   -object memory-backend-ram,id=mem0,size=512M \
   -object memory-backend-ram,id=mem1,size=512M \
   -numa node,nodeid=0,memdev=mem0,cpus=0-3 \
   -numa node,nodeid=1,memdev=mem1,cpus=4-7 \
 :  \
   -device virtio-balloon-pci,id=balloon0,bus=pcie.10

   { "execute" : "balloon", "arguments": { "value" : 1073672192 } }
   {"return": {}}
   {"timestamp": {"seconds": 1693272173, "microseconds": 88667},   \
"event": "BALLOON_CHANGE", "data": {"actual": 1073610752}}
   {"timestamp": {"seconds": 1693272174, "microseconds": 89704},   \
"event": "BALLOON_CHANGE", "data": {"actual": 1073610752}}
   {"timestamp": {"seconds": 1693272175, "microseconds": 90819},   \
"event": "BALLOON_CHANGE", "data": {"actual": 1073610752}}
   {"timestamp": {"seconds": 1693272176, "microseconds": 91961},   \
"event": "BALLOON_CHANGE", "data": {"actual": 1073610752}}
   {"timestamp": {"seconds": 1693272177, "microseconds": 93040},   \
"event": "BALLOON_CHANGE", "data": {"actual": 1073676288}}
   {"timestamp": {"seconds": 1693272178, "microseconds": 94117},   \
"event": "BALLOON_CHANGE", "data": {"actual": 1073676288}}
   {"timestamp": {"seconds": 1693272179, "microseconds": 95337},   \
"event": "BALLOON_CHANGE", "data": {"actual": 1073610752}}
   {"timestamp": {"seconds": 1693272180, "microseconds": 96615},   \
"event": "BALLOON_CHANGE", "data": {"actual": 1073676288}}
   {"timestamp": {"seconds": 1693272181, "microseconds": 97626},   \
"event": "BALLOON_CHANGE", "data": {"actual": 1073610752}}
   {"timestamp": {"seconds": 1693272182, "microseconds": 98693},   \
"event": "BALLOON_CHANGE", "data": {"actual": 1073676288}}
   {"timestamp": {"seconds": 1693272183, "microseconds": 99698},   \
"event": "BALLOON_CHANGE", "data": {"actual": 1073610752}}
   {"timestamp": {"seconds": 1693272184, "microseconds": 100727},  \
"event": "BALLOON_CHANGE", "data": {"actual": 1073610752}}
   {"timestamp": {"seconds": 1693272185, "microseconds": 90430},   \
"event": "BALLOON_CHANGE", "data": {"actual": 1073610752}}
   {"timestamp": {"seconds": 1693272186, "microseconds": 102999},  \
"event": "BALLOON_CHANGE", "data": {"actual": 1073676288}}
  :
   

Fix it by having the target aligned to the guest page size, 64KB
in this specific case. With this applied, no flooding QMP event
is observed and the memory balloon's size can be stablizied to
0x3ffe soon after the deflation request is sent.

   { "execute" : "balloon", "arguments": { "value" : 1073672192 } }
   {"return": {}}
   {"timestamp": {"seconds": 1693273328, "microseconds": 793075},  \
"event": "BALLOON_CHANGE", "data": {"actual": 1073610752}}
   { "execute" : "query-balloon" }
   {"return": {"actual": 1073610752}}

Signed-off-by: Gavin Shan 
---
  drivers/virtio/virtio_balloon.c | 13 -
  1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 5b15936a5214..625caac35264 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -386,6 +386,17 @@ static void stats_handle_request(struct virtio_balloon *vb)
virtqueue_kick(vq);
  }
  
+static inline s64 align_pages_up(s64 diff)

+{
+   if (diff == 0)
+   return diff;
+
+   if (diff > 0)
+   return ALIGN(diff, VIRTIO_BALLOON_PAGES_PER_PAGE);
+
+   return -ALIGN(-diff, VIRTIO_BALLOON_PAGES_PER_PAGE);
+}
+
  static inline s64 towards_target(struct virtio_balloon *vb)
  {
s64 target;
@@ -396,7 +407,7 @@ static inline s64 towards_target(struct virtio_balloon *vb)
&num_pages);
  
  	target = num_pages;

-   return target - vb->num_pages;


We know that vb->num_pages is always multiples of 
VIRTIO_BALLOON_PAGES_PER_PAGE.


Why not simply align target down?

target = ALIGN(num_pages, VIRTIO_BALLOON_PAGES_PER_PAGE);
return target - vb->num_pages;


--
Cheers,

David / dhildenb

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [virtio-comment] virtio queue numbering and optional queues

2023-08-22 Thread David Hildenbrand

On 22.08.23 15:40, Stefan Hajnoczi wrote:

On Mon, Aug 21, 2023 at 03:18:50PM -0700, Daniel Verkamp wrote:

Hello virtio folks,


Hi Daniel,
I have CCed those involved in the free page hint and page reporting
features.

Stefan



I noticed a mismatch between the way the specification defines
device-specific virtqueue indexes and the way device and driver
implementers have interpreted the specification. As a practical example,
consider the traditional memory balloon device [1]. The first two queues
(indexes 0 and 1) are available as part of the baseline device, but the
rest of the queues are tied to feature bits.

Section 5.5.2, "Virtqueues", gives a list that appears to be a mapping from
queue index to queue name/function, defining queue index 3 as free_page_vq
and index 4 as reporting_vq, and declaring that "free_page_vq only exists
if VIRTIO_BALLOON_F_FREE_PAGE_HINT is set" and "reporting_vq only exists if
VIRTIO_BALLOON_F_PAGE_REPORTING is set." This wording is a bit vague, but I
assume "is set" means "is negotiated" (not just "advertised by the
device").


Staring at QEMU: the queues are added when the virtio-balloon device is
*created*. Queues are added based on feature configuration, if they are
part of the device feature set.

That should translate to "is advertised", not "is negotiated".

The queue ordering is as follows:

* inflate queue, baseline device
* deflate queue, baseline device
* driver memory statistics, VIRTIO_BALLOON_F_STATS_VQ
* free page hinting, VIRTIO_BALLOON_F_FREE_PAGE_HINT
* free page reporting, VIRTIO_BALLOON_F_REPORTING

QEMU always supports the first 3, so they use number 0-2. The other two
can be configured for the device.

So the queue indices vary based on actual feature presence.


Also presumably "exists" means something like "may only be used
by the driver if the feature bit is negotiated" and "should be ignored by
the device if the feature bit is not negotiated", although it would be nice
to have a proper definition in the spec somewhere.

Section 5.5.3, "Feature bits", gives definitions of the feature bits, with
similar descriptions of the relationship between the feature bits and
virtqueue availability, although the wording is slightly different
("present" rather than "exists"). No dependency between feature bits is
defined, so it seems like it should be valid for a device or driver to
support or accept one of the higher-numbered features while not supporting
a lower-numbered one.


Yes, that's my understanding.




Notably, there is no mention of queue index assignments changing based on
negotiated features in either of these sections. Hence a reader can only
assume that the queue index assignments are fixed (i.e. stats_vq will
always be vq index 4 if F_STATS_VQ is negotiated, regardless of any other
feature bits).


And that does not seem to be the case. At least QEMU assigns them sequentially,
based on actually configured features for the device.

If I read the kernel code correctly (drivers/virtio/virtio_balloon.c:init_vqs)
it also behaves that way: if the device has a certain feature
"virtio_has_feature", it gets the next index. Otherwise, the next index goes
to another feature:

/*
 * Inflateq and deflateq are used unconditionally. The names[]
 * will be NULL if the related feature is not enabled, which will
 * cause no allocation for the corresponding virtqueue in find_vqs.
 */
callbacks[VIRTIO_BALLOON_VQ_INFLATE] = balloon_ack;
names[VIRTIO_BALLOON_VQ_INFLATE] = "inflate";
callbacks[VIRTIO_BALLOON_VQ_DEFLATE] = balloon_ack;
names[VIRTIO_BALLOON_VQ_DEFLATE] = "deflate";
callbacks[VIRTIO_BALLOON_VQ_STATS] = NULL;
names[VIRTIO_BALLOON_VQ_STATS] = NULL;
callbacks[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
names[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
names[VIRTIO_BALLOON_VQ_REPORTING] = NULL;


if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
names[VIRTIO_BALLOON_VQ_STATS] = "stats";
callbacks[VIRTIO_BALLOON_VQ_STATS] = stats_request;
}

if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT)) {
names[VIRTIO_BALLOON_VQ_FREE_PAGE] = "free_page_vq";
callbacks[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
}

if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_REPORTING)) {
names[VIRTIO_BALLOON_VQ_REPORTING] = "reporting_vq";
callbacks[VIRTIO_BALLOON_VQ_REPORTING] = balloon_ack;
}

err = virtio_find_vqs(vb->vdev, VIRTIO_BALLOON_VQ_MAX, vqs,
  callbacks, names, NULL);
if (err)
return err;



Now consider a scenario where VIRTIO_BALLOON_F_STATS_VQ and
VIRTIO_BALLOON_F_PAGE_REPORTING are negotiated but
VIRTIO_BALLOON_F_FREE_PAGE_HINT is not (perhaps the device supports all of
the defined features but the driver only wants to use reporting_vq, not
free_pag

Re: [PATCH] virtio-balloon: correct the comment of virtballoon_migratepage()

2023-08-14 Thread David Hildenbrand

On 13.08.23 16:07, Xueshi Hu wrote:

After commit 68f2736a8583 ("mm: Convert all PageMovable users to
movable_operations"), the execution path has been changed to

move_to_new_folio
movable_operations->migrate_page
balloon_page_migrate
balloon_page_migrate->balloon_page_migrate
balloon_page_migrate

Correct the outdated comment.

Signed-off-by: Xueshi Hu 
---
  drivers/virtio/virtio_balloon.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 5b15936a5214..f5aac6cf3aa9 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -741,7 +741,7 @@ static void report_free_page_func(struct work_struct *work)
   *  2) update the host about the old page removed from vb->pages list;
   *
   * This function preforms the balloon page migration task.
- * Called through balloon_mapping->a_ops->migratepage
+ * Called through movable_operations->migrate_page
   */
  static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
struct page *newpage, struct page *page, enum migrate_mode mode)


Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v1 0/4] virtio-mem: memory unplug/offlining related cleanups

2023-07-14 Thread David Hildenbrand

On 13.07.23 17:03, Michael S. Tsirkin wrote:

On Thu, Jul 13, 2023 at 04:55:47PM +0200, David Hildenbrand wrote:

Some cleanups+optimizations primarily around offline_and_remove_memory().

Patch #1 drops the "unsafe unplug" feature where we might get stuck in
offline_and_remove_memory() forever.

Patch #2 handles unexpected errors from offline_and_remove_memory() a bit
nicer.

Patch #3 handles the case where offline_and_remove_memory() failed and
we want to retry later to remove a completely unplugged Linux memory
block, to not have them waste memory forever.

Patch #4 something I had lying around for longer, which reacts faster
on config changes when unplugging memory.

Cc: "Michael S. Tsirkin" 
Cc: Jason Wang 
Cc: Xuan Zhuo 


This looks like something that's reasonable to put in this linux, right?
These are fixes even though they are for theoretical issues.


Yes, but these are not high-priority fixes+optimizations. So if you feel 
like we should be delaying them, fine with me.


On the other hand, getting them in now also shouldn't really hurt. 
Especially patch #1 might be better of just going in soner than later.


--
Cheers,

David / dhildenb

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v1 4/4] virtio-mem: check if the config changed before fake offlining memory

2023-07-13 Thread David Hildenbrand
If we repeatedly fail to fake offline memory to unplug it, we won't be
sending any unplug requests to the device. However, we only check if the
config changed when sending such (un)plug requests.

We could end up trying for a long time to unplug memory, even though
the config changed already and we're not supposed to unplug memory
anymore. For example, the hypervisor might detect a low-memory situation
while unplugging memory and decide to replug some memory. Continuing
trying to unplug memory in that case can be problematic.

So let's check on a more regular basis.

Signed-off-by: David Hildenbrand 
---
 drivers/virtio/virtio_mem.c | 15 ---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
index a5cf92e3e5af..fa5226c198cc 100644
--- a/drivers/virtio/virtio_mem.c
+++ b/drivers/virtio/virtio_mem.c
@@ -1189,7 +1189,8 @@ static void virtio_mem_fake_online(unsigned long pfn, 
unsigned long nr_pages)
  * Try to allocate a range, marking pages fake-offline, effectively
  * fake-offlining them.
  */
-static int virtio_mem_fake_offline(unsigned long pfn, unsigned long nr_pages)
+static int virtio_mem_fake_offline(struct virtio_mem *vm, unsigned long pfn,
+  unsigned long nr_pages)
 {
const bool is_movable = is_zone_movable_page(pfn_to_page(pfn));
int rc, retry_count;
@@ -1202,6 +1203,14 @@ static int virtio_mem_fake_offline(unsigned long pfn, 
unsigned long nr_pages)
 * some guarantees.
 */
for (retry_count = 0; retry_count < 5; retry_count++) {
+   /*
+* If the config changed, stop immediately and go back to the
+* main loop: avoid trying to keep unplugging if the device
+* might have decided to not remove any more memory.
+*/
+   if (atomic_read(&vm->config_changed))
+   return -EAGAIN;
+
rc = alloc_contig_range(pfn, pfn + nr_pages, MIGRATE_MOVABLE,
GFP_KERNEL);
if (rc == -ENOMEM)
@@ -1951,7 +1960,7 @@ static int virtio_mem_sbm_unplug_sb_online(struct 
virtio_mem *vm,
start_pfn = PFN_DOWN(virtio_mem_mb_id_to_phys(mb_id) +
 sb_id * vm->sbm.sb_size);
 
-   rc = virtio_mem_fake_offline(start_pfn, nr_pages);
+   rc = virtio_mem_fake_offline(vm, start_pfn, nr_pages);
if (rc)
return rc;
 
@@ -2149,7 +2158,7 @@ static int 
virtio_mem_bbm_offline_remove_and_unplug_bb(struct virtio_mem *vm,
if (!page)
continue;
 
-   rc = virtio_mem_fake_offline(pfn, PAGES_PER_SECTION);
+   rc = virtio_mem_fake_offline(vm, pfn, PAGES_PER_SECTION);
if (rc) {
end_pfn = pfn;
goto rollback;
-- 
2.41.0

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v1 2/4] virtio-mem: convert most offline_and_remove_memory() errors to -EBUSY

2023-07-13 Thread David Hildenbrand
Just like we do with alloc_contig_range(), let's convert all unknown
errors to -EBUSY, but WARN so we can look into the issue. For example,
offline_pages() could fail with -EINTR, which would be unexpected in our
case.

Signed-off-by: David Hildenbrand 
---
 drivers/virtio/virtio_mem.c | 12 
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
index ed15d2a4bd96..1a76ba2bc118 100644
--- a/drivers/virtio/virtio_mem.c
+++ b/drivers/virtio/virtio_mem.c
@@ -741,11 +741,15 @@ static int virtio_mem_offline_and_remove_memory(struct 
virtio_mem *vm,
 * immediately instead of waiting.
 */
virtio_mem_retry(vm);
-   } else {
-   dev_dbg(&vm->vdev->dev,
-   "offlining and removing memory failed: %d\n", rc);
+   return 0;
}
-   return rc;
+   dev_dbg(&vm->vdev->dev, "offlining and removing memory failed: %d\n", 
rc);
+   /*
+* We don't really expect this to fail, because we fake-offlined all
+* memory already. But it could fail in corner cases.
+*/
+   WARN_ON_ONCE(rc != -ENOMEM && rc != -EBUSY);
+   return rc == -ENOMEM ? -ENOMEM : -EBUSY;
 }
 
 /*
-- 
2.41.0

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v1 3/4] virtio-mem: keep retrying on offline_and_remove_memory() errors in Sub Block Mode (SBM)

2023-07-13 Thread David Hildenbrand
In case offline_and_remove_memory() fails in SBM, we leave a completely
unplugged Linux memory block stick around until we try plugging memory
again. We won't try removing that memory block again.

offline_and_remove_memory() may, for example, fail if we're racing with
another alloc_contig_range() user, if allocating temporary memory fails,
or if some memory notifier rejected the offlining request.

Let's handle that case better, by simple retrying to offline and remove
such memory.

Tested using CONFIG_MEMORY_NOTIFIER_ERROR_INJECT.

Signed-off-by: David Hildenbrand 
---
 drivers/virtio/virtio_mem.c | 92 +
 1 file changed, 73 insertions(+), 19 deletions(-)

diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
index 1a76ba2bc118..a5cf92e3e5af 100644
--- a/drivers/virtio/virtio_mem.c
+++ b/drivers/virtio/virtio_mem.c
@@ -168,6 +168,13 @@ struct virtio_mem {
/* The number of subblocks per Linux memory block. */
uint32_t sbs_per_mb;
 
+   /*
+* Some of the Linux memory blocks tracked as "partially
+* plugged" are completely unplugged and can be offlined
+* and removed -- which previously failed.
+*/
+   bool have_unplugged_mb;
+
/* Summary of all memory block states. */
unsigned long mb_count[VIRTIO_MEM_SBM_MB_COUNT];
 
@@ -765,6 +772,34 @@ static int virtio_mem_sbm_offline_and_remove_mb(struct 
virtio_mem *vm,
return virtio_mem_offline_and_remove_memory(vm, addr, size);
 }
 
+/*
+ * Try (offlining and) removing memory from Linux in case all subblocks are
+ * unplugged. Can be called on online and offline memory blocks.
+ *
+ * May modify the state of memory blocks in virtio-mem.
+ */
+static int virtio_mem_sbm_try_remove_unplugged_mb(struct virtio_mem *vm,
+ unsigned long mb_id)
+{
+   int rc;
+
+   /*
+* Once all subblocks of a memory block were unplugged, offline and
+* remove it.
+*/
+   if (!virtio_mem_sbm_test_sb_unplugged(vm, mb_id, 0, vm->sbm.sbs_per_mb))
+   return 0;
+
+   /* offline_and_remove_memory() works for online and offline memory. */
+   mutex_unlock(&vm->hotplug_mutex);
+   rc = virtio_mem_sbm_offline_and_remove_mb(vm, mb_id);
+   mutex_lock(&vm->hotplug_mutex);
+   if (!rc)
+   virtio_mem_sbm_set_mb_state(vm, mb_id,
+   VIRTIO_MEM_SBM_MB_UNUSED);
+   return rc;
+}
+
 /*
  * See virtio_mem_offline_and_remove_memory(): Try to offline and remove a
  * all Linux memory blocks covered by the big block.
@@ -1988,20 +2023,10 @@ static int virtio_mem_sbm_unplug_any_sb_online(struct 
virtio_mem *vm,
}
 
 unplugged:
-   /*
-* Once all subblocks of a memory block were unplugged, offline and
-* remove it. This will usually not fail, as no memory is in use
-* anymore - however some other notifiers might NACK the request.
-*/
-   if (virtio_mem_sbm_test_sb_unplugged(vm, mb_id, 0, vm->sbm.sbs_per_mb)) 
{
-   mutex_unlock(&vm->hotplug_mutex);
-   rc = virtio_mem_sbm_offline_and_remove_mb(vm, mb_id);
-   mutex_lock(&vm->hotplug_mutex);
-   if (!rc)
-   virtio_mem_sbm_set_mb_state(vm, mb_id,
-   VIRTIO_MEM_SBM_MB_UNUSED);
-   }
-
+   rc = virtio_mem_sbm_try_remove_unplugged_mb(vm, mb_id);
+   if (rc)
+   vm->sbm.have_unplugged_mb = 1;
+   /* Ignore errors, this is not critical. We'll retry later. */
return 0;
 }
 
@@ -2253,12 +2278,13 @@ static int virtio_mem_unplug_request(struct virtio_mem 
*vm, uint64_t diff)
 
 /*
  * Try to unplug all blocks that couldn't be unplugged before, for example,
- * because the hypervisor was busy.
+ * because the hypervisor was busy. Further, offline and remove any memory
+ * blocks where we previously failed.
  */
-static int virtio_mem_unplug_pending_mb(struct virtio_mem *vm)
+static int virtio_mem_cleanup_pending_mb(struct virtio_mem *vm)
 {
unsigned long id;
-   int rc;
+   int rc = 0;
 
if (!vm->in_sbm) {
virtio_mem_bbm_for_each_bb(vm, id,
@@ -2280,6 +2306,27 @@ static int virtio_mem_unplug_pending_mb(struct 
virtio_mem *vm)
VIRTIO_MEM_SBM_MB_UNUSED);
}
 
+   if (!vm->sbm.have_unplugged_mb)
+   return 0;
+
+   /*
+* Let's retry (offlining and) removing completely unplugged Linux
+* memory blocks.
+*/
+   vm->sbm.have_unplugged_mb = false;
+
+   mutex_lock(&vm->hotplug_mutex);
+  

[PATCH v1 0/4] virtio-mem: memory unplug/offlining related cleanups

2023-07-13 Thread David Hildenbrand
Some cleanups+optimizations primarily around offline_and_remove_memory().

Patch #1 drops the "unsafe unplug" feature where we might get stuck in
offline_and_remove_memory() forever.

Patch #2 handles unexpected errors from offline_and_remove_memory() a bit
nicer.

Patch #3 handles the case where offline_and_remove_memory() failed and
we want to retry later to remove a completely unplugged Linux memory
block, to not have them waste memory forever.

Patch #4 something I had lying around for longer, which reacts faster
on config changes when unplugging memory.

Cc: "Michael S. Tsirkin" 
Cc: Jason Wang 
Cc: Xuan Zhuo 

David Hildenbrand (4):
  virtio-mem: remove unsafe unplug in Big Block Mode (BBM)
  virtio-mem: convert most offline_and_remove_memory() errors to -EBUSY
  virtio-mem: keep retrying on offline_and_remove_memory() errors in Sub
Block Mode (SBM)
  virtio-mem: check if the config changed before fake offlining memory

 drivers/virtio/virtio_mem.c | 168 
 1 file changed, 112 insertions(+), 56 deletions(-)


base-commit: 3f01e9fed8454dcd89727016c3e5b2fbb8f8e50c
-- 
2.41.0

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v1 1/4] virtio-mem: remove unsafe unplug in Big Block Mode (BBM)

2023-07-13 Thread David Hildenbrand
When "unsafe unplug" is enabled, we don't fake-offline all memory ahead of
actual memory offlining using alloc_contig_range(). Instead, we rely on
offline_pages() to also perform actual page migration, which might fail
or take a very long time.

In that case, it's possible to easily run into endless loops that cannot be
aborted anymore (as offlining is triggered by a workqueue then): For
example, a single (accidentally) permanently unmovable page in
ZONE_MOVABLE results in an endless loop. For ZONE_NORMAL, races between
isolating the pageblock (and checking for unmovable pages) and
concurrent page allocation are possible and similarly result in endless
loops.

The idea of the unsafe unplug mode was to make it possible to more
reliably unplug large memory blocks. However, (a) we really should be
tackling that differently, by extending the alloc_contig_range()-based
mechanism; and (b) this mode is not the default and as far as I know,
it's unused either way.

So let's simply get rid of it.

Signed-off-by: David Hildenbrand 
---
 drivers/virtio/virtio_mem.c | 51 +++--
 1 file changed, 20 insertions(+), 31 deletions(-)

diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
index 835f6cc2fb66..ed15d2a4bd96 100644
--- a/drivers/virtio/virtio_mem.c
+++ b/drivers/virtio/virtio_mem.c
@@ -38,11 +38,6 @@ module_param(bbm_block_size, ulong, 0444);
 MODULE_PARM_DESC(bbm_block_size,
 "Big Block size in bytes. Default is 0 (auto-detection).");
 
-static bool bbm_safe_unplug = true;
-module_param(bbm_safe_unplug, bool, 0444);
-MODULE_PARM_DESC(bbm_safe_unplug,
-"Use a safe unplug mechanism in BBM, avoiding long/endless loops");
-
 /*
  * virtio-mem currently supports the following modes of operation:
  *
@@ -2111,38 +2106,32 @@ static int 
virtio_mem_bbm_offline_remove_and_unplug_bb(struct virtio_mem *vm,
 VIRTIO_MEM_BBM_BB_ADDED))
return -EINVAL;
 
-   if (bbm_safe_unplug) {
-   /*
-* Start by fake-offlining all memory. Once we marked the device
-* block as fake-offline, all newly onlined memory will
-* automatically be kept fake-offline. Protect from concurrent
-* onlining/offlining until we have a consistent state.
-*/
-   mutex_lock(&vm->hotplug_mutex);
-   virtio_mem_bbm_set_bb_state(vm, bb_id,
-   VIRTIO_MEM_BBM_BB_FAKE_OFFLINE);
+   /*
+* Start by fake-offlining all memory. Once we marked the device
+* block as fake-offline, all newly onlined memory will
+* automatically be kept fake-offline. Protect from concurrent
+* onlining/offlining until we have a consistent state.
+*/
+   mutex_lock(&vm->hotplug_mutex);
+   virtio_mem_bbm_set_bb_state(vm, bb_id, VIRTIO_MEM_BBM_BB_FAKE_OFFLINE);
 
-   for (pfn = start_pfn; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
-   page = pfn_to_online_page(pfn);
-   if (!page)
-   continue;
+   for (pfn = start_pfn; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
+   page = pfn_to_online_page(pfn);
+   if (!page)
+   continue;
 
-   rc = virtio_mem_fake_offline(pfn, PAGES_PER_SECTION);
-   if (rc) {
-   end_pfn = pfn;
-   goto rollback_safe_unplug;
-   }
+   rc = virtio_mem_fake_offline(pfn, PAGES_PER_SECTION);
+   if (rc) {
+   end_pfn = pfn;
+   goto rollback;
}
-   mutex_unlock(&vm->hotplug_mutex);
}
+   mutex_unlock(&vm->hotplug_mutex);
 
rc = virtio_mem_bbm_offline_and_remove_bb(vm, bb_id);
if (rc) {
-   if (bbm_safe_unplug) {
-   mutex_lock(&vm->hotplug_mutex);
-   goto rollback_safe_unplug;
-   }
-   return rc;
+   mutex_lock(&vm->hotplug_mutex);
+   goto rollback;
}
 
rc = virtio_mem_bbm_unplug_bb(vm, bb_id);
@@ -2154,7 +2143,7 @@ static int 
virtio_mem_bbm_offline_remove_and_unplug_bb(struct virtio_mem *vm,
VIRTIO_MEM_BBM_BB_UNUSED);
return rc;
 
-rollback_safe_unplug:
+rollback:
for (pfn = start_pfn; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
page = pfn_to_online_page(pfn);
if (!page)
-- 
2.41.0

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v1 3/5] mm/memory_hotplug: make offline_and_remove_memory() timeout instead of failing on fatal signals

2023-06-27 Thread David Hildenbrand

On 27.06.23 16:17, Michal Hocko wrote:

On Tue 27-06-23 15:14:11, David Hildenbrand wrote:

On 27.06.23 14:40, Michal Hocko wrote:

On Tue 27-06-23 13:22:18, David Hildenbrand wrote:

John Hubbard writes [1]:

  Some device drivers add memory to the system via memory hotplug.
  When the driver is unloaded, that memory is hot-unplugged.

  However, memory hot unplug can fail. And these days, it fails a
  little too easily, with respect to the above case. Specifically, if
  a signal is pending on the process, hot unplug fails.

  [...]

  So in this case, other things (unmovable pages, un-splittable huge
  pages) can also cause the above problem. However, those are
  demonstrably less common than simply having a pending signal. I've
  got bug reports from users who can trivially reproduce this by
  killing their process with a "kill -9", for example.


This looks like a bug of the said driver no? If the tear down process is
killed it could very well happen right before offlining so you end up in
the very same state. Or what am I missing?


IIUC (John can correct me if I am wrong):

1) The process holds the device node open
2) The process gets killed or quits
3) As the process gets torn down, it closes the device node
4) Closing the device node results in the driver removing the device and
calling offline_and_remove_memory()

So it's not a "tear down process" that triggers that offlining_removal
somehow explicitly, it's just a side-product of it letting go of the device
node as the process gets torn down.


Isn't that just fragile? The operation might fail for other reasons. Why
cannot there be a hold on the resource to control the tear down
explicitly?


I'll let John comment on that. But from what I understood, in most 
setups where ZONE_MOVABLE gets used for hotplugged memory 
offline_and_remove_memory() succeeds and allows for reusing the device 
later without a reboot.


For the cases where it doesn't work, a reboot is required.




Especially with ZONE_MOVABLE, offlining is supposed to work in most
cases when offlining actually hotplugged (not boot) memory, and only fail
in rare corner cases (e.g., some driver holds a reference to a page in
ZONE_MOVABLE, turning it unmovable).

In these corner cases we really don't want to be stuck forever in
offline_and_remove_memory(). But in the general cases, we really want to
do our best to make memory offlining succeed -- in a reasonable
timeframe.

Reliably failing in the described case when there is a fatal signal pending
is sub-optimal. The pending signal check is mostly only relevant when user
space explicitly triggers offlining of memory using sysfs device attributes
("state" or "online" attribute), but not when coming via
offline_and_remove_memory().

So let's use a timer instead and ignore fatal signals, because they are
not really expressive for offline_and_remove_memory() users. Let's default
to 30 seconds if no timeout was specified, and limit the timeout to 120
seconds.


I really hate having timeouts back. They just proven to be hard to get
right and it is essentially a policy implemented in the kernel. They
simply do not belong to the kernel space IMHO.


As much as I agree with you in terms of offlining triggered from user space
(e.g., write "state" or "online" attribute) where user-space is actually in
charge  and can do something reasonable (timeout, retry, whatever), in these
the offline_and_remove_memory() case it's the driver that wants a
best-effort memory offlining+removal.

If it times out, virtio-mem will simply try another block or retry later.
Right now, it could get stuck forever in offline_and_remove_memory(), which
is obviously "not great". Fortunately, for virtio-mem it's configurable and
we use the alloc_contig_range()-method for now as default.


It seems that offline_and_remove_memory is using a wrong operation then.
If it wants an opportunistic offlining with some sort of policy. Timeout
might be just one policy to use but failure mode or a retry count might
be a better fit for some users. So rather than (ab)using offline_pages,
would be make more sense to extract basic offlining steps and allow
drivers like virtio-mem to reuse them and define their own policy?


virtio-mem, in default operation, does that: use alloc_contig_range() to 
logically unplug ("fake offline") that memory and then just trigger 
offline_and_remove_memory() to make it "officially offline".


In that mode, offline_and_remove_memory() cannot really timeout and is 
almost always going to succeed (except memory notifiers and some hugetlb 
dissolving).


Right now we also allow the admin to configure ordinary offlining 
directly (without prior fake offlining) when bigger memory blocks are 
used: offline_pages() is more reliab

Re: [PATCH v1 1/5] mm/memory_hotplug: check for fatal signals only in offline_pages()

2023-06-27 Thread David Hildenbrand

On 27.06.23 14:34, Michal Hocko wrote:

On Tue 27-06-23 13:22:16, David Hildenbrand wrote:

Let's check for fatal signals only. That looks cleaner and still keeps
the documented use case for manual user-space triggered memory offlining
working. From Documentation/admin-guide/mm/memory-hotplug.rst:

% timeout $TIMEOUT offline_block | failure_handling

In fact, we even document there: "the offlining context can be terminated
by sending a fatal signal".


We should be fixing documentation instead. This could break users who do
have a SIGALRM signal hander installed.


You mean because timeout will send a SIGALRM, which is not considered 
fatal in case a signal handler is installed?


At least the "traditional" tools I am aware of don't set a timeout at 
all (crossing fingers that they never end up stuck):

* chmem
* QEMU guest agent
* powerpc-utils

libdaxctl also doesn't seem to implement an easy-to-spot timeout for 
memory offlining, but it also doesn't configure SIGALRM.



Of course, that doesn't mean that there isn't somewhere a program that 
does that; I merely assume that it would be pretty unlikely to find such 
a program.


But no strong opinion: we can also keep it like that, update the doc and 
add a comment why this one here is different than most other signal 
backoff checks.



Thanks!

--
Cheers,

David / dhildenb

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v1 3/5] mm/memory_hotplug: make offline_and_remove_memory() timeout instead of failing on fatal signals

2023-06-27 Thread David Hildenbrand

On 27.06.23 14:40, Michal Hocko wrote:

On Tue 27-06-23 13:22:18, David Hildenbrand wrote:

John Hubbard writes [1]:

 Some device drivers add memory to the system via memory hotplug.
 When the driver is unloaded, that memory is hot-unplugged.

 However, memory hot unplug can fail. And these days, it fails a
 little too easily, with respect to the above case. Specifically, if
 a signal is pending on the process, hot unplug fails.

 [...]

 So in this case, other things (unmovable pages, un-splittable huge
 pages) can also cause the above problem. However, those are
 demonstrably less common than simply having a pending signal. I've
 got bug reports from users who can trivially reproduce this by
 killing their process with a "kill -9", for example.


This looks like a bug of the said driver no? If the tear down process is
killed it could very well happen right before offlining so you end up in
the very same state. Or what am I missing?


IIUC (John can correct me if I am wrong):

1) The process holds the device node open
2) The process gets killed or quits
3) As the process gets torn down, it closes the device node
4) Closing the device node results in the driver removing the device and
   calling offline_and_remove_memory()

So it's not a "tear down process" that triggers that offlining_removal 
somehow explicitly, it's just a side-product of it letting go of the 
device node as the process gets torn down.


  

Especially with ZONE_MOVABLE, offlining is supposed to work in most
cases when offlining actually hotplugged (not boot) memory, and only fail
in rare corner cases (e.g., some driver holds a reference to a page in
ZONE_MOVABLE, turning it unmovable).

In these corner cases we really don't want to be stuck forever in
offline_and_remove_memory(). But in the general cases, we really want to
do our best to make memory offlining succeed -- in a reasonable
timeframe.

Reliably failing in the described case when there is a fatal signal pending
is sub-optimal. The pending signal check is mostly only relevant when user
space explicitly triggers offlining of memory using sysfs device attributes
("state" or "online" attribute), but not when coming via
offline_and_remove_memory().

So let's use a timer instead and ignore fatal signals, because they are
not really expressive for offline_and_remove_memory() users. Let's default
to 30 seconds if no timeout was specified, and limit the timeout to 120
seconds.


I really hate having timeouts back. They just proven to be hard to get
right and it is essentially a policy implemented in the kernel. They
simply do not belong to the kernel space IMHO.


As much as I agree with you in terms of offlining triggered from user 
space (e.g., write "state" or "online" attribute) where user-space is 
actually in charge  and can do something reasonable (timeout, retry, 
whatever), in these the offline_and_remove_memory() case it's the driver 
that wants a best-effort memory offlining+removal.


If it times out, virtio-mem will simply try another block or retry 
later. Right now, it could get stuck forever in 
offline_and_remove_memory(), which is obviously "not great". 
Fortunately, for virtio-mem it's configurable and we use the 
alloc_contig_range()-method for now as default.


If it would time out for John's driver, we most certainly don't want to 
be stuck in offline_and_remove_memory(), blocking device/driver 
unloading (and even a reboot IIRC) possibly forever.



I much rather have offline_and_remove_memory() indicate "timeout" to a 
in-kernel user a bit earlier than getting stuck in there forever. The 
timeout parameter allows for giving the in-kernel users a bit of 
flexibility, which I showcases for virtio-mem that unplugs smaller 
blocks and rather wants to fail fast and retry later.



Sure, we could make the timeout configurable to optimize for some corner 
cases, but that's not really what offline_and_remove_memory() users want 
and I doubt anybody would fine-tune that: they want a best-effort 
attempt. And that's IMHO not really a policy, it's an implementation 
detail of these drivers.


For example, the driver from John could simply never call 
offline_and_remove_memory() and always require a reboot when wanting to 
reuse a device. But that's definitely what users want.


virtio-mem could simply never call offline_and_remove_memory() and 
indicate "I don't support unplug of these online memory blocks". But 
that's *definitely* not what users want.



I'm very open for alternatives regarding offline_and_remove_memory(), so 
far this was the only reasonable thing I could come up with that 
actually achieves what we want for these users: not get stuck in there 
forever but rath

[PATCH v1 3/5] mm/memory_hotplug: make offline_and_remove_memory() timeout instead of failing on fatal signals

2023-06-27 Thread David Hildenbrand
John Hubbard writes [1]:

Some device drivers add memory to the system via memory hotplug.
When the driver is unloaded, that memory is hot-unplugged.

However, memory hot unplug can fail. And these days, it fails a
little too easily, with respect to the above case. Specifically, if
a signal is pending on the process, hot unplug fails.

[...]

So in this case, other things (unmovable pages, un-splittable huge
pages) can also cause the above problem. However, those are
demonstrably less common than simply having a pending signal. I've
got bug reports from users who can trivially reproduce this by
killing their process with a "kill -9", for example.

Especially with ZONE_MOVABLE, offlining is supposed to work in most
cases when offlining actually hotplugged (not boot) memory, and only fail
in rare corner cases (e.g., some driver holds a reference to a page in
ZONE_MOVABLE, turning it unmovable).

In these corner cases we really don't want to be stuck forever in
offline_and_remove_memory(). But in the general cases, we really want to
do our best to make memory offlining succeed -- in a reasonable
timeframe.

Reliably failing in the described case when there is a fatal signal pending
is sub-optimal. The pending signal check is mostly only relevant when user
space explicitly triggers offlining of memory using sysfs device attributes
("state" or "online" attribute), but not when coming via
offline_and_remove_memory().

So let's use a timer instead and ignore fatal signals, because they are
not really expressive for offline_and_remove_memory() users. Let's default
to 30 seconds if no timeout was specified, and limit the timeout to 120
seconds.

This change is also valuable for virtio-mem in BBM (Big Block Mode) with
"bbm_safe_unplug=off", to avoid endless loops when stuck forever in
offline_and_remove_memory().

While at it, drop the "extern" from offline_and_remove_memory() to make
it fit into a single line.

[1] https://lkml.kernel.org/r/20230620011719.155379-1-jhubb...@nvidia.com

Signed-off-by: David Hildenbrand 
---
 drivers/virtio/virtio_mem.c|  2 +-
 include/linux/memory_hotplug.h |  2 +-
 mm/memory_hotplug.c| 50 --
 3 files changed, 50 insertions(+), 4 deletions(-)

diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
index cb8bc6f6aa90..f8792223f1db 100644
--- a/drivers/virtio/virtio_mem.c
+++ b/drivers/virtio/virtio_mem.c
@@ -738,7 +738,7 @@ static int virtio_mem_offline_and_remove_memory(struct 
virtio_mem *vm,
"offlining and removing memory: 0x%llx - 0x%llx\n", addr,
addr + size - 1);
 
-   rc = offline_and_remove_memory(addr, size);
+   rc = offline_and_remove_memory(addr, size, 0);
if (!rc) {
atomic64_sub(size, &vm->offline_size);
/*
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 9fcbf5706595..d5f9e8b5a4a4 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -307,7 +307,7 @@ extern int offline_pages(unsigned long start_pfn, unsigned 
long nr_pages,
 struct zone *zone, struct memory_group *group);
 extern int remove_memory(u64 start, u64 size);
 extern void __remove_memory(u64 start, u64 size);
-extern int offline_and_remove_memory(u64 start, u64 size);
+int offline_and_remove_memory(u64 start, u64 size, unsigned int timeout_ms);
 
 #else
 static inline void try_offline_node(int nid) {}
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 0d2151df4ee1..ca635121644a 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -152,6 +152,22 @@ void put_online_mems(void)
 
 bool movable_node_enabled = false;
 
+/*
+ * Protected by the device hotplug lock: offline_and_remove_memory()
+ * will activate a timer such that offlining cannot be stuck forever.
+ *
+ * With an active timer, fatal signals will be ignored, because they can be
+ * counter-productive when dying user space triggers device unplug/driver
+ * unloading that ends up offlining+removing device memory.
+ */
+static bool mhp_offlining_timer_active;
+static atomic_t mhp_offlining_timer_expired;
+
+static void mhp_offline_timer_fn(struct timer_list *unused)
+{
+   atomic_set(&mhp_offlining_timer_expired, 1);
+}
+
 #ifndef CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE
 int mhp_default_online_type = MMOP_OFFLINE;
 #else
@@ -1879,7 +1895,18 @@ int __ref offline_pages(unsigned long start_pfn, 
unsigned long nr_pages,
do {
pfn = start_pfn;
do {
-   if (fatal_signal_pending(current)) {
+   /*
+* If a timer is active, we're coming via
+* offline_and_remove_memory() and want to ignore even

[PATCH v1 4/5] virtio-mem: set the timeout for offline_and_remove_memory() to 10 seconds

2023-06-27 Thread David Hildenbrand
Currently we use the default (30 seconds), let's reduce it to 10
seconds. In BBM, we barely deal with blocks larger than 1/2 GiB, and
after 10 seconds it's most probably best to give up on that memory block
and try another one (or retry this one later).

In the common fake-offline case where we effectively fake-offline memory
using alloc_contig_range() first (SBM or BBM with bbm_safe_unplug=on),
we expect offline_and_remove_memory() to be blazingly fast and never take
anywhere close to 10seconds -- so this should only affect BBM with
bbm_safe_unplug=off.

While at it, update the parameter description and the relationship to
unmovable pages.

Signed-off-by: David Hildenbrand 
---
 drivers/virtio/virtio_mem.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
index f8792223f1db..7468b4a907e3 100644
--- a/drivers/virtio/virtio_mem.c
+++ b/drivers/virtio/virtio_mem.c
@@ -41,7 +41,7 @@ MODULE_PARM_DESC(bbm_block_size,
 static bool bbm_safe_unplug = true;
 module_param(bbm_safe_unplug, bool, 0444);
 MODULE_PARM_DESC(bbm_safe_unplug,
-"Use a safe unplug mechanism in BBM, avoiding long/endless loops");
+"Use a safe/fast unplug mechanism in BBM, failing faster on 
unmovable pages");
 
 /*
  * virtio-mem currently supports the following modes of operation:
@@ -738,7 +738,7 @@ static int virtio_mem_offline_and_remove_memory(struct 
virtio_mem *vm,
"offlining and removing memory: 0x%llx - 0x%llx\n", addr,
addr + size - 1);
 
-   rc = offline_and_remove_memory(addr, size, 0);
+   rc = offline_and_remove_memory(addr, size, 10 * MSEC_PER_SEC);
if (!rc) {
atomic64_sub(size, &vm->offline_size);
/*
-- 
2.40.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v1 5/5] virtio-mem: check if the config changed before (fake) offlining memory

2023-06-27 Thread David Hildenbrand
If we repeatedly fail to (fake) offline memory, we won't be sending
any unplug requests to the device. However, we only check if the config
changed when sending such (un)plug requests.

So we could end up trying for a long time to offline memory, even though
the config changed already and we're not supposed to unplug memory
anymore.

Let's optimize for that case, identified while testing the
offline_and_remove() memory timeout and simulating it repeatedly running
into the timeout.

Signed-off-by: David Hildenbrand 
---
 drivers/virtio/virtio_mem.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
index 7468b4a907e3..247fb3e0ce61 100644
--- a/drivers/virtio/virtio_mem.c
+++ b/drivers/virtio/virtio_mem.c
@@ -1922,6 +1922,10 @@ static int virtio_mem_sbm_unplug_sb_online(struct 
virtio_mem *vm,
unsigned long start_pfn;
int rc;
 
+   /* Stop fake offlining attempts if the config changed. */
+   if (atomic_read(&vm->config_changed))
+   return -EAGAIN;
+
start_pfn = PFN_DOWN(virtio_mem_mb_id_to_phys(mb_id) +
 sb_id * vm->sbm.sb_size);
 
@@ -2233,6 +2237,10 @@ static int virtio_mem_bbm_unplug_request(struct 
virtio_mem *vm, uint64_t diff)
virtio_mem_bbm_for_each_bb_rev(vm, bb_id, 
VIRTIO_MEM_BBM_BB_ADDED) {
cond_resched();
 
+   /* Stop (fake) offlining attempts if the config 
changed. */
+   if (atomic_read(&vm->config_changed))
+   return -EAGAIN;
+
/*
 * As we're holding no locks, these checks are racy,
 * but we don't care.
-- 
2.40.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v1 2/5] virtio-mem: convert most offline_and_remove_memory() errors to -EBUSY

2023-06-27 Thread David Hildenbrand
Let's prepare for offline_and_remove_memory() to return other error
codes that effectively translate to -EBUSY, such as -ETIMEDOUT.

Signed-off-by: David Hildenbrand 
---
 drivers/virtio/virtio_mem.c | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
index 835f6cc2fb66..cb8bc6f6aa90 100644
--- a/drivers/virtio/virtio_mem.c
+++ b/drivers/virtio/virtio_mem.c
@@ -750,7 +750,15 @@ static int virtio_mem_offline_and_remove_memory(struct 
virtio_mem *vm,
dev_dbg(&vm->vdev->dev,
"offlining and removing memory failed: %d\n", rc);
}
-   return rc;
+
+   switch (rc) {
+   case 0:
+   case -ENOMEM:
+   case -EINVAL:
+   return rc;
+   default:
+   return -EBUSY;
+   }
 }
 
 /*
-- 
2.40.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v1 0/5] mm/memory_hotplug: make offline_and_remove_memory() timeout instead of failing on fatal signals

2023-06-27 Thread David Hildenbrand
As raised by John Hubbard [1], offline_and_remove_memory() failing on
fatal signals can be sub-optimal for out-of-tree drivers: dying user space
might be the last one holding a device node open.

As that device node gets closed, the driver might unplug the device
and trigger offline_and_remove_memory() to unplug previously
hotplugged device memory. This, however, will fail reliably when fatal
signals are pending on the dying process, turning the device unusable until
the machine gets rebooted.

That can be optizied easily by ignoring fatal signals. In fact, checking
for fatal signals in the case of offline_and_remove_memory() doesn't
make too much sense; the check makes sense when offlining is triggered
directly via sysfs.  However, we actually do want a way to not end up
stuck in offline_and_remove_memory() forever.

What offline_and_remove_memory() users actually want is fail after some
given timeout and not care about fatal signals.

So let's implement that, optimizing virtio-mem along the way.

Cc: Andrew Morton 
Cc: "Michael S. Tsirkin" 
Cc: John Hubbard 
Cc: Oscar Salvador 
Cc: Michal Hocko 
Cc: Jason Wang 
Cc: Xuan Zhuo 

[1] https://lkml.kernel.org/r/20230620011719.155379-1-jhubb...@nvidia.com

David Hildenbrand (5):
  mm/memory_hotplug: check for fatal signals only in offline_pages()
  virtio-mem: convert most offline_and_remove_memory() errors to -EBUSY
  mm/memory_hotplug: make offline_and_remove_memory() timeout instead of
failing on fatal signals
  virtio-mem: set the timeout for offline_and_remove_memory() to 10
seconds
  virtio-mem: check if the config changed before (fake) offlining memory

 drivers/virtio/virtio_mem.c| 22 +--
 include/linux/memory_hotplug.h |  2 +-
 mm/memory_hotplug.c| 50 --
 3 files changed, 68 insertions(+), 6 deletions(-)


base-commit: 6995e2de6891c724bfeb2db33d7b87775f913ad1
-- 
2.40.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v1 1/5] mm/memory_hotplug: check for fatal signals only in offline_pages()

2023-06-27 Thread David Hildenbrand
Let's check for fatal signals only. That looks cleaner and still keeps
the documented use case for manual user-space triggered memory offlining
working. From Documentation/admin-guide/mm/memory-hotplug.rst:

% timeout $TIMEOUT offline_block | failure_handling

In fact, we even document there: "the offlining context can be terminated
by sending a fatal signal".

Signed-off-by: David Hildenbrand 
---
 mm/memory_hotplug.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 8e0fa209d533..0d2151df4ee1 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1879,7 +1879,7 @@ int __ref offline_pages(unsigned long start_pfn, unsigned 
long nr_pages,
do {
pfn = start_pfn;
do {
-   if (signal_pending(current)) {
+   if (fatal_signal_pending(current)) {
ret = -EINTR;
reason = "signal backoff";
goto failed_removal_isolated;
-- 
2.40.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 08/26] virtio-mem: use array_size

2023-06-26 Thread David Hildenbrand

On 23.06.23 23:14, Julia Lawall wrote:

Use array_size to protect against multiplication overflows.

The changes were done using the following Coccinelle semantic patch:

// 
@@
 expression E1, E2;
 constant C1, C2;
 identifier alloc = {vmalloc,vzalloc};
@@
 
(

   alloc(C1 * C2,...)
|
   alloc(
-   (E1) * (E2)
+   array_size(E1, E2)
   ,...)
)
// 

Signed-off-by: Julia Lawall 

---
  drivers/virtio/virtio_mem.c |6 +++---
  1 file changed, 3 insertions(+), 3 deletions(-)




Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 0/7] riscv: Memory Hot(Un)Plug support

2023-05-22 Thread David Hildenbrand

On 21.05.23 11:15, Björn Töpel wrote:

Hi David and Anshuman!

Björn Töpel  writes:


David Hildenbrand  writes:


On 12.05.23 16:57, Björn Töpel wrote:

From: Björn Töpel 

Memory Hot(Un)Plug support for the RISC-V port
==


[...]



Cool stuff! I'm fairly busy right now, so some high-level questions upfront:


No worries, and no rush! I'd say the v1 series was mainly for the RISC-V
folks, and I've got tons of (offline) comments from Alex -- and with
your comments below some more details to figure out.


One of the major issues with my v1 patch is around init_mm page table
synchronization, and that'll be part of the v2.

I've noticed there's a quite a difference between x86-64 and arm64 in
terms of locking, when updating (add/remove) the init_mm table. x86-64
uses the usual page table locking mechanisms (used by the generic
kernel functions), whereas arm64 does not.

How does arm64 manage to mix the "lock-less" updates (READ/WRITE_ONCE,
and fences in set_p?d+friends), with the generic kernel ones that uses
the regular page locking mechanism?

I'm obviously missing something about the locking rules for memory hot
add/remove... I've been reading the arm64 memory hot add/remove
series, but none the wiser! ;-)


In general, memory hot(un)plug is serialized on a high level using the 
mem_hotplug_lock. For example, in pagemap_range() or in 
add_memory_resource(), we grab that lock in write mode. So we'll never 
see memory getting added/removed concurrently from the direct map.


From what I recall, the locking on the arch level is required for 
concurrent (direct mapping) page table modifications that target virtual 
address ranges adjacent to the ranges we hot(un)plug:

CONFIG_ARCH_HAS_SET_DIRECT_MAP and vmalloc come to mind.

For example, if a range would be mapped using a large PUD, but we have 
to unplug it partially (unplugging memory part of bootmem), we'd have to 
replace the large PUD by a PMD table first. That change (that could 
affect other concurrent page table walkers/operations) has to be 
synchronized.


I guess to which degree this applies to riscv depends on the virtual 
memory layout, direct mapping granularity and features (e.g., 
CONFIG_ARCH_HAS_SET_DIRECT_MAP).



One trick that arm64 implements is, that it only allows hotunplugging 
memory that was hotplugged (see prevent_bootmem_remove_notifier()). That 
might just rule out such problematic cases that require locking 
completely, and the high-level mem_hotplug_lock sufficient.


--
Thanks,

David / dhildenb

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 0/7] riscv: Memory Hot(Un)Plug support

2023-05-17 Thread David Hildenbrand

On 12.05.23 16:57, Björn Töpel wrote:

From: Björn Töpel 

Memory Hot(Un)Plug support for the RISC-V port
==

Introduction


To quote "Documentation/admin-guide/mm/memory-hotplug.rst": "Memory
hot(un)plug allows for increasing and decreasing the size of physical
memory available to a machine at runtime."

This series attempts to add memory hot(un)plug support for the RISC-V
Linux port.

I'm sending the series as a v1, but it's borderline RFC. It definitely
needs more testing time, but it would be nice with some early input.

Implementation
--

 From an arch perspective, a couple of callbacks needs to be
implemented to support hot plugging:

arch_add_memory()
This callback is responsible for updating the linear/direct map, and
call into the memory hot plugging generic code via __add_pages().

arch_remove_memory()
In this callback the linear/direct map is tore down.

vmemmap_free()
The function tears down the vmemmap mappings (if
CONFIG_SPARSEMEM_VMEMMAP is in-use), and also deallocates the backing
vmemmap pages. Note that for persistent memory, an alternative
allocator for the backing pages can be used -- the vmem_altmap. This
means that when the backing pages are cleared, extra care is needed so
that the correct deallocation method is used. Note that RISC-V
populates the vmemmap using vmemmap_populate_basepages(), so currently
no hugepages are used for the backing store.

The page table unmap/teardown functions are heavily based (copied!)
from the x86 tree. The same remove_pgd_mapping() is used in both
vmemmap_free() and arch_remove_memory(), but in the latter function
the backing pages are not removed.

On RISC-V, the PGD level kernel mappings needs to synchronized with
all page-tables (e.g. via sync_kernel_mappings()). Synchronization
involves special care, like locking. Instead, this patch series takes
a different approach (introduced by Jörg Rödel in the x86-tree);
Pre-allocate the PGD-leaves (P4D, PUD, or PMD depending on the paging
setup) at mem_init(), for vmemmap and the direct map.

Pre-allocating the PGD-leaves waste some memory, but is only enabled
for CONFIG_MEMORY_HOTPLUG. The number pages, potentially unused, are
~128 * 4K.

Patch 1: Preparation for hotplugging support, by pre-allocating the
  PGD leaves.

Patch 2: Changes the __init attribute to __meminit, to avoid that the
  functions are removed after init. __meminit keeps the
  functions after init, if memory hotplugging is enabled for
  the build.
  
Patch 3: Refactor the direct map setup, so it can be used for hot add.


Patch 4: The actual add/remove code. Mostly a page-table-walk
  exercise.

Patch 5: Turn on the arch support in Kconfig

Patch 6: Now that memory hotplugging is enabled, make virtio-mem
  usable for RISC-V
  
Patch 7: Pre-allocate vmalloc PGD-leaves as well, which removes the

  need for vmalloc faulting.
  
RFC

---

  * TLB flushes. The current series uses Big Hammer flush-it-all.
  * Pre-allocation vs explicit syncs

Testing
---

ACPI support is still in the making for RISC-V, so tests that involve
CXL and similar fanciness is currently not possible. Virtio-mem,
however, works without proper ACPI support. In order to try this out
in Qemu, some additional patches for Qemu are needed:

  * Enable virtio-mem for RISC-V
  * Add proper hotplug support for virtio-mem
  
The patch for Qemu can be found is commit 5d90a7ef1bc0

("hw/riscv/virt: Support for virtio-mem-pci"), and can be found here

   https://github.com/bjoto/qemu/tree/riscv-virtio-mem

I will try to upstream that work in parallel with this.
   
Thanks to David Hildenbrand for valuable input for the Qemu side of

things.

The series is based on the RISC-V fixes tree
   https://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux.git/log/?h=fixes



Cool stuff! I'm fairly busy right now, so some high-level questions upfront:

What is the memory section size (which implies the memory block size 
and)? This implies the minimum DIMM granularity and the high-level 
granularity in which virtio-mem adds memory.


What is the pageblock size, implying the minimum granularity that 
virtio-mem can operate on?


On x86-64 and arm64 we currently use the ACPI SRAT to expose the maximum 
physical address where we can see memory getting hotplugged. [1] From 
that, we can derive the "max_possible_pfn" and prepare the kernel 
virtual memory layourt (especially, direct map).


Is something similar required on RISC-V? On s390x, I'm planning on 
adding a paravirtualized mechanism to detect where memory devices might 
be located. (I had a running RFC, but was distracted by all other kinds 
of stuff)



[1] https://virtio-mem.gitlab.io/developer-guide.html

--
Thanks,

David / dhildenb

___
Virtualization 

Re: [PATCH v3 6/7] mm/gup: remove vmas parameter from pin_user_pages()

2023-04-17 Thread David Hildenbrand

On 15.04.23 14:09, Lorenzo Stoakes wrote:

After the introduction of FOLL_SAME_FILE we no longer require vmas for any
invocation of pin_user_pages(), so eliminate this parameter from the
function and all callers.

This clears the way to removing the vmas parameter from GUP altogether.

Signed-off-by: Lorenzo Stoakes 
---


Ideally, we'd avoid FOLL_SAME_FILE as well

Acked-by: David Hildenbrand 

--
Thanks,

David / dhildenb

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH 01/19] mm: Introduce vm_account

2023-01-31 Thread David Hildenbrand

On 24.01.23 06:42, Alistair Popple wrote:

Kernel drivers that pin pages should account these pages against
either user->locked_vm or mm->pinned_vm and fail the pinning if
RLIMIT_MEMLOCK is exceeded and CAP_IPC_LOCK isn't held.

Currently drivers open-code this accounting and use various methods to
update the atomic variables and check against the limits leading to
various bugs and inconsistencies. To fix this introduce a standard
interface for charging pinned and locked memory. As this involves
taking references on kernel objects such as mm_struct or user_struct
we introduce a new vm_account struct to hold these references. Several
helper functions are then introduced to grab references and check
limits.

As the way these limits are charged and enforced is visible to
userspace we need to be careful not to break existing applications by
charging to different counters. As a result the vm_account functions
support accounting to different counters as required.

A future change will extend this to also account against a cgroup for
pinned pages.


The term "vm_account" is misleading, no? VM_ACCOUNT is for accounting 
towards the commit limit 


--
Thanks,

David / dhildenb

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC] memory pressure detection in VMs using PSI mechanism for dynamically inflating/deflating VM memory

2023-01-24 Thread David Hildenbrand

On 24.01.23 00:04, Sudarshan Rajagopalan wrote:
[...]

One problematic thing is that adding memory to Linux by virtio-mem
eventually consumes memory (e.g., the memmap), especially when having
to to add a completely new memory block to Linux.


Yes we have thought about this issue as well where-in when system is
heavily on memory pressure, it would require some memory for memmap
metadata, and also few other places in memory hotplug that it would need
to alloc_pages for hot-plugging in. I think this path in memory_hotplug
may be fixed where it doesn't rely on allocating some small portion of
memory for hotplugging. But, the purpose memory_hotplug itself wasn't
for plugging memory on system being in memory pressure :).


Some small allocations might be classified as "urgent" and go to atomic 
reserves (e.g., resource tree node, memory device node). The big 
allocations (memmap, page-ext if enabled, eventually page tables for 
direct map when not mapping huge pages) are the problematic "memory 
consumers" I think.






So if you're already under severe memory pressure, these allocations
to bring up new memory can fail. The question is, if PSI can notify
"early" enough such that this barely happens in practice.

There are some possible ways to mitigate:

1) Always keep spare memory blocks by virtio-mem added to Linux, that
B B  don't expose any memory yet. Memory from these block can be handed
B B  over to Linux without additional Linux allocations. Of course, they
B B  consume metadata, so one might want to limit them.

2) Implement memmap_on_memory support for virtio-mem. This might help in
B B  some setups, where the device block size is suitable.

Did you run into that scenario already during your experiments, and
how did you deal with that?


We are exactly implementing 2) you had mentioned i.e. enabling
memmap_on_memory support for virtio-mem. This always guarantees that
free memory is always present for memmap metadata while hotplugging. But
this required us to increase memory block size to 256MB (from 128MB) for
alignment requirement of memory hotplug to enable memory_on_memmap, for
4K page size configuration. Option 1) you mentioned also seems


The memmap of 128 MiB is 2 MiB. Assuming the pageblock size is 2 MiB, 
and virtio-mem supports a device block size of 2 MiB, it should "in 
theory" also work with 128 MiB memory blocks.


So I'd be curious why the change to 256 MiB was required. Maybe, that 
kernel config ends up with a pageblock size of 4 MiB (IIRC that can 
happen without CONFIG_HUGETLB -- which we should most probbaly change to 
also be PMD_ORDER due to THP).



interesting - its good to have some spare memory in hand when system is
heavily in memory pressure so that this memory can be handed over
immediately on PSI pressure and doesn't have to wait for memory plug-in
request roundtrip from Primary VM.


The idea was that you'd still do the roundtrip to request plugging of 
device memory blocks, but that you could immediately expose memory to 
the system (without requiring allocations), to eventually immediately 
prepare the next Linux memory block while "fresh" memory is available.


This way you could handle most allocations that happen when adding a 
Linux memory block.


The main idea was to always have at least one spare one lying around. 
And as soon as you start exposing memory from one of them to the page 
allocator, immediately prepare the next one.




Do you think having memmap_on_memory support for virtio-mem is useful to
have? If so, we can send the patch that supports this in virtio-mem?



I think yes. However, last time I though about adding support, I 
realized that there are some ugly corner cases to handle cleanly.


You have to make sure that the device memory blocks to-be-used as memmap 
are "plugged" even before calling add_memory_driver_managed(). And you 
can only "unplug" these device memory blocks after the memory block was 
removed via offline_and_remove_memory().


So the whole order of events and management of plugged device blocks 
changes quite a bit ...


... and what to do if the device block size is, say 4MiB, but the memmap 
is 2 MiB? Of course, one could simply skip the optimization then.


Having that said, if you managed to get it running and it's not too 
ugly, please share.


--
Thanks,

David / dhildenb

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC] memory pressure detection in VMs using PSI mechanism for dynamically inflating/deflating VM memory

2023-01-23 Thread David Hildenbrand


1. This will be a native userspace daemon that will be running only in
the Linux VM which will use virtio-mem driver that uses memory hotplug
to add/remove memory. The VM (aka Secondary VM, SVM) will request for
memory from the host which is Primary VM, PVM via the backend hypervisor
which takes care of cross-VM communication.

2. This will be guest driver. This daemon will use PSI mechanism to
monitor memory pressure to keep track of memory demands in the system.
It will register to few memory pressure events and make an educated
guess on when demand for memory in system is increasing.


Is that running in the primary or the secondary VM?


The userspace PSI daemon will be running on secondary VM. It will talk
to a kernel driver (running on secondary VM itself) via ioctl. This
kernel driver will talk to slightly modified version of virtio-mem
driver where it can call the virtio_mem_config_changed(virtiomem_device)
function for resizing the secondary VM. So its mainly "guest driven" now.


Okay, thanks.

[...]



This daemon is currently in just Beta stage now and we have basic
functionality running. We are yet to add more flesh to this scheme to


Good to hear that the basics are running with virtio-mem (I assume :) ).


make sure any potential risks or security concerns are taken care as
well.


It would be great to draw/explain the architecture in more detail.


We will be looking into solving any potential security concerns where
hypervisor would restrict few actions of resizing of memory. Right now,
we are experimenting to see if PSI mechanism itself can be used for ways
of detecting memory pressure in the system and add memory to secondary
VM when memory is in need. Taking into account all the latencies
involved in the PSI scheme (i.e. time when one does malloc call till
when extra memory gets added to SVM system). And wanted to know
upstream's opinion on such a scheme using PSI mechanism for detecting
memory pressure and resizing SVM accordingly.


One problematic thing is that adding memory to Linux by virtio-mem 
eventually consumes memory (e.g., the memmap), especially when having to 
to add a completely new memory block to Linux.


So if you're already under severe memory pressure, these allocations to 
bring up new memory can fail. The question is, if PSI can notify "early" 
enough such that this barely happens in practice.


There are some possible ways to mitigate:

1) Always keep spare memory blocks by virtio-mem added to Linux, that
   don't expose any memory yet. Memory from these block can be handed
   over to Linux without additional Linux allocations. Of course, they
   consume metadata, so one might want to limit them.

2) Implement memmap_on_memory support for virtio-mem. This might help in
   some setups, where the device block size is suitable.

Did you run into that scenario already during your experiments, and how 
did you deal with that?


--
Thanks,

David / dhildenb

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC] memory pressure detection in VMs using PSI mechanism for dynamically inflating/deflating VM memory

2023-01-17 Thread David Hildenbrand

On 15.01.23 04:57, Sudarshan Rajagopalan wrote:

Hello all,



Hi,

I'll focus on the virtio-mem side of things :)


We’re from the Linux memory team here at Qualcomm. We are currently
devising a VM memory resizing feature where we dynamically inflate or
deflate the Linux VM based on ongoing memory demands in the VM. We
wanted to propose few details about this userspace daemon in form of RFC
and wanted to know the upstream’s opinion. Here are few details –


I'd avoid using the terminology of inflating/deflating VM memory when 
talking about virtio-mem. Just call it "dynamically resizing VM memory". 
virtio-mem is one way of doing it using memory devices.


Inflation/deflation, in contrast, reminds one of a traditional balloon 
driver, along the lines of virtio-balloon.




1. This will be a native userspace daemon that will be running only in
the Linux VM which will use virtio-mem driver that uses memory hotplug
to add/remove memory. The VM (aka Secondary VM, SVM) will request for
memory from the host which is Primary VM, PVM via the backend hypervisor
which takes care of cross-VM communication.

2. This will be guest driver. This daemon will use PSI mechanism to
monitor memory pressure to keep track of memory demands in the system.
It will register to few memory pressure events and make an educated
guess on when demand for memory in system is increasing.


Is that running in the primary or the secondary VM?



3. Currently, min PSI window size is 500ms, so PSI monitor sampling
period will be 50ms. In order to get quick response time from PSI, we’ve
reduced the min window size to 50ms so that as small as 5ms increase in
memory pressure can be reported to userspace by PSI.

/* PSI trigger definitions */
-#define WINDOW_MIN_US 50   /* Min window size is 500ms */
+#define WINDOW_MIN_US 5    /* Min window size is 50ms */

4. Detecting increase in memory demand – when a certain usecase starts
in VM that does memory allocations, it will stall causing PSI mechanism
to generate a memory pressure event to userspace. To simply put, when
pressure increases certain set threshold, it can make educated guess
that a memory requiring usecase has ran and VM system needs memory to be
added.

5. Detecting decrease in memory pressure – the reverse part where we
give back memory to PVM when memory is no longer needed is bit tricky.
We look for pressure decay and see if PSI averages (avg10, avg60,
avg300) go down, and along with other memory stats (such as free memory
etc) we make an educated guess that usecase has ended and memory has
been free’ed by the usecase, and this memory can be given back to PVM
when its no longer needed.

6. I’m skimming much on the logic and intelligence but the daemon relies
on PSI mechanism to know when memory demand is going up and down, and
communicates with virtio-mem driver for hot-plugging/unplugging memory.


For now, the hypervisor is in charge of triggering a virtio-mem device 
resize request. Will the Linux VM expose a virtio-mem device to the SVM 
and request to resize the SVM memory via that virtio-mem device?



We also factor in the latency involved with roundtrips between SVM<->PVM
so we size the memory chuck that needs to be plugged-in accordingly.

7. The whole purpose of daemon using PSI mechanism is to make this si
guest driven rather than host driven, which currently is the case mostly
with virtio-mem users. The memory pressure and usage monitoring happens
inside the SVM and the SVM makes the decisions to request for memory
from PVM. This avoids any intervention such as admin in PVM to monitor
and control the knobs. We have also set max limit of how much SVMs can
grow interms of memory, so that a rouge VM would not abuse this scheme.


Something I envisioned at some point is to
1) Have a virtio-mem guest driver to request a size change. The
   hypervisor will react accordingly by adjusting the requested size.

   Such a driver<->device request could be communicated via any other
   communication mechanism to the hypervisor, but it already came up a
   couple of times to do it via the virtio-mem protocol directly.

2) Configure the hypervisor to have a lower/upper range. Within that
   range, resize requests by the driver can be granted. The current
   values of these properties can be exposed via the device to the
   driver as well.

Is that what you also proposing here? If so, great.



This daemon is currently in just Beta stage now and we have basic
functionality running. We are yet to add more flesh to this scheme to


Good to hear that the basics are running with virtio-mem (I assume :) ).


make sure any potential risks or security concerns are taken care as well.


It would be great to draw/explain the architecture in more detail.

--
Thanks,

David / dhildenb

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH] virtio_balloon: high order allocation

2023-01-02 Thread David Hildenbrand

On 29.12.22 06:31, Soichiro Ueda wrote:

Hi David.


How does this affect page migration / balloon compaction etc?


I guess this patch doesn't affect balloon compaction. When allocating
pages using alloc_pages(), it skips compaction by masking out
__GFP_RECLAIM if the order is larger than 0.

As for page migration, in the current implementation it migrate a
0-order page to another 0-order page. With this patch, it may migrate a
high-order page to another same-order page.

But I noticed that the migrated high-order page is handled as 0-order
page in virtballoon_migratepage().

  >     balloon_page_insert(vb_dev_info, newpage);



Yes, I think suspected that it's broken.

We also might want to handle OOM accordingly by splitting the page and 
retrying migration. Almost nothing should stop a balloon page from 
getting migrated.


One thing to try is allocating a higher-order page and immediately 
splitting it into base pages, and enqueuing base pages only. Only 
inflation would be faster, because you could only deflate base pages.



We should put the newpage into a page list of the corresponding order,
like this.

      balloon_page_enqueue(vb_dev_info, newpage, order);

I'll fix it in the v2 patch.



Note that I have some more concerns:
* We might end up stealing all higher-order pages from the guest instead
  of eating all of the "small" leftover pieces first. This might be
  undesirable. We discussed this in the past in the context of hugepage
  ballooning [not able to locate the abandoned patches quickly].
* PG_offline won't work as expected anymore and result in kdump reading
  inflated memory, which is undesirable. One workaround would be setting
  PG_offline on all base pages, but this needs some more thought.
* How are we handling a scenario where we are requested to deflate e.g.,
  a single 4096 KiB page but we only have higher-order pages allocated?
  I suspect you would over-deflate right now, which is undesirable.

--
Thanks,

David / dhildenb

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH] virtio_balloon: high order allocation

2022-12-23 Thread David Hildenbrand

On 23.12.22 10:35, Soichiro Ueda wrote:

At present, the VirtIO balloon device driver allocates pages
one by one using alloc_page(), and frees them using put_page().

This changes it so that the driver allocates high order pages
by using alloc_pages(), and frees them using __free_pages() if possible.
By doing so, the CPU performance of inflation and deflation
should be improved.

The effect of this change has been confirmed by benchmarks that measure
the elapsed time of inflation and deflation.

The results are here:

16 pages inflation:
   before: 119,779 ns
   after : 115,655 ns (-3.4%)
64 pages inflation:
   before: 156,977 ns
   after : 150,961 ns (-3.8%)
256 pages inflation:
   before: 218,649 ns
   after : 208,490 ns (-4.6%)
16 pages deflation:
   before: 78,112 ns
   after : 68,288 ns (-12.6%)
64 pages deflation:
   before: 97,205 ns
   after : 80,491 ns (-17.194%)
256 pages deflation:
   before: 122,742 ns
   after : 107,526 ns (-12.4%)


How does this affect page migration / balloon compaction etc?

--
Thanks,

David / dhildenb

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 2/4] virtio-mem: Fix probe failed when modprobe virtio_mem

2022-11-28 Thread David Hildenbrand

On 28.11.22 03:10, Li Zetao wrote:

When doing the following test steps, an error was found:
   step 1: modprobe virtio_mem succeeded
 # modprobe virtio_mem  <-- OK

   step 2: fault injection in virtio_mem_init()
 # modprobe -r virtio_mem   <-- OK
 # ...
   CPU: 0 PID: 1837 Comm: modprobe Not tainted
   6.1.0-rc6-00285-g6a1e40c4b995-dirty
   Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
   Call Trace:

should_fail.cold+0x5/0x1f
...
virtio_mem_init_hotplug+0x9ae/0xe57 [virtio_mem]
virtio_mem_init.cold+0x327/0x339 [virtio_mem]
virtio_mem_probe+0x48e/0x910 [virtio_mem]
virtio_dev_probe+0x608/0xae0
...

   virtio_mem virtio4: could not reserve device region
   virtio_mem: probe of virtio4 failed with error -16

   step 3: modprobe virtio_net failed


virtio_net ?


 # modprobe virtio_mem   <-- failed
   virtio_mem: probe of virtio4 failed with error -16

The root cause of the problem is that the virtqueues are not
stopped on the error handling path when virtio_mem_init()
fails in virtio_mem_probe(), resulting in an error "-ENOENT"
returned in the next modprobe call in setup_vq().

virtio_pci_modern_device uses virtqueues to send or
receive message, and "queue_enable" records whether the
queues are available. In vp_modern_find_vqs(), all queues
will be selected and activated, but once queues are enabled
there is no way to go back except reset.

Fix it by reset virtio device on error handling path. After
virtio_mem_init_vq() succeeded, all virtqueues should be
stopped on error handling path.

Fixes: 1fcf0512c9c8 ("virtio_pci: modern driver")


That commit is from 2014. virtio-mem was merged in 2020

Fixes: 5f1f79bbc9e2 ("virtio-mem: Paravirtualized memory hotplug")


Signed-off-by: Li Zetao 
---
  drivers/virtio/virtio_mem.c | 1 +
  1 file changed, 1 insertion(+)

diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
index 0c2892ec6817..c7f09c2ce982 100644
--- a/drivers/virtio/virtio_mem.c
+++ b/drivers/virtio/virtio_mem.c
@@ -2793,6 +2793,7 @@ static int virtio_mem_probe(struct virtio_device *vdev)
  
  	return 0;

  out_del_vq:
+   virtio_reset_device(vdev);
vdev->config->del_vqs(vdev);
  out_free_vm:
    kfree(vm);



Apart from that

Reviewed-by: David Hildenbrand 

--
Thanks,

David / dhildenb

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: RFC [PATCH v4 2/7] Enable balloon drivers to report inflated memory

2022-10-14 Thread David Hildenbrand


Other problem is that there are drivers that do not use
adjust_managed_page_count().


Which ones? Do we care?


VMWare and Virtio balloon drivers. I recently proposed to unify them and
the objection was that it would break existing users - which is valid so
we must care i guess.


I'm confused, I think we care about actual adjustment of the total pages 
available here, that we want to notify the system about. These 
approaches (vmware, virtio-balloon with deflate-on-oom) don't adjust 
totalpages, because the assumption is that we can get back the inflated 
memory any time we really need it automatically.


--
Thanks,

David / dhildenb

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: RFC [PATCH v4 2/7] Enable balloon drivers to report inflated memory

2022-10-14 Thread David Hildenbrand

On 14.10.22 14:50, Alexander Atanasov wrote:

Hello,

On 11.10.22 12:23, David Hildenbrand wrote:

Sounds to me that all you want is some notifier to be called from
adjust_managed_page_count(). What am I missing?


Notifier will act as an accumulator to report size of change and it
will make things easier for the drivers and users wrt locking.
Notifier is similar to the memory hotplug notifier.


Overall, I am not convinced that there is any value of separating the
value
and the notifier. You can batch both or not batch both. In addition,
as I
mentioned, having two values seems racy.


I have identified two users so far above - may be more to come.
One type needs the value to adjust. Also having the value is necessary
to report it to users and oom. There are options with callbacks and so
on but it will complicate things with no real gain. You are right about
the atomicity but i guess if that's a problem for some user it could
find a way to ensure it. i am yet to find such place.



I haven't followed the whole discussion, but I just wanted to raise that
having a generic mechanism to notify on such changes could be valuable.

For example, virtio-mem also uses adjust_managed_page_count() and might
sometimes not trigger memory hotplug notifiers when adding more memory
(essentially, when it fake-adds memory part of an already added Linux
memory block).

What might make sense is schedule some kind of deferred notification on
adjust_managed_page_count() changes. This way, we could notify without
caring about locking and would naturally batch notifications.

adjust_managed_page_count() users would not require changes.


Making it deferred will bring issues for both the users of the
adjust_managed_page_count and the receivers of the notification -
locking as first. And it is hard to know when the adjustment will
finish, some of the drivers wait and retry in blocks. It will bring
complexity and it will not be possible to convert users in small steps.


What exactly is the issue about handling that deferred? Who needs an 
immediate, 100% precise notification?


Locking from a separate workqueue shouldn't be too hard, or what am i 
missing?




Other problem is that there are drivers that do not use
adjust_managed_page_count().


Which ones? Do we care?

--
Thanks,

David / dhildenb

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: RFC [PATCH v4 2/7] Enable balloon drivers to report inflated memory

2022-10-11 Thread David Hildenbrand

Sounds to me that all you want is some notifier to be called from
adjust_managed_page_count(). What am I missing?


Notifier will act as an accumulator to report size of change and it will make 
things easier for the drivers and users wrt locking.
Notifier is similar to the memory hotplug notifier.


Overall, I am not convinced that there is any value of separating the value
and the notifier. You can batch both or not batch both. In addition, as I
mentioned, having two values seems racy.


I have identified two users so far above - may be more to come.
One type needs the value to adjust. Also having the value is necessary
to report it to users and oom. There are options with callbacks and so
on but it will complicate things with no real gain. You are right about
the atomicity but i guess if that's a problem for some user it could
find a way to ensure it. i am yet to find such place.



I haven't followed the whole discussion, but I just wanted to raise that 
having a generic mechanism to notify on such changes could be valuable.


For example, virtio-mem also uses adjust_managed_page_count() and might 
sometimes not trigger memory hotplug notifiers when adding more memory 
(essentially, when it fake-adds memory part of an already added Linux 
memory block).


What might make sense is schedule some kind of deferred notification on 
adjust_managed_page_count() changes. This way, we could notify without 
caring about locking and would naturally batch notifications.


adjust_managed_page_count() users would not require changes.

--
Thanks,

David / dhildenb

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC] how the ballooned memory should be accounted by the drivers inside the guests? (was:[PATCH v6 1/2] Create debugfs file with virtio balloon usage information)

2022-08-09 Thread David Hildenbrand
On 09.08.22 11:36, Alexander Atanasov wrote:
> Hello,
> 
> On 2.08.22 16:48, David Hildenbrand wrote:
>>>>
>>>> In case of Hyper-V I remember a customer BUG report that requested that
>>>> exact behavior, however, I'm not able to locate the BZ quickly.
>>>> [1] 
>>>> https://lists.linuxfoundation.org/pipermail/virtualization/2021-November/057767.html
>>>> (note that I can't easily find the original mail in the archives)
>>>
>>> VMWare does not, Xen do, HV do (but it didn't) - Virtio does both.
>>>
>>> For me the confusion comes from mixing ballooning and hot plug.
>>
>> For example, QEMU (and even libvirt) doesn't even have built in support
>> for any kind of automatic balloon resizing on guest memory pressure (and
>> I'm happy that we don't implement any such heuristics). As a user/admin,
>> all you can do is manually adjust the logical VM size by requesting to
>> inflate/deflate the balloon. For virtio-balloon, we cannot derive what
>> the hypervisor/admin might or might not do -- and whether the admin
>> intends to use memory ballooning for memory hotunplug or for optimizing > 
>> memory overcommit.
> 
> Is the lack of proper hotplug/unplug leading the admins to use it like 
> this?

Yes. Especially unplug is tricky and hard to get working reliably in
practice because of unmovable pages. Ballooning is an easy hack to get
unplug somewhat working.

For reference: under Windows ballooning was (and maybe still mostly is)
the only way to unplug memory again. Unplug of DIMMs is not supported.

> I really don't understand why you don't like automatic resizing - 
> both HyperV and VMWare do it ?

You need heuristics to guess what the guest might be doing next and
deflate fast enough to avoid any kind of OOMs in the guest. I pretty
much dislike that concept, because it just screams to be fragile.

In short: I don't like ballooning, to me it's a technology from ancient
times before we were able to do any better. In comparison, I do like
free page reporting for optimizing memory overcommit, but it still has
some drawbacks (caches consuming too much memory), that people are
working on to improve. So we're stuck with ballooning for now for some
use cases.

Personally, I consider any balloon extensions (inflate/deflate, not
things like free page reporting) a waste of time and effort, but that's
just my humble opinion. So I keep reviewing and maintaining them ;)

> 
>> As another example, HV dynamic memory actually combines memory hotplug
>> with memory ballooning: use memory hotplug to add more memory on demand
>> and use memory ballooning to logically unplug memory again.
> 
> Have both as an options - min/max memory , percentage free to keep as 
> minimum, fixed size and have hot add - kind of hot plug to go above 
> initial max - tries to manage it in dynamic way.
> 
>> The VMWare balloon is a bit special, because it actually usually
>> implements deflate-on-oom semantics in the hypervisor. IIRC, the
>> hypervisor will actually adjust the balloon size based on guest memory
>> stats, and there isn't really an interface to manually set the balloon
>> size for an admin. But I might be wrong regarding the latter.
> 
> For me this is what makes sense - you have a limited amount of
> physical RAM that you want to be used optimally by the guests.
> Waiting for the admin to adjust the balloon would not give very
> optimal results.

Exactly. For the use case of optimizing memory overcommit in the
hypervisor, you want deflate-on-oom semantics if you go with balloon
inflation/deflation.

> 
>>
>>>
>>> Ballooning is like a heap inside the guest from which the host can
>>> allocate/deallocate pages, if there is a mechanism for the guest to ask
>>> the host for more/to free/ pages or the host have a heuristic to monitor
>>> the guest and inflate/deflate the guest it is a matter of implementation.
>>
>> Please don't assume that the use case for memory ballooning is only
>> optimizing memory overcommit in the hypervisor under memory pressure.
> 
> I understood that - currently it is used and as a way to do 
> hotplug/unplug. The question is if that is the right way to use it.

People use it like that, and we have no control over that. I've heard of
people using hotplug of DIMMs to increase VM memory and balloon
inflation to hotunplug memory, to decrease VM memory.

> 
>>>
>>> Hot plug is adding  to MemTotal and it is not a random event either in
>>> real or virtual environment -  so you can act upon it. MemTotal  goes
>>> down on hot unplug and if pages get marked as fau

Re: [RFC] how the ballooned memory should be accounted by the drivers inside the guests? (was:[PATCH v6 1/2] Create debugfs file with virtio balloon usage information)

2022-08-02 Thread David Hildenbrand
>>
>> In case of Hyper-V I remember a customer BUG report that requested that
>> exact behavior, however, I'm not able to locate the BZ quickly.
>> [1] 
>> https://lists.linuxfoundation.org/pipermail/virtualization/2021-November/057767.html
>> (note that I can't easily find the original mail in the archives)
> 
> VMWare does not, Xen do, HV do (but it didn't) - Virtio does both.
> 
> For me the confusion comes from mixing ballooning and hot plug.

For example, QEMU (and even libvirt) doesn't even have built in support
for any kind of automatic balloon resizing on guest memory pressure (and
I'm happy that we don't implement any such heuristics). As a user/admin,
all you can do is manually adjust the logical VM size by requesting to
inflate/deflate the balloon. For virtio-balloon, we cannot derive what
the hypervisor/admin might or might not do -- and whether the admin
intends to use memory ballooning for memory hotunplug or for optimizing
memory overcommit.

As another example, HV dynamic memory actually combines memory hotplug
with memory ballooning: use memory hotplug to add more memory on demand
and use memory ballooning to logically unplug memory again.

The VMWare balloon is a bit special, because it actually usually
implements deflate-on-oom semantics in the hypervisor. IIRC, the
hypervisor will actually adjust the balloon size based on guest memory
stats, and there isn't really an interface to manually set the balloon
size for an admin. But I might be wrong regarding the latter.

> 
> Ballooning is like a heap inside the guest from which the host can 
> allocate/deallocate pages, if there is a mechanism for the guest to ask 
> the host for more/to free/ pages or the host have a heuristic to monitor 
> the guest and inflate/deflate the guest it is a matter of implementation.

Please don't assume that the use case for memory ballooning is only
optimizing memory overcommit in the hypervisor under memory pressure.

> 
> Hot plug is adding  to MemTotal and it is not a random event either in 
> real or virtual environment -  so you can act upon it. MemTotal  goes 
> down on hot unplug and if pages get marked as faulty RAM.

"not a random event either" -- sure, with ppc dlpar, xen balloon, hv
balloon or virtio-mem ... which all are able to hotplug memory fairly
randomly based on hypervisor decisions.

In physical environments, it's not really a random event, I agree.

> 
> Historically MemTotal is a stable value ( i agree with most of David 
> Stevens points) and user space is expecting it to be stable , 
> initialized at startup and it does not expect it to change.

Just like some apps are not prepared for memory hot(un)plug. Some apps
just don't work in environments with variable physical memory sizes:
examples include databases, where memory ballooning might essentially be
completely useless (there is a paper about application-aware memory
ballooning for that exact use case).

> 
> Used is what changes and that is what user space expects to change.
> 
> Delfate on oom might have been a mistake but it is there and if anything 
> depends on changing MemTotal  it will be broken by that option.  How 
> that can be fixed?

I didn't quite get your concern here. Deflate-on-oom in virtio-balloon
won't adjust MemTotal, so under which condition would be something broken?

> 
> I agree that the host can not reclaim what is marked as used  but should 
> it be able to ? May be it will be good to teach oom killer that there 
> can be such ram that can not be reclaimed.
> 
>> Note: I suggested under [1] to expose inflated pages via /proc/meminfo
>> directly. We could do that consistently over all balloon drivers ...
>> doesn't sound too crazy.
> 
> Initally i wanted to do exactly this BUT:
> - some drivers prefer to expose some more internal information in the file.

They always can have an extended debugfs interface in addition.

> - a lot of user space is using meminfo so better keep it as is to avoid 
> breaking something, ballooning is not very frequently used.

We can always extend. Just recently, we exposed Zswap data:

commit f6498b776d280b30a4614d8261840961e993c2c8
Author: Johannes Weiner 
Date:   Thu May 19 14:08:53 2022 -0700

mm: zswap: add basic meminfo and vmstat coverage


Exposing information about inflated pages in a generic way doesn't sound
completely wrong to me, but there might be people that object.

> 
> 
> Please, share your view on how the ballooned memory should be accounted by 
> the drivers inside the guests so we can work towards consistent behaviour:
> 
> Should the inflated memory be accounted as Used or MemTotal be adjusted?

I hope I was able to make it clear that it completely depends on how
memory ballooning is actually intended to be used. It's not uncommon to
use it as form of fake memory hotunplug, where that memory is actually
gone for good and won't simply come back when under memory pressure.

> 
> Should the inflated memory be added to /proc/meminfo ?

My gut feeling is yes.

Re: [PATCH v6 1/2] Create debugfs file with virtio balloon usage information

2022-08-01 Thread David Hildenbrand
>>> +
>>> +   if (!virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
>>> +   num_pages = -num_pages;
>> With VIRTIO_BALLOON_F_DEFLATE_ON_OOM this will now always report "0".
>>
>> which would be the same as "num_pages = 0;" and would deserve a comment
>> explaining why we don't indicate that as "inflated: ".
>>
>> Thanks.
> 
> Except if "now"  refers to some commit that i am missing it does not report 0.

/me facepalm

I read "-= num_pages"

> 
> 
> with qemu-system-x86_64  -enable-kvm -device 
> virtio-balloon,id=balloon0,deflate-on-oom=on,notify_on_empty=on,page-per-vq=on
>  -m 3G 
> 
> / # cat /sys/kernel/debug/virtio-balloon
> inflated: 0 kB
> / # QEMU 4.2.1 monitor - type 'help' for more information
> (qemu) balloon 1024
> (qemu)
> 
> / # cat /sys/kernel/debug/virtio-balloon
> inflated: 2097152 kB
> / #
> 
> with qemu-system-x86_64  -enable-kvm -device 
> virtio-balloon,id=balloon0,deflate-on-oom=off,notify_on_empty=on,page-per-vq=on
>  
> -m 3G ...
> 
> / # cat /sys/kernel/debug/virtio-balloon
> inflated: 0 kB
> / # QEMU 4.2.1 monitor - type 'help' for more information
> (qemu) balloon 1024
> (qemu)
> / # cat /sys/kernel/debug/virtio-balloon
> inflated: -2097152 kB

What's the rationale of making it negative?

> 
> To join the threads:
> 
>>> Always account inflated memory as used for both cases - with and
>>> without deflate on oom. Do not change total ram which can confuse
>>> userspace and users.
> 
>> Sorry, but NAK.
> 
> Ok.
> 
>> This would affect existing users / user space / balloon stats. For example
>> HV just recently switch to properly using adjust_managed_page_count()
> 
> 
> I am wondering what's the rationale behind this i have never seen such users
> that expect it to work like this. Do you have any pointers to such users, so
> i can understood why they do so ?

We adjust total pages and managed pages to simulate what memory is
actually available to the system (just like during memory hot(un)plug).
Even though the pages are "allocated" by the driver, they are actually
unusable for the system, just as if they would have been offlined.
Strictly speaking, the guest OS can kill as many processes as it wants,
it cannot reclaim that memory, as it's logically no longer available.

There is nothing (valid, well, except driver unloading) the guest can do
to reuse these pages. The hypervisor has to get involved first to grant
access to some of these pages again (deflate the balloon).

It's different with deflate-on-oom: the guest will *itself* decide to
reuse inflated pages to deflate them. So the allocated pages can become
back usable easily. There was a recent discussion for virtio-balloon to
change that behavior when it's known that the hypervisor essentially
implements "deflate-on-oom" by looking at guest memory stats and
adjusting the balloon size accordingly; however, as long as we don't
know what the hypervisor does or doesn't do, we have to keep the
existing behavior.

Note that most balloon drivers under Linux share that behavior.

In case of Hyper-V I remember a customer BUG report that requested that
exact behavior, however, I'm not able to locate the BZ quickly.


[1]
https://lists.linuxfoundation.org/pipermail/virtualization/2021-November/057767.html
(note that I can't easily find the original mail in the archives)

Note: I suggested under [1] to expose inflated pages via /proc/meminfo
directly. We could do that consistently over all balloon drivers ...
doesn't sound too crazy.

-- 
Thanks,

David / dhildenb

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v6 1/2] Create debugfs file with virtio balloon usage information

2022-08-01 Thread David Hildenbrand
On 26.07.22 16:08, Alexander Atanasov wrote:
> Allow the guest to know how much it is ballooned by the host
> and how that memory is accounted.
> 
> It is useful when debugging out of memory conditions,
> as well for userspace processes that monitor the memory pressure
> and for nested virtualization.
> 
> Depending on the deflate on oom flag memory is accounted in two ways.
> If the flag is set the inflated pages are accounted as used memory.
> If the flag is not set the inflated pages are substracted from totalram.
> To make clear how are inflated pages accounted sign prefix the value.
> Where negative means substracted from totalram and positive means
> they are accounted as used.
> 
> Signed-off-by: Alexander Atanasov 
> ---
>  drivers/virtio/virtio_balloon.c | 59 +
>  1 file changed, 59 insertions(+)
> 
> V2:
>   - fixed coding style
>   - removed pretty print
> V3:
>   - removed dublicate of features
>   - comment about balooned_pages more clear
>   - convert host pages to balloon pages
> V4:
>   - added a define for BALLOON_PAGE_SIZE to make things clear
> V5:
>   - Made the calculatons work properly for both ways of memory accounting
> with or without deflate_on_oom
>   - dropped comment
> V6:
>   - reduced the file to minimum
>   - unify accounting
> 
> I didn't understand if you agree to change the accounting to be the same
> so following part 2 is about it.
> 
> 
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index f4c34a2a6b8e..97d3b29cb9f1 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -10,6 +10,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -731,6 +732,59 @@ static void report_free_page_func(struct work_struct 
> *work)
>   }
>  }
>  
> +/*
> + * DEBUGFS Interface
> + */
> +#ifdef CONFIG_DEBUG_FS
> +
> +/**
> + * virtio_balloon_debug_show - shows statistics of balloon operations.
> + * @f: pointer to the &struct seq_file.
> + * @offset: ignored.
> + *
> + * Provides the statistics that can be accessed in virtio-balloon in the 
> debugfs.
> + *
> + * Return: zero on success or an error code.
> + */
> +

Superfluous empty line. Personally, I'd just get rid of this
(comparatively obvious) documentation completely.

> +static int virtio_balloon_debug_show(struct seq_file *f, void *offset)
> +{
> + struct virtio_balloon *vb = f->private;
> + s64 num_pages = vb->num_pages << (VIRTIO_BALLOON_PFN_SHIFT - 10);

Rather call this "inflated_kb" then, it's no longer "pages".

> +
> + if (!virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
> + num_pages = -num_pages;

With VIRTIO_BALLOON_F_DEFLATE_ON_OOM this will now always report "0".

which would be the same as "num_pages = 0;" and would deserve a comment
explaining why we don't indicate that as "inflated: ".

Thanks.

-- 
Thanks,

David / dhildenb

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v6 2/2] Unify how inflated memory is accounted in virtio balloon driver

2022-08-01 Thread David Hildenbrand
On 26.07.22 16:10, Alexander Atanasov wrote:
> Always account inflated memory as used for both cases - with and
> without deflate on oom. Do not change total ram which can confuse
> userspace and users.

Sorry, but NAK.

This would affect existing users / user space / balloon stats. For example
HV just recently switch to properly using adjust_managed_page_count()


commit d1df458cbfdb0c3384c03c7fbcb1689bc02a746c
Author: Vitaly Kuznetsov 
Date:   Wed Dec 2 17:12:45 2020 +0100

hv_balloon: do adjust_managed_page_count() when ballooning/un-ballooning

Unlike virtio_balloon/virtio_mem/xen balloon drivers, Hyper-V balloon driver
does not adjust managed pages count when ballooning/un-ballooning and this 
leads
to incorrect stats being reported, e.g. unexpected 'free' output.

Note, the calculation in post_status() seems to remain correct: ballooned 
out
pages are never 'available' and we manually add dm->num_pages_ballooned to
'commited'.


-- 
Thanks,

David / dhildenb

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v5 1/1] Create debugfs file with virtio balloon usage information

2022-07-25 Thread David Hildenbrand
On 25.07.22 13:27, Alexander Atanasov wrote:
> Hi,
> 
> On 18/07/2022 14:35, David Hildenbrand wrote:
>> On 14.07.22 15:20, Alexander Atanasov wrote:
>>> Allow the guest to know how much it is ballooned by the host.
>>> It is useful when debugging out of memory conditions.
>>>
>>> When host gets back memory from the guest it is accounted
>>> as used memory in the guest but the guest have no way to know
>>> how much it is actually ballooned.
>>>
>>> Signed-off-by: Alexander Atanasov 
>>> ---
>>>   drivers/virtio/virtio_balloon.c | 79 +
>>>   include/uapi/linux/virtio_balloon.h |  1 +
>>>   2 files changed, 80 insertions(+)
>>>
>>> V2:
>>>   - fixed coding style
>>>   - removed pretty print
>>> V3:
>>>   - removed dublicate of features
>>>   - comment about balooned_pages more clear
>>>   - convert host pages to balloon pages
>>> V4:
>>>   - added a define for BALLOON_PAGE_SIZE to make things clear
>>> V5:
>>>   - Made the calculatons work properly for both ways of memory accounting
>>> with or without deflate_on_oom
>>>   - dropped comment
>>>
> []
>>> +   u32 num_pages, total_pages, current_pages;
>>> +   struct sysinfo i;
>>> +
>>> +   si_meminfo(&i);
>>> +
>>> +   seq_printf(f, "%-22s: %d\n", "page_size", VIRTIO_BALLOON_PAGE_SIZE);
>> Why? Either export all in ordinary page size or in kB. No need to
>> over-complicate the interface with a different page size that users
>> don't actually care about.
>>
>> I'd just stick to /proc/meminfo and use kB.
> 
> The point is to expose some internal balloon data. Balloon works with 
> pages not with kB  and users can easily calculate kB.

Pages translate to KB. kB are easy to consume by humans instead of pages
with variable apge sizes

> 
>>> +
>>> +   seq_printf(f, "%-22s: %u\n", "ballooned_pages", num_pages);
>>> +
>>> +   if (virtio_has_feature(b->vdev, VIRTIO_BALLOON_F_DEFLATE_ON_OOM)) {
>>> +   total_pages = guest_to_balloon_pages(i.totalram);
>>> +   current_pages = guest_to_balloon_pages(i.totalram) - num_pages;
>>> +   } else {
>>> +   total_pages = guest_to_balloon_pages(i.totalram) +  num_pages;
>>> +   current_pages = guest_to_balloon_pages(i.totalram);
>>> +   }
>>> +
>>> +   /* Total Memory for the guest from host */
>>> +   seq_printf(f, "%-22s: %u\n", "total_pages", total_pages);
>>> +
>>> +   /* Current memory for the guest */
>>> +   seq_printf(f, "%-22s: %u\n", "current_pages", current_pages);
>> The think I detest about that interface (total/current) is that it's
>> simply not correct -- because i.totalram for example doesn't include
>> things like (similar to MemTotal in /proc/meminfo)
>>
>> a) crashkernel
>> b) early boot allocations (e.g., memmap)
>> c) any kind of possible memory (un)hotplug in the future
>>
>> I'd really suggest to just KIS and not mess with i.totalram.
>>
>> Exposing how much memory is inflated and some way to identify how that
>> memory is accounted in /proc/meminfo should be good enough.
>>
>> E.g., the output here could simply be
>>
>> "Inflated: 1024 kB"
>> "MemTotalReduced: 1024 kB"
>>
>> That would even allow in the future for flexibility when it comes to how
>> much/what was actually subtracted from MemTotal.
> 
> 
> The point of the debug interface is to expose some of the balloon driver 
> internals to the guest.
> 
> The users of this are user space processes that try to work according to 
> the memory pressure and nested virtualization.
> 
> I haven't seen one userspace software that expects total ram to change, 
> it should be constant with one exception hotplug. But if you do  hotplug 
> RAM you know that and you can restart what you need to restart.
> 
> So instead of messing with totalram with adding or removing pages /it 
> would even be optimization since now it is done page by page/ i suggest 
> to just account the inflated memory as used as in the deflate_on_oom 
> case now. It is confusing and inconsistent as it is now. How to  explain 
> to a vps user why his RAM is constantly changing?
> 
> And the file can go just to
> 
> inflated: 
> 
> ballon_page_size:  4096
> 
> or
> 
> inflated: kB
> 
> I 

Re: [PATCH v5 1/1] Create debugfs file with virtio balloon usage information

2022-07-18 Thread David Hildenbrand
On 14.07.22 15:20, Alexander Atanasov wrote:
> Allow the guest to know how much it is ballooned by the host.
> It is useful when debugging out of memory conditions.
> 
> When host gets back memory from the guest it is accounted
> as used memory in the guest but the guest have no way to know
> how much it is actually ballooned.
> 
> Signed-off-by: Alexander Atanasov 
> ---
>  drivers/virtio/virtio_balloon.c | 79 +
>  include/uapi/linux/virtio_balloon.h |  1 +
>  2 files changed, 80 insertions(+)
> 
> V2:
>  - fixed coding style
>  - removed pretty print
> V3:
>  - removed dublicate of features
>  - comment about balooned_pages more clear
>  - convert host pages to balloon pages
> V4:
>  - added a define for BALLOON_PAGE_SIZE to make things clear
> V5:
>  - Made the calculatons work properly for both ways of memory accounting
>with or without deflate_on_oom
>  - dropped comment 
> 
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index b9737da6c4dd..e17f8cc71ba4 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -10,6 +10,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -731,6 +732,79 @@ static void report_free_page_func(struct work_struct 
> *work)
>   }
>  }
>  
> +/*
> + * DEBUGFS Interface
> + */
> +#ifdef CONFIG_DEBUG_FS
> +
> +#define guest_to_balloon_pages(i) ((i)*VIRTIO_BALLOON_PAGES_PER_PAGE)
> +/**
> + * virtio_balloon_debug_show - shows statistics of balloon operations.
> + * @f: pointer to the &struct seq_file.
> + * @offset: ignored.
> + *
> + * Provides the statistics that can be accessed in virtio-balloon in the 
> debugfs.
> + *
> + * Return: zero on success or an error code.
> + */
> +
> +static int virtio_balloon_debug_show(struct seq_file *f, void *offset)
> +{
> + struct virtio_balloon *b = f->private;

Most other functions call this "vb".

> + u32 num_pages, total_pages, current_pages;
> + struct sysinfo i;
> +
> + si_meminfo(&i);
> +
> + seq_printf(f, "%-22s: %d\n", "page_size", VIRTIO_BALLOON_PAGE_SIZE);

Why? Either export all in ordinary page size or in kB. No need to
over-complicate the interface with a different page size that users
don't actually care about.

I'd just stick to /proc/meminfo and use kB.

> +
> + virtio_cread_le(b->vdev, struct virtio_balloon_config, actual,
> + &num_pages);

What's wrong with vb->num_pages? I'd prefer not doing config access, if
it can be avoided.

> +
> + seq_printf(f, "%-22s: %u\n", "ballooned_pages", num_pages);
> +
> + if (virtio_has_feature(b->vdev, VIRTIO_BALLOON_F_DEFLATE_ON_OOM)) {
> + total_pages = guest_to_balloon_pages(i.totalram);
> + current_pages = guest_to_balloon_pages(i.totalram) - num_pages;
> + } else {
> + total_pages = guest_to_balloon_pages(i.totalram) +  num_pages;
> + current_pages = guest_to_balloon_pages(i.totalram);
> + }
> +
> + /* Total Memory for the guest from host */
> + seq_printf(f, "%-22s: %u\n", "total_pages", total_pages);
> +
> + /* Current memory for the guest */
> + seq_printf(f, "%-22s: %u\n", "current_pages", current_pages);

The think I detest about that interface (total/current) is that it's
simply not correct -- because i.totalram for example doesn't include
things like (similar to MemTotal in /proc/meminfo)

a) crashkernel
b) early boot allocations (e.g., memmap)
c) any kind of possible memory (un)hotplug in the future

I'd really suggest to just KIS and not mess with i.totalram.

Exposing how much memory is inflated and some way to identify how that
memory is accounted in /proc/meminfo should be good enough.

E.g., the output here could simply be

"Inflated: 1024 kB"
"MemTotalReduced: 1024 kB"

That would even allow in the future for flexibility when it comes to how
much/what was actually subtracted from MemTotal.

> +
> + return 0;
> +}
> +
> +DEFINE_SHOW_ATTRIBUTE(virtio_balloon_debug);
> +
> +static void  virtio_balloon_debugfs_init(struct virtio_balloon *b)
> +{
> + debugfs_create_file("virtio-balloon", 0444, NULL, b,
> + &virtio_balloon_debug_fops);
> +}
> +
> +static void  virtio_balloon_debugfs_exit(struct virtio_balloon *b)
> +{
> + debugfs_remove(debugfs_lookup("virtio-balloon", NULL));
> +}
> +
> +#else
> +
> +static inline void virtio_balloon_debugfs_init(struct virtio_balloon *b)
> +{
> +}
> +
> +static inline void virtio_balloon_debugfs_exit(struct virtio_balloon *b)
> +{
> +}
> +
> +#endif   /* CONFIG_DEBUG_FS */

[...]

> diff --git a/include/uapi/linux/virtio_balloon.h 
> b/include/uapi/linux/virtio_balloon.h
> index ddaa45e723c4..f3ff7c4e5884 100644
> --- a/include/uapi/linux/virtio_balloon.h
> +++ b/include/uapi/linux/virtio_balloon.h
> @@ -40,6 +40,7 @@
>  
>  /* Size of a PFN in the balloon interface. */
>  #define VIRTIO_BALLOON_PFN_SHIFT 12
> +#define VIRT

Re: [PATCH v4 1/1] Create debugfs file with virtio balloon usage information

2022-07-14 Thread David Hildenbrand
On 14.07.22 15:20, Alexander Atanasov wrote:
> Hello,
> 
> On 14/07/2022 14:35, David Hildenbrand wrote:
>> On 05.07.22 10:36, Alexander Atanasov wrote:
>>> Allow the guest to know how much it is ballooned by the host.
>>> It is useful when debugging out of memory conditions.
>>>
>>> When host gets back memory from the guest it is accounted
>>> as used memory in the guest but the guest have no way to know
>>> how much it is actually ballooned.
>>>
>>> Signed-off-by: Alexander Atanasov 
>>> ---
>>>   drivers/virtio/virtio_balloon.c | 77 +
>>>   include/uapi/linux/virtio_balloon.h |  1 +
>>>   2 files changed, 78 insertions(+)
>>>
>>> V2:
>>>   - fixed coding style
>>>   - removed pretty print
>>> V3:
>>>   - removed dublicate of features
>>>   - comment about balooned_pages more clear
>>>   - convert host pages to balloon pages
>>> V4:
>>>   - added a define for BALLOON_PAGE_SIZE to make things clear
>>>
>>> diff --git a/drivers/virtio/virtio_balloon.c 
>>> b/drivers/virtio/virtio_balloon.c
>>> index b9737da6c4dd..dc4ad584b947 100644
>>> --- a/drivers/virtio/virtio_balloon.c
>>> +++ b/drivers/virtio/virtio_balloon.c
>>> @@ -10,6 +10,7 @@
>>>   #include 
>>>   #include 
>>>   #include 
>>> +#include 
>>>   #include 
>>>   #include 
>>>   #include 
>>> @@ -731,6 +732,77 @@ static void report_free_page_func(struct work_struct 
>>> *work)
>>> }
>>>   }
>>>   
>>> +/*
>>> + * DEBUGFS Interface
>>> + */
>>> +#ifdef CONFIG_DEBUG_FS
>>> +
>>> +#define guest_to_balloon_pages(i) ((i)*VIRTIO_BALLOON_PAGES_PER_PAGE)
>>> +/**
>>> + * virtio_balloon_debug_show - shows statistics of balloon operations.
>>> + * @f: pointer to the &struct seq_file.
>>> + * @offset: ignored.
>>> + *
>>> + * Provides the statistics that can be accessed in virtio-balloon in the 
>>> debugfs.
>>> + *
>>> + * Return: zero on success or an error code.
>>> + */
>>> +
>>> +static int virtio_balloon_debug_show(struct seq_file *f, void *offset)
>>> +{
>>> +   struct virtio_balloon *b = f->private;
>>> +   u32 num_pages;
>>> +   struct sysinfo i;
>>> +
>>> +   si_meminfo(&i);
>>> +
>>> +   seq_printf(f, "%-22s: %d\n", "page_size", VIRTIO_BALLOON_PAGE_SIZE);
>>> +
>>> +   virtio_cread_le(b->vdev, struct virtio_balloon_config, actual,
>>> +   &num_pages);
>>> +   /*
>>> +* Pages allocated by host from the guest memory.
>>> +* Host inflates the balloon to get more memory.
>>> +* Guest needs to deflate the balloon to get more memory.
>>> +*/
>> Please drop that comment. This is basic virtio-balloon operation that
>> must not be explained at this point.
> 
> Ok
> 
>>> +   seq_printf(f, "%-22s: %u\n", "ballooned_pages", num_pages);
>>> +
>>> +   /* Total Memory for the guest from host */
>>> +   seq_printf(f, "%-22s: %lu\n", "total_pages",
>>> +   guest_to_balloon_pages(i.totalram));
>> totalram is calculated from totalram_pages().
>>
>> When we inflate/deflate, we adjust totalram as well via
>> adjust_managed_page_count().
> 
> That is true only when not using DEFLATE_ON_OOM.
> 
> Otherwise inflated memory is accounted as used and total ram stays the same.
>> Consequently, this doesn't calculate what you actually want?
>> Total memory would be totalram+inflated, current would be totalram.
> 
> My calculations are correct for the case deflate_on_oom is enabled.
> 

Which is the corner cases. You'd have to special case on DEFLATE_ON_OOM
availability.

>> But, TBH, only export num_pages. User space can just lookup the other
>> information (totalram) via /proc/meminfo.
> 
> I have missed that the memory accounting is made differently depending 
> on a flag.
> 
> Since the calculations are different i'd prefer to have the values 
> calculate and printed there.

What about an indication instead, whether or not inflated pages are
accounted into total or not? That would be slightly cleaner IMHO.

-- 
Thanks,

David / dhildenb

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v4 1/1] Create debugfs file with virtio balloon usage information

2022-07-14 Thread David Hildenbrand
On 05.07.22 10:36, Alexander Atanasov wrote:
> Allow the guest to know how much it is ballooned by the host.
> It is useful when debugging out of memory conditions.
> 
> When host gets back memory from the guest it is accounted
> as used memory in the guest but the guest have no way to know
> how much it is actually ballooned.
> 
> Signed-off-by: Alexander Atanasov 
> ---
>  drivers/virtio/virtio_balloon.c | 77 +
>  include/uapi/linux/virtio_balloon.h |  1 +
>  2 files changed, 78 insertions(+)
> 
> V2:
>  - fixed coding style
>  - removed pretty print
> V3:
>  - removed dublicate of features
>  - comment about balooned_pages more clear
>  - convert host pages to balloon pages
> V4:
>  - added a define for BALLOON_PAGE_SIZE to make things clear
> 
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index b9737da6c4dd..dc4ad584b947 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -10,6 +10,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -731,6 +732,77 @@ static void report_free_page_func(struct work_struct 
> *work)
>   }
>  }
>  
> +/*
> + * DEBUGFS Interface
> + */
> +#ifdef CONFIG_DEBUG_FS
> +
> +#define guest_to_balloon_pages(i) ((i)*VIRTIO_BALLOON_PAGES_PER_PAGE)
> +/**
> + * virtio_balloon_debug_show - shows statistics of balloon operations.
> + * @f: pointer to the &struct seq_file.
> + * @offset: ignored.
> + *
> + * Provides the statistics that can be accessed in virtio-balloon in the 
> debugfs.
> + *
> + * Return: zero on success or an error code.
> + */
> +
> +static int virtio_balloon_debug_show(struct seq_file *f, void *offset)
> +{
> + struct virtio_balloon *b = f->private;
> + u32 num_pages;
> + struct sysinfo i;
> +
> + si_meminfo(&i);
> +
> + seq_printf(f, "%-22s: %d\n", "page_size", VIRTIO_BALLOON_PAGE_SIZE);
> +
> + virtio_cread_le(b->vdev, struct virtio_balloon_config, actual,
> + &num_pages);
> + /*
> +  * Pages allocated by host from the guest memory.
> +  * Host inflates the balloon to get more memory.
> +  * Guest needs to deflate the balloon to get more memory.
> +  */

Please drop that comment. This is basic virtio-balloon operation that
must not be explained at this point.

> + seq_printf(f, "%-22s: %u\n", "ballooned_pages", num_pages);
> +
> + /* Total Memory for the guest from host */
> + seq_printf(f, "%-22s: %lu\n", "total_pages",
> + guest_to_balloon_pages(i.totalram));

totalram is calculated from totalram_pages().

When we inflate/deflate, we adjust totalram as well via
adjust_managed_page_count().

Consequently, this doesn't calculate what you actually want?

Total memory would be totalram+inflated, current would be totalram.


But, TBH, only export num_pages. User space can just lookup the other
information (totalram) via /proc/meminfo.

-- 
Thanks,

David / dhildenb

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: virtio_balloon regression in 5.19-rc3

2022-06-21 Thread David Hildenbrand
On 20.06.22 20:49, Ben Hutchings wrote:
> I've tested a 5.19-rc3 kernel on top of QEMU/KVM with machine type
> pc-q35-5.2.  It has a virtio balloon device defined in libvirt as:
> 
> 
>function="0x0"/>
> 
> 
> but the virtio_balloon driver fails to bind to it:
> 
> virtio_balloon virtio4: init_vqs: add stat_vq failed
> virtio_balloon: probe of virtio4 failed with error -5
> 

Hmm, I don't see any recent changes to drivers/virtio/virtio_balloon.c

virtqueue_add_outbuf() fails with -EIO if I'm not wrong. That's the
first call of virtqueue_add_outbuf() when virtio_balloon initializes.


Maybe something in generic virtio code changed?

-- 
Thanks,

David / dhildenb

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 03/19] fs: Add aops->migrate_folio

2022-06-10 Thread David Hildenbrand
On 09.06.22 16:35, Matthew Wilcox wrote:
> On Thu, Jun 09, 2022 at 02:50:20PM +0200, David Hildenbrand wrote:
>> On 08.06.22 17:02, Matthew Wilcox (Oracle) wrote:
>>> diff --git a/Documentation/filesystems/locking.rst 
>>> b/Documentation/filesystems/locking.rst
>>> index c0fe711f14d3..3d28b23676bd 100644
>>> --- a/Documentation/filesystems/locking.rst
>>> +++ b/Documentation/filesystems/locking.rst
>>> @@ -253,7 +253,8 @@ prototypes::
>>> void (*free_folio)(struct folio *);
>>> int (*direct_IO)(struct kiocb *, struct iov_iter *iter);
>>> bool (*isolate_page) (struct page *, isolate_mode_t);
>>> -   int (*migratepage)(struct address_space *, struct page *, struct page 
>>> *);
>>> +   int (*migrate_folio)(struct address_space *, struct folio *dst,
>>> +   struct folio *src, enum migrate_mode);
>>> void (*putback_page) (struct page *);
>>
>> isolate_page/putback_page are leftovers from the previous patch, no?
> 
> Argh, right, I completely forgot I needed to update the documentation in
> that patch.
> 
>>> +++ b/Documentation/vm/page_migration.rst
>>> @@ -181,22 +181,23 @@ which are function pointers of struct 
>>> address_space_operations.
>>> Once page is successfully isolated, VM uses page.lru fields so driver
>>> shouldn't expect to preserve values in those fields.
>>>  
>>> -2. ``int (*migratepage) (struct address_space *mapping,``
>>> -|  ``struct page *newpage, struct page *oldpage, enum migrate_mode);``
>>> -
>>> -   After isolation, VM calls migratepage() of driver with the isolated 
>>> page.
>>> -   The function of migratepage() is to move the contents of the old page 
>>> to the
>>> -   new page
>>> -   and set up fields of struct page newpage. Keep in mind that you should
>>> -   indicate to the VM the oldpage is no longer movable via 
>>> __ClearPageMovable()
>>> -   under page_lock if you migrated the oldpage successfully and returned
>>> -   MIGRATEPAGE_SUCCESS. If driver cannot migrate the page at the moment, 
>>> driver
>>> -   can return -EAGAIN. On -EAGAIN, VM will retry page migration in a short 
>>> time
>>> -   because VM interprets -EAGAIN as "temporary migration failure". On 
>>> returning
>>> -   any error except -EAGAIN, VM will give up the page migration without
>>> -   retrying.
>>> -
>>> -   Driver shouldn't touch the page.lru field while in the migratepage() 
>>> function.
>>> +2. ``int (*migrate_folio) (struct address_space *mapping,``
>>> +|  ``struct folio *dst, struct folio *src, enum migrate_mode);``
>>> +
>>> +   After isolation, VM calls the driver's migrate_folio() with the
>>> +   isolated folio.  The purpose of migrate_folio() is to move the contents
>>> +   of the source folio to the destination folio and set up the fields
>>> +   of destination folio.  Keep in mind that you should indicate to the
>>> +   VM the source folio is no longer movable via __ClearPageMovable()
>>> +   under folio if you migrated the source successfully and returned
>>> +   MIGRATEPAGE_SUCCESS.  If driver cannot migrate the folio at the
>>> +   moment, driver can return -EAGAIN. On -EAGAIN, VM will retry folio
>>> +   migration in a short time because VM interprets -EAGAIN as "temporary
>>> +   migration failure".  On returning any error except -EAGAIN, VM will
>>> +   give up the folio migration without retrying.
>>> +
>>> +   Driver shouldn't touch the folio.lru field while in the migrate_folio()
>>> +   function.
>>>  
>>>  3. ``void (*putback_page)(struct page *);``
>>
>> Hmm, here it's a bit more complicated now, because we essentially have
>> two paths: LRU+migrate_folio or !LRU+movable_ops
>> (isolate/migrate/putback page)
> 
> Oh ... actually, this is just documenting the driver side of things.
> I don't really like how it's written.  Here, have some rewritten
> documentation (which is now part of the previous patch):
> 

LGTM, thanks.


-- 
Thanks,

David / dhildenb

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v1] drivers/virtio: Clarify CONFIG_VIRTIO_MEM for unsupported architectures

2022-06-10 Thread David Hildenbrand
Let's make it clearer that simply unlocking CONFIG_VIRTIO_MEM on an
architecture is most probably not sufficient to have it working as
expected.

Cc: "Michael S. Tsirkin" 
Cc: Jason Wang 
Cc: Gavin Shan 
Signed-off-by: David Hildenbrand 
---
 drivers/virtio/Kconfig | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
index b5adf6abd241..f86b6a988b63 100644
--- a/drivers/virtio/Kconfig
+++ b/drivers/virtio/Kconfig
@@ -115,9 +115,11 @@ config VIRTIO_MEM
 This driver provides access to virtio-mem paravirtualized memory
 devices, allowing to hotplug and hotunplug memory.
 
-This driver was only tested under x86-64 and arm64, but should
-theoretically work on all architectures that support memory hotplug
-and hotremove.
+This driver currently only supports x86-64 and arm64. Although it
+should compile on other architectures that implement memory
+hot(un)plug, architecture-specific and/or common
+code changes may be required for virtio-mem, kdump and kexec to work as
+expected.
 
 If unsure, say M.
 
-- 
2.35.3

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 03/19] fs: Add aops->migrate_folio

2022-06-09 Thread David Hildenbrand
On 08.06.22 17:02, Matthew Wilcox (Oracle) wrote:
> Provide a folio-based replacement for aops->migratepage.  Update the
> documentation to document migrate_folio instead of migratepage.
> 
> Signed-off-by: Matthew Wilcox (Oracle) 
> Reviewed-by: Christoph Hellwig 
> ---
>  Documentation/filesystems/locking.rst |  5 ++--
>  Documentation/filesystems/vfs.rst | 13 ++-
>  Documentation/vm/page_migration.rst   | 33 ++-
>  include/linux/fs.h|  4 +++-
>  mm/compaction.c   |  4 +++-
>  mm/migrate.c  | 11 +
>  6 files changed, 40 insertions(+), 30 deletions(-)
> 
> diff --git a/Documentation/filesystems/locking.rst 
> b/Documentation/filesystems/locking.rst
> index c0fe711f14d3..3d28b23676bd 100644
> --- a/Documentation/filesystems/locking.rst
> +++ b/Documentation/filesystems/locking.rst
> @@ -253,7 +253,8 @@ prototypes::
>   void (*free_folio)(struct folio *);
>   int (*direct_IO)(struct kiocb *, struct iov_iter *iter);
>   bool (*isolate_page) (struct page *, isolate_mode_t);
> - int (*migratepage)(struct address_space *, struct page *, struct page 
> *);
> + int (*migrate_folio)(struct address_space *, struct folio *dst,
> + struct folio *src, enum migrate_mode);
>   void (*putback_page) (struct page *);

isolate_page/putback_page are leftovers from the previous patch, no?

>   int (*launder_folio)(struct folio *);
>   bool (*is_partially_uptodate)(struct folio *, size_t from, size_t 
> count);
> @@ -281,7 +282,7 @@ release_folio:yes
>  free_folio:  yes
>  direct_IO:
>  isolate_page:yes
> -migratepage: yes (both)
> +migrate_folio:   yes (both)
>  putback_page:yes

Dito.

>  launder_folio:   yes
>  is_partially_uptodate:   yes
> diff --git a/Documentation/filesystems/vfs.rst 
> b/Documentation/filesystems/vfs.rst
> index a08c652467d7..3ae1b039b03f 100644
> --- a/Documentation/filesystems/vfs.rst
> +++ b/Documentation/filesystems/vfs.rst
> @@ -740,7 +740,8 @@ cache in your filesystem.  The following members are 
> defined:
>   /* isolate a page for migration */
>   bool (*isolate_page) (struct page *, isolate_mode_t);
>   /* migrate the contents of a page to the specified target */
> - int (*migratepage) (struct page *, struct page *);
> + int (*migrate_folio)(struct mapping *, struct folio *dst,
> + struct folio *src, enum migrate_mode);
>   /* put migration-failed page back to right list */
>   void (*putback_page) (struct page *);

Dito.

>   int (*launder_folio) (struct folio *);
> @@ -935,12 +936,12 @@ cache in your filesystem.  The following members are 
> defined:
>   is successfully isolated, VM marks the page as PG_isolated via
>   __SetPageIsolated.
>  
> -``migrate_page``
> +``migrate_folio``
>   This is used to compact the physical memory usage.  If the VM
> - wants to relocate a page (maybe off a memory card that is
> - signalling imminent failure) it will pass a new page and an old
> - page to this function.  migrate_page should transfer any private
> - data across and update any references that it has to the page.
> + wants to relocate a folio (maybe from a memory device that is
> + signalling imminent failure) it will pass a new folio and an old
> + folio to this function.  migrate_folio should transfer any private
> + data across and update any references that it has to the folio.
>  
>  ``putback_page``
>   Called by the VM when isolated page's migration fails.

Dito.

> diff --git a/Documentation/vm/page_migration.rst 
> b/Documentation/vm/page_migration.rst
> index 8c5cb8147e55..e0f73ddfabb1 100644
> --- a/Documentation/vm/page_migration.rst
> +++ b/Documentation/vm/page_migration.rst
> @@ -181,22 +181,23 @@ which are function pointers of struct 
> address_space_operations.
> Once page is successfully isolated, VM uses page.lru fields so driver
> shouldn't expect to preserve values in those fields.
>  
> -2. ``int (*migratepage) (struct address_space *mapping,``
> -|``struct page *newpage, struct page *oldpage, enum migrate_mode);``
> -
> -   After isolation, VM calls migratepage() of driver with the isolated page.
> -   The function of migratepage() is to move the contents of the old page to 
> the
> -   new page
> -   and set up fields of struct page newpage. Keep in mind that you should
> -   indicate to the VM the oldpage is no longer movable via 
> __ClearPageMovable()
> -   under page_lock if you migrated the oldpage successfully and returned
> -   MIGRATEPAGE_SUCCESS. If driver cannot migrate the page at the moment, 
> driver
> -   can return -EAGAIN. On -EAGAIN, VM will retry page migration in a short 
> time
> -   because VM interprets -EAGAIN as "temporary mi

Re: [PATCH v2 01/19] secretmem: Remove isolate_page

2022-06-09 Thread David Hildenbrand
On 08.06.22 17:02, Matthew Wilcox (Oracle) wrote:
> The isolate_page operation is never called for filesystems, only
> for device drivers which call SetPageMovable.
> 
> Signed-off-by: Matthew Wilcox (Oracle) 
> ---
>  mm/secretmem.c | 6 --
>  1 file changed, 6 deletions(-)
> 
> diff --git a/mm/secretmem.c b/mm/secretmem.c
> index 206ed6b40c1d..1c7f1775b56e 100644
> --- a/mm/secretmem.c
> +++ b/mm/secretmem.c
> @@ -133,11 +133,6 @@ static const struct file_operations secretmem_fops = {
>   .mmap   = secretmem_mmap,
>  };
>  
> -static bool secretmem_isolate_page(struct page *page, isolate_mode_t mode)
> -{
> - return false;
> -}
> -
>  static int secretmem_migratepage(struct address_space *mapping,
>struct page *newpage, struct page *page,
>enum migrate_mode mode)
> @@ -155,7 +150,6 @@ const struct address_space_operations secretmem_aops = {
>   .dirty_folio= noop_dirty_folio,
>   .free_folio = secretmem_free_folio,
>   .migratepage= secretmem_migratepage,
> - .isolate_page   = secretmem_isolate_page,
>  };
>  
>  static int secretmem_setattr(struct user_namespace *mnt_userns,

Reviewed-by: David Hildenbrand 

-- 
Thanks,

David / dhildenb

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 02/19] mm: Convert all PageMovable users to movable_operations

2022-06-09 Thread David Hildenbrand
On 08.06.22 17:02, Matthew Wilcox (Oracle) wrote:
> These drivers are rather uncomfortably hammered into the
> address_space_operations hole.  They aren't filesystems and don't behave
> like filesystems.  They just need their own movable_operations structure,
> which we can point to directly from page->mapping.
> 
> Signed-off-by: Matthew Wilcox (Oracle) 
> ---
>  arch/powerpc/platforms/pseries/cmm.c |  60 +---
>  drivers/misc/vmw_balloon.c   |  61 +---
>  drivers/virtio/virtio_balloon.c  |  47 +---
>  include/linux/balloon_compaction.h   |   6 +-
>  include/linux/fs.h   |   2 -
>  include/linux/migrate.h  |  26 +--
>  include/linux/page-flags.h   |   2 +-
>  include/uapi/linux/magic.h   |   4 --
>  mm/balloon_compaction.c  |  10 ++-
>  mm/compaction.c  |  29 
>  mm/migrate.c |  24 +++
>  mm/util.c|   4 +-
>  mm/z3fold.c  |  82 +++--
>  mm/zsmalloc.c| 102 ++-
>  14 files changed, 94 insertions(+), 365 deletions(-)

You probably should have cc'ed the relevant maintainers (including me :P ).

For everything except z3fold.c and zsmalloc.c,

Reviewed-by: David Hildenbrand 

-- 
Thanks,

David / dhildenb

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 15/20] balloon: Convert to migrate_folio

2022-06-08 Thread David Hildenbrand
On 07.06.22 21:21, Matthew Wilcox wrote:
> On Tue, Jun 07, 2022 at 03:24:15PM +0100, Matthew Wilcox wrote:
>> On Tue, Jun 07, 2022 at 09:36:21AM +0200, David Hildenbrand wrote:
>>> On 06.06.22 22:40, Matthew Wilcox (Oracle) wrote:
>>>>  const struct address_space_operations balloon_aops = {
>>>> -  .migratepage = balloon_page_migrate,
>>>> +  .migrate_folio = balloon_migrate_folio,
>>>>.isolate_page = balloon_page_isolate,
>>>>.putback_page = balloon_page_putback,
>>>>  };
>>>
>>> I assume you're working on conversion of the other callbacks as well,
>>> because otherwise, this ends up looking a bit inconsistent and confusing :)
>>
>> My intention was to finish converting aops for the next merge window.
>>
>> However, it seems to me that we goofed back in 2016 by merging
>> commit bda807d44454.  isolate_page() and putback_page() should
>> never have been part of address_space_operations.
>>
>> I'm about to embark on creating a new migrate_operations struct
>> for drivers to use that contains only isolate/putback/migrate.
>> No filesystem uses isolate/putback, so those can just be deleted.
>> Both migrate_operations & address_space_operations will contain a
>> migrate callback.

That makes sense to me. I wonder if there was a design
decision/discussion behind that. CCing Rafael.

@Rafael, full mail at

https://lkml.kernel.org/r/yp+lu55h4igav...@casper.infradead.org

> 
> Well, that went more smoothly than I thought it would.
> 
> I can't see a nice way to split this patch up (other than making secretmem
> its own patch).  We just don't have enough bits in struct page to support
> both ways of handling PageMovable at the same time, so we can't convert
> one driver at a time.  The diffstat is pretty compelling.

Yes, splitting rather overcomplicates stuff.

> 
> The patch is on top of this patch series; I think it probably makes
> sense to shuffle it to be first, to avoid changing these drivers to
> folios, then changing them back.

Absolutely.

> 
> Questions:
> 
> Is what I've done with zsmalloc acceptable?  The locking in that
> file is rather complex.
> 
> Can we now eliminate balloon_mnt / balloon_fs from cmm.c?  I haven't even
> compiled thatfile , but it seems like the filesystem serves no use now.
> 
> Similar question for vmw_balloon.c, although I have compiled that.

IIRC, virtio-balloon, cmm and vmw_balloon all have the mnt/fs just for
page migration purposes. So if one can get rid of them, all should be
able to get rid of them. Essentially everything that uses the balloon
compaction framework.

That should go into separate patches then.

> 
> ---
> 
> I just spotted a bug with zs_unregister_migration(); it won't compile
> without CONFIG_MIGRATION.  I'll fix that up if the general approach is
> acceptable.
> 
>  arch/powerpc/platforms/pseries/cmm.c |   13 
>  drivers/misc/vmw_balloon.c   |   10 --
>  include/linux/balloon_compaction.h   |6 +---
>  include/linux/fs.h   |2 -
>  include/linux/migrate.h  |   27 ++
>  include/linux/page-flags.h   |2 -
>  mm/balloon_compaction.c  |   18 ++--
>  mm/compaction.c  |   29 ---
>  mm/migrate.c |   23 ---
>  mm/secretmem.c   |6 
>  mm/util.c|4 +-
>  mm/z3fold.c  |   45 ++
>  mm/zsmalloc.c|   52 
> +--
>  13 files changed, 83 insertions(+), 154 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/pseries/cmm.c 
> b/arch/powerpc/platforms/pseries/cmm.c
> index 15ed8206c463..2ecbab3db723 100644
> --- a/arch/powerpc/platforms/pseries/cmm.c
> +++ b/arch/powerpc/platforms/pseries/cmm.c
> @@ -578,23 +578,10 @@ static int cmm_balloon_compaction_init(void)
>   return rc;
>   }
>  
> - b_dev_info.inode = alloc_anon_inode(balloon_mnt->mnt_sb);
> - if (IS_ERR(b_dev_info.inode)) {
> - rc = PTR_ERR(b_dev_info.inode);
> - b_dev_info.inode = NULL;
> - kern_unmount(balloon_mnt);
> - balloon_mnt = NULL;
> - return rc;
> - }
> -
> - b_dev_info.inode->i_mapping->a_ops = &balloon_aops;


Are you missing similar updates to drivers/virtio/virtio_balloon.c ?

At least, there we're also using balloon_aops, so this patch shouldn't
compile.


Skimming over it, nothing else jumped at me.

-- 
Thanks,

David / dhildenb

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 15/20] balloon: Convert to migrate_folio

2022-06-07 Thread David Hildenbrand
On 06.06.22 22:40, Matthew Wilcox (Oracle) wrote:
> This is little more than changing the types over; there's no real work
> being done in this function.
> 
> Signed-off-by: Matthew Wilcox (Oracle) 
> ---
>  mm/balloon_compaction.c | 15 +++
>  1 file changed, 7 insertions(+), 8 deletions(-)
> 
> diff --git a/mm/balloon_compaction.c b/mm/balloon_compaction.c
> index 4b8eab4b3f45..3f75b876ad76 100644
> --- a/mm/balloon_compaction.c
> +++ b/mm/balloon_compaction.c
> @@ -230,11 +230,10 @@ static void balloon_page_putback(struct page *page)
>  
>  
>  /* move_to_new_page() counterpart for a ballooned page */
> -static int balloon_page_migrate(struct address_space *mapping,
> - struct page *newpage, struct page *page,
> - enum migrate_mode mode)
> +static int balloon_migrate_folio(struct address_space *mapping,
> + struct folio *dst, struct folio *src, enum migrate_mode mode)
>  {
> - struct balloon_dev_info *balloon = balloon_page_device(page);
> + struct balloon_dev_info *balloon = balloon_page_device(&src->page);
>  
>   /*
>* We can not easily support the no copy case here so ignore it as it
> @@ -244,14 +243,14 @@ static int balloon_page_migrate(struct address_space 
> *mapping,
>   if (mode == MIGRATE_SYNC_NO_COPY)
>   return -EINVAL;
>  
> - VM_BUG_ON_PAGE(!PageLocked(page), page);
> - VM_BUG_ON_PAGE(!PageLocked(newpage), newpage);
> + VM_BUG_ON_FOLIO(!folio_test_locked(src), src);
> + VM_BUG_ON_FOLIO(!folio_test_locked(dst), dst);
>  
> - return balloon->migratepage(balloon, newpage, page, mode);
> + return balloon->migratepage(balloon, &dst->page, &src->page, mode);
>  }
>  
>  const struct address_space_operations balloon_aops = {
> - .migratepage = balloon_page_migrate,
> + .migrate_folio = balloon_migrate_folio,
>   .isolate_page = balloon_page_isolate,
>   .putback_page = balloon_page_putback,
>  };

I assume you're working on conversion of the other callbacks as well,
because otherwise, this ends up looking a bit inconsistent and confusing :)

Change LGTM.

-- 
Thanks,

David / dhildenb

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 0/3] recover hardware corrupted page by virtio balloon

2022-06-02 Thread David Hildenbrand
On 02.06.22 11:28, zhenwei pi wrote:
> On 6/1/22 15:59, David Hildenbrand wrote:
>> On 01.06.22 04:17, zhenwei pi wrote:
>>> On 5/31/22 12:08, Jue Wang wrote:
>>>> On Mon, May 30, 2022 at 8:49 AM Peter Xu  wrote:
>>>>>
>>>>> On Mon, May 30, 2022 at 07:33:35PM +0800, zhenwei pi wrote:
>>>>>> A VM uses RAM of 2M huge page. Once a MCE(@HVAy in [HVAx,HVAz)) occurs, 
>>>>>> the
>>>>>> 2M([HVAx,HVAz)) of hypervisor becomes unaccessible, but the guest 
>>>>>> poisons 4K
>>>>>> (@GPAy in [GPAx, GPAz)) only, it may hit another 511 MCE ([GPAx, GPAz)
>>>>>> except GPAy). This is the worse case, so I want to add
>>>>>>'__le32 corrupted_pages' in struct virtio_balloon_config, it is used 
>>>>>> in the
>>>>>> next step: reporting 512 * 4K 'corrupted_pages' to the guest, the guest 
>>>>>> has
>>>>>> a chance to isolate the other 511 pages ahead of time. And the guest
>>>>>> actually loses 2M, fixing 512*4K seems to help significantly.
>>>>>
>>>>> It sounds hackish to teach a virtio device to assume one page will always
>>>>> be poisoned in huge page granule.  That's only a limitation to host kernel
>>>>> not virtio itself.
>>>>>
>>>>> E.g. there're upstream effort ongoing with enabling doublemap on hugetlbfs
>>>>> pages so hugetlb pages can be mapped in 4k with it.  It provides potential
>>>>> possibility to do page poisoning with huge pages in 4k too.  When that'll
>>>>> be ready the assumption can go away, and that does sound like a better
>>>>> approach towards this problem.
>>>>
>>>> +1.
>>>>
>>>> A hypervisor should always strive to minimize the guest memory loss.
>>>>
>>>> The HugeTLB double mapping enlightened memory poisoning behavior (only
>>>> poison 4K out of a 2MB huge page and 4K in guest) is a much better
>>>> solution here. To be completely transparent, it's not _strictly_
>>>> required to poison the page (whatever the granularity it is) on the
>>>> host side, as long as the following are true:
>>>>
>>>> 1. A hypervisor can emulate the _minimized_ (e.g., 4K) the poison to the 
>>>> guest.
>>>> 2. The host page with the UC error is "isolated" (could be PG_HWPOISON
>>>> or in some other way) and prevented from being reused by other
>>>> processes.
>>>>
>>>> For #2, PG_HWPOISON and HugeTLB double mapping enlightened memory
>>>> poisoning is a good solution.
>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>> I assume when talking about "the performance memory drops a lot", you
>>>>>>> imply that this patch set can mitigate that performance drop?
>>>>>>>
>>>>>>> But why do you see a performance drop? Because we might lose some
>>>>>>> possible THP candidates (in the host or the guest) and you want to plug
>>>>>>> does holes? I assume you'll see a performance drop simply because
>>>>>>> poisoning memory is expensive, including migrating pages around on CE.
>>>>>>>
>>>>>>> If you have some numbers to share, especially before/after this change,
>>>>>>> that would be great.
>>>>>>>
>>>>>>
>>>>>> The CE storm leads 2 problems I have even seen:
>>>>>> 1, the memory bandwidth slows down to 10%~20%, and the cycles per
>>>>>> instruction of CPU increases a lot.
>>>>>> 2, the THR (/proc/interrupts) interrupts frequently, the CPU has to use a
>>>>>> lot time to handle IRQ.
>>>>>
>>>>> Totally no good knowledge on CMCI, but if 2) is true then I'm wondering
>>>>> whether it's necessary to handle the interrupts that frequently.  When I
>>>>> was reading the Intel CMCI vector handler I stumbled over this comment:
>>>>>
>>>>> /*
>>>>>* The interrupt handler. This is called on every event.
>>>>>* Just call the poller directly to log any events.
>>>>>* This could in theory increase the threshold under high load,
>>>>>* but doesn't for now.
>>>>&

Re: [PATCH 0/3] recover hardware corrupted page by virtio balloon

2022-06-01 Thread David Hildenbrand
On 01.06.22 04:17, zhenwei pi wrote:
> On 5/31/22 12:08, Jue Wang wrote:
>> On Mon, May 30, 2022 at 8:49 AM Peter Xu  wrote:
>>>
>>> On Mon, May 30, 2022 at 07:33:35PM +0800, zhenwei pi wrote:
 A VM uses RAM of 2M huge page. Once a MCE(@HVAy in [HVAx,HVAz)) occurs, the
 2M([HVAx,HVAz)) of hypervisor becomes unaccessible, but the guest poisons 
 4K
 (@GPAy in [GPAx, GPAz)) only, it may hit another 511 MCE ([GPAx, GPAz)
 except GPAy). This is the worse case, so I want to add
   '__le32 corrupted_pages' in struct virtio_balloon_config, it is used in 
 the
 next step: reporting 512 * 4K 'corrupted_pages' to the guest, the guest has
 a chance to isolate the other 511 pages ahead of time. And the guest
 actually loses 2M, fixing 512*4K seems to help significantly.
>>>
>>> It sounds hackish to teach a virtio device to assume one page will always
>>> be poisoned in huge page granule.  That's only a limitation to host kernel
>>> not virtio itself.
>>>
>>> E.g. there're upstream effort ongoing with enabling doublemap on hugetlbfs
>>> pages so hugetlb pages can be mapped in 4k with it.  It provides potential
>>> possibility to do page poisoning with huge pages in 4k too.  When that'll
>>> be ready the assumption can go away, and that does sound like a better
>>> approach towards this problem.
>>
>> +1.
>>
>> A hypervisor should always strive to minimize the guest memory loss.
>>
>> The HugeTLB double mapping enlightened memory poisoning behavior (only
>> poison 4K out of a 2MB huge page and 4K in guest) is a much better
>> solution here. To be completely transparent, it's not _strictly_
>> required to poison the page (whatever the granularity it is) on the
>> host side, as long as the following are true:
>>
>> 1. A hypervisor can emulate the _minimized_ (e.g., 4K) the poison to the 
>> guest.
>> 2. The host page with the UC error is "isolated" (could be PG_HWPOISON
>> or in some other way) and prevented from being reused by other
>> processes.
>>
>> For #2, PG_HWPOISON and HugeTLB double mapping enlightened memory
>> poisoning is a good solution.
>>
>>>

>
> I assume when talking about "the performance memory drops a lot", you
> imply that this patch set can mitigate that performance drop?
>
> But why do you see a performance drop? Because we might lose some
> possible THP candidates (in the host or the guest) and you want to plug
> does holes? I assume you'll see a performance drop simply because
> poisoning memory is expensive, including migrating pages around on CE.
>
> If you have some numbers to share, especially before/after this change,
> that would be great.
>

 The CE storm leads 2 problems I have even seen:
 1, the memory bandwidth slows down to 10%~20%, and the cycles per
 instruction of CPU increases a lot.
 2, the THR (/proc/interrupts) interrupts frequently, the CPU has to use a
 lot time to handle IRQ.
>>>
>>> Totally no good knowledge on CMCI, but if 2) is true then I'm wondering
>>> whether it's necessary to handle the interrupts that frequently.  When I
>>> was reading the Intel CMCI vector handler I stumbled over this comment:
>>>
>>> /*
>>>   * The interrupt handler. This is called on every event.
>>>   * Just call the poller directly to log any events.
>>>   * This could in theory increase the threshold under high load,
>>>   * but doesn't for now.
>>>   */
>>> static void intel_threshold_interrupt(void)
>>>
>>> I think that matches with what I was thinking..  I mean for 2) not sure
>>> whether it can be seen as a CMCI problem and potentially can be optimized
>>> by adjust the cmci threshold dynamically.
>>
>> The CE storm caused performance drop is caused by the extra cycles
>> spent by the ECC steps in memory controller, not in CMCI handling.
>> This is observed in the Google fleet as well. A good solution is to
>> monitor the CE rate closely in user space via /dev/mcelog and migrate
>> all VMs to another host once the CE rate exceeds some threshold.
>>
>> CMCI is a _background_ interrupt that is not handled in the process
>> execution context and its handler is setup to switch to poll (1 / 5
>> min) mode if there are more than ~ a dozen CEs reported via CMCI per
>> second.
>>>
>>> --
>>> Peter Xu
>>>
> 
> Hi, Andrew, David, Naoya
> 
> According to the suggestions, I'd give up the improvement of memory 
> failure on huge page in this series.
> 
> Is it worth recovering corrupted pages for the guest kernel? I'd follow 
> your decision.

Well, as I said, I am not sure if we really need/want this for a handful
of 4k poisoned pages in a VM. As I suspected, doing so might primarily
be interesting for some sort of de-fragmentation (allow again a higher
order page to be placed at the affected PFNs), not because of the slight
reduction of available memory. A simple VM reboot would get the job
similarly done.

As the poisoning refcount code is already a bit shaky as I learned
recently in t

Re: [PATCH 3/3] virtio_balloon: Introduce memory recover

2022-05-30 Thread David Hildenbrand
On 25.05.22 01:32, zhenwei pi wrote:
> 
> 
> On 5/25/22 03:35, Sean Christopherson wrote:
>> On Fri, May 20, 2022, zhenwei pi wrote:
>>> @@ -59,6 +60,12 @@ enum virtio_balloon_config_read {
>>> VIRTIO_BALLOON_CONFIG_READ_CMD_ID = 0,
>>>   };
>>>   
>>> +/* the request body to commucate with host side */
>>> +struct __virtio_balloon_recover {
>>> +   struct virtio_balloon_recover vbr;
>>> +   __virtio32 pfns[VIRTIO_BALLOON_PAGES_PER_PAGE];
>>
>> I assume this is copied from virtio_balloon.pfns, which also uses 
>> __virtio32, but
>> isn't that horribly broken?  PFNs are 'unsigned long', i.e. 64 bits on 
>> 64-bit kernels.
>> x86-64 at least most definitely generates 64-bit PFNs.  Unless there's magic 
>> I'm
>> missing, page_to_balloon_pfn() will truncate PFNs and feed the host bad info.
>>
> 
> Yes, I also noticed this point, I suppose the balloon device can not 
> work on a virtual machine which has physical address larger than 16T.

Yes, that's a historical artifact and we never ran into it in practice
-- because 16TB VMs are still rare, especially when paired with
virtio-balloon inflation/deflation. Most probably the guest should just
stop inflating when hitting such a big PFN. In the future, we might want
a proper sg interface instead.

-- 
Thanks,

David / dhildenb

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 3/3] virtio_balloon: Introduce memory recover

2022-05-30 Thread David Hildenbrand


> +
>  struct virtio_balloon {
>   struct virtio_device *vdev;
>   struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq;
> @@ -126,6 +133,16 @@ struct virtio_balloon {
>   /* Free page reporting device */
>   struct virtqueue *reporting_vq;
>   struct page_reporting_dev_info pr_dev_info;
> +
> + /* Memory recover VQ - VIRTIO_BALLOON_F_RECOVER */
> + struct virtqueue *recover_vq;
> + spinlock_t recover_vq_lock;
> + struct notifier_block memory_failure_nb;
> + struct list_head corrupted_page_list;
> + struct list_head recovered_page_list;
> + spinlock_t recover_page_list_lock;
> + struct __virtio_balloon_recover in_vbr;
> + struct work_struct unpoison_memory_work;

I assume we want all that only with CONFIG_MEMORY_FAILURE.

>  };
>  
>  static const struct virtio_device_id id_table[] = {
> @@ -494,6 +511,198 @@ static void update_balloon_size_func(struct work_struct 
> *work)
>   queue_work(system_freezable_wq, work);
>  }
>  
> +/*
> + * virtballoon_memory_failure - notified by memory failure, try to fix the
> + *  corrupted page.
> + * The memory failure notifier is designed to call back when the kernel 
> handled
> + * successfully only, WARN_ON_ONCE on the unlikely condition to find out any
> + * error(memory error handling is a best effort, not 100% coverd).
> + */
> +static int virtballoon_memory_failure(struct notifier_block *notifier,
> +   unsigned long pfn, void *parm)
> +{
> + struct virtio_balloon *vb = container_of(notifier, struct 
> virtio_balloon,
> +  memory_failure_nb);
> + struct page *page;
> + struct __virtio_balloon_recover *out_vbr;
> + struct scatterlist sg;
> + unsigned long flags;
> + int err;
> +
> + page = pfn_to_online_page(pfn);
> + if (WARN_ON_ONCE(!page))
> + return NOTIFY_DONE;
> +
> + if (PageHuge(page))
> + return NOTIFY_DONE;
> +
> + if (WARN_ON_ONCE(!PageHWPoison(page)))
> + return NOTIFY_DONE;
> +
> + if (WARN_ON_ONCE(page_count(page) != 1))
> + return NOTIFY_DONE;

Relying on the page_count to be 1 for correctness is usually a bit
shaky, for example, when racing against isolate_movable_page() that
might temporarily bump upo the refcount.

> +
> + get_page(page); /* balloon reference */
> +
> + out_vbr = kzalloc(sizeof(*out_vbr), GFP_KERNEL);

Are we always guaranteed to be able to use GFP_KERNEL out of MCE
context? (IOW, are we never atomic?)

> + if (WARN_ON_ONCE(!out_vbr))
> + return NOTIFY_BAD;
> +
> + spin_lock(&vb->recover_page_list_lock);
> + balloon_page_push(&vb->corrupted_page_list, page);
> + spin_unlock(&vb->recover_page_list_lock);
> +
> + out_vbr->vbr.cmd = VIRTIO_BALLOON_R_CMD_RECOVER;

This makes me wonder if we should have a more generic guest->host
request queue, similar to what e.g., virtio-mem uses, instead of adding
a separate VIRTIO_BALLOON_VQ_RECOVER vq.

> + set_page_pfns(vb, out_vbr->pfns, page);
> + sg_init_one(&sg, out_vbr, sizeof(*out_vbr));
> +
> + spin_lock_irqsave(&vb->recover_vq_lock, flags);
> + err = virtqueue_add_outbuf(vb->recover_vq, &sg, 1, out_vbr, GFP_KERNEL);
> + if (unlikely(err)) {
> + spin_unlock_irqrestore(&vb->recover_vq_lock, flags);
> + return NOTIFY_DONE;
> + }
> + virtqueue_kick(vb->recover_vq);
> + spin_unlock_irqrestore(&vb->recover_vq_lock, flags);
> +
> + return NOTIFY_OK;
> +}
> +
> +static int recover_vq_get_response(struct virtio_balloon *vb)
> +{
> + struct __virtio_balloon_recover *in_vbr;
> + struct scatterlist sg;
> + unsigned long flags;
> + int err;
> +
> + spin_lock_irqsave(&vb->recover_vq_lock, flags);
> + in_vbr = &vb->in_vbr;
> + memset(in_vbr, 0x00, sizeof(*in_vbr));
> + sg_init_one(&sg, in_vbr, sizeof(*in_vbr));
> + err = virtqueue_add_inbuf(vb->recover_vq, &sg, 1, in_vbr, GFP_KERNEL);
> + if (unlikely(err)) {
> + spin_unlock_irqrestore(&vb->recover_vq_lock, flags);
> + return err;
> + }
> +
> + virtqueue_kick(vb->recover_vq);
> + spin_unlock_irqrestore(&vb->recover_vq_lock, flags);
> +
> + return 0;
> +}
> +
> +static void recover_vq_handle_response(struct virtio_balloon *vb, unsigned 
> int len)
> +{
> + struct __virtio_balloon_recover *in_vbr;
> + struct virtio_balloon_recover *vbr;
> + struct page *page;
> + unsigned int pfns;
> + u32 pfn0, pfn1;
> + __u8 status;
> +
> + /* the response is not expected */
> + if (unlikely(len != sizeof(struct __virtio_balloon_recover)))
> + return;
> +
> + in_vbr = &vb->in_vbr;
> + vbr = &in_vbr->vbr;
> + if (unlikely(vbr->cmd != VIRTIO_BALLOON_R_CMD_RESPONSE))
> + return;
> +
> + /* to make sure the contiguous balloon PFNs */
> + for (pfn

Re: [PATCH 0/3] recover hardware corrupted page by virtio balloon

2022-05-30 Thread David Hildenbrand
On 27.05.22 08:32, zhenwei pi wrote:
> On 5/27/22 02:37, Peter Xu wrote:
>> On Wed, May 25, 2022 at 01:16:34PM -0700, Jue Wang wrote:
>>> The hypervisor _must_ emulate poisons identified in guest physical
>>> address space (could be transported from the source VM), this is to
>>> prevent silent data corruption in the guest. With a paravirtual
>>> approach like this patch series, the hypervisor can clear some of the
>>> poisoned HVAs knowing for certain that the guest OS has isolated the
>>> poisoned page. I wonder how much value it provides to the guest if the
>>> guest and workload are _not_ in a pressing need for the extra KB/MB
>>> worth of memory.
>>
>> I'm curious the same on how unpoisoning could help here.  The reasoning
>> behind would be great material to be mentioned in the next cover letter.
>>
>> Shouldn't we consider migrating serious workloads off the host already
>> where there's a sign of more severe hardware issues, instead?
>>
>> Thanks,
>>
> 
> I'm maintaining 1000,000+ virtual machines, from my experience:
> UE is quite unusual and occurs randomly, and I did not hit UE storm case 
> in the past years. The memory also has no obvious performance drop after 
> hitting UE.
> 
> I hit several CE storm case, the performance memory drops a lot. But I 
> can't find obvious relationship between UE and CE.
> 
> So from the point of my view, to fix the corrupted page for VM seems 
> good enough. And yes, unpoisoning several pages does not help 
> significantly, but it is still a chance to make the virtualization better.
> 

I'm curious why we should care about resurrecting a handful of poisoned
pages in a VM. The cover letter doesn't touch on that.

IOW, I'm missing the motivation why we should add additional
code+complexity to unpoison pages at all.

If we're talking about individual 4k pages, it's certainly sub-optimal,
but does it matter in practice? I could understand if we're losing
megabytes of memory. But then, I assume the workload might be seriously
harmed either way already?


I assume when talking about "the performance memory drops a lot", you
imply that this patch set can mitigate that performance drop?

But why do you see a performance drop? Because we might lose some
possible THP candidates (in the host or the guest) and you want to plug
does holes? I assume you'll see a performance drop simply because
poisoning memory is expensive, including migrating pages around on CE.

If you have some numbers to share, especially before/after this change,
that would be great.

-- 
Thanks,

David / dhildenb

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] virtio_balloon: check virtqueue_add_outbuf() return value

2022-05-30 Thread David Hildenbrand
On 30.05.22 04:16, Bo Liu (刘波)-浪潮信息 wrote:
> Adding this patch can avoid unnecessary VM exits and reduce the number of VM 
> exits
> 

... in corner cases where virtqueue_add_outbuf() fails? Why do we care
about that corner case?

Looks like unnecessary code churn to me, unless I am missing something
important.

-- 
Thanks,

David / dhildenb

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 2/3] mm/memory-failure.c: support reset PTE during unpoison

2022-05-29 Thread David Hildenbrand
On 20.05.22 09:06, zhenwei pi wrote:
> Origianlly, unpoison_memory() is only used by hwpoison-inject, and
> unpoisons a page which is poisoned by hwpoison-inject too. The kernel PTE
> entry has no change during software poison/unpoison.
> 
> On a virtualization platform, it's possible to fix hardware corrupted page
> by hypervisor, typically the hypervisor remaps the error HVA(host virtual
> address). So add a new parameter 'const char *reason' to show the reason
> called by.
> 
> Once the corrupted page gets fixed, the guest kernel needs put page to
> buddy. Reuse the page and hit the following issue(Intel Platinum 8260):
>  BUG: unable to handle page fault for address: 888061646000
>  #PF: supervisor write access in kernel mode
>  #PF: error_code(0x0002) - not-present page
>  PGD 2c01067 P4D 2c01067 PUD 61aaa063 PMD 10089b063 PTE 800f9e9b9062
>  Oops: 0002 [#1] PREEMPT SMP NOPTI
>  CPU: 2 PID: 31106 Comm: stress Kdump: loaded Tainted: G   M   OE 
> 5.18.0-rc6.bm.1-amd64 #6
>  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
>  RIP: 0010:clear_page_erms+0x7/0x10
> 
> The kernel PTE entry of the fixed page is still uncorrected, kernel hits
> page fault during prep_new_page. So add 'bool reset_kpte' to get a change
> to fix the PTE entry if the page is fixed by hypervisor.

Why don't we want to do that for the hwpoison case?

> 
> Signed-off-by: zhenwei pi 
> ---
>  include/linux/mm.h   |  2 +-
>  mm/hwpoison-inject.c |  2 +-
>  mm/memory-failure.c  | 26 +++---
>  3 files changed, 21 insertions(+), 9 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 665873c2788c..7ba210e86401 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3191,7 +3191,7 @@ enum mf_flags {
>  extern int memory_failure(unsigned long pfn, int flags);
>  extern void memory_failure_queue(unsigned long pfn, int flags);
>  extern void memory_failure_queue_kick(int cpu);
> -extern int unpoison_memory(unsigned long pfn);
> +extern int unpoison_memory(unsigned long pfn, bool reset_kpte, const char 
> *reason);
>  extern int sysctl_memory_failure_early_kill;
>  extern int sysctl_memory_failure_recovery;
>  extern void shake_page(struct page *p);
> diff --git a/mm/hwpoison-inject.c b/mm/hwpoison-inject.c
> index 5c0cddd81505..0dd17ba98ade 100644
> --- a/mm/hwpoison-inject.c
> +++ b/mm/hwpoison-inject.c
> @@ -57,7 +57,7 @@ static int hwpoison_unpoison(void *data, u64 val)
>   if (!capable(CAP_SYS_ADMIN))
>   return -EPERM;
>  
> - return unpoison_memory(val);
> + return unpoison_memory(val, false, "hwpoison-inject");

s/hwpoison-inject/hwpoison/

or maybe

s/hwpoison-inject/debugfs/

>  }
>  
>  DEFINE_DEBUGFS_ATTRIBUTE(hwpoison_fops, NULL, hwpoison_inject, "%lli\n");
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 95c218bb0a37..a46de3be1dd7 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -2132,21 +2132,26 @@ core_initcall(memory_failure_init);
>  /**
>   * unpoison_memory - Unpoison a previously poisoned page
>   * @pfn: Page number of the to be unpoisoned page
> + * @reset_kpte: Reset the PTE entry for kmap
> + * @reason: The callers tells why unpoisoning the page
>   *
> - * Software-unpoison a page that has been poisoned by
> - * memory_failure() earlier.
> + * Unpoison a page that has been poisoned by memory_failure() earlier.
>   *
> - * This is only done on the software-level, so it only works
> - * for linux injected failures, not real hardware failures
> + * For linux injected failures, there is no need to reset PTE entry.
> + * It's possible to fix hardware memory failure on a virtualization platform,
> + * once hypervisor fixes the failure, guest needs put page back to buddy and
> + * reset the PTE entry in kernel.

Why can't we do this unconditionally? Just check if the PTE is poisoned,
and if so, reset it.

-- 
Thanks,

David / dhildenb

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 0/3] recover hardware corrupted page by virtio balloon

2022-05-24 Thread David Hildenbrand
On 20.05.22 09:06, zhenwei pi wrote:
> Hi,
> 
> I'm trying to recover hardware corrupted page by virtio balloon, the
> workflow of this feature like this:
> 
> Guest  5.MF -> 6.RVQ FE10.Unpoison page
> /   \/
> ---+-+--+---
>| |  |
> 4.MCE7.RVQ BE   9.RVQ Event
>  QEMU /   \   /
>  3.SIGBUS  8.Remap
> /
> +
> |
> +--2.MF
>  Host   /
>1.HW error
> 
> 1, HardWare page error occurs randomly.
> 2, host side handles corrupted page by Memory Failure mechanism, sends
>SIGBUS to the user process if early-kill is enabled.
> 3, QEMU handles SIGBUS, if the address belongs to guest RAM, then:
> 4, QEMU tries to inject MCE into guest.
> 5, guest handles memory failure again.
> 
> 1-5 is already supported for a long time, the next steps are supported
> in this patch(also related driver patch):
> 
> 6, guest balloon driver gets noticed of the corrupted PFN, and sends
>request to host side by Recover VQ FrontEnd.
> 7, QEMU handles request from Recover VQ BackEnd, then:
> 8, QEMU remaps the corrupted HVA fo fix the memory failure, then:
> 9, QEMU acks the guest side the result by Recover VQ.
> 10, guest unpoisons the page if the corrupted page gets recoverd
> successfully.
> 
> Test:
> This patch set can be tested with QEMU(also in developing):
> https://github.com/pizhenwei/qemu/tree/balloon-recover
> 
> Emulate MCE by QEMU(guest RAM normal page only, hugepage is not supported):
> virsh qemu-monitor-command vm --hmp mce 0 9 0xbdc0 0xd 0x61646678 
> 0x8c
> 
> The guest works fine(on Intel Platinum 8260):
>  mce: [Hardware Error]: Machine check events logged
>  Memory failure: 0x61646: recovery action for dirty LRU page: Recovered
>  virtio_balloon virtio5: recovered pfn 0x61646
>  Unpoison: Unpoisoned page 0x61646 by virtio-balloon
>  MCE: Killing stress:24502 due to hardware memory corruption fault at 
> 7f5be2e5a010
> 
> And the 'HardwareCorrupted' in /proc/meminfo also shows 0 kB.
> 
> About the protocol of virtio balloon recover VQ, it's undefined and in
> developing currently:
> - 'struct virtio_balloon_recover' defines the structure which is used to
>   exchange message between guest and host.
> - '__le32 corrupted_pages' in struct virtio_balloon_config is used in the next
>   step:
>   1, a VM uses RAM of 2M huge page, once a MCE occurs, the 2M becomes
>  unaccessible. Reporting 512 * 4K 'corrupted_pages' to the guest, the 
> guest
>  has a chance to isolate the 512 pages ahead of time.
> 
>   2, after migrating to another host, the corrupted pages are actually 
> recovered,
>  once the guest gets the 'corrupted_pages' with 0, then the guest could
>  unpoison all the poisoned pages which are recorded in the balloon driver.
> 

Hi,

I'm still on vacation this week, I'll try to have a look when I'm back
(and flushed out my overflowing inbox :D).


-- 
Thanks,

David / dhildenb

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 3/4] mm/memofy-failure.c: optimize hwpoison_filter

2022-05-06 Thread David Hildenbrand
On 06.05.22 15:38, zhenwei pi wrote:
> 
> 
> On 5/6/22 16:59, Naoya Horiguchi wrote:
>> On Fri, Apr 29, 2022 at 10:22:05PM +0800, zhenwei pi wrote:
>>> In the memory failure procedure, hwpoison_filter has higher priority,
>>> if memory_filter() filters the error event, there is no need to do
>>> the further work.
>>
>> Could you clarify what problem you are trying to solve (what does
>> "optimize" mean in this context or what is the benefit)?
>>
> 
> OK. The background of this work:
> As well known, the memory failure mechanism handles memory corrupted 
> event, and try to send SIGBUS to the user process which uses this 
> corrupted page.
> 
> For the virtualization case, QEMU catches SIGBUS and tries to inject MCE 
> into the guest, and the guest handles memory failure again. Thus the 
> guest gets the minimal effect from hardware memory corruption.
> 
> The further step I'm working on:
> 1, try to modify code to decrease poisoned pages in a single place 
> (mm/memofy-failure.c: simplify num_poisoned_pages_dec in this series).
> 
> 2, try to use page_handle_poison() to handle SetPageHWPoison() and 
> num_poisoned_pages_inc() together. It would be best to call 
> num_poisoned_pages_inc() in a single place too. I'm not sure if this is 
> possible or not, please correct me if I misunderstand.
> 
> 3, introduce memory failure notifier list in memory-failure.c: notify 
> the corrupted PFN to someone who registers this list.
> If I can complete [1] and [2] part, [3] will be quite easy(just call 
> notifier list after increasing poisoned page).
> 
> 4, introduce memory recover VQ for memory balloon device, and registers 
> memory failure notifier list. During the guest kernel handles memory 
> failure, balloon device gets notified by memory failure notifier list, 
> and tells the host to recover the corrupted PFN(GPA) by the new VQ.

Most probably you might want to do that asynchronously, and once the
callback succeeds, un-poison the page.

> 
> 5, host side remaps the corrupted page(HVA), and tells the guest side to 
> unpoison the PFN(GPA). Then the guest fixes the corrupted page(GPA) 
> dynamically.

I think QEMU already does that during reboots. Now it would be triggered
by the guest for individual pages.

> 
> Because [4] and [5] are related to balloon device, also CC Michael, 
> David and Jason.

Doesn't sound too crazy for me, although it's a shame that we always
have to use virtio-balloon for such fairly balloon-unrelated things.

-- 
Thanks,

David / dhildenb

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v10 2/5] mm: page_isolation: check specified range for unmovable pages

2022-04-12 Thread David Hildenbrand
On 12.04.22 17:01, Zi Yan wrote:
> On 12 Apr 2022, at 10:49, David Hildenbrand wrote:
> 
>> On 12.04.22 16:07, Zi Yan wrote:
>>> On 12 Apr 2022, at 9:10, David Hildenbrand wrote:
>>>
>>>> On 06.04.22 17:18, Zi Yan wrote:
>>>>> From: Zi Yan 
>>>>>
>>>>> Enable set_migratetype_isolate() to check specified sub-range for
>>>>> unmovable pages during isolation. Page isolation is done
>>>>> at MAX_ORDER_NR_PAEGS granularity, but not all pages within that
>>>>> granularity are intended to be isolated. For example,
>>>>> alloc_contig_range(), which uses page isolation, allows ranges without
>>>>> alignment. This commit makes unmovable page check only look for
>>>>> interesting pages, so that page isolation can succeed for any
>>>>> non-overlapping ranges.
>>>>>
>>>>> Signed-off-by: Zi Yan 
>>>>> ---
>>>>
>>>> [...]
>>>>
>>>>>  /*
>>>>> - * This function checks whether pageblock includes unmovable pages or 
>>>>> not.
>>>>> + * This function checks whether the range [start_pfn, end_pfn) includes
>>>>> + * unmovable pages or not. The range must fall into a single pageblock 
>>>>> and
>>>>> + * consequently belong to a single zone.
>>>>>   *
>>>>>   * PageLRU check without isolation or lru_lock could race so that
>>>>>   * MIGRATE_MOVABLE block might include unmovable pages. And __PageMovable
>>>>> @@ -28,12 +30,14 @@
>>>>>   * cannot get removed (e.g., via memory unplug) concurrently.
>>>>>   *
>>>>>   */
>>>>> -static struct page *has_unmovable_pages(struct zone *zone, struct page 
>>>>> *page,
>>>>> -  int migratetype, int flags)
>>>>> +static struct page *has_unmovable_pages(unsigned long start_pfn, 
>>>>> unsigned long end_pfn,
>>>>> + int migratetype, int flags)
>>>>>  {
>>>>> - unsigned long iter = 0;
>>>>> - unsigned long pfn = page_to_pfn(page);
>>>>> - unsigned long offset = pfn % pageblock_nr_pages;
>>>>> + unsigned long pfn = start_pfn;
>>>>> + struct page *page = pfn_to_page(pfn);
>>>>
>>>>
>>>> Just do
>>>>
>>>> struct page *page = pfn_to_page(start_pfn);
>>>> struct zone *zone = page_zone(page);
>>>>
>>>> here. No need to lookup the zone again in the loop because, as you
>>>> document "must ... belong to a single zone.".
>>>>
>>>> Then, there is also no need to initialize "pfn" here. In the loop header
>>>> is sufficient.
>>>>
>>>
>>> Sure.
>>>
>>>>> +
>>>>> + VM_BUG_ON(ALIGN_DOWN(start_pfn, pageblock_nr_pages) !=
>>>>> +   ALIGN_DOWN(end_pfn - 1, pageblock_nr_pages));
>>>>>
>>>>>   if (is_migrate_cma_page(page)) {
>>>>>   /*
>>>>> @@ -47,8 +51,11 @@ static struct page *has_unmovable_pages(struct zone 
>>>>> *zone, struct page *page,
>>>>>   return page;
>>>>>   }
>>>>>
>>>>> - for (; iter < pageblock_nr_pages - offset; iter++) {
>>>>> - page = pfn_to_page(pfn + iter);
>>>>> + for (pfn = start_pfn; pfn < end_pfn; pfn++) {
>>>>> + struct zone *zone;
>>>>> +
>>>>> + page = pfn_to_page(pfn);
>>>>> + zone = page_zone(page);
>>>>>
>>>>>   /*
>>>>>* Both, bootmem allocations and memory holes are marked
>>>>> @@ -85,7 +92,7 @@ static struct page *has_unmovable_pages(struct zone 
>>>>> *zone, struct page *page,
>>>>>   }
>>>>>
>>>>>   skip_pages = compound_nr(head) - (page - head);
>>>>> - iter += skip_pages - 1;
>>>>> + pfn += skip_pages - 1;
>>>>>   continue;
>>>>>   }
>>>>>
>>>>> @@ -97,7 +104,7 @@ static struct page *has_unmovable_pages(struct zone 
>>>>> *zone, struct page *page,
>>>>>*/
>>>>>   i

Re: [PATCH v10 2/5] mm: page_isolation: check specified range for unmovable pages

2022-04-12 Thread David Hildenbrand
On 12.04.22 16:07, Zi Yan wrote:
> On 12 Apr 2022, at 9:10, David Hildenbrand wrote:
> 
>> On 06.04.22 17:18, Zi Yan wrote:
>>> From: Zi Yan 
>>>
>>> Enable set_migratetype_isolate() to check specified sub-range for
>>> unmovable pages during isolation. Page isolation is done
>>> at MAX_ORDER_NR_PAEGS granularity, but not all pages within that
>>> granularity are intended to be isolated. For example,
>>> alloc_contig_range(), which uses page isolation, allows ranges without
>>> alignment. This commit makes unmovable page check only look for
>>> interesting pages, so that page isolation can succeed for any
>>> non-overlapping ranges.
>>>
>>> Signed-off-by: Zi Yan 
>>> ---
>>
>> [...]
>>
>>>  /*
>>> - * This function checks whether pageblock includes unmovable pages or not.
>>> + * This function checks whether the range [start_pfn, end_pfn) includes
>>> + * unmovable pages or not. The range must fall into a single pageblock and
>>> + * consequently belong to a single zone.
>>>   *
>>>   * PageLRU check without isolation or lru_lock could race so that
>>>   * MIGRATE_MOVABLE block might include unmovable pages. And __PageMovable
>>> @@ -28,12 +30,14 @@
>>>   * cannot get removed (e.g., via memory unplug) concurrently.
>>>   *
>>>   */
>>> -static struct page *has_unmovable_pages(struct zone *zone, struct page 
>>> *page,
>>> -int migratetype, int flags)
>>> +static struct page *has_unmovable_pages(unsigned long start_pfn, unsigned 
>>> long end_pfn,
>>> +   int migratetype, int flags)
>>>  {
>>> -   unsigned long iter = 0;
>>> -   unsigned long pfn = page_to_pfn(page);
>>> -   unsigned long offset = pfn % pageblock_nr_pages;
>>> +   unsigned long pfn = start_pfn;
>>> +   struct page *page = pfn_to_page(pfn);
>>
>>
>> Just do
>>
>> struct page *page = pfn_to_page(start_pfn);
>> struct zone *zone = page_zone(page);
>>
>> here. No need to lookup the zone again in the loop because, as you
>> document "must ... belong to a single zone.".
>>
>> Then, there is also no need to initialize "pfn" here. In the loop header
>> is sufficient.
>>
> 
> Sure.
> 
>>> +
>>> +   VM_BUG_ON(ALIGN_DOWN(start_pfn, pageblock_nr_pages) !=
>>> + ALIGN_DOWN(end_pfn - 1, pageblock_nr_pages));
>>>
>>> if (is_migrate_cma_page(page)) {
>>> /*
>>> @@ -47,8 +51,11 @@ static struct page *has_unmovable_pages(struct zone 
>>> *zone, struct page *page,
>>> return page;
>>> }
>>>
>>> -   for (; iter < pageblock_nr_pages - offset; iter++) {
>>> -   page = pfn_to_page(pfn + iter);
>>> +   for (pfn = start_pfn; pfn < end_pfn; pfn++) {
>>> +   struct zone *zone;
>>> +
>>> +   page = pfn_to_page(pfn);
>>> +   zone = page_zone(page);
>>>
>>> /*
>>>  * Both, bootmem allocations and memory holes are marked
>>> @@ -85,7 +92,7 @@ static struct page *has_unmovable_pages(struct zone 
>>> *zone, struct page *page,
>>> }
>>>
>>> skip_pages = compound_nr(head) - (page - head);
>>> -   iter += skip_pages - 1;
>>> +   pfn += skip_pages - 1;
>>> continue;
>>> }
>>>
>>> @@ -97,7 +104,7 @@ static struct page *has_unmovable_pages(struct zone 
>>> *zone, struct page *page,
>>>  */
>>> if (!page_ref_count(page)) {
>>> if (PageBuddy(page))
>>> -   iter += (1 << buddy_order(page)) - 1;
>>> +   pfn += (1 << buddy_order(page)) - 1;
>>> continue;
>>> }
>>>
>>> @@ -134,11 +141,18 @@ static struct page *has_unmovable_pages(struct zone 
>>> *zone, struct page *page,
>>> return NULL;
>>>  }
>>>
>>> -static int set_migratetype_isolate(struct page *page, int migratetype, int 
>>> isol_flags)
>>> +/*
>>> + * This function set pageblock migratetype to isolate if no unmovable page 
>>> is
>>> + * present in [start_pfn, end_pfn). The pageblock must intersect with
>&g

Re: [PATCH v10 2/5] mm: page_isolation: check specified range for unmovable pages

2022-04-12 Thread David Hildenbrand
On 06.04.22 17:18, Zi Yan wrote:
> From: Zi Yan 
> 
> Enable set_migratetype_isolate() to check specified sub-range for
> unmovable pages during isolation. Page isolation is done
> at MAX_ORDER_NR_PAEGS granularity, but not all pages within that
> granularity are intended to be isolated. For example,
> alloc_contig_range(), which uses page isolation, allows ranges without
> alignment. This commit makes unmovable page check only look for
> interesting pages, so that page isolation can succeed for any
> non-overlapping ranges.
> 
> Signed-off-by: Zi Yan 
> ---

[...]

>  /*
> - * This function checks whether pageblock includes unmovable pages or not.
> + * This function checks whether the range [start_pfn, end_pfn) includes
> + * unmovable pages or not. The range must fall into a single pageblock and
> + * consequently belong to a single zone.
>   *
>   * PageLRU check without isolation or lru_lock could race so that
>   * MIGRATE_MOVABLE block might include unmovable pages. And __PageMovable
> @@ -28,12 +30,14 @@
>   * cannot get removed (e.g., via memory unplug) concurrently.
>   *
>   */
> -static struct page *has_unmovable_pages(struct zone *zone, struct page *page,
> -  int migratetype, int flags)
> +static struct page *has_unmovable_pages(unsigned long start_pfn, unsigned 
> long end_pfn,
> + int migratetype, int flags)
>  {
> - unsigned long iter = 0;
> - unsigned long pfn = page_to_pfn(page);
> - unsigned long offset = pfn % pageblock_nr_pages;
> + unsigned long pfn = start_pfn;
> + struct page *page = pfn_to_page(pfn);


Just do

struct page *page = pfn_to_page(start_pfn);
struct zone *zone = page_zone(page);

here. No need to lookup the zone again in the loop because, as you
document "must ... belong to a single zone.".

Then, there is also no need to initialize "pfn" here. In the loop header
is sufficient.

> +
> + VM_BUG_ON(ALIGN_DOWN(start_pfn, pageblock_nr_pages) !=
> +   ALIGN_DOWN(end_pfn - 1, pageblock_nr_pages));
>  
>   if (is_migrate_cma_page(page)) {
>   /*
> @@ -47,8 +51,11 @@ static struct page *has_unmovable_pages(struct zone *zone, 
> struct page *page,
>   return page;
>   }
>  
> - for (; iter < pageblock_nr_pages - offset; iter++) {
> - page = pfn_to_page(pfn + iter);
> + for (pfn = start_pfn; pfn < end_pfn; pfn++) {
> + struct zone *zone;
> +
> + page = pfn_to_page(pfn);
> + zone = page_zone(page);
>  
>   /*
>* Both, bootmem allocations and memory holes are marked
> @@ -85,7 +92,7 @@ static struct page *has_unmovable_pages(struct zone *zone, 
> struct page *page,
>   }
>  
>   skip_pages = compound_nr(head) - (page - head);
> - iter += skip_pages - 1;
> + pfn += skip_pages - 1;
>   continue;
>   }
>  
> @@ -97,7 +104,7 @@ static struct page *has_unmovable_pages(struct zone *zone, 
> struct page *page,
>*/
>   if (!page_ref_count(page)) {
>   if (PageBuddy(page))
> - iter += (1 << buddy_order(page)) - 1;
> + pfn += (1 << buddy_order(page)) - 1;
>   continue;
>   }
>  
> @@ -134,11 +141,18 @@ static struct page *has_unmovable_pages(struct zone 
> *zone, struct page *page,
>   return NULL;
>  }
>  
> -static int set_migratetype_isolate(struct page *page, int migratetype, int 
> isol_flags)
> +/*
> + * This function set pageblock migratetype to isolate if no unmovable page is
> + * present in [start_pfn, end_pfn). The pageblock must intersect with
> + * [start_pfn, end_pfn).
> + */
> +static int set_migratetype_isolate(struct page *page, int migratetype, int 
> isol_flags,
> + unsigned long start_pfn, unsigned long end_pfn)

I think we might be able do better, eventually not passing start_pfn at
all. Hmm.

I think we want to pull out the
start_isolate_page_range()/undo_isolate_page_range() interface change
into a separate patch.

Let me try to give it a shot, I'll try hacking something up real quick
to see if we can do better.

-- 
Thanks,

David / dhildenb

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v10 0/5] Use pageblock_order for cma and alloc_contig_range alignment.

2022-04-12 Thread David Hildenbrand
On 06.04.22 17:18, Zi Yan wrote:
> From: Zi Yan 
> 
> Hi David,

Hi!

> 
> This patchset tries to remove the MAX_ORDER-1 alignment requirement for CMA
> and alloc_contig_range(). It prepares for my upcoming changes to make
> MAX_ORDER adjustable at boot time[1]. It is on top of mmotm-2022-04-05-15-54.

Sorry for the late reply, I've got way too many irons in the fire right now.

> 
> I also added "Content-Type: text/plain; charset=UTF-8" to all email bodies
> explicitly, please let me know if you still cannot see the emails in a
> proper format.

Oh, thanks! But no need to work around Mimecast mailing issues on your
side. This just has to be fixed for good on the RH side ...


I yet heave to give #3 a thorough review, sorry for not commenting on
that earlier. It's a bit more involved, especially, with all the
possible corner cases :)

-- 
Thanks,

David / dhildenb

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v8 2/5] mm: page_isolation: check specified range for unmovable pages

2022-03-22 Thread David Hildenbrand
On 21.03.22 19:23, Zi Yan wrote:
> On 21 Mar 2022, at 13:30, David Hildenbrand wrote:
> 
>> On 17.03.22 16:37, Zi Yan wrote:
>>> From: Zi Yan 
>>>
>>> Enable set_migratetype_isolate() to check specified sub-range for
>>> unmovable pages during isolation. Page isolation is done
>>> at max(MAX_ORDER_NR_PAEGS, pageblock_nr_pages) granularity, but not all
>>> pages within that granularity are intended to be isolated. For example,
>>> alloc_contig_range(), which uses page isolation, allows ranges without
>>> alignment. This commit makes unmovable page check only look for
>>> interesting pages, so that page isolation can succeed for any
>>> non-overlapping ranges.
>>>
>>> Signed-off-by: Zi Yan 
>>> ---
>>>  include/linux/page-isolation.h | 10 +
>>>  mm/page_alloc.c| 13 +--
>>>  mm/page_isolation.c| 69 --
>>>  3 files changed, 51 insertions(+), 41 deletions(-)
>>>
>>> diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h
>>> index e14eddf6741a..eb4a208fe907 100644
>>> --- a/include/linux/page-isolation.h
>>> +++ b/include/linux/page-isolation.h
>>> @@ -15,6 +15,16 @@ static inline bool is_migrate_isolate(int migratetype)
>>>  {
>>> return migratetype == MIGRATE_ISOLATE;
>>>  }
>>> +static inline unsigned long pfn_max_align_down(unsigned long pfn)
>>> +{
>>> +   return ALIGN_DOWN(pfn, MAX_ORDER_NR_PAGES);
>>> +}
>>> +
>>> +static inline unsigned long pfn_max_align_up(unsigned long pfn)
>>> +{
>>> +   return ALIGN(pfn, MAX_ORDER_NR_PAGES);
>>> +}
>>> +
>>>  #else
>>>  static inline bool has_isolate_pageblock(struct zone *zone)
>>>  {
>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> index 6de57d058d3d..680580a40a35 100644
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -8937,16 +8937,6 @@ void *__init alloc_large_system_hash(const char 
>>> *tablename,
>>>  }
>>>
>>>  #ifdef CONFIG_CONTIG_ALLOC
>>> -static unsigned long pfn_max_align_down(unsigned long pfn)
>>> -{
>>> -   return ALIGN_DOWN(pfn, MAX_ORDER_NR_PAGES);
>>> -}
>>> -
>>> -static unsigned long pfn_max_align_up(unsigned long pfn)
>>> -{
>>> -   return ALIGN(pfn, MAX_ORDER_NR_PAGES);
>>> -}
>>> -
>>>  #if defined(CONFIG_DYNAMIC_DEBUG) || \
>>> (defined(CONFIG_DYNAMIC_DEBUG_CORE) && defined(DYNAMIC_DEBUG_MODULE))
>>>  /* Usage: See admin-guide/dynamic-debug-howto.rst */
>>> @@ -9091,8 +9081,7 @@ int alloc_contig_range(unsigned long start, unsigned 
>>> long end,
>>>  * put back to page allocator so that buddy can use them.
>>>  */
>>>
>>> -   ret = start_isolate_page_range(pfn_max_align_down(start),
>>> -  pfn_max_align_up(end), migratetype, 0);
>>> +   ret = start_isolate_page_range(start, end, migratetype, 0);
>>> if (ret)
>>> return ret;
>>
>> Shouldn't we similarly adjust undo_isolate_page_range()? IOW, all users
>> of pfn_max_align_down()/pfn_max_align_up(). would be gone from that file
>> and you can move these defines into mm/page_isolation.c instead of
>> include/linux/page-isolation.h?
> 
> undo_isolate_page_range() faces much simpler situation, just needing
> to unset migratetype. We can just pass pageblock_nr_pages aligned range
> to it. For start_isolate_page_range(), start and end are also used for
> has_unmovable_pages() for precise unmovable page identification, so
> they cannot be pageblock_nr_pages aligned. But for readability and symmetry,
> yes, I can change undo_isolate_page_range() too.
Yeah, we should call both with the same range and any extension of the
range should be handled internally.

I thought about some corner cases, especially once we relax some (CMA)
alignment thingies -- then, the CMA area might be placed at weird
locations. I haven't checked to which degree they apply, but we should
certainly keep them in mind whenever we're extending the isolation range.

We can assume that the contig range we're allocation
a) Belongs to a single zone
b) Does not contain any memory holes / mmap holes

Let's double check


1) Different zones in extended range

...   ZONE A  ][ ZONE B 
[ Pageblock X ][ Pageblock Y ][ Pageblock Z ]
[MAX_ORDER - 1   ]

We can never create a higher-order page between X and Y, because they
are in different

Re: [PATCH v8 2/5] mm: page_isolation: check specified range for unmovable pages

2022-03-21 Thread David Hildenbrand
On 17.03.22 16:37, Zi Yan wrote:
> From: Zi Yan 
> 
> Enable set_migratetype_isolate() to check specified sub-range for
> unmovable pages during isolation. Page isolation is done
> at max(MAX_ORDER_NR_PAEGS, pageblock_nr_pages) granularity, but not all
> pages within that granularity are intended to be isolated. For example,
> alloc_contig_range(), which uses page isolation, allows ranges without
> alignment. This commit makes unmovable page check only look for
> interesting pages, so that page isolation can succeed for any
> non-overlapping ranges.
> 
> Signed-off-by: Zi Yan 
> ---
>  include/linux/page-isolation.h | 10 +
>  mm/page_alloc.c| 13 +--
>  mm/page_isolation.c| 69 --
>  3 files changed, 51 insertions(+), 41 deletions(-)
> 
> diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h
> index e14eddf6741a..eb4a208fe907 100644
> --- a/include/linux/page-isolation.h
> +++ b/include/linux/page-isolation.h
> @@ -15,6 +15,16 @@ static inline bool is_migrate_isolate(int migratetype)
>  {
>   return migratetype == MIGRATE_ISOLATE;
>  }
> +static inline unsigned long pfn_max_align_down(unsigned long pfn)
> +{
> + return ALIGN_DOWN(pfn, MAX_ORDER_NR_PAGES);
> +}
> +
> +static inline unsigned long pfn_max_align_up(unsigned long pfn)
> +{
> + return ALIGN(pfn, MAX_ORDER_NR_PAGES);
> +}
> +
>  #else
>  static inline bool has_isolate_pageblock(struct zone *zone)
>  {
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 6de57d058d3d..680580a40a35 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -8937,16 +8937,6 @@ void *__init alloc_large_system_hash(const char 
> *tablename,
>  }
>  
>  #ifdef CONFIG_CONTIG_ALLOC
> -static unsigned long pfn_max_align_down(unsigned long pfn)
> -{
> - return ALIGN_DOWN(pfn, MAX_ORDER_NR_PAGES);
> -}
> -
> -static unsigned long pfn_max_align_up(unsigned long pfn)
> -{
> - return ALIGN(pfn, MAX_ORDER_NR_PAGES);
> -}
> -
>  #if defined(CONFIG_DYNAMIC_DEBUG) || \
>   (defined(CONFIG_DYNAMIC_DEBUG_CORE) && defined(DYNAMIC_DEBUG_MODULE))
>  /* Usage: See admin-guide/dynamic-debug-howto.rst */
> @@ -9091,8 +9081,7 @@ int alloc_contig_range(unsigned long start, unsigned 
> long end,
>* put back to page allocator so that buddy can use them.
>*/
>  
> - ret = start_isolate_page_range(pfn_max_align_down(start),
> -pfn_max_align_up(end), migratetype, 0);
> + ret = start_isolate_page_range(start, end, migratetype, 0);
>   if (ret)
>   return ret;

Shouldn't we similarly adjust undo_isolate_page_range()? IOW, all users
of pfn_max_align_down()/pfn_max_align_up(). would be gone from that file
and you can move these defines into mm/page_isolation.c instead of
include/linux/page-isolation.h?

Maybe perform this change in a separate patch for
start_isolate_page_range() and undo_isolate_page_range() ?

>  
> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> index b34f1310aeaa..419c805dbdcd 100644
> --- a/mm/page_isolation.c
> +++ b/mm/page_isolation.c
> @@ -16,7 +16,8 @@
>  #include 
>  
>  /*
> - * This function checks whether pageblock includes unmovable pages or not.
> + * This function checks whether pageblock within [start_pfn, end_pfn) 
> includes
> + * unmovable pages or not.

I think we still want to limit that to a single pageblock (see below),
as we're going to isolate individual pageblocks. Then an updated
description could be:

"This function checks whether the range [start_pfn, end_pfn) includes
unmovable pages or not. The range must fall into a single pageblock and
consequently belong to a single zone."

>   *
>   * PageLRU check without isolation or lru_lock could race so that
>   * MIGRATE_MOVABLE block might include unmovable pages. And __PageMovable
> @@ -28,27 +29,26 @@
>   * cannot get removed (e.g., via memory unplug) concurrently.
>   *
>   */
> -static struct page *has_unmovable_pages(struct zone *zone, struct page *page,
> -  int migratetype, int flags)
> +static struct page *has_unmovable_pages(unsigned long start_pfn, unsigned 
> long end_pfn,
> + int migratetype, int flags)
>  {
> - unsigned long iter = 0;
> - unsigned long pfn = page_to_pfn(page);
> - unsigned long offset = pfn % pageblock_nr_pages;
> + unsigned long pfn = start_pfn;
>  
> - if (is_migrate_cma_page(page)) {
> - /*
> -  * CMA allocations (alloc_contig_range) really need to mark
> -  * isolate CMA pageblocks even when they are not movable in fact
> -  * so consider them movable here.
> -  */
> - if (is_migrate_cma(migratetype))
> - return NULL;

If we're really dealing with a range that falls into a single pageblock,
then you can leave the is_migrate_cma_page() in place and also lookup
the zone only once. That should speed up things and

Re: [PATCH v7 2/5] mm: page_isolation: check specified range for unmovable pages

2022-03-14 Thread David Hildenbrand
On 11.03.22 19:36, Zi Yan wrote:
> From: Zi Yan 
> 
> Enable set_migratetype_isolate() to check specified sub-range for
> unmovable pages during isolation. Page isolation is done
> at max(MAX_ORDER_NR_PAEGS, pageblock_nr_pages) granularity, but not all
> pages within that granularity are intended to be isolated. For example,
> alloc_contig_range(), which uses page isolation, allows ranges without
> alignment. This commit makes unmovable page check only look for
> interesting pages, so that page isolation can succeed for any
> non-overlapping ranges.
> 
> Signed-off-by: Zi Yan 
> ---
>  include/linux/page-isolation.h | 10 
>  mm/page_alloc.c| 13 +-
>  mm/page_isolation.c| 47 +-
>  3 files changed, 40 insertions(+), 30 deletions(-)
> 
> diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h
> index e14eddf6741a..eb4a208fe907 100644
> --- a/include/linux/page-isolation.h
> +++ b/include/linux/page-isolation.h
> @@ -15,6 +15,16 @@ static inline bool is_migrate_isolate(int migratetype)
>  {
>   return migratetype == MIGRATE_ISOLATE;
>  }
> +static inline unsigned long pfn_max_align_down(unsigned long pfn)
> +{
> + return ALIGN_DOWN(pfn, MAX_ORDER_NR_PAGES);
> +}
> +
> +static inline unsigned long pfn_max_align_up(unsigned long pfn)
> +{
> + return ALIGN(pfn, MAX_ORDER_NR_PAGES);
> +}
> +
>  #else
>  static inline bool has_isolate_pageblock(struct zone *zone)
>  {
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 6de57d058d3d..680580a40a35 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -8937,16 +8937,6 @@ void *__init alloc_large_system_hash(const char 
> *tablename,
>  }
>  
>  #ifdef CONFIG_CONTIG_ALLOC
> -static unsigned long pfn_max_align_down(unsigned long pfn)
> -{
> - return ALIGN_DOWN(pfn, MAX_ORDER_NR_PAGES);
> -}
> -
> -static unsigned long pfn_max_align_up(unsigned long pfn)
> -{
> - return ALIGN(pfn, MAX_ORDER_NR_PAGES);
> -}
> -
>  #if defined(CONFIG_DYNAMIC_DEBUG) || \
>   (defined(CONFIG_DYNAMIC_DEBUG_CORE) && defined(DYNAMIC_DEBUG_MODULE))
>  /* Usage: See admin-guide/dynamic-debug-howto.rst */
> @@ -9091,8 +9081,7 @@ int alloc_contig_range(unsigned long start, unsigned 
> long end,
>* put back to page allocator so that buddy can use them.
>*/
>  
> - ret = start_isolate_page_range(pfn_max_align_down(start),
> -pfn_max_align_up(end), migratetype, 0);
> + ret = start_isolate_page_range(start, end, migratetype, 0);
>   if (ret)
>   return ret;
>  
> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> index b34f1310aeaa..e0afc3ee8cf9 100644
> --- a/mm/page_isolation.c
> +++ b/mm/page_isolation.c
> @@ -16,7 +16,8 @@
>  #include 
>  
>  /*
> - * This function checks whether pageblock includes unmovable pages or not.
> + * This function checks whether pageblock within [start_pfn, end_pfn) 
> includes
> + * unmovable pages or not.
>   *
>   * PageLRU check without isolation or lru_lock could race so that
>   * MIGRATE_MOVABLE block might include unmovable pages. And __PageMovable
> @@ -29,11 +30,14 @@
>   *
>   */
>  static struct page *has_unmovable_pages(struct zone *zone, struct page *page,
> -  int migratetype, int flags)
> +  int migratetype, int flags,
> +  unsigned long start_pfn, unsigned long end_pfn)
>  {
> - unsigned long iter = 0;
> - unsigned long pfn = page_to_pfn(page);
> - unsigned long offset = pfn % pageblock_nr_pages;
> + unsigned long first_pfn = max(page_to_pfn(page), start_pfn);
> + unsigned long pfn = first_pfn;
> + unsigned long last_pfn = min(ALIGN(pfn + 1, pageblock_nr_pages), 
> end_pfn);
> +
> + page = pfn_to_page(pfn);

I think we should get rid of the page argument completely. The caller
should pass in a reasonable [start_pfn, end_pfn) range, and to any
necessary fixups to the range outside of this function.

The goal should be to have

pfn = start_pfn

and replacing last_pfn by end_pfn.


Ideally we'd end up with "This function checks whether the range
[start_pfn, end_pfn) contains unmovable pages or not."


What would be missing to achieve that?

-- 
Thanks,

David / dhildenb

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v7 1/5] mm: page_isolation: move has_unmovable_pages() to mm/page_isolation.c

2022-03-14 Thread David Hildenbrand
On 11.03.22 19:36, Zi Yan wrote:
> From: Zi Yan 
> 
> has_unmovable_pages() is only used in mm/page_isolation.c. Move it from
> mm/page_alloc.c and make it static.
> 
> Signed-off-by: Zi Yan 
> Reviewed-by: Oscar Salvador 
> Reviewed-by: Mike Rapoport 

Acked-by: David Hildenbrand 


-- 
Thanks,

David / dhildenb

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] virtio: drop default for virtio-mem

2022-02-25 Thread David Hildenbrand
On 25.02.22 12:48, Michael S. Tsirkin wrote:
> There's no special reason why virtio-mem needs a default that's
> different from what kconfig provides, any more than e.g. virtio blk.
> 
> Signed-off-by: Michael S. Tsirkin 
> ---
>  drivers/virtio/Kconfig | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
> index 34f80b7a8a64..492fc26f0b65 100644
> --- a/drivers/virtio/Kconfig
> +++ b/drivers/virtio/Kconfig
> @@ -105,7 +105,6 @@ config VIRTIO_BALLOON
>  
>  config VIRTIO_MEM
>   tristate "Virtio mem driver"
> - default m
>   depends on X86_64
>   depends on VIRTIO
>   depends on MEMORY_HOTPLUG

Yeah, why not

Acked-by: David Hildenbrand 

-- 
Thanks,

David / dhildenb

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] Virtio-balloon: add user space API for sizing

2022-02-15 Thread David Hildenbrand
On 14.02.22 20:59, Kameron Lutes wrote:
> This new linux API will allow user space applications to directly
> control the size of the virtio-balloon. This is useful in
> situations where the guest must quickly respond to drastically
> increased memory pressure and cannot wait for the host to adjust
> the balloon's size.
> 
> Under the current wording of the Virtio spec, guest driven
> behavior such as this is permitted:
> 
> VIRTIO Version 1.1 Section 5.5.6
> "The device is driven either by the receipt of a configuration
> change notification, or by changing guest memory needs, such as
> performing memory compaction or responding to out of memory
> conditions."

Not quite. num_pages is determined by the hypervisor only and the guest
is not expected to change it, and if it does, it's ignored.

5.5.6 does not indicate at all that the guest may change it or that it
would have any effect. num_pages is examined only, actual is updated by
the driver.

5.5.6.1 documents what's allowed, e.g.,

  The driver SHOULD supply pages to the balloon when num_pages is
  greater than the actual number of pages in the balloon.

  The driver MAY use pages from the balloon when num_pages is less than
  the actual number of pages in the balloon.

and special handling for VIRTIO_BALLOON_F_DEFLATE_ON_OOM.

Especially, we have

  The driver MUST update actual after changing the number of pages in
  the balloon.

  The driver MAY update actual once after multiple inflate and deflate
  operations.

That's also why QEMU never syncs back the num_pages value from the guest
when writing the config.


Current spec does not allow for what you propose.


> 
> The intended use case for this API is one where the host
> communicates a deflation limit to the guest. The guest may then
> choose to respond to memory pressure by deflating its balloon down
> to the guest's allowable limit.

It would be good to have a full proposal and a proper spec update. I'd
assume you'd want separate values for soft vs. hard num_values -- if
that's what we really want.

BUT

There seems to be recent interest in handling memory pressure in a
better way (although how to really detect "serious memory pressure" vs
"ordinary reclaim" in Linux is still to be figured out). There is
already a discussion going on how that could happen. Adding random user
space toggles might not be the best idea. We might want a single
mechanism to achieve that.

https://lists.oasis-open.org/archives/virtio-comment/202201/msg00139.html

-- 
Thanks,

David / dhildenb

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v1 0/2] mm: enforce pageblock_order < MAX_ORDER

2022-02-14 Thread David Hildenbrand
Having pageblock_order >= MAX_ORDER seems to be able to happen in corner
cases and some parts of the kernel are not prepared for it.

For example, Aneesh has shown [1] that such kernels can be compiled on
ppc64 with 64k base pages by setting FORCE_MAX_ZONEORDER=8, which will run
into a WARN_ON_ONCE(order >= MAX_ORDER) in comapction code right during
boot.

We can get pageblock_order >= MAX_ORDER when the default hugetlb size is
bigger than the maximum allocation granularity of the buddy, in which case
we are no longer talking about huge pages but instead gigantic pages.

Having pageblock_order >= MAX_ORDER can only make alloc_contig_range() of
such gigantic pages more likely to succeed.

Reliable use of gigantic pages either requires boot time allcoation or CMA,
no need to overcomplicate some places in the kernel to optimize for corner
cases that are broken in other areas of the kernel.

Let's enforce pageblock_order < MAX_ORDER and simplify.

Especially patch #1 can be regarded a cleanup before:
[PATCH v5 0/6] Use pageblock_order for cma and alloc_contig_range
alignment. [2]

[1] https://lkml.kernel.org/r/87r189a2ks@linux.ibm.com
[2] https://lkml.kernel.org/r/20220211164135.1803616-1-zi@sent.com

Cc: Andrew Morton 
Cc: Aneesh Kumar K V 
Cc: Zi Yan 
Cc: Michael Ellerman 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Rob Herring 
Cc: Frank Rowand 
Cc: "Michael S. Tsirkin" 
Cc: Christoph Hellwig 
Cc: Marek Szyprowski 
Cc: Robin Murphy 
Cc: Minchan Kim 
Cc: Vlastimil Babka 
Cc: linuxppc-...@lists.ozlabs.org
Cc: devicet...@vger.kernel.org
Cc: virtualization@lists.linux-foundation.org
Cc: io...@lists.linux-foundation.org
Cc: linux...@kvack.org

David Hildenbrand (2):
  cma: factor out minimum alignment requirement
  mm: enforce pageblock_order < MAX_ORDER

 arch/powerpc/include/asm/fadump-internal.h |  5 
 arch/powerpc/kernel/fadump.c   |  2 +-
 drivers/of/of_reserved_mem.c   |  9 ++
 drivers/virtio/virtio_mem.c|  9 ++
 include/linux/cma.h|  8 ++
 include/linux/pageblock-flags.h|  7 +++--
 kernel/dma/contiguous.c|  4 +--
 mm/Kconfig |  3 ++
 mm/cma.c   | 20 --
 mm/page_alloc.c| 32 ++
 10 files changed, 37 insertions(+), 62 deletions(-)


base-commit: 754e0b0e35608ed5206d6a67a791563c631cec07
-- 
2.34.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v1 1/2] cma: factor out minimum alignment requirement

2022-02-14 Thread David Hildenbrand
Let's factor out determining the minimum alignment requirement for CMA
and add a helpful comment.

No functional change intended.

Signed-off-by: David Hildenbrand 
---
 arch/powerpc/include/asm/fadump-internal.h |  5 -
 arch/powerpc/kernel/fadump.c   |  2 +-
 drivers/of/of_reserved_mem.c   |  9 +++--
 include/linux/cma.h|  9 +
 kernel/dma/contiguous.c|  4 +---
 mm/cma.c   | 20 +---
 6 files changed, 19 insertions(+), 30 deletions(-)

diff --git a/arch/powerpc/include/asm/fadump-internal.h 
b/arch/powerpc/include/asm/fadump-internal.h
index 52189928ec08..81bcb9abb371 100644
--- a/arch/powerpc/include/asm/fadump-internal.h
+++ b/arch/powerpc/include/asm/fadump-internal.h
@@ -19,11 +19,6 @@
 
 #define memblock_num_regions(memblock_type)(memblock.memblock_type.cnt)
 
-/* Alignment per CMA requirement. */
-#define FADUMP_CMA_ALIGNMENT   (PAGE_SIZE <<   \
-max_t(unsigned long, MAX_ORDER - 1,\
-pageblock_order))
-
 /* FAD commands */
 #define FADUMP_REGISTER1
 #define FADUMP_UNREGISTER  2
diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
index d03e488cfe9c..7eb67201ea41 100644
--- a/arch/powerpc/kernel/fadump.c
+++ b/arch/powerpc/kernel/fadump.c
@@ -544,7 +544,7 @@ int __init fadump_reserve_mem(void)
if (!fw_dump.nocma) {
fw_dump.boot_memory_size =
ALIGN(fw_dump.boot_memory_size,
- FADUMP_CMA_ALIGNMENT);
+ CMA_MIN_ALIGNMENT_BYTES);
}
 #endif
 
diff --git a/drivers/of/of_reserved_mem.c b/drivers/of/of_reserved_mem.c
index 9c0fb962c22b..75caa6f5d36f 100644
--- a/drivers/of/of_reserved_mem.c
+++ b/drivers/of/of_reserved_mem.c
@@ -22,6 +22,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "of_private.h"
 
@@ -116,12 +117,8 @@ static int __init __reserved_mem_alloc_size(unsigned long 
node,
if (IS_ENABLED(CONFIG_CMA)
&& of_flat_dt_is_compatible(node, "shared-dma-pool")
&& of_get_flat_dt_prop(node, "reusable", NULL)
-   && !nomap) {
-   unsigned long order =
-   max_t(unsigned long, MAX_ORDER - 1, pageblock_order);
-
-   align = max(align, (phys_addr_t)PAGE_SIZE << order);
-   }
+   && !nomap)
+   align = max_t(phys_addr_t, align, CMA_MIN_ALIGNMENT_BYTES);
 
prop = of_get_flat_dt_prop(node, "alloc-ranges", &len);
if (prop) {
diff --git a/include/linux/cma.h b/include/linux/cma.h
index bd801023504b..75fe188ec4a1 100644
--- a/include/linux/cma.h
+++ b/include/linux/cma.h
@@ -20,6 +20,15 @@
 
 #define CMA_MAX_NAME 64
 
+/*
+ * TODO: once the buddy -- especially pageblock merging and 
alloc_contig_range()
+ * -- can deal with only some pageblocks of a higher-order page being
+ *  MIGRATE_CMA, we can use pageblock_nr_pages.
+ */
+#define CMA_MIN_ALIGNMENT_PAGES max_t(phys_addr_t, MAX_ORDER_NR_PAGES, \
+ pageblock_nr_pages)
+#define CMA_MIN_ALIGNMENT_BYTES (PAGE_SIZE * CMA_MIN_ALIGNMENT_PAGES)
+
 struct cma;
 
 extern unsigned long totalcma_pages;
diff --git a/kernel/dma/contiguous.c b/kernel/dma/contiguous.c
index 3d63d91cba5c..6ea80ae42622 100644
--- a/kernel/dma/contiguous.c
+++ b/kernel/dma/contiguous.c
@@ -399,8 +399,6 @@ static const struct reserved_mem_ops rmem_cma_ops = {
 
 static int __init rmem_cma_setup(struct reserved_mem *rmem)
 {
-   phys_addr_t align = PAGE_SIZE << max(MAX_ORDER - 1, pageblock_order);
-   phys_addr_t mask = align - 1;
unsigned long node = rmem->fdt_node;
bool default_cma = of_get_flat_dt_prop(node, "linux,cma-default", NULL);
struct cma *cma;
@@ -416,7 +414,7 @@ static int __init rmem_cma_setup(struct reserved_mem *rmem)
of_get_flat_dt_prop(node, "no-map", NULL))
return -EINVAL;
 
-   if ((rmem->base & mask) || (rmem->size & mask)) {
+   if (!IS_ALIGNED(rmem->base | rmem->size, CMA_MIN_ALIGNMENT_BYTES)) {
pr_err("Reserved memory: incorrect alignment of CMA region\n");
return -EINVAL;
}
diff --git a/mm/cma.c b/mm/cma.c
index bc9ca8f3c487..5a2cd5851658 100644
--- a/mm/cma.c
+++ b/mm/cma.c
@@ -168,7 +168,6 @@ int __init cma_init_reserved_mem(phys_addr_t base, 
phys_addr_t size,
 struct cma **res_cma)
 {
struct cma *cma;
-   phys_addr_t alignment;
 
/* Sanity checks */
if (cma_area_count == ARRAY_SIZE(cma_areas)) {
@@ -179,15 +178,12 @@ int __init cma_init_reserved_mem

[PATCH v1 2/2] mm: enforce pageblock_order < MAX_ORDER

2022-02-14 Thread David Hildenbrand
Some places in the kernel don't really expect pageblock_order >=
MAX_ORDER, and it looks like this is only possible in corner cases:

1) CONFIG_DEFERRED_STRUCT_PAGE_INIT we'll end up freeing pageblock_order
   pages via __free_pages_core(), which cannot possibly work.

2) find_zone_movable_pfns_for_nodes() will roundup the ZONE_MOVABLE
   start PFN to MAX_ORDER_NR_PAGES. Consequently with a bigger
   pageblock_order, we could have a single pageblock partially managed by
   two zones.

3) compaction code runs into __fragmentation_index() with order
   >= MAX_ORDER, when checking WARN_ON_ONCE(order >= MAX_ORDER). [1]

4) mm/page_reporting.c won't be reporting any pages with default
   page_reporting_order == pageblock_order, as we'll be skipping the
   reporting loop inside page_reporting_process_zone().

5) __rmqueue_fallback() will never be able to steal with
   ALLOC_NOFRAGMENT.

pageblock_order >= MAX_ORDER is weird either way: it's a pure
optimization for making alloc_contig_range(), as used for allcoation of
gigantic pages, a little more reliable to succeed. However, if there is
demand for somewhat reliable allocation of gigantic pages, affected setups
should be using CMA or boottime allocations instead.

So let's make sure that pageblock_order < MAX_ORDER and simplify.

[1] https://lkml.kernel.org/r/87r189a2ks@linux.ibm.com

Signed-off-by: David Hildenbrand 
---
 drivers/virtio/virtio_mem.c |  9 +++--
 include/linux/cma.h |  3 +--
 include/linux/pageblock-flags.h |  7 +--
 mm/Kconfig  |  3 +++
 mm/page_alloc.c | 32 
 5 files changed, 20 insertions(+), 34 deletions(-)

diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
index 38becd8d578c..e7d6b679596d 100644
--- a/drivers/virtio/virtio_mem.c
+++ b/drivers/virtio/virtio_mem.c
@@ -2476,13 +2476,10 @@ static int virtio_mem_init_hotplug(struct virtio_mem 
*vm)
  VIRTIO_MEM_DEFAULT_OFFLINE_THRESHOLD);
 
/*
-* We want subblocks to span at least MAX_ORDER_NR_PAGES and
-* pageblock_nr_pages pages. This:
-* - Is required for now for alloc_contig_range() to work reliably -
-*   it doesn't properly handle smaller granularity on ZONE_NORMAL.
+* TODO: once alloc_contig_range() works reliably with pageblock
+* granularity on ZONE_NORMAL, use pageblock_nr_pages instead.
 */
-   sb_size = max_t(uint64_t, MAX_ORDER_NR_PAGES,
-   pageblock_nr_pages) * PAGE_SIZE;
+   sb_size = PAGE_SIZE * MAX_ORDER_NR_PAGES;
sb_size = max_t(uint64_t, vm->device_block_size, sb_size);
 
if (sb_size < memory_block_size_bytes() && !force_bbm) {
diff --git a/include/linux/cma.h b/include/linux/cma.h
index 75fe188ec4a1..b1ba94f1cc9c 100644
--- a/include/linux/cma.h
+++ b/include/linux/cma.h
@@ -25,8 +25,7 @@
  * -- can deal with only some pageblocks of a higher-order page being
  *  MIGRATE_CMA, we can use pageblock_nr_pages.
  */
-#define CMA_MIN_ALIGNMENT_PAGES max_t(phys_addr_t, MAX_ORDER_NR_PAGES, \
- pageblock_nr_pages)
+#define CMA_MIN_ALIGNMENT_PAGES MAX_ORDER_NR_PAGES
 #define CMA_MIN_ALIGNMENT_BYTES (PAGE_SIZE * CMA_MIN_ALIGNMENT_PAGES)
 
 struct cma;
diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h
index 973fd731a520..83c7248053a1 100644
--- a/include/linux/pageblock-flags.h
+++ b/include/linux/pageblock-flags.h
@@ -37,8 +37,11 @@ extern unsigned int pageblock_order;
 
 #else /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */
 
-/* Huge pages are a constant size */
-#define pageblock_orderHUGETLB_PAGE_ORDER
+/*
+ * Huge pages are a constant size, but don't exceed the maximum allocation
+ * granularity.
+ */
+#define pageblock_ordermin_t(unsigned int, HUGETLB_PAGE_ORDER, 
MAX_ORDER - 1)
 
 #endif /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */
 
diff --git a/mm/Kconfig b/mm/Kconfig
index 3326ee3903f3..4c91b92e7537 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -262,6 +262,9 @@ config HUGETLB_PAGE_SIZE_VARIABLE
  HUGETLB_PAGE_ORDER when there are multiple HugeTLB page sizes 
available
  on a platform.
 
+ Note that the pageblock_order cannot exceed MAX_ORDER - 1 and will be
+ clamped down to MAX_ORDER - 1.
+
 config CONTIG_ALLOC
def_bool (MEMORY_ISOLATION && COMPACTION) || CMA
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3589febc6d31..04cf964b57b5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1072,14 +1072,12 @@ static inline void __free_one_page(struct page *page,
int migratetype, fpi_t fpi_flags)
 {
struct capture_control *capc = task_capc(zone);
+   unsigned int max_order = pageblock_order;
unsigned long buddy_pfn;
unsigned long combined_pfn;
-   unsigned int max_order;

Re: [PATCH v3] drivers/virtio: Enable virtio mem for ARM64

2022-02-04 Thread David Hildenbrand
On 04.02.22 15:04, Michael S. Tsirkin wrote:
> On Fri, Feb 04, 2022 at 02:29:39PM +0100, David Hildenbrand wrote:
>> On 04.02.22 14:24, Michael S. Tsirkin wrote:
>>> On Wed, Jan 19, 2022 at 09:35:05AM +0100, David Hildenbrand wrote:
>>>> On 19.01.22 08:46, Gavin Shan wrote:
>>>>> Hi Michael,
>>>>>
>>>>> On 1/19/22 3:39 PM, Michael S. Tsirkin wrote:
>>>>>> On Wed, Jan 19, 2022 at 09:05:51AM +0800, Gavin Shan wrote:
>>>>>>> This enables virtio-mem device support by allowing to enable the
>>>>>>> corresponding kernel config option (CONFIG_VIRTIO_MEM) on the
>>>>>>> architecture.
>>>>>>>
>>>>>>> Signed-off-by: Gavin Shan 
>>>>>>> Acked-by: David Hildenbrand 
>>>>>>> Acked-by: Jonathan Cameron 
>>>>>>> Acked-by: Michael S. Tsirkin 
>>>>>>> ---
>>>>>>> v3: Pick ack-by tags from Jonathan and Michael
>>>>>>> ---
>>>>>>>   drivers/virtio/Kconfig | 7 ---
>>>>>>>   1 file changed, 4 insertions(+), 3 deletions(-)
>>>>>>>
>>>>>>> diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
>>>>>>> index 34f80b7a8a64..74c8b0c7bc33 100644
>>>>>>> --- a/drivers/virtio/Kconfig
>>>>>>> +++ b/drivers/virtio/Kconfig
>>>>>>> @@ -106,7 +106,7 @@ config VIRTIO_BALLOON
>>>>>>>   config VIRTIO_MEM
>>>>>>> tristate "Virtio mem driver"
>>>>>>> default m
>>>>>>> -   depends on X86_64
>>>>>>> +   depends on X86_64 || ARM64
>>>>>>> depends on VIRTIO
>>>>>>> depends on MEMORY_HOTPLUG
>>>>>>> depends on MEMORY_HOTREMOVE
>>>>>>> @@ -116,8 +116,9 @@ config VIRTIO_MEM
>>>>>>>  This driver provides access to virtio-mem paravirtualized 
>>>>>>> memory
>>>>>>>  devices, allowing to hotplug and hotunplug memory.
>>>>>>>   
>>>>>>> -This driver was only tested under x86-64, but should 
>>>>>>> theoretically
>>>>>>> -work on all architectures that support memory hotplug and 
>>>>>>> hotremove.
>>>>>>> +This driver was only tested under x86-64 and arm64, but should
>>>>>>> +theoretically work on all architectures that support memory 
>>>>>>> hotplug
>>>>>>> +and hotremove.
>>>>>>>   
>>>>>>
>>>>>> BTW isn't there a symbol saying "memory hotplug" that we can depend on?
>>>>>>
>>>>
>>>> You mean
>>>>
>>>>depends on MEMORY_HOTPLUG
>>>>depends on MEMORY_HOTREMOVE
>>>>
>>>> We still need a manual opt-in from architectures, because devil's in the
>>>> detail. (e.g., memblock stuff we had to adjust)
>>>
>>> Is there any chance of documenting some of this here? The current comment 
>>> makes it
>>> look like it's just a question of whitelisting an architecture.
>>
>> I can send a patch to add more details.
> 
> oks so that will be a patch on top?

Yes, it's independent of arm64 support.

-- 
Thanks,

David / dhildenb

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v3] drivers/virtio: Enable virtio mem for ARM64

2022-02-04 Thread David Hildenbrand
On 04.02.22 14:24, Michael S. Tsirkin wrote:
> On Wed, Jan 19, 2022 at 09:35:05AM +0100, David Hildenbrand wrote:
>> On 19.01.22 08:46, Gavin Shan wrote:
>>> Hi Michael,
>>>
>>> On 1/19/22 3:39 PM, Michael S. Tsirkin wrote:
>>>> On Wed, Jan 19, 2022 at 09:05:51AM +0800, Gavin Shan wrote:
>>>>> This enables virtio-mem device support by allowing to enable the
>>>>> corresponding kernel config option (CONFIG_VIRTIO_MEM) on the
>>>>> architecture.
>>>>>
>>>>> Signed-off-by: Gavin Shan 
>>>>> Acked-by: David Hildenbrand 
>>>>> Acked-by: Jonathan Cameron 
>>>>> Acked-by: Michael S. Tsirkin 
>>>>> ---
>>>>> v3: Pick ack-by tags from Jonathan and Michael
>>>>> ---
>>>>>   drivers/virtio/Kconfig | 7 ---
>>>>>   1 file changed, 4 insertions(+), 3 deletions(-)
>>>>>
>>>>> diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
>>>>> index 34f80b7a8a64..74c8b0c7bc33 100644
>>>>> --- a/drivers/virtio/Kconfig
>>>>> +++ b/drivers/virtio/Kconfig
>>>>> @@ -106,7 +106,7 @@ config VIRTIO_BALLOON
>>>>>   config VIRTIO_MEM
>>>>>   tristate "Virtio mem driver"
>>>>>   default m
>>>>> - depends on X86_64
>>>>> + depends on X86_64 || ARM64
>>>>>   depends on VIRTIO
>>>>>   depends on MEMORY_HOTPLUG
>>>>>   depends on MEMORY_HOTREMOVE
>>>>> @@ -116,8 +116,9 @@ config VIRTIO_MEM
>>>>>This driver provides access to virtio-mem paravirtualized 
>>>>> memory
>>>>>devices, allowing to hotplug and hotunplug memory.
>>>>>   
>>>>> -  This driver was only tested under x86-64, but should theoretically
>>>>> -  work on all architectures that support memory hotplug and hotremove.
>>>>> +  This driver was only tested under x86-64 and arm64, but should
>>>>> +  theoretically work on all architectures that support memory hotplug
>>>>> +  and hotremove.
>>>>>   
>>>>
>>>> BTW isn't there a symbol saying "memory hotplug" that we can depend on?
>>>>
>>
>> You mean
>>
>>  depends on MEMORY_HOTPLUG
>>  depends on MEMORY_HOTREMOVE
>>
>> We still need a manual opt-in from architectures, because devil's in the
>> detail. (e.g., memblock stuff we had to adjust)
> 
> Is there any chance of documenting some of this here? The current comment 
> makes it
> look like it's just a question of whitelisting an architecture.

I can send a patch to add more details.

-- 
Thanks,

David / dhildenb

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v4 3/7] mm: page_isolation: check specified range for unmovable pages

2022-02-02 Thread David Hildenbrand
On 02.02.22 13:18, Oscar Salvador wrote:
> On Wed, Jan 19, 2022 at 02:06:19PM -0500, Zi Yan wrote:
>> From: Zi Yan 
>>
>> Enable set_migratetype_isolate() to check specified sub-range for
>> unmovable pages during isolation. Page isolation is done
>> at max(MAX_ORDER_NR_PAEGS, pageblock_nr_pages) granularity, but not all
>> pages within that granularity are intended to be isolated. For example,
>> alloc_contig_range(), which uses page isolation, allows ranges without
>> alignment. This commit makes unmovable page check only look for
>> interesting pages, so that page isolation can succeed for any
>> non-overlapping ranges.
> 
> Another thing that came to my mind.
> Prior to this patch, has_unmovable_pages() was checking on pageblock
> granularity, starting at pfn#0 of the pageblock.
> With this patch, you no longer check on pageblock granularity, which
> means you might isolate a pageblock, but some pages that sneaked in
> might actually be unmovable.
> 
> E.g:
> 
> Let's say you have a pageblock that spans (pfn#512,pfn#1024),
> and you pass alloc_contig_range() (pfn#514,pfn#1024).
> has_unmovable_pages() will start checking the pageblock at pfn#514,
> and so it will mis pfn#512 and pfn#513. Isn't that a problem, if those
> pfn turn out to be actually unmovable?

That's the whole idea for being able to allocate parts of an unmovable
pageblock that are movable.

If the first part is unmovable but the second part is movable, nothing
should stop us from trying to allocate the second part.

Of course, we want to remember the original migratetype of the
pageblock, to restore that after isolation -- otherwise we'll end up
converting all such pageblocks to MIGRATE_MOVABLE. The next patch does
that IIRC.

However, devil is in the detail, and I still have to review those parts
of this series.


Note that there are no current users of alloc_contig_range() that
allocate < MAX_ORDER - 1 -- except CMA, but for CMA we immediately exit
has_unmovable_pages() either way.

-- 
Thanks,

David / dhildenb

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] mm/balloon_compaction: make balloon page compaction callbacks static

2022-01-25 Thread David Hildenbrand
On 25.01.22 14:22, Miaohe Lin wrote:
> Since commit b1123ea6d3b3 ("mm: balloon: use general non-lru movable page
> feature"), these functions are called via balloon_aops callbacks. They're
> not called directly outside this file. So make them static and clean up
> the relevant code.
> 
> Signed-off-by: Miaohe Lin 

Reviewed-by: David Hildenbrand 


-- 
Thanks,

David / dhildenb

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v3] drivers/virtio: Enable virtio mem for ARM64

2022-01-19 Thread David Hildenbrand
On 19.01.22 08:46, Gavin Shan wrote:
> Hi Michael,
> 
> On 1/19/22 3:39 PM, Michael S. Tsirkin wrote:
>> On Wed, Jan 19, 2022 at 09:05:51AM +0800, Gavin Shan wrote:
>>> This enables virtio-mem device support by allowing to enable the
>>> corresponding kernel config option (CONFIG_VIRTIO_MEM) on the
>>> architecture.
>>>
>>> Signed-off-by: Gavin Shan 
>>> Acked-by: David Hildenbrand 
>>> Acked-by: Jonathan Cameron 
>>> Acked-by: Michael S. Tsirkin 
>>> ---
>>> v3: Pick ack-by tags from Jonathan and Michael
>>> ---
>>>   drivers/virtio/Kconfig | 7 ---
>>>   1 file changed, 4 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
>>> index 34f80b7a8a64..74c8b0c7bc33 100644
>>> --- a/drivers/virtio/Kconfig
>>> +++ b/drivers/virtio/Kconfig
>>> @@ -106,7 +106,7 @@ config VIRTIO_BALLOON
>>>   config VIRTIO_MEM
>>> tristate "Virtio mem driver"
>>> default m
>>> -   depends on X86_64
>>> +   depends on X86_64 || ARM64
>>> depends on VIRTIO
>>> depends on MEMORY_HOTPLUG
>>> depends on MEMORY_HOTREMOVE
>>> @@ -116,8 +116,9 @@ config VIRTIO_MEM
>>>  This driver provides access to virtio-mem paravirtualized memory
>>>  devices, allowing to hotplug and hotunplug memory.
>>>   
>>> -This driver was only tested under x86-64, but should theoretically
>>> -work on all architectures that support memory hotplug and hotremove.
>>> +This driver was only tested under x86-64 and arm64, but should
>>> +theoretically work on all architectures that support memory hotplug
>>> +and hotremove.
>>>   
>>
>> BTW isn't there a symbol saying "memory hotplug" that we can depend on?
>>

You mean

depends on MEMORY_HOTPLUG
depends on MEMORY_HOTREMOVE

We still need a manual opt-in from architectures, because devil's in the
detail. (e.g., memblock stuff we had to adjust)

-- 
Thanks,

David / dhildenb

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] drivers/virtio: Enable virtio mem for ARM64

2022-01-18 Thread David Hildenbrand
On 18.01.22 11:43, Michael S. Tsirkin wrote:
> On Tue, Jan 18, 2022 at 09:38:21AM +0100, David Hildenbrand wrote:
>> On 18.01.22 02:34, Gavin Shan wrote:
>>> This enables virtio-mem device support by allowing to enable the
>>> corresponding kernel config option (CONFIG_VIRTIO_MEM) on the
>>> architecture.
>>>
>>> Signed-off-by: Gavin Shan 
>>> ---
>>>  drivers/virtio/Kconfig | 2 +-
>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
>>> index 34f80b7a8a64..bf3f6ebdaa3b 100644
>>> --- a/drivers/virtio/Kconfig
>>> +++ b/drivers/virtio/Kconfig
>>> @@ -106,7 +106,7 @@ config VIRTIO_BALLOON
>>>  config VIRTIO_MEM
>>> tristate "Virtio mem driver"
>>> default m
>>> -   depends on X86_64
>>> +   depends on X86_64 || ARM64
>>> depends on VIRTIO
>>> depends on MEMORY_HOTPLUG
>>> depends on MEMORY_HOTREMOVE
>>
>> With MEMBLOCK_DRIVER_MANAGED in place upstream, kexec should be fine.
>>
>>
>> Can you adjust/rephrase the comment as well? Like
>>
>> diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
>> index 34f80b7a8a64..88028ca01c8f 100644
>> --- a/drivers/virtio/Kconfig
>> +++ b/drivers/virtio/Kconfig
>> @@ -116,8 +116,9 @@ config VIRTIO_MEM
>>  This driver provides access to virtio-mem paravirtualized memory
>>  devices, allowing to hotplug and hotunplug memory.
>>  
>> -This driver was only tested under x86-64, but should theoretically
>> -work on all architectures that support memory hotplug and hotremove.
>> +This driver was only tested under x86-64 and arm64, but should
>> +theoretically work on all architectures that support memory hotplug 
>> and
>> +hotremove.
>>  
>>  If unsure, say M.
>>  
>>
>>
>> Acked-by: David Hildenbrand 
> 
> with this:
> Acked-by: Michael S. Tsirkin 
> 
> I guess I will merge this? It's a small change so - let's go for this
> release straight away?

Yes, no need to wait for the next cycle.


-- 
Thanks,

David / dhildenb

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] drivers/virtio: Enable virtio mem for ARM64

2022-01-18 Thread David Hildenbrand
On 18.01.22 02:34, Gavin Shan wrote:
> This enables virtio-mem device support by allowing to enable the
> corresponding kernel config option (CONFIG_VIRTIO_MEM) on the
> architecture.
> 
> Signed-off-by: Gavin Shan 
> ---
>  drivers/virtio/Kconfig | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
> index 34f80b7a8a64..bf3f6ebdaa3b 100644
> --- a/drivers/virtio/Kconfig
> +++ b/drivers/virtio/Kconfig
> @@ -106,7 +106,7 @@ config VIRTIO_BALLOON
>  config VIRTIO_MEM
>   tristate "Virtio mem driver"
>   default m
> - depends on X86_64
> + depends on X86_64 || ARM64
>   depends on VIRTIO
>   depends on MEMORY_HOTPLUG
>   depends on MEMORY_HOTREMOVE

With MEMBLOCK_DRIVER_MANAGED in place upstream, kexec should be fine.


Can you adjust/rephrase the comment as well? Like

diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
index 34f80b7a8a64..88028ca01c8f 100644
--- a/drivers/virtio/Kconfig
+++ b/drivers/virtio/Kconfig
@@ -116,8 +116,9 @@ config VIRTIO_MEM
 This driver provides access to virtio-mem paravirtualized memory
 devices, allowing to hotplug and hotunplug memory.
 
-This driver was only tested under x86-64, but should theoretically
-work on all architectures that support memory hotplug and hotremove.
+This driver was only tested under x86-64 and arm64, but should
+theoretically work on all architectures that support memory hotplug and
+    hotremove.
 
 If unsure, say M.
 


Acked-by: David Hildenbrand 

-- 
Thanks,

David / dhildenb

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] virtio_mem: break device on remove

2022-01-17 Thread David Hildenbrand
On 17.01.22 09:40, Michael S. Tsirkin wrote:
> On Mon, Jan 17, 2022 at 09:31:56AM +0100, David Hildenbrand wrote:
>> On 17.01.22 08:55, Michael S. Tsirkin wrote:
>>> On Mon, Jan 17, 2022 at 02:40:11PM +0800, Jason Wang wrote:
>>>>
>>>> 在 2022/1/15 上午5:43, Michael S. Tsirkin 写道:
>>>>> A common pattern for device reset is currently:
>>>>> vdev->config->reset(vdev);
>>>>> .. cleanup ..
>>>>>
>>>>> reset prevents new interrupts from arriving and waits for interrupt
>>>>> handlers to finish.
>>>>>
>>>>> However if - as is common - the handler queues a work request which is
>>>>> flushed during the cleanup stage, we have code adding buffers / trying
>>>>> to get buffers while device is reset. Not good.
>>>>>
>>>>> This was reproduced by running
>>>>>   modprobe virtio_console
>>>>>   modprobe -r virtio_console
>>>>> in a loop, and this reasoning seems to apply to virtio mem though
>>>>> I could not reproduce it there.
>>>>>
>>>>> Fix this up by calling virtio_break_device + flush before reset.
>>>>>
>>>>> Signed-off-by: Michael S. Tsirkin 
>>>>> ---
>>>>>   drivers/virtio/virtio_mem.c | 2 ++
>>>>>   1 file changed, 2 insertions(+)
>>>>>
>>>>> diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
>>>>> index 38becd8d578c..33b8a118a3ae 100644
>>>>> --- a/drivers/virtio/virtio_mem.c
>>>>> +++ b/drivers/virtio/virtio_mem.c
>>>>> @@ -2888,6 +2888,8 @@ static void virtio_mem_remove(struct virtio_device 
>>>>> *vdev)
>>>>>   virtio_mem_deinit_hotplug(vm);
>>>>>   /* reset the device and cleanup the queues */
>>>>> + virtio_break_device(vdev);
>>>>> + flush_work(&vm->wq);
>>>>
>>>>
>>>> We set vm->removing to true and call cancel_work_sync() in
>>>> virtio_mem_deinit_hotplug(). Isn't is sufficient?
>>>>
>>>> Thanks
>>>
>>>
>>> Hmm I think you are right. David, I will drop this for now.
>>> Up to you to consider whether some central capability will be
>>> helpful as a replacement for the virtio-mem specific "removing" flag.
>>
>> It's all a bit tricky because we also have to handle pending timers and
>> pending memory onlining/offlining operations in a controlled way. Maybe
>> we could convert to virtio_break_device() and use the
>> &dev->vqs_list_lock as a replacement for the removal_lock. However, I'm
>> not 100% sure if it's nice to use that lock from
>> drivers/virtio/virtio_mem.c directly.
> 
> We could add an API if you like. Or maybe it makes sense to add a
> separate one that lets you find out that removal started. Need to figure
> out how to handle suspend too then ...
> Generally we have these checks that device is not going away
> sprinkled over all drivers and I don't like it, but
> it's not easy to build a sane API to handle it, especially
> for high speed things when we can't take locks because performance.

The interesting case might be how to handle virtio_mem_retry(), whereby
we conditionally queue work if !removing.

Having that said, in an ideal world we could deny removal requests for
virtio_mem completely if there is still any memory added to the system ...


-- 
Thanks,

David / dhildenb

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH] virtio_mem: break device on remove

2022-01-17 Thread David Hildenbrand
On 17.01.22 08:55, Michael S. Tsirkin wrote:
> On Mon, Jan 17, 2022 at 02:40:11PM +0800, Jason Wang wrote:
>>
>> 在 2022/1/15 上午5:43, Michael S. Tsirkin 写道:
>>> A common pattern for device reset is currently:
>>> vdev->config->reset(vdev);
>>> .. cleanup ..
>>>
>>> reset prevents new interrupts from arriving and waits for interrupt
>>> handlers to finish.
>>>
>>> However if - as is common - the handler queues a work request which is
>>> flushed during the cleanup stage, we have code adding buffers / trying
>>> to get buffers while device is reset. Not good.
>>>
>>> This was reproduced by running
>>> modprobe virtio_console
>>> modprobe -r virtio_console
>>> in a loop, and this reasoning seems to apply to virtio mem though
>>> I could not reproduce it there.
>>>
>>> Fix this up by calling virtio_break_device + flush before reset.
>>>
>>> Signed-off-by: Michael S. Tsirkin 
>>> ---
>>>   drivers/virtio/virtio_mem.c | 2 ++
>>>   1 file changed, 2 insertions(+)
>>>
>>> diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
>>> index 38becd8d578c..33b8a118a3ae 100644
>>> --- a/drivers/virtio/virtio_mem.c
>>> +++ b/drivers/virtio/virtio_mem.c
>>> @@ -2888,6 +2888,8 @@ static void virtio_mem_remove(struct virtio_device 
>>> *vdev)
>>> virtio_mem_deinit_hotplug(vm);
>>> /* reset the device and cleanup the queues */
>>> +   virtio_break_device(vdev);
>>> +   flush_work(&vm->wq);
>>
>>
>> We set vm->removing to true and call cancel_work_sync() in
>> virtio_mem_deinit_hotplug(). Isn't is sufficient?
>>
>> Thanks
> 
> 
> Hmm I think you are right. David, I will drop this for now.
> Up to you to consider whether some central capability will be
> helpful as a replacement for the virtio-mem specific "removing" flag.

It's all a bit tricky because we also have to handle pending timers and
pending memory onlining/offlining operations in a controlled way. Maybe
we could convert to virtio_break_device() and use the
&dev->vqs_list_lock as a replacement for the removal_lock. However, I'm
not 100% sure if it's nice to use that lock from
drivers/virtio/virtio_mem.c directly.
-- 
Thanks,

David / dhildenb

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC PATCH v3 7/8] drivers: virtio_mem: use pageblock size as the minimum virtio_mem size.

2022-01-14 Thread David Hildenbrand
On 05.01.22 22:47, Zi Yan wrote:
> From: Zi Yan 
> 
> alloc_contig_range() now only needs to be aligned to pageblock_order,
> drop virtio_mem size requirement that it needs to be the max of
> pageblock_order and MAX_ORDER.
> 
> Signed-off-by: Zi Yan 
> ---
>  drivers/virtio/virtio_mem.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
> index a6a78685cfbe..2664dc16d0f9 100644
> --- a/drivers/virtio/virtio_mem.c
> +++ b/drivers/virtio/virtio_mem.c
> @@ -2481,8 +2481,7 @@ static int virtio_mem_init_hotplug(struct virtio_mem 
> *vm)
>* - Is required for now for alloc_contig_range() to work reliably -
>*   it doesn't properly handle smaller granularity on ZONE_NORMAL.
>*/

Please also update this comment.

-- 
Thanks,

David / dhildenb

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH v3 5/8] mm: page_isolation: check specified range for unmovable pages during isolation.

2022-01-14 Thread David Hildenbrand
On 05.01.22 22:47, Zi Yan wrote:
> From: Zi Yan 
> 
> Enable set_migratetype_isolate() to check specified sub-range for
> unmovable pages during isolation. Page isolation is done
> at max(MAX_ORDER_NR_PAEGS, pageblock_nr_pages) granularity, but not all
> pages within that granularity are intended to be isolated. For example,
> alloc_contig_range(), which uses page isolation, allows ranges without
> alignment. This commit makes unmovable page check only look for
> interesting pages, so that page isolation can succeed for any
> non-overlapping ranges.

Are you handling if we start checking in the middle of a compound page
and actually have to lookup the head to figure out if movable or not?

> 
> has_unmovable_pages() is moved to mm/page_isolation.c since it is only
> used by page isolation.

Please move that into a separate patch upfront, makes this patch much
easier to review.

-- 
Thanks,

David / dhildenb

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH v3 3/8] mm: migrate: allocate the right size of non hugetlb or THP compound pages.

2022-01-13 Thread David Hildenbrand
On 13.01.22 16:46, Zi Yan wrote:
> On 12 Jan 2022, at 6:04, David Hildenbrand wrote:
> 
>> On 05.01.22 22:47, Zi Yan wrote:
>>> From: Zi Yan 
>>>
>>> alloc_migration_target() is used by alloc_contig_range() and non-LRU
>>> movable compound pages can be migrated. Current code does not allocate the
>>> right page size for such pages. Check THP precisely using
>>> is_transparent_huge() and add allocation support for non-LRU compound
>>> pages.
>>
>> IIRC, we don't have any non-lru migratable pages that are coumpound
>> pages. Read: not used and not supported :)
> 
> OK, but nothing prevents one writing a driver that allocates compound
> pages and provides address_space->migratepage() and 
> address_space->isolate_page().
> 
> Actually, to test this series, I write a kernel module that allocates
> an order-10 page, gives it a fake address_space with migratepage() and
> isolate_page(), __SetPageMovable() on it, then call alloc_contig_range()
> on the page range. Apparently, my kernel module is not supported by
> the kernel, thus, I added this patch.
> 
> Do you have an alternative test to my kernel module, so that I do not
> even need this patch myself?
> 
>> Why is this required in the context of this series?
> 
> It might not be required. I will drop it.

That's why I think it would be best dropping it. If you need it in
different context, better submit it in different context.

Makes this series easier to digest :)


-- 
Thanks,

David / dhildenb

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH v3 1/8] mm: page_alloc: avoid merging non-fallbackable pageblocks with others.

2022-01-13 Thread David Hildenbrand
On 13.01.22 12:36, Mike Rapoport wrote:
> On Wed, Jan 12, 2022 at 11:54:49AM +0100, David Hildenbrand wrote:
>> On 05.01.22 22:47, Zi Yan wrote:
>>> From: Zi Yan 
>>>
>>> This is done in addition to MIGRATE_ISOLATE pageblock merge avoidance.
>>> It prepares for the upcoming removal of the MAX_ORDER-1 alignment
>>> requirement for CMA and alloc_contig_range().
>>>
>>> MIGRARTE_HIGHATOMIC should not merge with other migratetypes like
>>> MIGRATE_ISOLATE and MIGRARTE_CMA[1], so this commit prevents that too.
>>> Also add MIGRARTE_HIGHATOMIC to fallbacks array for completeness.
>>>
>>> [1] 
>>> https://lore.kernel.org/linux-mm/20211130100853.gp3...@techsingularity.net/
>>>
>>> Signed-off-by: Zi Yan 
>>> ---
>>>  include/linux/mmzone.h |  6 ++
>>>  mm/page_alloc.c| 28 ++--
>>>  2 files changed, 24 insertions(+), 10 deletions(-)
>>>
> 
> ...
> 
>>> @@ -3545,8 +3553,8 @@ int __isolate_free_page(struct page *page, unsigned 
>>> int order)
>>> struct page *endpage = page + (1 << order) - 1;
>>> for (; page < endpage; page += pageblock_nr_pages) {
>>> int mt = get_pageblock_migratetype(page);
>>> -   if (!is_migrate_isolate(mt) && !is_migrate_cma(mt)
>>> -   && !is_migrate_highatomic(mt))
>>> +   /* Only change normal pageblock */
>>> +   if (migratetype_has_fallback(mt))
>>> set_pageblock_migratetype(page,
>>>   MIGRATE_MOVABLE);
>>> }
>>
>> That part is a nice cleanup IMHO. Although the "has fallback" part is a
>> bit imprecise. "migratetype_is_mergable()" might be a bit clearer.
>> ideally "migratetype_is_mergable_with_other_types()". Can we come up
>> with a nice name for that?
> 
> migratetype_is_mergable() kinda implies "_with_other_types", no?
> 
> I like migratetype_is_mergable() more than _has_fallback().
> 
> My $0.02 to bikeshedding :)

:)

Yeah, for me migratetype_is_mergable() would also be good enough. I
think I was at first thinking one could mistake it with a dedicated
migratetype. But such functions are historically called

is_migrate_cma/is_migrate_cma/

-- 
Thanks,

David / dhildenb

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH v3 3/8] mm: migrate: allocate the right size of non hugetlb or THP compound pages.

2022-01-12 Thread David Hildenbrand
On 05.01.22 22:47, Zi Yan wrote:
> From: Zi Yan 
> 
> alloc_migration_target() is used by alloc_contig_range() and non-LRU
> movable compound pages can be migrated. Current code does not allocate the
> right page size for such pages. Check THP precisely using
> is_transparent_huge() and add allocation support for non-LRU compound
> pages.

IIRC, we don't have any non-lru migratable pages that are coumpound
pages. Read: not used and not supported :)

Why is this required in the context of this series?


-- 
Thanks,

David / dhildenb

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH v3 2/8] mm: compaction: handle non-lru compound pages properly in isolate_migratepages_block().

2022-01-12 Thread David Hildenbrand
On 05.01.22 22:47, Zi Yan wrote:
> From: Zi Yan 
> 
> In isolate_migratepages_block(), a !PageLRU tail page can be encountered
> when the page is larger than a pageblock. Use compound head page for the
> checks inside and skip the entire compound page when isolation succeeds.
> 

This will currently never happen, due to the way we always isolate
MAX_ORDER -1 ranges, correct?

Better note that in the patch description, because currently it reads
like it's an actual fix "can be encountered".

> Signed-off-by: Zi Yan 
> ---
>  mm/compaction.c | 10 +++---
>  1 file changed, 7 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/compaction.c b/mm/compaction.c
> index b4e94cda3019..ad9053fbbe06 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -979,19 +979,23 @@ isolate_migratepages_block(struct compact_control *cc, 
> unsigned long low_pfn,
>* Skip any other type of page
>*/
>   if (!PageLRU(page)) {
> + struct page *head = compound_head(page);
>   /*
>* __PageMovable can return false positive so we need
>* to verify it under page_lock.
>*/
> - if (unlikely(__PageMovable(page)) &&
> - !PageIsolated(page)) {
> + if (unlikely(__PageMovable(head)) &&
> + !PageIsolated(head)) {
>   if (locked) {
>   unlock_page_lruvec_irqrestore(locked, 
> flags);
>   locked = NULL;
>   }
>  
> - if (!isolate_movable_page(page, isolate_mode))
> + if (!isolate_movable_page(head, isolate_mode)) {
> + low_pfn += (1 << compound_order(head)) 
> - 1 - (page - head);
> + page = head;
>   goto isolate_success;
> + }
>   }
>  
>   goto isolate_fail;


-- 
Thanks,

David / dhildenb

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH v3 1/8] mm: page_alloc: avoid merging non-fallbackable pageblocks with others.

2022-01-12 Thread David Hildenbrand
On 05.01.22 22:47, Zi Yan wrote:
> From: Zi Yan 
> 
> This is done in addition to MIGRATE_ISOLATE pageblock merge avoidance.
> It prepares for the upcoming removal of the MAX_ORDER-1 alignment
> requirement for CMA and alloc_contig_range().
> 
> MIGRARTE_HIGHATOMIC should not merge with other migratetypes like
> MIGRATE_ISOLATE and MIGRARTE_CMA[1], so this commit prevents that too.
> Also add MIGRARTE_HIGHATOMIC to fallbacks array for completeness.
> 
> [1] 
> https://lore.kernel.org/linux-mm/20211130100853.gp3...@techsingularity.net/
> 
> Signed-off-by: Zi Yan 
> ---
>  include/linux/mmzone.h |  6 ++
>  mm/page_alloc.c| 28 ++--
>  2 files changed, 24 insertions(+), 10 deletions(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index aed44e9b5d89..0aa549653e4e 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -83,6 +83,12 @@ static inline bool is_migrate_movable(int mt)
>   return is_migrate_cma(mt) || mt == MIGRATE_MOVABLE;
>  }
>  
> +/* See fallbacks[MIGRATE_TYPES][3] in page_alloc.c */
> +static inline bool migratetype_has_fallback(int mt)
> +{
> + return mt < MIGRATE_PCPTYPES;
> +}
> +
>  #define for_each_migratetype_order(order, type) \
>   for (order = 0; order < MAX_ORDER; order++) \
>   for (type = 0; type < MIGRATE_TYPES; type++)
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 8dd6399bafb5..5193c953dbf8 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1042,6 +1042,12 @@ buddy_merge_likely(unsigned long pfn, unsigned long 
> buddy_pfn,
>   return page_is_buddy(higher_page, higher_buddy, order + 1);
>  }
>  
> +static inline bool has_non_fallback_pageblock(struct zone *zone)
> +{
> + return has_isolate_pageblock(zone) || zone_cma_pages(zone) != 0 ||
> + zone->nr_reserved_highatomic != 0;
> +}

Due to zone_cma_pages(), the unlikely() below will be very wrong on many
setups. Previously, isolation really was a corner case. CMA and
highatomic are less of a corner case ...

I'm not even sure if this check is worth having around anymore at all,
or if it would be easier and cheaper to just always check the both
migration types unconditionally. Would certainly simplify the code.

Side node: we actually care about has_free_non_fallback_pageblock(), we
can only merge with free pageblocks. But that might not necessarily be
cheaper to test/track/check.

> +
>  /*
>   * Freeing function for a buddy system allocator.
>   *
> @@ -1117,14 +1123,15 @@ static inline void __free_one_page(struct page *page,
>   }
>   if (order < MAX_ORDER - 1) {
>   /* If we are here, it means order is >= pageblock_order.
> -  * We want to prevent merge between freepages on isolate
> -  * pageblock and normal pageblock. Without this, pageblock
> -  * isolation could cause incorrect freepage or CMA accounting.
> +  * We want to prevent merge between freepages on pageblock
> +  * without fallbacks and normal pageblock. Without this,
> +  * pageblock isolation could cause incorrect freepage or CMA
> +  * accounting or HIGHATOMIC accounting.
>*
>* We don't want to hit this code for the more frequent
>* low-order merging.
>*/
> - if (unlikely(has_isolate_pageblock(zone))) {
> + if (unlikely(has_non_fallback_pageblock(zone))) {
>   int buddy_mt;
>  
>   buddy_pfn = __find_buddy_pfn(pfn, order);
> @@ -1132,8 +1139,8 @@ static inline void __free_one_page(struct page *page,
>   buddy_mt = get_pageblock_migratetype(buddy);
>  
>   if (migratetype != buddy_mt
> - && (is_migrate_isolate(migratetype) ||
> - is_migrate_isolate(buddy_mt)))
> + && 
> (!migratetype_has_fallback(migratetype) ||
> + 
> !migratetype_has_fallback(buddy_mt)))
>   goto done_merging;
>   }
>   max_order = order + 1;
> @@ -2484,6 +2491,7 @@ static int fallbacks[MIGRATE_TYPES][3] = {
>   [MIGRATE_UNMOVABLE]   = { MIGRATE_RECLAIMABLE, MIGRATE_MOVABLE,   
> MIGRATE_TYPES },
>   [MIGRATE_MOVABLE] = { MIGRATE_RECLAIMABLE, MIGRATE_UNMOVABLE, 
> MIGRATE_TYPES },
>   [MIGRATE_RECLAIMABLE] = { MIGRATE_UNMOVABLE,   MIGRATE_MOVABLE,   
> MIGRATE_TYPES },
> + [MIGRATE_HIGHATOMIC] = { MIGRATE_TYPES }, /* Never used */
>  #ifdef CONFIG_CMA
>   [MIGRATE_CMA] = { MIGRATE_TYPES }, /* Never used */
>  #endif
> @@ -2795,8 +2803,8 @@ static void reserve_highatomic_pageblock(struct page 
> *page, struct zone *zone,
>  
>   /* Yoink! */
>   mt = get_pageblock_migratetype(page);
> - if (!is_migrate_highatomic(mt) && !is_migrate_isolate(m

Re: [RFC PATCH v2 0/7] Use pageblock_order for cma and alloc_contig_range alignment.

2021-12-10 Thread David Hildenbrand
On 10.12.21 00:04, Zi Yan wrote:
> From: Zi Yan 
> 
> Hi all,

Hi,

thanks for working on that!

> 
> This patchset tries to remove the MAX_ORDER - 1 alignment requirement for CMA
> and alloc_contig_range(). It prepares for my upcoming changes to make 
> MAX_ORDER
> adjustable at boot time[1].
> 
> The MAX_ORDER - 1 alignment requirement comes from that alloc_contig_range()
> isolates pageblocks to remove free memory from buddy allocator but isolating
> only a subset of pageblocks within a page spanning across multiple pageblocks
> causes free page accounting issues. Isolated page might not be put into the
> right free list, since the code assumes the migratetype of the first pageblock
> as the whole free page migratetype. This is based on the discussion at [2].
> 
> To remove the requirement, this patchset:
> 1. still isolates pageblocks at MAX_ORDER - 1 granularity;
> 2. but saves the pageblock migratetypes outside the specified range of
>alloc_contig_range() and restores them after all pages within the range
>become free after __alloc_contig_migrate_range();
> 3. splits free pages spanning multiple pageblocks at the beginning and the end
>of the range and puts the split pages to the right migratetype free lists
>based on the pageblock migratetypes;
> 4. returns pages not in the range as it did before this patch.
> 
> Isolation needs to happen at MAX_ORDER - 1 granularity, because otherwise
> 1) extra code is needed to detect pages (free, PageHuge, THP, or PageCompound)
> to make sure all pageblocks belonging to a single page are isolated together 
> and later pageblocks outside the range need to have their migratetypes 
> restored;
> or 2) extra logic will need to be added during page free time to split a free
> page with multi-migratetype pageblocks.
> 
> Two optimizations might come later:
> 1. only check unmovable pages within the range instead of MAX_ORDER - 1 
> aligned
>range during isolation to increase successful rate of alloc_contig_range().

The issue with virtio-mem is that we'll need that as soon as we change
the granularity to pageblocks, because otherwise, you can heavily
degrade unplug reliably in sane setups:

Previous:
* Try unplug free 4M range (2 pageblocks): succeeds

Now:
* Try unplug 2M range (first pageblock): succeeds.
* Try unplug next 2M range (second pageblock): fails because first
contains unmovable allcoations.

-- 
Thanks,

David / dhildenb

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v1 2/2] virtio-mem: prepare fake page onlining code for granularity smaller than MAX_ORDER - 1

2021-12-09 Thread David Hildenbrand
Hi Eric,

thanks for the review!

>>  if (PageDirty(page)) {
>> -virtio_mem_clear_fake_offline(pfn + i, max_nr_pages,
>> -  false);
>> -generic_online_page(page, MAX_ORDER - 1);
>> +virtio_mem_clear_fake_offline(pfn + i, 1 << order, 
>> false);
>> +generic_online_page(page, order);
>>  } else {
>> -virtio_mem_clear_fake_offline(pfn + i, max_nr_pages,
>> -  true);
>> -free_contig_range(pfn + i, max_nr_pages);
>> -adjust_managed_page_count(page, max_nr_pages);
>> +virtio_mem_clear_fake_offline(pfn + i, 1 << order, 
>> true);
>> +free_contig_range(pfn + i, 1 << order);
>> +adjust_managed_page_count(page, 1 << order);
> In the loop, pfn + i, 1 << order are repeatedly calculated. 1 << order 
> is a step size, pfn + i  is each step position.
> Better to figure the numer once each iter?

The compiler better be smart enough to calculate such constants once :)

> 
> LGTL.
> LGTM.
> Reviewed-by: Eric Ren 

Thanks!


-- 
Thanks,

David / dhildenb

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v1 0/2] virtio-mem: prepare for granularity smaller than MAX_ORDER - 1

2021-12-01 Thread David Hildenbrand
On 01.12.21 00:56, Michael S. Tsirkin wrote:
> On Fri, Nov 26, 2021 at 02:42:07PM +0100, David Hildenbrand wrote:
>> The virtio-mem driver currently supports logical hot(un)plug in
>> MAX_ORDER - 1 granularity (4MiB on x86-64) or bigger. We want to support
>> pageblock granularity (2MiB on x86-64), to make hot(un)plug even more
>> flexible, and to improve hotunplug when using ZONE_NORMAL.
>>
>> With pageblock granularity, we then have a granularity comparable to
>> hugepage ballooning. Further, there are ideas to increase MAX_ORDER, so
>> we really want to decouple it from MAX_ORDER.
>>
>> While ZONE_MOVABLE should mostly work already, alloc_contig_range() still
>> needs work to be able to properly handle pageblock granularity on
>> ZONE_NORMAL. This support is in the works [1], so let's prepare
>> virtio-mem for supporting smaller granularity than MAX_ORDER - 1.
> 
> is there value to merging this seprately? or should this just
> be part of that patchset?
> 

The value would be to give it additional testing ahead of time. But we
could just carry it along. Whatever you prefer. (I'd suggest merging it
right away)

-- 
Thanks,

David / dhildenb

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v1 0/2] virtio-mem: prepare for granularity smaller than MAX_ORDER - 1

2021-11-30 Thread David Hildenbrand
On 29.11.21 17:47, Zi Yan wrote:
> On 26 Nov 2021, at 8:42, David Hildenbrand wrote:
> 
>> The virtio-mem driver currently supports logical hot(un)plug in
>> MAX_ORDER - 1 granularity (4MiB on x86-64) or bigger. We want to support
>> pageblock granularity (2MiB on x86-64), to make hot(un)plug even more
>> flexible, and to improve hotunplug when using ZONE_NORMAL.
>>
>> With pageblock granularity, we then have a granularity comparable to
>> hugepage ballooning. Further, there are ideas to increase MAX_ORDER, so
>> we really want to decouple it from MAX_ORDER.
>>
>> While ZONE_MOVABLE should mostly work already, alloc_contig_range() still
>> needs work to be able to properly handle pageblock granularity on
>> ZONE_NORMAL. This support is in the works [1], so let's prepare
>> virtio-mem for supporting smaller granularity than MAX_ORDER - 1.
>>
>> Tested with ZONE_MOVABLE after removing the MAX_ORDER - 1 granularity
>> limitation in virtio-mem, and using different device block sizes (2MiB,
>> 4MiB, 8MiB).
>>
>> [1] https://lkml.kernel.org/r/2025193725.737539-1-zi@sent.com
> 
> The patchset looks good to me. Thanks. Reviewed-by: Zi Yan 

Thanks a lot!

-- 
Thanks,

David / dhildenb

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v1 2/2] virtio-mem: prepare fake page onlining code for granularity smaller than MAX_ORDER - 1

2021-11-26 Thread David Hildenbrand
Let's prepare our fake page onlining code for subblock size smaller than
MAX_ORDER - 1: we might get called for ranges not covering properly
aligned MAX_ORDER - 1 pages. We have to detect the order to use
dynamically.

Signed-off-by: David Hildenbrand 
---
 drivers/virtio/virtio_mem.c | 26 +-
 1 file changed, 13 insertions(+), 13 deletions(-)

diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
index 03e1c5743699..50de7582c9f8 100644
--- a/drivers/virtio/virtio_mem.c
+++ b/drivers/virtio/virtio_mem.c
@@ -1121,15 +1121,18 @@ static void virtio_mem_clear_fake_offline(unsigned long 
pfn,
  */
 static void virtio_mem_fake_online(unsigned long pfn, unsigned long nr_pages)
 {
-   const unsigned long max_nr_pages = MAX_ORDER_NR_PAGES;
+   unsigned long order = MAX_ORDER - 1;
unsigned long i;
 
/*
-* We are always called at least with MAX_ORDER_NR_PAGES
-* granularity/alignment (e.g., the way subblocks work). All pages
-* inside such a block are alike.
+* We might get called for ranges that don't cover properly aligned
+* MAX_ORDER - 1 pages; however, we can only online properly aligned
+* pages with an order of MAX_ORDER - 1 at maximum.
 */
-   for (i = 0; i < nr_pages; i += max_nr_pages) {
+   while (!IS_ALIGNED(pfn | nr_pages, 1 << order))
+   order--;
+
+   for (i = 0; i < nr_pages; i += 1 << order) {
struct page *page = pfn_to_page(pfn + i);
 
/*
@@ -1139,14 +1142,12 @@ static void virtio_mem_fake_online(unsigned long pfn, 
unsigned long nr_pages)
 * alike.
 */
if (PageDirty(page)) {
-   virtio_mem_clear_fake_offline(pfn + i, max_nr_pages,
- false);
-   generic_online_page(page, MAX_ORDER - 1);
+   virtio_mem_clear_fake_offline(pfn + i, 1 << order, 
false);
+   generic_online_page(page, order);
} else {
-   virtio_mem_clear_fake_offline(pfn + i, max_nr_pages,
- true);
-   free_contig_range(pfn + i, max_nr_pages);
-   adjust_managed_page_count(page, max_nr_pages);
+   virtio_mem_clear_fake_offline(pfn + i, 1 << order, 
true);
+   free_contig_range(pfn + i, 1 << order);
+   adjust_managed_page_count(page, 1 << order);
}
}
 }
@@ -2477,7 +2478,6 @@ static int virtio_mem_init_hotplug(struct virtio_mem *vm)
/*
 * We want subblocks to span at least MAX_ORDER_NR_PAGES and
 * pageblock_nr_pages pages. This:
-* - Simplifies our fake page onlining code (virtio_mem_fake_online).
 * - Is required for now for alloc_contig_range() to work reliably -
 *   it doesn't properly handle smaller granularity on ZONE_NORMAL.
 */
-- 
2.31.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v1 1/2] virtio-mem: prepare page onlining code for granularity smaller than MAX_ORDER - 1

2021-11-26 Thread David Hildenbrand
Let's prepare our page onlining code for subblock size smaller than
MAX_ORDER - 1: we'll get called for a MAX_ORDER - 1 page but might have
some subblocks in the range plugged and some unplugged. In that case,
fallback to subblock granularity to properly only expose the plugged
parts to the buddy.

Signed-off-by: David Hildenbrand 
---
 drivers/virtio/virtio_mem.c | 86 ++---
 1 file changed, 62 insertions(+), 24 deletions(-)

diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
index 96e5a8782769..03e1c5743699 100644
--- a/drivers/virtio/virtio_mem.c
+++ b/drivers/virtio/virtio_mem.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -1228,28 +1229,46 @@ static void 
virtio_mem_fake_offline_cancel_offline(unsigned long pfn,
page_ref_inc(pfn_to_page(pfn + i));
 }
 
-static void virtio_mem_online_page_cb(struct page *page, unsigned int order)
+static void virtio_mem_online_page(struct virtio_mem *vm,
+  struct page *page, unsigned int order)
 {
-   const unsigned long addr = page_to_phys(page);
-   unsigned long id, sb_id;
-   struct virtio_mem *vm;
+   const unsigned long start = page_to_phys(page);
+   const unsigned long end = start + PFN_PHYS(1 << order);
+   unsigned long addr, next, id, sb_id, count;
bool do_online;
 
-   rcu_read_lock();
-   list_for_each_entry_rcu(vm, &virtio_mem_devices, next) {
-   if (!virtio_mem_contains_range(vm, addr, PFN_PHYS(1 << order)))
-   continue;
+   /*
+* We can get called with any order up to MAX_ORDER - 1. If our
+* subblock size is smaller than that and we have a mixture of plugged
+* and unplugged subblocks within such a page, we have to process in
+* smaller granularity. In that case we'll adjust the order exactly once
+* within the loop.
+*/
+   for (addr = start; addr < end; ) {
+   next = addr + PFN_PHYS(1 << order);
 
if (vm->in_sbm) {
-   /*
-* We exploit here that subblocks have at least
-* MAX_ORDER_NR_PAGES size/alignment - so we cannot
-* cross subblocks within one call.
-*/
id = virtio_mem_phys_to_mb_id(addr);
sb_id = virtio_mem_phys_to_sb_id(vm, addr);
-   do_online = virtio_mem_sbm_test_sb_plugged(vm, id,
-  sb_id, 1);
+   count = virtio_mem_phys_to_sb_id(vm, next - 1) - sb_id 
+ 1;
+
+   if (virtio_mem_sbm_test_sb_plugged(vm, id, sb_id, 
count)) {
+   /* Fully plugged. */
+   do_online = true;
+   } else if (count == 1 ||
+  virtio_mem_sbm_test_sb_unplugged(vm, id, 
sb_id, count)) {
+   /* Fully unplugged. */
+   do_online = false;
+   } else {
+   /*
+* Mixture, process sub-blocks instead. This
+* will be at least the size of a pageblock.
+* We'll run into this case exactly once.
+*/
+   order = ilog2(vm->sbm.sb_size) - PAGE_SHIFT;
+   do_online = virtio_mem_sbm_test_sb_plugged(vm, 
id, sb_id, 1);
+   continue;
+   }
} else {
/*
 * If the whole block is marked fake offline, keep
@@ -1260,18 +1279,38 @@ static void virtio_mem_online_page_cb(struct page 
*page, unsigned int order)
VIRTIO_MEM_BBM_BB_FAKE_OFFLINE;
}
 
+   if (do_online)
+   generic_online_page(pfn_to_page(PFN_DOWN(addr)), order);
+   else
+   virtio_mem_set_fake_offline(PFN_DOWN(addr), 1 << order,
+   false);
+   addr = next;
+   }
+}
+
+static void virtio_mem_online_page_cb(struct page *page, unsigned int order)
+{
+   const unsigned long addr = page_to_phys(page);
+   struct virtio_mem *vm;
+
+   rcu_read_lock();
+   list_for_each_entry_rcu(vm, &virtio_mem_devices, next) {
/*
-* virtio_mem_set_fake_offline() might sleep, we don't need
-* the device anymore. See virtio_mem_remove() how races
+* Pages we're onlining will never cross memory blocks and,
+* therefore, not virtio-mem devices

[PATCH v1 0/2] virtio-mem: prepare for granularity smaller than MAX_ORDER - 1

2021-11-26 Thread David Hildenbrand
The virtio-mem driver currently supports logical hot(un)plug in
MAX_ORDER - 1 granularity (4MiB on x86-64) or bigger. We want to support
pageblock granularity (2MiB on x86-64), to make hot(un)plug even more
flexible, and to improve hotunplug when using ZONE_NORMAL.

With pageblock granularity, we then have a granularity comparable to
hugepage ballooning. Further, there are ideas to increase MAX_ORDER, so
we really want to decouple it from MAX_ORDER.

While ZONE_MOVABLE should mostly work already, alloc_contig_range() still
needs work to be able to properly handle pageblock granularity on
ZONE_NORMAL. This support is in the works [1], so let's prepare
virtio-mem for supporting smaller granularity than MAX_ORDER - 1.

Tested with ZONE_MOVABLE after removing the MAX_ORDER - 1 granularity
limitation in virtio-mem, and using different device block sizes (2MiB,
4MiB, 8MiB).

[1] https://lkml.kernel.org/r/2025193725.737539-1-zi@sent.com

Cc: "Michael S. Tsirkin" 
Cc: Jason Wang 
Cc: Zi Yan 
Cc: Gavin Shan 
Cc: Hui Zhu 
Cc: Eric Ren 
Cc: Sebastien Boeuf 
Cc: Pankaj Gupta 
Cc: Wei Yang 
Cc: virtualization@lists.linux-foundation.org
Cc: linux...@kvack.org

David Hildenbrand (2):
  virtio-mem: prepare page onlining code for granularity smaller than
MAX_ORDER - 1
  virtio-mem: prepare fake page onlining code for granularity smaller
than MAX_ORDER - 1

 drivers/virtio/virtio_mem.c | 110 
 1 file changed, 74 insertions(+), 36 deletions(-)

-- 
2.31.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH AUTOSEL 5.15 7/7] virtio-mem: support VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE

2021-11-26 Thread David Hildenbrand
On 26.11.21 03:30, Sasha Levin wrote:
> From: David Hildenbrand 
> 
> [ Upstream commit 61082ad6a6e1f999eef7e7e90046486c87933b1e ]
> 
> The initial virtio-mem spec states that while unplugged memory should not
> be read, the device still has to allow for reading unplugged memory inside
> the usable region. The primary motivation for this default handling was
> to simplify bringup of virtio-mem, because there were corner cases where
> Linux might have accidentially read unplugged memory inside added Linux
> memory blocks.
> 
> In the meantime, we:
> 1. Removed /dev/kmem in commit bbcd53c96071 ("drivers/char: remove
>/dev/kmem for good")
> 2. Disallowed access to virtio-mem device memory via /dev/mem in
>commit 2128f4e21aa2 ("virtio-mem: disallow mapping virtio-mem memory via
>/dev/mem")
> 3. Sanitized access to virtio-mem device memory via /proc/kcore in
>commit 0daa322b8ff9 ("fs/proc/kcore: don't read offline sections,
>logically offline pages and hwpoisoned pages")
> 4. Sanitized access to virtio-mem device memory via /proc/vmcore in
>commit ce2814622e84 ("virtio-mem: kdump mode to sanitize /proc/vmcore
>access")
> 
> "Accidential" access to unplugged memory is no longer possible; we can
> support the new VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE feature that will be
> required by some hypervisors implementing virtio-mem in the near future.
> 
> Acked-by: Michael S. Tsirkin 
> Cc: "Michael S. Tsirkin" 
> Cc: Jason Wang 
> Cc: Marek Kedzierski 
> Cc: Hui Zhu 
> Cc: Sebastien Boeuf 
> Cc: Pankaj Gupta 
> Cc: Wei Yang 
> Signed-off-by: David Hildenbrand 
> Signed-off-by: Sasha Levin 
> ---
>  drivers/virtio/virtio_mem.c | 1 +
>  include/uapi/linux/virtio_mem.h | 9 ++---
>  2 files changed, 7 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
> index bef8ad6bf4661..78dfdc9c98a1c 100644
> --- a/drivers/virtio/virtio_mem.c
> +++ b/drivers/virtio/virtio_mem.c
> @@ -2758,6 +2758,7 @@ static unsigned int virtio_mem_features[] = {
>  #if defined(CONFIG_NUMA) && defined(CONFIG_ACPI_NUMA)
>   VIRTIO_MEM_F_ACPI_PXM,
>  #endif
> + VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE,
>  };
>  
>  static const struct virtio_device_id virtio_mem_id_table[] = {
> diff --git a/include/uapi/linux/virtio_mem.h b/include/uapi/linux/virtio_mem.h
> index 70e01c687d5eb..e9122f1d0e0cb 100644
> --- a/include/uapi/linux/virtio_mem.h
> +++ b/include/uapi/linux/virtio_mem.h
> @@ -68,9 +68,10 @@
>   * explicitly triggered (VIRTIO_MEM_REQ_UNPLUG).
>   *
>   * There are no guarantees what will happen if unplugged memory is
> - * read/written. Such memory should, in general, not be touched. E.g.,
> - * even writing might succeed, but the values will simply be discarded at
> - * random points in time.
> + * read/written. In general, unplugged memory should not be touched, because
> + * the resulting action is undefined. There is one exception: without
> + * VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE, unplugged memory inside the usable
> + * region can be read, to simplify creation of memory dumps.
>   *
>   * It can happen that the device cannot process a request, because it is
>   * busy. The device driver has to retry later.
> @@ -87,6 +88,8 @@
>  
>  /* node_id is an ACPI PXM and is valid */
>  #define VIRTIO_MEM_F_ACPI_PXM0
> +/* unplugged memory must not be accessed */
> +#define VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE  1
>  
>  
>  /* --- virtio-mem: guest -> host requests --- */
> 

As 2. and 4. are part of v5.16-rc1 but not v5.15-stable

Nacked-by: David Hildenbrand 

-- 
Thanks,

David / dhildenb

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH AUTOSEL 5.10 3/4] virtio-mem: support VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE

2021-11-26 Thread David Hildenbrand
On 26.11.21 03:30, Sasha Levin wrote:
> From: David Hildenbrand 
> 
> [ Upstream commit 61082ad6a6e1f999eef7e7e90046486c87933b1e ]
> 
> The initial virtio-mem spec states that while unplugged memory should not
> be read, the device still has to allow for reading unplugged memory inside
> the usable region. The primary motivation for this default handling was
> to simplify bringup of virtio-mem, because there were corner cases where
> Linux might have accidentially read unplugged memory inside added Linux
> memory blocks.
> 
> In the meantime, we:
> 1. Removed /dev/kmem in commit bbcd53c96071 ("drivers/char: remove
>/dev/kmem for good")
> 2. Disallowed access to virtio-mem device memory via /dev/mem in
>commit 2128f4e21aa2 ("virtio-mem: disallow mapping virtio-mem memory via
>/dev/mem")
> 3. Sanitized access to virtio-mem device memory via /proc/kcore in
>commit 0daa322b8ff9 ("fs/proc/kcore: don't read offline sections,
>logically offline pages and hwpoisoned pages")
> 4. Sanitized access to virtio-mem device memory via /proc/vmcore in
>commit ce2814622e84 ("virtio-mem: kdump mode to sanitize /proc/vmcore
>access")

As 2. and 4. are part of v5.16-rc1 but not v5.10-stable

Nacked-by: David Hildenbrand 

-- 
Thanks,

David / dhildenb

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2] virtio_balloon: add param to skip adjusting pages

2021-11-24 Thread David Hildenbrand
 I'd appreciate a more generic approach for user space to figure out the
 "initial memory size" in a virtualized environment than adding
 some module parameter to virtio-balloon -- if that makes sense.

 MemTotal as is expresses how much memory the buddy currently manages,
 for example, excluding early allocations during boot, excluding actually
 unplugged memory and excluding logically unplugged memory. Adjusting that
 value makes perfect sense for virtio-balloon without deflate-on-oom.

>>
>> That's a definition of how MemTotal is implemented, but it's not
>> really a specification of the MemTotal API. The closest thing to a
>> real specification I can find is "Total usable RAM (i.e., physical RAM
>> minus a few reserved bits and the kernel binary code)", from the proc
>> man pages. I think there is quite a bit of leeway in changing how
>> exactly MemTotal is implemented without violating the (quite vague)
>> specification or changing any observable semantics of the API. In
>> particular, leaving the pages in the balloon as part of MemTotal is
>> essentially indistinguishable from simply having a non-OOM killable
>> process locking an equivalent amount of memory. So this proposal isn't
>> really introducing any fundamentally new behavior to the Linux kernel.

How to indicate MemTotal completely depends on the intended semantics.
Using balloon inflation to logically unplug memory vs. some kind of
co-operational memory management with the hypervisor.

For co-operational management I would strongly advise using free page
reporting instead if possible. It can't drain the pagecache so far, but
there are approaches being discussed on how to make that happen (e.g.,
using DAMON, or avoiding the guest page cache using virtio-pmem).

>>
> It's also not clear to me what utility the extra information would
> provide to userspace. If userspace wants to know how much memory is
> available, they should use MemAvailable. If userspace wants to have a
> rough estimate for the maximum amount of memory in the system, they
> would add together MemTotal and "logically offline pages". The value
> of MemTotal with a no-deflate-on-oom virtio-balloon is a value with a
> vague meaning that lies somewhere between the "maximum amount of
> memory" and the "current amount of memory". I don't really see any
> situations where it should clearly be used over one of MemAvailable or
> MemTotal + "logically offline pages".

 The issue is that any application that relies on MemTotal in a virtualized
 environment is most probably already suboptimal in some cases. You can
 rely on it and actually later someone will unplug (inflate balloon)
 memory or plug (deflate balloon) memory. Even MemAvailable is suboptimal
 because what about two applications that rely on that information at
 the same time?

>>>
>>> BTW, the general issue here is that "we don't know what the hypervisor
>>> will do".
>>
>> I do agree that this is a significant problem. I would expand on it a
>> bit more, to be "since we don't know what the hypervisor will do, we
>> don't know how to treat memory in the balloon". The proposed module
>> parameter is more or less a mechanism to allow the system
>> administrator to tell the virtio_balloon driver how the hypervisor
>> behaves.
> 
> 
> Now that you put it that way, it looks more like this should
> be a feature bit not a module parameter.

It will be slightly better. At least the hypervisor can indicate the
what it's intending on doing.

>> And if the hypervisor will give memory back to the guest when
>> the guest needs it, then I don't think it's not necessary to logically
>> unplug the memory.
> 
> Ideally we would also pair this with sending a signal to device
> that memory is needed.

Such approaches are in general problematic because once the guest is
already OOM, the hypervisor will most likely not react in time and it's
essentially too late.

So you need some policy somewhere that monitors memory consumption and
makes smart decisions. Usually this is implemented in the hypervisor by
monitoring VM stats.

IMHO the device is the wrong place. I recently discussed something
similar offline with potential virtio-mem users.

> 
>> It might be a bit cleaner to explicitly address this in the
>> virtio_balloon protocol. We could add a min_num_pages field to the
>> balloon config, with semantics along the lines of "The host will
>> respond to memory pressure in the guest by deflating the balloon down
>> to min_num_pages, unless it would cause system instability in the
>> host". Given that feature, I think it would be reasonable to only
>> consider min_num_pages as logically unplugged.
> 
> Okay. I think I would do it a bit differently though, make num_pages be
> the min_num_pages, and add an extra_num_pages field for memory that is
> nice to have but ok to drop.
> 
> 
> As long as we are here, can we add a page_shift field pl

Re: [RFC PATCH 0/3] Use pageblock_order for cma and alloc_contig_range alignment.

2021-11-23 Thread David Hildenbrand
On 17.11.21 04:04, Zi Yan wrote:
> On 16 Nov 2021, at 3:58, David Hildenbrand wrote:
> 
>> On 15.11.21 20:37, Zi Yan wrote:
>>> From: Zi Yan 
>>>
>>> Hi David,
>>
>> Hi,
>>
>> thanks for looking into this.
>>

Hi,

sorry for the delay, I wasn't "actually working" last week, so now I'm
back from holiday :)

>>>
>>> You suggested to make alloc_contig_range() deal with pageblock_order 
>>> instead of
>>> MAX_ORDER - 1 and get rid of MAX_ORDER - 1 dependency in virtio_mem[1]. This
>>> patchset is my attempt to achieve that. Please take a look and let me know 
>>> if
>>> I am doing it correctly or not.
>>>
>>> From what my understanding, cma required alignment of
>>> max(MAX_ORDER - 1, pageblock_order), because when MIGRATE_CMA was 
>>> introduced,
>>> __free_one_page() does not prevent merging two different pageblocks, when
>>> MAX_ORDER - 1 > pageblock_order. But current __free_one_page() 
>>> implementation
>>> does prevent that. It should be OK to just align cma to pageblock_order.
>>> alloc_contig_range() relies on MIGRATE_CMA to get free pages, so it can use
>>> pageblock_order as alignment too.
>>
>> I wonder if that's sufficient. Especially the outer_start logic in
>> alloc_contig_range() might be problematic. There are some ugly corner
>> cases with free pages/allocations spanning multiple pageblocks and we
>> only isolated a single pageblock.
> 
> Thank you a lot for writing the list of these corner cases. They are
> very helpful!
> 
>>
>>
>> Regarding CMA, we have to keep the following cases working:
>>
>> a) Different pageblock types (MIGRATE_CMA and !MIGRATE_CMA) in MAX_ORDER
>>- 1 page:
>>[   MAX_ ORDER - 1 ]
>>[ pageblock 0 | pageblock 1]
>>
>> Assume either pageblock 0 is MIGRATE_CMA or pageblock 1 is MIGRATE_CMA,
>> but not both. We have to make sure alloc_contig_range() keeps working
>> correctly. This should be the case even with your change, as we won't
>> merging pages accross differing migratetypes.
> 
> Yes.
> 
>>
>> b) Migrating/freeing a MAX_ ORDER - 1 page while partially isolated:
>>[   MAX_ ORDER - 1 ]
>>[ pageblock 0 | pageblock 1]
>>
>> Assume both are MIGRATE_CMA. Assume we want to either allocate from
>> pageblock 0 or pageblock 1. Especially, assume we want to allocate from
>> pageblock 1. While we would isolate pageblock 1, we wouldn't isolate
>> pageblock 0.
>>
>> What happens if we either have a free page spanning the MAX_ORDER - 1
>> range already OR if we have to migrate a MAX_ORDER - 1 page, resulting
>> in a free MAX_ORDER - 1 page of which only the second pageblock is
>> isolated? We would end up essentially freeing a page that has mixed
>> pageblocks, essentially placing it in !MIGRATE_ISOLATE free lists ... I
>> might be wrong but I have the feeling that this would be problematic.
>>
> 
> This could happen when start_isolate_page_range() stumbles upon a compound
> page with order >= pageblock_order or a free page with order >=
> pageblock_order, but should not. start_isolate_page_range() should check
> the actual page size, either compound page size or free page size, and set
> the migratetype across pageblocks if the page is bigger than pageblock size.
> More precisely set_migratetype_isolate() should do that.

Right -- but then we have to extend the isolation range from within
set_migratetype_isolate() :/ E.g., how should we know what we have to
unisolate later ..

> 
> 
>> c) Concurrent allocations:
>> [   MAX_ ORDER - 1 ]
>> [ pageblock 0 | pageblock 1]
>>
>> Assume b) but we have two concurrent CMA allocations to pageblock 0 and
>> pageblock 1, which would now be possible as start_isolate_page_range()
>> isolate would succeed on both.
> 
> Two isolations will be serialized by the zone lock taken by
> set_migratetype_isolate(), so the concurrent allocation would not be a 
> problem.
> If it is a MAX_ORDER-1 free page, the first comer should split it and only
> isolate one of the pageblock then second one can isolate the other pageblock.

Right, the issue is essentially b) above.

> If it is a MAX_ORDER-1 compound page, the first comer should isolate both
> pageblocks, then the second one would fail. WDYT?

Possibly we could even have two independent CMA areas "colliding" within
a MAX_ ORDER - 1 page. I guess the surprise would be getting an
"-EAGAIN" from isolation, but the caller should properly handle that.

May

  1   2   3   4   5   6   7   8   9   >