[virtio-dev] Re: [PATCH v4 00/20] Split device spec to its individual files

2023-01-13 Thread David Hildenbrand

On 12.01.23 22:22, Parav Pandit wrote:

Relatively several of the recent device specifications are maintained
in their own specification file. Such separate files enables better
maintenance of the specification overall.
However, several of the initial virtio device specifications
are located in single file.

Hence, split them into their individual files.

Additionally, each device's driver and device conformance is
present in one giant conformance file all together.

As Michael suggest's move this device and driver conformance
section adjacent to device specification in each device specific
directory. This further makes device specification self-contained.



Yeah, that looks much cleaner now, thanks!

--
Thanks,

David / dhildenb


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] Re: [PATCH v4 06/20] virtio-mem-balloon: Maintain mem balloon device spec in separate directory

2023-01-13 Thread David Hildenbrand

On 12.01.23 22:22, Parav Pandit wrote:

Move virtio memory balloon device specification to its own file
similar to recent virtio devices.
While at it, place device specification, its driver and device
conformance into its own directory to have self contained device
specification.

Fixes: https://github.com/oasis-tcs/virtio-spec/issues/153
Signed-off-by: Parav Pandit 


Nit: Subject s/virtio-mem-balloon/virtio-balloon/

Thanks!

--
Thanks,

David / dhildenb


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



Re: [virtio-dev] [PATCH v3 06/20] virtio-mem-balloon: Maintain mem balloon device spec in separate directory

2023-01-11 Thread David Hildenbrand

On 11.01.23 16:01, Parav Pandit wrote:

Hi David,


Hi Parav,




From: David Hildenbrand 
Sent: Wednesday, January 11, 2023 9:14 AM
To: Parav Pandit ; m...@redhat.com; virtio-dev@lists.oasis-
open.org; coh...@redhat.com
Cc: virtio-comm...@lists.oasis-open.org
Subject: Re: [virtio-dev] [PATCH v3 06/20] virtio-mem-balloon: Maintain mem
balloon device spec in separate directory

On 11.01.23 00:03, Parav Pandit wrote:

Move virtio memory balloon device specification to its own file
similar to recent virtio devices.
While at it, place device specification, its driver and device
conformance into its own directory to have self contained device
specification.

Fixes: https://github.com/oasis-tcs/virtio-spec/issues/153
Signed-off-by: Parav Pandit 



There is virtio-mem and there is virtio-balloon. Calling virtio-balloon "virtio-
mem-balloon" can easily lead to quite some confusion. Any particular reason
why not to stick to "virtio-balloon" ?


Historically Linux memory balloon driver in linux is placed as virtio_balloon.c


See below. id=5 has widespread "virtio-balloon" terminology use. id=13 
is what creates confusion.



In virtio spec, in the device type is it named as "Traditional memory balloon 
device".
So, I named the directory name close to actual spec content name.
Adding legacy/traditional was too long. :)
May be virtio-mem-legacy is better to differentiate between legacy and new mem 
device?


As it has nothing to do with virtio-mem, that would be confusing. Also, 
legacy doesn't quite catch the semantics.




In this patchset, directories are named with "virtio-" prefix such as 
virtio-pmem, virtio-sound.

Another option (which I prefer as I write now) is,
How about we drop "virtio-" prefix in the directory name because this is the 
virtio spec.

And have names as
device-types/sound
device-types/legacy-mem-balloon
device-types/mem
device-types/pmem

This is short and covers balloon part too?


Looking at 
https://lore.kernel.org/all/20220516204913.542894-71-...@redhat.com/


We seem to have virtio-balloon (id=5) and virtio-mem-balloon (if=13).

virtio-balloon is what's actually implemented and used. "Traditional" is 
a bit misleading here.


IMHO, we could/should

* Name it "balloon" here
* Make "id=13" reserved and remove the notion of "memory balloon" from
  the spec
* Call "id=5" "Memory Balloon" and remove the notion of "Traditional".
  It's the one that exists.

@MST?

--
Thanks,

David / dhildenb


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



Re: [virtio-dev] [PATCH v3 06/20] virtio-mem-balloon: Maintain mem balloon device spec in separate directory

2023-01-11 Thread David Hildenbrand

On 11.01.23 00:03, Parav Pandit wrote:

Move virtio memory balloon device specification to its own file
similar to recent virtio devices.
While at it, place device specification, its driver and device
conformance into its own directory to have self contained device
specification.

Fixes: https://github.com/oasis-tcs/virtio-spec/issues/153
Signed-off-by: Parav Pandit 



There is virtio-mem and there is virtio-balloon. Calling virtio-balloon 
"virtio-mem-balloon" can easily lead to quite some confusion. Any 
particular reason why not to stick to "virtio-balloon" ?


--
Thanks,

David / dhildenb


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



Re: [virtio-dev] Re: [PATCH 1/4] content: Introduce driver/device auxiliary notifications

2022-08-11 Thread David Hildenbrand
On 11.08.22 16:09, Halil Pasic wrote:
> On Thu, 11 Aug 2022 10:53:35 +0200
> Cornelia Huck  wrote:
> 
>> On Wed, Aug 10 2022, Halil Pasic  wrote:
>>
>>> On Wed, 10 Aug 2022 11:54:35 +0200
>>> Cornelia Huck  wrote:
>>>  
>> These device-specific notifications are needed later when adding support
>> for virtio-vhost-user device.
>>
>> Signed-off-by: Usama Arif 
>> Signed-off-by: Stefan Hajnoczi 
>> Signed-off-by: Nikos Dragazis 
>
> I see ccw is missing. Cornelia, any suggestions?

 Hmm... I seem to be really behind on ccw things :(

 We can probably use the following:

 - for device->driver notification, use the next bit in the secondary
   indicators (bit 0 is used for config change notification)
 - for driver->device notification, maybe use a new subcode for diagnose
   0x500 (4 is probably the next free one?)
   
>>>
>>> Sounds reasonable! I will have to double check the DIAG stuff though. I'm
>>> not sure where what needs to be reserved and documented.   
>>
>> There's
>> https://gitlab.com/davidhildenbrand/s390x-os-virt-spec/-/merge_requests/1,
>> but nothing much has happened there recently. David?

Heh, I'd like to say it has high priority on my todo list ... but there
is a lot of stuff with even higher priority. I'll definetly come back to
this soonish.

> 
> Hm, that seems to be private. I land on a sign-in page.

Oh, it was private. It's public now.


-- 
Thanks,

David / dhildenb


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] Re: [PATCH] Virtio-balloon: add user space API for sizing

2022-02-15 Thread David Hildenbrand
On 14.02.22 20:59, Kameron Lutes wrote:
> This new linux API will allow user space applications to directly
> control the size of the virtio-balloon. This is useful in
> situations where the guest must quickly respond to drastically
> increased memory pressure and cannot wait for the host to adjust
> the balloon's size.
> 
> Under the current wording of the Virtio spec, guest driven
> behavior such as this is permitted:
> 
> VIRTIO Version 1.1 Section 5.5.6
> "The device is driven either by the receipt of a configuration
> change notification, or by changing guest memory needs, such as
> performing memory compaction or responding to out of memory
> conditions."

Not quite. num_pages is determined by the hypervisor only and the guest
is not expected to change it, and if it does, it's ignored.

5.5.6 does not indicate at all that the guest may change it or that it
would have any effect. num_pages is examined only, actual is updated by
the driver.

5.5.6.1 documents what's allowed, e.g.,

  The driver SHOULD supply pages to the balloon when num_pages is
  greater than the actual number of pages in the balloon.

  The driver MAY use pages from the balloon when num_pages is less than
  the actual number of pages in the balloon.

and special handling for VIRTIO_BALLOON_F_DEFLATE_ON_OOM.

Especially, we have

  The driver MUST update actual after changing the number of pages in
  the balloon.

  The driver MAY update actual once after multiple inflate and deflate
  operations.

That's also why QEMU never syncs back the num_pages value from the guest
when writing the config.


Current spec does not allow for what you propose.


> 
> The intended use case for this API is one where the host
> communicates a deflation limit to the guest. The guest may then
> choose to respond to memory pressure by deflating its balloon down
> to the guest's allowable limit.

It would be good to have a full proposal and a proper spec update. I'd
assume you'd want separate values for soft vs. hard num_values -- if
that's what we really want.

BUT

There seems to be recent interest in handling memory pressure in a
better way (although how to really detect "serious memory pressure" vs
"ordinary reclaim" in Linux is still to be figured out). There is
already a discussion going on how that could happen. Adding random user
space toggles might not be the best idea. We might want a single
mechanism to achieve that.

https://lists.oasis-open.org/archives/virtio-comment/202201/msg00139.html

-- 
Thanks,

David / dhildenb


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



Re: [virtio-dev] [PATCH RESEND] virtio-pmem: PMEM device spec

2021-08-04 Thread David Hildenbrand

On 04.08.21 13:07, Stefan Hajnoczi wrote:

On Wed, Jul 28, 2021 at 05:04:35PM +0200, Pankaj Gupta wrote:

Posting virtio specification for virtio pmem device. Virtio pmem is a
paravirtualized device which allows the guest to bypass page cache.
Virtio pmem kernel driver is merged in Upstream Kernel 5.3. Also, Qemu
device is merged in Qemu 4.1.

Signed-off-by: Pankaj Gupta 
---
Sorry, It took me long time to get back on this. There is
an enhancement to this spec by "Taylor Stark" CCed in the list.
Request for feedback and merging.

RFC is posted here [1]
[1] https://lists.oasis-open.org/archives/virtio-dev/201903/msg00083.html


I skimmed through the review comments but pretty much reviewed this
patch from scratch. Feel free to ignore questions that others have
already raised.



  conformance.tex |  19 ++-
  content.tex |   1 +
  virtio-pmem.tex | 132 
  3 files changed, 150 insertions(+), 2 deletions(-)
  create mode 100644 virtio-pmem.tex

diff --git a/conformance.tex b/conformance.tex
index 94d7a06..818ddda 100644
--- a/conformance.tex
+++ b/conformance.tex
@@ -31,7 +31,8 @@ \section{Conformance Targets}\label{sec:Conformance / 
Conformance Targets}
  \ref{sec:Conformance / Driver Conformance / Sound Driver Conformance},
  \ref{sec:Conformance / Driver Conformance / Memory Driver Conformance},
  \ref{sec:Conformance / Driver Conformance / I2C Adapter Driver Conformance} or
-\ref{sec:Conformance / Driver Conformance / SCMI Driver Conformance}.
+\ref{sec:Conformance / Driver Conformance / SCMI Driver Conformance},
+\ref{sec:Conformance / Driver Conformance / PMEM Driver Conformance}.
  
  \item Clause \ref{sec:Conformance / Legacy Interface: Transitional Device and Transitional Driver Conformance}.

\end{itemize}
@@ -55,7 +56,8 @@ \section{Conformance Targets}\label{sec:Conformance / 
Conformance Targets}
  \ref{sec:Conformance / Device Conformance / Sound Device Conformance},
  \ref{sec:Conformance / Device Conformance / Memory Device Conformance},
  \ref{sec:Conformance / Device Conformance / I2C Adapter Device Conformance} or
-\ref{sec:Conformance / Device Conformance / SCMI Device Conformance}.
+\ref{sec:Conformance / Device Conformance / SCMI Device Conformance},
+\ref{sec:Conformance / Device Conformance / PMEM Driver Conformance}.
  
  \item Clause \ref{sec:Conformance / Legacy Interface: Transitional Device and Transitional Driver Conformance}.

\end{itemize}
@@ -301,6 +303,19 @@ \section{Conformance Targets}\label{sec:Conformance / 
Conformance Targets}
  \item \ref{drivernormative:Device Types / SCMI Device / Device Operation / 
Setting Up eventq Buffers}
  \end{itemize}
  
+\conformance{\subsection}{PMEM Driver Conformance}\label{sec:Conformance / Driver Conformance / PMEM Driver Conformance}

+
+A PMEM driver MUST conform to the following normative statements:
+
+\begin{itemize}
+\item \ref{devicenormative:Device Types / PMEM Device / Device Initialization}
+\item \ref{drivernormative:Device Types / PMEM Driver / Driver Initialization 
/ Direct access}
+\item \ref{drivernormative:Device Types / PMEM Driver / Driver Initialization 
/ Virtio flush}
+\item \ref{drivernormative:Device Types / PMEM Driver / Driver Operation / 
Virtqueue command}
+\item \ref{devicenormative:Device Types / PMEM Device / Device Operation / 
Virtqueue flush}
+\item \ref{devicenormative:Device Types / PMEM Device / Device Operation / 
Virtqueue return}
+\end{itemize}
+
  \conformance{\section}{Device Conformance}\label{sec:Conformance / Device 
Conformance}
  
  A device MUST conform to the following normative statements:

diff --git a/content.tex b/content.tex
index ceb2562..6acc785 100644
--- a/content.tex
+++ b/content.tex
@@ -6583,6 +6583,7 @@ \subsubsection{Legacy Interface: Framing 
Requirements}\label{sec:Device
  \input{virtio-mem.tex}
  \input{virtio-i2c.tex}
  \input{virtio-scmi.tex}
+\input{virtio-pmem.tex}
  
  \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits}
  
diff --git a/virtio-pmem.tex b/virtio-pmem.tex

new file mode 100644
index 000..a2b888e
--- /dev/null
+++ b/virtio-pmem.tex
@@ -0,0 +1,132 @@
+\section{PMEM Device}\label{sec:Device Types / PMEM Device}
+
+The virtio pmem is a fake persistent memory (NVDIMM) device


s/fake/virtual/ or drop "fake"? If the device persists data correctly
then it's not fake.


+used to bypass the guest page cache and provide a virtio
+based asynchronous flush mechanism.This avoids the need
+of a separate page cache in guest and keeps page cache only
+in the host. Under memory pressure, the host makes use of
+effecient memory reclaim decisions for page cache pages


s/effecient/efficient/


+of all the guests. This helps to reduce the memory footprint
+and fit more guests in the host system.


This explains the motivation for the device. It would also be nice to
explain the nature of the device:

   The virtio pmem device provides access to byte-addressable persistent
   memory. 

[virtio-dev] Re: [PATCH RESEND] virtio-pmem: PMEM device spec

2021-08-03 Thread David Hildenbrand

Driver SHOULD handle multiple FLUSH requests on the files present on
the Virtio pmem device.


Same here. I'm afraid this is not easy :(


hmm...

The driver SHOULD handle multiple FLUSH requests.


Do you want to say

The driver MUST be able to handle concurrent FLUSH requests.

?

--
Thanks,

David / dhildenb


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] Re: [PATCH RESEND] virtio-pmem: PMEM device spec

2021-08-03 Thread David Hildenbrand

On 03.08.21 10:26, Cornelia Huck wrote:

On Fri, Jul 30 2021, Pankaj Gupta  wrote:


+Also, configures a flush callback function with the corresponding region.


Not sure if that is too specific already... maybe something like "Also,
it configures a notification for when the corresponding region is flushed."?


Maybe will remove this line altogether as it is implementation
details?


Maybe... I think the point is to configure _something_, not sure if we
can really generalize that. Other ideas welcome.


See above for "flush callback". I'm mostly worrying about the wording
being generic enough (even though it's probably obvious enough for
non-Linux people as well.)


yes, Something below is better?

The driver MUST not enable any explicit FLUSH on the file memory
mapped from the Virtio pmem device


Hm, not sure. Would like to see feedback from others that had worked in
this area.



I think instead of describing detailed device handling in regard to 
e.g., fsync, we should document what the exact semantics are in POV of 
the driver when issuing a flush, and when it makes sense to issue a 
flush. The device is free to implement that however it likes (fsync, 
whatsoever).


Why do we care about "The driver MUST not enable any explicit FLUSH on 
the file memory mapped from the Virtio pmem device" and what exactly do 
we mean with "explicit FLUSH" here ?


--
Thanks,

David / dhildenb


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] Re: [virtio-comment] [PATCH RESEND v6 3/3] content: Document balloon feature free page reporting

2020-08-21 Thread David Hildenbrand
On 18.08.20 19:32, Alexander Duyck wrote:
> From: Alexander Duyck 
> 
> Free page reporting is a feature that allows the guest to proactively
> report unused pages to the host. By making use of this feature is is
> possible to reduce the overall memory footprint of the guest in cases where
> some significant portion of the memory is idle. Add documentation for the
> free page reporting feature describing the functionality and requirements.
> 
> Reviewed-by: Cornelia Huck 
> Signed-off-by: Alexander Duyck 

Reviewed-by: David Hildenbrand 

Thanks!


-- 
Thanks,

David / dhildenb


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] Re: [virtio-comment] [PATCH RESEND v6 2/3] content: Document balloon feature page poison

2020-08-21 Thread David Hildenbrand
he VIRTIO_BALLOON_F_PAGE_POISON feature bit is negotiated, update
> +  the \field{poison_val} configuration field.
>  
>  \item DRIVER_OK is set: device operation begins.
>  
> @@ -5494,6 +5506,41 @@ \subsubsection{Free Page Hinting}\label{sec:Device 
> Types / Memory Balloon Device
>  endian of the guest rather than (necessarily when not using the legacy
>  interface) little-endian.
>  
> +\subsubsection{Page Poison}\label{sec:Device Types / Memory Balloon Device / 
> Device Operation / Page Poison}
> +
> +Page Poison provides a way to notify the host that the guest is initializing
> +free pages with \field{poison_val}. When the feature is enabled, pages will
> +be immediately written to by the driver after deflating.
> +
> +If the guest is not initializing freed pages, the driver should reject the
> +VIRTIO_BALLOON_F_PAGE_POISON feature.
> +
> +If VIRTIO_BALLOON_F_PAGE_POISON feature has been negotiated, the driver
> +will place the initialization value into the \field{poison_val}
> +configuration field data.
> +
> +\drivernormative{\paragraph}{Page Poison}{Device Types / Memory Balloon 
> Device / Device Operation / Page Poison}
> +
> +Normative statements in this section apply if the
> +VIRTIO_BALLOON_F_PAGE_POISON feature has been negotiated.
> +
> +The driver MUST initialize the deflated pages with \field{poison_val} when
> +they are reused by the driver.
> +
> +The driver MUST populate the \field{poison_val} configuration data before
> +setting the DRIVER_OK bit.
> +
> +The driver MUST NOT modify \field{poison_val} while the DRIVER_OK bit is set.
> +
> +\devicenormative{\paragraph}{Page Poison}{Device Types / Memory Balloon 
> Device / Device Operation / Page Poison}
> +
> +Normative statements in this section apply if the
> +VIRTIO_BALLOON_F_PAGE_POISON feature has been negotiated.
> +
> +The device MAY use the content of \field{poison_val} as a hint to guest
> +behavior.
> +>>>>>>> patched

^ looks strange

Apart from that

Reviewed-by: David Hildenbrand 

Thanks!
-- 
Thanks,

David / dhildenb


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] Re: [virtio-comment] [PATCH RESEND v6 1/3] content: Document balloon feature free page hints

2020-08-21 Thread David Hildenbrand
On 18.08.20 19:32, Alexander Duyck wrote:
> From: Alexander Duyck 
> 
> Free page hints allow the balloon driver to provide information on what
> pages are not currently in use so that we can avoid the cost of copying
> them in migration scenarios. Add a feature description for free page hints
> describing basic functioning and requirements.
> 
> In working on this the specification as pointed out certain issues with the
> Linux driver and QEMU device implementation. The issues include:
> 1. The Linux driver does not re-initialize pages when it reuses them
> before receiving the "DONE" command, as such this can lead to possible data
> corruption.
> 2. The QEMU device is not returning the "DONE" command if a migration
> fails. This results in the guest holding onto pages until forced out by the
> shrinker.
> 
> There are also additional issues that have been found not related to the
> specification.
> 
> There is currently discussion on if the feature should be removed so this
> patch is a place-holder for if we decide to keep the feature and fix the
> issues. Otherwise this patch can be dropped and we can work on a patch to
> document the need to avoid the feature.
> 
> Signed-off-by: Alexander Duyck 

Reviewed-by: David Hildenbrand 


-- 
Thanks,

David / dhildenb


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] Re: [PATCH RESEND v6 0/3] virtio-spec: Add documentation for recently added balloon features

2020-08-18 Thread David Hildenbrand
On 18.08.20 19:32, Alexander Duyck wrote:
> I am resending this patch set with the hope of getting final reviews sorted
> out as I had no feedback on v6. If there are no further comments to be made
> I will create an issue and ask for inclusion of this patch set.
> 

You can open an issue right away and ask for inclusion in a week or so.
I'll try to have a look this week but I consider this good enough already :)

---
Thanks,

David / dhildenb


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] Re: [PATCH v2] virtio-balloon: always indicate S_DONE when migration fails

2020-07-22 Thread David Hildenbrand
On 22.07.20 14:05, David Hildenbrand wrote:
> On 22.07.20 14:04, Michael S. Tsirkin wrote:
>> On Mon, Jun 29, 2020 at 10:06:15AM +0200, David Hildenbrand wrote:
>>> If something goes wrong during precopy, before stopping the VM, we will
>>> never send a S_DONE indication to the VM, resulting in the hinted pages
>>> not getting released to be used by the guest OS (e.g., Linux).
>>>
>>> Easy to reproduce:
>>> 1. Start migration (e.g., HMP "migrate -d 'exec:gzip -c > STATEFILE.gz'")
>>> 2. Cancel migration (e.g., HMP "migrate_cancel")
>>> 3. Oberve in the guest (e.g., cat /proc/meminfo) that there is basically
>>>no free memory left.
>>>
>>> While at it, add similar locking to virtio_balloon_free_page_done() as
>>> done in virtio_balloon_free_page_stop. Locking is still weird, but that
>>> has to be sorted out separately.
>>>
>>> There is nothing to do in the PRECOPY_NOTIFY_COMPLETE case. Add some
>>> comments regarding S_DONE handling.
>>>
>>> Fixes: c13c4153f76d ("virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT")
>>> Reviewed-by: Alexander Duyck 
>>> Cc: Wei Wang 
>>> Cc: Alexander Duyck 
>>> Signed-off-by: David Hildenbrand 
>>
>> IIUC this is superceded by Alexander's patches right?
> 
> Not that I know ... @Alex?
> 

Okay, I'm confused, that patch is already upstream (via your tree)?

dd8eeb9671fc ("virtio-balloon: always indicate S_DONE when migration fails")

Did you stumble over this mail by mistake again?

-- 
Thanks,

David / dhildenb


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] Re: [PATCH v2] virtio-balloon: always indicate S_DONE when migration fails

2020-07-22 Thread David Hildenbrand
On 22.07.20 14:04, Michael S. Tsirkin wrote:
> On Mon, Jun 29, 2020 at 10:06:15AM +0200, David Hildenbrand wrote:
>> If something goes wrong during precopy, before stopping the VM, we will
>> never send a S_DONE indication to the VM, resulting in the hinted pages
>> not getting released to be used by the guest OS (e.g., Linux).
>>
>> Easy to reproduce:
>> 1. Start migration (e.g., HMP "migrate -d 'exec:gzip -c > STATEFILE.gz'")
>> 2. Cancel migration (e.g., HMP "migrate_cancel")
>> 3. Oberve in the guest (e.g., cat /proc/meminfo) that there is basically
>>no free memory left.
>>
>> While at it, add similar locking to virtio_balloon_free_page_done() as
>> done in virtio_balloon_free_page_stop. Locking is still weird, but that
>> has to be sorted out separately.
>>
>> There is nothing to do in the PRECOPY_NOTIFY_COMPLETE case. Add some
>> comments regarding S_DONE handling.
>>
>> Fixes: c13c4153f76d ("virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT")
>> Reviewed-by: Alexander Duyck 
>> Cc: Wei Wang 
>> Cc: Alexander Duyck 
>> Signed-off-by: David Hildenbrand 
> 
> IIUC this is superceded by Alexander's patches right?

Not that I know ... @Alex?

> If not pls rebase ...
> 



-- 
Thanks,

David / dhildenb


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



Re: [virtio-dev] [PATCH] virtio-balloon: Document byte ordering of poison_val

2020-07-20 Thread David Hildenbrand
On 13.07.20 22:35, Alexander Duyck wrote:
> From: Alexander Duyck 
> 
> The poison_val field in the virtio_balloon_config is treated as a
> little-endian field by the host. Since we are currently only having to deal
> with a single byte poison value this isn't a problem, however if the value
> should ever expand it would cause byte ordering issues. Document that in
> the code so that we know that if the value should ever expand we need to
> byte swap the value on big-endian architectures.
> 
> Signed-off-by: Alexander Duyck 
> ---
>  drivers/virtio/virtio_balloon.c |5 +
>  1 file changed, 5 insertions(+)
> 
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index 1f157d2f4952..d0fd8f8dc6ed 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -974,6 +974,11 @@ static int virtballoon_probe(struct virtio_device *vdev)
>   /*
>* Let the hypervisor know that we are expecting a
>* specific value to be written back in balloon pages.
> +  *
> +  * If the PAGE_POISON value was larger than a byte we would
> +  * need to byte swap poison_val here to guarantee it is
> +  * little-endian. However for now it is a single byte so we
> +  * can pass it as-is.

Yeah, why not (although it's pretty fundamental that 1-byte values don't
need any swapping).

Acked-by: David Hildenbrand 


-- 
Thanks,

David / dhildenb


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] Re: [PATCH v2 2/3] virtio-balloon: Add locking to prevent possible race when starting hinting

2020-07-10 Thread David Hildenbrand
On 06.07.20 23:14, Alexander Duyck wrote:
> From: Alexander Duyck 
> 
> There is already locking in place when we are stopping free page hinting
> but there is not similar protections in place when we start. I can only
> assume this was overlooked as in most cases the page hinting should not be
> occurring when we are starting the hinting, however there is still a chance
> we could be processing hints by the time we get back around to restarting
> the hinting so we are better off making sure to protect the state with the
> mutex lock rather than just updating the value with no protections.
> 
> Signed-off-by: Alexander Duyck 
> ---
>  hw/virtio/virtio-balloon.c |4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
> index 0c0fd7114799..b3e96a822b4d 100644
> --- a/hw/virtio/virtio-balloon.c
> +++ b/hw/virtio/virtio-balloon.c
> @@ -593,6 +593,8 @@ static void virtio_balloon_free_page_start(VirtIOBalloon 
> *s)
>  return;
>  }
>  
> +qemu_mutex_lock(>free_page_lock);
> +
>  if (s->free_page_report_cmd_id == UINT_MAX) {
>  s->free_page_report_cmd_id =
> VIRTIO_BALLOON_FREE_PAGE_REPORT_CMD_ID_MIN;
> @@ -601,6 +603,8 @@ static void virtio_balloon_free_page_start(VirtIOBalloon 
> *s)
>  }
>  
>  s->free_page_report_status = FREE_PAGE_REPORT_S_REQUESTED;
> +qemu_mutex_unlock(>free_page_lock);
> +
>  virtio_notify_config(vdev);
>  }
>  
> 

Yes, makes sense, thanks

Acked-by: David Hildenbrand 

-- 
Thanks,

David / dhildenb


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] [PATCH v2] virtio-balloon: always indicate S_DONE when migration fails

2020-06-29 Thread David Hildenbrand
If something goes wrong during precopy, before stopping the VM, we will
never send a S_DONE indication to the VM, resulting in the hinted pages
not getting released to be used by the guest OS (e.g., Linux).

Easy to reproduce:
1. Start migration (e.g., HMP "migrate -d 'exec:gzip -c > STATEFILE.gz'")
2. Cancel migration (e.g., HMP "migrate_cancel")
3. Oberve in the guest (e.g., cat /proc/meminfo) that there is basically
   no free memory left.

While at it, add similar locking to virtio_balloon_free_page_done() as
done in virtio_balloon_free_page_stop. Locking is still weird, but that
has to be sorted out separately.

There is nothing to do in the PRECOPY_NOTIFY_COMPLETE case. Add some
comments regarding S_DONE handling.

Fixes: c13c4153f76d ("virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT")
Reviewed-by: Alexander Duyck 
Cc: Wei Wang 
Cc: Alexander Duyck 
Signed-off-by: David Hildenbrand 
---
 hw/virtio/virtio-balloon.c | 26 --
 1 file changed, 20 insertions(+), 6 deletions(-)

diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
index 10507b2a43..8a84718490 100644
--- a/hw/virtio/virtio-balloon.c
+++ b/hw/virtio/virtio-balloon.c
@@ -628,8 +628,13 @@ static void virtio_balloon_free_page_done(VirtIOBalloon *s)
 {
 VirtIODevice *vdev = VIRTIO_DEVICE(s);
 
-s->free_page_report_status = FREE_PAGE_REPORT_S_DONE;
-virtio_notify_config(vdev);
+if (s->free_page_report_status != FREE_PAGE_REPORT_S_DONE) {
+/* See virtio_balloon_free_page_stop() */
+qemu_mutex_lock(>free_page_lock);
+s->free_page_report_status = FREE_PAGE_REPORT_S_DONE;
+qemu_mutex_unlock(>free_page_lock);
+virtio_notify_config(vdev);
+}
 }
 
 static int
@@ -653,17 +658,26 @@ virtio_balloon_free_page_report_notify(NotifierWithReturn 
*n, void *data)
 case PRECOPY_NOTIFY_SETUP:
 precopy_enable_free_page_optimization();
 break;
-case PRECOPY_NOTIFY_COMPLETE:
-case PRECOPY_NOTIFY_CLEANUP:
 case PRECOPY_NOTIFY_BEFORE_BITMAP_SYNC:
 virtio_balloon_free_page_stop(dev);
 break;
 case PRECOPY_NOTIFY_AFTER_BITMAP_SYNC:
 if (vdev->vm_running) {
 virtio_balloon_free_page_start(dev);
-} else {
-virtio_balloon_free_page_done(dev);
+break;
 }
+/*
+ * Set S_DONE before migrating the vmstate, so the guest will reuse
+ * all hinted pages once running on the destination. Fall through.
+ */
+case PRECOPY_NOTIFY_CLEANUP:
+/*
+ * Especially, if something goes wrong during precopy or if migration
+ * is canceled, we have to properly communicate S_DONE to the VM.
+ */
+virtio_balloon_free_page_done(dev);
+break;
+case PRECOPY_NOTIFY_COMPLETE:
 break;
 default:
 virtio_error(vdev, "%s: %d reason unknown", __func__, pnd->reason);
-- 
2.26.2


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] [PATCH v1] virtio-balloon: always indicate S_DONE when migration fails

2020-06-26 Thread David Hildenbrand
If something goes wrong during precopy, before stopping the VM, we will
never send a S_DONE indication to the VM, resulting in the hinted pages
not getting released to be used by the guest OS (e.g., Linux).

Easy to reproduce:
1. Start migration (e.g., HMP "migrate -d 'exec:gzip -c > STATEFILE.gz'")
2. Cancel migration (e.g., HMP "migrate_cancel")
3. Oberve in the guest (e.g., cat /proc/meminfo) that there is basically
   no free memory left.

While at it, add similar locking to virtio_balloon_free_page_done() as
done in virtio_balloon_free_page_stop. Locking is still weird, but that
has to be sorted out separately.

There is nothing to do in the PRECOPY_NOTIFY_COMPLETE case. Add some
comments regarding S_DONE handling.

Fixes: c13c4153f76d ("virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT")
Cc: Wei Wang 
Cc: Alexander Duyck 
Signed-off-by: David Hildenbrand 
---
 hw/virtio/virtio-balloon.c | 24 
 1 file changed, 20 insertions(+), 4 deletions(-)

diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
index 10507b2a43..13ba208694 100644
--- a/hw/virtio/virtio-balloon.c
+++ b/hw/virtio/virtio-balloon.c
@@ -628,8 +628,13 @@ static void virtio_balloon_free_page_done(VirtIOBalloon *s)
 {
 VirtIODevice *vdev = VIRTIO_DEVICE(s);
 
-s->free_page_report_status = FREE_PAGE_REPORT_S_DONE;
-virtio_notify_config(vdev);
+if (s->free_page_report_status != FREE_PAGE_REPORT_S_DONE) {
+/* See virtio_balloon_free_page_stop() */
+qemu_mutex_lock(>free_page_lock);
+s->free_page_report_status = FREE_PAGE_REPORT_S_DONE;
+qemu_mutex_unlock(>free_page_lock);
+virtio_notify_config(vdev);
+}
 }
 
 static int
@@ -653,8 +658,6 @@ virtio_balloon_free_page_report_notify(NotifierWithReturn 
*n, void *data)
 case PRECOPY_NOTIFY_SETUP:
 precopy_enable_free_page_optimization();
 break;
-case PRECOPY_NOTIFY_COMPLETE:
-case PRECOPY_NOTIFY_CLEANUP:
 case PRECOPY_NOTIFY_BEFORE_BITMAP_SYNC:
 virtio_balloon_free_page_stop(dev);
 break;
@@ -662,9 +665,22 @@ virtio_balloon_free_page_report_notify(NotifierWithReturn 
*n, void *data)
 if (vdev->vm_running) {
 virtio_balloon_free_page_start(dev);
 } else {
+/*
+ * Set S_DONE before migrating the vmstate, so the guest will reuse
+ * all hinted pages once running on the destination.
+ */
 virtio_balloon_free_page_done(dev);
 }
 break;
+case PRECOPY_NOTIFY_CLEANUP:
+/*
+ * Especially, if something goes wrong during precopy or if migration
+ * is canceled, we have to properly communicate S_DONE to the VM.
+ */
+virtio_balloon_free_page_done(dev);
+break;
+case PRECOPY_NOTIFY_COMPLETE:
+break;
 default:
 virtio_error(vdev, "%s: %d reason unknown", __func__, pnd->reason);
 }
-- 
2.26.2


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] Re: [PATCH v25 QEMU 3/3] virtio-balloon: Replace free page hinting references to 'report' with 'hint'

2020-06-24 Thread David Hildenbrand



> Am 24.06.2020 um 22:36 schrieb Michael S. Tsirkin :
> 
> On Wed, Jun 24, 2020 at 06:01:02PM +0200, David Hildenbrand wrote:
>>> On 24.06.20 17:37, Michael S. Tsirkin wrote:
>>> On Wed, Jun 24, 2020 at 05:28:59PM +0200, David Hildenbrand wrote:
>>>>> So at the high level the idea was simple, we just clear the dirty bit
>>>>> when page is hinted, unless we sent a new command since. Implementation
>>>>> was reviewed by migration maintainers. If there's a consensus the code
>>>>> is written so badly we can't maintain it, maybe we should remove it.
>>>>> Which parts are unmaintainable in your eyes - migration or virtio ones?
>>>> 
>>>> QEMU implementation without a propert virtio specification. I hope that
>>>> we can *at least* finally document the expected behavior. Alex gave it a
>>>> shot, and I was hoping that Wei could jump in to clarify, help move this
>>>> forward ... after all he implemented (+designed?) the feature and the
>>>> virtio interface.
>>>> 
>>>>> Or maybe it's the general thing that interface was never specced
>>>>> properly.
>>>> 
>>>> Yes, a spec would be definitely a good starter ...
>>>> 
>>>> [...]
>>>> 
>>>>>> 
>>>>>> 1. If migration fails during RAM precopy, the guest will never receive a
>>>>>> DONE notification. Probably easy to fix.
>>>>>> 
>>>>>> 2. Unclear semantics. Alex tried to document what the actual semantics
>>>>>> of hinted pages are.
>>>>> 
>>>>> I'll reply to that now.
>>>>> 
>>>>>> Assume the following in the guest to a previously
>>>>>> hinted page
>>>>>> 
>>>>>> /* page was hinted and is reused now */
>>>>>> if (page[x] != Y)
>>>>>>page[x] == Y;
>>>>>> /* migration ends, we now run on the destination */
>>>>>> BUG_ON(page[x] != Y);
>>>>>> /* BUG, because the content chan
>>>>> 
>>>>> The assumption hinting makes is that data in page is writtent to before 
>>>>> it's used.
>>>>> 
>>>>> 
>>>>>> A guest can observe that. And that could be a random driver that just
>>>>>> allocated a page.
>>>>>> 
>>>>>> (I *assume* in Linux we might catch that using kasan, but I am not 100%
>>>>>> sure, also, the actual semantics to document are unclear - e.g., for
>>>>>> other guests)
>>>>> 
>>>>> I think it's basically simple: hinting means it's ok to
>>>>> fill page with trash unless it has been modified since the command
>>>>> ID supplied.
>>>> 
>>>> Yeah, I quite dislike the semantics, especially, as they are different
>>>> to well-know semantics as e.g., represent in MADV_FREE. Getting changed
>>>> content when reading is really weird. But it seemed to be easier to
>>>> implement (low hanging fruit) and nobody complained back then. Well, now
>>>> we are stuck with it.
>>>> 
>>>> [..]
>>> 
>>> The difference with MADV_FREE is
>>> - asynchronous (using cmd id to synchronize)
>>> - zero not guaranteed
>>> 
>>> right?
>> 
>> *looking into man page*, yes, when reading you either get the old
>> content or zero.
>> 
>> (I remember that a re-read also makes the content stable, but looks like
>> you really have to write to a page)
>> 
>> We should most probably do what Alex suggested and initialize pages (at
>> least write a single byte) when leaking them from the shrinker in the
>> guest while hinting is active, such that the content is stable for
>> anybody to allocate and reuse a page.
> 
> Drivers ignore old content from slab though, so I don't really see
> the point.
> 

That‘s what we‘re hoping for and what we would expect. Maybe we should just 
life with that assumption and hope for the best ...

>> -- 
>> Thanks,
>> 
>> David / dhildenb
> 


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] Re: [virtio-comment] [PATCH v4 3/3] content: Document balloon feature free page hints

2020-06-24 Thread David Hildenbrand
On 27.05.20 06:07, Alexander Duyck wrote:
> From: Alexander Duyck 
> 
> Free page hints allow the balloon driver to provide information on what
> pages are not currently in use so that we can avoid the cost of copying
> them in migration scenarios. Add a feature description for free page hints
> describing basic functioning and requirements.
> 
> In working on this the specification as pointed out certain issues with the
> Linux driver and QEMU device implementation. The issues include:
> 1. The Linux driver does not re-initialize pages when it reuses them
> before receiving the "DONE" command, as such this can lead to possible data
> corruption.
> 2. The QEMU device is not returning the "DONE" command if a migration
> fails. This results in the guest holding onto pages until forced out by the
> shrinker.
> 
> There are also additional issues that have been found not related to the
> specification.
> 
> There is currently discussion on if the feature should be removed so this
> patch is a place-holder for if we decide to keep the feature and fix the
> issues. Otherwise this patch can be dropped and we can work on a patch to
> document the need to avoid the feature.

Looks like the feature will stay, hope we can document the expected
semantics reasonably well now and fix the remaining issues. After all we
spend quite some time in reverse-engineering and fixing already ...

[...]
>  
> +\subsubsection{Free Page Hinting}\label{sec:Device Types / Memory Balloon 
> Device / Device Operation / Free Page Hinting}
> +
> +Free page hinting is designed to be used during migration to determine what
> +pages within the guest are currently unused so that they can be skipped over
> +while migrating the guest. The device will indicate that it is ready to start
> +performing hinting by setting the \field{free_page_hint_cmd_id} to one of the
> +non-reserved values that can be used as a command ID. The following values
> +are reserved:

Maybe mention somewhere (resulting from a discussion with Michael) that
the semantics of hinted pages are similar to MADV_FREE, except
- it's asynchronous (and the cmd_id is used to synchronize)
- when reading pages after hinted, the content is undefined (might be
  something besides the old content or zero).

Might help to understand what the semantics are.

> +
> +\begin{description}
> +\item[VIRTIO_BALLOON_CMD_ID_STOP (0)] Any command ID previously supplied by
> +  the device is invalid. The driver should stop hinting free pages until a
> +  new command ID is supplied, but should not release any hinted pages for
> +  use by the guest.
> +
> +\item[VIRTIO_BALLOON_CMD_ID_DONE (1)] Any command ID previously supplied by
> +  the device is invalid. The driver should stop hinting free pages, and
> +  should release all hinted pages for use by the guest.
> +\end{description}
> +
> +A request for free page hinting proceeds as follows:
> +
> +\begin{enumerate}
> +
> +\item The driver examines the \field{free_page_hint_cmd_id} configuration 
> field.
> +  If it contains a non-reserved value then free page hinting will begin.
> +
> +\item To supply free page hints:
> +  \begin{enumerate}
> +  \item The driver constructs an output descriptor containing the new value
> +from the \field{free_page_hint_cmd_id} configuration field and adds it to
> +the free_page_hint_vq.
> +  \item The driver maps a series of pages and adds them to the
> +free_page_hint_vq as individual scatter-gather input descriptor entries.
> +  \item When the driver is no longer able to fetch additional pages to add
> +to the free_page_hint_vq, it will construct an output descriptor
> +containing the command ID VIRTIO_BALLOON_CMD_ID_STOP.
> +  \end{enumerate}
> +
> +\item A round of hinting ends either when the driver is no longer able to
> +  supply more pages for hinting as described above, or when the device
> +  updates \field{free_page_hint_cmd_id} configuration field to contain either
> +  VIRTIO_BALLOON_CMD_ID_STOP or VIRTIO_BALLOON_CMD_ID_DONE.
> +
> +\item The device may follow VIRTIO_BALLOON_CMD_ID_STOP with a new
> +  non-reserved value for the \field{free_page_hint_cmd_id} configuration
> +  field in which case it will resume supplying free page hints.
> +
> +\item Otherwise, if the device provides VIRTIO_BALLOON_CMD_ID_DONE then
> +  hinting is complete and the driver may release all previously hinted
> +  pages for use by the guest.
> +
> +\end{enumerate}
> +
> +\drivernormative{\paragraph}{Free Page Hinting}{Device Types / Memory 
> Balloon Device / Device Operation / Free Page Hinting}
> +
> +Normative statements in this section apply if the
> +VIRTIO_BALLOON_F_FREE_PAGE_HINT feature has been negotiated.
> +
> +The driver SHOULD supply pages to the free page hints when
> +\field{free_page_hint_cmd_id} reports a value of 2 or greater.
> +

nit: I'd avoid using the term "report" here. Maybe "specifies" or sth.
like that.

> +The driver MUST start hinting by providing an output descriptor
> 

[virtio-dev] Re: [PATCH v25 QEMU 3/3] virtio-balloon: Replace free page hinting references to 'report' with 'hint'

2020-06-24 Thread David Hildenbrand
On 24.06.20 17:37, Michael S. Tsirkin wrote:
> On Wed, Jun 24, 2020 at 05:28:59PM +0200, David Hildenbrand wrote:
>>> So at the high level the idea was simple, we just clear the dirty bit
>>> when page is hinted, unless we sent a new command since. Implementation
>>> was reviewed by migration maintainers. If there's a consensus the code
>>> is written so badly we can't maintain it, maybe we should remove it.
>>> Which parts are unmaintainable in your eyes - migration or virtio ones?
>>
>> QEMU implementation without a propert virtio specification. I hope that
>> we can *at least* finally document the expected behavior. Alex gave it a
>> shot, and I was hoping that Wei could jump in to clarify, help move this
>> forward ... after all he implemented (+designed?) the feature and the
>> virtio interface.
>>
>>> Or maybe it's the general thing that interface was never specced
>>> properly.
>>
>> Yes, a spec would be definitely a good starter ...
>>
>> [...]
>>
>>>>
>>>> 1. If migration fails during RAM precopy, the guest will never receive a
>>>> DONE notification. Probably easy to fix.
>>>>
>>>> 2. Unclear semantics. Alex tried to document what the actual semantics
>>>> of hinted pages are.
>>>
>>> I'll reply to that now.
>>>
>>>> Assume the following in the guest to a previously
>>>> hinted page
>>>>
>>>> /* page was hinted and is reused now */
>>>> if (page[x] != Y)
>>>>page[x] == Y;
>>>> /* migration ends, we now run on the destination */
>>>> BUG_ON(page[x] != Y);
>>>> /* BUG, because the content chan
>>>
>>> The assumption hinting makes is that data in page is writtent to before 
>>> it's used.
>>>
>>>
>>>> A guest can observe that. And that could be a random driver that just
>>>> allocated a page.
>>>>
>>>> (I *assume* in Linux we might catch that using kasan, but I am not 100%
>>>> sure, also, the actual semantics to document are unclear - e.g., for
>>>> other guests)
>>>
>>> I think it's basically simple: hinting means it's ok to
>>> fill page with trash unless it has been modified since the command
>>> ID supplied.
>>
>> Yeah, I quite dislike the semantics, especially, as they are different
>> to well-know semantics as e.g., represent in MADV_FREE. Getting changed
>> content when reading is really weird. But it seemed to be easier to
>> implement (low hanging fruit) and nobody complained back then. Well, now
>> we are stuck with it.
>>
>> [..]
> 
> The difference with MADV_FREE is
> - asynchronous (using cmd id to synchronize)
> - zero not guaranteed
> 
> right?

*looking into man page*, yes, when reading you either get the old
content or zero.

(I remember that a re-read also makes the content stable, but looks like
you really have to write to a page)

We should most probably do what Alex suggested and initialize pages (at
least write a single byte) when leaking them from the shrinker in the
guest while hinting is active, such that the content is stable for
anybody to allocate and reuse a page.

-- 
Thanks,

David / dhildenb


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] Re: [PATCH v25 QEMU 3/3] virtio-balloon: Replace free page hinting references to 'report' with 'hint'

2020-06-24 Thread David Hildenbrand
> So at the high level the idea was simple, we just clear the dirty bit
> when page is hinted, unless we sent a new command since. Implementation
> was reviewed by migration maintainers. If there's a consensus the code
> is written so badly we can't maintain it, maybe we should remove it.
> Which parts are unmaintainable in your eyes - migration or virtio ones?

QEMU implementation without a propert virtio specification. I hope that
we can *at least* finally document the expected behavior. Alex gave it a
shot, and I was hoping that Wei could jump in to clarify, help move this
forward ... after all he implemented (+designed?) the feature and the
virtio interface.

> Or maybe it's the general thing that interface was never specced
> properly.

Yes, a spec would be definitely a good starter ...

[...]

>>
>> 1. If migration fails during RAM precopy, the guest will never receive a
>> DONE notification. Probably easy to fix.
>>
>> 2. Unclear semantics. Alex tried to document what the actual semantics
>> of hinted pages are.
> 
> I'll reply to that now.
> 
>> Assume the following in the guest to a previously
>> hinted page
>>
>> /* page was hinted and is reused now */
>> if (page[x] != Y)
>>  page[x] == Y;
>> /* migration ends, we now run on the destination */
>> BUG_ON(page[x] != Y);
>> /* BUG, because the content chan
> 
> The assumption hinting makes is that data in page is writtent to before it's 
> used.
> 
> 
>> A guest can observe that. And that could be a random driver that just
>> allocated a page.
>>
>> (I *assume* in Linux we might catch that using kasan, but I am not 100%
>> sure, also, the actual semantics to document are unclear - e.g., for
>> other guests)
> 
> I think it's basically simple: hinting means it's ok to
> fill page with trash unless it has been modified since the command
> ID supplied.

Yeah, I quite dislike the semantics, especially, as they are different
to well-know semantics as e.g., represent in MADV_FREE. Getting changed
content when reading is really weird. But it seemed to be easier to
implement (low hanging fruit) and nobody complained back then. Well, now
we are stuck with it.

[..]

> 
>> There are other concerns I had regarding the iothread (e.g., while
>> reporting is active, virtio_ballloon_get_free_page_hints() is
>> essentially a busy loop, in contrast to documented -
>> continue_to_get_hints will always be true).
> 
> So that would be a performance issue you are suggesting, right?

I misread the code, so that comment does no longer apply (see other
message).

> 
>>> The appeal of hinting is that it's 0 overhead outside migration,
>>> and pains were taken to avoid keeping pages locked while
>>> hypervisor is busy.
>>>
>>> If we are to drop hinting completely we need to show that reporting
>>> can be comparable, and we'll probably want to add a mode for
>>> reporting that behaves somewhat similarly.
>>
>> Depends on the actual users. If we're dropping a feature that nobody is
>> actively using, I don't think we have to show anything.
> 
> 
> I don't know how to find out. So far it doesn't look like we found
> any common data corruptions that would indicate no one can use it safely.
> Races around reset aren't all that uncommon but I don't think that
> qualifies as a deal breaker.

As I said, there are no libvirt bindings, so at least anything using
libvirt does not use it. I'd be curious about actual users.

> 
> I find the idea of asynchronously sending hints to host without
> waiting for them to be processed intriguing. Not something
> I'd work on implementing if we had reporting originally,
> but since it's there I'm not sure we should just discard it
> at this point.
> 
>> This feature obviously saw no proper review.
> 
> I did my best but obviously missed some things.

Yeah, definitely not your fault. People cannot expect maintainers to
review everything in detail.

-- 
Thanks,

David / dhildenb


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] Re: [PATCH 1/2] virtio-balloon: Prevent guest from starting a report when we didn't request one

2020-06-23 Thread David Hildenbrand
>>> +++ b/hw/virtio/virtio-balloon.c
>>> @@ -527,7 +527,8 @@ static bool get_free_page_hints(VirtIOBalloon *dev)
>>>  ret = false;
>>>  goto out;
>>>  }
>>> -if (id == dev->free_page_report_cmd_id) {
>>> +if (dev->free_page_report_status == FREE_PAGE_REPORT_S_REQUESTED &&
>>> +id == dev->free_page_report_cmd_id) {
>>>  dev->free_page_report_status = FREE_PAGE_REPORT_S_START;
>>>  } else {
>>>  /*
>>
>> But doesn't that mean that, after the first hint, all further ones will
>> be discarded and we'll enter the STOP state in the else case? Or am I
>> missing something?
>>
>> Shouldn't this be something like
>>
>> if (id == dev->free_page_report_cmd_id) {
>> if (dev->free_page_report_status == FREE_PAGE_REPORT_S_REQUESTED) {
>> dev->free_page_report_status = FREE_PAGE_REPORT_S_START;
>> }
>> /* Stay in FREE_PAGE_REPORT_S_START as long as the cmd_id match .*/
>> } else { ...
> 
> There should only be one element containing an outbuf at the start of
> the report. Once that is processed we should not see the driver
> sending additional outbufs unless it is sending the STOP command ID.

Ok, I assume what Linux guests do is considered the correct protocol.

[...]

> 
>>> @@ -592,14 +593,16 @@ static void 
>>> virtio_balloon_free_page_start(VirtIOBalloon *s)
>>>  return;
>>>  }
>>>
>>> -if (s->free_page_report_cmd_id == UINT_MAX) {
>>> +qemu_mutex_lock(>free_page_lock);
>>> +
>>> +if (s->free_page_report_cmd_id++ == UINT_MAX) {
>>>  s->free_page_report_cmd_id =
>>> VIRTIO_BALLOON_FREE_PAGE_REPORT_CMD_ID_MIN;
>>> -} else {
>>> -s->free_page_report_cmd_id++;
>>>  }
>>
>> Somewhat unrelated cleanup.
> 
> Agreed. I can drop it if preferred. I just took care of it because I
> was adding the lock above and below to prevent us from getting into
> any wierd states where the command ID might be updated but the report
> status was not.

No hard feelings, it just makes reviewing harder, because one has to
investigate how the changes relate to the locking changes - to find out
they don't. :)

Acked-by: David Hildenbrand 

-- 
Thanks,

David / dhildenb


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] Re: [PATCH 1/2] virtio-balloon: Prevent guest from starting a report when we didn't request one

2020-06-22 Thread David Hildenbrand
On 19.06.20 23:53, Alexander Duyck wrote:
> From: Alexander Duyck 
> 
> Based on code review it appears possible for the driver to force the device
> out of a stopped state when hinting by repeating the last ID it was
> provided.

Indeed, thanks for noticing.

> 
> Prevent this by only allowing a transition to the start state when we are
> in the requested state. This way the driver is only allowed to send one
> descriptor that will transition the device into the start state. All others
> will leave it in the stop state once it has finished.
> 
> In addition add the necessary locking to provent any potential races

s/provent/prevent/

> between the accesses of the cmd_id and the status.
> 
> Fixes: c13c4153f76d ("virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT")
> Signed-off-by: Alexander Duyck 
> ---
>  hw/virtio/virtio-balloon.c |   11 +++
>  1 file changed, 7 insertions(+), 4 deletions(-)
> 
> diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
> index 10507b2a430a..7f3af266f674 100644
> --- a/hw/virtio/virtio-balloon.c
> +++ b/hw/virtio/virtio-balloon.c
> @@ -527,7 +527,8 @@ static bool get_free_page_hints(VirtIOBalloon *dev)
>  ret = false;
>  goto out;
>  }
> -if (id == dev->free_page_report_cmd_id) {
> +if (dev->free_page_report_status == FREE_PAGE_REPORT_S_REQUESTED &&
> +id == dev->free_page_report_cmd_id) {
>  dev->free_page_report_status = FREE_PAGE_REPORT_S_START;
>  } else {
>  /*

But doesn't that mean that, after the first hint, all further ones will
be discarded and we'll enter the STOP state in the else case? Or am I
missing something?

Shouldn't this be something like

if (id == dev->free_page_report_cmd_id) {
if (dev->free_page_report_status == FREE_PAGE_REPORT_S_REQUESTED) {
dev->free_page_report_status = FREE_PAGE_REPORT_S_START;
}
/* Stay in FREE_PAGE_REPORT_S_START as long as the cmd_id match .*/
} else { ...

> @@ -592,14 +593,16 @@ static void 
> virtio_balloon_free_page_start(VirtIOBalloon *s)
>  return;
>  }
>  
> -if (s->free_page_report_cmd_id == UINT_MAX) {
> +qemu_mutex_lock(>free_page_lock);
> +
> +if (s->free_page_report_cmd_id++ == UINT_MAX) {
>  s->free_page_report_cmd_id =
> VIRTIO_BALLOON_FREE_PAGE_REPORT_CMD_ID_MIN;
> -} else {
> -s->free_page_report_cmd_id++;
>  }

Somewhat unrelated cleanup.

>  
>  s->free_page_report_status = FREE_PAGE_REPORT_S_REQUESTED;
> +qemu_mutex_unlock(>free_page_lock);
> +
>  virtio_notify_config(vdev);
>  }
>  
> 


-- 
Thanks,

David / dhildenb


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] Re: [PATCH v25 QEMU 3/3] virtio-balloon: Replace free page hinting references to 'report' with 'hint'

2020-06-18 Thread David Hildenbrand
> 
>> 2. Unclear semantics. Alex tried to document what the actual semantics
>> of hinted pages are. Assume the following in the guest to a previously
>> hinted page
>> 
>> /* page was hinted and is reused now */
>> if (page[x] != Y)
>>page[x] == Y;
>> /* migration ends, we now run on the destination */
>> BUG_ON(page[x] != Y);
>> /* BUG, because the content chan
>> 
>> A guest can observe that. And that could be a random driver that just
>> allocated a page.
>> 
>> (I *assume* in Linux we might catch that using kasan, but I am not 100%
>> sure, also, the actual semantics to document are unclear - e.g., for
>> other guests)
>> 
>> As Alex mentioned, it is not even guaranteed in QEMU that we receive a
>> zero page on the destination, it could also be something else (e.g.,
>> previously migrated values).
> 
> So this is only an issue for pages that are pushed out of the balloon
> as a part of the shrinker process though. So fixing it would be pretty
> straightforward as we would just have to initialize or at least dirty
> pages that are leaked as a part of the shrinker. That may have an
> impact on performance though as it would result in us dirtying pages
> that are freed as a result of the shrinker being triggered.
> 

It really depends on the desired semantics, which are unclear because there is 
no doc/spec. Either QEMU is buggy or the kernel is buggy.

>> 3. If I am not wrong, the iothread works in
>> virtio_ballloon_get_free_page_hints() on the virtqueue only with holding
>> the free_page_lock (no BQL).
>> 
>> Assume we're migrating, the iothread is active, and the guest triggers a
>> device reset.
>> 
>> virtio_balloon_device_reset() will trigger a
>> virtio_balloon_free_page_stop(s). That won't actually wait for the
>> iothread to stop, it will only temporarily lock free_page_lock and
>> update s->free_page_report_status.
>> 
>> I think there can be a race between the device reset and the iothread.
>> Once virtio_balloon_free_page_stop() returned,
>> virtio_ballloon_get_free_page_hints() can still call
>> - virtio_queue_set_notification(vq, 0);
>> - virtio_queue_set_notification(vq, 1);
>> - virtio_notify(vdev, vq);
>> - virtqueue_pop()
>> 
>> I doubt this is very nice.
> 
> And our conversation had me start looking though reference to
> virtio_balloon_free_page_stop. It looks like we call it for when we
> unrealize the device or reset the device. It might make more sense for
> us to look at pushing the status to DONE and forcing the iothread to
> be flushed out.
> 
>> There are other concerns I had regarding the iothread (e.g., while
>> reporting is active, virtio_ballloon_get_free_page_hints() is
>> essentially a busy loop, in contrast to documented -
>> continue_to_get_hints will always be true).
>> 
>>> The appeal of hinting is that it's 0 overhead outside migration,
>>> and pains were taken to avoid keeping pages locked while
>>> hypervisor is busy.
>>> 
>>> If we are to drop hinting completely we need to show that reporting
>>> can be comparable, and we'll probably want to add a mode for
>>> reporting that behaves somewhat similarly.
>> 
>> Depends on the actual users. If we're dropping a feature that nobody is
>> actively using, I don't think we have to show anything.
>> 
>> This feature obviously saw no proper review.
> 
> I'm pretty sure it had some, as it went through several iterations as
> I recall. However I don't think the review of the virtio interface was
> very detailed as I think most of the attention was on the kernel
> interface.

Yes, that‘s what I meant. The kernel side and the migration code (QEMU) got a 
lot of attention.


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] Re: [PATCH v25 QEMU 3/3] virtio-balloon: Replace free page hinting references to 'report' with 'hint'

2020-06-18 Thread David Hildenbrand
 > There are other concerns I had regarding the iothread (e.g., while
> reporting is active, virtio_ballloon_get_free_page_hints() is
> essentially a busy loop, in contrast to documented -
> continue_to_get_hints will always be true).

FWIW, I just double checked this and my memory was bad.

 -
-- 
Thanks,

David / dhildenb


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] Re: [PATCH v25 QEMU 3/3] virtio-balloon: Replace free page hinting references to 'report' with 'hint'

2020-06-18 Thread David Hildenbrand
>>
>> Ugh, ...
>>
>> @MST, you might have missed that in another discussion, what's your
>> general opinion about removing free page hinting in QEMU (and Linux)? We
>> keep finding issues in the QEMU implementation, including non-trivial
>> ones, and have to speculate about the actual semantics. I can see that
>> e.g., libvirt does not support it yet.
> 
> Not maintaining two similar features sounds attractive.

I consider free page hinting (in QEMU) to be in an unmaintainable state
(and it looks like Alex and I are fixing a feature we don't actually
intend to use / not aware of users). In contrast to that, the free page
reporting functionality/implementation is a walk in the park.

> 
> I'm still trying to get my head around the list of issues.  So far they
> all look kind of minor to me.  Would you like to summarize them
> somewhere?

Some things I still have in my mind


1. If migration fails during RAM precopy, the guest will never receive a
DONE notification. Probably easy to fix.

2. Unclear semantics. Alex tried to document what the actual semantics
of hinted pages are. Assume the following in the guest to a previously
hinted page

/* page was hinted and is reused now */
if (page[x] != Y)
page[x] == Y;
/* migration ends, we now run on the destination */
BUG_ON(page[x] != Y);
/* BUG, because the content chan

A guest can observe that. And that could be a random driver that just
allocated a page.

(I *assume* in Linux we might catch that using kasan, but I am not 100%
sure, also, the actual semantics to document are unclear - e.g., for
other guests)

As Alex mentioned, it is not even guaranteed in QEMU that we receive a
zero page on the destination, it could also be something else (e.g.,
previously migrated values).

3. If I am not wrong, the iothread works in
virtio_ballloon_get_free_page_hints() on the virtqueue only with holding
the free_page_lock (no BQL).

Assume we're migrating, the iothread is active, and the guest triggers a
device reset.

virtio_balloon_device_reset() will trigger a
virtio_balloon_free_page_stop(s). That won't actually wait for the
iothread to stop, it will only temporarily lock free_page_lock and
update s->free_page_report_status.

I think there can be a race between the device reset and the iothread.
Once virtio_balloon_free_page_stop() returned,
virtio_ballloon_get_free_page_hints() can still call
- virtio_queue_set_notification(vq, 0);
- virtio_queue_set_notification(vq, 1);
- virtio_notify(vdev, vq);
- virtqueue_pop()

I doubt this is very nice.

There are other concerns I had regarding the iothread (e.g., while
reporting is active, virtio_ballloon_get_free_page_hints() is
essentially a busy loop, in contrast to documented -
continue_to_get_hints will always be true).

> The appeal of hinting is that it's 0 overhead outside migration,
> and pains were taken to avoid keeping pages locked while
> hypervisor is busy.
> 
> If we are to drop hinting completely we need to show that reporting
> can be comparable, and we'll probably want to add a mode for
> reporting that behaves somewhat similarly.

Depends on the actual users. If we're dropping a feature that nobody is
actively using, I don't think we have to show anything.

This feature obviously saw no proper review.

-- 
Thanks,

David / dhildenb


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] Re: [PATCH v25 QEMU 3/3] virtio-balloon: Replace free page hinting references to 'report' with 'hint'

2020-06-18 Thread David Hildenbrand
On 18.06.20 17:14, Alexander Duyck wrote:
> On Thu, Jun 18, 2020 at 5:54 AM David Hildenbrand  wrote:
>>
>> On 13.06.20 22:07, Alexander Duyck wrote:
>>> On Tue, May 26, 2020 at 9:14 PM Alexander Duyck
>>>  wrote:
>>>>
>>>> From: Alexander Duyck 
>>>>
>>>> In an upcoming patch a feature named Free Page Reporting is about to be
>>>> added. In order to avoid any confusion we should drop the use of the word
>>>> 'report' when referring to Free Page Hinting. So what this patch does is go
>>>> through and replace all instances of 'report' with 'hint" when we are
>>>> referring to free page hinting.
>>>>
>>>> Acked-by: David Hildenbrand 
>>>> Signed-off-by: Alexander Duyck 
>>>> ---
>>>>  hw/virtio/virtio-balloon.c |   78 
>>>> ++--
>>>>  include/hw/virtio/virtio-balloon.h |   20 +
>>>>  2 files changed, 49 insertions(+), 49 deletions(-)
>>>>
>>>> diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
>>>> index 3e2ac1104b5f..dc15409b0bb6 100644
>>>> --- a/hw/virtio/virtio-balloon.c
>>>> +++ b/hw/virtio/virtio-balloon.c
>>>
>>> ...
>>>
>>>> @@ -817,14 +817,14 @@ static int virtio_balloon_post_load_device(void 
>>>> *opaque, int version_id)
>>>>  return 0;
>>>>  }
>>>>
>>>> -static const VMStateDescription vmstate_virtio_balloon_free_page_report = 
>>>> {
>>>> +static const VMStateDescription vmstate_virtio_balloon_free_page_hint = {
>>>>  .name = "virtio-balloon-device/free-page-report",
>>>>  .version_id = 1,
>>>>  .minimum_version_id = 1,
>>>>  .needed = virtio_balloon_free_page_support,
>>>>  .fields = (VMStateField[]) {
>>>> -VMSTATE_UINT32(free_page_report_cmd_id, VirtIOBalloon),
>>>> -VMSTATE_UINT32(free_page_report_status, VirtIOBalloon),
>>>> +VMSTATE_UINT32(free_page_hint_cmd_id, VirtIOBalloon),
>>>> +VMSTATE_UINT32(free_page_hint_status, VirtIOBalloon),
>>>>  VMSTATE_END_OF_LIST()
>>>>  }
>>>>  };
>>>
>>> So I noticed this patch wasn't in the list of patches pulled, but that
>>> is probably for the best since I believe the change above might have
>>> broken migration as VMSTATE_UINT32 does a stringify on the first
>>> parameter.
>>
>> Indeed, it's the name of the vmstate field. But I don't think it is
>> relevant for migration. It's and indicator if a field is valid and it's
>> used in traces/error messages.
>>
>> See git grep "field->name"
>>
>> I don't think renaming this is problematic. Can you rebase and resent?
>> Thanks!
> 
> Okay, I will.
> 
>>> Any advice on how to address it, or should I just give up on renaming
>>> free_page_report_cmd_id and free_page_report_status?
>>>
>>> Looking at this I wonder why we even need to migrate these values? It
>>> seems like if we are completing a migration the cmd_id should always
>>> be "DONE" shouldn't it? It isn't as if we are going to migrate the
>>
>> The *status* should be DONE IIUC. The cmd_id might be relevant, no? It's
>> always incremented until it wraps.
> 
> The thing is, the cmd_id visible to the driver if the status is DONE
> is the cmd_id value for DONE. So as long as the driver acknowledges
> the value we could essentially start over the cmd_id without any
> negative effect. The driver would have to put down a new descriptor to
> start a block of hinting in order to begin reporting again so there
> shouldn't be any risk of us falsely hinting pages that were in a
> previous epoch.
> 
> Ugh, although now looking at it I think we might have a bug in the
> QEMU code in that the driver could in theory force its way past a
> "STOP" by just replaying the last command_id descriptor and then keep
> going. Should be a pretty easy fix though as we should only allow a
> transition to S_START if the status is S_REQUESTED/

Ugh, ...

@MST, you might have missed that in another discussion, what's your
general opinion about removing free page hinting in QEMU (and Linux)? We
keep finding issues in the QEMU implementation, including non-trivial
ones, and have to speculate about the actual semantics. I can see that
e.g., libvirt does not support it yet.

-- 
Thanks,

David / dhildenb


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] Re: [PATCH v25 QEMU 3/3] virtio-balloon: Replace free page hinting references to 'report' with 'hint'

2020-06-18 Thread David Hildenbrand
On 13.06.20 22:07, Alexander Duyck wrote:
> On Tue, May 26, 2020 at 9:14 PM Alexander Duyck
>  wrote:
>>
>> From: Alexander Duyck 
>>
>> In an upcoming patch a feature named Free Page Reporting is about to be
>> added. In order to avoid any confusion we should drop the use of the word
>> 'report' when referring to Free Page Hinting. So what this patch does is go
>> through and replace all instances of 'report' with 'hint" when we are
>> referring to free page hinting.
>>
>> Acked-by: David Hildenbrand 
>> Signed-off-by: Alexander Duyck 
>> ---
>>  hw/virtio/virtio-balloon.c |   78 
>> ++--
>>  include/hw/virtio/virtio-balloon.h |   20 +
>>  2 files changed, 49 insertions(+), 49 deletions(-)
>>
>> diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
>> index 3e2ac1104b5f..dc15409b0bb6 100644
>> --- a/hw/virtio/virtio-balloon.c
>> +++ b/hw/virtio/virtio-balloon.c
> 
> ...
> 
>> @@ -817,14 +817,14 @@ static int virtio_balloon_post_load_device(void 
>> *opaque, int version_id)
>>  return 0;
>>  }
>>
>> -static const VMStateDescription vmstate_virtio_balloon_free_page_report = {
>> +static const VMStateDescription vmstate_virtio_balloon_free_page_hint = {
>>  .name = "virtio-balloon-device/free-page-report",
>>  .version_id = 1,
>>  .minimum_version_id = 1,
>>  .needed = virtio_balloon_free_page_support,
>>  .fields = (VMStateField[]) {
>> -VMSTATE_UINT32(free_page_report_cmd_id, VirtIOBalloon),
>> -VMSTATE_UINT32(free_page_report_status, VirtIOBalloon),
>> +VMSTATE_UINT32(free_page_hint_cmd_id, VirtIOBalloon),
>> +VMSTATE_UINT32(free_page_hint_status, VirtIOBalloon),
>>  VMSTATE_END_OF_LIST()
>>  }
>>  };
> 
> So I noticed this patch wasn't in the list of patches pulled, but that
> is probably for the best since I believe the change above might have
> broken migration as VMSTATE_UINT32 does a stringify on the first
> parameter.

Indeed, it's the name of the vmstate field. But I don't think it is
relevant for migration. It's and indicator if a field is valid and it's
used in traces/error messages.

See git grep "field->name"

I don't think renaming this is problematic. Can you rebase and resent?
Thanks!

> Any advice on how to address it, or should I just give up on renaming
> free_page_report_cmd_id and free_page_report_status?
> 
> Looking at this I wonder why we even need to migrate these values? It
> seems like if we are completing a migration the cmd_id should always
> be "DONE" shouldn't it? It isn't as if we are going to migrate the

The *status* should be DONE IIUC. The cmd_id might be relevant, no? It's
always incremented until it wraps.

> hinting from one host to another. We will have to start over which is
> essentially the signal that the "DONE" value provides. Same thing for
> the status. We shouldn't be able to migrate unless both of these are
> already in the "DONE" state so if anything I wonder if we shouldn't
> have that as the initial state for the device and just drop the
> migration info.

We'll have to glue that to a compat machine unfortunately, so we can
just keep migrating it ... :(


-- 
Thanks,

David / dhildenb


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] Re: [PATCH v1] virtio-mem: add memory via add_memory_driver_managed()

2020-06-11 Thread David Hildenbrand
>> I'd like to have this patch in 5.8, with the initial merge of virtio-mem
>> if possible (so the user space representation of virtio-mem added memory
>> resources won't change anymore).
> 
> So my plan is to rebase on top of -rc1 and merge this for rc2 then.
> I don't like rebase on top of tip as the results are sometimes kind of
> random.

Right, I just wanted to get this out early so we can discuss how to proceed.

> And let's add a Fixes: tag as well, this way people will remember to
> pick this.
> Makes sense?

Yes, it's somehow a fix (for kexec). So

Fixes: 5f1f79bbc9e26 ("virtio-mem: Paravirtualized memory hotplug")

I can respin after -rc1 with the commit id fixed as noted by Pankaj.
Just let me know what you prefer.

Thanks!

-- 
Thanks,

David / dhildenb


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] Re: [PATCH v1] virtio-mem: add memory via add_memory_driver_managed()

2020-06-11 Thread David Hildenbrand
On 11.06.20 12:32, Pankaj Gupta wrote:
>> Virtio-mem managed memory is always detected and added by the virtio-mem
>> driver, never using something like the firmware-provided memory map.
>> This is the case after an ordinary system reboot, and has to be guaranteed
>> after kexec. Especially, virtio-mem added memory resources can contain
>> inaccessible parts ("unblocked memory blocks"), blindly forwarding them
>> to a kexec kernel is dangerous, as unplugged memory will get accessed
>> (esp. written).
>>
>> Let's use the new way of adding special driver-managed memory introduced
>> in commit 75ac4c58bc0d ("mm/memory_hotplug: introduce
>> add_memory_driver_managed()").
> 
> Is this commit id correct?

Good point, it's the one from next-20200605.

7b7b27214bba

Is the correct one.

[...]

> 
> Looks good to me.
> Reviewed-by: Pankaj Gupta 
> 

Thanks!

-- 
Thanks,

David / dhildenb


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] [PATCH v1] virtio-mem: add memory via add_memory_driver_managed()

2020-06-11 Thread David Hildenbrand
Virtio-mem managed memory is always detected and added by the virtio-mem
driver, never using something like the firmware-provided memory map.
This is the case after an ordinary system reboot, and has to be guaranteed
after kexec. Especially, virtio-mem added memory resources can contain
inaccessible parts ("unblocked memory blocks"), blindly forwarding them
to a kexec kernel is dangerous, as unplugged memory will get accessed
(esp. written).

Let's use the new way of adding special driver-managed memory introduced
in commit 75ac4c58bc0d ("mm/memory_hotplug: introduce
add_memory_driver_managed()").

This will result in no entries in /sys/firmware/memmap ("raw firmware-
provided memory map"), the memory resource will be flagged
IORESOURCE_MEM_DRIVER_MANAGED (esp., kexec_file_load() will not place
kexec images on this memory), and it is exposed as "System RAM
(virtio_mem)" in /proc/iomem, so esp. kexec-tools can properly handle it.

Example /proc/iomem before this change:
  [...]
  14000-333ff : virtio0
14000-147ff : System RAM
  33400-533ff : virtio1
33800-33fff : System RAM
34000-347ff : System RAM
34800-34fff : System RAM
  [...]

Example /proc/iomem after this change:
  [...]
  14000-333ff : virtio0
14000-147ff : System RAM (virtio_mem)
  33400-533ff : virtio1
33800-33fff : System RAM (virtio_mem)
34000-347ff : System RAM (virtio_mem)
34800-34fff : System RAM (virtio_mem)
  [...]

Cc: "Michael S. Tsirkin" 
Cc: Pankaj Gupta 
Cc: teawater 
Signed-off-by: David Hildenbrand 
---

Based on latest Linus' tree (and not a tag) because
- virtio-mem has just been merged via the vhost tree
- add_memory_driver_managed() has been merged a week ago via the -mm tree

I'd like to have this patch in 5.8, with the initial merge of virtio-mem
if possible (so the user space representation of virtio-mem added memory
resources won't change anymore).

---
 drivers/virtio/virtio_mem.c | 25 ++---
 1 file changed, 22 insertions(+), 3 deletions(-)

diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
index 50c689f250450..d2eab3558a9e1 100644
--- a/drivers/virtio/virtio_mem.c
+++ b/drivers/virtio/virtio_mem.c
@@ -101,6 +101,11 @@ struct virtio_mem {
 
/* The parent resource for all memory added via this device. */
struct resource *parent_resource;
+   /*
+* Copy of "System RAM (virtio_mem)" to be used for
+* add_memory_driver_managed().
+*/
+   const char *resource_name;
 
/* Summary of all memory block states. */
unsigned long nb_mb_state[VIRTIO_MEM_MB_STATE_COUNT];
@@ -414,8 +419,20 @@ static int virtio_mem_mb_add(struct virtio_mem *vm, 
unsigned long mb_id)
if (nid == NUMA_NO_NODE)
nid = memory_add_physaddr_to_nid(addr);
 
+   /*
+* When force-unloading the driver and we still have memory added to
+* Linux, the resource name has to stay.
+*/
+   if (!vm->resource_name) {
+   vm->resource_name = kstrdup_const("System RAM (virtio_mem)",
+ GFP_KERNEL);
+   if (!vm->resource_name)
+   return -ENOMEM;
+   }
+
dev_dbg(>vdev->dev, "adding memory block: %lu\n", mb_id);
-   return add_memory(nid, addr, memory_block_size_bytes());
+   return add_memory_driver_managed(nid, addr, memory_block_size_bytes(),
+vm->resource_name);
 }
 
 /*
@@ -1890,10 +1907,12 @@ static void virtio_mem_remove(struct virtio_device 
*vdev)
vm->nb_mb_state[VIRTIO_MEM_MB_STATE_OFFLINE_PARTIAL] ||
vm->nb_mb_state[VIRTIO_MEM_MB_STATE_ONLINE] ||
vm->nb_mb_state[VIRTIO_MEM_MB_STATE_ONLINE_PARTIAL] ||
-   vm->nb_mb_state[VIRTIO_MEM_MB_STATE_ONLINE_MOVABLE])
+   vm->nb_mb_state[VIRTIO_MEM_MB_STATE_ONLINE_MOVABLE]) {
dev_warn(>dev, "device still has system memory added\n");
-   else
+   } else {
virtio_mem_delete_resource(vm);
+   kfree_const(vm->resource_name);
+   }
 
/* remove all tracking data - no locking needed */
vfree(vm->mb_state);
-- 
2.26.2


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] Re: [PATCH RFC v4 00/13] virtio-mem: paravirtualized memory

2020-06-05 Thread David Hildenbrand
On 05.06.20 12:46, Alex Shi wrote:
> 
> 
> 在 2020/6/5 下午6:05, David Hildenbrand 写道:
>>> I guess I know what's happening here. In case we only have DMA memory
>>> when booting, we don't reserve swiotlb buffers. Once we hotplug memory
>>> and online ZONE_NORMAL, we don't have any swiotlb DMA bounce buffers to
>>> map such PFNs (total 0 (slots), used 0 (slots)).
>>>
>>> Can you try with "swiotlb=force" on the kernel cmdline?
>> Alternative, looks like you can specify "-m 2G,maxmem=16G,slots=1", to
>> create proper ACPI tables that indicate hotpluggable memory. (I'll have
>> to look into QEMU to figure out to always indicate hotpluggable memory
>> that way).
>>
> 
> 
> That works too. Yes, better resolved in qemu, maybe. :)
> 

You can checkout

g...@github.com:davidhildenbrand/qemu.git virtio-mem-v4

(prone to change before officially sent), which will create srat tables
also if no "slots" parameter was defined (and no -numa config was
specified).

Your original example should work with that.

-- 
Thanks,

David / dhildenb


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] Re: [PATCH RFC v4 00/13] virtio-mem: paravirtualized memory

2020-06-05 Thread David Hildenbrand
On 05.06.20 11:36, David Hildenbrand wrote:
> On 05.06.20 11:08, David Hildenbrand wrote:
>> On 05.06.20 10:55, Alex Shi wrote:
>>>
>>>
>>> 在 2020/1/9 下午9:48, David Hildenbrand 写道:
>>>> Ping,
>>>>
>>>> I'd love to get some feedback on
>>>>
>>>> a) The remaining MM bits from MM folks (especially, patch #6 and #8).
>>>> b) The general virtio infrastructure (esp. uapi in patch #2) from virtio
>>>> folks.
>>>>
>>>> I'm planning to send a proper v1 (!RFC) once I have all necessary MM
>>>> acks. In the meanwhile, I will do more testing and minor reworks (e.g.,
>>>> fix !CONFIG_NUMA compilation).
>>>
>>>
>>> Hi David,
>>>
>>> Thanks for your work!
>>>
>>> I am trying your https://github.com/davidhildenbrand/linux.git virtio-mem-v5
>>> which works fine for me, but just a 'DMA error' happens when a vm start with
>>> less than 2GB memory, Do I missed sth?
>>
>> Please use the virtio-mem-v4 branch for now, v5 is still under
>> construction (and might be scrapped completely if v4 goes upstream as is).
>>
>> Looks like a DMA issue. Your're hotplugging 1GB, which should not really
>> eat too much memory. There was a similar issue reported by Hui in [1],
>> which boiled down to wrong usage of the swiotlb parameter.
>>
>> In such cases you should always try to reproduce with hotplug of a
>> sam-sized DIMM. E.g., hotplugging a 1GB DIMM should result in the same
>> issue.
>>
>> What does your .config specify for CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE?
>>
>> I'll try to reproduce with v4 briefly.
> 
> I guess I know what's happening here. In case we only have DMA memory
> when booting, we don't reserve swiotlb buffers. Once we hotplug memory
> and online ZONE_NORMAL, we don't have any swiotlb DMA bounce buffers to
> map such PFNs (total 0 (slots), used 0 (slots)).
> 
> Can you try with "swiotlb=force" on the kernel cmdline?

Alternative, looks like you can specify "-m 2G,maxmem=16G,slots=1", to
create proper ACPI tables that indicate hotpluggable memory. (I'll have
to look into QEMU to figure out to always indicate hotpluggable memory
that way).


-- 
Thanks,

David / dhildenb


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] Re: [PATCH RFC v4 00/13] virtio-mem: paravirtualized memory

2020-06-05 Thread David Hildenbrand
On 05.06.20 11:08, David Hildenbrand wrote:
> On 05.06.20 10:55, Alex Shi wrote:
>>
>>
>> 在 2020/1/9 下午9:48, David Hildenbrand 写道:
>>> Ping,
>>>
>>> I'd love to get some feedback on
>>>
>>> a) The remaining MM bits from MM folks (especially, patch #6 and #8).
>>> b) The general virtio infrastructure (esp. uapi in patch #2) from virtio
>>> folks.
>>>
>>> I'm planning to send a proper v1 (!RFC) once I have all necessary MM
>>> acks. In the meanwhile, I will do more testing and minor reworks (e.g.,
>>> fix !CONFIG_NUMA compilation).
>>
>>
>> Hi David,
>>
>> Thanks for your work!
>>
>> I am trying your https://github.com/davidhildenbrand/linux.git virtio-mem-v5
>> which works fine for me, but just a 'DMA error' happens when a vm start with
>> less than 2GB memory, Do I missed sth?
> 
> Please use the virtio-mem-v4 branch for now, v5 is still under
> construction (and might be scrapped completely if v4 goes upstream as is).
> 
> Looks like a DMA issue. Your're hotplugging 1GB, which should not really
> eat too much memory. There was a similar issue reported by Hui in [1],
> which boiled down to wrong usage of the swiotlb parameter.
> 
> In such cases you should always try to reproduce with hotplug of a
> sam-sized DIMM. E.g., hotplugging a 1GB DIMM should result in the same
> issue.
> 
> What does your .config specify for CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE?
> 
> I'll try to reproduce with v4 briefly.

I guess I know what's happening here. In case we only have DMA memory
when booting, we don't reserve swiotlb buffers. Once we hotplug memory
and online ZONE_NORMAL, we don't have any swiotlb DMA bounce buffers to
map such PFNs (total 0 (slots), used 0 (slots)).

Can you try with "swiotlb=force" on the kernel cmdline?

-- 
Thanks,

David / dhildenb


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] Re: [PATCH RFC v4 00/13] virtio-mem: paravirtualized memory

2020-06-05 Thread David Hildenbrand
On 05.06.20 10:55, Alex Shi wrote:
> 
> 
> 在 2020/1/9 下午9:48, David Hildenbrand 写道:
>> Ping,
>>
>> I'd love to get some feedback on
>>
>> a) The remaining MM bits from MM folks (especially, patch #6 and #8).
>> b) The general virtio infrastructure (esp. uapi in patch #2) from virtio
>> folks.
>>
>> I'm planning to send a proper v1 (!RFC) once I have all necessary MM
>> acks. In the meanwhile, I will do more testing and minor reworks (e.g.,
>> fix !CONFIG_NUMA compilation).
> 
> 
> Hi David,
> 
> Thanks for your work!
> 
> I am trying your https://github.com/davidhildenbrand/linux.git virtio-mem-v5
> which works fine for me, but just a 'DMA error' happens when a vm start with
> less than 2GB memory, Do I missed sth?

Please use the virtio-mem-v4 branch for now, v5 is still under
construction (and might be scrapped completely if v4 goes upstream as is).

Looks like a DMA issue. Your're hotplugging 1GB, which should not really
eat too much memory. There was a similar issue reported by Hui in [1],
which boiled down to wrong usage of the swiotlb parameter.

In such cases you should always try to reproduce with hotplug of a
sam-sized DIMM. E.g., hotplugging a 1GB DIMM should result in the same
issue.

What does your .config specify for CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE?

I'll try to reproduce with v4 briefly.

[1]
https://lkml.kernel.org/r/9708f43a-9bd2-4377-8ee8-7fb1d95c6...@linux.alibaba.com

> 
> Thanks
> Alex
> 
> 
> (qemu) qom-set vm0 requested-size 1g
> (qemu) [   26.560026] virtio_mem virtio0: plugged size: 0x0
> [   26.560648] virtio_mem virtio0: requested size: 0x4000
> [   26.561730] systemd-journald[167]: no db file to read 
> /run/udev/data/+virtio:virtio0: No such file or directory
> [   26.563138] systemd-journald[167]: no db file to read 
> /run/udev/data/+virtio:virtio0: No such file or directory
> [   26.569122] Built 1 zonelists, mobility grouping on.  Total pages: 513141
> [   26.570039] Policy zone: Normal
> 
> (qemu) [   32.175838] e1000 :00:03.0: swiotlb buffer is full (sz: 81 
> bytes), total 0 (slots), used 0 (slots)
> [   32.176922] e1000 :00:03.0: TX DMA map failed
> [   32.177488] e1000 :00:03.0: swiotlb buffer is full (sz: 81 bytes), 
> total 0 (slots), used 0 (slots)
> [   32.178535] e1000 :00:03.0: TX DMA map failed
> 
> my qemu command is like this:
> qemu-system-x86_64  --enable-kvm \
>   -m 2G,maxmem=16G -kernel /root/linux-next/$1/arch/x86/boot/bzImage \
>   -smp 4 \
>   -append "earlyprintk=ttyS0 root=/dev/sda1 console=ttyS0 debug psi=1 
> nokaslr ignore_loglevel" \
>   -hda /root/CentOS-7-x86_64-Azure-1703.qcow2 \
>   -net user,hostfwd=tcp::-:22 -net nic -s \
>   -object memory-backend-ram,id=mem0,size=3G \
>   -device virtio-mem-pci,id=vm0,memdev=mem0,node=0,requested-size=0M \
>   --nographic
> 
> 


-- 
Thanks,

David / dhildenb


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] Re: [PATCH v4 00/15] virtio-mem: paravirtualized memory

2020-06-02 Thread David Hildenbrand
On 07.05.20 16:01, David Hildenbrand wrote:
> This series is based on v5.7-rc4. The patches are located at:
> https://github.com/davidhildenbrand/linux.git virtio-mem-v4
> 
> This is basically a resend of v3 [1], now based on v5.7-rc4 and restested.
> One patch was reshuffled and two ACKs I missed to add were added. The
> rebase did not require any modifications to patches.
> 
> Details about virtio-mem can be found in the cover letter of v2 [2]. A
> basic QEMU implementation was posted yesterday [3].
> 
> [1] https://lkml.kernel.org/r/20200507103119.11219-1-da...@redhat.com
> [2] https://lkml.kernel.org/r/20200311171422.10484-1-da...@redhat.com
> [3] https://lkml.kernel.org/r/20200506094948.76388-1-da...@redhat.com
> 
> v3 -> v4:
> - Move "MAINTAINERS: Add myself as virtio-mem maintainer" to #2
> - Add two ACKs from Andrew (in reply to v2)
> -- "mm: Allow to offline unmovable PageOffline() pages via ..."
> -- "mm/memory_hotplug: Introduce offline_and_remove_memory()"
> 
> v2 -> v3:
> - "virtio-mem: Paravirtualized memory hotplug"
> -- Include "linux/slab.h" to fix build issues
> -- Remember the "region_size", helpful for patch #11
> -- Minor simplifaction in virtio_mem_overlaps_range()
> -- Use notifier_from_errno() instead of notifier_to_errno() in notifier
> -- More reliable check for added memory when unloading the driver
> - "virtio-mem: Allow to specify an ACPI PXM as nid"
> -- Also print the nid
> - Added patch #11-#15

@MST ping, v5.7 has been released

-- 
Thanks,

David / dhildenb


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] Re: [virtio-comment] [PATCH v4 1/3] content: Document balloon feature page poison

2020-05-29 Thread David Hildenbrand
On 29.05.20 18:57, Alexander Duyck wrote:
> On Fri, May 29, 2020 at 1:13 AM David Hildenbrand  wrote:
>>
>> On 27.05.20 06:06, Alexander Duyck wrote:
>>> From: Alexander Duyck 
>>>
>>> Page poison provides a way for the guest to notify the host that it is
>>> initializing or poisoning freed pages with some specific poison value. As a
>>> result of this we can infer a couple traits about the guest:
>>>
>>> 1. Free pages will contain a specific pattern within the guest.
>>> 2. Modifying free pages from this value may cause an error in the guest.
>>> 3. Pages will be immediately written to by the driver when deflated.
>>>
>>> There are currently no existing features that make use of this data. In the
>>> upcoming feature free page reporting we will need to make use of this to
>>> identify if we can evict pages from the guest without causing data
>>> corruption.
>>>
>>> Add documentation for the page poison feature describing the basic
>>> functionality and requirements.
>>>
>>> Signed-off-by: Alexander Duyck 
>>> ---
>>>  conformance.tex |2 ++
>>>  content.tex |   59 
>>> +++
>>>  2 files changed, 57 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/conformance.tex b/conformance.tex
>>> index b6fdec090383..4ed9d62e8088 100644
>>> --- a/conformance.tex
>>> +++ b/conformance.tex
>>> @@ -149,6 +149,7 @@ \section{Conformance Targets}\label{sec:Conformance / 
>>> Conformance Targets}
>>>  \item \ref{drivernormative:Device Types / Memory Balloon Device / Feature 
>>> bits}
>>>  \item \ref{drivernormative:Device Types / Memory Balloon Device / Device 
>>> Operation}
>>>  \item \ref{drivernormative:Device Types / Memory Balloon Device / Device 
>>> Operation / Memory Statistics}
>>> +\item \ref{drivernormative:Device Types / Memory Balloon Device / Device 
>>> Operation / Page Poison}
>>>  \end{itemize}
>>>
>>>  \conformance{\subsection}{SCSI Host Driver 
>>> Conformance}\label{sec:Conformance / Driver Conformance / SCSI Host Driver 
>>> Conformance}
>>> @@ -331,6 +332,7 @@ \section{Conformance Targets}\label{sec:Conformance / 
>>> Conformance Targets}
>>>  \item \ref{devicenormative:Device Types / Memory Balloon Device / Feature 
>>> bits}
>>>  \item \ref{devicenormative:Device Types / Memory Balloon Device / Device 
>>> Operation}
>>>  \item \ref{devicenormative:Device Types / Memory Balloon Device / Device 
>>> Operation / Memory Statistics}
>>> +\item \ref{devicenormative:Device Types / Memory Balloon Device / Device 
>>> Operation / Page Poison}
>>>  \end{itemize}
>>>
>>>  \conformance{\subsection}{SCSI Host Device 
>>> Conformance}\label{sec:Conformance / Device Conformance / SCSI Host Device 
>>> Conformance}
>>> diff --git a/content.tex b/content.tex
>>> index 91735e3eb018..4a0ab90260ff 100644
>>> --- a/content.tex
>>> +++ b/content.tex
>>> @@ -5019,6 +5019,9 @@ \subsection{Feature bits}\label{sec:Device Types / 
>>> Memory Balloon Device / Featu
>>>  memory statistics is present.
>>>  \item[VIRTIO_BALLOON_F_DEFLATE_ON_OOM (2) ] Deflate balloon on
>>>  guest out of memory condition.
>>> +\item[ VIRTIO_BALLOON_F_PAGE_POISON(4) ] A hint to the device, that the 
>>> driver
>>> +will immediately write \field{poison_val} to pages after deflating 
>>> them.
>>> +Configuration field \field{poison_val} is valid.
>>>
>>
>> Here we have "that the driver will immediately" ...
>>
>> But we never document that in form of a normative statement (e.g., "The
>> driver MUST initialize pages with \field{poison_val} after deflating").
> 
> I'm pretty sure we did document that. In the normative statement for
> the driver below we have:
> +The driver MUST initialize the deflated pages with \field{poison_val} when
> +they are reused by the driver.
> 

Doh! I think I missed that somehow

Reviewed-by: David Hildenbrand 

Thanks!

-- 
Thanks,

David / dhildenb


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] Re: [virtio-comment] [PATCH v4 1/3] content: Document balloon feature page poison

2020-05-29 Thread David Hildenbrand
On 27.05.20 06:06, Alexander Duyck wrote:
> From: Alexander Duyck 
> 
> Page poison provides a way for the guest to notify the host that it is
> initializing or poisoning freed pages with some specific poison value. As a
> result of this we can infer a couple traits about the guest:
> 
> 1. Free pages will contain a specific pattern within the guest.
> 2. Modifying free pages from this value may cause an error in the guest.
> 3. Pages will be immediately written to by the driver when deflated.
> 
> There are currently no existing features that make use of this data. In the
> upcoming feature free page reporting we will need to make use of this to
> identify if we can evict pages from the guest without causing data
> corruption.
> 
> Add documentation for the page poison feature describing the basic
> functionality and requirements.
> 
> Signed-off-by: Alexander Duyck 
> ---
>  conformance.tex |2 ++
>  content.tex |   59 
> +++
>  2 files changed, 57 insertions(+), 4 deletions(-)
> 
> diff --git a/conformance.tex b/conformance.tex
> index b6fdec090383..4ed9d62e8088 100644
> --- a/conformance.tex
> +++ b/conformance.tex
> @@ -149,6 +149,7 @@ \section{Conformance Targets}\label{sec:Conformance / 
> Conformance Targets}
>  \item \ref{drivernormative:Device Types / Memory Balloon Device / Feature 
> bits}
>  \item \ref{drivernormative:Device Types / Memory Balloon Device / Device 
> Operation}
>  \item \ref{drivernormative:Device Types / Memory Balloon Device / Device 
> Operation / Memory Statistics}
> +\item \ref{drivernormative:Device Types / Memory Balloon Device / Device 
> Operation / Page Poison}
>  \end{itemize}
>  
>  \conformance{\subsection}{SCSI Host Driver 
> Conformance}\label{sec:Conformance / Driver Conformance / SCSI Host Driver 
> Conformance}
> @@ -331,6 +332,7 @@ \section{Conformance Targets}\label{sec:Conformance / 
> Conformance Targets}
>  \item \ref{devicenormative:Device Types / Memory Balloon Device / Feature 
> bits}
>  \item \ref{devicenormative:Device Types / Memory Balloon Device / Device 
> Operation}
>  \item \ref{devicenormative:Device Types / Memory Balloon Device / Device 
> Operation / Memory Statistics}
> +\item \ref{devicenormative:Device Types / Memory Balloon Device / Device 
> Operation / Page Poison}
>  \end{itemize}
>  
>  \conformance{\subsection}{SCSI Host Device 
> Conformance}\label{sec:Conformance / Device Conformance / SCSI Host Device 
> Conformance}
> diff --git a/content.tex b/content.tex
> index 91735e3eb018..4a0ab90260ff 100644
> --- a/content.tex
> +++ b/content.tex
> @@ -5019,6 +5019,9 @@ \subsection{Feature bits}\label{sec:Device Types / 
> Memory Balloon Device / Featu
>  memory statistics is present.
>  \item[VIRTIO_BALLOON_F_DEFLATE_ON_OOM (2) ] Deflate balloon on
>  guest out of memory condition.
> +\item[ VIRTIO_BALLOON_F_PAGE_POISON(4) ] A hint to the device, that the 
> driver
> +will immediately write \field{poison_val} to pages after deflating them.
> +Configuration field \field{poison_val} is valid.
>  

Here we have "that the driver will immediately" ...

But we never document that in form of a normative statement (e.g., "The
driver MUST initialize pages with \field{poison_val} after deflating").

Just wondering if that is intended (I imagine it will be different with
free page reporting)?

Apart from that looks good to me!

-- 
Thanks,

David / dhildenb


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



Re: [virtio-dev] Re: [virtio-comment] [PATCH v3 2/3] content: Document balloon feature page poison

2020-05-27 Thread David Hildenbrand
On 27.05.20 08:14, Wei Wang wrote:
> On 05/26/2020 11:38 PM, Cornelia Huck wrote:
>> On Tue, 26 May 2020 17:28:00 +0200 David Hildenbrand
>>  wrote:
>> 
>>> On 26.05.20 16:50, Alexander Duyck wrote:
>>>> On Tue, May 26, 2020 at 1:24 AM David Hildenbrand
>>>>  wrote:
>>>>> Still wondering what to do with free page hinting ... in the
>>>>> meantime I'll have a look at free page reporting :)
>>>> The problem is it is already out there so I worry we wouldn't
>>>> ever be able to get rid of it. At most we could deprecate it,
>>>> but we are still stuck with it consuming bit resources and
>>>> such.
>>> Yeah, that's not an issue, they will simply turn to dead bits
>>> with minimal documentation. I just don't see us fixing/supporting
>>> that feature, really. Let's see what @MST things when he has time
>>> to look into this.
>>> 
>> If free page hinting is broken enough that we don't want anybody to
>> try implementing it, we maybe could:
> 
> May I know the issues that you got with FREE_PAGE_HINT?

Did you follow the discussion on the spec updates proposed by Alexander?
We might have identified a couple of issues in the QEMU side trying to
document the semantics of free page hinting.

For example:

1. When migration fails in the live stage, before stopping the VM, the
guest will not receive a VIRTIO_BALLOON_CMD_ID_DONE.

2. The semantics about what could happen to hinted pages are unclear
(and it is unclear if the current QEMU behavior is a BUG or expected).
While writing to a hinted page will result in the page to get migrated
and not change the value, the guest might suddenly observe a change in
the value when only reading the page.

Imagine (just as an example) something in a guest like

/* page was previously hinted and is now getting reused by the guest */
if (!page_filled_with(page, X)) {
fill_page_with(page, X);
}
/* migration finished, value of page changed */

And Alexander pointed out, that the change the guest might observe might
not be the change to a zero page. Semantics unclear.

There seems to be more related to the async iothread/reset handling
+ the other fixes I just recently sent.


It would be good if you could have a look at the matter.

-- 
Thanks,

David / dhildenb


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] Re: [virtio-comment] [PATCH v3 2/3] content: Document balloon feature page poison

2020-05-26 Thread David Hildenbrand
On 26.05.20 16:50, Alexander Duyck wrote:
> On Tue, May 26, 2020 at 1:24 AM David Hildenbrand  wrote:
>>
>> On 20.05.20 18:25, Alexander Duyck wrote:
>>> On Wed, May 20, 2020 at 2:28 AM David Hildenbrand  wrote:
>>>>
>>>> On 20.05.20 04:02, Alexander Duyck wrote:
>>>>> From: Alexander Duyck 
>>>>>
>>>>> Page poison provides a way for the guest to notify the host of the content
>>>>> expected to be found in pages when they are added back to the guest after
>>>>> being discarded. The feature currently doesn't apply to the existing
>>>>> balloon features, however it will apply to an upcoming feature, free page
>>>>> reporting. Add documentation for the page poison feature describing the
>>>>> basic functionality and requirements.
>>>>>
>>>>
>>>> I would rephrase this, starting what it does *without* free page
>>>> reporting (which is not "provides a way for the guest to notify ..."),
>>>> and then eventually how this feature will also be used in the future as
>>>> well with free page reporting.
>>>
>>> Below is a rewrite on this description. I'm thinking that we can
>>> probably call out the advantage to free page reporting in a different
>>> way. Basically with the page poison feature we know a few things about
>>> the behavior and I have called them out in the new patch description:
>>>
>>> Page poison provides a way for the guest to notify the host that it is
>>> initializing or poisoning freed pages with some specific poison value. As a
>>> result of this we can infer a couple traits about the guest:
>>>
>>> 1. Free pages will contain a specific pattern within the guest.
>>> 2. Modifying free pages from this value may cause an error in the guest.
>>> 3. Pages will be immediately written to by the driver when deflated.
>>>
>>> There are currently no existing features that make use of this data. In the
>>> upcoming feature free page reporting we will need to make use of this to
>>> identify if we can evict pages from the guest without causing data
>>> corruption.
>>>
>>> Add documentation for the page poison feature describing the basic
>>> functionality and requirements.
>>>
> 
> [...]
> 
>>>>> +
>>>>> +If the guest is not initializing or poisoning freed pages it should 
>>>>> reject
>>>>
>>>> Sometimes you use "write to pages after deflating", here you use "freed
>>>> pages"
>>>
>>> So when I am referencing "freed pages" I am talking about all free
>>> memory, while when I refer to "pages after deflating" I am talking
>>> about pages coming out of the balloon.
>>>
>>> My thought is that there maybe be additional uses for "poison_val" be
>>> to feed it into some future use other than just the balloon portion of
>>> the deflation. Basically what this is telling us is that we could look
>>> for a pattern of pages containing nothing but poison_val if we wanted
>>> to do some sort of same page merging, or maybe define something to
>>> optimize migration by defining a poison page similar to a zero page
>>> that could be used to reduce migration overhead in the future.
>>>
>>>>> +the VIRTIO_BALLOON_F_PAGE_POISON feature.
>>>>> +
>>>>> +If VIRTIO_BALLOON_F_PAGE_POISON feature has been negotiated, the guest
>>>>> +will place the expected poison value into the \field{poison_val}
>>>>
>>>> again, "expected" is misleading in the context of this patch only.
>>>
>>> I will rewrite this statement at follows:
>>>   If VIRTIO_BALLOON_F_PAGE_POISON feature has been negotiated, the driver
>>>   will place the initialization and/or poison value into the 
>>> \field{poison_val}
>>>   configuration field data.
>>>
>>> I think I might strengthen things a bit as well. In the driver
>>> normative section I think I might add the following:
>>>   The driver MUST initialize and/or poison the deflated pages with
>>>   \field{poison_val} when they are reused by the driver.
>>>
>>
>> Maybe simplify that whole "initialize and/or poison " handling across
>> this patch to "initialize with \field{poison_val}" - if the
>> initialization is used for poisoning or initialization doesn't matter
>> from

[virtio-dev] Re: [virtio-comment] [PATCH v3 3/3] content: Document balloon feature free page reporting

2020-05-26 Thread David Hildenbrand
On 20.05.20 04:02, Alexander Duyck wrote:
> From: Alexander Duyck 
> 
> Free page reporting is a feature that allows the guest to proactively
> report unused pages to the host. By making use of this feature is is
> possible to reduce the overall memory footprint of the guest in cases where
> some significant portion of the memory is idle. Add documentation for the
> free page reporting feature describing the functionality and requirements.
> 
> Signed-off-by: Alexander Duyck 
> ---
>  conformance.tex |2 +
>  content.tex |   82 
> ++-
>  2 files changed, 83 insertions(+), 1 deletion(-)
> 
> diff --git a/conformance.tex b/conformance.tex
> index 5038b36324ac..5496a25e93ef 100644
> --- a/conformance.tex
> +++ b/conformance.tex
> @@ -151,6 +151,7 @@ \section{Conformance Targets}\label{sec:Conformance / 
> Conformance Targets}
>  \item \ref{drivernormative:Device Types / Memory Balloon Device / Device 
> Operation / Memory Statistics}
>  \item \ref{drivernormative:Device Types / Memory Balloon Device / Device 
> Operation / Free Page Hinting}
>  \item \ref{drivernormative:Device Types / Memory Balloon Device / Device 
> Operation / Page Poison}
> +\item \ref{drivernormative:Device Types / Memory Balloon Device / Device 
> Operation / Free Page Reporting}
>  \end{itemize}
>  
>  \conformance{\subsection}{SCSI Host Driver 
> Conformance}\label{sec:Conformance / Driver Conformance / SCSI Host Driver 
> Conformance}
> @@ -335,6 +336,7 @@ \section{Conformance Targets}\label{sec:Conformance / 
> Conformance Targets}
>  \item \ref{devicenormative:Device Types / Memory Balloon Device / Device 
> Operation / Memory Statistics}
>  \item \ref{devicenormative:Device Types / Memory Balloon Device / Device 
> Operation / Free Page Hinting}
>  \item \ref{devicenormative:Device Types / Memory Balloon Device / Device 
> Operation / Page Poison}
> +\item \ref{devicenormative:Device Types / Memory Balloon Device / Device 
> Operation / Free Page Reporting}
>  \end{itemize}
>  
>  \conformance{\subsection}{SCSI Host Device 
> Conformance}\label{sec:Conformance / Device Conformance / SCSI Host Device 
> Conformance}
> diff --git a/content.tex b/content.tex
> index 89e9948b7399..acdbcfc81538 100644
> --- a/content.tex
> +++ b/content.tex
> @@ -5007,12 +5007,15 @@ \subsection{Virtqueues}\label{sec:Device Types / 
> Memory Balloon Device / Virtque
>  \item[1] deflateq
>  \item[2] statsq
>  \item[3] free_page_vq
> +\item[4] reporting_vq
>  \end{description}
>  
>statsq only exists if VIRTIO_BALLOON_F_STATS_VQ is set.
>  
>free_page_vq only exists if VIRTIO_BALLOON_F_FREE_PAGE_HINT is set.
>  
> +  reporting_vq only exists if VIRTIO_BALLOON_F_PAGE_REPORTING is set.
> +
>  \subsection{Feature bits}\label{sec:Device Types / Memory Balloon Device / 
> Feature bits}
>  \begin{description}
>  \item[VIRTIO_BALLOON_F_MUST_TELL_HOST (0)] Host has to be told before
> @@ -5029,6 +5032,8 @@ \subsection{Feature bits}\label{sec:Device Types / 
> Memory Balloon Device / Featu
>  \item[ VIRTIO_BALLOON_F_PAGE_POISON(4) ] The device has to be notified if
>  the driver is expecting balloon pages to contain a certain value when
>  returned. Configuration field poison_val is valid.
> +\item[ VIRTIO_BALLOON_F_PAGE_REPORTING(5) ] The device has support for free
> +page reporting. A virtqueue for reporting free guest memory is present.
>  
>  \end{description}
>  
> @@ -5039,6 +5044,10 @@ \subsection{Feature bits}\label{sec:Device Types / 
> Memory Balloon Device / Featu
>  The driver SHOULD clear the VIRTIO_BALLOON_F_PAGE_POISON flag if it is not
>  expecting any specific value to be stored in the page.
>  
> +If the driver is expecting the pages to retain some initialized value,

"some" -> the value communicated via poison_val?

> +it MUST NOT accept VIRTIO_BALLOON_F_PAGE_REPORTING unless it also
> +negotiates VIRTIO_BALLOON_F_PAGE_POISON.
> +
>  \devicenormative{\subsubsection}{Feature bits}{Device Types / Memory Balloon 
> Device / Feature bits}
>  If the device offers the VIRTIO_BALLOON_F_MUST_TELL_HOST feature
>  bit, and if the driver did not accept this feature bit, the
> @@ -5101,10 +5110,16 @@ \subsection{Device Initialization}\label{sec:Device 
> Types / Memory Balloon Devic
>  \item If the VIRTIO_BALLOON_F_PAGE_POISON feature bit is negotiated, the
>driver updates the \field{poison_val} configuration field.
>  
> +\item If the VIRTIO_BALLOON_F_PAGE_REPORTING feature bit is negotiated the
> +  reporting_vq is identified.
> +
>  \item DRIVER_OK is set: device operation begins.
>  
>  \item If the VIRTIO_BALLOON_F_STATS_VQ feature bit is negotiated, then
>notify the device about the stats virtqueue buffer.
> +
> +\item If the VIRTIO_BALLOON_F_PAGE_REPORTING feature bit is negotiated then
> +  begin reporting free pages to device.

s/to device/to the device/

>  \end{enumerate}
>  
>  \subsection{Device Operation}\label{sec:Device Types / Memory Balloon Device 
> 

[virtio-dev] Re: [virtio-comment] [PATCH v3 2/3] content: Document balloon feature page poison

2020-05-26 Thread David Hildenbrand
On 20.05.20 18:25, Alexander Duyck wrote:
> On Wed, May 20, 2020 at 2:28 AM David Hildenbrand  wrote:
>>
>> On 20.05.20 04:02, Alexander Duyck wrote:
>>> From: Alexander Duyck 
>>>
>>> Page poison provides a way for the guest to notify the host of the content
>>> expected to be found in pages when they are added back to the guest after
>>> being discarded. The feature currently doesn't apply to the existing
>>> balloon features, however it will apply to an upcoming feature, free page
>>> reporting. Add documentation for the page poison feature describing the
>>> basic functionality and requirements.
>>>
>>
>> I would rephrase this, starting what it does *without* free page
>> reporting (which is not "provides a way for the guest to notify ..."),
>> and then eventually how this feature will also be used in the future as
>> well with free page reporting.
> 
> Below is a rewrite on this description. I'm thinking that we can
> probably call out the advantage to free page reporting in a different
> way. Basically with the page poison feature we know a few things about
> the behavior and I have called them out in the new patch description:
> 
> Page poison provides a way for the guest to notify the host that it is
> initializing or poisoning freed pages with some specific poison value. As a
> result of this we can infer a couple traits about the guest:
> 
> 1. Free pages will contain a specific pattern within the guest.
> 2. Modifying free pages from this value may cause an error in the guest.
> 3. Pages will be immediately written to by the driver when deflated.
> 
> There are currently no existing features that make use of this data. In the
> upcoming feature free page reporting we will need to make use of this to
> identify if we can evict pages from the guest without causing data
> corruption.
> 
> Add documentation for the page poison feature describing the basic
> functionality and requirements.
> 
> [...]
> 
>>> diff --git a/content.tex b/content.tex
>>> index 816b6c1b052e..89e9948b7399 100644
>>> --- a/content.tex
>>> +++ b/content.tex
>>> @@ -5026,6 +5026,9 @@ \subsection{Feature bits}\label{sec:Device Types / 
>>> Memory Balloon Device / Featu
>>>  page hinting. A virtqueue for providing hints as to what memory is
>>>  currently free is present. Configuration field 
>>> \field{free_page_hint_cmd_id}
>>>  is valid.
>>> +\item[ VIRTIO_BALLOON_F_PAGE_POISON(4) ] The device has to be notified if
>>> +the driver is expecting balloon pages to contain a certain value when
>>> +returned. Configuration field poison_val is valid.
>>
>> That's not what it does in the context of this feature only, no?
>>
>> "A hint to the device, that the driver might immediately write
>> \field{poison_val} to pages after deflating them. Configuration field
>> \field{poison_val} is valid."
> 
> I'll probably just use this wording with a few slight tweaks. Thinking
> about it though I will get rid of "might" and replace it with "will".

I think that's always guaranteed by Linux as of now, so "will" makes sense.

[...]

>>> +
>>> +If the guest is not initializing or poisoning freed pages it should reject
>>
>> Sometimes you use "write to pages after deflating", here you use "freed
>> pages"
> 
> So when I am referencing "freed pages" I am talking about all free
> memory, while when I refer to "pages after deflating" I am talking
> about pages coming out of the balloon.
> 
> My thought is that there maybe be additional uses for "poison_val" be
> to feed it into some future use other than just the balloon portion of
> the deflation. Basically what this is telling us is that we could look
> for a pattern of pages containing nothing but poison_val if we wanted
> to do some sort of same page merging, or maybe define something to
> optimize migration by defining a poison page similar to a zero page
> that could be used to reduce migration overhead in the future.
> 
>>> +the VIRTIO_BALLOON_F_PAGE_POISON feature.
>>> +
>>> +If VIRTIO_BALLOON_F_PAGE_POISON feature has been negotiated, the guest
>>> +will place the expected poison value into the \field{poison_val}
>>
>> again, "expected" is misleading in the context of this patch only.
> 
> I will rewrite this statement at follows:
>   If VIRTIO_BALLOON_F_PAGE_POISON feature has been negotiated, the driver
>   will place the initialization and/or poison value into the 
> 

[virtio-dev] Re: [virtio-comment] [PATCH v3 2/3] content: Document balloon feature page poison

2020-05-20 Thread David Hildenbrand
On 20.05.20 04:02, Alexander Duyck wrote:
> From: Alexander Duyck 
> 
> Page poison provides a way for the guest to notify the host of the content
> expected to be found in pages when they are added back to the guest after
> being discarded. The feature currently doesn't apply to the existing
> balloon features, however it will apply to an upcoming feature, free page
> reporting. Add documentation for the page poison feature describing the
> basic functionality and requirements.
> 

I would rephrase this, starting what it does *without* free page
reporting (which is not "provides a way for the guest to notify ..."),
and then eventually how this feature will also be used in the future as
well with free page reporting.

> Signed-off-by: Alexander Duyck 
> ---
>  conformance.tex |2 ++
>  content.tex |   44 
>  2 files changed, 46 insertions(+)
> 
> diff --git a/conformance.tex b/conformance.tex
> index a14e26edfcb2..5038b36324ac 100644
> --- a/conformance.tex
> +++ b/conformance.tex
> @@ -150,6 +150,7 @@ \section{Conformance Targets}\label{sec:Conformance / 
> Conformance Targets}
>  \item \ref{drivernormative:Device Types / Memory Balloon Device / Device 
> Operation}
>  \item \ref{drivernormative:Device Types / Memory Balloon Device / Device 
> Operation / Memory Statistics}
>  \item \ref{drivernormative:Device Types / Memory Balloon Device / Device 
> Operation / Free Page Hinting}
> +\item \ref{drivernormative:Device Types / Memory Balloon Device / Device 
> Operation / Page Poison}
>  \end{itemize}
>  
>  \conformance{\subsection}{SCSI Host Driver 
> Conformance}\label{sec:Conformance / Driver Conformance / SCSI Host Driver 
> Conformance}
> @@ -333,6 +334,7 @@ \section{Conformance Targets}\label{sec:Conformance / 
> Conformance Targets}
>  \item \ref{devicenormative:Device Types / Memory Balloon Device / Device 
> Operation}
>  \item \ref{devicenormative:Device Types / Memory Balloon Device / Device 
> Operation / Memory Statistics}
>  \item \ref{devicenormative:Device Types / Memory Balloon Device / Device 
> Operation / Free Page Hinting}
> +\item \ref{devicenormative:Device Types / Memory Balloon Device / Device 
> Operation / Page Poison}
>  \end{itemize}
>  
>  \conformance{\subsection}{SCSI Host Device 
> Conformance}\label{sec:Conformance / Device Conformance / SCSI Host Device 
> Conformance}
> diff --git a/content.tex b/content.tex
> index 816b6c1b052e..89e9948b7399 100644
> --- a/content.tex
> +++ b/content.tex
> @@ -5026,6 +5026,9 @@ \subsection{Feature bits}\label{sec:Device Types / 
> Memory Balloon Device / Featu
>  page hinting. A virtqueue for providing hints as to what memory is
>  currently free is present. Configuration field 
> \field{free_page_hint_cmd_id}
>  is valid.
> +\item[ VIRTIO_BALLOON_F_PAGE_POISON(4) ] The device has to be notified if
> +the driver is expecting balloon pages to contain a certain value when
> +returned. Configuration field poison_val is valid.

That's not what it does in the context of this feature only, no?

"A hint to the device, that the driver might immediately write
\field{poison_val} to pages after deflating them. Configuration field
\field{poison_val} is valid."

>  
>  \end{description}
>  
> @@ -5033,6 +5036,9 @@ \subsection{Feature bits}\label{sec:Device Types / 
> Memory Balloon Device / Featu
>  The driver SHOULD accept the VIRTIO_BALLOON_F_MUST_TELL_HOST
>  feature if offered by the device.
>  
> +The driver SHOULD clear the VIRTIO_BALLOON_F_PAGE_POISON flag if it is not
> +expecting any specific value to be stored in the page.

That's not what it does in the context of this feature only, no?

"The driver SHOULD clear the VIRTIO_BALLOON_F_PAGE_POISON flag if it is
not immediately write \field{poison_val} to deflated pages (e.g., to
initialize them, or fill them with a poison value)." ?

> +
>  \devicenormative{\subsubsection}{Feature bits}{Device Types / Memory Balloon 
> Device / Feature bits}
>  If the device offers the VIRTIO_BALLOON_F_MUST_TELL_HOST feature
>  bit, and if the driver did not accept this feature bit, the
> @@ -5055,11 +5061,15 @@ \subsection{Device configuration 
> layout}\label{sec:Device Types / Memory Balloon
>  VIRTIO_BALLOON_F_FREE_PAGE_HINT has been negotiated and is read-only by
>  the driver.
>  
> +  \field{poison_val} is available if VIRTIO_BALLOON_F_PAGE_POISON has been
> +negotiated.
> +
>  \begin{lstlisting}
>  struct virtio_balloon_config {
>  le32 num_pages;
>  le32 actual;
>  le32 free_page_hint_cmd_id;
> +le32 poison_val;
>  };
>  \end{lstlisting}
>  
> @@ -5088,6 +5098,9 @@ \subsection{Device Initialization}\label{sec:Device 
> Types / Memory Balloon Devic
>  \item If the VIRTIO_BALLOON_F_FREE_PAGE_HINT feature bit is negotiated, the
>free_page_vq is identified.
>  
> +\item If the VIRTIO_BALLOON_F_PAGE_POISON feature bit is negotiated, the
> +  driver updates the \field{poison_val} 

[virtio-dev] Re: [virtio-comment] [PATCH v2 1/3] content: Document balloon feature free page hints

2020-05-20 Thread David Hildenbrand
On 19.05.20 23:00, Alexander Duyck wrote:
> On Tue, May 19, 2020 at 9:09 AM David Hildenbrand  wrote:
>>
>>>> I proposed that the driver MUST reinitialize the pages when reusing
>>>> (which is what Linux does), so then this is true. Reuse implies
>>>> initializing, implies modification. It's somewhat simpler than what you
>>>> propose, leaving the case open where the driver would reuse pages by
>>>> only reading them (I don't really see a use case for that ...). But I
>>>> don't care as long as it's consistent and correct :)
>>>
>>> Linux does not reinitialize the pages when it frees them. That only
>>
>> Whoever uses the pages has to initialize. Again, I don't think we should
>> make difference between the guest and the driver. From spec POV, they
>> are one piece. Everything else is implementation detail.
> 
> Right, but the problem is "use". In the case of balloon it was pages
> being pulled out of the balloon. In the case of free pages nobody is
> really using them. They are "free" already. Part of the issue here is
> that unlike the balloon or page reporting we don't really have a good
> definition for where they are. Getting back to the wording I have been
> using for free page hinting I am looking at something like:
>   The driver MUST reinitialize the contents of any previously hinted page
>   released before receiving the command ID VIRTIO_BALLOON_CMD_ID_DONE.
> 
> I might reference that as well as the earlier comment about treating
> the hinted pages as uninitialized memory.
> 
>>> happens if poison or init_on_free are enabled which are rare cases.
>>> When it does reinitialize the pages then I agree that the device
>>> cannot modify the contents.
>>
>> What about a user who relies on the content of uninitialized pages?
>> Like, read it, if it has the value, don't set it to the value. Unlikely
>> but possible, no? We could have data corruption.
>>
>> We should document that in some way, because this is what could happen
>> with the *current* QEMU implementation
> 
> Agreed. This is a problem with the current QEMU/Linux driver
> implementation. What worries me is that I wonder if this might not be
> more possible then we realize. For example I wonder if something like
> KSM could read the page and try merging it with others just for the
> value to eventually change.
> 
> So I was documenting the driver side mostly as-is for the
> specification. What we probably do need to do is update both the
> driver and the specification to address this since if we are pulling
> the page out before we get "DONE" we probably should reinitialize it
> so that the state if fixed going forward and it cannot change.
> 
>>>
>>> The current implementation is assuming QEMU live-migration with the
>>> Linux guest as the only use case. As such I want to make sure we
>>> correctly capture all of the behaviors that are expected based on
>>> those assumptions, but I want to avoid inserting behaviors we would
>>> like to see occur but aren't really a part of this.
>>
>> Exactly that's why I bring this ^ up.
>>
>>>
>>>>>
>>>>> The driver can end up releasing the pages back to the buddy allocator
>>>>> and if they are not poisoned/init_on_free then they will go there and
>>>>> can still potentially change until such time as the guest writes to
>>>>> the page modifying it or the balloon driver switches the cmd ID to
>>>>> VIRTIO_BALLOON_CMD_ID_DONE. That was one of the reasons for trying to
>>>>> frame it the way I did. So what I can do is reword the two statements
>>>>> as follows:
>>>>>
>>>>>   If the content of a previously hinted page has not been modified by the
>>>>>   guest since the device issued the \field{free_page_hint_cmd_id} 
>>>>> associated
>>>>>   with the hint, the device MAY modify the contents of the page.
>>>>>
>>>>>   The device MUST NOT modify the content of a previously hinted page
>>>>> after
>>>>>   \field{free_page_hint_cmd_id} is set to VIRTIO_BALLOON_CMD_ID_DONE.
>>>> Is it really only "DONE" that closes the current window? I think a
>>>> "STOP" from the device will also close the window. DONE is only set at
>>>> the very last iteration during memory migration.
>>>
>>> So the CMD_ID_DONE is issued when the migration has occurred. The
>>> migration is what is actually modifying the memory.
>&

[virtio-dev] Re: [virtio-comment] [PATCH v2 3/3] content: Document balloon feature free page reporting

2020-05-19 Thread David Hildenbrand
> 
>>>  \devicenormative{\subsubsection}{Feature bits}{Device Types / Memory 
>>> Balloon Device / Feature bits}
>>>  If the device offers the VIRTIO_BALLOON_F_MUST_TELL_HOST feature
>>>  bit, and if the driver did not accept this feature bit, the
>>> @@ -5101,10 +5110,16 @@ \subsection{Device Initialization}\label{sec:Device 
>>> Types / Memory Balloon Devic
>>>  \item If the VIRTIO_BALLOON_F_PAGE_POISON feature bit is negotiated, the
>>>driver updates the \field{poison_val} configuration field.
>>>
>>> +\item If the VIRTIO_BALLOON_F_PAGE_REPORTING feature bit is negotiated the
>>> +  reporting_vq is identified.
>>> +
>>>  \item DRIVER_OK is set: device operation begins.
>>>
>>>  \item If the VIRTIO_BALLOON_F_STATS_VQ feature bit is negotiated, then
>>>notify the device about the stats virtqueue buffer.
>>> +
>>> +\item If the VIRTIO_BALLOON_F_PAGE_REPORTING feature bit is negotiated then
>>> +  begin reporting free pages to device.
>>>  \end{enumerate}
>>>
>>>  \subsection{Device Operation}\label{sec:Device Types / Memory Balloon 
>>> Device / Device Operation}
>>> @@ -5478,7 +5493,9 @@ \subsubsection{Page Poison}\label{sec:Device Types / 
>>> Memory Balloon Device / Dev
>>>
>>>  Page Poison provides a way to notify the host of the contents that are
>>>  currently in the balloon pages, and those that are expected to be in the
>>> -pages when they are pulled from the balloon.
>>> +pages when they are pulled from the balloon. It is used for in-place
>>> +reporting of pages without needing to pull them from the memory allocator
>>> +of the guest.
>>
>> Let's see how that looks like after you modify patch #2.
> 
> What I currently have is:
>   Page Poison provides a way to notify the host that the guest is initializing
>   and/or poisoning free pages. When the feature is enabled pages that are
>   deflated will be immediately written to by the guest, and pages indicated by
>   free page reporting will contain the value indicated by \field{poison_val}.

Sounds good! Wonder if "will be immediately" -> "might be immediately".

> 
>>>
>>>  If VIRTIO_BALLOON_F_PAGE_POISON feature has been negotiated, the guest
>>>  will place the expected poison value into the \field{poison_val}
>>> @@ -5504,6 +5521,71 @@ \subsubsection{Page Poison}\label{sec:Device Types / 
>>> Memory Balloon Device / Dev
>>>  page hinting, the device MAY ignore the content of \field{poison_val}
>>>  for those operations.
>>>
>>> +\subsubsection{Free Page Reporting}\label{sec:Device Types / Memory 
>>> Balloon Device / Device Operation / Free Page Reporting}
>>> +
>>> +Free Page Reporting provides a mechanism similar to balloon inflation,
>>> +however it does not provide a deflation queue. The expectation is that the
>>> +device will have a means by which it can detect the guest page access and
>>> +fault in such pages with some initial value, likely a zero page.
>>> +
>>> +The driver will respond to to memory conditions and begin reporting free
>>
>> "to to memory conditions" I don't understand what you are trying to say.
>> The driver will simply report some free pages (e.g., of a guest-specific
>> minimum size) when it feels like the right time has come.
>>
>> This (and below) is too implementation specific. You could just
>> implement a driver that hints a single page every time it is getting
>> freed. Nothing wrong about that. There is just the option do to a bulk
>> report whenever the driver feels like doing it.
>>
>>> +pages when some number of pages are available.
>>> +
> 
> I don't really think it is all that specific. The full wording is:
>   The driver will respond to memory conditions and begin reporting free
>   pages when some number of pages are available.
> 
> So in this case "memory conditions" could be freeing a page and "some
> number" could be 1 if that is what you want to go for. I am not saying
> it has to bulk, but it could. There are a number of ways this could be
> interpreted. Basically there is "some condition" that will trigger us
> reporting pages. If that is 1 free page then I think that is described
> by that sentence, but so is the case where we wait until there are a
> large number of free pages.

If it's not all that specific, why not simplify to

" The driver will begin reporting free pages. When exactly and which
free pages are reported is up to the driver."

?

> 
>>> +\begin{enumerate}
>>> +
>>> +\item The driver determines it has enough pages available to begin
>>> +  reporting pages.
>>> +
>>> +\item The driver gathers pages into a scatter-gather list and adds them to
>>> +  the reporting_vq.
>>> +
>>> +\item The device acknowledges the reporting request by using the
>>> +  reporting_vq descriptor.
>>> +
>>> +\item Once the device has acknowledged the report, the pages can be
>>> +  returned to the location from which they were pulled.
>>> +
>>> +\item The driver can then continue to gather and report pages until it
>>> +  has determined it has reported a sufficient quantity of pages.
>>> +
>>> 

[virtio-dev] Re: [virtio-comment] [PATCH v2 1/3] content: Document balloon feature free page hints

2020-05-19 Thread David Hildenbrand
>> I proposed that the driver MUST reinitialize the pages when reusing
>> (which is what Linux does), so then this is true. Reuse implies
>> initializing, implies modification. It's somewhat simpler than what you
>> propose, leaving the case open where the driver would reuse pages by
>> only reading them (I don't really see a use case for that ...). But I
>> don't care as long as it's consistent and correct :)
> 
> Linux does not reinitialize the pages when it frees them. That only

Whoever uses the pages has to initialize. Again, I don't think we should
make difference between the guest and the driver. From spec POV, they
are one piece. Everything else is implementation detail.

> happens if poison or init_on_free are enabled which are rare cases.
> When it does reinitialize the pages then I agree that the device
> cannot modify the contents.

What about a user who relies on the content of uninitialized pages?
Like, read it, if it has the value, don't set it to the value. Unlikely
but possible, no? We could have data corruption.

We should document that in some way, because this is what could happen
with the *current* QEMU implementation

> 
> The current implementation is assuming QEMU live-migration with the
> Linux guest as the only use case. As such I want to make sure we
> correctly capture all of the behaviors that are expected based on
> those assumptions, but I want to avoid inserting behaviors we would
> like to see occur but aren't really a part of this.

Exactly that's why I bring this ^ up.

> 
>>>
>>> The driver can end up releasing the pages back to the buddy allocator
>>> and if they are not poisoned/init_on_free then they will go there and
>>> can still potentially change until such time as the guest writes to
>>> the page modifying it or the balloon driver switches the cmd ID to
>>> VIRTIO_BALLOON_CMD_ID_DONE. That was one of the reasons for trying to
>>> frame it the way I did. So what I can do is reword the two statements
>>> as follows:
>>>
>>>   If the content of a previously hinted page has not been modified by the
>>>   guest since the device issued the \field{free_page_hint_cmd_id} associated
>>>   with the hint, the device MAY modify the contents of the page.
>>>
>>>   The device MUST NOT modify the content of a previously hinted page
>>> after
>>>   \field{free_page_hint_cmd_id} is set to VIRTIO_BALLOON_CMD_ID_DONE.
>> Is it really only "DONE" that closes the current window? I think a
>> "STOP" from the device will also close the window. DONE is only set at
>> the very last iteration during memory migration.
> 
> So the CMD_ID_DONE is issued when the migration has occurred. The
> migration is what is actually modifying the memory.
> 
>> (virtio_balloon_free_page_report_notify() in QEMU)
>>
>> I consider one window == one iteration == one value of
>> \field{free_page_hint_cmd_id} until either DONE or STOP
> 
> CMD_ID_STOP will close the current window for providing hints, but the
> migration hasn't happened yet. We are still accumulating the hints. We
> don't receive CMD_ID_DONE from the device until the migration has
> occurred. It is the migration that will alter the content of the pages
> by leaving them behind on the previous VM.

I'll have to think about again if your statements reflect the reality
today. I'll have to dive once again into QEMU code :( Complicated stuff.

> 
>> [...]
>>
>> Let's think this through, what about this scenario:
>>
>> The device sets \field{free_page_hint_cmd_id} = X
>> The driver starts reporting free pages (and reports all pages it has)
>> 1. Sends X to start the windows
>> 2. Sends all page hints (\field{free_page_hint_cmd_id} stays X)
>> 3. Sends VIRTIO_BALLOON_CMD_ID_STOP to end the window
>> The driver sets \field{free_page_hint_cmd_id} = DONE or STOP
>>
>> The guest can reuse the pages any time (triggered by the shrinker),
>> especially, during 2, before the hypervisor even processed a hint
>> request. It can happen that the guest reuses a page before the
>> hypervisor processes the request and before
>> \field{free_page_hint_cmd_id} changes.
>>
>> In QEMU, the double-bitmap magic makes sure that this is guaranteed to
>> work IIRC.
>>
>> In that case, the page has to be migrated in that windows, the
>> hypervisor must not modify the content.
> 
> If by "reuse" you mean write to or reinitialize then that is correct.
> All that is really happening is that any pages that are hinted have
> the potential to be left behind with the original VM and not migrated
> to the new one. We get the notification that the migration happened
> when CMD_ID_DONE is passed to us. At that point the hinting is
> complete and the device has no use for additional data.
> 
> Instead of CMD_ID_STOP it probably would have made more sense to call
> it something like CMD_ID_PAUSE or CMD_ID_HOLD as that is what it is
> really doing. It is just temporarily holding the hints off while the
> hypervisor synchronizes the dirty bits from the host.

I think if migration fails, 

[virtio-dev] Re: [virtio-comment] [PATCH v2 3/3] content: Document balloon feature free page reporting

2020-05-19 Thread David Hildenbrand
On 15.05.20 19:33, Alexander Duyck wrote:
> From: Alexander Duyck 
> 
> Free page reporting is a feature that allows the guest to proactively
> report unused pages to the host. By making use of this feature is is
> possible to reduce the overall memory footprint of the guest in cases where
> some significant portion of the memory is idle. Add documentation for the
> free page reporting feature describing the functionality and requirements.
> 
> Signed-off-by: Alexander Duyck 
> ---
>  content.tex |   84 
> ++-
>  1 file changed, 83 insertions(+), 1 deletion(-)
> 
> diff --git a/content.tex b/content.tex
> index 3d30fd5bb6fa..3cb38105f794 100644
> --- a/content.tex
> +++ b/content.tex
> @@ -5007,12 +5007,15 @@ \subsection{Virtqueues}\label{sec:Device Types / 
> Memory Balloon Device / Virtque
>  \item[1] deflateq
>  \item[2] statsq
>  \item[3] free_page_vq
> +\item[4] reporting_vq
>  \end{description}
>  
>statsq only exists if VIRTIO_BALLOON_F_STATS_VQ is set.
>  
>free_page_vq only exists if VIRTIO_BALLOON_F_FREE_PAGE_HINT is set.
>  
> +  reporting_vq only exists if VIRTIO_BALLOON_F_PAGE_REPORTING is set.
> +
>  \subsection{Feature bits}\label{sec:Device Types / Memory Balloon Device / 
> Feature bits}
>  \begin{description}
>  \item[VIRTIO_BALLOON_F_MUST_TELL_HOST (0)] Host has to be told before
> @@ -5029,6 +5032,8 @@ \subsection{Feature bits}\label{sec:Device Types / 
> Memory Balloon Device / Featu
>  \item[ VIRTIO_BALLOON_F_PAGE_POISON(4) ] The device has to be notified if
>  the driver is expecting balloon pages to contain a certain value when
>  returned. Configuration field poison_val is valid.
> +\item[ VIRTIO_BALLOON_F_PAGE_REPORTING(5) ] The device has support for free
> +page reporting. A virtqueue for reporting free guest memory is present.
>  
>  \end{description}
>  
> @@ -5039,6 +5044,10 @@ \subsection{Feature bits}\label{sec:Device Types / 
> Memory Balloon Device / Featu
>  The driver SHOULD clear the VIRTIO_BALLOON_F_PAGE_POISON flag if it is not
>  expecting any specific value to be stored in the page.
>  
> +If the driver is expecting the pages to retain some initialized value,
> +it MUST NOT accept VIRTIO_BALLOON_F_PAGE_REPORTING unless it also
> +negotiates VIRTIO_BALLOON_F_PAGE_POISON.
> +

Is "accept" really the right word here? Below you use "negotiate", which
makes more sense.

>  \devicenormative{\subsubsection}{Feature bits}{Device Types / Memory Balloon 
> Device / Feature bits}
>  If the device offers the VIRTIO_BALLOON_F_MUST_TELL_HOST feature
>  bit, and if the driver did not accept this feature bit, the
> @@ -5101,10 +5110,16 @@ \subsection{Device Initialization}\label{sec:Device 
> Types / Memory Balloon Devic
>  \item If the VIRTIO_BALLOON_F_PAGE_POISON feature bit is negotiated, the
>driver updates the \field{poison_val} configuration field.
>  
> +\item If the VIRTIO_BALLOON_F_PAGE_REPORTING feature bit is negotiated the
> +  reporting_vq is identified.
> +
>  \item DRIVER_OK is set: device operation begins.
>  
>  \item If the VIRTIO_BALLOON_F_STATS_VQ feature bit is negotiated, then
>notify the device about the stats virtqueue buffer.
> +
> +\item If the VIRTIO_BALLOON_F_PAGE_REPORTING feature bit is negotiated then
> +  begin reporting free pages to device.
>  \end{enumerate}
>  
>  \subsection{Device Operation}\label{sec:Device Types / Memory Balloon Device 
> / Device Operation}
> @@ -5478,7 +5493,9 @@ \subsubsection{Page Poison}\label{sec:Device Types / 
> Memory Balloon Device / Dev
>  
>  Page Poison provides a way to notify the host of the contents that are
>  currently in the balloon pages, and those that are expected to be in the
> -pages when they are pulled from the balloon.
> +pages when they are pulled from the balloon. It is used for in-place
> +reporting of pages without needing to pull them from the memory allocator
> +of the guest.

Let's see how that looks like after you modify patch #2.

>  
>  If VIRTIO_BALLOON_F_PAGE_POISON feature has been negotiated, the guest
>  will place the expected poison value into the \field{poison_val}
> @@ -5504,6 +5521,71 @@ \subsubsection{Page Poison}\label{sec:Device Types / 
> Memory Balloon Device / Dev
>  page hinting, the device MAY ignore the content of \field{poison_val}
>  for those operations.
>  
> +\subsubsection{Free Page Reporting}\label{sec:Device Types / Memory Balloon 
> Device / Device Operation / Free Page Reporting}
> +
> +Free Page Reporting provides a mechanism similar to balloon inflation,
> +however it does not provide a deflation queue. The expectation is that the
> +device will have a means by which it can detect the guest page access and
> +fault in such pages with some initial value, likely a zero page.
> +
> +The driver will respond to to memory conditions and begin reporting free

"to to memory conditions" I don't understand what you are trying to say.
The driver will simply report some free pages 

[virtio-dev] Re: [virtio-comment] [PATCH v2 1/3] content: Document balloon feature free page hints

2020-05-19 Thread David Hildenbrand
[...]
>>> +\begin{description}
>>> +\item[VIRTIO_BALLOON_CMD_ID_STOP (0)] Any command ID previously supplied by
>>> +  the device is invalid. The driver should halt all hinting until a new
>>> +  command ID is supplied.
>>
>> Maybe "The driver should stop hinting free pages, but should not reuse
>> all previously hinted pages."
> 
> The "reuse all previously hinted pages" seems rather unclear to me. I
> would like to make clear that in this case the "use" is the guest
> making use of the memory, not the driver doing something like
> recycling hints. So in the spots where you reference "the driver

IMHO, In term of use/reuse, I think it does not matter. From spec POV,
whatever happens in the guest in respect to hinting is under driver
control. The driver just has to find a way that the memory won't be reused.

> reusing pages" I think I might prefer to go with something along the
> lines of "releasing pages for use by the guest". The problem is that
> when you have a balloon we were referencing using pages from the
> balloon. Since we cannot reference the balloon I figure I will go with
> language where we "supply" and "release" hinted pages. That way we
> acknowledge that the driver is holding onto pages and not freeing them
> for use by the guest.
> 
> I'll probably go with something like:
>   The driver should stop hinting free pages, but
>   should not release any hinted pages for use by the guest.
> 

I'd say "release any hinted pages" is an implementation detail in the
guest to make sure the pages won't be reused. But I don't have a strong
opinion here as long as it helps to describe what has to be done :)

[...]

>>> +
>>> +The driver SHOULD return pages for use once \field{free_page_hint_cmd_id}
>>> +reports a value of VIRTIO_BALLOON_CMD_ID_DONE.
>>
>> "return pages" -> "start to reuse all previously hinted pages".
> 
> The driver SHOULD release all hinted pages for use by the guest once
> \field{free_page_hint_cmd_id} reports a value of VIRTIO_BALLOON_CMD_ID_DONE.
> 
>> Also,
>>
>> "The driver MUST reinitialize hinted pages before reusing them."
> 
> That isn't quite correct though. It is only necessary to initialize
> the pages if the guest depends on them being initialized.
> 
> Maybe something like:
>   The driver MUST treat the content of all hinted pages as uninitialized 
> memory.
> 

Makes sense.

>>> +
>>> +\devicenormative{\paragraph}{Free Page Hinting}{Device Types / Memory 
>>> Balloon Device / Device Operation / Free Page Hinting}
>>> +
>>> +Normative statements in this section apply if the
>>> +VIRTIO_BALLOON_F_FREE_PAGE_HINT feature has been negotiated.
>>> +
>>> +The device MUST set \field{free_page_hint_cmd_id} to
>>> +VIRTIO_BALLOON_CMD_ID_STOP any time that the dirty pages for the given
>>> +guest are being recorded.
>>> +
>>> +The device MUST NOT reuse a command ID until it has received an output
>>> +descriptor containing VIRTIO_BALLOON_CMD_ID_STOP from the driver.
>>> +
>>> +The device MUST ignore pages that are provided with a command ID that does
>>> +not match the current value in \field{free_page_hint_cmd_id}.
>>> +
>>> +The device MAY modify the contents of the page in the balloon if the page
>>> +has not been modified by the guest since the \field{free_page_hint_cmd_id}
>>> +associated with the hint was issued by the device.
>>
>> "page in the balloon" -> "previously hinted pages"
>>
>> But it's not that easy in respect to the guest reusing the pages.
>>
>> "previously hinted pages and not reused pages" ?
>>
>> Also, something like
>>
>> "The device MUST NOT modify the contents of previously hinted pages in
>> case they are reused by the devices, even if they are reused by the
>> driver before the hinting request is processed."
> 
> That is not quite true.

I proposed that the driver MUST reinitialize the pages when reusing
(which is what Linux does), so then this is true. Reuse implies
initializing, implies modification. It's somewhat simpler than what you
propose, leaving the case open where the driver would reuse pages by
only reading them (I don't really see a use case for that ...). But I
don't care as long as it's consistent and correct :)

> 
> The driver can end up releasing the pages back to the buddy allocator
> and if they are not poisoned/init_on_free then they will go there and
> can still potentially change until such time as the guest writes to
> the page modifying it or the balloon driver switches the cmd ID to
> VIRTIO_BALLOON_CMD_ID_DONE. That was one of the reasons for trying to
> frame it the way I did. So what I can do is reword the two statements
> as follows:
> 
>   If the content of a previously hinted page has not been modified by the
>   guest since the device issued the \field{free_page_hint_cmd_id} associated
>   with the hint, the device MAY modify the contents of the page.
> 
>   The device MUST NOT modify the content of a previously hinted page
> after
>   \field{free_page_hint_cmd_id} is set to VIRTIO_BALLOON_CMD_ID_DONE.
Is it really only 

[virtio-dev] Re: [virtio-comment] [PATCH v2 1/3] content: Document balloon feature free page hints

2020-05-18 Thread David Hildenbrand


>> +
>> +\devicenormative{\paragraph}{Free Page Hinting}{Device Types / Memory 
>> Balloon Device / Device Operation / Free Page Hinting}
>> +
>> +Normative statements in this section apply if the
>> +VIRTIO_BALLOON_F_FREE_PAGE_HINT feature has been negotiated.
>> +
>> +The device MUST set \field{free_page_hint_cmd_id} to
>> +VIRTIO_BALLOON_CMD_ID_STOP any time that the dirty pages for the given
>> +guest are being recorded.
>> +
>> +The device MUST NOT reuse a command ID until it has received an output
>> +descriptor containing VIRTIO_BALLOON_CMD_ID_STOP from the driver.
>> +
>> +The device MUST ignore pages that are provided with a command ID that does
>> +not match the current value in \field{free_page_hint_cmd_id}.
>> +
>> +The device MAY modify the contents of the page in the balloon if the page
>> +has not been modified by the guest since the \field{free_page_hint_cmd_id}
>> +associated with the hint was issued by the device.
> 
> "page in the balloon" -> "previously hinted pages"
> 
> But it's not that easy in respect to the guest reusing the pages.
> 
> "previously hinted pages and not reused pages" ?
> 
> Also, something like
> 
> "The device MUST NOT modify the contents of previously hinted pages in
> case they are reused by the devices, even if they are reused by the
> driver before the hinting request is processed."

"reused by the driver" in the first instance of course.


-- 
Thanks,

David / dhildenb


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] Re: [virtio-comment] [PATCH v2 1/3] content: Document balloon feature free page hints

2020-05-18 Thread David Hildenbrand
On 15.05.20 19:33, Alexander Duyck wrote:
> From: Alexander Duyck 
> 
> Free page hints allow the balloon driver to provide information on what
> pages are not currently in use so that we can avoid the cost of copying
> them in migration scenarios. Add a feature description for free page hints
> describing basic functioning and requirements.
> 
> Signed-off-by: Alexander Duyck 
> ---
>  content.tex |  128 
> ---
>  1 file changed, 122 insertions(+), 6 deletions(-)
> 
> diff --git a/content.tex b/content.tex
> index 91735e3eb018..ec0abf177526 100644
> --- a/content.tex
> +++ b/content.tex
> @@ -5005,10 +5005,13 @@ \subsection{Virtqueues}\label{sec:Device Types / 
> Memory Balloon Device / Virtque
>  \begin{description}
>  \item[0] inflateq
>  \item[1] deflateq
> -\item[2] statsq.
> +\item[2] statsq
> +\item[3] free_page_vq
>  \end{description}
>  
> -  Virtqueue 2 only exists if VIRTIO_BALLOON_F_STATS_VQ set.
> +  statsq only exists if VIRTIO_BALLOON_F_STATS_VQ is set.
> +
> +  free_page_vq only exists if VIRTIO_BALLOON_F_FREE_PAGE_HINT is set.
>  
>  \subsection{Feature bits}\label{sec:Device Types / Memory Balloon Device / 
> Feature bits}
>  \begin{description}
> @@ -5019,6 +5022,10 @@ \subsection{Feature bits}\label{sec:Device Types / 
> Memory Balloon Device / Featu
>  memory statistics is present.
>  \item[VIRTIO_BALLOON_F_DEFLATE_ON_OOM (2) ] Deflate balloon on
>  guest out of memory condition.
> +\item[ VIRTIO_BALLOON_F_FREE_PAGE_HINT(3) ] The device has support for free
> +page hinting. A virtqueue for providing hints as to what memory is
> +currently free is present. Configuration field free_page_hint_cmd_id
> +is valid.

\field{free_page_hint_cmd_id} ?

>  
>  \end{description}
>  
> @@ -5042,13 +5049,17 @@ \subsection{Feature bits}\label{sec:Device Types / 
> Memory Balloon Device / Featu
>  VIRTIO_BALLOON_F_MUST_TELL_HOST is not negotiated.
>  
>  \subsection{Device configuration layout}\label{sec:Device Types / Memory 
> Balloon Device / Device configuration layout}
> -  Both fields of this configuration
> -  are always available.
> +  \field{num_pages} and \field{actual} are always available.
> +
> +  \field{free_page_hint_cmd_id} is available if
> +VIRTIO_BALLOON_F_FREE_PAGE_HINT has been negotiated and is read-only by
> +the driver.
>  
>  \begin{lstlisting}
>  struct virtio_balloon_config {
>  le32 num_pages;
>  le32 actual;
> +le32 free_page_hint_cmd_id;
>  };
>  \end{lstlisting}
>  
> @@ -5072,9 +5083,15 @@ \subsection{Device Initialization}\label{sec:Device 
> Types / Memory Balloon Devic
>\begin{enumerate}
>\item Identify the stats virtqueue.
>\item Add one empty buffer to the stats virtqueue.
> -  \item DRIVER_OK is set: device operation begins.
> -  \item Notify the device about the stats virtqueue buffer.
>\end{enumerate}
> +
> +\item If the VIRTIO_BALLOON_F_FREE_PAGE_HINT feature bit is negotiated, the
> +  free_page_vq is identified.
> +
> +\item DRIVER_OK is set: device operation begins.
> +
> +\item If the VIRTIO_BALLOON_F_STATS_VQ feature bit is negotiated, then
> +  notify the device about the stats virtqueue buffer.
>  \end{enumerate}
>  
>  \subsection{Device Operation}\label{sec:Device Types / Memory Balloon Device 
> / Device Operation}
> @@ -5345,6 +5362,105 @@ \subsubsection{Memory Statistics 
> Tags}\label{sec:Device Types / Memory Balloon D
>allocations in the guest.
>  \end{description}
>  
> +\subsubsection{Free Page Hinting}\label{sec:Device Types / Memory Balloon 
> Device / Device Operation / Free Page Hinting}
> +
> +Free page hinting is designed to be used during migration to determine what
> +pages within the guest are currently unused so that they can be skipped over
> +while migrating the guest. The device will indicate that it is ready to start
> +performing hinting by setting the \field{free_page_hint_cmd_id} to one of the
> +non-reserved values that can be used as a command ID. The following values
> +are reserved:
> +
> +\begin{description}
> +\item[VIRTIO_BALLOON_CMD_ID_STOP (0)] Any command ID previously supplied by
> +  the device is invalid. The driver should halt all hinting until a new
> +  command ID is supplied.

Maybe "The driver should stop hinting free pages, but should not reuse
all previously hinted pages."

> +
> +\item[VIRTIO_BALLOON_CMD_ID_DONE (1)] Any command ID previously supplied by
> +  the device is invalid. The driver should halt all hinting and the hinting
> +  balloon can now be deflated returning all pages to the guest.

I would avoid the terminology "hinting balloon" and "deflation".

"The driver should stop hinting free pages and should reuse all
previously hinted pages.".

> +\end{description}
> +
> +A request for free page hinting proceeds as follows:
> +
> +\begin{enumerate}
> +
> +\item The driver examines the \field{free_page_hint_cmd_id} configuration 
> field.
> +  If it contains a non-reserved 

[virtio-dev] Re: [virtio-comment] [PATCH 2/3] content: Document balloon feature page poison

2020-05-15 Thread David Hildenbrand
On 08.05.20 19:16, Alexander Duyck wrote:
> From: Alexander Duyck 
> 
> Page poison provides a way for the guest to notify the host of the content
> expected to be found in pages when they are added back to the guest after
> being discarded. The feature currently doesn't apply to the existing
> balloon features, however it will apply to an upcoming feature, free page
> reporting. Add documentation for the page poison feature describing the
> basic functionality and requirements.
> 
> Signed-off-by: Alexander Duyck 
> ---
>  content.tex |   45 +
>  1 file changed, 45 insertions(+)
> 
> diff --git a/content.tex b/content.tex
> index 7d91604178fd..e154948a9a1a 100644
> --- a/content.tex
> +++ b/content.tex
> @@ -5026,6 +5026,9 @@ \subsection{Feature bits}\label{sec:Device Types / 
> Memory Balloon Device / Featu
>  page hinting. A virtqueue for providing hints as to what memory is
>  currently free is present. Configuration field free_page_hint_cmd_id
>  is valid.
> +\item[ VIRTIO_BALLOON_F_PAGE_POISON(4) ] Host has to be notified if guest
> +is expecting reported pages to contain a certain value when returned.
> +Configuration field poison_val is valid.
>  
>  \end{description}
>  
> @@ -5033,6 +5036,9 @@ \subsection{Feature bits}\label{sec:Device Types / 
> Memory Balloon Device / Featu
>  The driver SHOULD accept the VIRTIO_BALLOON_F_MUST_TELL_HOST
>  feature if offered by the device.
>  
> +The driver SHOULD clear the VIRTIO_BALLOON_F_PAGE_POISON flag if it is not
> +expecting any specific value to be stored in the page.
> +
>  \devicenormative{\subsubsection}{Feature bits}{Device Types / Memory Balloon 
> Device / Feature bits}
>  If the device offers the VIRTIO_BALLOON_F_MUST_TELL_HOST feature
>  bit, and if the driver did not accept this feature bit, the
> @@ -5055,11 +5061,15 @@ \subsection{Device configuration 
> layout}\label{sec:Device Types / Memory Balloon
>  VIRTIO_BALLOON_F_FREE_PAGE_HINT has been negotiated and is read-only by
>  the driver.
>  
> +  \field{poison_val} is available if VIRTIO_BALLOON_F_PAGE_POISON has been
> +negotiated.
> +
>  \begin{lstlisting}
>  struct virtio_balloon_config {
>  le32 num_pages;
>  le32 actual;
>  le32 free_page_hint_cmd_id;
> +le32 poison_val;
>  };
>  \end{lstlisting}
>  
> @@ -5088,6 +5098,9 @@ \subsection{Device Initialization}\label{sec:Device 
> Types / Memory Balloon Devic
>  \item If the VIRTIO_BALLOON_F_FREE_PAGE_HINT feature bit is negotiated the
>free_page_vq is identified.
>  
> +\item If the VIRTIO_BALLOON_F_PAGE_POISON feature bit is negotiated then
> +  the driver MUST update the poison_val configuration field.
> +
>  \item DRIVER_OK is set: device operation begins.
>  
>  \item If the VIRTIO_BALLOON_F_STATS_VQ feature bit is negotiated then
> @@ -5461,6 +5474,38 @@ \subsubsection{Free Page Hinting}\label{sec:Device 
> Types / Memory Balloon Device
>  The device MAY NOT modify the contents of the balloon after
>  \field{free_page_hint_cmd_id} is set to VIRTIO_BALLOON_CMD_ID_DONE.
>  
> +\subsubsection{Page Poison}\label{sec:Device Types / Memory Balloon Device / 
> Device Operation / Page Poison}
> +
> +Page Poison provides a way to notify the host of the contents that are
> +currently in the balloon pages, and those that are expected to be in the
> +pages when they are pulled from the balloon. It is used for in-place

"when they are pulled from the balloon". That's not correct. This only
applies to free page reporting (-> patch #3).

Without free page reporting, poisoning only tells the hypervisor that
pages pages that are getting deflated might immediately be written by
the hypervisor again.

Or am I missing something?

> +reporting of pages without needing to pull them from the memory allocator
> +of the guest.
> +
> +If VIRTIO_BALLOON_F_PAGE_POISON feature has been negotiated, the guest
> +will place the expected poison value in \field{poison_val} configuration
> +data.
> +
> +\drivernormative{\paragraph}{Page Poison}{Device Types / Memory Balloon 
> Device / Device Operation / Page Poison}
> +
> +Normative statements in this section apply if the
> +VIRTIO_BALLOON_F_PAGE_POISON feature has been negotiated.
> +
> +The driver MUST populate the \field{poison_val} configuration data if it is
> +expecting the page to contain some fixed value when free.

Again, not correctly phrased I think. Only applies to free page reporting.

> +
> +The driver MAY opt to disable the feature if it will take care of
> +re-initializing pages when first accessing them.
> +
> +\devicenormative{\paragraph}{Page Poison}{Device Types / Memory Balloon 
> Device / Device Operation / Page Poison}
> +
> +Normative statements in this section apply if the
> +VIRTIO_BALLOON_F_PAGE_POISON feature has been negotiated.
> +
> +The device MAY ignore the \field{poison_val} for normal balloon operations 
> and
> +free page hinting as this feature did not exist prior to 

[virtio-dev] Re: [PATCH 0/3] virtio-spec: Add documentation for recently added balloon features

2020-05-15 Thread David Hildenbrand
On 15.05.20 19:06, Alexander Duyck wrote:
> On Mon, May 11, 2020 at 5:44 AM David Hildenbrand  wrote:
>>
>> On 11.05.20 14:38, Cornelia Huck wrote:
>>> On Fri, 08 May 2020 10:16:14 -0700
>>> Alexander Duyck  wrote:
>>>
>>>> This patch set is meant to add documentation for balloon features that have
>>>> been recently added to the Linux kernel[1,2] and that we are currently
>>>> working on adding to QEMU[3].
>>>>
>>>> Changes since RFC:
>>>> Incorporated suggestions from Cornelia Huck
>>>> Fixed a few additional spelling errors
>>>>
>>>> [1]: 
>>>> https://lore.kernel.org/lkml/20200211224416.29318.44077.stgit@localhost.localdomain/
>>>> [2]: 
>>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b0c504f154718904ae49349147e3b7e6ae91ffdc
>>>> [3]: https://lists.oasis-open.org/archives/virtio-dev/202004/msg00180.html
>>>>
>>>> ---
>>>>
>>>> Alexander Duyck (3):
>>>>   content: Document balloon feature free page hints
>>>>   content: Document balloon feature page poison
>>>>   content: Document balloon feature free page reporting
>>>>
>>>>
>>>>  content.tex |  248 
>>>> ++-
>>>>  1 file changed, 242 insertions(+), 6 deletions(-)
>>>>
>>>> --
>>>>
>>>
>>> I think this has moved a lot into the right direction; but the patches
>>> would really benefit from review by someone more familiar with the
>>> balloon than me.
>>
>> On my list, will have a look this week.
> 
> Any ETA on when you might be able to get to that review? I'm just
> considering if I should submit v2 with your tweak and the suggestions
> from Cornelia or if I should wait for your feedback.

I wanted to do it today but got distracted by other stuff :(

Please send a v2, high prio for early next week, thanks!

-- 
Thanks,

David / dhildenb


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] [PATCH v4 16/15] virtio-mem: Don't rely on implicit compiler padding for requests

2020-05-15 Thread David Hildenbrand
The compiler will add padding after the last member, make that explicit.
The size of a request is always 24 bytes. The size of a response always
10 bytes. Add compile-time checks.

Cc: "Michael S. Tsirkin" 
Cc: Pankaj Gupta 
Cc: teawater 
Signed-off-by: David Hildenbrand 
---

Something I noticed while working on the spec (which proves that writing a
virtio-spec makes sense :) ).

---
 drivers/virtio/virtio_mem.c | 3 +++
 include/uapi/linux/virtio_mem.h | 3 +++
 2 files changed, 6 insertions(+)

diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
index 9e523db3bee1..f658fe9149be 100644
--- a/drivers/virtio/virtio_mem.c
+++ b/drivers/virtio/virtio_mem.c
@@ -1770,6 +1770,9 @@ static int virtio_mem_probe(struct virtio_device *vdev)
struct virtio_mem *vm;
int rc = -EINVAL;
 
+   BUILD_BUG_ON(sizeof(struct virtio_mem_req) != 24);
+   BUILD_BUG_ON(sizeof(struct virtio_mem_resp) != 10);
+
vdev->priv = vm = kzalloc(sizeof(*vm), GFP_KERNEL);
if (!vm)
return -ENOMEM;
diff --git a/include/uapi/linux/virtio_mem.h b/include/uapi/linux/virtio_mem.h
index e0a9dc7397c3..a455c488a995 100644
--- a/include/uapi/linux/virtio_mem.h
+++ b/include/uapi/linux/virtio_mem.h
@@ -103,16 +103,19 @@
 struct virtio_mem_req_plug {
__virtio64 addr;
__virtio16 nb_blocks;
+   __virtio16 padding[3];
 };
 
 struct virtio_mem_req_unplug {
__virtio64 addr;
__virtio16 nb_blocks;
+   __virtio16 padding[3];
 };
 
 struct virtio_mem_req_state {
__virtio64 addr;
__virtio16 nb_blocks;
+   __virtio16 padding[3];
 };
 
 struct virtio_mem_req {
-- 
2.25.4


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



Re: [virtio-dev] [PATCH v3 00/15] virtio-mem: paravirtualized memory

2020-05-14 Thread David Hildenbrand
On 14.05.20 13:47, David Hildenbrand wrote:
> On 14.05.20 13:10, David Hildenbrand wrote:
>> On 14.05.20 12:12, David Hildenbrand wrote:
>>> On 14.05.20 12:02, teawater wrote:
>>>>
>>>>
>>>>> 2020年5月14日 16:48,David Hildenbrand  写道:
>>>>>
>>>>> On 14.05.20 08:44, teawater wrote:
>>>>>> Hi David,
>>>>>>
>>>>>> I got a kernel warning with v2 and v3.
>>>>>
>>>>> Hi Hui,
>>>>>
>>>>> thanks for playing with the latest versions. Surprisingly, I can
>>>>> reproduce even by hotplugging a DIMM instead as well - that's good, so
>>>>> it's not related to virtio-mem, lol. Seems to be some QEMU setup issue
>>>>> with older machine types.
>>>>>
>>>>> Can you switch to a newer qemu machine version, especially
>>>>> pc-i440fx-5.0? Both, hotplugging DIMMs and virtio-mem works for me with
>>>>> that QEMU machine just fine.
>>>>
>>>> I still could reproduce this issue with pc-i440fx-5.0 or pc.  Did I miss 
>>>> anything?
>>>>
>>>
>>> Below I don't even see virtio_mem. I had to repair the image (filesystem
>>> fsck) because it was broken, can you try that as well?
>>>
>>> Also, it would be great if you could test with v4.
>>>
>>
>> Correction, something seems to be broken either in QEMU or the kernel. Once I
>> define a DIMM so it's added and online during boot, I get these issues:
>>
>> (I have virtio-mem v4 installed in the guest)
>>
>> #! /bin/bash
>> sudo x86_64-softmmu/qemu-system-x86_64 \
>> -machine pc-i440fx-5.0,accel=kvm,usb=off \
>> -cpu host \
>> -no-reboot \
>> -nographic \
>> -device ide-hd,drive=hd \
>> -drive 
>> if=none,id=hd,file=/home/dhildenb/git/Fedora-Cloud-Base-31-1.9.x86_64.qcow2,format=qcow2
>>  \
>> -m 1g,slots=10,maxmem=2G \
>> -smp 1 \
>> -object memory-backend-ram,id=mem0,size=256m \
>> -device pc-dimm,id=dimm0,memdev=mem0 \
>> -s \
>> -monitor unix:/var/tmp/monitor,server,nowait
>>
>>
>> Without the DIMM it seems to work just fine.
>>
> 
> And another correction. 
> 
> Using QEMU v5.0.0, Linux 5.7-rc5, untouched
> Fedora-Cloud-Base-32-1.6.x86_64.qcow2, I get even without any memory hotplug:
> 
> #! /bin/bash
> sudo x86_64-softmmu/qemu-system-x86_64 \
> -machine pc-i440fx-5.0,accel=kvm,usb=off \
> -cpu host \
> -no-reboot \
> -nographic \
> -device ide-hd,drive=hd \
> -drive 
> if=none,id=hd,file=/home/dhildenb/git/Fedora-Cloud-Base-32-1.6.x86_64.qcow2,format=qcow2
>  \
> -m 5g,slots=10,maxmem=6G \
> -smp 1 \
> -s \
> -kernel /home/dhildenb/git/linux/arch/x86/boot/bzImage \
> -append "console=ttyS0 rd.shell nokaslr swiotlb=noforce" \
> -monitor unix:/var/tmp/monitor,server,nowait
> 
> 
> Observe how big the initial RAM even is!
> 
> 
> So this is no DIMM/hotplug/virtio_mem issue. With memory hotplug, it seems to 
> get
> more likely to trigger if "swiotlb=noforce" is not specified.
> 
> "swiotlb=noforce" seems to trigger some pre-existing issue here. Without
> "swiotlb=noforce", I was only able to observe this via pc-i440fx-2.1,
> 

(talking to myself :) )

I think I finally understood why using "swiotlb=noforce" with hotplugged
memory is wrong - or with memory > 3GB. Via "swiotlb=noforce" you tell
the system to "Never use bounce buffers (for debugging)". This works as
long as all memory is DMA memory (e.g., < 3GB) AFAIK.

"If specified, trying to map memory that cannot be used with DMA will
fail, and a rate-limited warning will be printed."

Hotplugged memory (under QEMU) is never added below 4GB, because of the
PCI hole. So both, memory from DIMMs and from virtio-mem will end up at
or above 4GB. To make a device use that memory, you need bounce buffers.

Hotplugged memory is never DMA memory.

-- 
Thanks,

David / dhildenb


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



Re: [virtio-dev] [PATCH v3 00/15] virtio-mem: paravirtualized memory

2020-05-14 Thread David Hildenbrand
On 14.05.20 13:10, David Hildenbrand wrote:
> On 14.05.20 12:12, David Hildenbrand wrote:
>> On 14.05.20 12:02, teawater wrote:
>>>
>>>
>>>> 2020年5月14日 16:48,David Hildenbrand  写道:
>>>>
>>>> On 14.05.20 08:44, teawater wrote:
>>>>> Hi David,
>>>>>
>>>>> I got a kernel warning with v2 and v3.
>>>>
>>>> Hi Hui,
>>>>
>>>> thanks for playing with the latest versions. Surprisingly, I can
>>>> reproduce even by hotplugging a DIMM instead as well - that's good, so
>>>> it's not related to virtio-mem, lol. Seems to be some QEMU setup issue
>>>> with older machine types.
>>>>
>>>> Can you switch to a newer qemu machine version, especially
>>>> pc-i440fx-5.0? Both, hotplugging DIMMs and virtio-mem works for me with
>>>> that QEMU machine just fine.
>>>
>>> I still could reproduce this issue with pc-i440fx-5.0 or pc.  Did I miss 
>>> anything?
>>>
>>
>> Below I don't even see virtio_mem. I had to repair the image (filesystem
>> fsck) because it was broken, can you try that as well?
>>
>> Also, it would be great if you could test with v4.
>>
> 
> Correction, something seems to be broken either in QEMU or the kernel. Once I
> define a DIMM so it's added and online during boot, I get these issues:
> 
> (I have virtio-mem v4 installed in the guest)
> 
> #! /bin/bash
> sudo x86_64-softmmu/qemu-system-x86_64 \
> -machine pc-i440fx-5.0,accel=kvm,usb=off \
> -cpu host \
> -no-reboot \
> -nographic \
> -device ide-hd,drive=hd \
> -drive 
> if=none,id=hd,file=/home/dhildenb/git/Fedora-Cloud-Base-31-1.9.x86_64.qcow2,format=qcow2
>  \
> -m 1g,slots=10,maxmem=2G \
> -smp 1 \
> -object memory-backend-ram,id=mem0,size=256m \
> -device pc-dimm,id=dimm0,memdev=mem0 \
> -s \
> -monitor unix:/var/tmp/monitor,server,nowait
> 
> 
> Without the DIMM it seems to work just fine.
> 

And another correction. 

Using QEMU v5.0.0, Linux 5.7-rc5, untouched
Fedora-Cloud-Base-32-1.6.x86_64.qcow2, I get even without any memory hotplug:

#! /bin/bash
sudo x86_64-softmmu/qemu-system-x86_64 \
-machine pc-i440fx-5.0,accel=kvm,usb=off \
-cpu host \
-no-reboot \
-nographic \
-device ide-hd,drive=hd \
-drive 
if=none,id=hd,file=/home/dhildenb/git/Fedora-Cloud-Base-32-1.6.x86_64.qcow2,format=qcow2
 \
-m 5g,slots=10,maxmem=6G \
-smp 1 \
-s \
-kernel /home/dhildenb/git/linux/arch/x86/boot/bzImage \
-append "console=ttyS0 rd.shell nokaslr swiotlb=noforce" \
-monitor unix:/var/tmp/monitor,server,nowait


Observe how big the initial RAM even is!


So this is no DIMM/hotplug/virtio_mem issue. With memory hotplug, it seems to 
get
more likely to trigger if "swiotlb=noforce" is not specified.

"swiotlb=noforce" seems to trigger some pre-existing issue here. Without
"swiotlb=noforce", I was only able to observe this via pc-i440fx-2.1,

-- 
Thanks,

David / dhildenb


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



Re: [virtio-dev] [PATCH v3 00/15] virtio-mem: paravirtualized memory

2020-05-14 Thread David Hildenbrand
On 14.05.20 12:02, teawater wrote:
> 
> 
>> 2020年5月14日 16:48,David Hildenbrand  写道:
>>
>> On 14.05.20 08:44, teawater wrote:
>>> Hi David,
>>>
>>> I got a kernel warning with v2 and v3.
>>
>> Hi Hui,
>>
>> thanks for playing with the latest versions. Surprisingly, I can
>> reproduce even by hotplugging a DIMM instead as well - that's good, so
>> it's not related to virtio-mem, lol. Seems to be some QEMU setup issue
>> with older machine types.
>>
>> Can you switch to a newer qemu machine version, especially
>> pc-i440fx-5.0? Both, hotplugging DIMMs and virtio-mem works for me with
>> that QEMU machine just fine.
> 
> I still could reproduce this issue with pc-i440fx-5.0 or pc.  Did I miss 
> anything?
> 

Below I don't even see virtio_mem. I had to repair the image (filesystem
fsck) because it was broken, can you try that as well?

Also, it would be great if you could test with v4.

-- 
Thanks,

David / dhildenb


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



Re: [virtio-dev] [PATCH v3 00/15] virtio-mem: paravirtualized memory

2020-05-14 Thread David Hildenbrand
On 14.05.20 08:44, teawater wrote:
> Hi David,
> 
> I got a kernel warning with v2 and v3.

Hi Hui,

thanks for playing with the latest versions. Surprisingly, I can
reproduce even by hotplugging a DIMM instead as well - that's good, so
it's not related to virtio-mem, lol. Seems to be some QEMU setup issue
with older machine types.

Can you switch to a newer qemu machine version, especially
pc-i440fx-5.0? Both, hotplugging DIMMs and virtio-mem works for me with
that QEMU machine just fine.

What also seems to make it work with pc-i440fx-2.1, is giving the
machine 4G of initial memory (-m 4g,slots=10,maxmem=5G).

Cheers!


> // start a QEMU that is get from 
> https://github.com/davidhildenbrand/qemu/tree/virtio-mem-v2 and setup a file 
> as a ide disk.
> /home/teawater/qemu/qemu/x86_64-softmmu/qemu-system-x86_64 -machine 
> pc-i440fx-2.1,accel=kvm,usb=off -cpu host -no-reboot -nographic -device 
> ide-hd,drive=hd -drive if=none,id=hd,file=/home/teawater/old.img,format=raw 
> -kernel /home/teawater/kernel/bk2/arch/x86/boot/bzImage -append 
> "console=ttyS0 root=/dev/sda nokaslr swiotlb=noforce" -m 
> 1g,slots=10,maxmem=2G -smp 1 -s -monitor 
> unix:/home/teawater/qemu/m,server,nowait
> 
> // Setup virtio-mem and plug 256m memory in qemu monitor:
> object_add memory-backend-ram,id=mem1,size=256m
> device_add virtio-mem-pci,id=vm0,memdev=mem1
> qom-set vm0 requested-size 256M
> 
> // Go back to the terminal and access file system will got following kernel 
> warning.
> [   19.515549] pci :00:04.0: [1af4:1015] type 00 class 0x00ff00
> [   19.516227] pci :00:04.0: reg 0x10: [io  0x-0x007f]
> [   19.517196] pci :00:04.0: BAR 0: assigned [io  0x1000-0x107f]
> [   19.517843] virtio-pci :00:04.0: enabling device ( -> 0001)
> [   19.535957] PCI Interrupt Link [LNKD] enabled at IRQ 11
> [   19.536507] virtio-pci :00:04.0: virtio_pci: leaving for legacy driver
> [   19.537528] virtio_mem virtio0: start address: 0x1
> [   19.538094] virtio_mem virtio0: region size: 0x1000
> [   19.538621] virtio_mem virtio0: device block size: 0x20
> [   19.539186] virtio_mem virtio0: memory block size: 0x800
> [   19.539752] virtio_mem virtio0: subblock size: 0x40
> [   19.540357] virtio_mem virtio0: plugged size: 0x0
> [   19.540834] virtio_mem virtio0: requested size: 0x0
> [   20.170441] virtio_mem virtio0: plugged size: 0x0
> [   20.170933] virtio_mem virtio0: requested size: 0x1000
> [   20.172247] Built 1 zonelists, mobility grouping on.  Total pages: 266012
> [   20.172955] Policy zone: Normal
> 
> / # ls
> [   26.724565] [ cut here ]
> [   26.725047] ata_piix :00:01.1: DMA addr 0x00010fc14000+49152 
> overflow (mask , bus limit 0).
> [   26.726024] WARNING: CPU: 0 PID: 179 at 
> /home/teawater/kernel/linux2/kernel/dma/direct.c:364 
> dma_direct_map_page+0x118/0x130
> [   26.727141] Modules linked in:
> [   26.727456] CPU: 0 PID: 179 Comm: ls Not tainted 5.6.0-rc5-next-20200311+ 
> #9
> [   26.728163] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
> [   26.729305] RIP: 0010:dma_direct_map_page+0x118/0x130
> [   26.729825] Code: 8b 1f e8 3b 70 59 00 48 8d 4c 24 08 48 89 c6 4c 89 2c 24 
> 4d 89 e1 49 89 e8 48 89 da 48 c7 c7 08 6c 34 82 31 c0 e8 d8 8e f7 ff <00
> [   26.731683] RSP: :c9213838 EFLAGS: 00010082
> [   26.732205] RAX:  RBX: 88803ebeb1b0 RCX: 
> 82665148
> [   26.732913] RDX: 0001 RSI: 0092 RDI: 
> 0046
> [   26.733621] RBP: c000 R08: 01df R09: 
> 01df
> [   26.734338] R10:  R11: c92135a8 R12: 
> 
> [   26.735054] R13:  R14:  R15: 
> 88803d55f5b0
> [   26.735772] FS:  024e9880() GS:88803ec0() 
> knlGS:
> [   26.736579] CS:  0010 DS:  ES:  CR0: 80050033
> [   26.737162] CR2: 005bfc7f CR3: 000107e12004 CR4: 
> 00360ef0
> [   26.737879] DR0:  DR1:  DR2: 
> 
> [   26.738591] DR3:  DR6: fffe0ff0 DR7: 
> 0400
> [   26.739307] Call Trace:
> [   26.739564]  dma_direct_map_sg+0x64/0xb0
> [   26.739969]  ? ata_scsi_write_same_xlat+0x350/0x350
> [   26.740461]  ata_qc_issue+0x214/0x260
> [   26.740839]  ata_scsi_queuecmd+0x16a/0x490
> [   26.741255]  scsi_queue_rq+0x679/0xa60
> [   26.741639]  blk_mq_dispatch_rq_list+0x90/0x510
> [   26.742099]  ? elv_rb_del+0x1f/0x30
> [   26.742456]  ? deadline_remove_request+0x6a/0xb0
> [   26.742926]  blk_mq_do_dispatch_sched+0x78/0x100
> [   26.743397]  blk_mq_sched_dispatch_requests+0xf9/0x170
> [   26.743924]  __blk_mq_run_hw_queue+0x7e/0x130
> [   26.744365]  __blk_mq_delay_run_hw_queue+0x107/0x150
> [   26.744874]  blk_mq_run_hw_queue+0x61/0x100
> [   26.745299]  

[virtio-dev] Re: [RFC v3 for QEMU] virtio-balloon: Add option cont-pages to set VIRTIO_BALLOON_VQ_INFLATE_CONT

2020-05-13 Thread David Hildenbrand
On 12.05.20 11:41, Hui Zhu wrote:

This description needs an overhaul, it's hard to parse.

> If the guest kernel has many fragmentation pages, use virtio_balloon
> will split THP of QEMU when it calls MADV_DONTNEED madvise to release
> the balloon pages.

This is very unclear and confusing. You will *always* split THPs when
inflating 4k pages and there are THPs around. This is completely
independent of any fragmentation in the guest. The thing you are trying
to achieve here is trying to *minimize* the number of split THPs in the
hypervisor *after* the balloon was completely inflated.

> Set option cont-pages to on will open flags VIRTIO_BALLOON_VQ_INFLATE_CONT
> and set default continuous pages order to THP order.

... and what you implement here is very x86 specific, hard-coding the
"9" as THP order.

"Set option cont-pages to on" -> 'Once the feature is enabled via
"cont-pages=on"'
"open flags" -> "unlock VIRTIO_BALLOON_VQ_INFLATE_CONT".


> Then It will get continuous pages PFN that its order is current_pages_order
> from VQ ivq use use madvise MADV_DONTNEED release the page.

I fail to parse this sentence. I assume something like

"current_pages_order is set by the guest and configures the size of the
pages communicated via the inflate/deflate queue by the guest. It
defaults to 0, which corresponds to the legacy handling without
VIRTIO_BALLOON_VQ_INFLATE_CONT - 4k pages."

Why is "max_pages_order" necessary *at all*? You never check against that.

I have to say, I really dislike that interface. I would much rather
prefer new inflate/deflate queues that eat variable sizes (not orders!)
- similar to the free page reporting interface - and skip things like
the pbp. Not sure if this interface is really what MST asked for.

> This will handle the THP split issue.
> 
> Signed-off-by: Hui Zhu 
> ---
>  hw/virtio/virtio-balloon.c  | 77 
> +
>  include/hw/virtio/virtio-balloon.h  |  2 +
>  include/standard-headers/linux/virtio_balloon.h |  5 ++
>  3 files changed, 60 insertions(+), 24 deletions(-)
> 
> diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
> index a4729f7..84d47d3 100644
> --- a/hw/virtio/virtio-balloon.c
> +++ b/hw/virtio/virtio-balloon.c
> @@ -34,6 +34,7 @@
>  #include "hw/virtio/virtio-access.h"
>  
>  #define BALLOON_PAGE_SIZE  (1 << VIRTIO_BALLOON_PFN_SHIFT)
> +#define CONT_PAGES_ORDER   9
>  
>  typedef struct PartiallyBalloonedPage {
>  ram_addr_t base_gpa;
> @@ -72,6 +73,8 @@ static void balloon_inflate_page(VirtIOBalloon *balloon,
>  RAMBlock *rb;
>  size_t rb_page_size;
>  int subpages;
> +size_t inflate_size = BALLOON_PAGE_SIZE << balloon->current_pages_order;
> +int pages_num;

reverse christmas tree please. squash same types into a single line if
possible.

>  
>  /* XXX is there a better way to get to the RAMBlock than via a
>   * host address? */
> @@ -81,7 +84,7 @@ static void balloon_inflate_page(VirtIOBalloon *balloon,
>  if (rb_page_size == BALLOON_PAGE_SIZE) {
>  /* Easy case */
>  
> -ram_block_discard_range(rb, rb_offset, rb_page_size);
> +ram_block_discard_range(rb, rb_offset, inflate_size);
>  /* We ignore errors from ram_block_discard_range(), because it
>   * has already reported them, and failing to discard a balloon
>   * page is not fatal */
> @@ -99,32 +102,38 @@ static void balloon_inflate_page(VirtIOBalloon *balloon,
>  
>  rb_aligned_offset = QEMU_ALIGN_DOWN(rb_offset, rb_page_size);
>  subpages = rb_page_size / BALLOON_PAGE_SIZE;
> -base_gpa = memory_region_get_ram_addr(mr) + mr_offset -
> -   (rb_offset - rb_aligned_offset);
>  
> -if (pbp->bitmap && !virtio_balloon_pbp_matches(pbp, base_gpa)) {
> -/* We've partially ballooned part of a host page, but now
> - * we're trying to balloon part of a different one.  Too hard,
> - * give up on the old partial page */
> -virtio_balloon_pbp_free(pbp);
> -}
> +for (pages_num = inflate_size / BALLOON_PAGE_SIZE;
> + pages_num > 0; pages_num--) {
> +base_gpa = memory_region_get_ram_addr(mr) + mr_offset -
> +   (rb_offset - rb_aligned_offset);
>  
> -if (!pbp->bitmap) {
> -virtio_balloon_pbp_alloc(pbp, base_gpa, subpages);
> -}
> +if (pbp->bitmap && !virtio_balloon_pbp_matches(pbp, base_gpa)) {
> +/* We've partially ballooned part of a host page, but now
> +* we're trying to balloon part of a different one.  Too hard,
> +* give up on the old partial page */
> +virtio_balloon_pbp_free(pbp);
> +}
>  
> -set_bit((rb_offset - rb_aligned_offset) / BALLOON_PAGE_SIZE,
> -pbp->bitmap);
> +if (!pbp->bitmap) {
> +virtio_balloon_pbp_alloc(pbp, base_gpa, subpages);
> +}
>  
> -if (bitmap_full(pbp->bitmap, subpages)) {
> -/* We've accumulated a full 

[virtio-dev] Re: [PATCH 0/3] virtio-spec: Add documentation for recently added balloon features

2020-05-11 Thread David Hildenbrand
On 11.05.20 14:38, Cornelia Huck wrote:
> On Fri, 08 May 2020 10:16:14 -0700
> Alexander Duyck  wrote:
> 
>> This patch set is meant to add documentation for balloon features that have
>> been recently added to the Linux kernel[1,2] and that we are currently
>> working on adding to QEMU[3].
>>
>> Changes since RFC:
>> Incorporated suggestions from Cornelia Huck
>> Fixed a few additional spelling errors
>>
>> [1]: 
>> https://lore.kernel.org/lkml/20200211224416.29318.44077.stgit@localhost.localdomain/
>> [2]: 
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b0c504f154718904ae49349147e3b7e6ae91ffdc
>> [3]: https://lists.oasis-open.org/archives/virtio-dev/202004/msg00180.html
>>
>> ---
>>
>> Alexander Duyck (3):
>>   content: Document balloon feature free page hints
>>   content: Document balloon feature page poison
>>   content: Document balloon feature free page reporting
>>
>>
>>  content.tex |  248 
>> ++-
>>  1 file changed, 242 insertions(+), 6 deletions(-)
>>
>> --
>>
> 
> I think this has moved a lot into the right direction; but the patches
> would really benefit from review by someone more familiar with the
> balloon than me.

On my list, will have a look this week.

Minor nit I spotted: Patch #2 should not document things (e.g., how
poisoning interacts with reported pages), before the free reporting
feature is actually introduced in patch #3.

BTW: Thanks Alex for tackling this!

-- 
Thanks,

David / dhildenb


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] Re: [PATCH v23 QEMU 0/5] virtio-balloon: add support for page poison reporting and free page reporting

2020-05-08 Thread David Hildenbrand



> Am 08.05.2020 um 19:31 schrieb Alexander Duyck :
> 
> I just wanted to follow up since it has been a little over a week
> since I submitted this and I haven't heard anything back. It looks
> like the linux-headers patches can be dropped since the headers appear
> to have been synced. I was wondering if I should resubmit with just
> the 3 patches that are adding the functionality, or if this patch-set
> is good as-is?

Should be good as-is. However, if the new compat machines are already upstream, 
you might want to tackle that right away.

Cheers and have a nice weekend!

> 
> Thanks.
> 
> - Alex
> 
>> On Mon, Apr 27, 2020 at 5:53 PM Alexander Duyck
>>  wrote:
>> 
>> This series provides an asynchronous means of reporting free guest pages
>> to QEMU through virtio-balloon so that the memory associated with those
>> pages can be dropped and reused by other processes and/or guests on the
>> host. Using this it is possible to avoid unnecessary I/O to disk and
>> greatly improve performance in the case of memory overcommit on the host.
>> 
>> I originally submitted this patch series back on February 11th 2020[1],
>> but at that time I was focused primarily on the kernel portion of this
>> patch set. However as of April 7th those patches are now included in
>> Linus's kernel tree[2] and so I am submitting the QEMU pieces for
>> inclusion.
>> 
>> [1]: 
>> https://lore.kernel.org/lkml/20200211224416.29318.44077.stgit@localhost.localdomain/
>> [2]: 
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b0c504f154718904ae49349147e3b7e6ae91ffdc
>> 
>> Changes from v17:
>> Fixed typo in patch 1 title
>> Addressed white-space issues reported via checkpatch
>> Added braces {} for two if statements to match expected coding style
>> 
>> Changes from v18:
>> Updated patches 2 and 3 based on input from dhildenb
>> Added comment to patch 2 describing what keeps us from reporting a bad page
>> Added patch to address issue with ROM devices being directly writable
>> 
>> Changes from v19:
>> Added std-headers change to match changes pushed for linux kernel headers
>> Added patch to remove "report" from page hinting code paths
>> Updated comment to better explain why we disable hints w/ page poisoning
>> Removed code that was modifying config size for poison vs hinting
>> Dropped x-page-poison property
>> Added code to bounds check the reported region vs the RAM block
>> Dropped patch for ROM devices as that was already pulled in by Paolo
>> 
>> Changes from v20:
>> Rearranged patches to push Linux header sync patches to front
>> Removed association between free page hinting and 
>> VIRTIO_BALLOON_F_PAGE_POISON
>> Added code to enable VIRTIO_BALLOON_F_PAGE_POISON if page reporting is 
>> enabled
>> Fixed possible resource leak if poison or qemu_balloon_is_inhibited return 
>> true
>> 
>> Changes from v21:
>> Added ack for patch 3
>> Rewrote patch description for page poison reporting feature
>> Made page-poison independent property and set to enabled by default
>> Added logic to migrate poison_val
>> Added several comments in code to better explain features
>> Switched free-page-reporting property to disabled by default
>> 
>> Changes from v22:
>> Added ack for patches 4 & 5
>> Added additional comment fixes in patch 3 to remove "reporting" reference
>> Renamed rvq in patch 5 to reporting_vq to improve readability
>> Moved call adding reporting_vq to after free_page_vq to fix VQ ordering
>> 
>> ---
>> 
>> Alexander Duyck (5):
>>  linux-headers: Update to allow renaming of free_page_report_cmd_id
>>  linux-headers: update to contain virito-balloon free page reporting
>>  virtio-balloon: Replace free page hinting references to 'report' with 
>> 'hint'
>>  virtio-balloon: Implement support for page poison reporting feature
>>  virtio-balloon: Provide an interface for free page reporting
>> 
>> 
>> hw/virtio/virtio-balloon.c  |  176 
>> ++-
>> include/hw/virtio/virtio-balloon.h  |   23 ++-
>> include/standard-headers/linux/virtio_balloon.h |   12 +-
>> 3 files changed, 159 insertions(+), 52 deletions(-)
>> 
>> --
> 


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] [PATCH v4 09/15] virtio-mem: Offline and remove completely unplugged memory blocks

2020-05-07 Thread David Hildenbrand
Let's offline+remove memory blocks once all subblocks are unplugged. We
can use the new Linux MM interface for that. As no memory is in use
anymore, this shouldn't take a long time and shouldn't fail. There might
be corner cases where the offlining could still fail (especially, if
another notifier NACKs the offlining request).

Acked-by: Pankaj Gupta 
Tested-by: Pankaj Gupta 
Cc: "Michael S. Tsirkin" 
Cc: Jason Wang 
Cc: Oscar Salvador 
Cc: Michal Hocko 
Cc: Igor Mammedov 
Cc: Dave Young 
Cc: Andrew Morton 
Cc: Dan Williams 
Cc: Pavel Tatashin 
Cc: Stefan Hajnoczi 
Cc: Vlastimil Babka 
Signed-off-by: David Hildenbrand 
---
 drivers/virtio/virtio_mem.c | 47 +
 1 file changed, 43 insertions(+), 4 deletions(-)

diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
index b0b41c73ce89..a2edb87e5ed8 100644
--- a/drivers/virtio/virtio_mem.c
+++ b/drivers/virtio/virtio_mem.c
@@ -446,6 +446,28 @@ static int virtio_mem_mb_remove(struct virtio_mem *vm, 
unsigned long mb_id)
return remove_memory(nid, addr, memory_block_size_bytes());
 }
 
+/*
+ * Try to offline and remove a memory block from Linux.
+ *
+ * Must not be called with the vm->hotplug_mutex held (possible deadlock with
+ * onlining code).
+ *
+ * Will not modify the state of the memory block.
+ */
+static int virtio_mem_mb_offline_and_remove(struct virtio_mem *vm,
+   unsigned long mb_id)
+{
+   const uint64_t addr = virtio_mem_mb_id_to_phys(mb_id);
+   int nid = vm->nid;
+
+   if (nid == NUMA_NO_NODE)
+   nid = memory_add_physaddr_to_nid(addr);
+
+   dev_dbg(>vdev->dev, "offlining and removing memory block: %lu\n",
+   mb_id);
+   return offline_and_remove_memory(nid, addr, memory_block_size_bytes());
+}
+
 /*
  * Trigger the workqueue so the device can perform its magic.
  */
@@ -537,7 +559,13 @@ static void virtio_mem_notify_offline(struct virtio_mem 
*vm,
break;
}
 
-   /* trigger the workqueue, maybe we can now unplug memory. */
+   /*
+* Trigger the workqueue, maybe we can now unplug memory. Also,
+* when we offline and remove a memory block, this will re-trigger
+* us immediately - which is often nice because the removal of
+* the memory block (e.g., memmap) might have freed up memory
+* on other memory blocks we manage.
+*/
virtio_mem_retry(vm);
 }
 
@@ -1284,7 +1312,8 @@ static int virtio_mem_mb_unplug_any_sb_offline(struct 
virtio_mem *vm,
  * Unplug the desired number of plugged subblocks of an online memory block.
  * Will skip subblock that are busy.
  *
- * Will modify the state of the memory block.
+ * Will modify the state of the memory block. Might temporarily drop the
+ * hotplug_mutex.
  *
  * Note: Can fail after some subblocks were successfully unplugged. Can
  *   return 0 even if subblocks were busy and could not get unplugged.
@@ -1340,9 +1369,19 @@ static int virtio_mem_mb_unplug_any_sb_online(struct 
virtio_mem *vm,
}
 
/*
-* TODO: Once all subblocks of a memory block were unplugged, we want
-* to offline the memory block and remove it.
+* Once all subblocks of a memory block were unplugged, offline and
+* remove it. This will usually not fail, as no memory is in use
+* anymore - however some other notifiers might NACK the request.
 */
+   if (virtio_mem_mb_test_sb_unplugged(vm, mb_id, 0, vm->nb_sb_per_mb)) {
+   mutex_unlock(>hotplug_mutex);
+   rc = virtio_mem_mb_offline_and_remove(vm, mb_id);
+   mutex_lock(>hotplug_mutex);
+   if (!rc)
+   virtio_mem_mb_set_state(vm, mb_id,
+   VIRTIO_MEM_MB_STATE_UNUSED);
+   }
+
return 0;
 }
 
-- 
2.25.3


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] [PATCH v4 15/15] virtio-mem: Try to unplug the complete online memory block first

2020-05-07 Thread David Hildenbrand
Right now, we always try to unplug single subblocks when processing an
online memory block. Let's try to unplug the complete online memory block
first, in case it is fully plugged and the unplug request is large
enough. Fallback to single subblocks in case the memory block cannot get
unplugged as a whole.

Cc: "Michael S. Tsirkin" 
Cc: Pankaj Gupta 
Signed-off-by: David Hildenbrand 
---
 drivers/virtio/virtio_mem.c | 88 -
 1 file changed, 57 insertions(+), 31 deletions(-)

diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
index abd93b778a26..9e523db3bee1 100644
--- a/drivers/virtio/virtio_mem.c
+++ b/drivers/virtio/virtio_mem.c
@@ -1307,6 +1307,46 @@ static int virtio_mem_mb_unplug_any_sb_offline(struct 
virtio_mem *vm,
return 0;
 }
 
+/*
+ * Unplug the given plugged subblocks of an online memory block.
+ *
+ * Will modify the state of the memory block.
+ */
+static int virtio_mem_mb_unplug_sb_online(struct virtio_mem *vm,
+ unsigned long mb_id, int sb_id,
+ int count)
+{
+   const unsigned long nr_pages = PFN_DOWN(vm->subblock_size) * count;
+   unsigned long start_pfn;
+   int rc;
+
+   start_pfn = PFN_DOWN(virtio_mem_mb_id_to_phys(mb_id) +
+sb_id * vm->subblock_size);
+   rc = alloc_contig_range(start_pfn, start_pfn + nr_pages,
+   MIGRATE_MOVABLE, GFP_KERNEL);
+   if (rc == -ENOMEM)
+   /* whoops, out of memory */
+   return rc;
+   if (rc)
+   return -EBUSY;
+
+   /* Mark it as fake-offline before unplugging it */
+   virtio_mem_set_fake_offline(start_pfn, nr_pages, true);
+   adjust_managed_page_count(pfn_to_page(start_pfn), -nr_pages);
+
+   /* Try to unplug the allocated memory */
+   rc = virtio_mem_mb_unplug_sb(vm, mb_id, sb_id, count);
+   if (rc) {
+   /* Return the memory to the buddy. */
+   virtio_mem_fake_online(start_pfn, nr_pages);
+   return rc;
+   }
+
+   virtio_mem_mb_set_state(vm, mb_id,
+   VIRTIO_MEM_MB_STATE_ONLINE_PARTIAL);
+   return 0;
+}
+
 /*
  * Unplug the desired number of plugged subblocks of an online memory block.
  * Will skip subblock that are busy.
@@ -1321,16 +1361,21 @@ static int virtio_mem_mb_unplug_any_sb_online(struct 
virtio_mem *vm,
  unsigned long mb_id,
  uint64_t *nb_sb)
 {
-   const unsigned long nr_pages = PFN_DOWN(vm->subblock_size);
-   unsigned long start_pfn;
int rc, sb_id;
 
-   /*
-* TODO: To increase the performance we want to try bigger, consecutive
-* subblocks first before falling back to single subblocks. Also,
-* we should sense via something like is_mem_section_removable()
-* first if it makes sense to go ahead any try to allocate.
-*/
+   /* If possible, try to unplug the complete block in one shot. */
+   if (*nb_sb >= vm->nb_sb_per_mb &&
+   virtio_mem_mb_test_sb_plugged(vm, mb_id, 0, vm->nb_sb_per_mb)) {
+   rc = virtio_mem_mb_unplug_sb_online(vm, mb_id, 0,
+   vm->nb_sb_per_mb);
+   if (!rc) {
+   *nb_sb -= vm->nb_sb_per_mb;
+   goto unplugged;
+   } else if (rc != -EBUSY)
+   return rc;
+   }
+
+   /* Fallback to single subblocks. */
for (sb_id = vm->nb_sb_per_mb - 1; sb_id >= 0 && *nb_sb; sb_id--) {
/* Find the next candidate subblock */
while (sb_id >= 0 &&
@@ -1339,34 +1384,15 @@ static int virtio_mem_mb_unplug_any_sb_online(struct 
virtio_mem *vm,
if (sb_id < 0)
break;
 
-   start_pfn = PFN_DOWN(virtio_mem_mb_id_to_phys(mb_id) +
-sb_id * vm->subblock_size);
-   rc = alloc_contig_range(start_pfn, start_pfn + nr_pages,
-   MIGRATE_MOVABLE, GFP_KERNEL);
-   if (rc == -ENOMEM)
-   /* whoops, out of memory */
-   return rc;
-   if (rc)
-   /* memory busy, we can't unplug this chunk */
+   rc = virtio_mem_mb_unplug_sb_online(vm, mb_id, sb_id, 1);
+   if (rc == -EBUSY)
continue;
-
-   /* Mark it as fake-offline before unplugging it */
-   virtio_mem_set_fake_offline(start_pfn, nr_pages, true);
-   adjust_managed_page_count(pfn_to_page(start_pfn), -nr_pages);
-
-   /* Try to unplug the allocated memory */
-   

[virtio-dev] [PATCH v4 12/15] virtio-mem: Drop manual check for already present memory

2020-05-07 Thread David Hildenbrand
Registering our parent resource will fail if any memory is still present
(e.g., because somebody unloaded the driver and tries to reload it). No
need for the manual check.

Move our "unplug all" handling to after registering the resource.

Cc: "Michael S. Tsirkin" 
Cc: Pankaj Gupta 
Signed-off-by: David Hildenbrand 
---
 drivers/virtio/virtio_mem.c | 55 -
 1 file changed, 12 insertions(+), 43 deletions(-)

diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
index 80cdb9e6b3c4..8dd57b61b09b 100644
--- a/drivers/virtio/virtio_mem.c
+++ b/drivers/virtio/virtio_mem.c
@@ -1616,23 +1616,6 @@ static int virtio_mem_init_vq(struct virtio_mem *vm)
return 0;
 }
 
-/*
- * Test if any memory in the range is present in Linux.
- */
-static bool virtio_mem_any_memory_present(unsigned long start,
- unsigned long size)
-{
-   const unsigned long start_pfn = PFN_DOWN(start);
-   const unsigned long end_pfn = PFN_UP(start + size);
-   unsigned long pfn;
-
-   for (pfn = start_pfn; pfn != end_pfn; pfn++)
-   if (present_section_nr(pfn_to_section_nr(pfn)))
-   return true;
-
-   return false;
-}
-
 static int virtio_mem_init(struct virtio_mem *vm)
 {
const uint64_t phys_limit = 1UL << MAX_PHYSMEM_BITS;
@@ -1664,32 +1647,6 @@ static int virtio_mem_init(struct virtio_mem *vm)
virtio_cread(vm->vdev, struct virtio_mem_config, region_size,
 >region_size);
 
-   /*
-* If we still have memory plugged, we might have to unplug all
-* memory first. However, if somebody simply unloaded the driver
-* we would have to reinitialize the old state - something we don't
-* support yet. Detect if we have any memory in the area present.
-*/
-   if (vm->plugged_size) {
-   uint64_t usable_region_size;
-
-   virtio_cread(vm->vdev, struct virtio_mem_config,
-usable_region_size, _region_size);
-
-   if (virtio_mem_any_memory_present(vm->addr,
- usable_region_size)) {
-   dev_err(>vdev->dev,
-   "reloading the driver is not supported\n");
-   return -EINVAL;
-   }
-   /*
-* Note: it might happen that the device is busy and
-* unplugging all memory might take some time.
-*/
-   dev_info(>vdev->dev, "unplugging all memory required\n");
-   vm->unplug_all_required = 1;
-   }
-
/*
 * We always hotplug memory in memory block granularity. This way,
 * we have to wait for exactly one memory block to online.
@@ -1760,6 +1717,8 @@ static int virtio_mem_create_resource(struct virtio_mem 
*vm)
if (!vm->parent_resource) {
kfree(name);
dev_warn(>vdev->dev, "could not reserve device region\n");
+   dev_info(>vdev->dev,
+"reloading the driver is not supported\n");
return -EBUSY;
}
 
@@ -1816,6 +1775,16 @@ static int virtio_mem_probe(struct virtio_device *vdev)
if (rc)
goto out_del_vq;
 
+   /*
+* If we still have memory plugged, we have to unplug all memory first.
+* Registering our parent resource makes sure that this memory isn't
+* actually in use (e.g., trying to reload the driver).
+*/
+   if (vm->plugged_size) {
+   vm->unplug_all_required = 1;
+   dev_info(>vdev->dev, "unplugging all memory is required\n");
+   }
+
/* register callbacks */
vm->memory_notifier.notifier_call = virtio_mem_memory_notifier_cb;
rc = register_memory_notifier(>memory_notifier);
-- 
2.25.3


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] [PATCH v4 11/15] virtio-mem: Add parent resource for all added "System RAM"

2020-05-07 Thread David Hildenbrand
Let's add a parent resource, named after the virtio device (inspired by
drivers/dax/kmem.c). This allows user space to identify which memory
belongs to which virtio-mem device.

With this change and two virtio-mem devices:
:/# cat /proc/iomem
-0fff : Reserved
1000-0009fbff : System RAM
[...]
14000-333ff : virtio0
  14000-147ff : System RAM
  14800-14fff : System RAM
  15000-157ff : System RAM
[...]
33400-3033ff : virtio1
  33800-33fff : System RAM
  34000-347ff : System RAM
  34800-34fff : System RAM
[...]

Cc: "Michael S. Tsirkin" 
Cc: Pankaj Gupta 
Signed-off-by: David Hildenbrand 
---
 drivers/virtio/virtio_mem.c | 52 -
 1 file changed, 51 insertions(+), 1 deletion(-)

diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
index eb4c16d634e0..80cdb9e6b3c4 100644
--- a/drivers/virtio/virtio_mem.c
+++ b/drivers/virtio/virtio_mem.c
@@ -99,6 +99,9 @@ struct virtio_mem {
/* Id of the next memory bock to prepare when needed. */
unsigned long next_mb_id;
 
+   /* The parent resource for all memory added via this device. */
+   struct resource *parent_resource;
+
/* Summary of all memory block states. */
unsigned long nb_mb_state[VIRTIO_MEM_MB_STATE_COUNT];
 #define VIRTIO_MEM_NB_OFFLINE_THRESHOLD10
@@ -1741,6 +1744,44 @@ static int virtio_mem_init(struct virtio_mem *vm)
return 0;
 }
 
+static int virtio_mem_create_resource(struct virtio_mem *vm)
+{
+   /*
+* When force-unloading the driver and removing the device, we
+* could have a garbage pointer. Duplicate the string.
+*/
+   const char *name = kstrdup(dev_name(>vdev->dev), GFP_KERNEL);
+
+   if (!name)
+   return -ENOMEM;
+
+   vm->parent_resource = __request_mem_region(vm->addr, vm->region_size,
+  name, IORESOURCE_SYSTEM_RAM);
+   if (!vm->parent_resource) {
+   kfree(name);
+   dev_warn(>vdev->dev, "could not reserve device region\n");
+   return -EBUSY;
+   }
+
+   /* The memory is not actually busy - make add_memory() work. */
+   vm->parent_resource->flags &= ~IORESOURCE_BUSY;
+   return 0;
+}
+
+static void virtio_mem_delete_resource(struct virtio_mem *vm)
+{
+   const char *name;
+
+   if (!vm->parent_resource)
+   return;
+
+   name = vm->parent_resource->name;
+   release_resource(vm->parent_resource);
+   kfree(vm->parent_resource);
+   kfree(name);
+   vm->parent_resource = NULL;
+}
+
 static int virtio_mem_probe(struct virtio_device *vdev)
 {
struct virtio_mem *vm;
@@ -1770,11 +1811,16 @@ static int virtio_mem_probe(struct virtio_device *vdev)
if (rc)
goto out_del_vq;
 
+   /* create the parent resource for all memory */
+   rc = virtio_mem_create_resource(vm);
+   if (rc)
+   goto out_del_vq;
+
/* register callbacks */
vm->memory_notifier.notifier_call = virtio_mem_memory_notifier_cb;
rc = register_memory_notifier(>memory_notifier);
if (rc)
-   goto out_del_vq;
+   goto out_del_resource;
rc = register_virtio_mem_device(vm);
if (rc)
goto out_unreg_mem;
@@ -1788,6 +1834,8 @@ static int virtio_mem_probe(struct virtio_device *vdev)
return 0;
 out_unreg_mem:
unregister_memory_notifier(>memory_notifier);
+out_del_resource:
+   virtio_mem_delete_resource(vm);
 out_del_vq:
vdev->config->del_vqs(vdev);
 out_free_vm:
@@ -1848,6 +1896,8 @@ static void virtio_mem_remove(struct virtio_device *vdev)
vm->nb_mb_state[VIRTIO_MEM_MB_STATE_ONLINE_PARTIAL] ||
vm->nb_mb_state[VIRTIO_MEM_MB_STATE_ONLINE_MOVABLE])
dev_warn(>dev, "device still has system memory added\n");
+   else
+   virtio_mem_delete_resource(vm);
 
/* remove all tracking data - no locking needed */
vfree(vm->mb_state);
-- 
2.25.3


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] [PATCH v4 10/15] virtio-mem: Better retry handling

2020-05-07 Thread David Hildenbrand
Let's start with a retry interval of 5 seconds and double the time until
we reach 5 minutes, in case we keep getting errors. Reset the retry
interval in case we succeeded.

The two main reasons for having to retry are
- The hypervisor is busy and cannot process our request
- We cannot reach the desired requested_size (esp., not enough memory can
  get unplugged because we can't allocate any subblocks).

Tested-by: Pankaj Gupta 
Cc: "Michael S. Tsirkin" 
Cc: Jason Wang 
Cc: Oscar Salvador 
Cc: Michal Hocko 
Cc: Igor Mammedov 
Cc: Dave Young 
Cc: Andrew Morton 
Cc: Dan Williams 
Cc: Pavel Tatashin 
Cc: Stefan Hajnoczi 
Cc: Vlastimil Babka 
Signed-off-by: David Hildenbrand 
---
 drivers/virtio/virtio_mem.c | 11 ---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
index a2edb87e5ed8..eb4c16d634e0 100644
--- a/drivers/virtio/virtio_mem.c
+++ b/drivers/virtio/virtio_mem.c
@@ -141,7 +141,9 @@ struct virtio_mem {
 
/* Timer for retrying to plug/unplug memory. */
struct hrtimer retry_timer;
-#define VIRTIO_MEM_RETRY_TIMER_MS  3
+   unsigned int retry_timer_ms;
+#define VIRTIO_MEM_RETRY_TIMER_MIN_MS  5
+#define VIRTIO_MEM_RETRY_TIMER_MAX_MS  30
 
/* Memory notifier (online/offline events). */
struct notifier_block memory_notifier;
@@ -1550,6 +1552,7 @@ static void virtio_mem_run_wq(struct work_struct *work)
 
switch (rc) {
case 0:
+   vm->retry_timer_ms = VIRTIO_MEM_RETRY_TIMER_MIN_MS;
break;
case -ENOSPC:
/*
@@ -1565,8 +1568,7 @@ static void virtio_mem_run_wq(struct work_struct *work)
 */
case -ENOMEM:
/* Out of memory, try again later. */
-   hrtimer_start(>retry_timer,
- ms_to_ktime(VIRTIO_MEM_RETRY_TIMER_MS),
+   hrtimer_start(>retry_timer, ms_to_ktime(vm->retry_timer_ms),
  HRTIMER_MODE_REL);
break;
case -EAGAIN:
@@ -1586,6 +1588,8 @@ static enum hrtimer_restart 
virtio_mem_timer_expired(struct hrtimer *timer)
 retry_timer);
 
virtio_mem_retry(vm);
+   vm->retry_timer_ms = min_t(unsigned int, vm->retry_timer_ms * 2,
+  VIRTIO_MEM_RETRY_TIMER_MAX_MS);
return HRTIMER_NORESTART;
 }
 
@@ -1754,6 +1758,7 @@ static int virtio_mem_probe(struct virtio_device *vdev)
spin_lock_init(>removal_lock);
hrtimer_init(>retry_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
vm->retry_timer.function = virtio_mem_timer_expired;
+   vm->retry_timer_ms = VIRTIO_MEM_RETRY_TIMER_MIN_MS;
 
/* register the virtqueue */
rc = virtio_mem_init_vq(vm);
-- 
2.25.3


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] [PATCH v4 13/15] virtio-mem: Unplug subblocks right-to-left

2020-05-07 Thread David Hildenbrand
We unplug blocks right-to-left, let's also unplug subblocks within a block
right-to-left.

Cc: "Michael S. Tsirkin" 
Cc: Pankaj Gupta 
Signed-off-by: David Hildenbrand 
---
 drivers/virtio/virtio_mem.c | 38 -
 1 file changed, 16 insertions(+), 22 deletions(-)

diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
index 8dd57b61b09b..a719e1a04ac7 100644
--- a/drivers/virtio/virtio_mem.c
+++ b/drivers/virtio/virtio_mem.c
@@ -353,18 +353,6 @@ static bool virtio_mem_mb_test_sb_unplugged(struct 
virtio_mem *vm,
return find_next_bit(vm->sb_bitmap, bit + count, bit) >= bit + count;
 }
 
-/*
- * Find the first plugged subblock. Returns vm->nb_sb_per_mb in case there is
- * none.
- */
-static int virtio_mem_mb_first_plugged_sb(struct virtio_mem *vm,
- unsigned long mb_id)
-{
-   const int bit = (mb_id - vm->first_mb_id) * vm->nb_sb_per_mb;
-
-   return find_next_bit(vm->sb_bitmap, bit + vm->nb_sb_per_mb, bit) - bit;
-}
-
 /*
  * Find the first unplugged subblock. Returns vm->nb_sb_per_mb in case there is
  * none.
@@ -1016,21 +1004,27 @@ static int virtio_mem_mb_unplug_any_sb(struct 
virtio_mem *vm,
int sb_id, count;
int rc;
 
+   sb_id = vm->nb_sb_per_mb - 1;
while (*nb_sb) {
-   sb_id = virtio_mem_mb_first_plugged_sb(vm, mb_id);
-   if (sb_id >= vm->nb_sb_per_mb)
+   /* Find the next candidate subblock */
+   while (sb_id >= 0 &&
+  virtio_mem_mb_test_sb_unplugged(vm, mb_id, sb_id, 1))
+   sb_id--;
+   if (sb_id < 0)
break;
+   /* Try to unplug multiple subblocks at a time */
count = 1;
-   while (count < *nb_sb &&
-  sb_id + count  < vm->nb_sb_per_mb &&
-  virtio_mem_mb_test_sb_plugged(vm, mb_id, sb_id + count,
-1))
+   while (count < *nb_sb && sb_id > 0 &&
+  virtio_mem_mb_test_sb_plugged(vm, mb_id, sb_id - 1, 1)) {
count++;
+   sb_id--;
+   }
 
rc = virtio_mem_mb_unplug_sb(vm, mb_id, sb_id, count);
if (rc)
return rc;
*nb_sb -= count;
+   sb_id--;
}
 
return 0;
@@ -1337,12 +1331,12 @@ static int virtio_mem_mb_unplug_any_sb_online(struct 
virtio_mem *vm,
 * we should sense via something like is_mem_section_removable()
 * first if it makes sense to go ahead any try to allocate.
 */
-   for (sb_id = 0; sb_id < vm->nb_sb_per_mb && *nb_sb; sb_id++) {
+   for (sb_id = vm->nb_sb_per_mb - 1; sb_id >= 0 && *nb_sb; sb_id--) {
/* Find the next candidate subblock */
-   while (sb_id < vm->nb_sb_per_mb &&
+   while (sb_id >= 0 &&
   !virtio_mem_mb_test_sb_plugged(vm, mb_id, sb_id, 1))
-   sb_id++;
-   if (sb_id >= vm->nb_sb_per_mb)
+   sb_id--;
+   if (sb_id < 0)
break;
 
start_pfn = PFN_DOWN(virtio_mem_mb_id_to_phys(mb_id) +
-- 
2.25.3


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] [PATCH v4 14/15] virtio-mem: Use -ETXTBSY as error code if the device is busy

2020-05-07 Thread David Hildenbrand
Let's be able to distinguish if the device or if memory is busy.

Cc: "Michael S. Tsirkin" 
Cc: Pankaj Gupta 
Signed-off-by: David Hildenbrand 
---
 drivers/virtio/virtio_mem.c | 16 ++--
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
index a719e1a04ac7..abd93b778a26 100644
--- a/drivers/virtio/virtio_mem.c
+++ b/drivers/virtio/virtio_mem.c
@@ -893,7 +893,7 @@ static int virtio_mem_send_plug_request(struct virtio_mem 
*vm, uint64_t addr,
case VIRTIO_MEM_RESP_NACK:
return -EAGAIN;
case VIRTIO_MEM_RESP_BUSY:
-   return -EBUSY;
+   return -ETXTBSY;
case VIRTIO_MEM_RESP_ERROR:
return -EINVAL;
default:
@@ -919,7 +919,7 @@ static int virtio_mem_send_unplug_request(struct virtio_mem 
*vm, uint64_t addr,
vm->plugged_size -= size;
return 0;
case VIRTIO_MEM_RESP_BUSY:
-   return -EBUSY;
+   return -ETXTBSY;
case VIRTIO_MEM_RESP_ERROR:
return -EINVAL;
default:
@@ -941,7 +941,7 @@ static int virtio_mem_send_unplug_all_request(struct 
virtio_mem *vm)
atomic_set(>config_changed, 1);
return 0;
case VIRTIO_MEM_RESP_BUSY:
-   return -EBUSY;
+   return -ETXTBSY;
default:
return -ENOMEM;
}
@@ -1557,11 +1557,15 @@ static void virtio_mem_run_wq(struct work_struct *work)
 * or we have too many offline memory blocks.
 */
break;
-   case -EBUSY:
+   case -ETXTBSY:
/*
 * The hypervisor cannot process our request right now
-* (e.g., out of memory, migrating) or we cannot free up
-* any memory to unplug it (all plugged memory is busy).
+* (e.g., out of memory, migrating);
+*/
+   case -EBUSY:
+   /*
+* We cannot free up any memory to unplug it (all plugged memory
+* is busy).
 */
case -ENOMEM:
/* Out of memory, try again later. */
-- 
2.25.3


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] [PATCH v4 04/15] virtio-mem: Paravirtualized memory hotunplug part 1

2020-05-07 Thread David Hildenbrand
Unplugging subblocks of memory blocks that are offline is easy. All we
have to do is watch out for concurrent onlining activity.

Tested-by: Pankaj Gupta 
Cc: "Michael S. Tsirkin" 
Cc: Jason Wang 
Cc: Oscar Salvador 
Cc: Michal Hocko 
Cc: Igor Mammedov 
Cc: Dave Young 
Cc: Andrew Morton 
Cc: Dan Williams 
Cc: Pavel Tatashin 
Cc: Stefan Hajnoczi 
Cc: Vlastimil Babka 
Signed-off-by: David Hildenbrand 
---
 drivers/virtio/virtio_mem.c | 116 +++-
 1 file changed, 114 insertions(+), 2 deletions(-)

diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
index 270ddeaec059..a3ec795be8be 100644
--- a/drivers/virtio/virtio_mem.c
+++ b/drivers/virtio/virtio_mem.c
@@ -123,7 +123,7 @@ struct virtio_mem {
 *
 * When this lock is held the pointers can't change, ONLINE and
 * OFFLINE blocks can't change the state and no subblocks will get
-* plugged.
+* plugged/unplugged.
 */
struct mutex hotplug_mutex;
bool hotplug_active;
@@ -280,6 +280,12 @@ static int virtio_mem_mb_state_prepare_next_mb(struct 
virtio_mem *vm)
 _mb_id++) \
if (virtio_mem_mb_get_state(_vm, _mb_id) == _state)
 
+#define virtio_mem_for_each_mb_state_rev(_vm, _mb_id, _state) \
+   for (_mb_id = _vm->next_mb_id - 1; \
+_mb_id >= _vm->first_mb_id && _vm->nb_mb_state[_state]; \
+_mb_id--) \
+   if (virtio_mem_mb_get_state(_vm, _mb_id) == _state)
+
 /*
  * Mark all selected subblocks plugged.
  *
@@ -325,6 +331,19 @@ static bool virtio_mem_mb_test_sb_plugged(struct 
virtio_mem *vm,
   bit + count;
 }
 
+/*
+ * Test if all selected subblocks are unplugged.
+ */
+static bool virtio_mem_mb_test_sb_unplugged(struct virtio_mem *vm,
+   unsigned long mb_id, int sb_id,
+   int count)
+{
+   const int bit = (mb_id - vm->first_mb_id) * vm->nb_sb_per_mb + sb_id;
+
+   /* TODO: Helper similar to bitmap_set() */
+   return find_next_bit(vm->sb_bitmap, bit + count, bit) >= bit + count;
+}
+
 /*
  * Find the first plugged subblock. Returns vm->nb_sb_per_mb in case there is
  * none.
@@ -513,6 +532,9 @@ static void virtio_mem_notify_offline(struct virtio_mem *vm,
BUG();
break;
}
+
+   /* trigger the workqueue, maybe we can now unplug memory. */
+   virtio_mem_retry(vm);
 }
 
 static void virtio_mem_notify_online(struct virtio_mem *vm, unsigned long 
mb_id,
@@ -1122,6 +1144,94 @@ static int virtio_mem_plug_request(struct virtio_mem 
*vm, uint64_t diff)
return rc;
 }
 
+/*
+ * Unplug the desired number of plugged subblocks of an offline memory block.
+ * Will fail if any subblock cannot get unplugged (instead of skipping it).
+ *
+ * Will modify the state of the memory block. Might temporarily drop the
+ * hotplug_mutex.
+ *
+ * Note: Can fail after some subblocks were successfully unplugged.
+ */
+static int virtio_mem_mb_unplug_any_sb_offline(struct virtio_mem *vm,
+  unsigned long mb_id,
+  uint64_t *nb_sb)
+{
+   int rc;
+
+   rc = virtio_mem_mb_unplug_any_sb(vm, mb_id, nb_sb);
+
+   /* some subblocks might have been unplugged even on failure */
+   if (!virtio_mem_mb_test_sb_plugged(vm, mb_id, 0, vm->nb_sb_per_mb))
+   virtio_mem_mb_set_state(vm, mb_id,
+   VIRTIO_MEM_MB_STATE_OFFLINE_PARTIAL);
+   if (rc)
+   return rc;
+
+   if (virtio_mem_mb_test_sb_unplugged(vm, mb_id, 0, vm->nb_sb_per_mb)) {
+   /*
+* Remove the block from Linux - this should never fail.
+* Hinder the block from getting onlined by marking it
+* unplugged. Temporarily drop the mutex, so
+* any pending GOING_ONLINE requests can be serviced/rejected.
+*/
+   virtio_mem_mb_set_state(vm, mb_id,
+   VIRTIO_MEM_MB_STATE_UNUSED);
+
+   mutex_unlock(>hotplug_mutex);
+   rc = virtio_mem_mb_remove(vm, mb_id);
+   BUG_ON(rc);
+   mutex_lock(>hotplug_mutex);
+   }
+   return 0;
+}
+
+/*
+ * Try to unplug the requested amount of memory.
+ */
+static int virtio_mem_unplug_request(struct virtio_mem *vm, uint64_t diff)
+{
+   uint64_t nb_sb = diff / vm->subblock_size;
+   unsigned long mb_id;
+   int rc;
+
+   if (!nb_sb)
+   return 0;
+
+   /*
+* We'll drop the mutex a couple of times when it is safe to do so.
+* This might result in some blocks switching the state (online/offline)
+* and we could miss them in this run - we will retry again later.
+*/
+   mutex_lock(>hotplug_mutex);
+

[virtio-dev] [PATCH v4 07/15] virtio-mem: Allow to offline partially unplugged memory blocks

2020-05-07 Thread David Hildenbrand
Dropping the reference count of PageOffline() pages during MEM_GOING_ONLINE
allows offlining code to skip them. However, we also have to clear
PG_reserved, because PG_reserved pages get detected as unmovable right
away. Take care of restoring the reference count when offlining is
canceled.

Clarify why we don't have to perform any action when unloading the
driver. Also, let's add a warning if anybody is still holding a
reference to unplugged pages when offlining.

Tested-by: Pankaj Gupta 
Cc: "Michael S. Tsirkin" 
Cc: Jason Wang 
Cc: Oscar Salvador 
Cc: Michal Hocko 
Cc: Igor Mammedov 
Cc: Dave Young 
Cc: Andrew Morton 
Cc: Dan Williams 
Cc: Pavel Tatashin 
Cc: Stefan Hajnoczi 
Cc: Vlastimil Babka 
Signed-off-by: David Hildenbrand 
---
 drivers/virtio/virtio_mem.c | 68 -
 1 file changed, 67 insertions(+), 1 deletion(-)

diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
index 74f0d3cb1d22..b0b41c73ce89 100644
--- a/drivers/virtio/virtio_mem.c
+++ b/drivers/virtio/virtio_mem.c
@@ -572,6 +572,57 @@ static void virtio_mem_notify_online(struct virtio_mem 
*vm, unsigned long mb_id,
virtio_mem_retry(vm);
 }
 
+static void virtio_mem_notify_going_offline(struct virtio_mem *vm,
+   unsigned long mb_id)
+{
+   const unsigned long nr_pages = PFN_DOWN(vm->subblock_size);
+   struct page *page;
+   unsigned long pfn;
+   int sb_id, i;
+
+   for (sb_id = 0; sb_id < vm->nb_sb_per_mb; sb_id++) {
+   if (virtio_mem_mb_test_sb_plugged(vm, mb_id, sb_id, 1))
+   continue;
+   /*
+* Drop our reference to the pages so the memory can get
+* offlined and add the unplugged pages to the managed
+* page counters (so offlining code can correctly subtract
+* them again).
+*/
+   pfn = PFN_DOWN(virtio_mem_mb_id_to_phys(mb_id) +
+  sb_id * vm->subblock_size);
+   adjust_managed_page_count(pfn_to_page(pfn), nr_pages);
+   for (i = 0; i < nr_pages; i++) {
+   page = pfn_to_page(pfn + i);
+   if (WARN_ON(!page_ref_dec_and_test(page)))
+   dump_page(page, "unplugged page referenced");
+   }
+   }
+}
+
+static void virtio_mem_notify_cancel_offline(struct virtio_mem *vm,
+unsigned long mb_id)
+{
+   const unsigned long nr_pages = PFN_DOWN(vm->subblock_size);
+   unsigned long pfn;
+   int sb_id, i;
+
+   for (sb_id = 0; sb_id < vm->nb_sb_per_mb; sb_id++) {
+   if (virtio_mem_mb_test_sb_plugged(vm, mb_id, sb_id, 1))
+   continue;
+   /*
+* Get the reference we dropped when going offline and
+* subtract the unplugged pages from the managed page
+* counters.
+*/
+   pfn = PFN_DOWN(virtio_mem_mb_id_to_phys(mb_id) +
+  sb_id * vm->subblock_size);
+   adjust_managed_page_count(pfn_to_page(pfn), -nr_pages);
+   for (i = 0; i < nr_pages; i++)
+   page_ref_inc(pfn_to_page(pfn + i));
+   }
+}
+
 /*
  * This callback will either be called synchronously from add_memory() or
  * asynchronously (e.g., triggered via user space). We have to be careful
@@ -618,6 +669,7 @@ static int virtio_mem_memory_notifier_cb(struct 
notifier_block *nb,
break;
}
vm->hotplug_active = true;
+   virtio_mem_notify_going_offline(vm, mb_id);
break;
case MEM_GOING_ONLINE:
mutex_lock(>hotplug_mutex);
@@ -642,6 +694,12 @@ static int virtio_mem_memory_notifier_cb(struct 
notifier_block *nb,
mutex_unlock(>hotplug_mutex);
break;
case MEM_CANCEL_OFFLINE:
+   if (!vm->hotplug_active)
+   break;
+   virtio_mem_notify_cancel_offline(vm, mb_id);
+   vm->hotplug_active = false;
+   mutex_unlock(>hotplug_mutex);
+   break;
case MEM_CANCEL_ONLINE:
if (!vm->hotplug_active)
break;
@@ -668,8 +726,11 @@ static void virtio_mem_set_fake_offline(unsigned long pfn,
struct page *page = pfn_to_page(pfn);
 
__SetPageOffline(page);
-   if (!onlined)
+   if (!onlined) {
SetPageDirty(page);
+   /* FIXME: remove after cleanups */
+   ClearPageReserved(page);
+   }
}
 }
 
@@ -1722,6 +1783,11 @@ static void virtio_mem_remove(struct virtio_device *vdev)
  

[virtio-dev] [PATCH v4 05/15] virtio-mem: Paravirtualized memory hotunplug part 2

2020-05-07 Thread David Hildenbrand
We also want to unplug online memory (contained in online memory blocks
and, therefore, managed by the buddy), and eventually replug it later.

When requested to unplug memory, we use alloc_contig_range() to allocate
subblocks in online memory blocks (so we are the owner) and send them to
our hypervisor. When requested to plug memory, we can replug such memory
using free_contig_range() after asking our hypervisor.

We also want to mark all allocated pages PG_offline, so nobody will
touch them. To differentiate pages that were never onlined when
onlining the memory block from pages allocated via alloc_contig_range(), we
use PageDirty(). Based on this flag, virtio_mem_fake_online() can either
online the pages for the first time or use free_contig_range().

It is worth noting that there are no guarantees on how much memory can
actually get unplugged again. All device memory might completely be
fragmented with unmovable data, such that no subblock can get unplugged.

We are not touching the ZONE_MOVABLE. If memory is onlined to the
ZONE_MOVABLE, it can only get unplugged after that memory was offlined
manually by user space. In normal operation, virtio-mem memory is
suggested to be onlined to ZONE_NORMAL. In the future, we will try to
make unplug more likely to succeed.

Add a module parameter to control if online memory shall be touched.

As we want to access alloc_contig_range()/free_contig_range() from
kernel module context, export the symbols.

Note: Whenever virtio-mem uses alloc_contig_range(), all affected pages
are on the same node, in the same zone, and contain no holes.

Acked-by: Michal Hocko  # to export contig range allocator API
Tested-by: Pankaj Gupta 
Cc: "Michael S. Tsirkin" 
Cc: Jason Wang 
Cc: Oscar Salvador 
Cc: Michal Hocko 
Cc: Igor Mammedov 
Cc: Dave Young 
Cc: Andrew Morton 
Cc: Dan Williams 
Cc: Pavel Tatashin 
Cc: Stefan Hajnoczi 
Cc: Vlastimil Babka 
Cc: Mel Gorman 
Cc: Mike Rapoport 
Cc: Alexander Duyck 
Cc: Alexander Potapenko 
Signed-off-by: David Hildenbrand 
---
 drivers/virtio/Kconfig  |   1 +
 drivers/virtio/virtio_mem.c | 157 
 mm/page_alloc.c |   2 +
 3 files changed, 146 insertions(+), 14 deletions(-)

diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
index d6dde7d2cf76..4c1e14615001 100644
--- a/drivers/virtio/Kconfig
+++ b/drivers/virtio/Kconfig
@@ -85,6 +85,7 @@ config VIRTIO_MEM
depends on VIRTIO
depends on MEMORY_HOTPLUG_SPARSE
depends on MEMORY_HOTREMOVE
+   select CONTIG_ALLOC
help
 This driver provides access to virtio-mem paravirtualized memory
 devices, allowing to hotplug and hotunplug memory.
diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
index a3ec795be8be..74f0d3cb1d22 100644
--- a/drivers/virtio/virtio_mem.c
+++ b/drivers/virtio/virtio_mem.c
@@ -23,6 +23,10 @@
 
 #include 
 
+static bool unplug_online = true;
+module_param(unplug_online, bool, 0644);
+MODULE_PARM_DESC(unplug_online, "Try to unplug online memory");
+
 enum virtio_mem_mb_state {
/* Unplugged, not added to Linux. Can be reused later. */
VIRTIO_MEM_MB_STATE_UNUSED = 0,
@@ -654,23 +658,35 @@ static int virtio_mem_memory_notifier_cb(struct 
notifier_block *nb,
 }
 
 /*
- * Set a range of pages PG_offline.
+ * Set a range of pages PG_offline. Remember pages that were never onlined
+ * (via generic_online_page()) using PageDirty().
  */
 static void virtio_mem_set_fake_offline(unsigned long pfn,
-   unsigned int nr_pages)
+   unsigned int nr_pages, bool onlined)
 {
-   for (; nr_pages--; pfn++)
-   __SetPageOffline(pfn_to_page(pfn));
+   for (; nr_pages--; pfn++) {
+   struct page *page = pfn_to_page(pfn);
+
+   __SetPageOffline(page);
+   if (!onlined)
+   SetPageDirty(page);
+   }
 }
 
 /*
- * Clear PG_offline from a range of pages.
+ * Clear PG_offline from a range of pages. If the pages were never onlined,
+ * (via generic_online_page()), clear PageDirty().
  */
 static void virtio_mem_clear_fake_offline(unsigned long pfn,
- unsigned int nr_pages)
+ unsigned int nr_pages, bool onlined)
 {
-   for (; nr_pages--; pfn++)
-   __ClearPageOffline(pfn_to_page(pfn));
+   for (; nr_pages--; pfn++) {
+   struct page *page = pfn_to_page(pfn);
+
+   __ClearPageOffline(page);
+   if (!onlined)
+   ClearPageDirty(page);
+   }
 }
 
 /*
@@ -686,10 +702,26 @@ static void virtio_mem_fake_online(unsigned long pfn, 
unsigned int nr_pages)
 * We are always called with subblock granularity, which is at least
 * aligned to MAX_ORDER - 1.
 */
-   virtio_mem_clear_fake_offline(pfn, nr_pages);
+   for (i = 0; 

[virtio-dev] [PATCH v4 08/15] mm/memory_hotplug: Introduce offline_and_remove_memory()

2020-05-07 Thread David Hildenbrand
virtio-mem wants to offline and remove a memory block once it unplugged
all subblocks (e.g., using alloc_contig_range()). Let's provide
an interface to do that from a driver. virtio-mem already supports to
offline partially unplugged memory blocks. Offlining a fully unplugged
memory block will not require to migrate any pages. All unplugged
subblocks are PageOffline() and have a reference count of 0 - so
offlining code will simply skip them.

All we need is an interface to offline and remove the memory from kernel
module context, where we don't have access to the memory block devices
(esp. find_memory_block() and device_offline()) and the device hotplug
lock.

To keep things simple, allow to only work on a single memory block.

Acked-by: Michal Hocko 
Tested-by: Pankaj Gupta 
Acked-by: Andrew Morton 
Cc: Andrew Morton 
Cc: David Hildenbrand 
Cc: Oscar Salvador 
Cc: Michal Hocko 
Cc: Pavel Tatashin 
Cc: Wei Yang 
Cc: Dan Williams 
Cc: Qian Cai 
Signed-off-by: David Hildenbrand 
---
 include/linux/memory_hotplug.h |  1 +
 mm/memory_hotplug.c| 37 ++
 2 files changed, 38 insertions(+)

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 93d9ada74ddd..cb7499843f5c 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -319,6 +319,7 @@ extern void try_offline_node(int nid);
 extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
 extern int remove_memory(int nid, u64 start, u64 size);
 extern void __remove_memory(int nid, u64 start, u64 size);
+extern int offline_and_remove_memory(int nid, u64 start, u64 size);
 
 #else
 static inline bool is_mem_section_removable(unsigned long pfn,
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 008e4a7ed8bc..4acb99aa9bf4 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1821,4 +1821,41 @@ int remove_memory(int nid, u64 start, u64 size)
return rc;
 }
 EXPORT_SYMBOL_GPL(remove_memory);
+
+/*
+ * Try to offline and remove a memory block. Might take a long time to
+ * finish in case memory is still in use. Primarily useful for memory devices
+ * that logically unplugged all memory (so it's no longer in use) and want to
+ * offline + remove the memory block.
+ */
+int offline_and_remove_memory(int nid, u64 start, u64 size)
+{
+   struct memory_block *mem;
+   int rc = -EINVAL;
+
+   if (!IS_ALIGNED(start, memory_block_size_bytes()) ||
+   size != memory_block_size_bytes())
+   return rc;
+
+   lock_device_hotplug();
+   mem = find_memory_block(__pfn_to_section(PFN_DOWN(start)));
+   if (mem)
+   rc = device_offline(>dev);
+   /* Ignore if the device is already offline. */
+   if (rc > 0)
+   rc = 0;
+
+   /*
+* In case we succeeded to offline the memory block, remove it.
+* This cannot fail as it cannot get onlined in the meantime.
+*/
+   if (!rc) {
+   rc = try_remove_memory(nid, start, size);
+   WARN_ON_ONCE(rc);
+   }
+   unlock_device_hotplug();
+
+   return rc;
+}
+EXPORT_SYMBOL_GPL(offline_and_remove_memory);
 #endif /* CONFIG_MEMORY_HOTREMOVE */
-- 
2.25.3


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] [PATCH v4 06/15] mm: Allow to offline unmovable PageOffline() pages via MEM_GOING_OFFLINE

2020-05-07 Thread David Hildenbrand
virtio-mem wants to allow to offline memory blocks of which some parts
were unplugged (allocated via alloc_contig_range()), especially, to later
offline and remove completely unplugged memory blocks. The important part
is that PageOffline() has to remain set until the section is offline, so
these pages will never get accessed (e.g., when dumping). The pages should
not be handed back to the buddy (which would require clearing PageOffline()
and result in issues if offlining fails and the pages are suddenly in the
buddy).

Let's allow to do that by allowing to isolate any PageOffline() page
when offlining. This way, we can reach the memory hotplug notifier
MEM_GOING_OFFLINE, where the driver can signal that he is fine with
offlining this page by dropping its reference count. PageOffline() pages
with a reference count of 0 can then be skipped when offlining the
pages (like if they were free, however they are not in the buddy).

Anybody who uses PageOffline() pages and does not agree to offline them
(e.g., Hyper-V balloon, XEN balloon, VMWare balloon for 2MB pages) will not
decrement the reference count and make offlining fail when trying to
migrate such an unmovable page. So there should be no observable change.
Same applies to balloon compaction users (movable PageOffline() pages), the
pages will simply be migrated.

Note 1: If offlining fails, a driver has to increment the reference
count again in MEM_CANCEL_OFFLINE.

Note 2: A driver that makes use of this has to be aware that re-onlining
the memory block has to be handled by hooking into onlining code
(online_page_callback_t), resetting the page PageOffline() and
not giving them to the buddy.

Reviewed-by: Alexander Duyck 
Acked-by: Michal Hocko 
Tested-by: Pankaj Gupta 
Acked-by: Andrew Morton 
Cc: Andrew Morton 
Cc: Juergen Gross 
Cc: Konrad Rzeszutek Wilk 
Cc: Pavel Tatashin 
Cc: Alexander Duyck 
Cc: Vlastimil Babka 
Cc: Johannes Weiner 
Cc: Anthony Yznaga 
Cc: Michal Hocko 
Cc: Oscar Salvador 
Cc: Mel Gorman 
Cc: Mike Rapoport 
Cc: Dan Williams 
Cc: Anshuman Khandual 
Cc: Qian Cai 
Cc: Pingfan Liu 
Signed-off-by: David Hildenbrand 
---
 include/linux/page-flags.h | 10 +
 mm/memory_hotplug.c| 44 +-
 mm/page_alloc.c| 24 +
 mm/page_isolation.c|  9 
 4 files changed, 77 insertions(+), 10 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 222f6f7b2bb3..6be1aa559b1e 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -777,6 +777,16 @@ PAGE_TYPE_OPS(Buddy, buddy)
  * not onlined when onlining the section).
  * The content of these pages is effectively stale. Such pages should not
  * be touched (read/write/dump/save) except by their owner.
+ *
+ * If a driver wants to allow to offline unmovable PageOffline() pages without
+ * putting them back to the buddy, it can do so via the memory notifier by
+ * decrementing the reference count in MEM_GOING_OFFLINE and incrementing the
+ * reference count in MEM_CANCEL_OFFLINE. When offlining, the PageOffline()
+ * pages (now with a reference count of zero) are treated like free pages,
+ * allowing the containing memory block to get offlined. A driver that
+ * relies on this feature is aware that re-onlining the memory block will
+ * require to re-set the pages PageOffline() and not giving them to the
+ * buddy via online_page_callback_t.
  */
 PAGE_TYPE_OPS(Offline, offline)
 
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index fc0aad0bc1f5..008e4a7ed8bc 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1224,11 +1224,17 @@ struct zone *test_pages_in_a_zone(unsigned long 
start_pfn,
 
 /*
  * Scan pfn range [start,end) to find movable/migratable pages (LRU pages,
- * non-lru movable pages and hugepages). We scan pfn because it's much
- * easier than scanning over linked list. This function returns the pfn
- * of the first found movable page if it's found, otherwise 0.
+ * non-lru movable pages and hugepages). Will skip over most unmovable
+ * pages (esp., pages that can be skipped when offlining), but bail out on
+ * definitely unmovable pages.
+ *
+ * Returns:
+ * 0 in case a movable page is found and movable_pfn was updated.
+ * -ENOENT in case no movable page was found.
+ * -EBUSY in case a definitely unmovable page was found.
  */
-static unsigned long scan_movable_pages(unsigned long start, unsigned long end)
+static int scan_movable_pages(unsigned long start, unsigned long end,
+ unsigned long *movable_pfn)
 {
unsigned long pfn;
 
@@ -1240,18 +1246,30 @@ static unsigned long scan_movable_pages(unsigned long 
start, unsigned long end)
continue;
page = pfn_to_page(pfn);
if (PageLRU(page))
-   return pfn;
+   goto found;
if (__PageMovable(page

[virtio-dev] [PATCH v4 02/15] MAINTAINERS: Add myself as virtio-mem maintainer

2020-05-07 Thread David Hildenbrand
Let's make sure patches/bug reports find the right person.

Cc: "Michael S. Tsirkin" 
Signed-off-by: David Hildenbrand 
---
 MAINTAINERS | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 2926327e4976..014bbf5897c2 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -17972,6 +17972,13 @@ S: Maintained
 F: drivers/iommu/virtio-iommu.c
 F: include/uapi/linux/virtio_iommu.h
 
+VIRTIO MEM DRIVER
+M: David Hildenbrand 
+L: virtualizat...@lists.linux-foundation.org
+S: Maintained
+F: drivers/virtio/virtio_mem.c
+F: include/uapi/linux/virtio_mem.h
+
 VIRTUAL BOX GUEST DEVICE DRIVER
 M: Hans de Goede 
 M: Arnd Bergmann 
-- 
2.25.3


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] [PATCH v4 01/15] virtio-mem: Paravirtualized memory hotplug

2020-05-07 Thread David Hildenbrand
Each virtio-mem device owns exactly one memory region. It is responsible
for adding/removing memory from that memory region on request.

When the device driver starts up, the requested amount of memory is
queried and then plugged to Linux. On request, further memory can be
plugged or unplugged. This patch only implements the plugging part.

On x86-64, memory can currently be plugged in 4MB ("subblock") granularity.
When required, a new memory block will be added (e.g., usually 128MB on
x86-64) in order to plug more subblocks. Only x86-64 was tested for now.

The online_page callback is used to keep unplugged subblocks offline
when onlining memory - similar to the Hyper-V balloon driver. Unplugged
pages are marked PG_offline, to tell dump tools (e.g., makedumpfile) to
skip them.

User space is usually responsible for onlining the added memory. The
memory hotplug notifier is used to synchronize virtio-mem activity
against memory onlining/offlining.

Each virtio-mem device can belong to a NUMA node, which allows us to
easily add/remove small chunks of memory to/from a specific NUMA node by
using multiple virtio-mem devices. Something that works even when the
guest has no idea about the NUMA topology.

One way to view virtio-mem is as a "resizable DIMM" or a DIMM with many
"sub-DIMMS".

This patch directly introduces the basic infrastructure to implement memory
unplug. Especially the memory block states and subblock bitmaps will be
heavily used there.

Notes:
- In case memory is to be onlined by user space, we limit the amount of
  offline memory blocks, to not run out of memory. This is esp. an
  issue if memory is added faster than it is getting onlined.
- Suspend/Hibernate is not supported due to the way virtio-mem devices
  behave. Limited support might be possible in the future.
- Reloading the device driver is not supported.

Reviewed-by: Pankaj Gupta 
Tested-by: Pankaj Gupta 
Cc: "Michael S. Tsirkin" 
Cc: Jason Wang 
Cc: Oscar Salvador 
Cc: Michal Hocko 
Cc: Igor Mammedov 
Cc: Dave Young 
Cc: Andrew Morton 
Cc: Dan Williams 
Cc: Pavel Tatashin 
Cc: Stefan Hajnoczi 
Cc: Vlastimil Babka 
Cc: "Rafael J. Wysocki" 
Cc: Len Brown 
Cc: linux-a...@vger.kernel.org
Signed-off-by: David Hildenbrand 
---
 drivers/virtio/Kconfig  |   16 +
 drivers/virtio/Makefile |1 +
 drivers/virtio/virtio_mem.c | 1533 +++
 include/uapi/linux/virtio_ids.h |1 +
 include/uapi/linux/virtio_mem.h |  200 
 5 files changed, 1751 insertions(+)
 create mode 100644 drivers/virtio/virtio_mem.c
 create mode 100644 include/uapi/linux/virtio_mem.h

diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
index 69a32dfc318a..d6dde7d2cf76 100644
--- a/drivers/virtio/Kconfig
+++ b/drivers/virtio/Kconfig
@@ -78,6 +78,22 @@ config VIRTIO_BALLOON
 
 If unsure, say M.
 
+config VIRTIO_MEM
+   tristate "Virtio mem driver"
+   default m
+   depends on X86_64
+   depends on VIRTIO
+   depends on MEMORY_HOTPLUG_SPARSE
+   depends on MEMORY_HOTREMOVE
+   help
+This driver provides access to virtio-mem paravirtualized memory
+devices, allowing to hotplug and hotunplug memory.
+
+This driver was only tested under x86-64, but should theoretically
+work on all architectures that support memory hotplug and hotremove.
+
+If unsure, say M.
+
 config VIRTIO_INPUT
tristate "Virtio input driver"
depends on VIRTIO
diff --git a/drivers/virtio/Makefile b/drivers/virtio/Makefile
index 29a1386ecc03..4d993791f2d7 100644
--- a/drivers/virtio/Makefile
+++ b/drivers/virtio/Makefile
@@ -7,3 +7,4 @@ virtio_pci-$(CONFIG_VIRTIO_PCI_LEGACY) += virtio_pci_legacy.o
 obj-$(CONFIG_VIRTIO_BALLOON) += virtio_balloon.o
 obj-$(CONFIG_VIRTIO_INPUT) += virtio_input.o
 obj-$(CONFIG_VIRTIO_VDPA) += virtio_vdpa.o
+obj-$(CONFIG_VIRTIO_MEM) += virtio_mem.o
diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
new file mode 100644
index ..5d1dcaa6fc42
--- /dev/null
+++ b/drivers/virtio/virtio_mem.c
@@ -0,0 +1,1533 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Virtio-mem device driver.
+ *
+ * Copyright Red Hat, Inc. 2020
+ *
+ * Author(s): David Hildenbrand 
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+enum virtio_mem_mb_state {
+   /* Unplugged, not added to Linux. Can be reused later. */
+   VIRTIO_MEM_MB_STATE_UNUSED = 0,
+   /* (Partially) plugged, not added to Linux. Error on add_memory(). */
+   VIRTIO_MEM_MB_STATE_PLUGGED,
+   /* Fully plugged, fully added to Linux, offline. */
+   VIRTIO_MEM_MB_STATE_OFFLINE,
+   /* Partially plugged, fully added to Linux, offline. */
+   VIRTIO_MEM_MB_STATE_OFFLINE_PARTIAL,
+   /* Fully plugged, fully added to Linux, online (!ZONE_MOVABLE). */
+

[virtio-dev] [PATCH v4 00/15] virtio-mem: paravirtualized memory

2020-05-07 Thread David Hildenbrand
This series is based on v5.7-rc4. The patches are located at:
https://github.com/davidhildenbrand/linux.git virtio-mem-v4

This is basically a resend of v3 [1], now based on v5.7-rc4 and restested.
One patch was reshuffled and two ACKs I missed to add were added. The
rebase did not require any modifications to patches.

Details about virtio-mem can be found in the cover letter of v2 [2]. A
basic QEMU implementation was posted yesterday [3].

[1] https://lkml.kernel.org/r/20200507103119.11219-1-da...@redhat.com
[2] https://lkml.kernel.org/r/20200311171422.10484-1-da...@redhat.com
[3] https://lkml.kernel.org/r/20200506094948.76388-1-da...@redhat.com

v3 -> v4:
- Move "MAINTAINERS: Add myself as virtio-mem maintainer" to #2
- Add two ACKs from Andrew (in reply to v2)
-- "mm: Allow to offline unmovable PageOffline() pages via ..."
-- "mm/memory_hotplug: Introduce offline_and_remove_memory()"

v2 -> v3:
- "virtio-mem: Paravirtualized memory hotplug"
-- Include "linux/slab.h" to fix build issues
-- Remember the "region_size", helpful for patch #11
-- Minor simplifaction in virtio_mem_overlaps_range()
-- Use notifier_from_errno() instead of notifier_to_errno() in notifier
-- More reliable check for added memory when unloading the driver
- "virtio-mem: Allow to specify an ACPI PXM as nid"
-- Also print the nid
- Added patch #11-#15

David Hildenbrand (15):
  virtio-mem: Paravirtualized memory hotplug
  MAINTAINERS: Add myself as virtio-mem maintainer
  virtio-mem: Allow to specify an ACPI PXM as nid
  virtio-mem: Paravirtualized memory hotunplug part 1
  virtio-mem: Paravirtualized memory hotunplug part 2
  mm: Allow to offline unmovable PageOffline() pages via
MEM_GOING_OFFLINE
  virtio-mem: Allow to offline partially unplugged memory blocks
  mm/memory_hotplug: Introduce offline_and_remove_memory()
  virtio-mem: Offline and remove completely unplugged memory blocks
  virtio-mem: Better retry handling
  virtio-mem: Add parent resource for all added "System RAM"
  virtio-mem: Drop manual check for already present memory
  virtio-mem: Unplug subblocks right-to-left
  virtio-mem: Use -ETXTBSY as error code if the device is busy
  virtio-mem: Try to unplug the complete online memory block first

 MAINTAINERS |7 +
 drivers/acpi/numa/srat.c|1 +
 drivers/virtio/Kconfig  |   17 +
 drivers/virtio/Makefile |1 +
 drivers/virtio/virtio_mem.c | 1962 +++
 include/linux/memory_hotplug.h  |1 +
 include/linux/page-flags.h  |   10 +
 include/uapi/linux/virtio_ids.h |1 +
 include/uapi/linux/virtio_mem.h |  208 
 mm/memory_hotplug.c |   81 +-
 mm/page_alloc.c |   26 +
 mm/page_isolation.c |9 +
 12 files changed, 2314 insertions(+), 10 deletions(-)
 create mode 100644 drivers/virtio/virtio_mem.c
 create mode 100644 include/uapi/linux/virtio_mem.h

-- 
2.25.3


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] [PATCH v4 03/15] virtio-mem: Allow to specify an ACPI PXM as nid

2020-05-07 Thread David Hildenbrand
We want to allow to specify (similar as for a DIMM), to which node a
virtio-mem device (and, therefore, its memory) belongs. Add a new
virtio-mem feature flag and export pxm_to_node, so it can be used in kernel
module context.

Acked-by: Michal Hocko  # for the export
Acked-by: "Rafael J. Wysocki"  # for the export
Acked-by: Pankaj Gupta 
Tested-by: Pankaj Gupta 
Cc: "Michael S. Tsirkin" 
Cc: Jason Wang 
Cc: Oscar Salvador 
Cc: Michal Hocko 
Cc: Igor Mammedov 
Cc: Dave Young 
Cc: Andrew Morton 
Cc: Dan Williams 
Cc: Pavel Tatashin 
Cc: Stefan Hajnoczi 
Cc: Vlastimil Babka 
Cc: Len Brown 
Cc: linux-a...@vger.kernel.org
Signed-off-by: David Hildenbrand 
---
 drivers/acpi/numa/srat.c|  1 +
 drivers/virtio/virtio_mem.c | 39 +++--
 include/uapi/linux/virtio_mem.h | 10 -
 3 files changed, 47 insertions(+), 3 deletions(-)

diff --git a/drivers/acpi/numa/srat.c b/drivers/acpi/numa/srat.c
index 47b4969d9b93..5be5a977da1b 100644
--- a/drivers/acpi/numa/srat.c
+++ b/drivers/acpi/numa/srat.c
@@ -35,6 +35,7 @@ int pxm_to_node(int pxm)
return NUMA_NO_NODE;
return pxm_to_node_map[pxm];
 }
+EXPORT_SYMBOL(pxm_to_node);
 
 int node_to_pxm(int node)
 {
diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
index 5d1dcaa6fc42..270ddeaec059 100644
--- a/drivers/virtio/virtio_mem.c
+++ b/drivers/virtio/virtio_mem.c
@@ -21,6 +21,8 @@
 #include 
 #include 
 
+#include 
+
 enum virtio_mem_mb_state {
/* Unplugged, not added to Linux. Can be reused later. */
VIRTIO_MEM_MB_STATE_UNUSED = 0,
@@ -72,6 +74,8 @@ struct virtio_mem {
 
/* The device block size (for communicating with the device). */
uint32_t device_block_size;
+   /* The translated node id. NUMA_NO_NODE in case not specified. */
+   int nid;
/* Physical start address of the memory region. */
uint64_t addr;
/* Maximum region size in bytes. */
@@ -389,7 +393,10 @@ static int virtio_mem_sb_bitmap_prepare_next_mb(struct 
virtio_mem *vm)
 static int virtio_mem_mb_add(struct virtio_mem *vm, unsigned long mb_id)
 {
const uint64_t addr = virtio_mem_mb_id_to_phys(mb_id);
-   int nid = memory_add_physaddr_to_nid(addr);
+   int nid = vm->nid;
+
+   if (nid == NUMA_NO_NODE)
+   nid = memory_add_physaddr_to_nid(addr);
 
dev_dbg(>vdev->dev, "adding memory block: %lu\n", mb_id);
return add_memory(nid, addr, memory_block_size_bytes());
@@ -407,7 +414,10 @@ static int virtio_mem_mb_add(struct virtio_mem *vm, 
unsigned long mb_id)
 static int virtio_mem_mb_remove(struct virtio_mem *vm, unsigned long mb_id)
 {
const uint64_t addr = virtio_mem_mb_id_to_phys(mb_id);
-   int nid = memory_add_physaddr_to_nid(addr);
+   int nid = vm->nid;
+
+   if (nid == NUMA_NO_NODE)
+   nid = memory_add_physaddr_to_nid(addr);
 
dev_dbg(>vdev->dev, "removing memory block: %lu\n", mb_id);
return remove_memory(nid, addr, memory_block_size_bytes());
@@ -426,6 +436,17 @@ static void virtio_mem_retry(struct virtio_mem *vm)
spin_unlock_irqrestore(>removal_lock, flags);
 }
 
+static int virtio_mem_translate_node_id(struct virtio_mem *vm, uint16_t 
node_id)
+{
+   int node = NUMA_NO_NODE;
+
+#if defined(CONFIG_ACPI_NUMA)
+   if (virtio_has_feature(vm->vdev, VIRTIO_MEM_F_ACPI_PXM))
+   node = pxm_to_node(node_id);
+#endif
+   return node;
+}
+
 /*
  * Test if a virtio-mem device overlaps with the given range. Can be called
  * from (notifier) callbacks lockless.
@@ -1267,6 +1288,7 @@ static bool virtio_mem_any_memory_present(unsigned long 
start,
 static int virtio_mem_init(struct virtio_mem *vm)
 {
const uint64_t phys_limit = 1UL << MAX_PHYSMEM_BITS;
+   uint16_t node_id;
 
if (!vm->vdev->config->get) {
dev_err(>vdev->dev, "config access disabled\n");
@@ -1287,6 +1309,9 @@ static int virtio_mem_init(struct virtio_mem *vm)
 >plugged_size);
virtio_cread(vm->vdev, struct virtio_mem_config, block_size,
 >device_block_size);
+   virtio_cread(vm->vdev, struct virtio_mem_config, node_id,
+_id);
+   vm->nid = virtio_mem_translate_node_id(vm, node_id);
virtio_cread(vm->vdev, struct virtio_mem_config, addr, >addr);
virtio_cread(vm->vdev, struct virtio_mem_config, region_size,
 >region_size);
@@ -1365,6 +1390,8 @@ static int virtio_mem_init(struct virtio_mem *vm)
 memory_block_size_bytes());
dev_info(>vdev->dev, "subblock size: 0x%x",
 vm->subblock_size);
+   if (vm->nid != NUMA_NO_NODE)
+   dev_info(>vdev->dev, "nid: %d", vm->nid);
 
return 0;
 }
@@ -1508,12 +1535,20 @@ static in

[virtio-dev] Re: [PATCH v3 07/15] mm/memory_hotplug: Introduce offline_and_remove_memory()

2020-05-07 Thread David Hildenbrand
On 07.05.20 14:11, Michael S. Tsirkin wrote:
> On Thu, May 07, 2020 at 01:37:30PM +0200, David Hildenbrand wrote:
>> On 07.05.20 13:34, Michael S. Tsirkin wrote:
>>> On Thu, May 07, 2020 at 01:33:23PM +0200, David Hildenbrand wrote:
>>>>>> I get:
>>>>>>
>>>>>> error: sha1 information is lacking or useless (mm/memory_hotplug.c).
>>>>>> error: could not build fake ancestor
>>>>>>
>>>>>> which version is this against? Pls post patches on top of some tag
>>>>>> in Linus' tree if possible.
>>>>>
>>>>> As the cover states, latest linux-next. To be precise
>>>>>
>>>>> commit 6b43f715b6379433e8eb30aa9bcc99bd6a585f77 (tag: next-20200507,
>>>>> next/master)
>>>>> Author: Stephen Rothwell 
>>>>> Date:   Thu May 7 18:11:31 2020 +1000
>>>>>
>>>>> Add linux-next specific files for 20200507
>>>>>
>>>>
>>>> The patches seem to apply cleanly on top of
>>>>
>>>> commit a811c1fa0a02c062555b54651065899437bacdbe (linus/master)
>>>> Merge: b9388959ba50 16f8036086a9
>>>> Author: Linus Torvalds 
>>>> Date:   Wed May 6 20:53:22 2020 -0700
>>>>
>>>> Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
>>>
>>> Because you have the relevant hashes in your git tree not pruned yet.
>>> Do a new clone and they won't apply.
>>>
>>
>> Yeah, most probably, it knows how to merge. I'm used to sending all my
>> -mm stuff based on -next, so this here is different.
> 
> 
> Documentation/process/5.Posting.rst addresses this:
> 

Thanks for the info.

> 
> Patches must be prepared against a specific version of the kernel.  As a
> general rule, a patch should be based on the current mainline as found in
> Linus's git tree.  When basing on mainline, start with a well-known release
> point - a stable or -rc release - rather than branching off the mainline at
> an arbitrary spot.
> 
> It may become necessary to make versions against -mm, linux-next, or a
> subsystem tree, though, to facilitate wider testing and review.  Depending
> on the area of your patch and what is going on elsewhere, basing a patch
> against these other trees can require a significant amount of work
> resolving conflicts and dealing with API changes.

Yeah, but with -mm patches it is completely impractical to base them
against Linus's git tree.

-- 
Thanks,

David / dhildenb


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] Re: [PATCH v3 07/15] mm/memory_hotplug: Introduce offline_and_remove_memory()

2020-05-07 Thread David Hildenbrand
On 07.05.20 13:34, Michael S. Tsirkin wrote:
> On Thu, May 07, 2020 at 01:33:23PM +0200, David Hildenbrand wrote:
>>>> I get:
>>>>
>>>> error: sha1 information is lacking or useless (mm/memory_hotplug.c).
>>>> error: could not build fake ancestor
>>>>
>>>> which version is this against? Pls post patches on top of some tag
>>>> in Linus' tree if possible.
>>>
>>> As the cover states, latest linux-next. To be precise
>>>
>>> commit 6b43f715b6379433e8eb30aa9bcc99bd6a585f77 (tag: next-20200507,
>>> next/master)
>>> Author: Stephen Rothwell 
>>> Date:   Thu May 7 18:11:31 2020 +1000
>>>
>>> Add linux-next specific files for 20200507
>>>
>>
>> The patches seem to apply cleanly on top of
>>
>> commit a811c1fa0a02c062555b54651065899437bacdbe (linus/master)
>> Merge: b9388959ba50 16f8036086a9
>> Author: Linus Torvalds 
>> Date:   Wed May 6 20:53:22 2020 -0700
>>
>> Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
> 
> Because you have the relevant hashes in your git tree not pruned yet.
> Do a new clone and they won't apply.
> 

Yeah, most probably, it knows how to merge. I'm used to sending all my
-mm stuff based on -next, so this here is different.

I'll wait a bit and then send v4 based on latest linus/master, adding
the two acks and reshuffling the MAINTAINERS patch. Thanks.

-- 
Thanks,

David / dhildenb


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] Re: [PATCH v3 07/15] mm/memory_hotplug: Introduce offline_and_remove_memory()

2020-05-07 Thread David Hildenbrand
>> I get:
>>
>> error: sha1 information is lacking or useless (mm/memory_hotplug.c).
>> error: could not build fake ancestor
>>
>> which version is this against? Pls post patches on top of some tag
>> in Linus' tree if possible.
> 
> As the cover states, latest linux-next. To be precise
> 
> commit 6b43f715b6379433e8eb30aa9bcc99bd6a585f77 (tag: next-20200507,
> next/master)
> Author: Stephen Rothwell 
> Date:   Thu May 7 18:11:31 2020 +1000
> 
> Add linux-next specific files for 20200507
> 

The patches seem to apply cleanly on top of

commit a811c1fa0a02c062555b54651065899437bacdbe (linus/master)
Merge: b9388959ba50 16f8036086a9
Author: Linus Torvalds 
Date:   Wed May 6 20:53:22 2020 -0700

Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net


I can resend based on that, after giving it a short test.

-- 
Thanks,

David / dhildenb


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] Re: [PATCH v3 07/15] mm/memory_hotplug: Introduce offline_and_remove_memory()

2020-05-07 Thread David Hildenbrand
On 07.05.20 12:46, Michael S. Tsirkin wrote:
> On Thu, May 07, 2020 at 12:31:11PM +0200, David Hildenbrand wrote:
>> virtio-mem wants to offline and remove a memory block once it unplugged
>> all subblocks (e.g., using alloc_contig_range()). Let's provide
>> an interface to do that from a driver. virtio-mem already supports to
>> offline partially unplugged memory blocks. Offlining a fully unplugged
>> memory block will not require to migrate any pages. All unplugged
>> subblocks are PageOffline() and have a reference count of 0 - so
>> offlining code will simply skip them.
>>
>> All we need is an interface to offline and remove the memory from kernel
>> module context, where we don't have access to the memory block devices
>> (esp. find_memory_block() and device_offline()) and the device hotplug
>> lock.
>>
>> To keep things simple, allow to only work on a single memory block.
>>
>> Acked-by: Michal Hocko 
>> Tested-by: Pankaj Gupta 
>> Cc: Andrew Morton 
>> Cc: David Hildenbrand 
>> Cc: Oscar Salvador 
>> Cc: Michal Hocko 
>> Cc: Pavel Tatashin 
>> Cc: Wei Yang 
>> Cc: Dan Williams 
>> Cc: Qian Cai 
>> Signed-off-by: David Hildenbrand 
> 
> 
> didn't you lose Andrew Morton's ack here?

Yeah, thanks for noticing.

> 
>> ---
>>  include/linux/memory_hotplug.h |  1 +
>>  mm/memory_hotplug.c| 37 ++
>>  2 files changed, 38 insertions(+)
> 
> I get:
> 
> error: sha1 information is lacking or useless (mm/memory_hotplug.c).
> error: could not build fake ancestor
> 
> which version is this against? Pls post patches on top of some tag
> in Linus' tree if possible.

As the cover states, latest linux-next. To be precise

commit 6b43f715b6379433e8eb30aa9bcc99bd6a585f77 (tag: next-20200507,
next/master)
Author: Stephen Rothwell 
Date:   Thu May 7 18:11:31 2020 +1000

Add linux-next specific files for 20200507


-- 
Thanks,

David / dhildenb


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] Re: [PATCH v3 10/15] MAINTAINERS: Add myself as virtio-mem maintainer

2020-05-07 Thread David Hildenbrand
On 07.05.20 12:47, Michael S. Tsirkin wrote:
> On Thu, May 07, 2020 at 12:31:14PM +0200, David Hildenbrand wrote:
>> Let's make sure patches/bug reports find the right person.
>>
>> Cc: "Michael S. Tsirkin" 
>> Signed-off-by: David Hildenbrand 
> 
> Make this patch 2 in the series, or even squash into patch 1.

I'll move it to #2. If there are strong feelings, I can squash. Thanks!


-- 
Thanks,

David / dhildenb


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] Re: [PATCH v3 00/15] virtio-mem: paravirtualized memory

2020-05-07 Thread David Hildenbrand
On 07.05.20 12:48, Michael S. Tsirkin wrote:
> On Thu, May 07, 2020 at 12:31:04PM +0200, David Hildenbrand wrote:
>> This series is based on latest linux-next. The patches are located at:
>> https://github.com/davidhildenbrand/linux.git virtio-mem-v3
>>
>> Patch #1 - #10 where contained in v2 and only contain minor modifications
>> (mostly smaller fixes). The remaining patches are new and contain smaller
>> optimizations.
> 
> 
> Looks like you lost some acks, in particular I'd like to preserve
> Andrew Morton's ack.

Yeah, seems like I only picked up Pankaj's acks. I can resend.


-- 
Thanks,

David / dhildenb


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] [PATCH v3 13/15] virtio-mem: Unplug subblocks right-to-left

2020-05-07 Thread David Hildenbrand
We unplug blocks right-to-left, let's also unplug subblocks within a block
right-to-left.

Cc: "Michael S. Tsirkin" 
Cc: Pankaj Gupta 
Signed-off-by: David Hildenbrand 
---
 drivers/virtio/virtio_mem.c | 38 -
 1 file changed, 16 insertions(+), 22 deletions(-)

diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
index 8dd57b61b09b..a719e1a04ac7 100644
--- a/drivers/virtio/virtio_mem.c
+++ b/drivers/virtio/virtio_mem.c
@@ -353,18 +353,6 @@ static bool virtio_mem_mb_test_sb_unplugged(struct 
virtio_mem *vm,
return find_next_bit(vm->sb_bitmap, bit + count, bit) >= bit + count;
 }
 
-/*
- * Find the first plugged subblock. Returns vm->nb_sb_per_mb in case there is
- * none.
- */
-static int virtio_mem_mb_first_plugged_sb(struct virtio_mem *vm,
- unsigned long mb_id)
-{
-   const int bit = (mb_id - vm->first_mb_id) * vm->nb_sb_per_mb;
-
-   return find_next_bit(vm->sb_bitmap, bit + vm->nb_sb_per_mb, bit) - bit;
-}
-
 /*
  * Find the first unplugged subblock. Returns vm->nb_sb_per_mb in case there is
  * none.
@@ -1016,21 +1004,27 @@ static int virtio_mem_mb_unplug_any_sb(struct 
virtio_mem *vm,
int sb_id, count;
int rc;
 
+   sb_id = vm->nb_sb_per_mb - 1;
while (*nb_sb) {
-   sb_id = virtio_mem_mb_first_plugged_sb(vm, mb_id);
-   if (sb_id >= vm->nb_sb_per_mb)
+   /* Find the next candidate subblock */
+   while (sb_id >= 0 &&
+  virtio_mem_mb_test_sb_unplugged(vm, mb_id, sb_id, 1))
+   sb_id--;
+   if (sb_id < 0)
break;
+   /* Try to unplug multiple subblocks at a time */
count = 1;
-   while (count < *nb_sb &&
-  sb_id + count  < vm->nb_sb_per_mb &&
-  virtio_mem_mb_test_sb_plugged(vm, mb_id, sb_id + count,
-1))
+   while (count < *nb_sb && sb_id > 0 &&
+  virtio_mem_mb_test_sb_plugged(vm, mb_id, sb_id - 1, 1)) {
count++;
+   sb_id--;
+   }
 
rc = virtio_mem_mb_unplug_sb(vm, mb_id, sb_id, count);
if (rc)
return rc;
*nb_sb -= count;
+   sb_id--;
}
 
return 0;
@@ -1337,12 +1331,12 @@ static int virtio_mem_mb_unplug_any_sb_online(struct 
virtio_mem *vm,
 * we should sense via something like is_mem_section_removable()
 * first if it makes sense to go ahead any try to allocate.
 */
-   for (sb_id = 0; sb_id < vm->nb_sb_per_mb && *nb_sb; sb_id++) {
+   for (sb_id = vm->nb_sb_per_mb - 1; sb_id >= 0 && *nb_sb; sb_id--) {
/* Find the next candidate subblock */
-   while (sb_id < vm->nb_sb_per_mb &&
+   while (sb_id >= 0 &&
   !virtio_mem_mb_test_sb_plugged(vm, mb_id, sb_id, 1))
-   sb_id++;
-   if (sb_id >= vm->nb_sb_per_mb)
+   sb_id--;
+   if (sb_id < 0)
break;
 
start_pfn = PFN_DOWN(virtio_mem_mb_id_to_phys(mb_id) +
-- 
2.25.3


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] [PATCH v3 15/15] virtio-mem: Try to unplug the complete online memory block first

2020-05-07 Thread David Hildenbrand
Right now, we always try to unplug single subblocks when processing an
online memory block. Let's try to unplug the complete online memory block
first, in case it is fully plugged and the unplug request is large
enough. Fallback to single subblocks in case the memory block cannot get
unplugged as a whole.

Cc: "Michael S. Tsirkin" 
Cc: Pankaj Gupta 
Signed-off-by: David Hildenbrand 
---
 drivers/virtio/virtio_mem.c | 88 -
 1 file changed, 57 insertions(+), 31 deletions(-)

diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
index abd93b778a26..9e523db3bee1 100644
--- a/drivers/virtio/virtio_mem.c
+++ b/drivers/virtio/virtio_mem.c
@@ -1307,6 +1307,46 @@ static int virtio_mem_mb_unplug_any_sb_offline(struct 
virtio_mem *vm,
return 0;
 }
 
+/*
+ * Unplug the given plugged subblocks of an online memory block.
+ *
+ * Will modify the state of the memory block.
+ */
+static int virtio_mem_mb_unplug_sb_online(struct virtio_mem *vm,
+ unsigned long mb_id, int sb_id,
+ int count)
+{
+   const unsigned long nr_pages = PFN_DOWN(vm->subblock_size) * count;
+   unsigned long start_pfn;
+   int rc;
+
+   start_pfn = PFN_DOWN(virtio_mem_mb_id_to_phys(mb_id) +
+sb_id * vm->subblock_size);
+   rc = alloc_contig_range(start_pfn, start_pfn + nr_pages,
+   MIGRATE_MOVABLE, GFP_KERNEL);
+   if (rc == -ENOMEM)
+   /* whoops, out of memory */
+   return rc;
+   if (rc)
+   return -EBUSY;
+
+   /* Mark it as fake-offline before unplugging it */
+   virtio_mem_set_fake_offline(start_pfn, nr_pages, true);
+   adjust_managed_page_count(pfn_to_page(start_pfn), -nr_pages);
+
+   /* Try to unplug the allocated memory */
+   rc = virtio_mem_mb_unplug_sb(vm, mb_id, sb_id, count);
+   if (rc) {
+   /* Return the memory to the buddy. */
+   virtio_mem_fake_online(start_pfn, nr_pages);
+   return rc;
+   }
+
+   virtio_mem_mb_set_state(vm, mb_id,
+   VIRTIO_MEM_MB_STATE_ONLINE_PARTIAL);
+   return 0;
+}
+
 /*
  * Unplug the desired number of plugged subblocks of an online memory block.
  * Will skip subblock that are busy.
@@ -1321,16 +1361,21 @@ static int virtio_mem_mb_unplug_any_sb_online(struct 
virtio_mem *vm,
  unsigned long mb_id,
  uint64_t *nb_sb)
 {
-   const unsigned long nr_pages = PFN_DOWN(vm->subblock_size);
-   unsigned long start_pfn;
int rc, sb_id;
 
-   /*
-* TODO: To increase the performance we want to try bigger, consecutive
-* subblocks first before falling back to single subblocks. Also,
-* we should sense via something like is_mem_section_removable()
-* first if it makes sense to go ahead any try to allocate.
-*/
+   /* If possible, try to unplug the complete block in one shot. */
+   if (*nb_sb >= vm->nb_sb_per_mb &&
+   virtio_mem_mb_test_sb_plugged(vm, mb_id, 0, vm->nb_sb_per_mb)) {
+   rc = virtio_mem_mb_unplug_sb_online(vm, mb_id, 0,
+   vm->nb_sb_per_mb);
+   if (!rc) {
+   *nb_sb -= vm->nb_sb_per_mb;
+   goto unplugged;
+   } else if (rc != -EBUSY)
+   return rc;
+   }
+
+   /* Fallback to single subblocks. */
for (sb_id = vm->nb_sb_per_mb - 1; sb_id >= 0 && *nb_sb; sb_id--) {
/* Find the next candidate subblock */
while (sb_id >= 0 &&
@@ -1339,34 +1384,15 @@ static int virtio_mem_mb_unplug_any_sb_online(struct 
virtio_mem *vm,
if (sb_id < 0)
break;
 
-   start_pfn = PFN_DOWN(virtio_mem_mb_id_to_phys(mb_id) +
-sb_id * vm->subblock_size);
-   rc = alloc_contig_range(start_pfn, start_pfn + nr_pages,
-   MIGRATE_MOVABLE, GFP_KERNEL);
-   if (rc == -ENOMEM)
-   /* whoops, out of memory */
-   return rc;
-   if (rc)
-   /* memory busy, we can't unplug this chunk */
+   rc = virtio_mem_mb_unplug_sb_online(vm, mb_id, sb_id, 1);
+   if (rc == -EBUSY)
continue;
-
-   /* Mark it as fake-offline before unplugging it */
-   virtio_mem_set_fake_offline(start_pfn, nr_pages, true);
-   adjust_managed_page_count(pfn_to_page(start_pfn), -nr_pages);
-
-   /* Try to unplug the allocated memory */
-   

[virtio-dev] [PATCH v3 14/15] virtio-mem: Use -ETXTBSY as error code if the device is busy

2020-05-07 Thread David Hildenbrand
Let's be able to distinguish if the device or if memory is busy.

Cc: "Michael S. Tsirkin" 
Cc: Pankaj Gupta 
Signed-off-by: David Hildenbrand 
---
 drivers/virtio/virtio_mem.c | 16 ++--
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
index a719e1a04ac7..abd93b778a26 100644
--- a/drivers/virtio/virtio_mem.c
+++ b/drivers/virtio/virtio_mem.c
@@ -893,7 +893,7 @@ static int virtio_mem_send_plug_request(struct virtio_mem 
*vm, uint64_t addr,
case VIRTIO_MEM_RESP_NACK:
return -EAGAIN;
case VIRTIO_MEM_RESP_BUSY:
-   return -EBUSY;
+   return -ETXTBSY;
case VIRTIO_MEM_RESP_ERROR:
return -EINVAL;
default:
@@ -919,7 +919,7 @@ static int virtio_mem_send_unplug_request(struct virtio_mem 
*vm, uint64_t addr,
vm->plugged_size -= size;
return 0;
case VIRTIO_MEM_RESP_BUSY:
-   return -EBUSY;
+   return -ETXTBSY;
case VIRTIO_MEM_RESP_ERROR:
return -EINVAL;
default:
@@ -941,7 +941,7 @@ static int virtio_mem_send_unplug_all_request(struct 
virtio_mem *vm)
atomic_set(>config_changed, 1);
return 0;
case VIRTIO_MEM_RESP_BUSY:
-   return -EBUSY;
+   return -ETXTBSY;
default:
return -ENOMEM;
}
@@ -1557,11 +1557,15 @@ static void virtio_mem_run_wq(struct work_struct *work)
 * or we have too many offline memory blocks.
 */
break;
-   case -EBUSY:
+   case -ETXTBSY:
/*
 * The hypervisor cannot process our request right now
-* (e.g., out of memory, migrating) or we cannot free up
-* any memory to unplug it (all plugged memory is busy).
+* (e.g., out of memory, migrating);
+*/
+   case -EBUSY:
+   /*
+* We cannot free up any memory to unplug it (all plugged memory
+* is busy).
 */
case -ENOMEM:
/* Out of memory, try again later. */
-- 
2.25.3


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] [PATCH v3 12/15] virtio-mem: Drop manual check for already present memory

2020-05-07 Thread David Hildenbrand
Registering our parent resource will fail if any memory is still present
(e.g., because somebody unloaded the driver and tries to reload it). No
need for the manual check.

Move our "unplug all" handling to after registering the resource.

Cc: "Michael S. Tsirkin" 
Cc: Pankaj Gupta 
Signed-off-by: David Hildenbrand 
---
 drivers/virtio/virtio_mem.c | 55 -
 1 file changed, 12 insertions(+), 43 deletions(-)

diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
index 80cdb9e6b3c4..8dd57b61b09b 100644
--- a/drivers/virtio/virtio_mem.c
+++ b/drivers/virtio/virtio_mem.c
@@ -1616,23 +1616,6 @@ static int virtio_mem_init_vq(struct virtio_mem *vm)
return 0;
 }
 
-/*
- * Test if any memory in the range is present in Linux.
- */
-static bool virtio_mem_any_memory_present(unsigned long start,
- unsigned long size)
-{
-   const unsigned long start_pfn = PFN_DOWN(start);
-   const unsigned long end_pfn = PFN_UP(start + size);
-   unsigned long pfn;
-
-   for (pfn = start_pfn; pfn != end_pfn; pfn++)
-   if (present_section_nr(pfn_to_section_nr(pfn)))
-   return true;
-
-   return false;
-}
-
 static int virtio_mem_init(struct virtio_mem *vm)
 {
const uint64_t phys_limit = 1UL << MAX_PHYSMEM_BITS;
@@ -1664,32 +1647,6 @@ static int virtio_mem_init(struct virtio_mem *vm)
virtio_cread(vm->vdev, struct virtio_mem_config, region_size,
 >region_size);
 
-   /*
-* If we still have memory plugged, we might have to unplug all
-* memory first. However, if somebody simply unloaded the driver
-* we would have to reinitialize the old state - something we don't
-* support yet. Detect if we have any memory in the area present.
-*/
-   if (vm->plugged_size) {
-   uint64_t usable_region_size;
-
-   virtio_cread(vm->vdev, struct virtio_mem_config,
-usable_region_size, _region_size);
-
-   if (virtio_mem_any_memory_present(vm->addr,
- usable_region_size)) {
-   dev_err(>vdev->dev,
-   "reloading the driver is not supported\n");
-   return -EINVAL;
-   }
-   /*
-* Note: it might happen that the device is busy and
-* unplugging all memory might take some time.
-*/
-   dev_info(>vdev->dev, "unplugging all memory required\n");
-   vm->unplug_all_required = 1;
-   }
-
/*
 * We always hotplug memory in memory block granularity. This way,
 * we have to wait for exactly one memory block to online.
@@ -1760,6 +1717,8 @@ static int virtio_mem_create_resource(struct virtio_mem 
*vm)
if (!vm->parent_resource) {
kfree(name);
dev_warn(>vdev->dev, "could not reserve device region\n");
+   dev_info(>vdev->dev,
+"reloading the driver is not supported\n");
return -EBUSY;
}
 
@@ -1816,6 +1775,16 @@ static int virtio_mem_probe(struct virtio_device *vdev)
if (rc)
goto out_del_vq;
 
+   /*
+* If we still have memory plugged, we have to unplug all memory first.
+* Registering our parent resource makes sure that this memory isn't
+* actually in use (e.g., trying to reload the driver).
+*/
+   if (vm->plugged_size) {
+   vm->unplug_all_required = 1;
+   dev_info(>vdev->dev, "unplugging all memory is required\n");
+   }
+
/* register callbacks */
vm->memory_notifier.notifier_call = virtio_mem_memory_notifier_cb;
rc = register_memory_notifier(>memory_notifier);
-- 
2.25.3


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] [PATCH v3 10/15] MAINTAINERS: Add myself as virtio-mem maintainer

2020-05-07 Thread David Hildenbrand
Let's make sure patches/bug reports find the right person.

Cc: "Michael S. Tsirkin" 
Signed-off-by: David Hildenbrand 
---
 MAINTAINERS | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 4d43ea5468b5..ad2b34f4dd66 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -18037,6 +18037,13 @@ S: Maintained
 F: drivers/iommu/virtio-iommu.c
 F: include/uapi/linux/virtio_iommu.h
 
+VIRTIO MEM DRIVER
+M: David Hildenbrand 
+L: virtualizat...@lists.linux-foundation.org
+S: Maintained
+F: drivers/virtio/virtio_mem.c
+F: include/uapi/linux/virtio_mem.h
+
 VIRTUAL BOX GUEST DEVICE DRIVER
 M: Hans de Goede 
 M: Arnd Bergmann 
-- 
2.25.3


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] [PATCH v3 11/15] virtio-mem: Add parent resource for all added "System RAM"

2020-05-07 Thread David Hildenbrand
Let's add a parent resource, named after the virtio device (inspired by
drivers/dax/kmem.c). This allows user space to identify which memory
belongs to which virtio-mem device.

With this change and two virtio-mem devices:
:/# cat /proc/iomem
-0fff : Reserved
1000-0009fbff : System RAM
[...]
14000-333ff : virtio0
  14000-147ff : System RAM
  14800-14fff : System RAM
  15000-157ff : System RAM
[...]
33400-3033ff : virtio1
  33800-33fff : System RAM
  34000-347ff : System RAM
  34800-34fff : System RAM
[...]

Cc: "Michael S. Tsirkin" 
Cc: Pankaj Gupta 
Signed-off-by: David Hildenbrand 
---
 drivers/virtio/virtio_mem.c | 52 -
 1 file changed, 51 insertions(+), 1 deletion(-)

diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
index eb4c16d634e0..80cdb9e6b3c4 100644
--- a/drivers/virtio/virtio_mem.c
+++ b/drivers/virtio/virtio_mem.c
@@ -99,6 +99,9 @@ struct virtio_mem {
/* Id of the next memory bock to prepare when needed. */
unsigned long next_mb_id;
 
+   /* The parent resource for all memory added via this device. */
+   struct resource *parent_resource;
+
/* Summary of all memory block states. */
unsigned long nb_mb_state[VIRTIO_MEM_MB_STATE_COUNT];
 #define VIRTIO_MEM_NB_OFFLINE_THRESHOLD10
@@ -1741,6 +1744,44 @@ static int virtio_mem_init(struct virtio_mem *vm)
return 0;
 }
 
+static int virtio_mem_create_resource(struct virtio_mem *vm)
+{
+   /*
+* When force-unloading the driver and removing the device, we
+* could have a garbage pointer. Duplicate the string.
+*/
+   const char *name = kstrdup(dev_name(>vdev->dev), GFP_KERNEL);
+
+   if (!name)
+   return -ENOMEM;
+
+   vm->parent_resource = __request_mem_region(vm->addr, vm->region_size,
+  name, IORESOURCE_SYSTEM_RAM);
+   if (!vm->parent_resource) {
+   kfree(name);
+   dev_warn(>vdev->dev, "could not reserve device region\n");
+   return -EBUSY;
+   }
+
+   /* The memory is not actually busy - make add_memory() work. */
+   vm->parent_resource->flags &= ~IORESOURCE_BUSY;
+   return 0;
+}
+
+static void virtio_mem_delete_resource(struct virtio_mem *vm)
+{
+   const char *name;
+
+   if (!vm->parent_resource)
+   return;
+
+   name = vm->parent_resource->name;
+   release_resource(vm->parent_resource);
+   kfree(vm->parent_resource);
+   kfree(name);
+   vm->parent_resource = NULL;
+}
+
 static int virtio_mem_probe(struct virtio_device *vdev)
 {
struct virtio_mem *vm;
@@ -1770,11 +1811,16 @@ static int virtio_mem_probe(struct virtio_device *vdev)
if (rc)
goto out_del_vq;
 
+   /* create the parent resource for all memory */
+   rc = virtio_mem_create_resource(vm);
+   if (rc)
+   goto out_del_vq;
+
/* register callbacks */
vm->memory_notifier.notifier_call = virtio_mem_memory_notifier_cb;
rc = register_memory_notifier(>memory_notifier);
if (rc)
-   goto out_del_vq;
+   goto out_del_resource;
rc = register_virtio_mem_device(vm);
if (rc)
goto out_unreg_mem;
@@ -1788,6 +1834,8 @@ static int virtio_mem_probe(struct virtio_device *vdev)
return 0;
 out_unreg_mem:
unregister_memory_notifier(>memory_notifier);
+out_del_resource:
+   virtio_mem_delete_resource(vm);
 out_del_vq:
vdev->config->del_vqs(vdev);
 out_free_vm:
@@ -1848,6 +1896,8 @@ static void virtio_mem_remove(struct virtio_device *vdev)
vm->nb_mb_state[VIRTIO_MEM_MB_STATE_ONLINE_PARTIAL] ||
vm->nb_mb_state[VIRTIO_MEM_MB_STATE_ONLINE_MOVABLE])
dev_warn(>dev, "device still has system memory added\n");
+   else
+   virtio_mem_delete_resource(vm);
 
/* remove all tracking data - no locking needed */
vfree(vm->mb_state);
-- 
2.25.3


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] [PATCH v3 09/15] virtio-mem: Better retry handling

2020-05-07 Thread David Hildenbrand
Let's start with a retry interval of 5 seconds and double the time until
we reach 5 minutes, in case we keep getting errors. Reset the retry
interval in case we succeeded.

The two main reasons for having to retry are
- The hypervisor is busy and cannot process our request
- We cannot reach the desired requested_size (esp., not enough memory can
  get unplugged because we can't allocate any subblocks).

Tested-by: Pankaj Gupta 
Cc: "Michael S. Tsirkin" 
Cc: Jason Wang 
Cc: Oscar Salvador 
Cc: Michal Hocko 
Cc: Igor Mammedov 
Cc: Dave Young 
Cc: Andrew Morton 
Cc: Dan Williams 
Cc: Pavel Tatashin 
Cc: Stefan Hajnoczi 
Cc: Vlastimil Babka 
Signed-off-by: David Hildenbrand 
---
 drivers/virtio/virtio_mem.c | 11 ---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
index a2edb87e5ed8..eb4c16d634e0 100644
--- a/drivers/virtio/virtio_mem.c
+++ b/drivers/virtio/virtio_mem.c
@@ -141,7 +141,9 @@ struct virtio_mem {
 
/* Timer for retrying to plug/unplug memory. */
struct hrtimer retry_timer;
-#define VIRTIO_MEM_RETRY_TIMER_MS  3
+   unsigned int retry_timer_ms;
+#define VIRTIO_MEM_RETRY_TIMER_MIN_MS  5
+#define VIRTIO_MEM_RETRY_TIMER_MAX_MS  30
 
/* Memory notifier (online/offline events). */
struct notifier_block memory_notifier;
@@ -1550,6 +1552,7 @@ static void virtio_mem_run_wq(struct work_struct *work)
 
switch (rc) {
case 0:
+   vm->retry_timer_ms = VIRTIO_MEM_RETRY_TIMER_MIN_MS;
break;
case -ENOSPC:
/*
@@ -1565,8 +1568,7 @@ static void virtio_mem_run_wq(struct work_struct *work)
 */
case -ENOMEM:
/* Out of memory, try again later. */
-   hrtimer_start(>retry_timer,
- ms_to_ktime(VIRTIO_MEM_RETRY_TIMER_MS),
+   hrtimer_start(>retry_timer, ms_to_ktime(vm->retry_timer_ms),
  HRTIMER_MODE_REL);
break;
case -EAGAIN:
@@ -1586,6 +1588,8 @@ static enum hrtimer_restart 
virtio_mem_timer_expired(struct hrtimer *timer)
 retry_timer);
 
virtio_mem_retry(vm);
+   vm->retry_timer_ms = min_t(unsigned int, vm->retry_timer_ms * 2,
+  VIRTIO_MEM_RETRY_TIMER_MAX_MS);
return HRTIMER_NORESTART;
 }
 
@@ -1754,6 +1758,7 @@ static int virtio_mem_probe(struct virtio_device *vdev)
spin_lock_init(>removal_lock);
hrtimer_init(>retry_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
vm->retry_timer.function = virtio_mem_timer_expired;
+   vm->retry_timer_ms = VIRTIO_MEM_RETRY_TIMER_MIN_MS;
 
/* register the virtqueue */
rc = virtio_mem_init_vq(vm);
-- 
2.25.3


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] [PATCH v3 06/15] virtio-mem: Allow to offline partially unplugged memory blocks

2020-05-07 Thread David Hildenbrand
Dropping the reference count of PageOffline() pages during MEM_GOING_ONLINE
allows offlining code to skip them. However, we also have to clear
PG_reserved, because PG_reserved pages get detected as unmovable right
away. Take care of restoring the reference count when offlining is
canceled.

Clarify why we don't have to perform any action when unloading the
driver. Also, let's add a warning if anybody is still holding a
reference to unplugged pages when offlining.

Tested-by: Pankaj Gupta 
Cc: "Michael S. Tsirkin" 
Cc: Jason Wang 
Cc: Oscar Salvador 
Cc: Michal Hocko 
Cc: Igor Mammedov 
Cc: Dave Young 
Cc: Andrew Morton 
Cc: Dan Williams 
Cc: Pavel Tatashin 
Cc: Stefan Hajnoczi 
Cc: Vlastimil Babka 
Signed-off-by: David Hildenbrand 
---
 drivers/virtio/virtio_mem.c | 68 -
 1 file changed, 67 insertions(+), 1 deletion(-)

diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
index 74f0d3cb1d22..b0b41c73ce89 100644
--- a/drivers/virtio/virtio_mem.c
+++ b/drivers/virtio/virtio_mem.c
@@ -572,6 +572,57 @@ static void virtio_mem_notify_online(struct virtio_mem 
*vm, unsigned long mb_id,
virtio_mem_retry(vm);
 }
 
+static void virtio_mem_notify_going_offline(struct virtio_mem *vm,
+   unsigned long mb_id)
+{
+   const unsigned long nr_pages = PFN_DOWN(vm->subblock_size);
+   struct page *page;
+   unsigned long pfn;
+   int sb_id, i;
+
+   for (sb_id = 0; sb_id < vm->nb_sb_per_mb; sb_id++) {
+   if (virtio_mem_mb_test_sb_plugged(vm, mb_id, sb_id, 1))
+   continue;
+   /*
+* Drop our reference to the pages so the memory can get
+* offlined and add the unplugged pages to the managed
+* page counters (so offlining code can correctly subtract
+* them again).
+*/
+   pfn = PFN_DOWN(virtio_mem_mb_id_to_phys(mb_id) +
+  sb_id * vm->subblock_size);
+   adjust_managed_page_count(pfn_to_page(pfn), nr_pages);
+   for (i = 0; i < nr_pages; i++) {
+   page = pfn_to_page(pfn + i);
+   if (WARN_ON(!page_ref_dec_and_test(page)))
+   dump_page(page, "unplugged page referenced");
+   }
+   }
+}
+
+static void virtio_mem_notify_cancel_offline(struct virtio_mem *vm,
+unsigned long mb_id)
+{
+   const unsigned long nr_pages = PFN_DOWN(vm->subblock_size);
+   unsigned long pfn;
+   int sb_id, i;
+
+   for (sb_id = 0; sb_id < vm->nb_sb_per_mb; sb_id++) {
+   if (virtio_mem_mb_test_sb_plugged(vm, mb_id, sb_id, 1))
+   continue;
+   /*
+* Get the reference we dropped when going offline and
+* subtract the unplugged pages from the managed page
+* counters.
+*/
+   pfn = PFN_DOWN(virtio_mem_mb_id_to_phys(mb_id) +
+  sb_id * vm->subblock_size);
+   adjust_managed_page_count(pfn_to_page(pfn), -nr_pages);
+   for (i = 0; i < nr_pages; i++)
+   page_ref_inc(pfn_to_page(pfn + i));
+   }
+}
+
 /*
  * This callback will either be called synchronously from add_memory() or
  * asynchronously (e.g., triggered via user space). We have to be careful
@@ -618,6 +669,7 @@ static int virtio_mem_memory_notifier_cb(struct 
notifier_block *nb,
break;
}
vm->hotplug_active = true;
+   virtio_mem_notify_going_offline(vm, mb_id);
break;
case MEM_GOING_ONLINE:
mutex_lock(>hotplug_mutex);
@@ -642,6 +694,12 @@ static int virtio_mem_memory_notifier_cb(struct 
notifier_block *nb,
mutex_unlock(>hotplug_mutex);
break;
case MEM_CANCEL_OFFLINE:
+   if (!vm->hotplug_active)
+   break;
+   virtio_mem_notify_cancel_offline(vm, mb_id);
+   vm->hotplug_active = false;
+   mutex_unlock(>hotplug_mutex);
+   break;
case MEM_CANCEL_ONLINE:
if (!vm->hotplug_active)
break;
@@ -668,8 +726,11 @@ static void virtio_mem_set_fake_offline(unsigned long pfn,
struct page *page = pfn_to_page(pfn);
 
__SetPageOffline(page);
-   if (!onlined)
+   if (!onlined) {
SetPageDirty(page);
+   /* FIXME: remove after cleanups */
+   ClearPageReserved(page);
+   }
}
 }
 
@@ -1722,6 +1783,11 @@ static void virtio_mem_remove(struct virtio_device *vdev)
  

[virtio-dev] [PATCH v3 07/15] mm/memory_hotplug: Introduce offline_and_remove_memory()

2020-05-07 Thread David Hildenbrand
virtio-mem wants to offline and remove a memory block once it unplugged
all subblocks (e.g., using alloc_contig_range()). Let's provide
an interface to do that from a driver. virtio-mem already supports to
offline partially unplugged memory blocks. Offlining a fully unplugged
memory block will not require to migrate any pages. All unplugged
subblocks are PageOffline() and have a reference count of 0 - so
offlining code will simply skip them.

All we need is an interface to offline and remove the memory from kernel
module context, where we don't have access to the memory block devices
(esp. find_memory_block() and device_offline()) and the device hotplug
lock.

To keep things simple, allow to only work on a single memory block.

Acked-by: Michal Hocko 
Tested-by: Pankaj Gupta 
Cc: Andrew Morton 
Cc: David Hildenbrand 
Cc: Oscar Salvador 
Cc: Michal Hocko 
Cc: Pavel Tatashin 
Cc: Wei Yang 
Cc: Dan Williams 
Cc: Qian Cai 
Signed-off-by: David Hildenbrand 
---
 include/linux/memory_hotplug.h |  1 +
 mm/memory_hotplug.c| 37 ++
 2 files changed, 38 insertions(+)

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 7dca9cd6076b..d641828e5596 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -318,6 +318,7 @@ extern void try_offline_node(int nid);
 extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
 extern int remove_memory(int nid, u64 start, u64 size);
 extern void __remove_memory(int nid, u64 start, u64 size);
+extern int offline_and_remove_memory(int nid, u64 start, u64 size);
 
 #else
 static inline void try_offline_node(int nid) {}
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 936bfe208a6e..bf1941f02a60 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1748,4 +1748,41 @@ int remove_memory(int nid, u64 start, u64 size)
return rc;
 }
 EXPORT_SYMBOL_GPL(remove_memory);
+
+/*
+ * Try to offline and remove a memory block. Might take a long time to
+ * finish in case memory is still in use. Primarily useful for memory devices
+ * that logically unplugged all memory (so it's no longer in use) and want to
+ * offline + remove the memory block.
+ */
+int offline_and_remove_memory(int nid, u64 start, u64 size)
+{
+   struct memory_block *mem;
+   int rc = -EINVAL;
+
+   if (!IS_ALIGNED(start, memory_block_size_bytes()) ||
+   size != memory_block_size_bytes())
+   return rc;
+
+   lock_device_hotplug();
+   mem = find_memory_block(__pfn_to_section(PFN_DOWN(start)));
+   if (mem)
+   rc = device_offline(>dev);
+   /* Ignore if the device is already offline. */
+   if (rc > 0)
+   rc = 0;
+
+   /*
+* In case we succeeded to offline the memory block, remove it.
+* This cannot fail as it cannot get onlined in the meantime.
+*/
+   if (!rc) {
+   rc = try_remove_memory(nid, start, size);
+   WARN_ON_ONCE(rc);
+   }
+   unlock_device_hotplug();
+
+   return rc;
+}
+EXPORT_SYMBOL_GPL(offline_and_remove_memory);
 #endif /* CONFIG_MEMORY_HOTREMOVE */
-- 
2.25.3


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] [PATCH v3 08/15] virtio-mem: Offline and remove completely unplugged memory blocks

2020-05-07 Thread David Hildenbrand
Let's offline+remove memory blocks once all subblocks are unplugged. We
can use the new Linux MM interface for that. As no memory is in use
anymore, this shouldn't take a long time and shouldn't fail. There might
be corner cases where the offlining could still fail (especially, if
another notifier NACKs the offlining request).

Acked-by: Pankaj Gupta 
Tested-by: Pankaj Gupta 
Cc: "Michael S. Tsirkin" 
Cc: Jason Wang 
Cc: Oscar Salvador 
Cc: Michal Hocko 
Cc: Igor Mammedov 
Cc: Dave Young 
Cc: Andrew Morton 
Cc: Dan Williams 
Cc: Pavel Tatashin 
Cc: Stefan Hajnoczi 
Cc: Vlastimil Babka 
Signed-off-by: David Hildenbrand 
---
 drivers/virtio/virtio_mem.c | 47 +
 1 file changed, 43 insertions(+), 4 deletions(-)

diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
index b0b41c73ce89..a2edb87e5ed8 100644
--- a/drivers/virtio/virtio_mem.c
+++ b/drivers/virtio/virtio_mem.c
@@ -446,6 +446,28 @@ static int virtio_mem_mb_remove(struct virtio_mem *vm, 
unsigned long mb_id)
return remove_memory(nid, addr, memory_block_size_bytes());
 }
 
+/*
+ * Try to offline and remove a memory block from Linux.
+ *
+ * Must not be called with the vm->hotplug_mutex held (possible deadlock with
+ * onlining code).
+ *
+ * Will not modify the state of the memory block.
+ */
+static int virtio_mem_mb_offline_and_remove(struct virtio_mem *vm,
+   unsigned long mb_id)
+{
+   const uint64_t addr = virtio_mem_mb_id_to_phys(mb_id);
+   int nid = vm->nid;
+
+   if (nid == NUMA_NO_NODE)
+   nid = memory_add_physaddr_to_nid(addr);
+
+   dev_dbg(>vdev->dev, "offlining and removing memory block: %lu\n",
+   mb_id);
+   return offline_and_remove_memory(nid, addr, memory_block_size_bytes());
+}
+
 /*
  * Trigger the workqueue so the device can perform its magic.
  */
@@ -537,7 +559,13 @@ static void virtio_mem_notify_offline(struct virtio_mem 
*vm,
break;
}
 
-   /* trigger the workqueue, maybe we can now unplug memory. */
+   /*
+* Trigger the workqueue, maybe we can now unplug memory. Also,
+* when we offline and remove a memory block, this will re-trigger
+* us immediately - which is often nice because the removal of
+* the memory block (e.g., memmap) might have freed up memory
+* on other memory blocks we manage.
+*/
virtio_mem_retry(vm);
 }
 
@@ -1284,7 +1312,8 @@ static int virtio_mem_mb_unplug_any_sb_offline(struct 
virtio_mem *vm,
  * Unplug the desired number of plugged subblocks of an online memory block.
  * Will skip subblock that are busy.
  *
- * Will modify the state of the memory block.
+ * Will modify the state of the memory block. Might temporarily drop the
+ * hotplug_mutex.
  *
  * Note: Can fail after some subblocks were successfully unplugged. Can
  *   return 0 even if subblocks were busy and could not get unplugged.
@@ -1340,9 +1369,19 @@ static int virtio_mem_mb_unplug_any_sb_online(struct 
virtio_mem *vm,
}
 
/*
-* TODO: Once all subblocks of a memory block were unplugged, we want
-* to offline the memory block and remove it.
+* Once all subblocks of a memory block were unplugged, offline and
+* remove it. This will usually not fail, as no memory is in use
+* anymore - however some other notifiers might NACK the request.
 */
+   if (virtio_mem_mb_test_sb_unplugged(vm, mb_id, 0, vm->nb_sb_per_mb)) {
+   mutex_unlock(>hotplug_mutex);
+   rc = virtio_mem_mb_offline_and_remove(vm, mb_id);
+   mutex_lock(>hotplug_mutex);
+   if (!rc)
+   virtio_mem_mb_set_state(vm, mb_id,
+   VIRTIO_MEM_MB_STATE_UNUSED);
+   }
+
return 0;
 }
 
-- 
2.25.3


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] [PATCH v3 05/15] mm: Allow to offline unmovable PageOffline() pages via MEM_GOING_OFFLINE

2020-05-07 Thread David Hildenbrand
virtio-mem wants to allow to offline memory blocks of which some parts
were unplugged (allocated via alloc_contig_range()), especially, to later
offline and remove completely unplugged memory blocks. The important part
is that PageOffline() has to remain set until the section is offline, so
these pages will never get accessed (e.g., when dumping). The pages should
not be handed back to the buddy (which would require clearing PageOffline()
and result in issues if offlining fails and the pages are suddenly in the
buddy).

Let's allow to do that by allowing to isolate any PageOffline() page
when offlining. This way, we can reach the memory hotplug notifier
MEM_GOING_OFFLINE, where the driver can signal that he is fine with
offlining this page by dropping its reference count. PageOffline() pages
with a reference count of 0 can then be skipped when offlining the
pages (like if they were free, however they are not in the buddy).

Anybody who uses PageOffline() pages and does not agree to offline them
(e.g., Hyper-V balloon, XEN balloon, VMWare balloon for 2MB pages) will not
decrement the reference count and make offlining fail when trying to
migrate such an unmovable page. So there should be no observable change.
Same applies to balloon compaction users (movable PageOffline() pages), the
pages will simply be migrated.

Note 1: If offlining fails, a driver has to increment the reference
count again in MEM_CANCEL_OFFLINE.

Note 2: A driver that makes use of this has to be aware that re-onlining
the memory block has to be handled by hooking into onlining code
(online_page_callback_t), resetting the page PageOffline() and
not giving them to the buddy.

Reviewed-by: Alexander Duyck 
Acked-by: Michal Hocko 
Tested-by: Pankaj Gupta 
Cc: Andrew Morton 
Cc: Juergen Gross 
Cc: Konrad Rzeszutek Wilk 
Cc: Pavel Tatashin 
Cc: Alexander Duyck 
Cc: Vlastimil Babka 
Cc: Johannes Weiner 
Cc: Anthony Yznaga 
Cc: Michal Hocko 
Cc: Oscar Salvador 
Cc: Mel Gorman 
Cc: Mike Rapoport 
Cc: Dan Williams 
Cc: Anshuman Khandual 
Cc: Qian Cai 
Cc: Pingfan Liu 
Signed-off-by: David Hildenbrand 
---
 include/linux/page-flags.h | 10 +
 mm/memory_hotplug.c| 44 +-
 mm/page_alloc.c| 24 +
 mm/page_isolation.c|  9 
 4 files changed, 77 insertions(+), 10 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 222f6f7b2bb3..6be1aa559b1e 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -777,6 +777,16 @@ PAGE_TYPE_OPS(Buddy, buddy)
  * not onlined when onlining the section).
  * The content of these pages is effectively stale. Such pages should not
  * be touched (read/write/dump/save) except by their owner.
+ *
+ * If a driver wants to allow to offline unmovable PageOffline() pages without
+ * putting them back to the buddy, it can do so via the memory notifier by
+ * decrementing the reference count in MEM_GOING_OFFLINE and incrementing the
+ * reference count in MEM_CANCEL_OFFLINE. When offlining, the PageOffline()
+ * pages (now with a reference count of zero) are treated like free pages,
+ * allowing the containing memory block to get offlined. A driver that
+ * relies on this feature is aware that re-onlining the memory block will
+ * require to re-set the pages PageOffline() and not giving them to the
+ * buddy via online_page_callback_t.
  */
 PAGE_TYPE_OPS(Offline, offline)
 
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 555137bd0882..936bfe208a6e 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1151,11 +1151,17 @@ struct zone *test_pages_in_a_zone(unsigned long 
start_pfn,
 
 /*
  * Scan pfn range [start,end) to find movable/migratable pages (LRU pages,
- * non-lru movable pages and hugepages). We scan pfn because it's much
- * easier than scanning over linked list. This function returns the pfn
- * of the first found movable page if it's found, otherwise 0.
+ * non-lru movable pages and hugepages). Will skip over most unmovable
+ * pages (esp., pages that can be skipped when offlining), but bail out on
+ * definitely unmovable pages.
+ *
+ * Returns:
+ * 0 in case a movable page is found and movable_pfn was updated.
+ * -ENOENT in case no movable page was found.
+ * -EBUSY in case a definitely unmovable page was found.
  */
-static unsigned long scan_movable_pages(unsigned long start, unsigned long end)
+static int scan_movable_pages(unsigned long start, unsigned long end,
+ unsigned long *movable_pfn)
 {
unsigned long pfn;
 
@@ -1167,18 +1173,30 @@ static unsigned long scan_movable_pages(unsigned long 
start, unsigned long end)
continue;
page = pfn_to_page(pfn);
if (PageLRU(page))
-   return pfn;
+   goto found;
if (__PageMovable(page

[virtio-dev] [PATCH v3 02/15] virtio-mem: Allow to specify an ACPI PXM as nid

2020-05-07 Thread David Hildenbrand
We want to allow to specify (similar as for a DIMM), to which node a
virtio-mem device (and, therefore, its memory) belongs. Add a new
virtio-mem feature flag and export pxm_to_node, so it can be used in kernel
module context.

Acked-by: Michal Hocko  # for the export
Acked-by: "Rafael J. Wysocki"  # for the export
Acked-by: Pankaj Gupta 
Tested-by: Pankaj Gupta 
Cc: "Michael S. Tsirkin" 
Cc: Jason Wang 
Cc: Oscar Salvador 
Cc: Michal Hocko 
Cc: Igor Mammedov 
Cc: Dave Young 
Cc: Andrew Morton 
Cc: Dan Williams 
Cc: Pavel Tatashin 
Cc: Stefan Hajnoczi 
Cc: Vlastimil Babka 
Cc: Len Brown 
Cc: linux-a...@vger.kernel.org
Signed-off-by: David Hildenbrand 
---
 drivers/acpi/numa/srat.c|  1 +
 drivers/virtio/virtio_mem.c | 39 +++--
 include/uapi/linux/virtio_mem.h | 10 -
 3 files changed, 47 insertions(+), 3 deletions(-)

diff --git a/drivers/acpi/numa/srat.c b/drivers/acpi/numa/srat.c
index 47b4969d9b93..5be5a977da1b 100644
--- a/drivers/acpi/numa/srat.c
+++ b/drivers/acpi/numa/srat.c
@@ -35,6 +35,7 @@ int pxm_to_node(int pxm)
return NUMA_NO_NODE;
return pxm_to_node_map[pxm];
 }
+EXPORT_SYMBOL(pxm_to_node);
 
 int node_to_pxm(int node)
 {
diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
index 5d1dcaa6fc42..270ddeaec059 100644
--- a/drivers/virtio/virtio_mem.c
+++ b/drivers/virtio/virtio_mem.c
@@ -21,6 +21,8 @@
 #include 
 #include 
 
+#include 
+
 enum virtio_mem_mb_state {
/* Unplugged, not added to Linux. Can be reused later. */
VIRTIO_MEM_MB_STATE_UNUSED = 0,
@@ -72,6 +74,8 @@ struct virtio_mem {
 
/* The device block size (for communicating with the device). */
uint32_t device_block_size;
+   /* The translated node id. NUMA_NO_NODE in case not specified. */
+   int nid;
/* Physical start address of the memory region. */
uint64_t addr;
/* Maximum region size in bytes. */
@@ -389,7 +393,10 @@ static int virtio_mem_sb_bitmap_prepare_next_mb(struct 
virtio_mem *vm)
 static int virtio_mem_mb_add(struct virtio_mem *vm, unsigned long mb_id)
 {
const uint64_t addr = virtio_mem_mb_id_to_phys(mb_id);
-   int nid = memory_add_physaddr_to_nid(addr);
+   int nid = vm->nid;
+
+   if (nid == NUMA_NO_NODE)
+   nid = memory_add_physaddr_to_nid(addr);
 
dev_dbg(>vdev->dev, "adding memory block: %lu\n", mb_id);
return add_memory(nid, addr, memory_block_size_bytes());
@@ -407,7 +414,10 @@ static int virtio_mem_mb_add(struct virtio_mem *vm, 
unsigned long mb_id)
 static int virtio_mem_mb_remove(struct virtio_mem *vm, unsigned long mb_id)
 {
const uint64_t addr = virtio_mem_mb_id_to_phys(mb_id);
-   int nid = memory_add_physaddr_to_nid(addr);
+   int nid = vm->nid;
+
+   if (nid == NUMA_NO_NODE)
+   nid = memory_add_physaddr_to_nid(addr);
 
dev_dbg(>vdev->dev, "removing memory block: %lu\n", mb_id);
return remove_memory(nid, addr, memory_block_size_bytes());
@@ -426,6 +436,17 @@ static void virtio_mem_retry(struct virtio_mem *vm)
spin_unlock_irqrestore(>removal_lock, flags);
 }
 
+static int virtio_mem_translate_node_id(struct virtio_mem *vm, uint16_t 
node_id)
+{
+   int node = NUMA_NO_NODE;
+
+#if defined(CONFIG_ACPI_NUMA)
+   if (virtio_has_feature(vm->vdev, VIRTIO_MEM_F_ACPI_PXM))
+   node = pxm_to_node(node_id);
+#endif
+   return node;
+}
+
 /*
  * Test if a virtio-mem device overlaps with the given range. Can be called
  * from (notifier) callbacks lockless.
@@ -1267,6 +1288,7 @@ static bool virtio_mem_any_memory_present(unsigned long 
start,
 static int virtio_mem_init(struct virtio_mem *vm)
 {
const uint64_t phys_limit = 1UL << MAX_PHYSMEM_BITS;
+   uint16_t node_id;
 
if (!vm->vdev->config->get) {
dev_err(>vdev->dev, "config access disabled\n");
@@ -1287,6 +1309,9 @@ static int virtio_mem_init(struct virtio_mem *vm)
 >plugged_size);
virtio_cread(vm->vdev, struct virtio_mem_config, block_size,
 >device_block_size);
+   virtio_cread(vm->vdev, struct virtio_mem_config, node_id,
+_id);
+   vm->nid = virtio_mem_translate_node_id(vm, node_id);
virtio_cread(vm->vdev, struct virtio_mem_config, addr, >addr);
virtio_cread(vm->vdev, struct virtio_mem_config, region_size,
 >region_size);
@@ -1365,6 +1390,8 @@ static int virtio_mem_init(struct virtio_mem *vm)
 memory_block_size_bytes());
dev_info(>vdev->dev, "subblock size: 0x%x",
 vm->subblock_size);
+   if (vm->nid != NUMA_NO_NODE)
+   dev_info(>vdev->dev, "nid: %d", vm->nid);
 
return 0;
 }
@@ -1508,12 +1535,20 @@ static in

[virtio-dev] [PATCH v3 01/15] virtio-mem: Paravirtualized memory hotplug

2020-05-07 Thread David Hildenbrand
Each virtio-mem device owns exactly one memory region. It is responsible
for adding/removing memory from that memory region on request.

When the device driver starts up, the requested amount of memory is
queried and then plugged to Linux. On request, further memory can be
plugged or unplugged. This patch only implements the plugging part.

On x86-64, memory can currently be plugged in 4MB ("subblock") granularity.
When required, a new memory block will be added (e.g., usually 128MB on
x86-64) in order to plug more subblocks. Only x86-64 was tested for now.

The online_page callback is used to keep unplugged subblocks offline
when onlining memory - similar to the Hyper-V balloon driver. Unplugged
pages are marked PG_offline, to tell dump tools (e.g., makedumpfile) to
skip them.

User space is usually responsible for onlining the added memory. The
memory hotplug notifier is used to synchronize virtio-mem activity
against memory onlining/offlining.

Each virtio-mem device can belong to a NUMA node, which allows us to
easily add/remove small chunks of memory to/from a specific NUMA node by
using multiple virtio-mem devices. Something that works even when the
guest has no idea about the NUMA topology.

One way to view virtio-mem is as a "resizable DIMM" or a DIMM with many
"sub-DIMMS".

This patch directly introduces the basic infrastructure to implement memory
unplug. Especially the memory block states and subblock bitmaps will be
heavily used there.

Notes:
- In case memory is to be onlined by user space, we limit the amount of
  offline memory blocks, to not run out of memory. This is esp. an
  issue if memory is added faster than it is getting onlined.
- Suspend/Hibernate is not supported due to the way virtio-mem devices
  behave. Limited support might be possible in the future.
- Reloading the device driver is not supported.

Reviewed-by: Pankaj Gupta 
Tested-by: Pankaj Gupta 
Cc: "Michael S. Tsirkin" 
Cc: Jason Wang 
Cc: Oscar Salvador 
Cc: Michal Hocko 
Cc: Igor Mammedov 
Cc: Dave Young 
Cc: Andrew Morton 
Cc: Dan Williams 
Cc: Pavel Tatashin 
Cc: Stefan Hajnoczi 
Cc: Vlastimil Babka 
Cc: "Rafael J. Wysocki" 
Cc: Len Brown 
Cc: linux-a...@vger.kernel.org
Signed-off-by: David Hildenbrand 
---
 drivers/virtio/Kconfig  |   16 +
 drivers/virtio/Makefile |1 +
 drivers/virtio/virtio_mem.c | 1533 +++
 include/uapi/linux/virtio_ids.h |1 +
 include/uapi/linux/virtio_mem.h |  200 
 5 files changed, 1751 insertions(+)
 create mode 100644 drivers/virtio/virtio_mem.c
 create mode 100644 include/uapi/linux/virtio_mem.h

diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
index 69a32dfc318a..d6dde7d2cf76 100644
--- a/drivers/virtio/Kconfig
+++ b/drivers/virtio/Kconfig
@@ -78,6 +78,22 @@ config VIRTIO_BALLOON
 
 If unsure, say M.
 
+config VIRTIO_MEM
+   tristate "Virtio mem driver"
+   default m
+   depends on X86_64
+   depends on VIRTIO
+   depends on MEMORY_HOTPLUG_SPARSE
+   depends on MEMORY_HOTREMOVE
+   help
+This driver provides access to virtio-mem paravirtualized memory
+devices, allowing to hotplug and hotunplug memory.
+
+This driver was only tested under x86-64, but should theoretically
+work on all architectures that support memory hotplug and hotremove.
+
+If unsure, say M.
+
 config VIRTIO_INPUT
tristate "Virtio input driver"
depends on VIRTIO
diff --git a/drivers/virtio/Makefile b/drivers/virtio/Makefile
index 29a1386ecc03..4d993791f2d7 100644
--- a/drivers/virtio/Makefile
+++ b/drivers/virtio/Makefile
@@ -7,3 +7,4 @@ virtio_pci-$(CONFIG_VIRTIO_PCI_LEGACY) += virtio_pci_legacy.o
 obj-$(CONFIG_VIRTIO_BALLOON) += virtio_balloon.o
 obj-$(CONFIG_VIRTIO_INPUT) += virtio_input.o
 obj-$(CONFIG_VIRTIO_VDPA) += virtio_vdpa.o
+obj-$(CONFIG_VIRTIO_MEM) += virtio_mem.o
diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
new file mode 100644
index ..5d1dcaa6fc42
--- /dev/null
+++ b/drivers/virtio/virtio_mem.c
@@ -0,0 +1,1533 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Virtio-mem device driver.
+ *
+ * Copyright Red Hat, Inc. 2020
+ *
+ * Author(s): David Hildenbrand 
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+enum virtio_mem_mb_state {
+   /* Unplugged, not added to Linux. Can be reused later. */
+   VIRTIO_MEM_MB_STATE_UNUSED = 0,
+   /* (Partially) plugged, not added to Linux. Error on add_memory(). */
+   VIRTIO_MEM_MB_STATE_PLUGGED,
+   /* Fully plugged, fully added to Linux, offline. */
+   VIRTIO_MEM_MB_STATE_OFFLINE,
+   /* Partially plugged, fully added to Linux, offline. */
+   VIRTIO_MEM_MB_STATE_OFFLINE_PARTIAL,
+   /* Fully plugged, fully added to Linux, online (!ZONE_MOVABLE). */
+

[virtio-dev] [PATCH v3 03/15] virtio-mem: Paravirtualized memory hotunplug part 1

2020-05-07 Thread David Hildenbrand
Unplugging subblocks of memory blocks that are offline is easy. All we
have to do is watch out for concurrent onlining activity.

Tested-by: Pankaj Gupta 
Cc: "Michael S. Tsirkin" 
Cc: Jason Wang 
Cc: Oscar Salvador 
Cc: Michal Hocko 
Cc: Igor Mammedov 
Cc: Dave Young 
Cc: Andrew Morton 
Cc: Dan Williams 
Cc: Pavel Tatashin 
Cc: Stefan Hajnoczi 
Cc: Vlastimil Babka 
Signed-off-by: David Hildenbrand 
---
 drivers/virtio/virtio_mem.c | 116 +++-
 1 file changed, 114 insertions(+), 2 deletions(-)

diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
index 270ddeaec059..a3ec795be8be 100644
--- a/drivers/virtio/virtio_mem.c
+++ b/drivers/virtio/virtio_mem.c
@@ -123,7 +123,7 @@ struct virtio_mem {
 *
 * When this lock is held the pointers can't change, ONLINE and
 * OFFLINE blocks can't change the state and no subblocks will get
-* plugged.
+* plugged/unplugged.
 */
struct mutex hotplug_mutex;
bool hotplug_active;
@@ -280,6 +280,12 @@ static int virtio_mem_mb_state_prepare_next_mb(struct 
virtio_mem *vm)
 _mb_id++) \
if (virtio_mem_mb_get_state(_vm, _mb_id) == _state)
 
+#define virtio_mem_for_each_mb_state_rev(_vm, _mb_id, _state) \
+   for (_mb_id = _vm->next_mb_id - 1; \
+_mb_id >= _vm->first_mb_id && _vm->nb_mb_state[_state]; \
+_mb_id--) \
+   if (virtio_mem_mb_get_state(_vm, _mb_id) == _state)
+
 /*
  * Mark all selected subblocks plugged.
  *
@@ -325,6 +331,19 @@ static bool virtio_mem_mb_test_sb_plugged(struct 
virtio_mem *vm,
   bit + count;
 }
 
+/*
+ * Test if all selected subblocks are unplugged.
+ */
+static bool virtio_mem_mb_test_sb_unplugged(struct virtio_mem *vm,
+   unsigned long mb_id, int sb_id,
+   int count)
+{
+   const int bit = (mb_id - vm->first_mb_id) * vm->nb_sb_per_mb + sb_id;
+
+   /* TODO: Helper similar to bitmap_set() */
+   return find_next_bit(vm->sb_bitmap, bit + count, bit) >= bit + count;
+}
+
 /*
  * Find the first plugged subblock. Returns vm->nb_sb_per_mb in case there is
  * none.
@@ -513,6 +532,9 @@ static void virtio_mem_notify_offline(struct virtio_mem *vm,
BUG();
break;
}
+
+   /* trigger the workqueue, maybe we can now unplug memory. */
+   virtio_mem_retry(vm);
 }
 
 static void virtio_mem_notify_online(struct virtio_mem *vm, unsigned long 
mb_id,
@@ -1122,6 +1144,94 @@ static int virtio_mem_plug_request(struct virtio_mem 
*vm, uint64_t diff)
return rc;
 }
 
+/*
+ * Unplug the desired number of plugged subblocks of an offline memory block.
+ * Will fail if any subblock cannot get unplugged (instead of skipping it).
+ *
+ * Will modify the state of the memory block. Might temporarily drop the
+ * hotplug_mutex.
+ *
+ * Note: Can fail after some subblocks were successfully unplugged.
+ */
+static int virtio_mem_mb_unplug_any_sb_offline(struct virtio_mem *vm,
+  unsigned long mb_id,
+  uint64_t *nb_sb)
+{
+   int rc;
+
+   rc = virtio_mem_mb_unplug_any_sb(vm, mb_id, nb_sb);
+
+   /* some subblocks might have been unplugged even on failure */
+   if (!virtio_mem_mb_test_sb_plugged(vm, mb_id, 0, vm->nb_sb_per_mb))
+   virtio_mem_mb_set_state(vm, mb_id,
+   VIRTIO_MEM_MB_STATE_OFFLINE_PARTIAL);
+   if (rc)
+   return rc;
+
+   if (virtio_mem_mb_test_sb_unplugged(vm, mb_id, 0, vm->nb_sb_per_mb)) {
+   /*
+* Remove the block from Linux - this should never fail.
+* Hinder the block from getting onlined by marking it
+* unplugged. Temporarily drop the mutex, so
+* any pending GOING_ONLINE requests can be serviced/rejected.
+*/
+   virtio_mem_mb_set_state(vm, mb_id,
+   VIRTIO_MEM_MB_STATE_UNUSED);
+
+   mutex_unlock(>hotplug_mutex);
+   rc = virtio_mem_mb_remove(vm, mb_id);
+   BUG_ON(rc);
+   mutex_lock(>hotplug_mutex);
+   }
+   return 0;
+}
+
+/*
+ * Try to unplug the requested amount of memory.
+ */
+static int virtio_mem_unplug_request(struct virtio_mem *vm, uint64_t diff)
+{
+   uint64_t nb_sb = diff / vm->subblock_size;
+   unsigned long mb_id;
+   int rc;
+
+   if (!nb_sb)
+   return 0;
+
+   /*
+* We'll drop the mutex a couple of times when it is safe to do so.
+* This might result in some blocks switching the state (online/offline)
+* and we could miss them in this run - we will retry again later.
+*/
+   mutex_lock(>hotplug_mutex);
+

[virtio-dev] [PATCH v3 00/15] virtio-mem: paravirtualized memory

2020-05-07 Thread David Hildenbrand
This series is based on latest linux-next. The patches are located at:
https://github.com/davidhildenbrand/linux.git virtio-mem-v3

Patch #1 - #10 where contained in v2 and only contain minor modifications
(mostly smaller fixes). The remaining patches are new and contain smaller
optimizations.

Details about virtio-mem can be found in the cover letter of v2 [1]. A
basic QEMU implementation was posted yesterday [2].

[1] https://lkml.kernel.org/r/20200311171422.10484-1-da...@redhat.com
[2] https://lkml.kernel.org/r/20200506094948.76388-1-da...@redhat.com

v2 -> v3:
- "virtio-mem: Paravirtualized memory hotplug"
-- Include "linux/slab.h" to fix build issues
-- Remember the "region_size", helpful for patch #11
-- Minor simplifaction in virtio_mem_overlaps_range()
-- Use notifier_from_errno() instead of notifier_to_errno() in notifier
-- More reliable check for added memory when unloading the driver
- "virtio-mem: Allow to specify an ACPI PXM as nid"
-- Also print the nid
- Added patch #11-#15

Cc: Sebastien Boeuf 
Cc: Samuel Ortiz 
Cc: Robert Bradford 
Cc: Luiz Capitulino 
Cc: Pankaj Gupta 
Cc: teawater 
Cc: Igor Mammedov 
Cc: Dr. David Alan Gilbert 

David Hildenbrand (15):
  virtio-mem: Paravirtualized memory hotplug
  virtio-mem: Allow to specify an ACPI PXM as nid
  virtio-mem: Paravirtualized memory hotunplug part 1
  virtio-mem: Paravirtualized memory hotunplug part 2
  mm: Allow to offline unmovable PageOffline() pages via
MEM_GOING_OFFLINE
  virtio-mem: Allow to offline partially unplugged memory blocks
  mm/memory_hotplug: Introduce offline_and_remove_memory()
  virtio-mem: Offline and remove completely unplugged memory blocks
  virtio-mem: Better retry handling
  MAINTAINERS: Add myself as virtio-mem maintainer
  virtio-mem: Add parent resource for all added "System RAM"
  virtio-mem: Drop manual check for already present memory
  virtio-mem: Unplug subblocks right-to-left
  virtio-mem: Use -ETXTBSY as error code if the device is busy
  virtio-mem: Try to unplug the complete online memory block first

 MAINTAINERS |7 +
 drivers/acpi/numa/srat.c|1 +
 drivers/virtio/Kconfig  |   17 +
 drivers/virtio/Makefile |1 +
 drivers/virtio/virtio_mem.c | 1962 +++
 include/linux/memory_hotplug.h  |1 +
 include/linux/page-flags.h  |   10 +
 include/uapi/linux/virtio_ids.h |1 +
 include/uapi/linux/virtio_mem.h |  208 
 mm/memory_hotplug.c |   81 +-
 mm/page_alloc.c |   26 +
 mm/page_isolation.c |9 +
 12 files changed, 2314 insertions(+), 10 deletions(-)
 create mode 100644 drivers/virtio/virtio_mem.c
 create mode 100644 include/uapi/linux/virtio_mem.h

-- 
2.25.3


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] [PATCH v3 04/15] virtio-mem: Paravirtualized memory hotunplug part 2

2020-05-07 Thread David Hildenbrand
We also want to unplug online memory (contained in online memory blocks
and, therefore, managed by the buddy), and eventually replug it later.

When requested to unplug memory, we use alloc_contig_range() to allocate
subblocks in online memory blocks (so we are the owner) and send them to
our hypervisor. When requested to plug memory, we can replug such memory
using free_contig_range() after asking our hypervisor.

We also want to mark all allocated pages PG_offline, so nobody will
touch them. To differentiate pages that were never onlined when
onlining the memory block from pages allocated via alloc_contig_range(), we
use PageDirty(). Based on this flag, virtio_mem_fake_online() can either
online the pages for the first time or use free_contig_range().

It is worth noting that there are no guarantees on how much memory can
actually get unplugged again. All device memory might completely be
fragmented with unmovable data, such that no subblock can get unplugged.

We are not touching the ZONE_MOVABLE. If memory is onlined to the
ZONE_MOVABLE, it can only get unplugged after that memory was offlined
manually by user space. In normal operation, virtio-mem memory is
suggested to be onlined to ZONE_NORMAL. In the future, we will try to
make unplug more likely to succeed.

Add a module parameter to control if online memory shall be touched.

As we want to access alloc_contig_range()/free_contig_range() from
kernel module context, export the symbols.

Note: Whenever virtio-mem uses alloc_contig_range(), all affected pages
are on the same node, in the same zone, and contain no holes.

Acked-by: Michal Hocko  # to export contig range allocator API
Tested-by: Pankaj Gupta 
Cc: "Michael S. Tsirkin" 
Cc: Jason Wang 
Cc: Oscar Salvador 
Cc: Michal Hocko 
Cc: Igor Mammedov 
Cc: Dave Young 
Cc: Andrew Morton 
Cc: Dan Williams 
Cc: Pavel Tatashin 
Cc: Stefan Hajnoczi 
Cc: Vlastimil Babka 
Cc: Mel Gorman 
Cc: Mike Rapoport 
Cc: Alexander Duyck 
Cc: Alexander Potapenko 
Signed-off-by: David Hildenbrand 
---
 drivers/virtio/Kconfig  |   1 +
 drivers/virtio/virtio_mem.c | 157 
 mm/page_alloc.c |   2 +
 3 files changed, 146 insertions(+), 14 deletions(-)

diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
index d6dde7d2cf76..4c1e14615001 100644
--- a/drivers/virtio/Kconfig
+++ b/drivers/virtio/Kconfig
@@ -85,6 +85,7 @@ config VIRTIO_MEM
depends on VIRTIO
depends on MEMORY_HOTPLUG_SPARSE
depends on MEMORY_HOTREMOVE
+   select CONTIG_ALLOC
help
 This driver provides access to virtio-mem paravirtualized memory
 devices, allowing to hotplug and hotunplug memory.
diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
index a3ec795be8be..74f0d3cb1d22 100644
--- a/drivers/virtio/virtio_mem.c
+++ b/drivers/virtio/virtio_mem.c
@@ -23,6 +23,10 @@
 
 #include 
 
+static bool unplug_online = true;
+module_param(unplug_online, bool, 0644);
+MODULE_PARM_DESC(unplug_online, "Try to unplug online memory");
+
 enum virtio_mem_mb_state {
/* Unplugged, not added to Linux. Can be reused later. */
VIRTIO_MEM_MB_STATE_UNUSED = 0,
@@ -654,23 +658,35 @@ static int virtio_mem_memory_notifier_cb(struct 
notifier_block *nb,
 }
 
 /*
- * Set a range of pages PG_offline.
+ * Set a range of pages PG_offline. Remember pages that were never onlined
+ * (via generic_online_page()) using PageDirty().
  */
 static void virtio_mem_set_fake_offline(unsigned long pfn,
-   unsigned int nr_pages)
+   unsigned int nr_pages, bool onlined)
 {
-   for (; nr_pages--; pfn++)
-   __SetPageOffline(pfn_to_page(pfn));
+   for (; nr_pages--; pfn++) {
+   struct page *page = pfn_to_page(pfn);
+
+   __SetPageOffline(page);
+   if (!onlined)
+   SetPageDirty(page);
+   }
 }
 
 /*
- * Clear PG_offline from a range of pages.
+ * Clear PG_offline from a range of pages. If the pages were never onlined,
+ * (via generic_online_page()), clear PageDirty().
  */
 static void virtio_mem_clear_fake_offline(unsigned long pfn,
- unsigned int nr_pages)
+ unsigned int nr_pages, bool onlined)
 {
-   for (; nr_pages--; pfn++)
-   __ClearPageOffline(pfn_to_page(pfn));
+   for (; nr_pages--; pfn++) {
+   struct page *page = pfn_to_page(pfn);
+
+   __ClearPageOffline(page);
+   if (!onlined)
+   ClearPageDirty(page);
+   }
 }
 
 /*
@@ -686,10 +702,26 @@ static void virtio_mem_fake_online(unsigned long pfn, 
unsigned int nr_pages)
 * We are always called with subblock granularity, which is at least
 * aligned to MAX_ORDER - 1.
 */
-   virtio_mem_clear_fake_offline(pfn, nr_pages);
+   for (i = 0; 

  1   2   3   >