Re: [PATCH v3 2/3] spapr: nvdimm: Implement H_SCM_FLUSH hcall

2021-03-30 Thread David Gibson
On Mon, Mar 29, 2021 at 02:53:47PM +0530, Shivaprasad G Bhat wrote:
> 
> On 3/24/21 8:37 AM, David Gibson wrote:
> > On Tue, Mar 23, 2021 at 09:47:38AM -0400, Shivaprasad G Bhat wrote:
> > > machine vmstate.
> > > 
> > > Signed-off-by: Shivaprasad G Bhat
> > An overal question: surely the same issue must arise on x86 with
> > file-backed NVDIMMs.  How do they handle this case?
> 
> Discussed in other threads..
> 
> 
> 
> > >   };
> > > @@ -2997,6 +3000,9 @@ static void spapr_machine_init(MachineState 
> > > *machine)
> > >   }
> > >   qemu_cond_init(>fwnmi_machine_check_interlock_cond);
> > > +qemu_mutex_init(>spapr_nvdimm_flush_states_lock);
> > Do you actually need an extra mutex, or can you rely on the BQL?
> 
> I verified BQL is held at all places where it matters in the context of this
> patch.
> 
> Safe to get rid of this extra mutex.
> 
> ...
> 
> > 
> > > +{
> > > + SpaprMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
> > > +
> > > + return (!QLIST_EMPTY(>pending_flush_states) ||
> > > + !QLIST_EMPTY(>completed_flush_states));
> > > +}
> > > +
> > > +static int spapr_nvdimm_pre_save(void *opaque)
> > > +{
> > > +SpaprMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
> > > +
> > > +while (!QLIST_EMPTY(>pending_flush_states)) {
> > > +aio_poll(qemu_get_aio_context(), true);
> > Hmm... how long could waiting for all the pending flushes to complete
> > take?  This could add substanially to the guest's migration downtime,
> > couldn't it?
> 
> 
> The time taken depends on the number of dirtied pages and the disk io write
> speed. The number of dirty pages on host is configureable with tunables

Well, sure, I'm just trying to get an order of magnitude here.

> vm.dirty_background_ratio (10% default on Fedora 32, Ubuntu 20.04),
> vm.dirty_ratio(20%) of host memory and|or vm.dirty_expire_centisecs(30
> seconds). So, the host itself would be flushing the mmaped file on its own
> from time to time. For guests using the nvdimms with filesystem, the flushes
> would have come frequently and the number of dirty pages might be
> less. The

I'm not sure that necessarily follows.

> pmem applications can use the nvdimms without a filesystem. And for such
> guests, the chances that a flush request can come from pmem applications at
> the time of migration is less or is random. But, the host would have flushed
> the pagecache on its own when vm.dirty_background_ratio is crossed or
> vm.dirty_expire_centisecs expired. So, the worst case would stands at disk
> io latency for writing the dirtied pages in the last
> vm.dirty_expire_centisecs on host OR latency for writing maximum
> vm.dirty_background_ratio(10%) of host RAM. If you want me to calibrate any
> particular size, scenario and get the numbers please let me know.

Hmm.  I feel like this could still easily be 10s, maybe 100s of
milliseconds, yes?  Given that typical migration downtime caps are
also in the 100s of milliseconds, this still seems troublesome.  Since
this is part of the device migration itself, this flushing will all
happen during the downtime, but won't be factored into the downtime
estimations.

> ...
> > > +
> > > +/*
> > > + * Acquire a unique token and reserve it for the new flush state.
> > > + */
> > > +static SpaprNVDIMMDeviceFlushState 
> > > *spapr_nvdimm_init_new_flush_state(void)
> > > +{
> > > +Error *err = NULL;
> > > +uint64_t token;
> > > +SpaprMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
> > > +SpaprNVDIMMDeviceFlushState *tmp, *next, *state;
> > > +
> > > +state = g_malloc0(sizeof(*state));
> > > +
> > > +qemu_mutex_lock(>spapr_nvdimm_flush_states_lock);
> > > +retry:
> > > +if (qemu_guest_getrandom(, sizeof(token), ) < 0) {
> > Using getrandom seems like overkill, why not just use a counter?
> 
> I didnt want a spurious guest to abuse by consuming the return value
> providing
> 
> a valid "guess-able" counter and the real driver failing
> subsequently.

Why would a guessable value be bad?  The counter would be per-guest,
so AFAICT all a malicious guest could do is mess itself up.

> Also,
> across
> 
> guest migrations carrying the global counter to destination is another thing
> to ponder.

I don't think you need to: if there are pending flushes on migration
you can set the dest counter to the max id of those, if not you can
reset it to 1 without harm.

Actually.. come to think of it, can't you just use the current max id
of pending flushes + 1 as a new id.


> Let me know if you want me to reconsider using counter.
> 
> ...
> 
> > > mm_flush_states_lock);
> > > +
> > > +return state;
> > > +}
> > > +
> > > +/*
> > > + * spapr_nvdimm_finish_flushes
> > > + *  Waits for all pending flush requests to complete
> > > + *  their execution and free the states
> > > + */
> > > +void spapr_nvdimm_finish_flushes(void)
> > > +{
> > > +SpaprNVDIMMDeviceFlushState *state, *next;
> > > +SpaprMachineState *spapr = 

Re: [PATCH v3 2/3] spapr: nvdimm: Implement H_SCM_FLUSH hcall

2021-03-29 Thread Shivaprasad G Bhat


On 3/24/21 8:37 AM, David Gibson wrote:

On Tue, Mar 23, 2021 at 09:47:38AM -0400, Shivaprasad G Bhat wrote:

machine vmstate.

Signed-off-by: Shivaprasad G Bhat

An overal question: surely the same issue must arise on x86 with
file-backed NVDIMMs.  How do they handle this case?


Discussed in other threads..




  };
@@ -2997,6 +3000,9 @@ static void spapr_machine_init(MachineState *machine)
  }
  
  qemu_cond_init(>fwnmi_machine_check_interlock_cond);

+qemu_mutex_init(>spapr_nvdimm_flush_states_lock);

Do you actually need an extra mutex, or can you rely on the BQL?


I verified BQL is held at all places where it matters in the context of 
this patch.


Safe to get rid of this extra mutex.

...




+{
+ SpaprMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
+
+ return (!QLIST_EMPTY(>pending_flush_states) ||
+ !QLIST_EMPTY(>completed_flush_states));
+}
+
+static int spapr_nvdimm_pre_save(void *opaque)
+{
+SpaprMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
+
+while (!QLIST_EMPTY(>pending_flush_states)) {
+aio_poll(qemu_get_aio_context(), true);

Hmm... how long could waiting for all the pending flushes to complete
take?  This could add substanially to the guest's migration downtime,
couldn't it?



The time taken depends on the number of dirtied pages and the disk io 
write speed. The number of dirty pages on host is configureable with 
tunables vm.dirty_background_ratio (10% default on Fedora 32, Ubuntu 
20.04), vm.dirty_ratio(20%) of host memory and|or 
vm.dirty_expire_centisecs(30 seconds). So, the host itself would be 
flushing the mmaped file on its own from time to time. For guests using 
the nvdimms with filesystem, the flushes would have come frequently and 
the number of dirty pages might be less. The pmem applications can use 
the nvdimms without a filesystem. And for such guests, the chances that 
a flush request can come from pmem applications at the time of migration 
is less or is random. But, the host would have flushed the pagecache on 
its own when vm.dirty_background_ratio is crossed or 
vm.dirty_expire_centisecs expired. So, the worst case would stands at 
disk io latency for writing the dirtied pages in the last 
vm.dirty_expire_centisecs on host OR latency for writing maximum 
vm.dirty_background_ratio(10%) of host RAM. If you want me to calibrate 
any particular size, scenario and get the numbers please let me know.


...

+
+/*
+ * Acquire a unique token and reserve it for the new flush state.
+ */
+static SpaprNVDIMMDeviceFlushState *spapr_nvdimm_init_new_flush_state(void)
+{
+Error *err = NULL;
+uint64_t token;
+SpaprMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
+SpaprNVDIMMDeviceFlushState *tmp, *next, *state;
+
+state = g_malloc0(sizeof(*state));
+
+qemu_mutex_lock(>spapr_nvdimm_flush_states_lock);
+retry:
+if (qemu_guest_getrandom(, sizeof(token), ) < 0) {

Using getrandom seems like overkill, why not just use a counter?


I didnt want a spurious guest to abuse by consuming the return value 
providing


a valid "guess-able" counter and the real driver failing subsequently. 
Also, across


guest migrations carrying the global counter to destination is another 
thing to ponder.



Let me know if you want me to reconsider using counter.

...


mm_flush_states_lock);
+
+return state;
+}
+
+/*
+ * spapr_nvdimm_finish_flushes
+ *  Waits for all pending flush requests to complete
+ *  their execution and free the states
+ */
+void spapr_nvdimm_finish_flushes(void)
+{
+SpaprNVDIMMDeviceFlushState *state, *next;
+SpaprMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());

The caller has natural access to the machine, so pass it in rather
than using the global.


okay

...


+
+/*
+ * spapr_nvdimm_get_hcall_status
+ *  Fetches the status of the hcall worker and returns H_BUSY
+ *  if the worker is still running.
+ */
+static int spapr_nvdimm_get_flush_status(uint64_t token)
+{
+int ret = H_LONG_BUSY_ORDER_10_MSEC;
+SpaprMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());

The callers have natural access to spapr, so pass it in rather than
using the global.


Okay

...


+
+/*
+ * H_SCM_FLUSH
+ * Input: drc_index, continue-token
+ * Out: continue-token
+ * Return Value: H_SUCCESS, H_Parameter, H_P2, H_BUSY
+ *
+ * Given a DRC Index Flush the data to backend NVDIMM device.
+ * The hcall returns H_BUSY when the flush takes longer time and the hcall

It returns one of the H_LONG_BUSY values, not actual H_BUSY, doesn't
it?


Yes. I thought its okay to call it just H_BUSY in a generic way. Will 
fix it.




+ * needs to be issued multiple times in order to be completely serviced.
+}
+
+return ret;
+}
+
+dimm = PC_DIMM(drc->dev);
+backend = MEMORY_BACKEND(dimm->hostmem);
+
+state = spapr_nvdimm_init_new_flush_state();
+if (!state) {
+return H_P2;

AFAICT the only way init_new_flush_state() fails is 

Re: [PATCH v3 2/3] spapr: nvdimm: Implement H_SCM_FLUSH hcall

2021-03-26 Thread Shivaprasad G Bhat

On 3/25/21 7:21 AM, David Gibson wrote:

On Wed, Mar 24, 2021 at 09:34:06AM +0530, Aneesh Kumar K.V wrote:

On 3/24/21 8:37 AM, David Gibson wrote:

On Tue, Mar 23, 2021 at 09:47:38AM -0400, Shivaprasad G Bhat wrote:

The patch adds support for the SCM flush hcall for the nvdimm devices.

...

collects all the hcall states from 'completed' list. The necessary
nvdimm flush specific vmstate structures are added to the spapr
machine vmstate.

Signed-off-by: Shivaprasad G Bhat 

An overal question: surely the same issue must arise on x86 with
file-backed NVDIMMs.  How do they handle this case?

On x86 we have different ways nvdimm can be discovered. ACPI NFIT, e820 map
and virtio_pmem. Among these virio_pmem always operated with synchronous dax
disabled and both ACPI and e820 doesn't have the ability to differentiate
support for synchronous dax.

Ok.  And for the virtio-pmem case, how are the extra flushes actually
done on x86?



virtio-pmem device has virtqueue with virtio_pmem_flush() as the handler

which gets called for all flush requests from guest. virtio_pmem_flush() is

offloading the flush to thread pool with a worker doing fsync() and the

completion callback notifying the guest with response.



With that I would expect users to use virtio_pmem when using using file
backed NVDIMMS

So... should we prevent advertising an NVDIMM through ACPI or e820 if
it doesn't have sync-dax enabled?



Is it possible to have different defaults for sync-dax based on 
architecture ?


The behaviour on x86 is sync-dax=on for nvdimms. So, it would be correct to

have the default as "on" for x86. For pseries -  "off" for new machines.

Looking at code, I didnt find much ways to achieve this. Can you suggest

what can be done ?




Re: [PATCH v3 2/3] spapr: nvdimm: Implement H_SCM_FLUSH hcall

2021-03-24 Thread David Gibson
On Wed, Mar 24, 2021 at 09:34:06AM +0530, Aneesh Kumar K.V wrote:
> On 3/24/21 8:37 AM, David Gibson wrote:
> > On Tue, Mar 23, 2021 at 09:47:38AM -0400, Shivaprasad G Bhat wrote:
> > > The patch adds support for the SCM flush hcall for the nvdimm devices.
> > > To be available for exploitation by guest through the next patch.
> > > 
> > > The hcall expects the semantics such that the flush to return
> > > with H_BUSY when the operation is expected to take longer time along
> > > with a continue_token. The hcall to be called again providing the
> > > continue_token to get the status. So, all fresh requsts are put into
> > > a 'pending' list and flush worker is submitted to the thread pool.
> > > The thread pool completion callbacks move the requests to 'completed'
> > > list, which are cleaned up after reporting to guest in subsequent
> > > hcalls to get the status.
> > > 
> > > The semantics makes it necessary to preserve the continue_tokens
> > > and their return status even across migrations. So, the pre_save
> > > handler for the device waits for the flush worker to complete and
> > > collects all the hcall states from 'completed' list. The necessary
> > > nvdimm flush specific vmstate structures are added to the spapr
> > > machine vmstate.
> > > 
> > > Signed-off-by: Shivaprasad G Bhat 
> > 
> > An overal question: surely the same issue must arise on x86 with
> > file-backed NVDIMMs.  How do they handle this case?
> 
> On x86 we have different ways nvdimm can be discovered. ACPI NFIT, e820 map
> and virtio_pmem. Among these virio_pmem always operated with synchronous dax
> disabled and both ACPI and e820 doesn't have the ability to differentiate
> support for synchronous dax.

Ok.  And for the virtio-pmem case, how are the extra flushes actually
done on x86?

> With that I would expect users to use virtio_pmem when using using file
> backed NVDIMMS

So... should we prevent advertising an NVDIMM through ACPI or e820 if
it doesn't have sync-dax enabled?

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [PATCH v3 2/3] spapr: nvdimm: Implement H_SCM_FLUSH hcall

2021-03-24 Thread Aneesh Kumar K.V

On 3/24/21 8:37 AM, David Gibson wrote:

On Tue, Mar 23, 2021 at 09:47:38AM -0400, Shivaprasad G Bhat wrote:

The patch adds support for the SCM flush hcall for the nvdimm devices.
To be available for exploitation by guest through the next patch.

The hcall expects the semantics such that the flush to return
with H_BUSY when the operation is expected to take longer time along
with a continue_token. The hcall to be called again providing the
continue_token to get the status. So, all fresh requsts are put into
a 'pending' list and flush worker is submitted to the thread pool.
The thread pool completion callbacks move the requests to 'completed'
list, which are cleaned up after reporting to guest in subsequent
hcalls to get the status.

The semantics makes it necessary to preserve the continue_tokens
and their return status even across migrations. So, the pre_save
handler for the device waits for the flush worker to complete and
collects all the hcall states from 'completed' list. The necessary
nvdimm flush specific vmstate structures are added to the spapr
machine vmstate.

Signed-off-by: Shivaprasad G Bhat 


An overal question: surely the same issue must arise on x86 with
file-backed NVDIMMs.  How do they handle this case?


On x86 we have different ways nvdimm can be discovered. ACPI NFIT, e820 
map and virtio_pmem. Among these virio_pmem always operated with 
synchronous dax disabled and both ACPI and e820 doesn't have the ability 
to differentiate support for synchronous dax.


With that I would expect users to use virtio_pmem when using using file 
backed NVDIMMS


-aneesh



Re: [PATCH v3 2/3] spapr: nvdimm: Implement H_SCM_FLUSH hcall

2021-03-23 Thread David Gibson
On Tue, Mar 23, 2021 at 09:47:38AM -0400, Shivaprasad G Bhat wrote:
> The patch adds support for the SCM flush hcall for the nvdimm devices.
> To be available for exploitation by guest through the next patch.
> 
> The hcall expects the semantics such that the flush to return
> with H_BUSY when the operation is expected to take longer time along
> with a continue_token. The hcall to be called again providing the
> continue_token to get the status. So, all fresh requsts are put into
> a 'pending' list and flush worker is submitted to the thread pool.
> The thread pool completion callbacks move the requests to 'completed'
> list, which are cleaned up after reporting to guest in subsequent
> hcalls to get the status.
> 
> The semantics makes it necessary to preserve the continue_tokens
> and their return status even across migrations. So, the pre_save
> handler for the device waits for the flush worker to complete and
> collects all the hcall states from 'completed' list. The necessary
> nvdimm flush specific vmstate structures are added to the spapr
> machine vmstate.
> 
> Signed-off-by: Shivaprasad G Bhat 

An overal question: surely the same issue must arise on x86 with
file-backed NVDIMMs.  How do they handle this case?

> ---
>  hw/ppc/spapr.c|6 +
>  hw/ppc/spapr_nvdimm.c |  240 
> +
>  include/hw/ppc/spapr.h|   11 ++
>  include/hw/ppc/spapr_nvdimm.h |   12 ++
>  4 files changed, 268 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index d56418ca29..fdb0c73a2c 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -1607,6 +1607,8 @@ static void spapr_machine_reset(MachineState *machine)
>  spapr->ov5_cas = spapr_ovec_clone(spapr->ov5);
>  }
>  
> +spapr_nvdimm_finish_flushes();
> +
>  /* DRC reset may cause a device to be unplugged. This will cause troubles
>   * if this device is used by another device (eg, a running vhost backend
>   * will crash QEMU if the DIMM holding the vring goes away). To avoid 
> such
> @@ -2003,6 +2005,7 @@ static const VMStateDescription vmstate_spapr = {
>  _spapr_cap_ccf_assist,
>  _spapr_cap_fwnmi,
>  _spapr_fwnmi,
> +_spapr_nvdimm_flush_states,
>  NULL
>  }
>  };
> @@ -2997,6 +3000,9 @@ static void spapr_machine_init(MachineState *machine)
>  }
>  
>  qemu_cond_init(>fwnmi_machine_check_interlock_cond);
> +qemu_mutex_init(>spapr_nvdimm_flush_states_lock);

Do you actually need an extra mutex, or can you rely on the BQL?

> +QLIST_INIT(>pending_flush_states);
> +QLIST_INIT(>completed_flush_states);
>  }
>  
>  #define DEFAULT_KVM_TYPE "auto"
> diff --git a/hw/ppc/spapr_nvdimm.c b/hw/ppc/spapr_nvdimm.c
> index 8cf3fb2ffb..883317c1ed 100644
> --- a/hw/ppc/spapr_nvdimm.c
> +++ b/hw/ppc/spapr_nvdimm.c
> @@ -22,14 +22,17 @@
>   * THE SOFTWARE.
>   */
>  #include "qemu/osdep.h"
> +#include "qemu/cutils.h"
>  #include "qapi/error.h"
>  #include "hw/ppc/spapr_drc.h"
>  #include "hw/ppc/spapr_nvdimm.h"
>  #include "hw/mem/nvdimm.h"
> +#include "qemu/guest-random.h"
>  #include "qemu/nvdimm-utils.h"
>  #include "hw/ppc/fdt.h"
>  #include "qemu/range.h"
>  #include "hw/ppc/spapr_numa.h"
> +#include "block/thread-pool.h"
>  
>  /*
>   * The nvdimm size should be aligned to SCM block size.
> @@ -371,6 +374,242 @@ static target_ulong h_scm_bind_mem(PowerPCCPU *cpu, 
> SpaprMachineState *spapr,
>  return H_SUCCESS;
>  }
>  
> +static const VMStateDescription vmstate_spapr_nvdimm_entry = {
> + .name = "spapr_nvdimm_states",
> + .version_id = 1,
> + .minimum_version_id = 1,
> + .fields = (VMStateField[]) {
> + VMSTATE_UINT64(continue_token, SpaprNVDIMMDeviceFlushState),
> + VMSTATE_INT64(hcall_ret, SpaprNVDIMMDeviceFlushState),
> + VMSTATE_END_OF_LIST()
> + },
> +};
> +
> +static bool spapr_nvdimm_states_needed(void *opaque)
> +{
> + SpaprMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
> +
> + return (!QLIST_EMPTY(>pending_flush_states) ||
> + !QLIST_EMPTY(>completed_flush_states));
> +}
> +
> +static int spapr_nvdimm_pre_save(void *opaque)
> +{
> +SpaprMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
> +
> +while (!QLIST_EMPTY(>pending_flush_states)) {
> +aio_poll(qemu_get_aio_context(), true);

Hmm... how long could waiting for all the pending flushes to complete
take?  This could add substanially to the guest's migration downtime,
couldn't it?

> +}
> +
> +return 0;
> +}
> +
> +const VMStateDescription vmstate_spapr_nvdimm_flush_states = {
> +.name = "spapr_nvdimm_hcall_states",
> +.version_id = 1,
> +.minimum_version_id = 1,
> +.needed = spapr_nvdimm_states_needed,
> +.pre_save = spapr_nvdimm_pre_save,
> +.fields = (VMStateField[]) {
> +VMSTATE_QLIST_V(completed_flush_states, SpaprMachineState, 1,
> +

[PATCH v3 2/3] spapr: nvdimm: Implement H_SCM_FLUSH hcall

2021-03-23 Thread Shivaprasad G Bhat
The patch adds support for the SCM flush hcall for the nvdimm devices.
To be available for exploitation by guest through the next patch.

The hcall expects the semantics such that the flush to return
with H_BUSY when the operation is expected to take longer time along
with a continue_token. The hcall to be called again providing the
continue_token to get the status. So, all fresh requsts are put into
a 'pending' list and flush worker is submitted to the thread pool.
The thread pool completion callbacks move the requests to 'completed'
list, which are cleaned up after reporting to guest in subsequent
hcalls to get the status.

The semantics makes it necessary to preserve the continue_tokens
and their return status even across migrations. So, the pre_save
handler for the device waits for the flush worker to complete and
collects all the hcall states from 'completed' list. The necessary
nvdimm flush specific vmstate structures are added to the spapr
machine vmstate.

Signed-off-by: Shivaprasad G Bhat 
---
 hw/ppc/spapr.c|6 +
 hw/ppc/spapr_nvdimm.c |  240 +
 include/hw/ppc/spapr.h|   11 ++
 include/hw/ppc/spapr_nvdimm.h |   12 ++
 4 files changed, 268 insertions(+), 1 deletion(-)

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index d56418ca29..fdb0c73a2c 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -1607,6 +1607,8 @@ static void spapr_machine_reset(MachineState *machine)
 spapr->ov5_cas = spapr_ovec_clone(spapr->ov5);
 }
 
+spapr_nvdimm_finish_flushes();
+
 /* DRC reset may cause a device to be unplugged. This will cause troubles
  * if this device is used by another device (eg, a running vhost backend
  * will crash QEMU if the DIMM holding the vring goes away). To avoid such
@@ -2003,6 +2005,7 @@ static const VMStateDescription vmstate_spapr = {
 _spapr_cap_ccf_assist,
 _spapr_cap_fwnmi,
 _spapr_fwnmi,
+_spapr_nvdimm_flush_states,
 NULL
 }
 };
@@ -2997,6 +3000,9 @@ static void spapr_machine_init(MachineState *machine)
 }
 
 qemu_cond_init(>fwnmi_machine_check_interlock_cond);
+qemu_mutex_init(>spapr_nvdimm_flush_states_lock);
+QLIST_INIT(>pending_flush_states);
+QLIST_INIT(>completed_flush_states);
 }
 
 #define DEFAULT_KVM_TYPE "auto"
diff --git a/hw/ppc/spapr_nvdimm.c b/hw/ppc/spapr_nvdimm.c
index 8cf3fb2ffb..883317c1ed 100644
--- a/hw/ppc/spapr_nvdimm.c
+++ b/hw/ppc/spapr_nvdimm.c
@@ -22,14 +22,17 @@
  * THE SOFTWARE.
  */
 #include "qemu/osdep.h"
+#include "qemu/cutils.h"
 #include "qapi/error.h"
 #include "hw/ppc/spapr_drc.h"
 #include "hw/ppc/spapr_nvdimm.h"
 #include "hw/mem/nvdimm.h"
+#include "qemu/guest-random.h"
 #include "qemu/nvdimm-utils.h"
 #include "hw/ppc/fdt.h"
 #include "qemu/range.h"
 #include "hw/ppc/spapr_numa.h"
+#include "block/thread-pool.h"
 
 /*
  * The nvdimm size should be aligned to SCM block size.
@@ -371,6 +374,242 @@ static target_ulong h_scm_bind_mem(PowerPCCPU *cpu, 
SpaprMachineState *spapr,
 return H_SUCCESS;
 }
 
+static const VMStateDescription vmstate_spapr_nvdimm_entry = {
+ .name = "spapr_nvdimm_states",
+ .version_id = 1,
+ .minimum_version_id = 1,
+ .fields = (VMStateField[]) {
+ VMSTATE_UINT64(continue_token, SpaprNVDIMMDeviceFlushState),
+ VMSTATE_INT64(hcall_ret, SpaprNVDIMMDeviceFlushState),
+ VMSTATE_END_OF_LIST()
+ },
+};
+
+static bool spapr_nvdimm_states_needed(void *opaque)
+{
+ SpaprMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
+
+ return (!QLIST_EMPTY(>pending_flush_states) ||
+ !QLIST_EMPTY(>completed_flush_states));
+}
+
+static int spapr_nvdimm_pre_save(void *opaque)
+{
+SpaprMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
+
+while (!QLIST_EMPTY(>pending_flush_states)) {
+aio_poll(qemu_get_aio_context(), true);
+}
+
+return 0;
+}
+
+const VMStateDescription vmstate_spapr_nvdimm_flush_states = {
+.name = "spapr_nvdimm_hcall_states",
+.version_id = 1,
+.minimum_version_id = 1,
+.needed = spapr_nvdimm_states_needed,
+.pre_save = spapr_nvdimm_pre_save,
+.fields = (VMStateField[]) {
+VMSTATE_QLIST_V(completed_flush_states, SpaprMachineState, 1,
+vmstate_spapr_nvdimm_entry,
+SpaprNVDIMMDeviceFlushState, node),
+VMSTATE_END_OF_LIST()
+},
+};
+
+/*
+ * Acquire a unique token and reserve it for the new flush state.
+ */
+static SpaprNVDIMMDeviceFlushState *spapr_nvdimm_init_new_flush_state(void)
+{
+Error *err = NULL;
+uint64_t token;
+SpaprMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
+SpaprNVDIMMDeviceFlushState *tmp, *next, *state;
+
+state = g_malloc0(sizeof(*state));
+
+qemu_mutex_lock(>spapr_nvdimm_flush_states_lock);
+retry:
+if (qemu_guest_getrandom(, sizeof(token), ) < 0) {
+error_report_err(err);
+g_free(state);
+