from:"Haozhong Zhang"

Re: [PATCH] xio3130_downstream: Set the maximum link width and speed

2021-06-02 Thread Haozhong Zhang

On 05/28/21 01:06, Haozhong Zhang wrote:
> The current implementation leaves 0 in the maximum link width (MLW)
> and speed (MLS) fields of the PCI_EXP_LNKCAP register of a xio3130
> downstream port device. As a consequence, when that downstream port
> negotiates the link width and speed with its downstream device, 0 will
> be used and filled in the MLW and MLS fields of the PCI_EXP_LNKSTA
> register of that downstream port.
> 
> Normally, such 0 MLS and MLW in PCI_EXP_LNKSTA register only make the
> guest lspci output looks weird (like "speed unknown" and "x0 width").
> However, it also fails the hot-plug of device to the xio3130
> downstream port. The guest Linux kernel complains:
> 
> pcieport :01:00.0: pciehp: Slot(0): Cannot train link: status 0x2000
> 
> because the pciehp_hpc driver expects a read of valid (non-zero) MLW
> from PCI_EXP_LNKSTA register of that downstream port.
> 
> This patch addresses the above issue by setting MLW and MLS in
> PCI_EXP_LNKCAP of the xio3130 downstream port to values defined in its
> data manual, i.e., x1 and 2.5 GT respectively.
> 
> Signed-off-by: Haozhong Zhang 
> ---
>  hw/pci-bridge/xio3130_downstream.c | 7 +++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/hw/pci-bridge/xio3130_downstream.c 
> b/hw/pci-bridge/xio3130_downstream.c
> index 04aae72cd6..fbf9868ad7 100644
> --- a/hw/pci-bridge/xio3130_downstream.c
> +++ b/hw/pci-bridge/xio3130_downstream.c
> @@ -87,6 +87,13 @@ static void xio3130_downstream_realize(PCIDevice *d, Error 
> **errp)
>  goto err_bridge;
>  }
>  
> +/*
> + * Following two fields must be set before calling pcie_cap_init() which
> + * will fill them to MLS and MLW of PCI_EXP_LNKCAP register.
> + */
> +s->speed = QEMU_PCI_EXP_LNK_2_5GT;
> +s->width = QEMU_PCI_EXP_LNK_X1;
> +
>  rc = pcie_cap_init(d, XIO3130_EXP_OFFSET, PCI_EXP_TYPE_DOWNSTREAM,
> p->port, errp);
>  if (rc < 0) {
> -- 
> 2.31.1
> 
>

Forgot to cc Marcel




signature.asc
Description: PGP signature

[PATCH] xio3130_downstream: Set the maximum link width and speed

2021-05-27 Thread Haozhong Zhang

The current implementation leaves 0 in the maximum link width (MLW)
and speed (MLS) fields of the PCI_EXP_LNKCAP register of a xio3130
downstream port device. As a consequence, when that downstream port
negotiates the link width and speed with its downstream device, 0 will
be used and filled in the MLW and MLS fields of the PCI_EXP_LNKSTA
register of that downstream port.

Normally, such 0 MLS and MLW in PCI_EXP_LNKSTA register only make the
guest lspci output looks weird (like "speed unknown" and "x0 width").
However, it also fails the hot-plug of device to the xio3130
downstream port. The guest Linux kernel complains:

pcieport :01:00.0: pciehp: Slot(0): Cannot train link: status 0x2000

because the pciehp_hpc driver expects a read of valid (non-zero) MLW
from PCI_EXP_LNKSTA register of that downstream port.

This patch addresses the above issue by setting MLW and MLS in
PCI_EXP_LNKCAP of the xio3130 downstream port to values defined in its
data manual, i.e., x1 and 2.5 GT respectively.

Signed-off-by: Haozhong Zhang 
---
 hw/pci-bridge/xio3130_downstream.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/hw/pci-bridge/xio3130_downstream.c 
b/hw/pci-bridge/xio3130_downstream.c
index 04aae72cd6..fbf9868ad7 100644
--- a/hw/pci-bridge/xio3130_downstream.c
+++ b/hw/pci-bridge/xio3130_downstream.c
@@ -87,6 +87,13 @@ static void xio3130_downstream_realize(PCIDevice *d, Error 
**errp)
 goto err_bridge;
 }
 
+/*
+ * Following two fields must be set before calling pcie_cap_init() which
+ * will fill them to MLS and MLW of PCI_EXP_LNKCAP register.
+ */
+s->speed = QEMU_PCI_EXP_LNK_2_5GT;
+s->width = QEMU_PCI_EXP_LNK_X1;
+
 rc = pcie_cap_init(d, XIO3130_EXP_OFFSET, PCI_EXP_TYPE_DOWNSTREAM,
p->port, errp);
 if (rc < 0) {
-- 
2.31.1

Re: [Qemu-devel] [PATCH v3 11/13] nvdimm: allow setting the label-size to 0

2018-06-15 Thread Haozhong Zhang

On 06/15/18 16:04, David Hildenbrand wrote:
> It is inititally 0, so setting it to 0 should be allowed, too.

I'm fine with this change and believe nothing is broken in practice,
but what is expected by the user who sets a zero label size?

Look at nvdimm_dsm_device() which enables label DSMs only if the label
size is not smaller than 128 KB. If a user sets a zero label size
explicitly, does he/she expect those label DSMs are available in
guest?  (according to Intel spec, the minimal label size is 128
KBytes)

I think if it's allowed to set a zero label-size, it would be better
to document its difference from other non-zero values in docs/nvdimm.txt.

Thanks,
Haozhong

> 
> Signed-off-by: David Hildenbrand 
> ---
>  hw/mem/nvdimm.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/mem/nvdimm.c b/hw/mem/nvdimm.c
> index db7d8c3050..df7646488b 100644
> --- a/hw/mem/nvdimm.c
> +++ b/hw/mem/nvdimm.c
> @@ -52,9 +52,9 @@ static void nvdimm_set_label_size(Object *obj, Visitor *v, 
> const char *name,
>  if (local_err) {
>  goto out;
>  }
> -if (value < MIN_NAMESPACE_LABEL_SIZE) {
> +if (value && value < MIN_NAMESPACE_LABEL_SIZE) {
>  error_setg(_err, "Property '%s.%s' (0x%" PRIx64 ") is required"
> -   " at least 0x%lx", object_get_typename(obj),
> +   " either 0 or at least 0x%lx", object_get_typename(obj),
> name, value, MIN_NAMESPACE_LABEL_SIZE);
>  goto out;
>  }
> -- 
> 2.17.1
> 
> 

signature.asc
Description: PGP signature

Re: [Qemu-devel] [RFC PATCH 1/1] nvdimm: let qemu requiring section alignment of pmem resource.

2018-06-12 Thread Haozhong Zhang

On 06/11/18 19:55, Dan Williams wrote:
> On Mon, Jun 11, 2018 at 9:26 AM, Stefan Hajnoczi  wrote:
> > On Mon, Jun 11, 2018 at 06:54:25PM +0800, Zhang Yi wrote:
> >> Nvdimm driver use Memory hot-plug APIs to map it's pmem resource,
> >> which at a section granularity.
> >>
> >> When QEMU emulated the vNVDIMM device, decrease the label-storage,
> >> QEMU will put the vNVDIMMs directly next to one another in physical
> >> address space, which means that the boundary between them won't
> >> align to the 128 MB memory section size.
> >
> > I'm having a hard time parsing this.
> >
> > Where does the "128 MB memory section size" come from?  ACPI?
> > A chipset-specific value?
> >
> 
> The devm_memremap_pages() implementation use the memory hotplug core
> to allocate the 'struct page' array/map for persistent memory. Memory
> hotplug can only be performed in terms of sections, 128MB on x86_64.

IIUC, it also affects the normal RAM hotplug to a Linux VM on QEMU. If
that is the case, it will be helpful to lift this option to pc-dimm.

Thanks,
Haozhong

> There is some limited support for allowing devm_memremap_pages() to
> overlap 'System RAM' within a given section, but it does not currently
> support multiple devm_memremap_pages() calls overlapping within the
> same section. There is currently a kernel bug where we do not handle
> this unsupported configuration gracefully. The fix will cause
> configurations configurations that try to overlap 2 persistent memory
> ranges in the same section to fail.
> 
> The proposed fix is trying to make sure that QEMU does not run afoul
> of this constraint.
> 
> There is currently no line of sight to reduce the minimum memory
> hotplug alignment size to less than 128M. Also, as other architectures
> outside of x86_64 add devm_memremap_pages() support, the minimum
> section alignment constraint might change and is a property of a guest
> OS. My understanding is that some guest OSes might expect an even
> larger persistent memory minimum alignment.
> 


signature.asc
Description: PGP signature

Re: [Qemu-devel] [PATCH v4 5/8] migration/ram: ensure write persistence on loading zero pages to PMEM

2018-04-01 Thread Haozhong Zhang

On 03/29/18 19:59 +0100, Dr. David Alan Gilbert wrote:
> * Haozhong Zhang (haozhong.zh...@intel.com) wrote:
> > When loading a zero page, check whether it will be loaded to
> > persistent memory If yes, load it by libpmem function
> > pmem_memset_nodrain().  Combined with a call to pmem_drain() at the
> > end of RAM loading, we can guarantee all those zero pages are
> > persistently loaded.
> > 
> > Depending on the host HW/SW configurations, pmem_drain() can be
> > "sfence".  Therefore, we do not call pmem_drain() after each
> > pmem_memset_nodrain(), or use pmem_memset_persist() (equally
> > pmem_memset_nodrain() + pmem_drain()), in order to avoid unnecessary
> > overhead.
> > 
> > Signed-off-by: Haozhong Zhang <haozhong.zh...@intel.com>
> 
> I'm still thinking this is way too invasive;  especially the next patch
> that touches qemu_file.
> 
> One thing that would help a little, but not really enough, would be
> to define a :
> 
> struct MemOps {
>   void (*copy)(like a memcpy);
>   void (*set)(like a memset);
> }
> 
> then you could have:
> 
> struct MemOps normalops = {memcpy, memset};
> struct MemOps pmem_nodrain_ops = { pmem_memcopy_nodrain, pmem_memset_nodrain 
> };
> 
> then things like ram_handle_compressed would be:
> 
> void ram_handle_compressed(void *host, uint8_t ch, uint64_t size, const 
> struct MemOps *mem)
> {
> if (ch != 0 || !is_zero_range(host, size)) {
> mem->set(host, ch,size);
> }
> }
> 
> which means the change is pretty tiny to each function.

This looks much better than mine.

I'm also considering Stefan's comment that flushing at the end of all
memory migration rather than invasively change every types of copy in
migration stream. We are going to perform some microbenchmarks on the
real hardware, and then decide which way to take.

> 
> > diff --git a/migration/rdma.c b/migration/rdma.c
> > index da474fc19f..573bcd2cb0 100644
> > --- a/migration/rdma.c
> > +++ b/migration/rdma.c
> > @@ -3229,7 +3229,7 @@ static int qemu_rdma_registration_handle(QEMUFile *f, 
> > void *opaque)
> >  host_addr = block->local_host_addr +
> >  (comp->offset - block->offset);
> >  
> > -ram_handle_compressed(host_addr, comp->value, comp->length);
> > +ram_handle_compressed(host_addr, comp->value, comp->length, 
> > false);
> 
> Is that right? Is RDMA not allowed to work on PMEM?
> (and anyway this call is a normal clear rather than an actual RDMA op).
>

Well, this patch exclude RMDA case intentionally. Once it's clear how
to guarantee the persistence of remote PMEM write in RDMA, we will
propose additional patch to add support in QEMU.

Thanks,
Haozhong

> Dave
> 
> >  break;
> >  
> >  case RDMA_CONTROL_REGISTER_FINISHED:
> > diff --git a/stubs/pmem.c b/stubs/pmem.c
> > index 03d990e571..a65b3bfc6b 100644
> > --- a/stubs/pmem.c
> > +++ b/stubs/pmem.c
> > @@ -17,3 +17,12 @@ void *pmem_memcpy_persist(void *pmemdest, const void 
> > *src, size_t len)
> >  {
> >  return memcpy(pmemdest, src, len);
> >  }
> > +
> > +void *pmem_memset_nodrain(void *pmemdest, int c, size_t len)
> > +{
> > +return memset(pmemdest, c, len);
> > +}
> > +
> > +void pmem_drain(void)
> > +{
> > +}
> > -- 
> > 2.14.1
> > 
> --
> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [Qemu-devel] [PATCH v4 0/8] nvdimm: guarantee persistence of QEMU writes to persistent memory

2018-04-01 Thread Haozhong Zhang

On 03/29/18 20:12 +0100, Dr. David Alan Gilbert wrote:
> * Haozhong Zhang (haozhong.zh...@intel.com) wrote:
> 
> 
> 
> > Post-copy with NVDIMM currently fails with message "Postcopy on shared
> > RAM (...) is not yet supported". Is it enough?
> 
> What does it say now that postcopy-shared support is in?
> 

I'll check it later.

Haozhong

Re: [Qemu-devel] [PATCH v4 0/8] nvdimm: guarantee persistence of QEMU writes to persistent memory

2018-03-12 Thread Haozhong Zhang

On 03/12/18 15:39 +, Stefan Hajnoczi wrote:
> On Wed, Feb 28, 2018 at 03:25:50PM +0800, Haozhong Zhang wrote:
> > QEMU writes to vNVDIMM backends in the vNVDIMM label emulation and
> > live migration. If the backend is on the persistent memory, QEMU needs
> > to take proper operations to ensure its writes persistent on the
> > persistent memory. Otherwise, a host power failure may result in the
> > loss the guest data on the persistent memory.
> > 
> > This v3 patch series is based on Marcel's patch "mem: add share
> > parameter to memory-backend-ram" [1] because of the changes in patch 1.
> > 
> > [1] https://lists.gnu.org/archive/html/qemu-devel/2018-02/msg03858.html
> > 
> > Previous versions can be found at
> > v3: https://lists.gnu.org/archive/html/qemu-devel/2018-02/msg04365.html
> > v2: https://lists.gnu.org/archive/html/qemu-devel/2018-02/msg01579.html
> > v1: https://lists.gnu.org/archive/html/qemu-devel/2017-12/msg05040.html
> > 
> > Changes in v4:
> >  * (Patch 2) Fix compilation errors found by patchew.
> > 
> > Changes in v3:
> >  * (Patch 5) Add a is_pmem flag to ram_handle_compressed() and handle
> >PMEM writes in it, so we don't need the _common function.
> >  * (Patch 6) Expose qemu_get_buffer_common so we can remove the
> >unnecessary qemu_get_buffer_to_pmem wrapper.
> >  * (Patch 8) Add a is_pmem flag to xbzrle_decode_buffer() and handle
> >PMEM writes in it, so we can remove the unnecessary
> >xbzrle_decode_buffer_{common, to_pmem}.
> >  * Move libpmem stubs to stubs/pmem.c and fix the compilation failures
> >of test-{xbzrle,vmstate}.c.
> > 
> > Changes in v2:
> >  * (Patch 1) Use a flags parameter in file ram allocation functions.
> >  * (Patch 2) Add a new option 'pmem' to hostmem-file.
> >  * (Patch 3) Use libpmem to operate on the persistent memory, rather
> >than re-implementing those operations in QEMU.
> >  * (Patch 5-8) Consider the write persistence in the migration path.
> > 
> > Haozhong Zhang (8):
> >   [1/8] memory, exec: switch file ram allocation functions to 'flags' 
> > parameters
> >   [2/8] hostmem-file: add the 'pmem' option
> >   [3/8] configure: add libpmem support
> >   [4/8] mem/nvdimm: ensure write persistence to PMEM in label emulation
> >   [5/8] migration/ram: ensure write persistence on loading zero pages to 
> > PMEM
> >   [6/8] migration/ram: ensure write persistence on loading normal pages to 
> > PMEM
> >   [7/8] migration/ram: ensure write persistence on loading compressed pages 
> > to PMEM
> >   [8/8] migration/ram: ensure write persistence on loading xbzrle pages to 
> > PMEM
> > 
> >  backends/hostmem-file.c | 27 +++-
> >  configure   | 35 ++
> >  docs/nvdimm.txt | 14 +++
> >  exec.c  | 20 ---
> >  hw/mem/nvdimm.c |  9 ++-
> >  include/exec/memory.h   | 12 +++--
> >  include/exec/ram_addr.h | 28 +++--
> >  include/migration/qemu-file-types.h |  2 ++
> >  include/qemu/pmem.h | 27 
> >  memory.c|  8 +++---
> >  migration/qemu-file.c   | 29 ++
> >  migration/ram.c | 49 
> > +++--
> >  migration/ram.h |  2 +-
> >  migration/rdma.c|  2 +-
> >  migration/xbzrle.c  |  8 --
> >  migration/xbzrle.h  |  3 ++-
> >  numa.c  |  2 +-
> >  qemu-options.hx |  9 ++-
> >  stubs/Makefile.objs |  1 +
> >  stubs/pmem.c| 37 
> >  tests/Makefile.include  |  4 +--
> >  tests/test-xbzrle.c |  4 +--
> >  22 files changed, 285 insertions(+), 47 deletions(-)
> >  create mode 100644 include/qemu/pmem.h
> >  create mode 100644 stubs/pmem.c
> 
> A few thoughts:
> 
> 1. Can you use pmem_is_pmem() to auto-detect the pmem=on|off value?

The manpage [1] of pmem_is_pmem says:

 "The result of pmem_is_pmem() query is only valid for the mappings
  created using pmem_map_file().  For other memory regions, in
  particular those created by a direct call to mmap(2), pmem_is_pmem()
  always returns false, even if the queried range is entirely
  persistent memory."

QEMU is using mmap for NV

[Qemu-devel] [PATCH v6 5/5] test/acpi-test-data: add ACPI tables for dimmpxm test

2018-03-10 Thread Haozhong Zhang

Reviewers can use ACPI tables in this patch to run
test_acpi_{piix4,q35}_tcg_dimm_pxm cases.

Signed-off-by: Haozhong Zhang <haozhong.zh...@intel.com>
---
 tests/acpi-test-data/pc/APIC.dimmpxm  | Bin 0 -> 144 bytes
 tests/acpi-test-data/pc/DSDT.dimmpxm  | Bin 0 -> 6803 bytes
 tests/acpi-test-data/pc/NFIT.dimmpxm  | Bin 0 -> 224 bytes
 tests/acpi-test-data/pc/SRAT.dimmpxm  | Bin 0 -> 472 bytes
 tests/acpi-test-data/pc/SSDT.dimmpxm  | Bin 0 -> 685 bytes
 tests/acpi-test-data/q35/APIC.dimmpxm | Bin 0 -> 144 bytes
 tests/acpi-test-data/q35/DSDT.dimmpxm | Bin 0 -> 9487 bytes
 tests/acpi-test-data/q35/NFIT.dimmpxm | Bin 0 -> 224 bytes
 tests/acpi-test-data/q35/SRAT.dimmpxm | Bin 0 -> 472 bytes
 tests/acpi-test-data/q35/SSDT.dimmpxm | Bin 0 -> 685 bytes
 10 files changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 tests/acpi-test-data/pc/APIC.dimmpxm
 create mode 100644 tests/acpi-test-data/pc/DSDT.dimmpxm
 create mode 100644 tests/acpi-test-data/pc/NFIT.dimmpxm
 create mode 100644 tests/acpi-test-data/pc/SRAT.dimmpxm
 create mode 100644 tests/acpi-test-data/pc/SSDT.dimmpxm
 create mode 100644 tests/acpi-test-data/q35/APIC.dimmpxm
 create mode 100644 tests/acpi-test-data/q35/DSDT.dimmpxm
 create mode 100644 tests/acpi-test-data/q35/NFIT.dimmpxm
 create mode 100644 tests/acpi-test-data/q35/SRAT.dimmpxm
 create mode 100644 tests/acpi-test-data/q35/SSDT.dimmpxm

diff --git a/tests/acpi-test-data/pc/APIC.dimmpxm 
b/tests/acpi-test-data/pc/APIC.dimmpxm
new file mode 100644
index 
..427bb08248e6a029c1c988f74f5e48f93ee4ebe0
GIT binary patch
literal 144
zcmZ<^@N}NQz`($`}|BUr=ZD8>jB1F=Cg1H*%VV44G{4#eePWQ5R6Oc0ux
t8ALPkfFuw61CdNzKn!AlSgfo-nis_4<b<)ffC?aD+}vOm3)_F75dauy4FLcE

literal 0
HcmV?d1

diff --git a/tests/acpi-test-data/pc/DSDT.dimmpxm 
b/tests/acpi-test-data/pc/DSDT.dimmpxm
new file mode 100644
index 
..38661cb13ee348718ab45bfc69452cd642cf9bb9
GIT binary patch
literal 6803
zcmcgxUvJyi6~C9H9O_E4DVt54IBf){f7Y%|^v88tY$z;|vZYv*8BxwMFc>Mv!Q`St
z2satx2E`N=aMQdMT8a(WgA(XD`3T!b=-Xbquh3zwpX!}M^3tko0`0>lAoM-={O<3Z
zd+#~tUNX9+w+H74q5rIGXf-QWxnXKL^ie_xw(+l0mu?cfr>xKTP>;cvNKd
zZN0L*q@vzjEEXpS!f<k*#OSldX>T*&}z0An4wf#~3!0MaQZ*c7MUt>Ur6z)%A4w
zYbMH0S#J^9*{thSG2{SKm1}0T%|R4EpawT;X4@cXLcabXMI`&7g7Vz;YE#ddE#1kf
z%Z}A96Ayb_54$>_xJ+?}=`RN^8Mvv#!x0%ye>v!oKX=mPU;jyr$cW9zj@GiWSvI|&
zgc$=lkqFr%%4?U<8+6z1pEYk3O1`gYkx+2OER-~XutQ}fF$UA5x>a@p94sv2mhzgw
zTn6sG{<}-af+Gp3T_*X0=JldqmXA*bub}{86+Emql(E+3fy1t+ocF!IGt5vC!Xn
z_R<>lh({D*k<c}|OLmDcwMXp!mGz1q!9`I``l}L6)F0>)mrxkBO}63hq5$)?`)?Q<
zo6*3TxygYtODqxvfn|AB*P=~46?(M5=GW-A;<qA^*68=*_pnflE95Q7=Ps-^%rf8X
zRlPo_u1ryk7@NbqautW9{`;N^pS$0<SYW56MF$~9l0tj~hgNE6Te3ghg
zA8b7?sXLs4?H-Y*QZ#3UW!C6@@xCa_i#^|;s-$fg1-_^W8blzc!3~L{IS>y-A}=aF
z%_`CqCuo=u@xYG8@(2e4@W{ZU)a0We>Y19=rYZ8A$q?cwXb^*$fyJuCeLMqD7
zF``l^Xq9~RDkv=*ZG$O)Cv;ov5#wTJ@@6UqtEf(Cebo+oL-Khud#p
z_lPJ>NTG_OtTPOIRcDUbY7=i(=(!&0JgX$1bXd>(q{9TV<vrN#Y!N1sTSH}V3qVOo
zn?`e8C)>kU2e;@jJLoVe>bA)C(@Z3l0hArwVnWnY346q-M<d(br+ZsWA&|J_(KaF8
zgAUkxv`vY({mAW*d!3PMKYaDFh~8qZV75`SbuBN$qkxYhK1#AWSHA|UI!r!by)Gnu
za>P>Va{ZdtN&=vj+{7gHqZ2iqQbjD0Kt&-yU+qziNIVta($cE527rU}6uBxD*2
z2$m-q*2>DBW^-Rmlcbn{C}r?31^@PlIuqm|I)Uz~Sx2v1<Wp?3p5HTrSxrXhk}sce
ztd>O|cLOh=nzicKA6l<WScmxi$<_;F)(gJ%0{Ay>y`Y|Ky<qnZoVzw*oe4j~d~wWt
z(Klb5Fkg%@UkuIbljh-o_{p;`jhQd`=1UXiOEKn4c7H>H*;|w5rNBH7Av{H3%X!
zsZ;)DtEpg((N*Ze-Brr<;K0+^&7-1kwyKc{HsuFbMl&~awL5pckM8|Gw|o2JJNI^P
z-Ts)^R5KgvVfoE4wnoVd@`9$JMnh2fRpbRr+Yc|IP$oGI4;-!Dw5ZlHu2U!oc}gTr
zGju35vj`G3tJ=r`QZKi2YTBtc>#|3%)9_!9V+DQjjupD*OmgWx<*s7qRV
z^|!*14Z37s%jfeu*reDj><m+bE)%d_4B8^LOFL?93d<xi~S>k6OE(1;vi`8T|
z_~Q|Gcy2JMzzKS6#<kx#rfIB?T-8Y6q<@|vI!Of#C-SScY0c7_a@;hi`>dQAxNvGi
zHJv1dO$0p=&7Odzb9QKEL$2<eHN8P9E(y(@%%HQ`23*f+Y@<ffKwqdo?
zaT7gpzm7QzGxl~)n3u<d$zFthIN1qHaH4niXX_G19;6}sAc!Hf<PioK#HmqkXH!fj
zGG5O>n{qG-Fer#R?ZBi`I5X1S`4E!@E9M#sn~C!_Xf8|YaP
zuP|UcbL1IT(1=YCe71Dt8eAx5BHx(6`IrzAmc-+PP!l6UQJf?c#|g!VP*fKn$JflN
zQ_UA4ME#2><~znURNur{nKEi-P>3^T)6AFi%dom|4rYwof4H-|m+Ky@R>8eBC{qj$
z$9XJMH4|?wgt2+MUonCL1I)n*Gr_Fa4I{UG`;R+V`(#6JwwP!?yfhXU=o2!EMyt}u
z!J`I`2DTc|GH*CJ`{COs;LBu%8CA=n2IiZAnPT8Q3oaKYphW|Vq_I)G4i8JqsdN76
zK1>eKC%AdS<-(?hf9)zy8L)KZNC(MpUqs#E;j>>qadCv_BH?gu5Lky4g_z^7Qmj
z9R%)RFQzF<yDlasl8mVOj)9eajMn>Wm-vLKfj_699{OlkEzcWp(nA3ZT;N#QXe}>g
zODzBRmxRD8!{EoeL-7!9N;r^BgXfK)ISb8N(0sj7<-Kn~GweNWly8LCbI`1L
zxJ@_+8x16aMx%xUu+c!0UF^mNjzz&&<LIIK1p83Gi!${)vwkPN8}(qTfx5;}KQ+K^
z2%uJ}n7zQxe|~0s$~aY=CQi|xUa80!&^s_EXime}uz4CBay+z*Y7fa#>7k#f3U>
zMa|_U^;L*NgHCRMhtbJ5)m481fz_6dfp#$Hybm$z0!noe;xrG}`X6s6gb9Ri0D)V}
zlws!Kbq%vKe;*2Cbb;yds}BYR>OsajIl`C<WvqfD^x2nZu?~qNbPr{y9IE2isS!Hj
zcZCMw<tj8-`2QG$*RarF@qcU(wg=KNyyW28hL;>@Mwc9D7k9?WX^EFptZrBn6m1GC
zt-lxHO$utL`aYE20>P6t`U()dvW<

[Qemu-devel] [PATCH v6 0/5] hw/acpi-build: build SRAT memory affinity structures for DIMM devices

2018-03-10 Thread Haozhong Zhang

(Patch 5 is only for reviewers to run test cases in patch 4)

ACPI 6.2A Table 5-129 "SPA Range Structure" requires the proximity
domain of a NVDIMM SPA range must match with corresponding entry in
SRAT table.

The address ranges of vNVDIMM in QEMU are allocated from the
hot-pluggable address space, which is entirely covered by one SRAT
memory affinity structure. However, users can set the vNVDIMM
proximity domain in NFIT SPA range structure by the 'node' property of
'-device nvdimm' to a value different than the one in the above SRAT
memory affinity structure.

In order to solve such proximity domain mismatch, this patch builds
one SRAT memory affinity structure for each DIMM device present at
boot time, including both PC-DIMM and NVDIMM, with the proximity
domain specified in '-device pc-dimm' or '-device nvdimm'.

The remaining hot-pluggable address space is covered by one or multiple
SRAT memory affinity structures with the proximity domain of the last
node as before.

Changes in v6:
 * (Patch 2) Fix the commit message.

Changes in v5:
 * (Patch 2) Inline qmp nvdimm info in MemoryDeviceInfo.

Changes in v4:
 * (Patch 1) Update the commit message and add R-b from Igor Mammedov.
 * (Patch 2) Rebase on misc.json and update the commit message.
 * (Patch 3) Directly use di-addr and di-node.
 * (Patch 4) Drop the previous v3 patch 3 and add '-machine nvdimm=on'
   to parameters of test_acpi_one().
 * (Patch 4) Put PC-DIMM and NVDIMM to different numa nodes.
 * (Patch 4&5) Move binary blobs of ACPI tables to DO-NOT-APPLY patch 5.

Changes in v3:
 * (Patch 1&2) Use qmp_pc_dimm_device_list to get information of DIMM
   devices and move it to separate patches.
 * (Patch 3) Replace while loop by a more readable for loop.
 * (Patch 3) Refactor the flag setting code.
 * (Patch 3) s/'static-plugged'/'present at boot time' in commit message.

Changes in v2:
 * Build SRAT memory affinity structures of PC-DIMM devices as well.
 * Add test cases.

Haozhong Zhang (5):
  pc-dimm: make qmp_pc_dimm_device_list() sort devices by address
  qmp: distinguish PC-DIMM and NVDIMM in MemoryDeviceInfoList
  hw/acpi-build: build SRAT memory affinity structures for DIMM devices
  tests/bios-tables-test: add test cases for DIMM proximity
  test/acpi-test-data: add ACPI tables for dimmpxm test

 hmp.c |  14 --
 hw/i386/acpi-build.c  |  56 +++--
 hw/mem/pc-dimm.c  |  91 +++---
 hw/ppc/spapr.c|   3 +-
 include/hw/mem/pc-dimm.h  |   2 +-
 numa.c|  23 +
 qapi/misc.json|   6 ++-
 qmp.c |   7 +--
 stubs/qmp_pc_dimm.c   |   4 +-
 tests/acpi-test-data/pc/APIC.dimmpxm  | Bin 0 -> 144 bytes
 tests/acpi-test-data/pc/DSDT.dimmpxm  | Bin 0 -> 6803 bytes
 tests/acpi-test-data/pc/NFIT.dimmpxm  | Bin 0 -> 224 bytes
 tests/acpi-test-data/pc/SRAT.dimmpxm  | Bin 0 -> 472 bytes
 tests/acpi-test-data/pc/SSDT.dimmpxm  | Bin 0 -> 685 bytes
 tests/acpi-test-data/q35/APIC.dimmpxm | Bin 0 -> 144 bytes
 tests/acpi-test-data/q35/DSDT.dimmpxm | Bin 0 -> 9487 bytes
 tests/acpi-test-data/q35/NFIT.dimmpxm | Bin 0 -> 224 bytes
 tests/acpi-test-data/q35/SRAT.dimmpxm | Bin 0 -> 472 bytes
 tests/acpi-test-data/q35/SSDT.dimmpxm | Bin 0 -> 685 bytes
 tests/bios-tables-test.c  |  38 ++
 20 files changed, 177 insertions(+), 67 deletions(-)
 create mode 100644 tests/acpi-test-data/pc/APIC.dimmpxm
 create mode 100644 tests/acpi-test-data/pc/DSDT.dimmpxm
 create mode 100644 tests/acpi-test-data/pc/NFIT.dimmpxm
 create mode 100644 tests/acpi-test-data/pc/SRAT.dimmpxm
 create mode 100644 tests/acpi-test-data/pc/SSDT.dimmpxm
 create mode 100644 tests/acpi-test-data/q35/APIC.dimmpxm
 create mode 100644 tests/acpi-test-data/q35/DSDT.dimmpxm
 create mode 100644 tests/acpi-test-data/q35/NFIT.dimmpxm
 create mode 100644 tests/acpi-test-data/q35/SRAT.dimmpxm
 create mode 100644 tests/acpi-test-data/q35/SSDT.dimmpxm

-- 
2.14.1

[Qemu-devel] [PATCH v6 4/5] tests/bios-tables-test: add test cases for DIMM proximity

2018-03-10 Thread Haozhong Zhang

QEMU now builds one SRAT memory affinity structure for each PC-DIMM
and NVDIMM device presented at boot time with the proximity domain
specified in the device option 'node', rather than only one SRAT
memory affinity structure covering the entire hotpluggable address
space with the proximity domain of the last node.

Add test cases on PC and Q35 machines with 4 proximity domains, and
one PC-DIMM and one NVDIMM attached to the 2nd and 3rd proximity
domains respectively. Check whether the QEMU-built SRAT tables match
with the expected ones.

The following ACPI tables need to be added for this test:
  tests/acpi-test-data/pc/APIC.dimmpxm
  tests/acpi-test-data/pc/DSDT.dimmpxm
  tests/acpi-test-data/pc/NFIT.dimmpxm
  tests/acpi-test-data/pc/SRAT.dimmpxm
  tests/acpi-test-data/pc/SSDT.dimmpxm
  tests/acpi-test-data/q35/APIC.dimmpxm
  tests/acpi-test-data/q35/DSDT.dimmpxm
  tests/acpi-test-data/q35/NFIT.dimmpxm
  tests/acpi-test-data/q35/SRAT.dimmpxm
  tests/acpi-test-data/q35/SSDT.dimmpxm
New APIC and DSDT are needed because of the multiple processors
configuration. New NFIT and SSDT are needed because of NVDIMM.

Signed-off-by: Haozhong Zhang <haozhong.zh...@intel.com>
Suggested-by: Igor Mammedov <imamm...@redhat.com>
---
 tests/bios-tables-test.c | 38 ++
 1 file changed, 38 insertions(+)

diff --git a/tests/bios-tables-test.c b/tests/bios-tables-test.c
index 65b271a173..34b55ff812 100644
--- a/tests/bios-tables-test.c
+++ b/tests/bios-tables-test.c
@@ -869,6 +869,42 @@ static void test_acpi_piix4_tcg_numamem(void)
 free_test_data();
 }
 
+static void test_acpi_tcg_dimm_pxm(const char *machine)
+{
+test_data data;
+
+memset(, 0, sizeof(data));
+data.machine = machine;
+data.variant = ".dimmpxm";
+test_acpi_one(" -machine nvdimm=on"
+  " -smp 4,sockets=4"
+  " -m 128M,slots=3,maxmem=1G"
+  " -numa node,mem=32M,nodeid=0"
+  " -numa node,mem=32M,nodeid=1"
+  " -numa node,mem=32M,nodeid=2"
+  " -numa node,mem=32M,nodeid=3"
+  " -numa cpu,node-id=0,socket-id=0"
+  " -numa cpu,node-id=1,socket-id=1"
+  " -numa cpu,node-id=2,socket-id=2"
+  " -numa cpu,node-id=3,socket-id=3"
+  " -object memory-backend-ram,id=ram0,size=128M"
+  " -object memory-backend-ram,id=nvm0,size=128M"
+  " -device pc-dimm,id=dimm0,memdev=ram0,node=1"
+  " -device nvdimm,id=dimm1,memdev=nvm0,node=2",
+  );
+free_test_data();
+}
+
+static void test_acpi_q35_tcg_dimm_pxm(void)
+{
+test_acpi_tcg_dimm_pxm(MACHINE_Q35);
+}
+
+static void test_acpi_piix4_tcg_dimm_pxm(void)
+{
+test_acpi_tcg_dimm_pxm(MACHINE_PC);
+}
+
 int main(int argc, char *argv[])
 {
 const char *arch = qtest_get_arch();
@@ -893,6 +929,8 @@ int main(int argc, char *argv[])
 qtest_add_func("acpi/q35/memhp", test_acpi_q35_tcg_memhp);
 qtest_add_func("acpi/piix4/numamem", test_acpi_piix4_tcg_numamem);
 qtest_add_func("acpi/q35/numamem", test_acpi_q35_tcg_numamem);
+qtest_add_func("acpi/piix4/dimmpxm", test_acpi_piix4_tcg_dimm_pxm);
+qtest_add_func("acpi/q35/dimmpxm", test_acpi_q35_tcg_dimm_pxm);
 }
 ret = g_test_run();
 boot_sector_cleanup(disk);
-- 
2.14.1

[Qemu-devel] [PATCH v6 3/5] hw/acpi-build: build SRAT memory affinity structures for DIMM devices

2018-03-10 Thread Haozhong Zhang

ACPI 6.2A Table 5-129 "SPA Range Structure" requires the proximity
domain of a NVDIMM SPA range must match with corresponding entry in
SRAT table.

The address ranges of vNVDIMM in QEMU are allocated from the
hot-pluggable address space, which is entirely covered by one SRAT
memory affinity structure. However, users can set the vNVDIMM
proximity domain in NFIT SPA range structure by the 'node' property of
'-device nvdimm' to a value different than the one in the above SRAT
memory affinity structure.

In order to solve such proximity domain mismatch, this patch builds
one SRAT memory affinity structure for each DIMM device present at
boot time, including both PC-DIMM and NVDIMM, with the proximity
domain specified in '-device pc-dimm' or '-device nvdimm'.

The remaining hot-pluggable address space is covered by one or multiple
SRAT memory affinity structures with the proximity domain of the last
node as before.

Signed-off-by: Haozhong Zhang <haozhong.zh...@intel.com>
---
 hw/i386/acpi-build.c | 56 
 1 file changed, 52 insertions(+), 4 deletions(-)

diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
index deb440f286..2c1f694da4 100644
--- a/hw/i386/acpi-build.c
+++ b/hw/i386/acpi-build.c
@@ -2323,6 +2323,55 @@ build_tpm2(GArray *table_data, BIOSLinker *linker, 
GArray *tcpalog)
 #define HOLE_640K_START  (640 * 1024)
 #define HOLE_640K_END   (1024 * 1024)
 
+static void build_srat_hotpluggable_memory(GArray *table_data, uint64_t base,
+   uint64_t len, int default_node)
+{
+MemoryDeviceInfoList *info_list = qmp_pc_dimm_device_list();
+MemoryDeviceInfoList *info;
+MemoryDeviceInfo *mi;
+PCDIMMDeviceInfo *di;
+uint64_t end = base + len, cur, size;
+bool is_nvdimm;
+AcpiSratMemoryAffinity *numamem;
+MemoryAffinityFlags flags;
+
+for (cur = base, info = info_list;
+ cur < end;
+ cur += size, info = info->next) {
+numamem = acpi_data_push(table_data, sizeof *numamem);
+
+if (!info) {
+build_srat_memory(numamem, cur, end - cur, default_node,
+  MEM_AFFINITY_HOTPLUGGABLE | 
MEM_AFFINITY_ENABLED);
+break;
+}
+
+mi = info->value;
+is_nvdimm = (mi->type == MEMORY_DEVICE_INFO_KIND_NVDIMM);
+di = !is_nvdimm ? mi->u.dimm.data : mi->u.nvdimm.data;
+
+if (cur < di->addr) {
+build_srat_memory(numamem, cur, di->addr - cur, default_node,
+  MEM_AFFINITY_HOTPLUGGABLE | 
MEM_AFFINITY_ENABLED);
+numamem = acpi_data_push(table_data, sizeof *numamem);
+}
+
+size = di->size;
+
+flags = MEM_AFFINITY_ENABLED;
+if (di->hotpluggable) {
+flags |= MEM_AFFINITY_HOTPLUGGABLE;
+}
+if (is_nvdimm) {
+flags |= MEM_AFFINITY_NON_VOLATILE;
+}
+
+build_srat_memory(numamem, di->addr, size, di->node, flags);
+}
+
+qapi_free_MemoryDeviceInfoList(info_list);
+}
+
 static void
 build_srat(GArray *table_data, BIOSLinker *linker, MachineState *machine)
 {
@@ -2434,10 +2483,9 @@ build_srat(GArray *table_data, BIOSLinker *linker, 
MachineState *machine)
  * providing _PXM method if necessary.
  */
 if (hotplugabble_address_space_size) {
-numamem = acpi_data_push(table_data, sizeof *numamem);
-build_srat_memory(numamem, pcms->hotplug_memory.base,
-  hotplugabble_address_space_size, pcms->numa_nodes - 
1,
-  MEM_AFFINITY_HOTPLUGGABLE | MEM_AFFINITY_ENABLED);
+build_srat_hotpluggable_memory(table_data, pcms->hotplug_memory.base,
+   hotplugabble_address_space_size,
+   pcms->numa_nodes - 1);
 }
 
 build_header(linker, table_data,
-- 
2.14.1

[Qemu-devel] [PATCH v6 1/5] pc-dimm: make qmp_pc_dimm_device_list() sort devices by address

2018-03-10 Thread Haozhong Zhang

Make qmp_pc_dimm_device_list() return sorted by start address
list of devices so that it could be reused in places that
would need sorted list*. Reuse existing pc_dimm_built_list()
to get sorted list.

While at it hide recursive callbacks from callers, so that:

  qmp_pc_dimm_device_list(qdev_get_machine(), );

could be replaced with simpler:

  list = qmp_pc_dimm_device_list();

* follow up patch will use it in build_srat()

Signed-off-by: Haozhong Zhang <haozhong.zh...@intel.com>
Reviewed-by: Igor Mammedov <imamm...@redhat.com>
Acked-by: David Gibson <da...@gibson.dropbear.id.au> for ppc part
Reviewed-by: Bharata B Rao <bhar...@linux.vnet.ibm.com>
---
 hw/mem/pc-dimm.c | 83 +---
 hw/ppc/spapr.c   |  3 +-
 include/hw/mem/pc-dimm.h |  2 +-
 numa.c   |  4 +--
 qmp.c|  7 +---
 stubs/qmp_pc_dimm.c  |  4 +--
 6 files changed, 50 insertions(+), 53 deletions(-)

diff --git a/hw/mem/pc-dimm.c b/hw/mem/pc-dimm.c
index 6e74b61cb6..4d050fe2cd 100644
--- a/hw/mem/pc-dimm.c
+++ b/hw/mem/pc-dimm.c
@@ -162,45 +162,6 @@ uint64_t get_plugged_memory_size(void)
 return pc_existing_dimms_capacity(_abort);
 }
 
-int qmp_pc_dimm_device_list(Object *obj, void *opaque)
-{
-MemoryDeviceInfoList ***prev = opaque;
-
-if (object_dynamic_cast(obj, TYPE_PC_DIMM)) {
-DeviceState *dev = DEVICE(obj);
-
-if (dev->realized) {
-MemoryDeviceInfoList *elem = g_new0(MemoryDeviceInfoList, 1);
-MemoryDeviceInfo *info = g_new0(MemoryDeviceInfo, 1);
-PCDIMMDeviceInfo *di = g_new0(PCDIMMDeviceInfo, 1);
-DeviceClass *dc = DEVICE_GET_CLASS(obj);
-PCDIMMDevice *dimm = PC_DIMM(obj);
-
-if (dev->id) {
-di->has_id = true;
-di->id = g_strdup(dev->id);
-}
-di->hotplugged = dev->hotplugged;
-di->hotpluggable = dc->hotpluggable;
-di->addr = dimm->addr;
-di->slot = dimm->slot;
-di->node = dimm->node;
-di->size = object_property_get_uint(OBJECT(dimm), 
PC_DIMM_SIZE_PROP,
-NULL);
-di->memdev = object_get_canonical_path(OBJECT(dimm->hostmem));
-
-info->u.dimm.data = di;
-elem->value = info;
-elem->next = NULL;
-**prev = elem;
-*prev = >next;
-}
-}
-
-object_child_foreach(obj, qmp_pc_dimm_device_list, opaque);
-return 0;
-}
-
 static int pc_dimm_slot2bitmap(Object *obj, void *opaque)
 {
 unsigned long *bitmap = opaque;
@@ -276,6 +237,50 @@ static int pc_dimm_built_list(Object *obj, void *opaque)
 return 0;
 }
 
+MemoryDeviceInfoList *qmp_pc_dimm_device_list(void)
+{
+GSList *dimms = NULL, *item;
+MemoryDeviceInfoList *list = NULL, *prev = NULL;
+
+object_child_foreach(qdev_get_machine(), pc_dimm_built_list, );
+
+for (item = dimms; item; item = g_slist_next(item)) {
+PCDIMMDevice *dimm = PC_DIMM(item->data);
+Object *obj = OBJECT(dimm);
+MemoryDeviceInfoList *elem = g_new0(MemoryDeviceInfoList, 1);
+MemoryDeviceInfo *info = g_new0(MemoryDeviceInfo, 1);
+PCDIMMDeviceInfo *di = g_new0(PCDIMMDeviceInfo, 1);
+DeviceClass *dc = DEVICE_GET_CLASS(obj);
+DeviceState *dev = DEVICE(obj);
+
+if (dev->id) {
+di->has_id = true;
+di->id = g_strdup(dev->id);
+}
+di->hotplugged = dev->hotplugged;
+di->hotpluggable = dc->hotpluggable;
+di->addr = dimm->addr;
+di->slot = dimm->slot;
+di->node = dimm->node;
+di->size = object_property_get_uint(obj, PC_DIMM_SIZE_PROP, NULL);
+di->memdev = object_get_canonical_path(OBJECT(dimm->hostmem));
+
+info->u.dimm.data = di;
+elem->value = info;
+elem->next = NULL;
+if (prev) {
+prev->next = elem;
+} else {
+list = elem;
+}
+prev = elem;
+}
+
+g_slist_free(dimms);
+
+return list;
+}
+
 uint64_t pc_dimm_get_free_addr(uint64_t address_space_start,
uint64_t address_space_size,
uint64_t *hint, uint64_t align, uint64_t size,
diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 7e1c858566..44a0670d11 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -722,8 +722,7 @@ static int spapr_populate_drconf_memory(sPAPRMachineState 
*spapr, void *fdt)
 }
 
 if (hotplug_lmb_start) {
-MemoryDeviceInfoList **prev = 
-qmp_pc_dimm_device_list(qdev_get_machine(), );
+dimms = qmp_pc_dimm_device_list();
 }
 
 /* ibm,dynamic-memory */
diff --git a/include/hw/mem/pc-dimm.h b/include/hw/mem/pc-dimm.h

Re: [Qemu-devel] [PATCH v5 2/5] qmp: distinguish PC-DIMM and NVDIMM in MemoryDeviceInfoList

2018-03-10 Thread Haozhong Zhang

On 03/10/18 20:31 -0600, Eric Blake wrote:
> On 03/10/2018 07:34 PM, Haozhong Zhang wrote:
> > It may need to treat PC-DIMM and NVDIMM differently, e.g., when
> > deciding the necessity of non-volatile flag bit in SRAT memory
> > affinity structures.
> > 
> > NVDIMMDeviceInfo, which inherits from PCDIMMDeviceInfo, is added to
> > union type MemoryDeviceInfo to record information of NVDIMM devices.
> > The NVDIMM-specific data is currently left empty and will be filled
> > when necessary in the future.
> 
> Stale comment.

Oops, my stupid miss. I'll send another version soon. 

Thanks,
Haozhong

> 
> > 
> > It also fixes "info memory-devices"/query-memory-devices which
> > currently show nvdimm devices as dimm devices since
> > object_dynamic_cast(obj, TYPE_PC_DIMM) happily cast nvdimm to
> > TYPE_PC_DIMM which it's been inherited from.
> > 
> > Signed-off-by: Haozhong Zhang <haozhong.zh...@intel.com>
> > ---
> >   hmp.c| 14 +++---
> >   hw/mem/pc-dimm.c | 10 +-
> >   numa.c   | 19 +--
> >   qapi/misc.json   |  6 +-
> >   4 files changed, 38 insertions(+), 11 deletions(-)
> > 
> 
> > +++ b/qapi/misc.json
> > @@ -2852,7 +2852,11 @@
> >   #
> >   # Since: 2.1
> 
> Perhaps this could somehow use a '(since 2.12)' tag; but as this is a
> "simple union" (which is anything but simple in the QAPI generator), and
> we're trying to avoid introducing new ones where possible, I'm fine
> overlooking it for now.
> 
> >   ##
> > -{ 'union': 'MemoryDeviceInfo', 'data': {'dimm': 'PCDIMMDeviceInfo'} }
> > +{ 'union': 'MemoryDeviceInfo',
> > +  'data': { 'dimm': 'PCDIMMDeviceInfo',
> > +'nvdimm': 'PCDIMMDeviceInfo'
> > +  }
> > +}
> 
> Reviewed-by: Eric Blake <ebl...@redhat.com>
> 
> -- 
> Eric Blake, Principal Software Engineer
> Red Hat, Inc.   +1-919-301-3266
> Virtualization:  qemu.org | libvirt.org

[Qemu-devel] [PATCH v6 2/5] qmp: distinguish PC-DIMM and NVDIMM in MemoryDeviceInfoList

2018-03-10 Thread Haozhong Zhang

It may need to treat PC-DIMM and NVDIMM differently, e.g., when
deciding the necessity of non-volatile flag bit in SRAT memory
affinity structures.

A new field 'nvdimm' is added to the union type MemoryDeviceInfo for
such purpose. Its type is currently PCDIMMDeviceInfo and will be
updated when necessary in the future.

It also fixes "info memory-devices"/query-memory-devices which
currently show nvdimm devices as dimm devices since
object_dynamic_cast(obj, TYPE_PC_DIMM) happily cast nvdimm to
TYPE_PC_DIMM which it's been inherited from.

Signed-off-by: Haozhong Zhang <haozhong.zh...@intel.com>
Reviewed-by: Eric Blake <ebl...@redhat.com>
---
 hmp.c| 14 +++---
 hw/mem/pc-dimm.c | 10 +-
 numa.c   | 19 +--
 qapi/misc.json   |  6 +-
 4 files changed, 38 insertions(+), 11 deletions(-)

diff --git a/hmp.c b/hmp.c
index 016cb5c4f1..011a7c6f35 100644
--- a/hmp.c
+++ b/hmp.c
@@ -2421,7 +2421,18 @@ void hmp_info_memory_devices(Monitor *mon, const QDict 
*qdict)
 switch (value->type) {
 case MEMORY_DEVICE_INFO_KIND_DIMM:
 di = value->u.dimm.data;
+break;
+
+case MEMORY_DEVICE_INFO_KIND_NVDIMM:
+di = value->u.nvdimm.data;
+break;
+
+default:
+di = NULL;
+break;
+}
 
+if (di) {
 monitor_printf(mon, "Memory device [%s]: \"%s\"\n",
MemoryDeviceInfoKind_str(value->type),
di->id ? di->id : "");
@@ -2434,9 +2445,6 @@ void hmp_info_memory_devices(Monitor *mon, const QDict 
*qdict)
di->hotplugged ? "true" : "false");
 monitor_printf(mon, "  hotpluggable: %s\n",
di->hotpluggable ? "true" : "false");
-break;
-default:
-break;
 }
 }
 }
diff --git a/hw/mem/pc-dimm.c b/hw/mem/pc-dimm.c
index 4d050fe2cd..51350d9c2d 100644
--- a/hw/mem/pc-dimm.c
+++ b/hw/mem/pc-dimm.c
@@ -20,6 +20,7 @@
 
 #include "qemu/osdep.h"
 #include "hw/mem/pc-dimm.h"
+#include "hw/mem/nvdimm.h"
 #include "qapi/error.h"
 #include "qemu/config-file.h"
 #include "qapi/visitor.h"
@@ -250,6 +251,7 @@ MemoryDeviceInfoList *qmp_pc_dimm_device_list(void)
 MemoryDeviceInfoList *elem = g_new0(MemoryDeviceInfoList, 1);
 MemoryDeviceInfo *info = g_new0(MemoryDeviceInfo, 1);
 PCDIMMDeviceInfo *di = g_new0(PCDIMMDeviceInfo, 1);
+bool is_nvdimm = object_dynamic_cast(obj, TYPE_NVDIMM);
 DeviceClass *dc = DEVICE_GET_CLASS(obj);
 DeviceState *dev = DEVICE(obj);
 
@@ -265,7 +267,13 @@ MemoryDeviceInfoList *qmp_pc_dimm_device_list(void)
 di->size = object_property_get_uint(obj, PC_DIMM_SIZE_PROP, NULL);
 di->memdev = object_get_canonical_path(OBJECT(dimm->hostmem));
 
-info->u.dimm.data = di;
+if (!is_nvdimm) {
+info->u.dimm.data = di;
+info->type = MEMORY_DEVICE_INFO_KIND_DIMM;
+} else {
+info->u.nvdimm.data = di;
+info->type = MEMORY_DEVICE_INFO_KIND_NVDIMM;
+}
 elem->value = info;
 elem->next = NULL;
 if (prev) {
diff --git a/numa.c b/numa.c
index 94427046ec..1116c90af9 100644
--- a/numa.c
+++ b/numa.c
@@ -529,18 +529,25 @@ static void numa_stat_memory_devices(NumaNodeMem 
node_mem[])
 
 if (value) {
 switch (value->type) {
-case MEMORY_DEVICE_INFO_KIND_DIMM: {
+case MEMORY_DEVICE_INFO_KIND_DIMM:
 pcdimm_info = value->u.dimm.data;
+break;
+
+case MEMORY_DEVICE_INFO_KIND_NVDIMM:
+pcdimm_info = value->u.nvdimm.data;
+break;
+
+default:
+pcdimm_info = NULL;
+break;
+}
+
+if (pcdimm_info) {
 node_mem[pcdimm_info->node].node_mem += pcdimm_info->size;
 if (pcdimm_info->hotpluggable && pcdimm_info->hotplugged) {
 node_mem[pcdimm_info->node].node_plugged_mem +=
 pcdimm_info->size;
 }
-break;
-}
-
-default:
-break;
 }
 }
 }
diff --git a/qapi/misc.json b/qapi/misc.json
index bcd5d10778..6bf082f612 100644
--- a/qapi/misc.json
+++ b/qapi/misc.json
@@ -2852,7 +2852,11 @@
 #
 # Since: 2.1
 ##
-{ 'union': 'MemoryDeviceInfo', 'data': {'dimm': 'PCDIMMDeviceInfo'} }
+{ 'union': 'MemoryDeviceInfo',
+  'data': { 'dimm': 'PCDIMMDeviceInfo',
+'nvdimm': 'PCDIMMDeviceInfo'
+  }
+}
 
 ##
 # @query-memory-devices:
-- 
2.14.1

[Qemu-devel] [PATCH v5 5/5][DO NOT APPLY] test/acpi-test-data: add ACPI tables for dimmpxm test

2018-03-10 Thread Haozhong Zhang

Reviewers can use ACPI tables in this patch to run
test_acpi_{piix4,q35}_tcg_dimm_pxm cases.

Signed-off-by: Haozhong Zhang <haozhong.zh...@intel.com>
---
 tests/acpi-test-data/pc/APIC.dimmpxm  | Bin 0 -> 144 bytes
 tests/acpi-test-data/pc/DSDT.dimmpxm  | Bin 0 -> 6803 bytes
 tests/acpi-test-data/pc/NFIT.dimmpxm  | Bin 0 -> 224 bytes
 tests/acpi-test-data/pc/SRAT.dimmpxm  | Bin 0 -> 472 bytes
 tests/acpi-test-data/pc/SSDT.dimmpxm  | Bin 0 -> 685 bytes
 tests/acpi-test-data/q35/APIC.dimmpxm | Bin 0 -> 144 bytes
 tests/acpi-test-data/q35/DSDT.dimmpxm | Bin 0 -> 9487 bytes
 tests/acpi-test-data/q35/NFIT.dimmpxm | Bin 0 -> 224 bytes
 tests/acpi-test-data/q35/SRAT.dimmpxm | Bin 0 -> 472 bytes
 tests/acpi-test-data/q35/SSDT.dimmpxm | Bin 0 -> 685 bytes
 10 files changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 tests/acpi-test-data/pc/APIC.dimmpxm
 create mode 100644 tests/acpi-test-data/pc/DSDT.dimmpxm
 create mode 100644 tests/acpi-test-data/pc/NFIT.dimmpxm
 create mode 100644 tests/acpi-test-data/pc/SRAT.dimmpxm
 create mode 100644 tests/acpi-test-data/pc/SSDT.dimmpxm
 create mode 100644 tests/acpi-test-data/q35/APIC.dimmpxm
 create mode 100644 tests/acpi-test-data/q35/DSDT.dimmpxm
 create mode 100644 tests/acpi-test-data/q35/NFIT.dimmpxm
 create mode 100644 tests/acpi-test-data/q35/SRAT.dimmpxm
 create mode 100644 tests/acpi-test-data/q35/SSDT.dimmpxm

diff --git a/tests/acpi-test-data/pc/APIC.dimmpxm 
b/tests/acpi-test-data/pc/APIC.dimmpxm
new file mode 100644
index 
..427bb08248e6a029c1c988f74f5e48f93ee4ebe0
GIT binary patch
literal 144
zcmZ<^@N}NQz`($`}|BUr=ZD8>jB1F=Cg1H*%VV44G{4#eePWQ5R6Oc0ux
t8ALPkfFuw61CdNzKn!AlSgfo-nis_4<b<)ffC?aD+}vOm3)_F75dauy4FLcE

literal 0
HcmV?d1

diff --git a/tests/acpi-test-data/pc/DSDT.dimmpxm 
b/tests/acpi-test-data/pc/DSDT.dimmpxm
new file mode 100644
index 
..38661cb13ee348718ab45bfc69452cd642cf9bb9
GIT binary patch
literal 6803
zcmcgxUvJyi6~C9H9O_E4DVt54IBf){f7Y%|^v88tY$z;|vZYv*8BxwMFc>Mv!Q`St
z2satx2E`N=aMQdMT8a(WgA(XD`3T!b=-Xbquh3zwpX!}M^3tko0`0>lAoM-={O<3Z
zd+#~tUNX9+w+H74q5rIGXf-QWxnXKL^ie_xw(+l0mu?cfr>xKTP>;cvNKd
zZN0L*q@vzjEEXpS!f<k*#OSldX>T*&}z0An4wf#~3!0MaQZ*c7MUt>Ur6z)%A4w
zYbMH0S#J^9*{thSG2{SKm1}0T%|R4EpawT;X4@cXLcabXMI`&7g7Vz;YE#ddE#1kf
z%Z}A96Ayb_54$>_xJ+?}=`RN^8Mvv#!x0%ye>v!oKX=mPU;jyr$cW9zj@GiWSvI|&
zgc$=lkqFr%%4?U<8+6z1pEYk3O1`gYkx+2OER-~XutQ}fF$UA5x>a@p94sv2mhzgw
zTn6sG{<}-af+Gp3T_*X0=JldqmXA*bub}{86+Emql(E+3fy1t+ocF!IGt5vC!Xn
z_R<>lh({D*k<c}|OLmDcwMXp!mGz1q!9`I``l}L6)F0>)mrxkBO}63hq5$)?`)?Q<
zo6*3TxygYtODqxvfn|AB*P=~46?(M5=GW-A;<qA^*68=*_pnflE95Q7=Ps-^%rf8X
zRlPo_u1ryk7@NbqautW9{`;N^pS$0<SYW56MF$~9l0tj~hgNE6Te3ghg
zA8b7?sXLs4?H-Y*QZ#3UW!C6@@xCa_i#^|;s-$fg1-_^W8blzc!3~L{IS>y-A}=aF
z%_`CqCuo=u@xYG8@(2e4@W{ZU)a0We>Y19=rYZ8A$q?cwXb^*$fyJuCeLMqD7
zF``l^Xq9~RDkv=*ZG$O)Cv;ov5#wTJ@@6UqtEf(Cebo+oL-Khud#p
z_lPJ>NTG_OtTPOIRcDUbY7=i(=(!&0JgX$1bXd>(q{9TV<vrN#Y!N1sTSH}V3qVOo
zn?`e8C)>kU2e;@jJLoVe>bA)C(@Z3l0hArwVnWnY346q-M<d(br+ZsWA&|J_(KaF8
zgAUkxv`vY({mAW*d!3PMKYaDFh~8qZV75`SbuBN$qkxYhK1#AWSHA|UI!r!by)Gnu
za>P>Va{ZdtN&=vj+{7gHqZ2iqQbjD0Kt&-yU+qziNIVta($cE527rU}6uBxD*2
z2$m-q*2>DBW^-Rmlcbn{C}r?31^@PlIuqm|I)Uz~Sx2v1<Wp?3p5HTrSxrXhk}sce
ztd>O|cLOh=nzicKA6l<WScmxi$<_;F)(gJ%0{Ay>y`Y|Ky<qnZoVzw*oe4j~d~wWt
z(Klb5Fkg%@UkuIbljh-o_{p;`jhQd`=1UXiOEKn4c7H>H*;|w5rNBH7Av{H3%X!
zsZ;)DtEpg((N*Ze-Brr<;K0+^&7-1kwyKc{HsuFbMl&~awL5pckM8|Gw|o2JJNI^P
z-Ts)^R5KgvVfoE4wnoVd@`9$JMnh2fRpbRr+Yc|IP$oGI4;-!Dw5ZlHu2U!oc}gTr
zGju35vj`G3tJ=r`QZKi2YTBtc>#|3%)9_!9V+DQjjupD*OmgWx<*s7qRV
z^|!*14Z37s%jfeu*reDj><m+bE)%d_4B8^LOFL?93d<xi~S>k6OE(1;vi`8T|
z_~Q|Gcy2JMzzKS6#<kx#rfIB?T-8Y6q<@|vI!Of#C-SScY0c7_a@;hi`>dQAxNvGi
zHJv1dO$0p=&7Odzb9QKEL$2<eHN8P9E(y(@%%HQ`23*f+Y@<ffKwqdo?
zaT7gpzm7QzGxl~)n3u<d$zFthIN1qHaH4niXX_G19;6}sAc!Hf<PioK#HmqkXH!fj
zGG5O>n{qG-Fer#R?ZBi`I5X1S`4E!@E9M#sn~C!_Xf8|YaP
zuP|UcbL1IT(1=YCe71Dt8eAx5BHx(6`IrzAmc-+PP!l6UQJf?c#|g!VP*fKn$JflN
zQ_UA4ME#2><~znURNur{nKEi-P>3^T)6AFi%dom|4rYwof4H-|m+Ky@R>8eBC{qj$
z$9XJMH4|?wgt2+MUonCL1I)n*Gr_Fa4I{UG`;R+V`(#6JwwP!?yfhXU=o2!EMyt}u
z!J`I`2DTc|GH*CJ`{COs;LBu%8CA=n2IiZAnPT8Q3oaKYphW|Vq_I)G4i8JqsdN76
zK1>eKC%AdS<-(?hf9)zy8L)KZNC(MpUqs#E;j>>qadCv_BH?gu5Lky4g_z^7Qmj
z9R%)RFQzF<yDlasl8mVOj)9eajMn>Wm-vLKfj_699{OlkEzcWp(nA3ZT;N#QXe}>g
zODzBRmxRD8!{EoeL-7!9N;r^BgXfK)ISb8N(0sj7<-Kn~GweNWly8LCbI`1L
zxJ@_+8x16aMx%xUu+c!0UF^mNjzz&&<LIIK1p83Gi!${)vwkPN8}(qTfx5;}KQ+K^
z2%uJ}n7zQxe|~0s$~aY=CQi|xUa80!&^s_EXime}uz4CBay+z*Y7fa#>7k#f3U>
zMa|_U^;L*NgHCRMhtbJ5)m481fz_6dfp#$Hybm$z0!noe;xrG}`X6s6gb9Ri0D)V}
zlws!Kbq%vKe;*2Cbb;yds}BYR>OsajIl`C<WvqfD^x2nZu?~qNbPr{y9IE2isS!Hj
zcZCMw<tj8-`2QG$*RarF@qcU(wg=KNyyW28hL;>@Mwc9D7k9?WX^EFptZrBn6m1GC
zt-lxHO$utL`aYE20>P6t`U()dvW<

[Qemu-devel] [PATCH v5 4/5] tests/bios-tables-test: add test cases for DIMM proximity

2018-03-10 Thread Haozhong Zhang

QEMU now builds one SRAT memory affinity structure for each PC-DIMM
and NVDIMM device presented at boot time with the proximity domain
specified in the device option 'node', rather than only one SRAT
memory affinity structure covering the entire hotpluggable address
space with the proximity domain of the last node.

Add test cases on PC and Q35 machines with 4 proximity domains, and
one PC-DIMM and one NVDIMM attached to the 2nd and 3rd proximity
domains respectively. Check whether the QEMU-built SRAT tables match
with the expected ones.

The following ACPI tables need to be added for this test:
  tests/acpi-test-data/pc/APIC.dimmpxm
  tests/acpi-test-data/pc/DSDT.dimmpxm
  tests/acpi-test-data/pc/NFIT.dimmpxm
  tests/acpi-test-data/pc/SRAT.dimmpxm
  tests/acpi-test-data/pc/SSDT.dimmpxm
  tests/acpi-test-data/q35/APIC.dimmpxm
  tests/acpi-test-data/q35/DSDT.dimmpxm
  tests/acpi-test-data/q35/NFIT.dimmpxm
  tests/acpi-test-data/q35/SRAT.dimmpxm
  tests/acpi-test-data/q35/SSDT.dimmpxm
New APIC and DSDT are needed because of the multiple processors
configuration. New NFIT and SSDT are needed because of NVDIMM.

Signed-off-by: Haozhong Zhang <haozhong.zh...@intel.com>
Suggested-by: Igor Mammedov <imamm...@redhat.com>
---
 tests/bios-tables-test.c | 38 ++
 1 file changed, 38 insertions(+)

diff --git a/tests/bios-tables-test.c b/tests/bios-tables-test.c
index 65b271a173..34b55ff812 100644
--- a/tests/bios-tables-test.c
+++ b/tests/bios-tables-test.c
@@ -869,6 +869,42 @@ static void test_acpi_piix4_tcg_numamem(void)
 free_test_data();
 }
 
+static void test_acpi_tcg_dimm_pxm(const char *machine)
+{
+test_data data;
+
+memset(, 0, sizeof(data));
+data.machine = machine;
+data.variant = ".dimmpxm";
+test_acpi_one(" -machine nvdimm=on"
+  " -smp 4,sockets=4"
+  " -m 128M,slots=3,maxmem=1G"
+  " -numa node,mem=32M,nodeid=0"
+  " -numa node,mem=32M,nodeid=1"
+  " -numa node,mem=32M,nodeid=2"
+  " -numa node,mem=32M,nodeid=3"
+  " -numa cpu,node-id=0,socket-id=0"
+  " -numa cpu,node-id=1,socket-id=1"
+  " -numa cpu,node-id=2,socket-id=2"
+  " -numa cpu,node-id=3,socket-id=3"
+  " -object memory-backend-ram,id=ram0,size=128M"
+  " -object memory-backend-ram,id=nvm0,size=128M"
+  " -device pc-dimm,id=dimm0,memdev=ram0,node=1"
+  " -device nvdimm,id=dimm1,memdev=nvm0,node=2",
+  );
+free_test_data();
+}
+
+static void test_acpi_q35_tcg_dimm_pxm(void)
+{
+test_acpi_tcg_dimm_pxm(MACHINE_Q35);
+}
+
+static void test_acpi_piix4_tcg_dimm_pxm(void)
+{
+test_acpi_tcg_dimm_pxm(MACHINE_PC);
+}
+
 int main(int argc, char *argv[])
 {
 const char *arch = qtest_get_arch();
@@ -893,6 +929,8 @@ int main(int argc, char *argv[])
 qtest_add_func("acpi/q35/memhp", test_acpi_q35_tcg_memhp);
 qtest_add_func("acpi/piix4/numamem", test_acpi_piix4_tcg_numamem);
 qtest_add_func("acpi/q35/numamem", test_acpi_q35_tcg_numamem);
+qtest_add_func("acpi/piix4/dimmpxm", test_acpi_piix4_tcg_dimm_pxm);
+qtest_add_func("acpi/q35/dimmpxm", test_acpi_q35_tcg_dimm_pxm);
 }
 ret = g_test_run();
 boot_sector_cleanup(disk);
-- 
2.14.1

[Qemu-devel] [PATCH v5 3/5] hw/acpi-build: build SRAT memory affinity structures for DIMM devices

2018-03-10 Thread Haozhong Zhang

ACPI 6.2A Table 5-129 "SPA Range Structure" requires the proximity
domain of a NVDIMM SPA range must match with corresponding entry in
SRAT table.

The address ranges of vNVDIMM in QEMU are allocated from the
hot-pluggable address space, which is entirely covered by one SRAT
memory affinity structure. However, users can set the vNVDIMM
proximity domain in NFIT SPA range structure by the 'node' property of
'-device nvdimm' to a value different than the one in the above SRAT
memory affinity structure.

In order to solve such proximity domain mismatch, this patch builds
one SRAT memory affinity structure for each DIMM device present at
boot time, including both PC-DIMM and NVDIMM, with the proximity
domain specified in '-device pc-dimm' or '-device nvdimm'.

The remaining hot-pluggable address space is covered by one or multiple
SRAT memory affinity structures with the proximity domain of the last
node as before.

Signed-off-by: Haozhong Zhang <haozhong.zh...@intel.com>
---
 hw/i386/acpi-build.c | 56 
 1 file changed, 52 insertions(+), 4 deletions(-)

diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
index deb440f286..2c1f694da4 100644
--- a/hw/i386/acpi-build.c
+++ b/hw/i386/acpi-build.c
@@ -2323,6 +2323,55 @@ build_tpm2(GArray *table_data, BIOSLinker *linker, 
GArray *tcpalog)
 #define HOLE_640K_START  (640 * 1024)
 #define HOLE_640K_END   (1024 * 1024)
 
+static void build_srat_hotpluggable_memory(GArray *table_data, uint64_t base,
+   uint64_t len, int default_node)
+{
+MemoryDeviceInfoList *info_list = qmp_pc_dimm_device_list();
+MemoryDeviceInfoList *info;
+MemoryDeviceInfo *mi;
+PCDIMMDeviceInfo *di;
+uint64_t end = base + len, cur, size;
+bool is_nvdimm;
+AcpiSratMemoryAffinity *numamem;
+MemoryAffinityFlags flags;
+
+for (cur = base, info = info_list;
+ cur < end;
+ cur += size, info = info->next) {
+numamem = acpi_data_push(table_data, sizeof *numamem);
+
+if (!info) {
+build_srat_memory(numamem, cur, end - cur, default_node,
+  MEM_AFFINITY_HOTPLUGGABLE | 
MEM_AFFINITY_ENABLED);
+break;
+}
+
+mi = info->value;
+is_nvdimm = (mi->type == MEMORY_DEVICE_INFO_KIND_NVDIMM);
+di = !is_nvdimm ? mi->u.dimm.data : mi->u.nvdimm.data;
+
+if (cur < di->addr) {
+build_srat_memory(numamem, cur, di->addr - cur, default_node,
+  MEM_AFFINITY_HOTPLUGGABLE | 
MEM_AFFINITY_ENABLED);
+numamem = acpi_data_push(table_data, sizeof *numamem);
+}
+
+size = di->size;
+
+flags = MEM_AFFINITY_ENABLED;
+if (di->hotpluggable) {
+flags |= MEM_AFFINITY_HOTPLUGGABLE;
+}
+if (is_nvdimm) {
+flags |= MEM_AFFINITY_NON_VOLATILE;
+}
+
+build_srat_memory(numamem, di->addr, size, di->node, flags);
+}
+
+qapi_free_MemoryDeviceInfoList(info_list);
+}
+
 static void
 build_srat(GArray *table_data, BIOSLinker *linker, MachineState *machine)
 {
@@ -2434,10 +2483,9 @@ build_srat(GArray *table_data, BIOSLinker *linker, 
MachineState *machine)
  * providing _PXM method if necessary.
  */
 if (hotplugabble_address_space_size) {
-numamem = acpi_data_push(table_data, sizeof *numamem);
-build_srat_memory(numamem, pcms->hotplug_memory.base,
-  hotplugabble_address_space_size, pcms->numa_nodes - 
1,
-  MEM_AFFINITY_HOTPLUGGABLE | MEM_AFFINITY_ENABLED);
+build_srat_hotpluggable_memory(table_data, pcms->hotplug_memory.base,
+   hotplugabble_address_space_size,
+   pcms->numa_nodes - 1);
 }
 
 build_header(linker, table_data,
-- 
2.14.1

[Qemu-devel] [PATCH v5 1/5] pc-dimm: make qmp_pc_dimm_device_list() sort devices by address

2018-03-10 Thread Haozhong Zhang

Make qmp_pc_dimm_device_list() return sorted by start address
list of devices so that it could be reused in places that
would need sorted list*. Reuse existing pc_dimm_built_list()
to get sorted list.

While at it hide recursive callbacks from callers, so that:

  qmp_pc_dimm_device_list(qdev_get_machine(), );

could be replaced with simpler:

  list = qmp_pc_dimm_device_list();

* follow up patch will use it in build_srat()

Signed-off-by: Haozhong Zhang <haozhong.zh...@intel.com>
Reviewed-by: Igor Mammedov <imamm...@redhat.com>
Acked-by: David Gibson <da...@gibson.dropbear.id.au> for ppc part
Reviewed-by: Bharata B Rao <bhar...@linux.vnet.ibm.com>
---
 hw/mem/pc-dimm.c | 83 +---
 hw/ppc/spapr.c   |  3 +-
 include/hw/mem/pc-dimm.h |  2 +-
 numa.c   |  4 +--
 qmp.c|  7 +---
 stubs/qmp_pc_dimm.c  |  4 +--
 6 files changed, 50 insertions(+), 53 deletions(-)

diff --git a/hw/mem/pc-dimm.c b/hw/mem/pc-dimm.c
index 6e74b61cb6..4d050fe2cd 100644
--- a/hw/mem/pc-dimm.c
+++ b/hw/mem/pc-dimm.c
@@ -162,45 +162,6 @@ uint64_t get_plugged_memory_size(void)
 return pc_existing_dimms_capacity(_abort);
 }
 
-int qmp_pc_dimm_device_list(Object *obj, void *opaque)
-{
-MemoryDeviceInfoList ***prev = opaque;
-
-if (object_dynamic_cast(obj, TYPE_PC_DIMM)) {
-DeviceState *dev = DEVICE(obj);
-
-if (dev->realized) {
-MemoryDeviceInfoList *elem = g_new0(MemoryDeviceInfoList, 1);
-MemoryDeviceInfo *info = g_new0(MemoryDeviceInfo, 1);
-PCDIMMDeviceInfo *di = g_new0(PCDIMMDeviceInfo, 1);
-DeviceClass *dc = DEVICE_GET_CLASS(obj);
-PCDIMMDevice *dimm = PC_DIMM(obj);
-
-if (dev->id) {
-di->has_id = true;
-di->id = g_strdup(dev->id);
-}
-di->hotplugged = dev->hotplugged;
-di->hotpluggable = dc->hotpluggable;
-di->addr = dimm->addr;
-di->slot = dimm->slot;
-di->node = dimm->node;
-di->size = object_property_get_uint(OBJECT(dimm), 
PC_DIMM_SIZE_PROP,
-NULL);
-di->memdev = object_get_canonical_path(OBJECT(dimm->hostmem));
-
-info->u.dimm.data = di;
-elem->value = info;
-elem->next = NULL;
-**prev = elem;
-*prev = >next;
-}
-}
-
-object_child_foreach(obj, qmp_pc_dimm_device_list, opaque);
-return 0;
-}
-
 static int pc_dimm_slot2bitmap(Object *obj, void *opaque)
 {
 unsigned long *bitmap = opaque;
@@ -276,6 +237,50 @@ static int pc_dimm_built_list(Object *obj, void *opaque)
 return 0;
 }
 
+MemoryDeviceInfoList *qmp_pc_dimm_device_list(void)
+{
+GSList *dimms = NULL, *item;
+MemoryDeviceInfoList *list = NULL, *prev = NULL;
+
+object_child_foreach(qdev_get_machine(), pc_dimm_built_list, );
+
+for (item = dimms; item; item = g_slist_next(item)) {
+PCDIMMDevice *dimm = PC_DIMM(item->data);
+Object *obj = OBJECT(dimm);
+MemoryDeviceInfoList *elem = g_new0(MemoryDeviceInfoList, 1);
+MemoryDeviceInfo *info = g_new0(MemoryDeviceInfo, 1);
+PCDIMMDeviceInfo *di = g_new0(PCDIMMDeviceInfo, 1);
+DeviceClass *dc = DEVICE_GET_CLASS(obj);
+DeviceState *dev = DEVICE(obj);
+
+if (dev->id) {
+di->has_id = true;
+di->id = g_strdup(dev->id);
+}
+di->hotplugged = dev->hotplugged;
+di->hotpluggable = dc->hotpluggable;
+di->addr = dimm->addr;
+di->slot = dimm->slot;
+di->node = dimm->node;
+di->size = object_property_get_uint(obj, PC_DIMM_SIZE_PROP, NULL);
+di->memdev = object_get_canonical_path(OBJECT(dimm->hostmem));
+
+info->u.dimm.data = di;
+elem->value = info;
+elem->next = NULL;
+if (prev) {
+prev->next = elem;
+} else {
+list = elem;
+}
+prev = elem;
+}
+
+g_slist_free(dimms);
+
+return list;
+}
+
 uint64_t pc_dimm_get_free_addr(uint64_t address_space_start,
uint64_t address_space_size,
uint64_t *hint, uint64_t align, uint64_t size,
diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 7e1c858566..44a0670d11 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -722,8 +722,7 @@ static int spapr_populate_drconf_memory(sPAPRMachineState 
*spapr, void *fdt)
 }
 
 if (hotplug_lmb_start) {
-MemoryDeviceInfoList **prev = 
-qmp_pc_dimm_device_list(qdev_get_machine(), );
+dimms = qmp_pc_dimm_device_list();
 }
 
 /* ibm,dynamic-memory */
diff --git a/include/hw/mem/pc-dimm.h b/include/hw/mem/pc-dimm.h

[Qemu-devel] [PATCH v5 0/5] hw/acpi-build: build SRAT memory affinity structures for DIMM devices

2018-03-10 Thread Haozhong Zhang

(Patch 5 is only for reviewers to run test cases in patch 4)

ACPI 6.2A Table 5-129 "SPA Range Structure" requires the proximity
domain of a NVDIMM SPA range must match with corresponding entry in
SRAT table.

The address ranges of vNVDIMM in QEMU are allocated from the
hot-pluggable address space, which is entirely covered by one SRAT
memory affinity structure. However, users can set the vNVDIMM
proximity domain in NFIT SPA range structure by the 'node' property of
'-device nvdimm' to a value different than the one in the above SRAT
memory affinity structure.

In order to solve such proximity domain mismatch, this patch builds
one SRAT memory affinity structure for each DIMM device present at
boot time, including both PC-DIMM and NVDIMM, with the proximity
domain specified in '-device pc-dimm' or '-device nvdimm'.

The remaining hot-pluggable address space is covered by one or multiple
SRAT memory affinity structures with the proximity domain of the last
node as before.

Changes in v5:
 * (Patch 2) Inline qmp nvdimm info in MemoryDeviceInfo.

Changes in v4:
 * (Patch 1) Update the commit message and add R-b from Igor Mammedov.
 * (Patch 2) Rebase on misc.json and update the commit message.
 * (Patch 3) Directly use di-addr and di-node.
 * (Patch 4) Drop the previous v3 patch 3 and add '-machine nvdimm=on'
   to parameters of test_acpi_one().
 * (Patch 4) Put PC-DIMM and NVDIMM to different numa nodes.
 * (Patch 4&5) Move binary blobs of ACPI tables to DO-NOT-APPLY patch 5.

Changes in v3:
 * (Patch 1&2) Use qmp_pc_dimm_device_list to get information of DIMM
   devices and move it to separate patches.
 * (Patch 3) Replace while loop by a more readable for loop.
 * (Patch 3) Refactor the flag setting code.
 * (Patch 3) s/'static-plugged'/'present at boot time' in commit message.

Changes in v2:
 * Build SRAT memory affinity structures of PC-DIMM devices as well.
 * Add test cases.

Haozhong Zhang (5):
  pc-dimm: make qmp_pc_dimm_device_list() sort devices by address
  qmp: distinguish PC-DIMM and NVDIMM in MemoryDeviceInfoList
  hw/acpi-build: build SRAT memory affinity structures for DIMM devices
  tests/bios-tables-test: add test cases for DIMM proximity
  [DO NOT APPLY] test/acpi-test-data: add ACPI tables for dimmpxm test

 hmp.c |  14 --
 hw/i386/acpi-build.c  |  56 +++--
 hw/mem/pc-dimm.c  |  91 +++---
 hw/ppc/spapr.c|   3 +-
 include/hw/mem/pc-dimm.h  |   2 +-
 numa.c|  23 +
 qapi/misc.json|   6 ++-
 qmp.c |   7 +--
 stubs/qmp_pc_dimm.c   |   4 +-
 tests/acpi-test-data/pc/APIC.dimmpxm  | Bin 0 -> 144 bytes
 tests/acpi-test-data/pc/DSDT.dimmpxm  | Bin 0 -> 6803 bytes
 tests/acpi-test-data/pc/NFIT.dimmpxm  | Bin 0 -> 224 bytes
 tests/acpi-test-data/pc/SRAT.dimmpxm  | Bin 0 -> 472 bytes
 tests/acpi-test-data/pc/SSDT.dimmpxm  | Bin 0 -> 685 bytes
 tests/acpi-test-data/q35/APIC.dimmpxm | Bin 0 -> 144 bytes
 tests/acpi-test-data/q35/DSDT.dimmpxm | Bin 0 -> 9487 bytes
 tests/acpi-test-data/q35/NFIT.dimmpxm | Bin 0 -> 224 bytes
 tests/acpi-test-data/q35/SRAT.dimmpxm | Bin 0 -> 472 bytes
 tests/acpi-test-data/q35/SSDT.dimmpxm | Bin 0 -> 685 bytes
 tests/bios-tables-test.c  |  38 ++
 20 files changed, 177 insertions(+), 67 deletions(-)
 create mode 100644 tests/acpi-test-data/pc/APIC.dimmpxm
 create mode 100644 tests/acpi-test-data/pc/DSDT.dimmpxm
 create mode 100644 tests/acpi-test-data/pc/NFIT.dimmpxm
 create mode 100644 tests/acpi-test-data/pc/SRAT.dimmpxm
 create mode 100644 tests/acpi-test-data/pc/SSDT.dimmpxm
 create mode 100644 tests/acpi-test-data/q35/APIC.dimmpxm
 create mode 100644 tests/acpi-test-data/q35/DSDT.dimmpxm
 create mode 100644 tests/acpi-test-data/q35/NFIT.dimmpxm
 create mode 100644 tests/acpi-test-data/q35/SRAT.dimmpxm
 create mode 100644 tests/acpi-test-data/q35/SSDT.dimmpxm

-- 
2.14.1

[Qemu-devel] [PATCH v5 2/5] qmp: distinguish PC-DIMM and NVDIMM in MemoryDeviceInfoList

2018-03-10 Thread Haozhong Zhang

It may need to treat PC-DIMM and NVDIMM differently, e.g., when
deciding the necessity of non-volatile flag bit in SRAT memory
affinity structures.

NVDIMMDeviceInfo, which inherits from PCDIMMDeviceInfo, is added to
union type MemoryDeviceInfo to record information of NVDIMM devices.
The NVDIMM-specific data is currently left empty and will be filled
when necessary in the future.

It also fixes "info memory-devices"/query-memory-devices which
currently show nvdimm devices as dimm devices since
object_dynamic_cast(obj, TYPE_PC_DIMM) happily cast nvdimm to
TYPE_PC_DIMM which it's been inherited from.

Signed-off-by: Haozhong Zhang <haozhong.zh...@intel.com>
---
 hmp.c| 14 +++---
 hw/mem/pc-dimm.c | 10 +-
 numa.c   | 19 +--
 qapi/misc.json   |  6 +-
 4 files changed, 38 insertions(+), 11 deletions(-)

diff --git a/hmp.c b/hmp.c
index 016cb5c4f1..011a7c6f35 100644
--- a/hmp.c
+++ b/hmp.c
@@ -2421,7 +2421,18 @@ void hmp_info_memory_devices(Monitor *mon, const QDict 
*qdict)
 switch (value->type) {
 case MEMORY_DEVICE_INFO_KIND_DIMM:
 di = value->u.dimm.data;
+break;
+
+case MEMORY_DEVICE_INFO_KIND_NVDIMM:
+di = value->u.nvdimm.data;
+break;
+
+default:
+di = NULL;
+break;
+}
 
+if (di) {
 monitor_printf(mon, "Memory device [%s]: \"%s\"\n",
MemoryDeviceInfoKind_str(value->type),
di->id ? di->id : "");
@@ -2434,9 +2445,6 @@ void hmp_info_memory_devices(Monitor *mon, const QDict 
*qdict)
di->hotplugged ? "true" : "false");
 monitor_printf(mon, "  hotpluggable: %s\n",
di->hotpluggable ? "true" : "false");
-break;
-default:
-break;
 }
 }
 }
diff --git a/hw/mem/pc-dimm.c b/hw/mem/pc-dimm.c
index 4d050fe2cd..51350d9c2d 100644
--- a/hw/mem/pc-dimm.c
+++ b/hw/mem/pc-dimm.c
@@ -20,6 +20,7 @@
 
 #include "qemu/osdep.h"
 #include "hw/mem/pc-dimm.h"
+#include "hw/mem/nvdimm.h"
 #include "qapi/error.h"
 #include "qemu/config-file.h"
 #include "qapi/visitor.h"
@@ -250,6 +251,7 @@ MemoryDeviceInfoList *qmp_pc_dimm_device_list(void)
 MemoryDeviceInfoList *elem = g_new0(MemoryDeviceInfoList, 1);
 MemoryDeviceInfo *info = g_new0(MemoryDeviceInfo, 1);
 PCDIMMDeviceInfo *di = g_new0(PCDIMMDeviceInfo, 1);
+bool is_nvdimm = object_dynamic_cast(obj, TYPE_NVDIMM);
 DeviceClass *dc = DEVICE_GET_CLASS(obj);
 DeviceState *dev = DEVICE(obj);
 
@@ -265,7 +267,13 @@ MemoryDeviceInfoList *qmp_pc_dimm_device_list(void)
 di->size = object_property_get_uint(obj, PC_DIMM_SIZE_PROP, NULL);
 di->memdev = object_get_canonical_path(OBJECT(dimm->hostmem));
 
-info->u.dimm.data = di;
+if (!is_nvdimm) {
+info->u.dimm.data = di;
+info->type = MEMORY_DEVICE_INFO_KIND_DIMM;
+} else {
+info->u.nvdimm.data = di;
+info->type = MEMORY_DEVICE_INFO_KIND_NVDIMM;
+}
 elem->value = info;
 elem->next = NULL;
 if (prev) {
diff --git a/numa.c b/numa.c
index 94427046ec..1116c90af9 100644
--- a/numa.c
+++ b/numa.c
@@ -529,18 +529,25 @@ static void numa_stat_memory_devices(NumaNodeMem 
node_mem[])
 
 if (value) {
 switch (value->type) {
-case MEMORY_DEVICE_INFO_KIND_DIMM: {
+case MEMORY_DEVICE_INFO_KIND_DIMM:
 pcdimm_info = value->u.dimm.data;
+break;
+
+case MEMORY_DEVICE_INFO_KIND_NVDIMM:
+pcdimm_info = value->u.nvdimm.data;
+break;
+
+default:
+pcdimm_info = NULL;
+break;
+}
+
+if (pcdimm_info) {
 node_mem[pcdimm_info->node].node_mem += pcdimm_info->size;
 if (pcdimm_info->hotpluggable && pcdimm_info->hotplugged) {
 node_mem[pcdimm_info->node].node_plugged_mem +=
 pcdimm_info->size;
 }
-break;
-}
-
-default:
-break;
 }
 }
 }
diff --git a/qapi/misc.json b/qapi/misc.json
index bcd5d10778..6bf082f612 100644
--- a/qapi/misc.json
+++ b/qapi/misc.json
@@ -2852,7 +2852,11 @@
 #
 # Since: 2.1
 ##
-{ 'union': 'MemoryDeviceInfo', 'data': {'dimm': 'PCDIMMDeviceInfo'} }
+{ 'union': 'MemoryDeviceInfo',
+  'data': { 'dimm': 'PCDIMMDeviceInfo',
+'nvdimm': 'PCDIMMDeviceInfo'
+  }
+}
 
 ##
 # @query-memory-devices:
-- 
2.14.1

Re: [Qemu-devel] [PATCH v4 2/5] qmp: distinguish PC-DIMM and NVDIMM in MemoryDeviceInfoList

2018-03-08 Thread Haozhong Zhang

On 03/08/18 11:22 -0600, Eric Blake wrote:
> On 03/07/2018 08:33 PM, Haozhong Zhang wrote:
> > It may need to treat PC-DIMM and NVDIMM differently, e.g., when
> > deciding the necessity of non-volatile flag bit in SRAT memory
> > affinity structures.
> > 
> > NVDIMMDeviceInfo, which inherits from PCDIMMDeviceInfo, is added to
> > union type MemoryDeviceInfo to record information of NVDIMM devices.
> > The NVDIMM-specific data is currently left empty and will be filled
> > when necessary in the future.
> > 
> > It also fixes "info memory-devices"/query-memory-devices which
> > currently show nvdimm devices as dimm devices since
> > object_dynamic_cast(obj, TYPE_PC_DIMM) happily cast nvdimm to
> > TYPE_PC_DIMM which it's been inherited from.
> > 
> > Signed-off-by: Haozhong Zhang <haozhong.zh...@intel.com>
> > ---
> 
> > +++ b/qapi/misc.json
> > @@ -2830,6 +2830,18 @@
> > }
> >   }
> > +##
> > +# @NVDIMMDeviceInfo:
> > +#
> > +# NVDIMMDevice state information
> > +#
> > +# Since: 2.12
> > +##
> > +{ 'struct': 'NVDIMMDeviceInfo',
> > +  'base': 'PCDIMMDeviceInfo',
> > +  'data': {}
> > +}
> > +
> 
> As long as you don't have any data members to add, you could omit this
> type...

Sure, I'll change in the next version.

Haozhong

> 
> >   ##
> >   # @MemoryDeviceInfo:
> >   #
> > @@ -2837,7 +2849,11 @@
> >   #
> >   # Since: 2.1
> >   ##
> > -{ 'union': 'MemoryDeviceInfo', 'data': {'dimm': 'PCDIMMDeviceInfo'} }
> > +{ 'union': 'MemoryDeviceInfo',
> > +  'data': { 'dimm': 'PCDIMMDeviceInfo',
> > +'nvdimm': 'NVDIMMDeviceInfo'
> > +  }
> 
> and just write this as
> 
>  'data': { 'dimm': 'PCDIMMDeviceInfo',
>'nvdimm': 'PCDIMMDeviceInfo' }
> 
> If, down the road, you want to add data members to one but not both of the
> branches, we can add a new (sub-)type at that time, and it won't break
> backwards compatibility.
> 
> -- 
> Eric Blake, Principal Software Engineer
> Red Hat, Inc.   +1-919-301-3266
> Virtualization:  qemu.org | libvirt.org

Re: [Qemu-devel] [PATCH v4 0/8] nvdimm: guarantee persistence of QEMU writes to persistent memory

2018-03-07 Thread Haozhong Zhang

Ping?

On 02/28/18 15:25 +0800, Haozhong Zhang wrote:
> QEMU writes to vNVDIMM backends in the vNVDIMM label emulation and
> live migration. If the backend is on the persistent memory, QEMU needs
> to take proper operations to ensure its writes persistent on the
> persistent memory. Otherwise, a host power failure may result in the
> loss the guest data on the persistent memory.
> 
> This v3 patch series is based on Marcel's patch "mem: add share
> parameter to memory-backend-ram" [1] because of the changes in patch 1.
> 
> [1] https://lists.gnu.org/archive/html/qemu-devel/2018-02/msg03858.html
> 
> Previous versions can be found at
> v3: https://lists.gnu.org/archive/html/qemu-devel/2018-02/msg04365.html
> v2: https://lists.gnu.org/archive/html/qemu-devel/2018-02/msg01579.html
> v1: https://lists.gnu.org/archive/html/qemu-devel/2017-12/msg05040.html
> 
> Changes in v4:
>  * (Patch 2) Fix compilation errors found by patchew.
> 
> Changes in v3:
>  * (Patch 5) Add a is_pmem flag to ram_handle_compressed() and handle
>PMEM writes in it, so we don't need the _common function.
>  * (Patch 6) Expose qemu_get_buffer_common so we can remove the
>unnecessary qemu_get_buffer_to_pmem wrapper.
>  * (Patch 8) Add a is_pmem flag to xbzrle_decode_buffer() and handle
>PMEM writes in it, so we can remove the unnecessary
>xbzrle_decode_buffer_{common, to_pmem}.
>  * Move libpmem stubs to stubs/pmem.c and fix the compilation failures
>of test-{xbzrle,vmstate}.c.
> 
> Changes in v2:
>  * (Patch 1) Use a flags parameter in file ram allocation functions.
>  * (Patch 2) Add a new option 'pmem' to hostmem-file.
>  * (Patch 3) Use libpmem to operate on the persistent memory, rather
>than re-implementing those operations in QEMU.
>  * (Patch 5-8) Consider the write persistence in the migration path.
> 
> Haozhong Zhang (8):
>   [1/8] memory, exec: switch file ram allocation functions to 'flags' 
> parameters
>   [2/8] hostmem-file: add the 'pmem' option
>   [3/8] configure: add libpmem support
>   [4/8] mem/nvdimm: ensure write persistence to PMEM in label emulation
>   [5/8] migration/ram: ensure write persistence on loading zero pages to PMEM
>   [6/8] migration/ram: ensure write persistence on loading normal pages to 
> PMEM
>   [7/8] migration/ram: ensure write persistence on loading compressed pages 
> to PMEM
>   [8/8] migration/ram: ensure write persistence on loading xbzrle pages to 
> PMEM
> 
>  backends/hostmem-file.c | 27 +++-
>  configure   | 35 ++
>  docs/nvdimm.txt | 14 +++
>  exec.c  | 20 ---
>  hw/mem/nvdimm.c |  9 ++-
>  include/exec/memory.h   | 12 +++--
>  include/exec/ram_addr.h | 28 +++--
>  include/migration/qemu-file-types.h |  2 ++
>  include/qemu/pmem.h | 27 
>  memory.c|  8 +++---
>  migration/qemu-file.c   | 29 ++
>  migration/ram.c | 49 
> +++--
>  migration/ram.h |  2 +-
>  migration/rdma.c|  2 +-
>  migration/xbzrle.c  |  8 --
>  migration/xbzrle.h  |  3 ++-
>  numa.c  |  2 +-
>  qemu-options.hx |  9 ++-
>  stubs/Makefile.objs |  1 +
>  stubs/pmem.c| 37 
>  tests/Makefile.include  |  4 +--
>  tests/test-xbzrle.c |  4 +--
>  22 files changed, 285 insertions(+), 47 deletions(-)
>  create mode 100644 include/qemu/pmem.h
>  create mode 100644 stubs/pmem.c
> 
> -- 
> 2.14.1
>

Re: [Qemu-devel] [PATCH v4 0/5] hw/acpi-build: build SRAT memory affinity structures for DIMM devices

2018-03-07 Thread Haozhong Zhang

On 03/08/18 10:33 +0800, Haozhong Zhang wrote:
> (Patch 5 is only for reviewers to run test cases in patch 4)
> 
> ACPI 6.2A Table 5-129 "SPA Range Structure" requires the proximity
> domain of a NVDIMM SPA range must match with corresponding entry in
> SRAT table.
> 
> The address ranges of vNVDIMM in QEMU are allocated from the
> hot-pluggable address space, which is entirely covered by one SRAT
> memory affinity structure. However, users can set the vNVDIMM
> proximity domain in NFIT SPA range structure by the 'node' property of
> '-device nvdimm' to a value different than the one in the above SRAT
> memory affinity structure.
> 
> In order to solve such proximity domain mismatch, this patch builds
> one SRAT memory affinity structure for each DIMM device present at
> boot time, including both PC-DIMM and NVDIMM, with the proximity
> domain specified in '-device pc-dimm' or '-device nvdimm'.
> 
> The remaining hot-pluggable address space is covered by one or multiple
> SRAT memory affinity structures with the proximity domain of the last
> node as before.
> 
> Changes in v4:
>  * (Patch 1) Update the commit message and add R-b from Igor Mammedov.
>  * (Patch 2) Rebase on misc.json and update the commit message.
>  * (Patch 3) Directly use di-addr and di-node.
>  * (Patch 4) Drop the previous v3 patch 3 and add '-machine nvdimm=on'
 ^^ should be 'v3 patch 4'
 
>to parameters of test_acpi_one().
>  * (Patch 4) Put PC-DIMM and NVDIMM to different numa nodes.
>  * (Patch 4&5) Move binary blobs of ACPI tables to DO-NOT-APPLY patch 5.
> 
> Changes in v3:
>  * (Patch 1&2) Use qmp_pc_dimm_device_list to get information of DIMM
>devices and move it to separate patches.
>  * (Patch 3) Replace while loop by a more readable for loop.
>  * (Patch 3) Refactor the flag setting code.
>  * (Patch 3) s/'static-plugged'/'present at boot time' in commit message.
> 
> Changes in v2:
>  * Build SRAT memory affinity structures of PC-DIMM devices as well.
>  * Add test cases.
> 
> 
> Haozhong Zhang (5):
>   pc-dimm: make qmp_pc_dimm_device_list() sort devices by address
>   qmp: distinguish PC-DIMM and NVDIMM in MemoryDeviceInfoList
>   hw/acpi-build: build SRAT memory affinity structures for DIMM devices
>   tests/bios-tables-test: add test cases for DIMM proximity
>   [DO NOT APPLY] test/acpi-test-data: add ACPI tables for dimmpxm test
> 
>  hmp.c |  14 +++--
>  hw/i386/acpi-build.c  |  57 ++--
>  hw/mem/pc-dimm.c  |  99 
> --
>  hw/ppc/spapr.c|   3 +-
>  include/hw/mem/pc-dimm.h  |   2 +-
>  numa.c|  23 
>  qapi/misc.json|  18 ++-
>  qmp.c |   7 +--
>  stubs/qmp_pc_dimm.c   |   4 +-
>  tests/acpi-test-data/pc/APIC.dimmpxm  | Bin 0 -> 144 bytes
>  tests/acpi-test-data/pc/DSDT.dimmpxm  | Bin 0 -> 6803 bytes
>  tests/acpi-test-data/pc/NFIT.dimmpxm  | Bin 0 -> 224 bytes
>  tests/acpi-test-data/pc/SRAT.dimmpxm  | Bin 0 -> 472 bytes
>  tests/acpi-test-data/pc/SSDT.dimmpxm  | Bin 0 -> 685 bytes
>  tests/acpi-test-data/q35/APIC.dimmpxm | Bin 0 -> 144 bytes
>  tests/acpi-test-data/q35/DSDT.dimmpxm | Bin 0 -> 9487 bytes
>  tests/acpi-test-data/q35/NFIT.dimmpxm | Bin 0 -> 224 bytes
>  tests/acpi-test-data/q35/SRAT.dimmpxm | Bin 0 -> 472 bytes
>  tests/acpi-test-data/q35/SSDT.dimmpxm | Bin 0 -> 685 bytes
>  tests/bios-tables-test.c  |  38 +
>  20 files changed, 198 insertions(+), 67 deletions(-)
>  create mode 100644 tests/acpi-test-data/pc/APIC.dimmpxm
>  create mode 100644 tests/acpi-test-data/pc/DSDT.dimmpxm
>  create mode 100644 tests/acpi-test-data/pc/NFIT.dimmpxm
>  create mode 100644 tests/acpi-test-data/pc/SRAT.dimmpxm
>  create mode 100644 tests/acpi-test-data/pc/SSDT.dimmpxm
>  create mode 100644 tests/acpi-test-data/q35/APIC.dimmpxm
>  create mode 100644 tests/acpi-test-data/q35/DSDT.dimmpxm
>  create mode 100644 tests/acpi-test-data/q35/NFIT.dimmpxm
>  create mode 100644 tests/acpi-test-data/q35/SRAT.dimmpxm
>  create mode 100644 tests/acpi-test-data/q35/SSDT.dimmpxm
> 
> -- 
> 2.14.1
>

[Qemu-devel] [PATCH v4 2/5] qmp: distinguish PC-DIMM and NVDIMM in MemoryDeviceInfoList

2018-03-07 Thread Haozhong Zhang

It may need to treat PC-DIMM and NVDIMM differently, e.g., when
deciding the necessity of non-volatile flag bit in SRAT memory
affinity structures.

NVDIMMDeviceInfo, which inherits from PCDIMMDeviceInfo, is added to
union type MemoryDeviceInfo to record information of NVDIMM devices.
The NVDIMM-specific data is currently left empty and will be filled
when necessary in the future.

It also fixes "info memory-devices"/query-memory-devices which
currently show nvdimm devices as dimm devices since
object_dynamic_cast(obj, TYPE_PC_DIMM) happily cast nvdimm to
TYPE_PC_DIMM which it's been inherited from.

Signed-off-by: Haozhong Zhang <haozhong.zh...@intel.com>
---
 hmp.c| 14 +++---
 hw/mem/pc-dimm.c | 20 ++--
 numa.c   | 19 +--
 qapi/misc.json   | 18 +-
 4 files changed, 59 insertions(+), 12 deletions(-)

diff --git a/hmp.c b/hmp.c
index 016cb5c4f1..692cb81868 100644
--- a/hmp.c
+++ b/hmp.c
@@ -2421,7 +2421,18 @@ void hmp_info_memory_devices(Monitor *mon, const QDict 
*qdict)
 switch (value->type) {
 case MEMORY_DEVICE_INFO_KIND_DIMM:
 di = value->u.dimm.data;
+break;
+
+case MEMORY_DEVICE_INFO_KIND_NVDIMM:
+di = qapi_NVDIMMDeviceInfo_base(value->u.nvdimm.data);
+break;
+
+default:
+di = NULL;
+break;
+}
 
+if (di) {
 monitor_printf(mon, "Memory device [%s]: \"%s\"\n",
MemoryDeviceInfoKind_str(value->type),
di->id ? di->id : "");
@@ -2434,9 +2445,6 @@ void hmp_info_memory_devices(Monitor *mon, const QDict 
*qdict)
di->hotplugged ? "true" : "false");
 monitor_printf(mon, "  hotpluggable: %s\n",
di->hotpluggable ? "true" : "false");
-break;
-default:
-break;
 }
 }
 }
diff --git a/hw/mem/pc-dimm.c b/hw/mem/pc-dimm.c
index 4d050fe2cd..866ecc699a 100644
--- a/hw/mem/pc-dimm.c
+++ b/hw/mem/pc-dimm.c
@@ -20,6 +20,7 @@
 
 #include "qemu/osdep.h"
 #include "hw/mem/pc-dimm.h"
+#include "hw/mem/nvdimm.h"
 #include "qapi/error.h"
 #include "qemu/config-file.h"
 #include "qapi/visitor.h"
@@ -249,10 +250,19 @@ MemoryDeviceInfoList *qmp_pc_dimm_device_list(void)
 Object *obj = OBJECT(dimm);
 MemoryDeviceInfoList *elem = g_new0(MemoryDeviceInfoList, 1);
 MemoryDeviceInfo *info = g_new0(MemoryDeviceInfo, 1);
-PCDIMMDeviceInfo *di = g_new0(PCDIMMDeviceInfo, 1);
+PCDIMMDeviceInfo *di;
+NVDIMMDeviceInfo *ndi;
+bool is_nvdimm = object_dynamic_cast(obj, TYPE_NVDIMM);
 DeviceClass *dc = DEVICE_GET_CLASS(obj);
 DeviceState *dev = DEVICE(obj);
 
+if (!is_nvdimm) {
+di = g_new0(PCDIMMDeviceInfo, 1);
+} else {
+ndi = g_new0(NVDIMMDeviceInfo, 1);
+di = qapi_NVDIMMDeviceInfo_base(ndi);
+}
+
 if (dev->id) {
 di->has_id = true;
 di->id = g_strdup(dev->id);
@@ -265,7 +275,13 @@ MemoryDeviceInfoList *qmp_pc_dimm_device_list(void)
 di->size = object_property_get_uint(obj, PC_DIMM_SIZE_PROP, NULL);
 di->memdev = object_get_canonical_path(OBJECT(dimm->hostmem));
 
-info->u.dimm.data = di;
+if (!is_nvdimm) {
+info->u.dimm.data = di;
+info->type = MEMORY_DEVICE_INFO_KIND_DIMM;
+} else {
+info->u.nvdimm.data = ndi;
+info->type = MEMORY_DEVICE_INFO_KIND_NVDIMM;
+}
 elem->value = info;
 elem->next = NULL;
 if (prev) {
diff --git a/numa.c b/numa.c
index 7ca2bef63f..5f291fc919 100644
--- a/numa.c
+++ b/numa.c
@@ -529,18 +529,25 @@ static void numa_stat_memory_devices(NumaNodeMem 
node_mem[])
 
 if (value) {
 switch (value->type) {
-case MEMORY_DEVICE_INFO_KIND_DIMM: {
+case MEMORY_DEVICE_INFO_KIND_DIMM:
 pcdimm_info = value->u.dimm.data;
+break;
+
+case MEMORY_DEVICE_INFO_KIND_NVDIMM:
+pcdimm_info = qapi_NVDIMMDeviceInfo_base(value->u.nvdimm.data);
+break;
+
+default:
+pcdimm_info = NULL;
+break;
+}
+
+if (pcdimm_info) {
 node_mem[pcdimm_info->node].node_mem += pcdimm_info->size;
 if (pcdimm_info->hotpluggable && pcdimm_info->hotplugged) {
 node_mem[pcdimm_info->node].node_plugged_mem +=
 pcdimm_info

[Qemu-devel] [PATCH v4 3/5] hw/acpi-build: build SRAT memory affinity structures for DIMM devices

2018-03-07 Thread Haozhong Zhang

ACPI 6.2A Table 5-129 "SPA Range Structure" requires the proximity
domain of a NVDIMM SPA range must match with corresponding entry in
SRAT table.

The address ranges of vNVDIMM in QEMU are allocated from the
hot-pluggable address space, which is entirely covered by one SRAT
memory affinity structure. However, users can set the vNVDIMM
proximity domain in NFIT SPA range structure by the 'node' property of
'-device nvdimm' to a value different than the one in the above SRAT
memory affinity structure.

In order to solve such proximity domain mismatch, this patch builds
one SRAT memory affinity structure for each DIMM device present at
boot time, including both PC-DIMM and NVDIMM, with the proximity
domain specified in '-device pc-dimm' or '-device nvdimm'.

The remaining hot-pluggable address space is covered by one or multiple
SRAT memory affinity structures with the proximity domain of the last
node as before.

Signed-off-by: Haozhong Zhang <haozhong.zh...@intel.com>
---
 hw/i386/acpi-build.c | 57 
 1 file changed, 53 insertions(+), 4 deletions(-)

diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
index deb440f286..cb99c63fcf 100644
--- a/hw/i386/acpi-build.c
+++ b/hw/i386/acpi-build.c
@@ -2323,6 +2323,56 @@ build_tpm2(GArray *table_data, BIOSLinker *linker, 
GArray *tcpalog)
 #define HOLE_640K_START  (640 * 1024)
 #define HOLE_640K_END   (1024 * 1024)
 
+static void build_srat_hotpluggable_memory(GArray *table_data, uint64_t base,
+   uint64_t len, int default_node)
+{
+MemoryDeviceInfoList *info_list = qmp_pc_dimm_device_list();
+MemoryDeviceInfoList *info;
+MemoryDeviceInfo *mi;
+PCDIMMDeviceInfo *di;
+uint64_t end = base + len, cur, size;
+bool is_nvdimm;
+AcpiSratMemoryAffinity *numamem;
+MemoryAffinityFlags flags;
+
+for (cur = base, info = info_list;
+ cur < end;
+ cur += size, info = info->next) {
+numamem = acpi_data_push(table_data, sizeof *numamem);
+
+if (!info) {
+build_srat_memory(numamem, cur, end - cur, default_node,
+  MEM_AFFINITY_HOTPLUGGABLE | 
MEM_AFFINITY_ENABLED);
+break;
+}
+
+mi = info->value;
+is_nvdimm = (mi->type == MEMORY_DEVICE_INFO_KIND_NVDIMM);
+di = !is_nvdimm ? mi->u.dimm.data :
+  qapi_NVDIMMDeviceInfo_base(mi->u.nvdimm.data);
+
+if (cur < di->addr) {
+build_srat_memory(numamem, cur, di->addr - cur, default_node,
+  MEM_AFFINITY_HOTPLUGGABLE | 
MEM_AFFINITY_ENABLED);
+numamem = acpi_data_push(table_data, sizeof *numamem);
+}
+
+size = di->size;
+
+flags = MEM_AFFINITY_ENABLED;
+if (di->hotpluggable) {
+flags |= MEM_AFFINITY_HOTPLUGGABLE;
+}
+if (is_nvdimm) {
+flags |= MEM_AFFINITY_NON_VOLATILE;
+}
+
+build_srat_memory(numamem, di->addr, size, di->node, flags);
+}
+
+qapi_free_MemoryDeviceInfoList(info_list);
+}
+
 static void
 build_srat(GArray *table_data, BIOSLinker *linker, MachineState *machine)
 {
@@ -2434,10 +2484,9 @@ build_srat(GArray *table_data, BIOSLinker *linker, 
MachineState *machine)
  * providing _PXM method if necessary.
  */
 if (hotplugabble_address_space_size) {
-numamem = acpi_data_push(table_data, sizeof *numamem);
-build_srat_memory(numamem, pcms->hotplug_memory.base,
-  hotplugabble_address_space_size, pcms->numa_nodes - 
1,
-  MEM_AFFINITY_HOTPLUGGABLE | MEM_AFFINITY_ENABLED);
+build_srat_hotpluggable_memory(table_data, pcms->hotplug_memory.base,
+   hotplugabble_address_space_size,
+   pcms->numa_nodes - 1);
 }
 
 build_header(linker, table_data,
-- 
2.14.1

[Qemu-devel] [PATCH v4 4/5] tests/bios-tables-test: add test cases for DIMM proximity

2018-03-07 Thread Haozhong Zhang

QEMU now builds one SRAT memory affinity structure for each PC-DIMM
and NVDIMM device presented at boot time with the proximity domain
specified in the device option 'node', rather than only one SRAT
memory affinity structure covering the entire hotpluggable address
space with the proximity domain of the last node.

Add test cases on PC and Q35 machines with 4 proximity domains, and
one PC-DIMM and one NVDIMM attached to the 2nd and 3rd proximity
domains respectively. Check whether the QEMU-built SRAT tables match
with the expected ones.

The following ACPI tables need to be added for this test:
  tests/acpi-test-data/pc/APIC.dimmpxm
  tests/acpi-test-data/pc/DSDT.dimmpxm
  tests/acpi-test-data/pc/NFIT.dimmpxm
  tests/acpi-test-data/pc/SRAT.dimmpxm
  tests/acpi-test-data/pc/SSDT.dimmpxm
  tests/acpi-test-data/q35/APIC.dimmpxm
  tests/acpi-test-data/q35/DSDT.dimmpxm
  tests/acpi-test-data/q35/NFIT.dimmpxm
  tests/acpi-test-data/q35/SRAT.dimmpxm
  tests/acpi-test-data/q35/SSDT.dimmpxm
New APIC and DSDT are needed because of the multiple processors
configuration. New NFIT and SSDT are needed because of NVDIMM.

Signed-off-by: Haozhong Zhang <haozhong.zh...@intel.com>
Suggested-by: Igor Mammedov <imamm...@redhat.com>
---
 tests/bios-tables-test.c | 38 ++
 1 file changed, 38 insertions(+)

diff --git a/tests/bios-tables-test.c b/tests/bios-tables-test.c
index 65b271a173..34b55ff812 100644
--- a/tests/bios-tables-test.c
+++ b/tests/bios-tables-test.c
@@ -869,6 +869,42 @@ static void test_acpi_piix4_tcg_numamem(void)
 free_test_data();
 }
 
+static void test_acpi_tcg_dimm_pxm(const char *machine)
+{
+test_data data;
+
+memset(, 0, sizeof(data));
+data.machine = machine;
+data.variant = ".dimmpxm";
+test_acpi_one(" -machine nvdimm=on"
+  " -smp 4,sockets=4"
+  " -m 128M,slots=3,maxmem=1G"
+  " -numa node,mem=32M,nodeid=0"
+  " -numa node,mem=32M,nodeid=1"
+  " -numa node,mem=32M,nodeid=2"
+  " -numa node,mem=32M,nodeid=3"
+  " -numa cpu,node-id=0,socket-id=0"
+  " -numa cpu,node-id=1,socket-id=1"
+  " -numa cpu,node-id=2,socket-id=2"
+  " -numa cpu,node-id=3,socket-id=3"
+  " -object memory-backend-ram,id=ram0,size=128M"
+  " -object memory-backend-ram,id=nvm0,size=128M"
+  " -device pc-dimm,id=dimm0,memdev=ram0,node=1"
+  " -device nvdimm,id=dimm1,memdev=nvm0,node=2",
+  );
+free_test_data();
+}
+
+static void test_acpi_q35_tcg_dimm_pxm(void)
+{
+test_acpi_tcg_dimm_pxm(MACHINE_Q35);
+}
+
+static void test_acpi_piix4_tcg_dimm_pxm(void)
+{
+test_acpi_tcg_dimm_pxm(MACHINE_PC);
+}
+
 int main(int argc, char *argv[])
 {
 const char *arch = qtest_get_arch();
@@ -893,6 +929,8 @@ int main(int argc, char *argv[])
 qtest_add_func("acpi/q35/memhp", test_acpi_q35_tcg_memhp);
 qtest_add_func("acpi/piix4/numamem", test_acpi_piix4_tcg_numamem);
 qtest_add_func("acpi/q35/numamem", test_acpi_q35_tcg_numamem);
+qtest_add_func("acpi/piix4/dimmpxm", test_acpi_piix4_tcg_dimm_pxm);
+qtest_add_func("acpi/q35/dimmpxm", test_acpi_q35_tcg_dimm_pxm);
 }
 ret = g_test_run();
 boot_sector_cleanup(disk);
-- 
2.14.1

[Qemu-devel] [PATCH v4 5/5] [DO NOT APPLY] test/acpi-test-data: add ACPI tables for dimmpxm test

2018-03-07 Thread Haozhong Zhang

Reviewers can use ACPI tables in this patch to run
test_acpi_{piix4,q35}_tcg_dimm_pxm cases.

Signed-off-by: Haozhong Zhang <haozhong.zh...@intel.com>
---
 tests/acpi-test-data/pc/APIC.dimmpxm  | Bin 0 -> 144 bytes
 tests/acpi-test-data/pc/DSDT.dimmpxm  | Bin 0 -> 6803 bytes
 tests/acpi-test-data/pc/NFIT.dimmpxm  | Bin 0 -> 224 bytes
 tests/acpi-test-data/pc/SRAT.dimmpxm  | Bin 0 -> 472 bytes
 tests/acpi-test-data/pc/SSDT.dimmpxm  | Bin 0 -> 685 bytes
 tests/acpi-test-data/q35/APIC.dimmpxm | Bin 0 -> 144 bytes
 tests/acpi-test-data/q35/DSDT.dimmpxm | Bin 0 -> 9487 bytes
 tests/acpi-test-data/q35/NFIT.dimmpxm | Bin 0 -> 224 bytes
 tests/acpi-test-data/q35/SRAT.dimmpxm | Bin 0 -> 472 bytes
 tests/acpi-test-data/q35/SSDT.dimmpxm | Bin 0 -> 685 bytes
 10 files changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 tests/acpi-test-data/pc/APIC.dimmpxm
 create mode 100644 tests/acpi-test-data/pc/DSDT.dimmpxm
 create mode 100644 tests/acpi-test-data/pc/NFIT.dimmpxm
 create mode 100644 tests/acpi-test-data/pc/SRAT.dimmpxm
 create mode 100644 tests/acpi-test-data/pc/SSDT.dimmpxm
 create mode 100644 tests/acpi-test-data/q35/APIC.dimmpxm
 create mode 100644 tests/acpi-test-data/q35/DSDT.dimmpxm
 create mode 100644 tests/acpi-test-data/q35/NFIT.dimmpxm
 create mode 100644 tests/acpi-test-data/q35/SRAT.dimmpxm
 create mode 100644 tests/acpi-test-data/q35/SSDT.dimmpxm

diff --git a/tests/acpi-test-data/pc/APIC.dimmpxm 
b/tests/acpi-test-data/pc/APIC.dimmpxm
new file mode 100644
index 
..427bb08248e6a029c1c988f74f5e48f93ee4ebe0
GIT binary patch
literal 144
zcmZ<^@N}NQz`($`}|BUr=ZD8>jB1F=Cg1H*%VV44G{4#eePWQ5R6Oc0ux
t8ALPkfFuw61CdNzKn!AlSgfo-nis_4<b<)ffC?aD+}vOm3)_F75dauy4FLcE

literal 0
HcmV?d1

diff --git a/tests/acpi-test-data/pc/DSDT.dimmpxm 
b/tests/acpi-test-data/pc/DSDT.dimmpxm
new file mode 100644
index 
..38661cb13ee348718ab45bfc69452cd642cf9bb9
GIT binary patch
literal 6803
zcmcgxUvJyi6~C9H9O_E4DVt54IBf){f7Y%|^v88tY$z;|vZYv*8BxwMFc>Mv!Q`St
z2satx2E`N=aMQdMT8a(WgA(XD`3T!b=-Xbquh3zwpX!}M^3tko0`0>lAoM-={O<3Z
zd+#~tUNX9+w+H74q5rIGXf-QWxnXKL^ie_xw(+l0mu?cfr>xKTP>;cvNKd
zZN0L*q@vzjEEXpS!f<k*#OSldX>T*&}z0An4wf#~3!0MaQZ*c7MUt>Ur6z)%A4w
zYbMH0S#J^9*{thSG2{SKm1}0T%|R4EpawT;X4@cXLcabXMI`&7g7Vz;YE#ddE#1kf
z%Z}A96Ayb_54$>_xJ+?}=`RN^8Mvv#!x0%ye>v!oKX=mPU;jyr$cW9zj@GiWSvI|&
zgc$=lkqFr%%4?U<8+6z1pEYk3O1`gYkx+2OER-~XutQ}fF$UA5x>a@p94sv2mhzgw
zTn6sG{<}-af+Gp3T_*X0=JldqmXA*bub}{86+Emql(E+3fy1t+ocF!IGt5vC!Xn
z_R<>lh({D*k<c}|OLmDcwMXp!mGz1q!9`I``l}L6)F0>)mrxkBO}63hq5$)?`)?Q<
zo6*3TxygYtODqxvfn|AB*P=~46?(M5=GW-A;<qA^*68=*_pnflE95Q7=Ps-^%rf8X
zRlPo_u1ryk7@NbqautW9{`;N^pS$0<SYW56MF$~9l0tj~hgNE6Te3ghg
zA8b7?sXLs4?H-Y*QZ#3UW!C6@@xCa_i#^|;s-$fg1-_^W8blzc!3~L{IS>y-A}=aF
z%_`CqCuo=u@xYG8@(2e4@W{ZU)a0We>Y19=rYZ8A$q?cwXb^*$fyJuCeLMqD7
zF``l^Xq9~RDkv=*ZG$O)Cv;ov5#wTJ@@6UqtEf(Cebo+oL-Khud#p
z_lPJ>NTG_OtTPOIRcDUbY7=i(=(!&0JgX$1bXd>(q{9TV<vrN#Y!N1sTSH}V3qVOo
zn?`e8C)>kU2e;@jJLoVe>bA)C(@Z3l0hArwVnWnY346q-M<d(br+ZsWA&|J_(KaF8
zgAUkxv`vY({mAW*d!3PMKYaDFh~8qZV75`SbuBN$qkxYhK1#AWSHA|UI!r!by)Gnu
za>P>Va{ZdtN&=vj+{7gHqZ2iqQbjD0Kt&-yU+qziNIVta($cE527rU}6uBxD*2
z2$m-q*2>DBW^-Rmlcbn{C}r?31^@PlIuqm|I)Uz~Sx2v1<Wp?3p5HTrSxrXhk}sce
ztd>O|cLOh=nzicKA6l<WScmxi$<_;F)(gJ%0{Ay>y`Y|Ky<qnZoVzw*oe4j~d~wWt
z(Klb5Fkg%@UkuIbljh-o_{p;`jhQd`=1UXiOEKn4c7H>H*;|w5rNBH7Av{H3%X!
zsZ;)DtEpg((N*Ze-Brr<;K0+^&7-1kwyKc{HsuFbMl&~awL5pckM8|Gw|o2JJNI^P
z-Ts)^R5KgvVfoE4wnoVd@`9$JMnh2fRpbRr+Yc|IP$oGI4;-!Dw5ZlHu2U!oc}gTr
zGju35vj`G3tJ=r`QZKi2YTBtc>#|3%)9_!9V+DQjjupD*OmgWx<*s7qRV
z^|!*14Z37s%jfeu*reDj><m+bE)%d_4B8^LOFL?93d<xi~S>k6OE(1;vi`8T|
z_~Q|Gcy2JMzzKS6#<kx#rfIB?T-8Y6q<@|vI!Of#C-SScY0c7_a@;hi`>dQAxNvGi
zHJv1dO$0p=&7Odzb9QKEL$2<eHN8P9E(y(@%%HQ`23*f+Y@<ffKwqdo?
zaT7gpzm7QzGxl~)n3u<d$zFthIN1qHaH4niXX_G19;6}sAc!Hf<PioK#HmqkXH!fj
zGG5O>n{qG-Fer#R?ZBi`I5X1S`4E!@E9M#sn~C!_Xf8|YaP
zuP|UcbL1IT(1=YCe71Dt8eAx5BHx(6`IrzAmc-+PP!l6UQJf?c#|g!VP*fKn$JflN
zQ_UA4ME#2><~znURNur{nKEi-P>3^T)6AFi%dom|4rYwof4H-|m+Ky@R>8eBC{qj$
z$9XJMH4|?wgt2+MUonCL1I)n*Gr_Fa4I{UG`;R+V`(#6JwwP!?yfhXU=o2!EMyt}u
z!J`I`2DTc|GH*CJ`{COs;LBu%8CA=n2IiZAnPT8Q3oaKYphW|Vq_I)G4i8JqsdN76
zK1>eKC%AdS<-(?hf9)zy8L)KZNC(MpUqs#E;j>>qadCv_BH?gu5Lky4g_z^7Qmj
z9R%)RFQzF<yDlasl8mVOj)9eajMn>Wm-vLKfj_699{OlkEzcWp(nA3ZT;N#QXe}>g
zODzBRmxRD8!{EoeL-7!9N;r^BgXfK)ISb8N(0sj7<-Kn~GweNWly8LCbI`1L
zxJ@_+8x16aMx%xUu+c!0UF^mNjzz&&<LIIK1p83Gi!${)vwkPN8}(qTfx5;}KQ+K^
z2%uJ}n7zQxe|~0s$~aY=CQi|xUa80!&^s_EXime}uz4CBay+z*Y7fa#>7k#f3U>
zMa|_U^;L*NgHCRMhtbJ5)m481fz_6dfp#$Hybm$z0!noe;xrG}`X6s6gb9Ri0D)V}
zlws!Kbq%vKe;*2Cbb;yds}BYR>OsajIl`C<WvqfD^x2nZu?~qNbPr{y9IE2isS!Hj
zcZCMw<tj8-`2QG$*RarF@qcU(wg=KNyyW28hL;>@Mwc9D7k9?WX^EFptZrBn6m1GC
zt-lxHO$utL`aYE20>P6t`U()dvW<

[Qemu-devel] [PATCH v4 0/5] hw/acpi-build: build SRAT memory affinity structures for DIMM devices

2018-03-07 Thread Haozhong Zhang

(Patch 5 is only for reviewers to run test cases in patch 4)

ACPI 6.2A Table 5-129 "SPA Range Structure" requires the proximity
domain of a NVDIMM SPA range must match with corresponding entry in
SRAT table.

The address ranges of vNVDIMM in QEMU are allocated from the
hot-pluggable address space, which is entirely covered by one SRAT
memory affinity structure. However, users can set the vNVDIMM
proximity domain in NFIT SPA range structure by the 'node' property of
'-device nvdimm' to a value different than the one in the above SRAT
memory affinity structure.

In order to solve such proximity domain mismatch, this patch builds
one SRAT memory affinity structure for each DIMM device present at
boot time, including both PC-DIMM and NVDIMM, with the proximity
domain specified in '-device pc-dimm' or '-device nvdimm'.

The remaining hot-pluggable address space is covered by one or multiple
SRAT memory affinity structures with the proximity domain of the last
node as before.

Changes in v4:
 * (Patch 1) Update the commit message and add R-b from Igor Mammedov.
 * (Patch 2) Rebase on misc.json and update the commit message.
 * (Patch 3) Directly use di-addr and di-node.
 * (Patch 4) Drop the previous v3 patch 3 and add '-machine nvdimm=on'
   to parameters of test_acpi_one().
 * (Patch 4) Put PC-DIMM and NVDIMM to different numa nodes.
 * (Patch 4&5) Move binary blobs of ACPI tables to DO-NOT-APPLY patch 5.

Changes in v3:
 * (Patch 1&2) Use qmp_pc_dimm_device_list to get information of DIMM
   devices and move it to separate patches.
 * (Patch 3) Replace while loop by a more readable for loop.
 * (Patch 3) Refactor the flag setting code.
 * (Patch 3) s/'static-plugged'/'present at boot time' in commit message.

Changes in v2:
 * Build SRAT memory affinity structures of PC-DIMM devices as well.
 * Add test cases.


Haozhong Zhang (5):
  pc-dimm: make qmp_pc_dimm_device_list() sort devices by address
  qmp: distinguish PC-DIMM and NVDIMM in MemoryDeviceInfoList
  hw/acpi-build: build SRAT memory affinity structures for DIMM devices
  tests/bios-tables-test: add test cases for DIMM proximity
  [DO NOT APPLY] test/acpi-test-data: add ACPI tables for dimmpxm test

 hmp.c |  14 +++--
 hw/i386/acpi-build.c  |  57 ++--
 hw/mem/pc-dimm.c  |  99 --
 hw/ppc/spapr.c|   3 +-
 include/hw/mem/pc-dimm.h  |   2 +-
 numa.c|  23 
 qapi/misc.json|  18 ++-
 qmp.c |   7 +--
 stubs/qmp_pc_dimm.c   |   4 +-
 tests/acpi-test-data/pc/APIC.dimmpxm  | Bin 0 -> 144 bytes
 tests/acpi-test-data/pc/DSDT.dimmpxm  | Bin 0 -> 6803 bytes
 tests/acpi-test-data/pc/NFIT.dimmpxm  | Bin 0 -> 224 bytes
 tests/acpi-test-data/pc/SRAT.dimmpxm  | Bin 0 -> 472 bytes
 tests/acpi-test-data/pc/SSDT.dimmpxm  | Bin 0 -> 685 bytes
 tests/acpi-test-data/q35/APIC.dimmpxm | Bin 0 -> 144 bytes
 tests/acpi-test-data/q35/DSDT.dimmpxm | Bin 0 -> 9487 bytes
 tests/acpi-test-data/q35/NFIT.dimmpxm | Bin 0 -> 224 bytes
 tests/acpi-test-data/q35/SRAT.dimmpxm | Bin 0 -> 472 bytes
 tests/acpi-test-data/q35/SSDT.dimmpxm | Bin 0 -> 685 bytes
 tests/bios-tables-test.c  |  38 +
 20 files changed, 198 insertions(+), 67 deletions(-)
 create mode 100644 tests/acpi-test-data/pc/APIC.dimmpxm
 create mode 100644 tests/acpi-test-data/pc/DSDT.dimmpxm
 create mode 100644 tests/acpi-test-data/pc/NFIT.dimmpxm
 create mode 100644 tests/acpi-test-data/pc/SRAT.dimmpxm
 create mode 100644 tests/acpi-test-data/pc/SSDT.dimmpxm
 create mode 100644 tests/acpi-test-data/q35/APIC.dimmpxm
 create mode 100644 tests/acpi-test-data/q35/DSDT.dimmpxm
 create mode 100644 tests/acpi-test-data/q35/NFIT.dimmpxm
 create mode 100644 tests/acpi-test-data/q35/SRAT.dimmpxm
 create mode 100644 tests/acpi-test-data/q35/SSDT.dimmpxm

-- 
2.14.1

[Qemu-devel] [PATCH v4 1/5] pc-dimm: make qmp_pc_dimm_device_list() sort devices by address

2018-03-07 Thread Haozhong Zhang

Make qmp_pc_dimm_device_list() return sorted by start address
list of devices so that it could be reused in places that
would need sorted list*. Reuse existing pc_dimm_built_list()
to get sorted list.

While at it hide recursive callbacks from callers, so that:

  qmp_pc_dimm_device_list(qdev_get_machine(), );

could be replaced with simpler:

  list = qmp_pc_dimm_device_list();

* follow up patch will use it in build_srat()

Signed-off-by: Haozhong Zhang <haozhong.zh...@intel.com>
Reviewed-by: Igor Mammedov <imamm...@redhat.com>
---
 hw/mem/pc-dimm.c | 83 +---
 hw/ppc/spapr.c   |  3 +-
 include/hw/mem/pc-dimm.h |  2 +-
 numa.c   |  4 +--
 qmp.c|  7 +---
 stubs/qmp_pc_dimm.c  |  4 +--
 6 files changed, 50 insertions(+), 53 deletions(-)

diff --git a/hw/mem/pc-dimm.c b/hw/mem/pc-dimm.c
index 6e74b61cb6..4d050fe2cd 100644
--- a/hw/mem/pc-dimm.c
+++ b/hw/mem/pc-dimm.c
@@ -162,45 +162,6 @@ uint64_t get_plugged_memory_size(void)
 return pc_existing_dimms_capacity(_abort);
 }
 
-int qmp_pc_dimm_device_list(Object *obj, void *opaque)
-{
-MemoryDeviceInfoList ***prev = opaque;
-
-if (object_dynamic_cast(obj, TYPE_PC_DIMM)) {
-DeviceState *dev = DEVICE(obj);
-
-if (dev->realized) {
-MemoryDeviceInfoList *elem = g_new0(MemoryDeviceInfoList, 1);
-MemoryDeviceInfo *info = g_new0(MemoryDeviceInfo, 1);
-PCDIMMDeviceInfo *di = g_new0(PCDIMMDeviceInfo, 1);
-DeviceClass *dc = DEVICE_GET_CLASS(obj);
-PCDIMMDevice *dimm = PC_DIMM(obj);
-
-if (dev->id) {
-di->has_id = true;
-di->id = g_strdup(dev->id);
-}
-di->hotplugged = dev->hotplugged;
-di->hotpluggable = dc->hotpluggable;
-di->addr = dimm->addr;
-di->slot = dimm->slot;
-di->node = dimm->node;
-di->size = object_property_get_uint(OBJECT(dimm), 
PC_DIMM_SIZE_PROP,
-NULL);
-di->memdev = object_get_canonical_path(OBJECT(dimm->hostmem));
-
-info->u.dimm.data = di;
-elem->value = info;
-elem->next = NULL;
-**prev = elem;
-*prev = >next;
-}
-}
-
-object_child_foreach(obj, qmp_pc_dimm_device_list, opaque);
-return 0;
-}
-
 static int pc_dimm_slot2bitmap(Object *obj, void *opaque)
 {
 unsigned long *bitmap = opaque;
@@ -276,6 +237,50 @@ static int pc_dimm_built_list(Object *obj, void *opaque)
 return 0;
 }
 
+MemoryDeviceInfoList *qmp_pc_dimm_device_list(void)
+{
+GSList *dimms = NULL, *item;
+MemoryDeviceInfoList *list = NULL, *prev = NULL;
+
+object_child_foreach(qdev_get_machine(), pc_dimm_built_list, );
+
+for (item = dimms; item; item = g_slist_next(item)) {
+PCDIMMDevice *dimm = PC_DIMM(item->data);
+Object *obj = OBJECT(dimm);
+MemoryDeviceInfoList *elem = g_new0(MemoryDeviceInfoList, 1);
+MemoryDeviceInfo *info = g_new0(MemoryDeviceInfo, 1);
+PCDIMMDeviceInfo *di = g_new0(PCDIMMDeviceInfo, 1);
+DeviceClass *dc = DEVICE_GET_CLASS(obj);
+DeviceState *dev = DEVICE(obj);
+
+if (dev->id) {
+di->has_id = true;
+di->id = g_strdup(dev->id);
+}
+di->hotplugged = dev->hotplugged;
+di->hotpluggable = dc->hotpluggable;
+di->addr = dimm->addr;
+di->slot = dimm->slot;
+di->node = dimm->node;
+di->size = object_property_get_uint(obj, PC_DIMM_SIZE_PROP, NULL);
+di->memdev = object_get_canonical_path(OBJECT(dimm->hostmem));
+
+info->u.dimm.data = di;
+elem->value = info;
+elem->next = NULL;
+if (prev) {
+prev->next = elem;
+} else {
+list = elem;
+}
+prev = elem;
+}
+
+g_slist_free(dimms);
+
+return list;
+}
+
 uint64_t pc_dimm_get_free_addr(uint64_t address_space_start,
uint64_t address_space_size,
uint64_t *hint, uint64_t align, uint64_t size,
diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 7e1c858566..44a0670d11 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -722,8 +722,7 @@ static int spapr_populate_drconf_memory(sPAPRMachineState 
*spapr, void *fdt)
 }
 
 if (hotplug_lmb_start) {
-MemoryDeviceInfoList **prev = 
-qmp_pc_dimm_device_list(qdev_get_machine(), );
+dimms = qmp_pc_dimm_device_list();
 }
 
 /* ibm,dynamic-memory */
diff --git a/include/hw/mem/pc-dimm.h b/include/hw/mem/pc-dimm.h
index d83b957829..1fc479281c 100644
--- a/include/hw/mem/pc-dimm.h
+++ b/include/hw/mem/pc-dimm.h
@@ -93,7 +93,7 @@

Re: [Qemu-devel] [Xen-devel] [RFC QEMU PATCH v4 00/10] Implement vNVDIMM for Xen HVM guest

2018-03-05 Thread Haozhong Zhang

On 03/02/18 12:03 +, Anthony PERARD wrote:
> On Wed, Feb 28, 2018 at 05:36:59PM +0800, Haozhong Zhang wrote:
> > On 02/27/18 17:22 +, Anthony PERARD wrote:
> > > On Thu, Dec 07, 2017 at 06:18:02PM +0800, Haozhong Zhang wrote:
> > > > This is the QEMU part patches that works with the associated Xen
> > > > patches to enable vNVDIMM support for Xen HVM domains. Xen relies on
> > > > QEMU to build guest NFIT and NVDIMM namespace devices, and allocate
> > > > guest address space for vNVDIMM devices.
> > > 
> > > I've got other question, and maybe possible improvements.
> > > 
> > > When QEMU build the ACPI tables, it also initialize some MemoryRegion,
> > > which use more guest memory. Do you know if those regions are used with
> > > your patch series on Xen?
> > 
> > Yes, that's why dm_acpi_size is introduced.
> > 
> > > Otherwise, we could try to avoid their
> > > creation with this:
> > > In xenfv_machine_options()
> > > m->rom_file_has_mr = false;
> > > (setting this in xen_hvm_init() would probably be better, but I havn't
> > > try)
> > 
> > If my memory is correct, simply setting rom_file_has_mr to false does
> > not work (though I cannot remind the exact reason). I'll have a look
> > as the code to refresh my memory.
> 
> I've played a bit with this idea, but without a proper NVDIMM available
> for the guest, so I don't know if it's going to work properly without
> the mr.
> 
> To make it work, I had to disable some code in acpi_build_update() that
> make use of the MemoryRegions, as well as an assert in acpi_setup().
> After those small hacks, I could boot the guest, and I've check that the
> expected ACPI tables where there, and they looked correct to my eyes.
> And least `ndctl list` works and showed the nvdimm (that I have
> configured on QEMU's cmdline).
> 
> But I may not have been far enough with my tests, and maybe something
> later relies on the MRs, especially the _DSM method that I don't know if
> it was working properly.
> 
> Anyway, that why I proposed the idea, and if we can avoid more
> uncertainty about how much guest memory QEMU is going to use, that would
> be good.
> 

Yes, I also tested some non-trivial _DSM methods and it looks rom
files without memory regions can work with Xen after some
modifications. I'll apply this idea in the next version if no other
issues are found.

Thanks,
Haozhong

Re: [Qemu-devel] [PATCH v3 2/5] qmp: distinguish PC-DIMM and NVDIMM in MemoryDeviceInfoList

2018-03-05 Thread Haozhong Zhang

On 03/05/18 13:14 -0600, Eric Blake wrote:
> On 03/05/2018 12:57 AM, Haozhong Zhang wrote:
> > It may need to treat PC-DIMM and NVDIMM differently, e.g., when
> > deciding the necessity of non-volatile flag bit in SRAT memory
> > affinity structures.
> > 
> > NVDIMMDeviceInfo, which inherits from PCDIMMDeviceInfo, is added to
> > union type MemoryDeviceInfo to record information of NVDIMM devices.
> > The NVDIMM-specific data is currently left empty and will be filled
> > when necessary in the future.
> > 
> > Signed-off-by: Haozhong Zhang <haozhong.zh...@intel.com>
> > ---
> >   hmp.c| 14 +++---
> >   hw/mem/pc-dimm.c | 20 ++--
> >   numa.c   | 19 +--
> >   qapi-schema.json | 18 +-
> 
> Will need rebasing now that the contents live in qapi/misc.json.

will do

> 
> > +++ b/qapi-schema.json
> > @@ -2920,6 +2920,18 @@
> > }
> >   }
> > +##
> > +# @NVDIMMDeviceInfo:
> > +#
> > +# NVDIMMDevice state information
> > +#
> > +# Since: 2.12
> > +##
> > +{ 'struct': 'NVDIMMDeviceInfo',
> > +  'base': 'PCDIMMDeviceInfo',
> > +  'data': {}
> > +}
> 
> You added no data, so why did you need the type?
> 
> > +
> >   ##
> >   # @MemoryDeviceInfo:
> >   #
> > @@ -2927,7 +2939,11 @@
> >   #
> >   # Since: 2.1
> >   ##
> > -{ 'union': 'MemoryDeviceInfo', 'data': {'dimm': 'PCDIMMDeviceInfo'} }
> > +{ 'union': 'MemoryDeviceInfo',
> > +  'data': { 'dimm': 'PCDIMMDeviceInfo',
> > +'nvdimm': 'NVDIMMDeviceInfo'
> 
> Names aren't part of the interface; would it be better to rename
> PCDIMMDeviceInfo into something that can be generically shared between both
> the 'dimm' and 'nvdimm' branches without having to create a pointless
> subtype?
>

The purpose of this NVDIMMDeviceInfo is to introduce
MEMORY_DEVICE_INFO_KIND_NVDIMM, which can be used to distinguish
NVDIMM from PC-DIMM in the list returned from query-memory-device.

If 'data' of NVDIMMDeviceInfo is filled with NVDIMM-specific
information (there does have some), would it make this type less
pointless?

Thanks,
Haozhong

Re: [Qemu-devel] [RFC QEMU PATCH v4 03/10] hostmem-xen: add a host memory backend for Xen

2018-03-04 Thread Haozhong Zhang

On 03/02/18 11:50 +, Anthony PERARD wrote:
> On Wed, Feb 28, 2018 at 03:56:54PM +0800, Haozhong Zhang wrote:
> > On 02/27/18 16:41 +, Anthony PERARD wrote:
> > > On Thu, Dec 07, 2017 at 06:18:05PM +0800, Haozhong Zhang wrote:
> > > > @@ -108,7 +109,10 @@ void pc_dimm_memory_plug(DeviceState *dev, 
> > > > MemoryHotplugState *hpms,
> > > >  }
> > > >  
> > > >  memory_region_add_subregion(>mr, addr - hpms->base, mr);
> > > > -vmstate_register_ram(vmstate_mr, dev);
> > > > +/* memory-backend-xen is not backed by RAM. */
> > > > +if (!xen_enabled()) {
> > > 
> > > Is it possible to have the same condition as the one used in
> > > host_memory_backend_memory_complete? i.e. base on whether the memory
> > > region is mapped or not (backend->mr.ram_block).
> > 
> > Like "if (!xen_enabled() || backend->mr.ram_block))"? No, it will mute
> > the abortion (vmstate_register_ram --> qemu_ram_set_idstr ) caused by
> > the case that !backend->mr.ram_block in the non-xen environment.
> 
> In non-xen environment, vmstate_register_ram() will be called, because
> !xen_enabled() is true, it would not matter if there is a ram_block or
> not.

Sorry, I really meant 'if (backend->mr.ram_block)', which may mute the
abortion in non-xen environment. 'if (!xen_enabled())' keeps the
original semantics in non-xen environment, so it's unlikely to break
the non-xen usage.

Haozhong

> 
> But if there is a memory-backend that can run in a xen environment that
> have a ram_block, vmstate_register_ram would not be called in the
> origial patch, but if we use (!xen_enabled() || vmstate_mr->ram_block)
> as condition then vmstate_register_ram will be called.
> 
> Is this make sense?
> 
> > > > +vmstate_register_ram(vmstate_mr, dev);
> > > > +}
> > > >  numa_set_mem_node_id(addr, memory_region_size(mr), dimm->node);
> > > >  
> > > >  out:
> 
> -- 
> Anthony PERARD
>

[Qemu-devel] [PATCH v3 1/5] pc-dimm: refactor qmp_pc_dimm_device_list

2018-03-04 Thread Haozhong Zhang

Use pc_dimm_built_list to hide recursive callbacks from callers.

Signed-off-by: Haozhong Zhang <haozhong.zh...@intel.com>
---
 hw/mem/pc-dimm.c | 83 +---
 hw/ppc/spapr.c   |  3 +-
 include/hw/mem/pc-dimm.h |  2 +-
 numa.c   |  4 +--
 qmp.c|  7 +---
 stubs/qmp_pc_dimm.c  |  4 +--
 6 files changed, 50 insertions(+), 53 deletions(-)

diff --git a/hw/mem/pc-dimm.c b/hw/mem/pc-dimm.c
index 6e74b61cb6..4d050fe2cd 100644
--- a/hw/mem/pc-dimm.c
+++ b/hw/mem/pc-dimm.c
@@ -162,45 +162,6 @@ uint64_t get_plugged_memory_size(void)
 return pc_existing_dimms_capacity(_abort);
 }
 
-int qmp_pc_dimm_device_list(Object *obj, void *opaque)
-{
-MemoryDeviceInfoList ***prev = opaque;
-
-if (object_dynamic_cast(obj, TYPE_PC_DIMM)) {
-DeviceState *dev = DEVICE(obj);
-
-if (dev->realized) {
-MemoryDeviceInfoList *elem = g_new0(MemoryDeviceInfoList, 1);
-MemoryDeviceInfo *info = g_new0(MemoryDeviceInfo, 1);
-PCDIMMDeviceInfo *di = g_new0(PCDIMMDeviceInfo, 1);
-DeviceClass *dc = DEVICE_GET_CLASS(obj);
-PCDIMMDevice *dimm = PC_DIMM(obj);
-
-if (dev->id) {
-di->has_id = true;
-di->id = g_strdup(dev->id);
-}
-di->hotplugged = dev->hotplugged;
-di->hotpluggable = dc->hotpluggable;
-di->addr = dimm->addr;
-di->slot = dimm->slot;
-di->node = dimm->node;
-di->size = object_property_get_uint(OBJECT(dimm), 
PC_DIMM_SIZE_PROP,
-NULL);
-di->memdev = object_get_canonical_path(OBJECT(dimm->hostmem));
-
-info->u.dimm.data = di;
-elem->value = info;
-elem->next = NULL;
-**prev = elem;
-*prev = >next;
-}
-}
-
-object_child_foreach(obj, qmp_pc_dimm_device_list, opaque);
-return 0;
-}
-
 static int pc_dimm_slot2bitmap(Object *obj, void *opaque)
 {
 unsigned long *bitmap = opaque;
@@ -276,6 +237,50 @@ static int pc_dimm_built_list(Object *obj, void *opaque)
 return 0;
 }
 
+MemoryDeviceInfoList *qmp_pc_dimm_device_list(void)
+{
+GSList *dimms = NULL, *item;
+MemoryDeviceInfoList *list = NULL, *prev = NULL;
+
+object_child_foreach(qdev_get_machine(), pc_dimm_built_list, );
+
+for (item = dimms; item; item = g_slist_next(item)) {
+PCDIMMDevice *dimm = PC_DIMM(item->data);
+Object *obj = OBJECT(dimm);
+MemoryDeviceInfoList *elem = g_new0(MemoryDeviceInfoList, 1);
+MemoryDeviceInfo *info = g_new0(MemoryDeviceInfo, 1);
+PCDIMMDeviceInfo *di = g_new0(PCDIMMDeviceInfo, 1);
+DeviceClass *dc = DEVICE_GET_CLASS(obj);
+DeviceState *dev = DEVICE(obj);
+
+if (dev->id) {
+di->has_id = true;
+di->id = g_strdup(dev->id);
+}
+di->hotplugged = dev->hotplugged;
+di->hotpluggable = dc->hotpluggable;
+di->addr = dimm->addr;
+di->slot = dimm->slot;
+di->node = dimm->node;
+di->size = object_property_get_uint(obj, PC_DIMM_SIZE_PROP, NULL);
+di->memdev = object_get_canonical_path(OBJECT(dimm->hostmem));
+
+info->u.dimm.data = di;
+elem->value = info;
+elem->next = NULL;
+if (prev) {
+prev->next = elem;
+} else {
+list = elem;
+}
+prev = elem;
+}
+
+g_slist_free(dimms);
+
+return list;
+}
+
 uint64_t pc_dimm_get_free_addr(uint64_t address_space_start,
uint64_t address_space_size,
uint64_t *hint, uint64_t align, uint64_t size,
diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 83c9d66dd5..68a81e47d2 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -731,8 +731,7 @@ static int spapr_populate_drconf_memory(sPAPRMachineState 
*spapr, void *fdt)
 }
 
 if (hotplug_lmb_start) {
-MemoryDeviceInfoList **prev = 
-qmp_pc_dimm_device_list(qdev_get_machine(), );
+dimms = qmp_pc_dimm_device_list();
 }
 
 /* ibm,dynamic-memory */
diff --git a/include/hw/mem/pc-dimm.h b/include/hw/mem/pc-dimm.h
index d83b957829..1fc479281c 100644
--- a/include/hw/mem/pc-dimm.h
+++ b/include/hw/mem/pc-dimm.h
@@ -93,7 +93,7 @@ uint64_t pc_dimm_get_free_addr(uint64_t address_space_start,
 
 int pc_dimm_get_free_slot(const int *hint, int max_slots, Error **errp);
 
-int qmp_pc_dimm_device_list(Object *obj, void *opaque);
+MemoryDeviceInfoList *qmp_pc_dimm_device_list(void);
 uint64_t pc_existing_dimms_capacity(Error **errp);
 uint64_t get_plugged_memory_size(void);
 void pc_dimm_memory_plug(DeviceState *dev, MemoryHotplugState *hpms,
diff --git a/num

[Qemu-devel] [PATCH v3 5/5] tests/bios-tables-test: add test cases for DIMM proximity

2018-03-04 Thread Haozhong Zhang

QEMU now builds one SRAT memory affinity structure for each
static-plugged PC-DIMM and NVDIMM device with the proximity domain
specified in the device option 'node', rather than only one SRAT
memory affinity structure covering the entire hotpluggable address
space with the proximity domain of the last node.

Add test cases on PC and Q35 machines with 3 proximity domains, and
one PC-DIMM and one NVDIMM attached to the second proximity domain.
Check whether the QEMU-built SRAT tables match with the expected ones.

Signed-off-by: Haozhong Zhang <haozhong.zh...@intel.com>
Suggested-by: Igor Mammedov <imamm...@redhat.com>
---
 tests/acpi-test-data/pc/APIC.dimmpxm  | Bin 0 -> 136 bytes
 tests/acpi-test-data/pc/DSDT.dimmpxm  | Bin 0 -> 6710 bytes
 tests/acpi-test-data/pc/NFIT.dimmpxm  | Bin 0 -> 224 bytes
 tests/acpi-test-data/pc/SRAT.dimmpxm  | Bin 0 -> 416 bytes
 tests/acpi-test-data/pc/SSDT.dimmpxm  | Bin 0 -> 685 bytes
 tests/acpi-test-data/q35/APIC.dimmpxm | Bin 0 -> 136 bytes
 tests/acpi-test-data/q35/DSDT.dimmpxm | Bin 0 -> 9394 bytes
 tests/acpi-test-data/q35/NFIT.dimmpxm | Bin 0 -> 224 bytes
 tests/acpi-test-data/q35/SRAT.dimmpxm | Bin 0 -> 416 bytes
 tests/acpi-test-data/q35/SSDT.dimmpxm | Bin 0 -> 685 bytes
 tests/bios-tables-test.c  |  33 +
 11 files changed, 33 insertions(+)
 create mode 100644 tests/acpi-test-data/pc/APIC.dimmpxm
 create mode 100644 tests/acpi-test-data/pc/DSDT.dimmpxm
 create mode 100644 tests/acpi-test-data/pc/NFIT.dimmpxm
 create mode 100644 tests/acpi-test-data/pc/SRAT.dimmpxm
 create mode 100644 tests/acpi-test-data/pc/SSDT.dimmpxm
 create mode 100644 tests/acpi-test-data/q35/APIC.dimmpxm
 create mode 100644 tests/acpi-test-data/q35/DSDT.dimmpxm
 create mode 100644 tests/acpi-test-data/q35/NFIT.dimmpxm
 create mode 100644 tests/acpi-test-data/q35/SRAT.dimmpxm
 create mode 100644 tests/acpi-test-data/q35/SSDT.dimmpxm

diff --git a/tests/acpi-test-data/pc/APIC.dimmpxm 
b/tests/acpi-test-data/pc/APIC.dimmpxm
new file mode 100644
index 
..658d7e748e37540ff85a02f4391efc7eaae3c8b4
GIT binary patch
literal 136
zcmZ<^@O18AU|?W8>g4b25v<@85#a0y6k`O6f!H9Lf#JbFFwFr}2jX%tGD2u3CJ@cY
q0}?#&4@5F?0WpXHVzIIUX<iVElM}|`0xE!radU%NENuUQMgRcNAq@cl

literal 0
HcmV?d1

diff --git a/tests/acpi-test-data/pc/DSDT.dimmpxm 
b/tests/acpi-test-data/pc/DSDT.dimmpxm
new file mode 100644
index 
..20e6433725bb3e70085cf6227f981106772bdaea
GIT binary patch
literal 6710
zcmcgxUvJyi6~C9H9O_E4DVt54IBf){ZQ8C)^e1#1z*8BxwMFc>Mv!Q`St
z2satx2E`NwaMQjOT80hSgA(XD{s`Mg=tt<jLWi|^s&@{_ODnPov=5tr(D)<L{hv
z@44sR%jlNgUOGbv-KeZ<H7i%SX=*z3Q9=l|@vl;sZV|huS5_UG5+rIrO8ISgRAlvi
zy|S@N|Jrr`;=1>~aB0UQo6nV}n;q}*6L*s!=>De17$ErAXf5Fu1dD*Ge^>q0g
zCdy7(ZxPwqsOwZQ<N#BZYi700K@>os1~+PE+aPH|zWFglB>Rzq^4yJTQ_q<#-N~s-
zj@2#`4|`k>yE>n_OmT<luLmv}xT%AK5gAT@J?M}>chclv|4EF<h|S23*0Qo$HocdG
zh=H6)gzOUKt&8Xlx@-4On>Pz3-`BKAD7a!4N}52}fwG(!gK1LTDmwuV1{QIb^P0e1
z2JXK7yNk$zZxT|wL{2o!YLk+yMAXXI5VZ>YQM7ZHL~a<_?EVL>wg#lZkfmU-(BFCX
z+A8*X^&{euac8D;wOYHuYwTd3WMNv)qqY?$`zvvQ|P<U{B0}0phj$7mW3d
z=*5}2$rojoSR@Jp%kqk@MU!|U^k{+2uhQ?t??fW4(jUYhV4xP4$$OH|U07+DWj@&}
zdVMyh5SC!;EKk`!6WCkuZ<Z~v1NI5~p3N{>c2@Li_7qbw4aa{12zLM14YM8jDiL))
zn0g#icQ^}xFaR_O!rfhfz1J>Q?Iq^%nTKBx)(35lb5DZUhmyr}pz
zD@aqEpkYG912Y=SBfJ!VM+P3HCLbnI&(y3oO_3K?CZgB;w*!9*#>RmZJOu
zGb)9GR>@bdfuhnhS~R5u3KX<TbHm8lw9?Sli29bPRj
<

[Qemu-devel] [PATCH v3 3/5] hw/acpi-build: build SRAT memory affinity structures for DIMM devices

2018-03-04 Thread Haozhong Zhang

ACPI 6.2A Table 5-129 "SPA Range Structure" requires the proximity
domain of a NVDIMM SPA range must match with corresponding entry in
SRAT table.

The address ranges of vNVDIMM in QEMU are allocated from the
hot-pluggable address space, which is entirely covered by one SRAT
memory affinity structure. However, users can set the vNVDIMM
proximity domain in NFIT SPA range structure by the 'node' property of
'-device nvdimm' to a value different than the one in the above SRAT
memory affinity structure.

In order to solve such proximity domain mismatch, this patch builds
one SRAT memory affinity structure for each DIMM device present at
boot time, including both PC-DIMM and NVDIMM, with the proximity
domain specified in '-device pc-dimm' or '-device nvdimm'.

The remaining hot-pluggable address space is covered by one or multiple
SRAT memory affinity structures with the proximity domain of the last
node as before.

Signed-off-by: Haozhong Zhang <haozhong.zh...@intel.com>
---
 hw/i386/acpi-build.c | 60 
 1 file changed, 56 insertions(+), 4 deletions(-)

diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
index deb440f286..2ca0317386 100644
--- a/hw/i386/acpi-build.c
+++ b/hw/i386/acpi-build.c
@@ -2323,6 +2323,59 @@ build_tpm2(GArray *table_data, BIOSLinker *linker, 
GArray *tcpalog)
 #define HOLE_640K_START  (640 * 1024)
 #define HOLE_640K_END   (1024 * 1024)
 
+static void build_srat_hotpluggable_memory(GArray *table_data, uint64_t base,
+   uint64_t len, int default_node)
+{
+MemoryDeviceInfoList *info_list = qmp_pc_dimm_device_list();
+MemoryDeviceInfoList *info;
+MemoryDeviceInfo *mi;
+PCDIMMDeviceInfo *di;
+uint64_t end = base + len, cur, addr, size;
+int node;
+bool is_nvdimm;
+AcpiSratMemoryAffinity *numamem;
+MemoryAffinityFlags flags;
+
+for (cur = base, info = info_list;
+ cur < end;
+ cur += size, info = info->next) {
+numamem = acpi_data_push(table_data, sizeof *numamem);
+
+if (!info) {
+build_srat_memory(numamem, cur, end - cur, default_node,
+  MEM_AFFINITY_HOTPLUGGABLE | 
MEM_AFFINITY_ENABLED);
+break;
+}
+
+mi = info->value;
+is_nvdimm = (mi->type == MEMORY_DEVICE_INFO_KIND_NVDIMM);
+di = !is_nvdimm ? mi->u.dimm.data :
+  qapi_NVDIMMDeviceInfo_base(mi->u.nvdimm.data);
+
+addr = di->addr;
+if (cur < addr) {
+build_srat_memory(numamem, cur, addr - cur, default_node,
+  MEM_AFFINITY_HOTPLUGGABLE | 
MEM_AFFINITY_ENABLED);
+numamem = acpi_data_push(table_data, sizeof *numamem);
+}
+
+size = di->size;
+node = di->node;
+
+flags = MEM_AFFINITY_ENABLED;
+if (di->hotpluggable) {
+flags |= MEM_AFFINITY_HOTPLUGGABLE;
+}
+if (is_nvdimm) {
+flags |= MEM_AFFINITY_NON_VOLATILE;
+}
+
+build_srat_memory(numamem, addr, size, node, flags);
+}
+
+qapi_free_MemoryDeviceInfoList(info_list);
+}
+
 static void
 build_srat(GArray *table_data, BIOSLinker *linker, MachineState *machine)
 {
@@ -2434,10 +2487,9 @@ build_srat(GArray *table_data, BIOSLinker *linker, 
MachineState *machine)
  * providing _PXM method if necessary.
  */
 if (hotplugabble_address_space_size) {
-numamem = acpi_data_push(table_data, sizeof *numamem);
-build_srat_memory(numamem, pcms->hotplug_memory.base,
-  hotplugabble_address_space_size, pcms->numa_nodes - 
1,
-  MEM_AFFINITY_HOTPLUGGABLE | MEM_AFFINITY_ENABLED);
+build_srat_hotpluggable_memory(table_data, pcms->hotplug_memory.base,
+   hotplugabble_address_space_size,
+   pcms->numa_nodes - 1);
 }
 
 build_header(linker, table_data,
-- 
2.14.1

[Qemu-devel] [PATCH v3 0/5] hw/acpi-build: build SRAT memory affinity structures for DIMM devices

2018-03-04 Thread Haozhong Zhang

ACPI 6.2A Table 5-129 "SPA Range Structure" requires the proximity
domain of a NVDIMM SPA range must match with corresponding entry in
SRAT table.

The address ranges of vNVDIMM in QEMU are allocated from the
hot-pluggable address space, which is entirely covered by one SRAT
memory affinity structure. However, users can set the vNVDIMM
proximity domain in NFIT SPA range structure by the 'node' property of
'-device nvdimm' to a value different than the one in the above SRAT
memory affinity structure.

In order to solve such proximity domain mismatch, this patch builds
one SRAT memory affinity structure for each DIMM device present at
boot time, including both PC-DIMM and NVDIMM, with the proximity
domain specified in '-device pc-dimm' or '-device nvdimm'.

The remaining hot-pluggable address space is covered by one or multiple
SRAT memory affinity structures with the proximity domain of the last
node as before.


Changes in v3:
 * (Patch 1&2) Use qmp_pc_dimm_device_list to get information of DIMM
   devices and move it to separate patches.
 * (Patch 3) Replace while loop by a more readable for loop.
 * (Patch 3) Refactor the flag setting code.
 * (Patch 3) s/'static-plugged'/'present at boot time' in commit message.

Changes in v2:
 * Build SRAT memory affinity structures of PC-DIMM devices as well.
 * Add test cases.


Haozhong Zhang (5):
  pc-dimm: refactor qmp_pc_dimm_device_list
  qmp: distinguish PC-DIMM and NVDIMM in MemoryDeviceInfoList
  hw/acpi-build: build SRAT memory affinity structures for DIMM devices
  tests/bios-tables-test: allow setting extra machine options
  tests/bios-tables-test: add test cases for DIMM proximity

 hmp.c |  14 +++--
 hw/i386/acpi-build.c  |  60 +++--
 hw/mem/pc-dimm.c  |  99 --
 hw/ppc/spapr.c|   3 +-
 include/hw/mem/pc-dimm.h  |   2 +-
 numa.c|  23 
 qapi-schema.json  |  18 ++-
 qmp.c |   7 +--
 stubs/qmp_pc_dimm.c   |   4 +-
 tests/acpi-test-data/pc/APIC.dimmpxm  | Bin 0 -> 136 bytes
 tests/acpi-test-data/pc/DSDT.dimmpxm  | Bin 0 -> 6710 bytes
 tests/acpi-test-data/pc/NFIT.dimmpxm  | Bin 0 -> 224 bytes
 tests/acpi-test-data/pc/SRAT.dimmpxm  | Bin 0 -> 416 bytes
 tests/acpi-test-data/pc/SSDT.dimmpxm  | Bin 0 -> 685 bytes
 tests/acpi-test-data/q35/APIC.dimmpxm | Bin 0 -> 136 bytes
 tests/acpi-test-data/q35/DSDT.dimmpxm | Bin 0 -> 9394 bytes
 tests/acpi-test-data/q35/NFIT.dimmpxm | Bin 0 -> 224 bytes
 tests/acpi-test-data/q35/SRAT.dimmpxm | Bin 0 -> 416 bytes
 tests/acpi-test-data/q35/SSDT.dimmpxm | Bin 0 -> 685 bytes
 tests/bios-tables-test.c  |  78 +--
 20 files changed, 225 insertions(+), 83 deletions(-)
 create mode 100644 tests/acpi-test-data/pc/APIC.dimmpxm
 create mode 100644 tests/acpi-test-data/pc/DSDT.dimmpxm
 create mode 100644 tests/acpi-test-data/pc/NFIT.dimmpxm
 create mode 100644 tests/acpi-test-data/pc/SRAT.dimmpxm
 create mode 100644 tests/acpi-test-data/pc/SSDT.dimmpxm
 create mode 100644 tests/acpi-test-data/q35/APIC.dimmpxm
 create mode 100644 tests/acpi-test-data/q35/DSDT.dimmpxm
 create mode 100644 tests/acpi-test-data/q35/NFIT.dimmpxm
 create mode 100644 tests/acpi-test-data/q35/SRAT.dimmpxm
 create mode 100644 tests/acpi-test-data/q35/SSDT.dimmpxm

-- 
2.14.1

[Qemu-devel] [PATCH v3 4/5] tests/bios-tables-test: allow setting extra machine options

2018-03-04 Thread Haozhong Zhang

Some test cases may require extra machine options than those used in
the current test_acpi_ones(), e.g., nvdimm test cases require the
machine option 'nvdimm=on'.

Signed-off-by: Haozhong Zhang <haozhong.zh...@intel.com>
---
 tests/bios-tables-test.c | 45 +
 1 file changed, 29 insertions(+), 16 deletions(-)

diff --git a/tests/bios-tables-test.c b/tests/bios-tables-test.c
index 65b271a173..d45181aa51 100644
--- a/tests/bios-tables-test.c
+++ b/tests/bios-tables-test.c
@@ -654,17 +654,22 @@ static void test_smbios_structs(test_data *data)
 }
 }
 
-static void test_acpi_one(const char *params, test_data *data)
+static void test_acpi_one(const char *extra_machine_opts,
+  const char *params, test_data *data)
 {
 char *args;
 
 /* Disable kernel irqchip to be able to override apic irq0. */
-args = g_strdup_printf("-machine %s,accel=%s,kernel-irqchip=off "
+args = g_strdup_printf("-machine %s,accel=%s,kernel-irqchip=off",
+   data->machine, "kvm:tcg");
+if (extra_machine_opts) {
+args = g_strdup_printf("%s,%s", args, extra_machine_opts);
+}
+args = g_strdup_printf("%s "
"-net none -display none %s "
"-drive id=hd0,if=none,file=%s,format=raw "
"-device ide-hd,drive=hd0 ",
-   data->machine, "kvm:tcg",
-   params ? params : "", disk);
+   args, params ? params : "", disk);
 
 qtest_start(args);
 
@@ -711,7 +716,7 @@ static void test_acpi_piix4_tcg(void)
 data.machine = MACHINE_PC;
 data.required_struct_types = base_required_struct_types;
 data.required_struct_types_len = ARRAY_SIZE(base_required_struct_types);
-test_acpi_one(NULL, );
+test_acpi_one(NULL, NULL, );
 free_test_data();
 }
 
@@ -724,7 +729,7 @@ static void test_acpi_piix4_tcg_bridge(void)
 data.variant = ".bridge";
 data.required_struct_types = base_required_struct_types;
 data.required_struct_types_len = ARRAY_SIZE(base_required_struct_types);
-test_acpi_one("-device pci-bridge,chassis_nr=1", );
+test_acpi_one(NULL, "-device pci-bridge,chassis_nr=1", );
 free_test_data();
 }
 
@@ -736,7 +741,7 @@ static void test_acpi_q35_tcg(void)
 data.machine = MACHINE_Q35;
 data.required_struct_types = base_required_struct_types;
 data.required_struct_types_len = ARRAY_SIZE(base_required_struct_types);
-test_acpi_one(NULL, );
+test_acpi_one(NULL, NULL, );
 free_test_data();
 }
 
@@ -749,7 +754,7 @@ static void test_acpi_q35_tcg_bridge(void)
 data.variant = ".bridge";
 data.required_struct_types = base_required_struct_types;
 data.required_struct_types_len = ARRAY_SIZE(base_required_struct_types);
-test_acpi_one("-device pci-bridge,chassis_nr=1",
+test_acpi_one(NULL, "-device pci-bridge,chassis_nr=1",
   );
 free_test_data();
 }
@@ -761,7 +766,8 @@ static void test_acpi_piix4_tcg_cphp(void)
 memset(, 0, sizeof(data));
 data.machine = MACHINE_PC;
 data.variant = ".cphp";
-test_acpi_one("-smp 2,cores=3,sockets=2,maxcpus=6"
+test_acpi_one(NULL,
+  "-smp 2,cores=3,sockets=2,maxcpus=6"
   " -numa node -numa node"
   " -numa dist,src=0,dst=1,val=21",
   );
@@ -775,7 +781,8 @@ static void test_acpi_q35_tcg_cphp(void)
 memset(, 0, sizeof(data));
 data.machine = MACHINE_Q35;
 data.variant = ".cphp";
-test_acpi_one(" -smp 2,cores=3,sockets=2,maxcpus=6"
+test_acpi_one(NULL,
+  " -smp 2,cores=3,sockets=2,maxcpus=6"
   " -numa node -numa node"
   " -numa dist,src=0,dst=1,val=21",
   );
@@ -795,7 +802,8 @@ static void test_acpi_q35_tcg_ipmi(void)
 data.variant = ".ipmibt";
 data.required_struct_types = ipmi_required_struct_types;
 data.required_struct_types_len = ARRAY_SIZE(ipmi_required_struct_types);
-test_acpi_one("-device ipmi-bmc-sim,id=bmc0"
+test_acpi_one(NULL,
+  "-device ipmi-bmc-sim,id=bmc0"
   " -device isa-ipmi-bt,bmc=bmc0",
   );
 free_test_data();
@@ -813,7 +821,8 @@ static void test_acpi_piix4_tcg_ipmi(void)
 data.variant = ".ipmikcs";
 data.required_struct_types = ipmi_required_struct_types;
 data.required_struct_types_len = ARRAY_SIZE(ipmi_required_struct_types);
-test_acpi_one("-device ipmi-bmc-sim,id=bmc0"
+test_acpi_one(NULL,
+  "-device ipmi-b

[Qemu-devel] [PATCH v3 2/5] qmp: distinguish PC-DIMM and NVDIMM in MemoryDeviceInfoList

2018-03-04 Thread Haozhong Zhang

It may need to treat PC-DIMM and NVDIMM differently, e.g., when
deciding the necessity of non-volatile flag bit in SRAT memory
affinity structures.

NVDIMMDeviceInfo, which inherits from PCDIMMDeviceInfo, is added to
union type MemoryDeviceInfo to record information of NVDIMM devices.
The NVDIMM-specific data is currently left empty and will be filled
when necessary in the future.

Signed-off-by: Haozhong Zhang <haozhong.zh...@intel.com>
---
 hmp.c| 14 +++---
 hw/mem/pc-dimm.c | 20 ++--
 numa.c   | 19 +--
 qapi-schema.json | 18 +-
 4 files changed, 59 insertions(+), 12 deletions(-)

diff --git a/hmp.c b/hmp.c
index ae86bfbade..3f06407c5e 100644
--- a/hmp.c
+++ b/hmp.c
@@ -2413,7 +2413,18 @@ void hmp_info_memory_devices(Monitor *mon, const QDict 
*qdict)
 switch (value->type) {
 case MEMORY_DEVICE_INFO_KIND_DIMM:
 di = value->u.dimm.data;
+break;
+
+case MEMORY_DEVICE_INFO_KIND_NVDIMM:
+di = qapi_NVDIMMDeviceInfo_base(value->u.nvdimm.data);
+break;
+
+default:
+di = NULL;
+break;
+}
 
+if (di) {
 monitor_printf(mon, "Memory device [%s]: \"%s\"\n",
MemoryDeviceInfoKind_str(value->type),
di->id ? di->id : "");
@@ -2426,9 +2437,6 @@ void hmp_info_memory_devices(Monitor *mon, const QDict 
*qdict)
di->hotplugged ? "true" : "false");
 monitor_printf(mon, "  hotpluggable: %s\n",
di->hotpluggable ? "true" : "false");
-break;
-default:
-break;
 }
 }
 }
diff --git a/hw/mem/pc-dimm.c b/hw/mem/pc-dimm.c
index 4d050fe2cd..866ecc699a 100644
--- a/hw/mem/pc-dimm.c
+++ b/hw/mem/pc-dimm.c
@@ -20,6 +20,7 @@
 
 #include "qemu/osdep.h"
 #include "hw/mem/pc-dimm.h"
+#include "hw/mem/nvdimm.h"
 #include "qapi/error.h"
 #include "qemu/config-file.h"
 #include "qapi/visitor.h"
@@ -249,10 +250,19 @@ MemoryDeviceInfoList *qmp_pc_dimm_device_list(void)
 Object *obj = OBJECT(dimm);
 MemoryDeviceInfoList *elem = g_new0(MemoryDeviceInfoList, 1);
 MemoryDeviceInfo *info = g_new0(MemoryDeviceInfo, 1);
-PCDIMMDeviceInfo *di = g_new0(PCDIMMDeviceInfo, 1);
+PCDIMMDeviceInfo *di;
+NVDIMMDeviceInfo *ndi;
+bool is_nvdimm = object_dynamic_cast(obj, TYPE_NVDIMM);
 DeviceClass *dc = DEVICE_GET_CLASS(obj);
 DeviceState *dev = DEVICE(obj);
 
+if (!is_nvdimm) {
+di = g_new0(PCDIMMDeviceInfo, 1);
+} else {
+ndi = g_new0(NVDIMMDeviceInfo, 1);
+di = qapi_NVDIMMDeviceInfo_base(ndi);
+}
+
 if (dev->id) {
 di->has_id = true;
 di->id = g_strdup(dev->id);
@@ -265,7 +275,13 @@ MemoryDeviceInfoList *qmp_pc_dimm_device_list(void)
 di->size = object_property_get_uint(obj, PC_DIMM_SIZE_PROP, NULL);
 di->memdev = object_get_canonical_path(OBJECT(dimm->hostmem));
 
-info->u.dimm.data = di;
+if (!is_nvdimm) {
+info->u.dimm.data = di;
+info->type = MEMORY_DEVICE_INFO_KIND_DIMM;
+} else {
+info->u.nvdimm.data = ndi;
+info->type = MEMORY_DEVICE_INFO_KIND_NVDIMM;
+}
 elem->value = info;
 elem->next = NULL;
 if (prev) {
diff --git a/numa.c b/numa.c
index c6734ceb8c..23c4371e51 100644
--- a/numa.c
+++ b/numa.c
@@ -529,18 +529,25 @@ static void numa_stat_memory_devices(NumaNodeMem 
node_mem[])
 
 if (value) {
 switch (value->type) {
-case MEMORY_DEVICE_INFO_KIND_DIMM: {
+case MEMORY_DEVICE_INFO_KIND_DIMM:
 pcdimm_info = value->u.dimm.data;
+break;
+
+case MEMORY_DEVICE_INFO_KIND_NVDIMM:
+pcdimm_info = qapi_NVDIMMDeviceInfo_base(value->u.nvdimm.data);
+break;
+
+default:
+pcdimm_info = NULL;
+break;
+}
+
+if (pcdimm_info) {
 node_mem[pcdimm_info->node].node_mem += pcdimm_info->size;
 if (pcdimm_info->hotpluggable && pcdimm_info->hotplugged) {
 node_mem[pcdimm_info->node].node_plugged_mem +=
 pcdimm_info->size;
 }
-break;
-}
-
-default:
-break;
 }
 }
 }
diff --git a/qapi-schema.json b/qapi-schema.json
index cd98a94388..1c2d281749 1006

Re: [Qemu-devel] [PATCH v2 1/3] hw/acpi-build: build SRAT memory affinity structures for DIMM devices

2018-03-01 Thread Haozhong Zhang

On 03/01/18 14:01 +0100, Igor Mammedov wrote:
> On Thu, 1 Mar 2018 19:56:51 +0800
> Haozhong Zhang <haozhong.zh...@intel.com> wrote:
> 
> > On 03/01/18 11:42 +0100, Igor Mammedov wrote:
> > > On Wed, 28 Feb 2018 12:02:58 +0800
> > > Haozhong Zhang <haozhong.zh...@intel.com> wrote:
> > >   
> > > > ACPI 6.2A Table 5-129 "SPA Range Structure" requires the proximity
> > > > domain of a NVDIMM SPA range must match with corresponding entry in
> > > > SRAT table.
> > > > 
> > > > The address ranges of vNVDIMM in QEMU are allocated from the
> > > > hot-pluggable address space, which is entirely covered by one SRAT
> > > > memory affinity structure. However, users can set the vNVDIMM
> > > > proximity domain in NFIT SPA range structure by the 'node' property of
> > > > '-device nvdimm' to a value different than the one in the above SRAT
> > > > memory affinity structure.
> > > > 
> > > > In order to solve such proximity domain mismatch, this patch builds
> > > > one SRAT memory affinity structure for each static-plugged DIMM device, 
> > > >  
> > > s/static-plugged/present at boot/
> > > since after hotplug and following reset SRAT will be recreated
> > > and include hotplugged DIMMs as well.  
> > 
> > Ah yes, I'll fix the message in the next version.
> > 
> > >   
> > > > including both PC-DIMM and NVDIMM, with the proximity domain specified
> > > > in '-device pc-dimm' or '-device nvdimm'.
> > > > 
> > > > The remaining hot-pluggable address space is covered by one or multiple
> > > > SRAT memory affinity structures with the proximity domain of the last
> > > > node as before.
> > > > 
> > > > Signed-off-by: Haozhong Zhang <haozhong.zh...@intel.com>
> > > > ---
> > > >  hw/i386/acpi-build.c | 50 
> > > > 
> > > >  hw/mem/pc-dimm.c |  8 
> > > >  include/hw/mem/pc-dimm.h | 10 ++
> > > >  3 files changed, 64 insertions(+), 4 deletions(-)
> > > > 
> > > > diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
> > > > index deb440f286..a88de06d8f 100644
> > > > --- a/hw/i386/acpi-build.c
> > > > +++ b/hw/i386/acpi-build.c
> > > > @@ -2323,6 +2323,49 @@ build_tpm2(GArray *table_data, BIOSLinker 
> > > > *linker, GArray *tcpalog)
> > > >  #define HOLE_640K_START  (640 * 1024)
> > > >  #define HOLE_640K_END   (1024 * 1024)
> > > >  
> > > > +static void build_srat_hotpluggable_memory(GArray *table_data, 
> > > > uint64_t base,
> > > > +   uint64_t len, int 
> > > > default_node)
> > > > +{
> > > > +GSList *dimms = pc_dimm_get_device_list();
> > > > +GSList *ent = dimms;
> > > > +PCDIMMDevice *dev;
> > > > +Object *obj;
> > > > +uint64_t end = base + len, addr, size;
> > > > +int node;
> > > > +AcpiSratMemoryAffinity *numamem;
> > > > +
> > > > +while (base < end) {  
> > > It's just matter of taste but wouldn't 'for' loop be better here?
> > > One can see start, end and next step from the begging.  
> > 
> > will switch to a for loop
> > 
> > >   
> > > > +numamem = acpi_data_push(table_data, sizeof *numamem);
> > > > +
> > > > +if (!ent) {
> > > > +build_srat_memory(numamem, base, end - base, default_node,
> > > > +  MEM_AFFINITY_HOTPLUGGABLE | 
> > > > MEM_AFFINITY_ENABLED);
> > > > +break;
> > > > +}
> > > > +
> > > > +dev = PC_DIMM(ent->data);
> > > > +obj = OBJECT(dev);
> > > > +addr = object_property_get_uint(obj, PC_DIMM_ADDR_PROP, NULL);
> > > > +size = object_property_get_uint(obj, PC_DIMM_SIZE_PROP, NULL);
> > > > +node = object_property_get_uint(obj, PC_DIMM_NODE_PROP, NULL);
> > > > +
> > > > +if (base < addr) {
> > > > +build_srat_memory(numamem, base, addr - base, default_node,
> > > > +  MEM_AFFINITY_HOTPLUGGABLE | 
> > > > MEM_AFFINITY_ENABLED);
> > > >

Re: [Qemu-devel] [PATCH v2 1/3] hw/acpi-build: build SRAT memory affinity structures for DIMM devices

2018-03-01 Thread Haozhong Zhang

On 03/01/18 11:42 +0100, Igor Mammedov wrote:
> On Wed, 28 Feb 2018 12:02:58 +0800
> Haozhong Zhang <haozhong.zh...@intel.com> wrote:
> 
> > ACPI 6.2A Table 5-129 "SPA Range Structure" requires the proximity
> > domain of a NVDIMM SPA range must match with corresponding entry in
> > SRAT table.
> > 
> > The address ranges of vNVDIMM in QEMU are allocated from the
> > hot-pluggable address space, which is entirely covered by one SRAT
> > memory affinity structure. However, users can set the vNVDIMM
> > proximity domain in NFIT SPA range structure by the 'node' property of
> > '-device nvdimm' to a value different than the one in the above SRAT
> > memory affinity structure.
> > 
> > In order to solve such proximity domain mismatch, this patch builds
> > one SRAT memory affinity structure for each static-plugged DIMM device,
> s/static-plugged/present at boot/
> since after hotplug and following reset SRAT will be recreated
> and include hotplugged DIMMs as well.

Ah yes, I'll fix the message in the next version.

> 
> > including both PC-DIMM and NVDIMM, with the proximity domain specified
> > in '-device pc-dimm' or '-device nvdimm'.
> > 
> > The remaining hot-pluggable address space is covered by one or multiple
> > SRAT memory affinity structures with the proximity domain of the last
> > node as before.
> > 
> > Signed-off-by: Haozhong Zhang <haozhong.zh...@intel.com>
> > ---
> >  hw/i386/acpi-build.c | 50 
> > 
> >  hw/mem/pc-dimm.c |  8 
> >  include/hw/mem/pc-dimm.h | 10 ++
> >  3 files changed, 64 insertions(+), 4 deletions(-)
> > 
> > diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
> > index deb440f286..a88de06d8f 100644
> > --- a/hw/i386/acpi-build.c
> > +++ b/hw/i386/acpi-build.c
> > @@ -2323,6 +2323,49 @@ build_tpm2(GArray *table_data, BIOSLinker *linker, 
> > GArray *tcpalog)
> >  #define HOLE_640K_START  (640 * 1024)
> >  #define HOLE_640K_END   (1024 * 1024)
> >  
> > +static void build_srat_hotpluggable_memory(GArray *table_data, uint64_t 
> > base,
> > +   uint64_t len, int default_node)
> > +{
> > +GSList *dimms = pc_dimm_get_device_list();
> > +GSList *ent = dimms;
> > +PCDIMMDevice *dev;
> > +Object *obj;
> > +uint64_t end = base + len, addr, size;
> > +int node;
> > +AcpiSratMemoryAffinity *numamem;
> > +
> > +while (base < end) {
> It's just matter of taste but wouldn't 'for' loop be better here?
> One can see start, end and next step from the begging.

will switch to a for loop

> 
> > +numamem = acpi_data_push(table_data, sizeof *numamem);
> > +
> > +if (!ent) {
> > +build_srat_memory(numamem, base, end - base, default_node,
> > +  MEM_AFFINITY_HOTPLUGGABLE | 
> > MEM_AFFINITY_ENABLED);
> > +break;
> > +}
> > +
> > +dev = PC_DIMM(ent->data);
> > +obj = OBJECT(dev);
> > +addr = object_property_get_uint(obj, PC_DIMM_ADDR_PROP, NULL);
> > +size = object_property_get_uint(obj, PC_DIMM_SIZE_PROP, NULL);
> > +node = object_property_get_uint(obj, PC_DIMM_NODE_PROP, NULL);
> > +
> > +if (base < addr) {
> > +build_srat_memory(numamem, base, addr - base, default_node,
> > +  MEM_AFFINITY_HOTPLUGGABLE | 
> > MEM_AFFINITY_ENABLED);
> > +numamem = acpi_data_push(table_data, sizeof *numamem);
> > +}
> > +build_srat_memory(numamem, addr, size, node,
> > +  MEM_AFFINITY_HOTPLUGGABLE | MEM_AFFINITY_ENABLED 
> > |
> Is NVDIMM hotplug supported in QEMU?
> If not we might need make MEM_AFFINITY_HOTPLUGGABLE conditional too.

Yes, it's supported.

> 
> > +  (object_dynamic_cast(obj, TYPE_NVDIMM) ?
> > +   MEM_AFFINITY_NON_VOLATILE : 0));
> it might be cleaner without inline flags duplication
> 
>   flags = MEM_AFFINITY_ENABLED;
>   ...
>   if (!ent) {
>   flags |= MEM_AFFINITY_HOTPLUGGABLE
>   }
>   ...
>   if (PCDIMMDeviceInfo::hotpluggable) { // see ***
>   flags |= MEM_AFFINITY_HOTPLUGGABLE
>   }
>   ...
>   if (object_dynamic_cast(obj, TYPE_NVDIMM))
>   flags |= MEM_AFFINITY_NON_VOLATILE
>   }

I'm fine for such changes, except ***

[..]
> > diff --git a/hw/mem/pc-dimm.c b/hw/mem/pc-dimm.c
> > index

Re: [Qemu-devel] [RFC QEMU PATCH v4 00/10] Implement vNVDIMM for Xen HVM guest

2018-02-28 Thread Haozhong Zhang

On 02/27/18 17:22 +, Anthony PERARD wrote:
> On Thu, Dec 07, 2017 at 06:18:02PM +0800, Haozhong Zhang wrote:
> > This is the QEMU part patches that works with the associated Xen
> > patches to enable vNVDIMM support for Xen HVM domains. Xen relies on
> > QEMU to build guest NFIT and NVDIMM namespace devices, and allocate
> > guest address space for vNVDIMM devices.
> 
> I've got other question, and maybe possible improvements.
> 
> When QEMU build the ACPI tables, it also initialize some MemoryRegion,
> which use more guest memory. Do you know if those regions are used with
> your patch series on Xen?

Yes, that's why dm_acpi_size is introduced.

> Otherwise, we could try to avoid their
> creation with this:
> In xenfv_machine_options()
> m->rom_file_has_mr = false;
> (setting this in xen_hvm_init() would probably be better, but I havn't
> try)

If my memory is correct, simply setting rom_file_has_mr to false does
not work (though I cannot remind the exact reason). I'll have a look
as the code to refresh my memory.

Haozhong

> 
> If this is possible, libxl would not need to allocate more memory for
> the guest (dm_acpi_size).
> 
> -- 
> Anthony PERARD

Re: [Qemu-devel] [RFC QEMU PATCH v4 05/10] xen-hvm: initialize fw_cfg interface

2018-02-28 Thread Haozhong Zhang

On 02/27/18 16:46 +, Anthony PERARD wrote:
> On Thu, Dec 07, 2017 at 06:18:07PM +0800, Haozhong Zhang wrote:
> > Xen is going to reuse QEMU to build ACPI of some devices (e.g., NFIT
> > and SSDT for NVDIMM) for HVM domains. The existing QEMU ACPI build
> > code requires a fw_cfg interface which will also be used to pass QEMU
> > built ACPI to Xen. Therefore, we need to initialize fw_cfg when any
> > ACPI is going to be built by QEMU.
> > 
> > Signed-off-by: Haozhong Zhang <haozhong.zh...@intel.com>
> > ---
> > Cc: Stefano Stabellini <sstabell...@kernel.org>
> > Cc: Anthony Perard <anthony.per...@citrix.com>
> > Cc: "Michael S. Tsirkin" <m...@redhat.com>
> > Cc: Paolo Bonzini <pbonz...@redhat.com>
> > Cc: Richard Henderson <r...@twiddle.net>
> > Cc: Eduardo Habkost <ehabk...@redhat.com>
> > ---
> >  hw/i386/xen/xen-hvm.c | 12 
> >  1 file changed, 12 insertions(+)
> > 
> > diff --git a/hw/i386/xen/xen-hvm.c b/hw/i386/xen/xen-hvm.c
> > index fe01b7a025..4b29f4052b 100644
> > --- a/hw/i386/xen/xen-hvm.c
> > +++ b/hw/i386/xen/xen-hvm.c
> > @@ -14,6 +14,7 @@
> >  #include "hw/pci/pci.h"
> >  #include "hw/i386/pc.h"
> >  #include "hw/i386/apic-msidef.h"
> > +#include "hw/loader.h"
> >  #include "hw/xen/xen_common.h"
> >  #include "hw/xen/xen_backend.h"
> >  #include "qmp-commands.h"
> > @@ -1234,6 +1235,14 @@ static void xen_wakeup_notifier(Notifier *notifier, 
> > void *data)
> >  xc_set_hvm_param(xen_xc, xen_domid, HVM_PARAM_ACPI_S_STATE, 0);
> >  }
> >  
> > +static void xen_fw_cfg_init(PCMachineState *pcms)
> > +{
> 
> The fw_cfg interface might already be initialized, it is used for
> "direct kernel boot" on hvm. It is initialized in xen_load_linux().
>

xen_hvm_init() --> xen_fw_cfg_init() are called before
xen_load_linux(). I'll add a check in xen_load_linux() to avoid
redoing fw_cfg_init_io and rom_set_fw.

Haozhong

> > +FWCfgState *fw_cfg = fw_cfg_init_io(FW_CFG_IO_BASE);
> > +
> > +rom_set_fw(fw_cfg);
> > +pcms->fw_cfg = fw_cfg;
> > +}
> > +
> >  void xen_hvm_init(PCMachineState *pcms, MemoryRegion **ram_memory)
> >  {
> >  int i, rc;
> > @@ -1384,6 +1393,9 @@ void xen_hvm_init(PCMachineState *pcms, MemoryRegion 
> > **ram_memory)
> >  
> >  /* Disable ACPI build because Xen handles it */
> >  pcms->acpi_build_enabled = false;
> > +if (pcms->acpi_build_enabled) {
> > +xen_fw_cfg_init(pcms);
> > +}
> >  
> >  return;
> >  
> > -- 
> > 2.15.1
> > 
> 
> -- 
> Anthony PERARD

Re: [Qemu-devel] [RFC QEMU PATCH v4 03/10] hostmem-xen: add a host memory backend for Xen

2018-02-27 Thread Haozhong Zhang

On 02/27/18 16:41 +, Anthony PERARD wrote:
> On Thu, Dec 07, 2017 at 06:18:05PM +0800, Haozhong Zhang wrote:
> > diff --git a/backends/hostmem.c b/backends/hostmem.c
> > index ee2c2d5bfd..ba13a52994 100644
> > --- a/backends/hostmem.c
> > +++ b/backends/hostmem.c
> > @@ -12,6 +12,7 @@
> >  #include "qemu/osdep.h"
> >  #include "sysemu/hostmem.h"
> >  #include "hw/boards.h"
> > +#include "hw/xen/xen.h"
> >  #include "qapi/error.h"
> >  #include "qapi/visitor.h"
> >  #include "qapi-types.h"
> > @@ -277,6 +278,14 @@ host_memory_backend_memory_complete(UserCreatable *uc, 
> > Error **errp)
> >  goto out;
> >  }
> >  
> > +/*
> > + * The backend storage of MEMORY_BACKEND_XEN is managed by Xen,
> > + * so no further work in this function is needed.
> > + */
> > +if (xen_enabled() && !backend->mr.ram_block) {
> > +goto out;
> > +}
> > +
> >  ptr = memory_region_get_ram_ptr(>mr);
> >  sz = memory_region_size(>mr);
> >  
> > diff --git a/hw/mem/pc-dimm.c b/hw/mem/pc-dimm.c
> > index 66eace5a5c..dcbfce33d5 100644
> > --- a/hw/mem/pc-dimm.c
> > +++ b/hw/mem/pc-dimm.c
> > @@ -28,6 +28,7 @@
> >  #include "sysemu/kvm.h"
> >  #include "trace.h"
> >  #include "hw/virtio/vhost.h"
> > +#include "hw/xen/xen.h"
> >  
> >  typedef struct pc_dimms_capacity {
> >   uint64_t size;
> > @@ -108,7 +109,10 @@ void pc_dimm_memory_plug(DeviceState *dev, 
> > MemoryHotplugState *hpms,
> >  }
> >  
> >  memory_region_add_subregion(>mr, addr - hpms->base, mr);
> > -vmstate_register_ram(vmstate_mr, dev);
> > +/* memory-backend-xen is not backed by RAM. */
> > +if (!xen_enabled()) {
> 
> Is it possible to have the same condition as the one used in
> host_memory_backend_memory_complete? i.e. base on whether the memory
> region is mapped or not (backend->mr.ram_block).

Like "if (!xen_enabled() || backend->mr.ram_block))"? No, it will mute
the abortion (vmstate_register_ram --> qemu_ram_set_idstr ) caused by
the case that !backend->mr.ram_block in the non-xen environment.

Haozhong

> 
> > +vmstate_register_ram(vmstate_mr, dev);
> > +}
> >  numa_set_mem_node_id(addr, memory_region_size(mr), dimm->node);
> >  
> >  out:
> > -- 
> > 2.15.1
> > 
> 
> -- 
> Anthony PERARD

Re: [Qemu-devel] [RFC QEMU PATCH v4 02/10] xen-hvm: create the hotplug memory region on Xen

2018-02-27 Thread Haozhong Zhang

On 02/27/18 16:37 +, Anthony PERARD wrote:
> On Thu, Dec 07, 2017 at 06:18:04PM +0800, Haozhong Zhang wrote:
> > The guest physical address of vNVDIMM is allocated from the hotplug
> > memory region, which is not created when QEMU is used as Xen device
> > model. In order to use vNVDIMM for Xen HVM domains, this commit reuses
> > the code for pc machine type to create the hotplug memory region for
> > Xen HVM domains.
> > 
> > Signed-off-by: Haozhong Zhang <haozhong.zh...@intel.com>
> > ---
> > Cc: "Michael S. Tsirkin" <m...@redhat.com>
> > Cc: Paolo Bonzini <pbonz...@redhat.com>
> > Cc: Richard Henderson <r...@twiddle.net>
> > Cc: Eduardo Habkost <ehabk...@redhat.com>
> > Cc: Stefano Stabellini <sstabell...@kernel.org>
> > Cc: Anthony Perard <anthony.per...@citrix.com>
> > ---
> >  hw/i386/pc.c  | 86 
> > ---
> >  hw/i386/xen/xen-hvm.c |  2 ++
> >  include/hw/i386/pc.h  |  1 +
> >  3 files changed, 51 insertions(+), 38 deletions(-)
> > 
> > diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> > index 186545d2a4..9f46c8df79 100644
> > --- a/hw/i386/pc.c
> > +++ b/hw/i386/pc.c
> > @@ -1315,6 +1315,53 @@ void xen_load_linux(PCMachineState *pcms)
> >  pcms->fw_cfg = fw_cfg;
> >  }
> >  
> > +void pc_memory_hotplug_init(PCMachineState *pcms, MemoryRegion 
> > *system_memory)
> 
> It might be better to have a separate patch which move the code into a 
> function.

will move it to a separate patch

> 
> > +{
> > +MachineState *machine = MACHINE(pcms);
> > +PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
> > +ram_addr_t hotplug_mem_size = machine->maxram_size - machine->ram_size;
> > +
> > +if (!pcmc->has_reserved_memory || machine->ram_size >= 
> > machine->maxram_size)
> > +return;
> > +
> > +if (memory_region_size(>hotplug_memory.mr)) {
> 
> This new check looks like to catch programming error, rather than user
> error. Would it be better to be an assert instead?

Well, this was a debugging check and I forgot to remove it before
sending the patch. I'll drop it in the next version.

Thanks,
Haozhong

> 
> > +error_report("hotplug memory region has been initialized");
> > +exit(EXIT_FAILURE);
> > +}
> > +
> 
> -- 
> Anthony PERARD

[Qemu-devel] [PATCH v4 6/8] migration/ram: ensure write persistence on loading normal pages to PMEM

2018-02-27 Thread Haozhong Zhang

When loading a normal page to persistent memory, load its data by
libpmem function pmem_memcpy_nodrain() instead of memcpy(). Combined
with a call to pmem_drain() at the end of memory loading, we can
guarantee all those normal pages are persistenly loaded to PMEM.

Signed-off-by: Haozhong Zhang <haozhong.zh...@intel.com>
---
 include/migration/qemu-file-types.h |  2 ++
 include/qemu/pmem.h |  1 +
 migration/qemu-file.c   | 29 +++--
 migration/ram.c |  2 +-
 stubs/pmem.c|  5 +
 tests/Makefile.include  |  2 +-
 6 files changed, 29 insertions(+), 12 deletions(-)

diff --git a/include/migration/qemu-file-types.h 
b/include/migration/qemu-file-types.h
index bd6d7dd7f9..c7c3f665f9 100644
--- a/include/migration/qemu-file-types.h
+++ b/include/migration/qemu-file-types.h
@@ -33,6 +33,8 @@ void qemu_put_byte(QEMUFile *f, int v);
 void qemu_put_be16(QEMUFile *f, unsigned int v);
 void qemu_put_be32(QEMUFile *f, unsigned int v);
 void qemu_put_be64(QEMUFile *f, uint64_t v);
+size_t qemu_get_buffer_common(QEMUFile *f, uint8_t *buf, size_t size,
+  bool is_pmem);
 size_t qemu_get_buffer(QEMUFile *f, uint8_t *buf, size_t size);
 
 int qemu_get_byte(QEMUFile *f);
diff --git a/include/qemu/pmem.h b/include/qemu/pmem.h
index ce96379f3c..127b87c326 100644
--- a/include/qemu/pmem.h
+++ b/include/qemu/pmem.h
@@ -16,6 +16,7 @@
 #include 
 #else  /* !CONFIG_LIBPMEM */
 
+void *pmem_memcpy_nodrain(void *pmemdest, const void *src, size_t len);
 void *pmem_memcpy_persist(void *pmemdest, const void *src, size_t len);
 void *pmem_memset_nodrain(void *pmemdest, int c, size_t len);
 void pmem_drain(void);
diff --git a/migration/qemu-file.c b/migration/qemu-file.c
index 2ab2bf362d..d19f677796 100644
--- a/migration/qemu-file.c
+++ b/migration/qemu-file.c
@@ -26,6 +26,7 @@
 #include "qemu-common.h"
 #include "qemu/error-report.h"
 #include "qemu/iov.h"
+#include "qemu/pmem.h"
 #include "migration.h"
 #include "qemu-file.h"
 #include "trace.h"
@@ -471,18 +472,13 @@ size_t qemu_peek_buffer(QEMUFile *f, uint8_t **buf, 
size_t size, size_t offset)
 return size;
 }
 
-/*
- * Read 'size' bytes of data from the file into buf.
- * 'size' can be larger than the internal buffer.
- *
- * It will return size bytes unless there was an error, in which case it will
- * return as many as it managed to read (assuming blocking fd's which
- * all current QEMUFile are)
- */
-size_t qemu_get_buffer(QEMUFile *f, uint8_t *buf, size_t size)
+size_t qemu_get_buffer_common(QEMUFile *f, uint8_t *buf, size_t size,
+  bool is_pmem)
 {
 size_t pending = size;
 size_t done = 0;
+void *(*memcpy_func)(void *d, const void *s, size_t n) =
+is_pmem ? pmem_memcpy_nodrain : memcpy;
 
 while (pending > 0) {
 size_t res;
@@ -492,7 +488,7 @@ size_t qemu_get_buffer(QEMUFile *f, uint8_t *buf, size_t 
size)
 if (res == 0) {
 return done;
 }
-memcpy(buf, src, res);
+memcpy_func(buf, src, res);
 qemu_file_skip(f, res);
 buf += res;
 pending -= res;
@@ -501,6 +497,19 @@ size_t qemu_get_buffer(QEMUFile *f, uint8_t *buf, size_t 
size)
 return done;
 }
 
+/*
+ * Read 'size' bytes of data from the file into buf.
+ * 'size' can be larger than the internal buffer.
+ *
+ * It will return size bytes unless there was an error, in which case it will
+ * return as many as it managed to read (assuming blocking fd's which
+ * all current QEMUFile are)
+ */
+size_t qemu_get_buffer(QEMUFile *f, uint8_t *buf, size_t size)
+{
+return qemu_get_buffer_common(f, buf, size, false);
+}
+
 /*
  * Read 'size' bytes of data from the file.
  * 'size' can be larger than the internal buffer.
diff --git a/migration/ram.c b/migration/ram.c
index 3904ceee79..ea2ad7dff0 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -2959,7 +2959,7 @@ static int ram_load(QEMUFile *f, void *opaque, int 
version_id)
 break;
 
 case RAM_SAVE_FLAG_PAGE:
-qemu_get_buffer(f, host, TARGET_PAGE_SIZE);
+qemu_get_buffer_common(f, host, TARGET_PAGE_SIZE, is_pmem);
 break;
 
 case RAM_SAVE_FLAG_COMPRESS_PAGE:
diff --git a/stubs/pmem.c b/stubs/pmem.c
index a65b3bfc6b..e172f31174 100644
--- a/stubs/pmem.c
+++ b/stubs/pmem.c
@@ -26,3 +26,8 @@ void *pmem_memset_nodrain(void *pmemdest, int c, size_t len)
 void pmem_drain(void)
 {
 }
+
+void *pmem_memcpy_nodrain(void *pmemdest, const void *src, size_t len)
+{
+return memcpy(pmemdest, src, len);
+}
diff --git a/tests/Makefile.include b/tests/Makefile.include
index 577eb573a2..37bb85f591 100644
--- a/tests/Makefile.include
+++ b/tests/Makefile.include
@@ -637,7 +637,7 @@ tests/test-qdev-global-props$(EXESUF): 
tests/test-qdev-global-props.o \
$(test-qapi-obj-y)

Re: [Qemu-devel] [PATCH v4 0/8] nvdimm: guarantee persistence of QEMU writes to persistent memory

2018-02-27 Thread Haozhong Zhang

On 02/28/18 15:25 +0800, Haozhong Zhang wrote:
> QEMU writes to vNVDIMM backends in the vNVDIMM label emulation and
> live migration. If the backend is on the persistent memory, QEMU needs
> to take proper operations to ensure its writes persistent on the
> persistent memory. Otherwise, a host power failure may result in the
> loss the guest data on the persistent memory.
>


> This v3 patch series is based on Marcel's patch "mem: add share
> parameter to memory-backend-ram" [1] because of the changes in patch 1.
> 
> [1] https://lists.gnu.org/archive/html/qemu-devel/2018-02/msg03858.html

I forgot to remove this part. v4 can be applied on the current master
branch now because above [1] has already been merged.

[Qemu-devel] [PATCH v4 4/8] mem/nvdimm: ensure write persistence to PMEM in label emulation

2018-02-27 Thread Haozhong Zhang

Guest writes to vNVDIMM labels are intercepted and performed on the
backend by QEMU. When the backend is a real persistent memort, QEMU
needs to take proper operations to ensure its write persistence on the
persistent memory. Otherwise, a host power failure may result in the
loss of guest label configurations.

Signed-off-by: Haozhong Zhang <haozhong.zh...@intel.com>
---
 hw/mem/nvdimm.c |  9 -
 include/qemu/pmem.h | 23 +++
 stubs/Makefile.objs |  1 +
 stubs/pmem.c| 19 +++
 4 files changed, 51 insertions(+), 1 deletion(-)
 create mode 100644 include/qemu/pmem.h
 create mode 100644 stubs/pmem.c

diff --git a/hw/mem/nvdimm.c b/hw/mem/nvdimm.c
index 61e677f92f..18861d1a7a 100644
--- a/hw/mem/nvdimm.c
+++ b/hw/mem/nvdimm.c
@@ -23,6 +23,7 @@
  */
 
 #include "qemu/osdep.h"
+#include "qemu/pmem.h"
 #include "qapi/error.h"
 #include "qapi/visitor.h"
 #include "qapi-visit.h"
@@ -156,11 +157,17 @@ static void nvdimm_write_label_data(NVDIMMDevice *nvdimm, 
const void *buf,
 {
 MemoryRegion *mr;
 PCDIMMDevice *dimm = PC_DIMM(nvdimm);
+bool is_pmem = object_property_get_bool(OBJECT(dimm->hostmem),
+"pmem", NULL);
 uint64_t backend_offset;
 
 nvdimm_validate_rw_label_data(nvdimm, size, offset);
 
-memcpy(nvdimm->label_data + offset, buf, size);
+if (!is_pmem) {
+memcpy(nvdimm->label_data + offset, buf, size);
+} else {
+pmem_memcpy_persist(nvdimm->label_data + offset, buf, size);
+}
 
 mr = host_memory_backend_get_memory(dimm->hostmem, _abort);
 backend_offset = memory_region_size(mr) - nvdimm->label_size + offset;
diff --git a/include/qemu/pmem.h b/include/qemu/pmem.h
new file mode 100644
index 00..16f5b2653a
--- /dev/null
+++ b/include/qemu/pmem.h
@@ -0,0 +1,23 @@
+/*
+ * QEMU header file for libpmem.
+ *
+ * Copyright (c) 2018 Intel Corporation.
+ *
+ * Author: Haozhong Zhang <haozhong.zh...@intel.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#ifndef QEMU_PMEM_H
+#define QEMU_PMEM_H
+
+#ifdef CONFIG_LIBPMEM
+#include 
+#else  /* !CONFIG_LIBPMEM */
+
+void *pmem_memcpy_persist(void *pmemdest, const void *src, size_t len);
+
+#endif /* CONFIG_LIBPMEM */
+
+#endif /* !QEMU_PMEM_H */
diff --git a/stubs/Makefile.objs b/stubs/Makefile.objs
index 2d59d84091..ba944b9739 100644
--- a/stubs/Makefile.objs
+++ b/stubs/Makefile.objs
@@ -43,3 +43,4 @@ stub-obj-y += xen-common.o
 stub-obj-y += xen-hvm.o
 stub-obj-y += pci-host-piix.o
 stub-obj-y += ram-block.o
+stub-obj-$(call lnot,$(CONFIG_LIBPMEM)) += pmem.o
\ No newline at end of file
diff --git a/stubs/pmem.c b/stubs/pmem.c
new file mode 100644
index 00..03d990e571
--- /dev/null
+++ b/stubs/pmem.c
@@ -0,0 +1,19 @@
+/*
+ * Stubs for libpmem.
+ *
+ * Copyright (c) 2018 Intel Corporation.
+ *
+ * Author: Haozhong Zhang <haozhong.zh...@intel.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include 
+
+#include "qemu/pmem.h"
+
+void *pmem_memcpy_persist(void *pmemdest, const void *src, size_t len)
+{
+return memcpy(pmemdest, src, len);
+}
-- 
2.14.1

[Qemu-devel] [PATCH v4 7/8] migration/ram: ensure write persistence on loading compressed pages to PMEM

2018-02-27 Thread Haozhong Zhang

When loading a compressed page to persistent memory, flush CPU cache
after the data is decompressed. Combined with a call to pmem_drain()
at the end of memory loading, we can guarantee those compressed pages
are persistently loaded to PMEM.

Signed-off-by: Haozhong Zhang <haozhong.zh...@intel.com>
---
 include/qemu/pmem.h |  1 +
 migration/ram.c | 16 +++-
 stubs/pmem.c|  4 
 3 files changed, 16 insertions(+), 5 deletions(-)

diff --git a/include/qemu/pmem.h b/include/qemu/pmem.h
index 127b87c326..120439ecb8 100644
--- a/include/qemu/pmem.h
+++ b/include/qemu/pmem.h
@@ -20,6 +20,7 @@ void *pmem_memcpy_nodrain(void *pmemdest, const void *src, 
size_t len);
 void *pmem_memcpy_persist(void *pmemdest, const void *src, size_t len);
 void *pmem_memset_nodrain(void *pmemdest, int c, size_t len);
 void pmem_drain(void);
+void pmem_flush(const void *addr, size_t len);
 
 #endif /* CONFIG_LIBPMEM */
 
diff --git a/migration/ram.c b/migration/ram.c
index ea2ad7dff0..37f3c39cee 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -276,6 +276,7 @@ struct DecompressParam {
 void *des;
 uint8_t *compbuf;
 int len;
+bool is_pmem;
 };
 typedef struct DecompressParam DecompressParam;
 
@@ -2498,7 +2499,7 @@ static void *do_data_decompress(void *opaque)
 DecompressParam *param = opaque;
 unsigned long pagesize;
 uint8_t *des;
-int len;
+int len, rc;
 
 qemu_mutex_lock(>mutex);
 while (!param->quit) {
@@ -2514,8 +2515,11 @@ static void *do_data_decompress(void *opaque)
  * not a problem because the dirty page will be retransferred
  * and uncompress() won't break the data in other pages.
  */
-uncompress((Bytef *)des, ,
-   (const Bytef *)param->compbuf, len);
+rc = uncompress((Bytef *)des, ,
+(const Bytef *)param->compbuf, len);
+if (rc == Z_OK && param->is_pmem) {
+pmem_flush(des, len);
+}
 
 qemu_mutex_lock(_done_lock);
 param->done = true;
@@ -2601,7 +2605,8 @@ static void compress_threads_load_cleanup(void)
 }
 
 static void decompress_data_with_multi_threads(QEMUFile *f,
-   void *host, int len)
+   void *host, int len,
+   bool is_pmem)
 {
 int idx, thread_count;
 
@@ -2615,6 +2620,7 @@ static void decompress_data_with_multi_threads(QEMUFile 
*f,
 qemu_get_buffer(f, decomp_param[idx].compbuf, len);
 decomp_param[idx].des = host;
 decomp_param[idx].len = len;
+decomp_param[idx].is_pmem = is_pmem;
 qemu_cond_signal(_param[idx].cond);
 qemu_mutex_unlock(_param[idx].mutex);
 break;
@@ -2969,7 +2975,7 @@ static int ram_load(QEMUFile *f, void *opaque, int 
version_id)
 ret = -EINVAL;
 break;
 }
-decompress_data_with_multi_threads(f, host, len);
+decompress_data_with_multi_threads(f, host, len, is_pmem);
 break;
 
 case RAM_SAVE_FLAG_XBZRLE:
diff --git a/stubs/pmem.c b/stubs/pmem.c
index e172f31174..cfab830131 100644
--- a/stubs/pmem.c
+++ b/stubs/pmem.c
@@ -31,3 +31,7 @@ void *pmem_memcpy_nodrain(void *pmemdest, const void *src, 
size_t len)
 {
 return memcpy(pmemdest, src, len);
 }
+
+void pmem_flush(const void *addr, size_t len)
+{
+}
-- 
2.14.1

[Qemu-devel] [PATCH v4 8/8] migration/ram: ensure write persistence on loading xbzrle pages to PMEM

2018-02-27 Thread Haozhong Zhang

When loading a xbzrle encoded page to persistent memory, load the data
via libpmem function pmem_memcpy_nodrain() instead of memcpy().
Combined with a call to pmem_drain() at the end of memory loading, we
can guarantee those xbzrle encoded pages are persistently loaded to PMEM.

Signed-off-by: Haozhong Zhang <haozhong.zh...@intel.com>
---
 migration/ram.c| 6 +++---
 migration/xbzrle.c | 8 ++--
 migration/xbzrle.h | 3 ++-
 tests/Makefile.include | 2 +-
 tests/test-xbzrle.c| 4 ++--
 5 files changed, 14 insertions(+), 9 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index 37f3c39cee..70b196c4f5 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -2391,7 +2391,7 @@ static void ram_save_pending(QEMUFile *f, void *opaque, 
uint64_t max_size,
 }
 }
 
-static int load_xbzrle(QEMUFile *f, ram_addr_t addr, void *host)
+static int load_xbzrle(QEMUFile *f, ram_addr_t addr, void *host, bool is_pmem)
 {
 unsigned int xh_len;
 int xh_flags;
@@ -2417,7 +2417,7 @@ static int load_xbzrle(QEMUFile *f, ram_addr_t addr, void 
*host)
 
 /* decode RLE */
 if (xbzrle_decode_buffer(loaded_data, xh_len, host,
- TARGET_PAGE_SIZE) == -1) {
+ TARGET_PAGE_SIZE, is_pmem) == -1) {
 error_report("Failed to load XBZRLE page - decode error!");
 return -1;
 }
@@ -2979,7 +2979,7 @@ static int ram_load(QEMUFile *f, void *opaque, int 
version_id)
 break;
 
 case RAM_SAVE_FLAG_XBZRLE:
-if (load_xbzrle(f, addr, host) < 0) {
+if (load_xbzrle(f, addr, host, is_pmem) < 0) {
 error_report("Failed to decompress XBZRLE page at "
  RAM_ADDR_FMT, addr);
 ret = -EINVAL;
diff --git a/migration/xbzrle.c b/migration/xbzrle.c
index 1ba482ded9..ca713c3697 100644
--- a/migration/xbzrle.c
+++ b/migration/xbzrle.c
@@ -12,6 +12,7 @@
  */
 #include "qemu/osdep.h"
 #include "qemu/cutils.h"
+#include "qemu/pmem.h"
 #include "xbzrle.h"
 
 /*
@@ -126,11 +127,14 @@ int xbzrle_encode_buffer(uint8_t *old_buf, uint8_t 
*new_buf, int slen,
 return d;
 }
 
-int xbzrle_decode_buffer(uint8_t *src, int slen, uint8_t *dst, int dlen)
+int xbzrle_decode_buffer(uint8_t *src, int slen, uint8_t *dst, int dlen,
+ bool is_pmem)
 {
 int i = 0, d = 0;
 int ret;
 uint32_t count = 0;
+void *(*memcpy_func)(void *d, const void *s, size_t n) =
+is_pmem ? pmem_memcpy_nodrain : memcpy;
 
 while (i < slen) {
 
@@ -167,7 +171,7 @@ int xbzrle_decode_buffer(uint8_t *src, int slen, uint8_t 
*dst, int dlen)
 return -1;
 }
 
-memcpy(dst + d, src + i, count);
+memcpy_func(dst + d, src + i, count);
 d += count;
 i += count;
 }
diff --git a/migration/xbzrle.h b/migration/xbzrle.h
index a0db507b9c..f18f679f47 100644
--- a/migration/xbzrle.h
+++ b/migration/xbzrle.h
@@ -17,5 +17,6 @@
 int xbzrle_encode_buffer(uint8_t *old_buf, uint8_t *new_buf, int slen,
  uint8_t *dst, int dlen);
 
-int xbzrle_decode_buffer(uint8_t *src, int slen, uint8_t *dst, int dlen);
+int xbzrle_decode_buffer(uint8_t *src, int slen, uint8_t *dst, int dlen,
+ bool is_pmem);
 #endif
diff --git a/tests/Makefile.include b/tests/Makefile.include
index 37bb85f591..be5b7e484b 100644
--- a/tests/Makefile.include
+++ b/tests/Makefile.include
@@ -616,7 +616,7 @@ tests/test-thread-pool$(EXESUF): tests/test-thread-pool.o 
$(test-block-obj-y)
 tests/test-iov$(EXESUF): tests/test-iov.o $(test-util-obj-y)
 tests/test-hbitmap$(EXESUF): tests/test-hbitmap.o $(test-util-obj-y) 
$(test-crypto-obj-y)
 tests/test-x86-cpuid$(EXESUF): tests/test-x86-cpuid.o
-tests/test-xbzrle$(EXESUF): tests/test-xbzrle.o migration/xbzrle.o 
migration/page_cache.o $(test-util-obj-y)
+tests/test-xbzrle$(EXESUF): tests/test-xbzrle.o migration/xbzrle.o 
migration/page_cache.o stubs/pmem.o $(test-util-obj-y)
 tests/test-cutils$(EXESUF): tests/test-cutils.o util/cutils.o 
$(test-util-obj-y)
 tests/test-int128$(EXESUF): tests/test-int128.o
 tests/rcutorture$(EXESUF): tests/rcutorture.o $(test-util-obj-y)
diff --git a/tests/test-xbzrle.c b/tests/test-xbzrle.c
index f5e08de91e..9afa0c4bcb 100644
--- a/tests/test-xbzrle.c
+++ b/tests/test-xbzrle.c
@@ -101,7 +101,7 @@ static void test_encode_decode_1_byte(void)
PAGE_SIZE);
 g_assert(dlen == (uleb128_encode_small([0], 4095) + 2));
 
-rc = xbzrle_decode_buffer(compressed, dlen, buffer, PAGE_SIZE);
+rc = xbzrle_decode_buffer(compressed, dlen, buffer, PAGE_SIZE, false);
 g_assert(rc == PAGE_SIZE);
 g_assert(memcmp(test, buffer, PAGE_SIZE) == 0);
 
@@ -156,7 +156,7 @@ static void encode_decode_range(void)
 dlen = xbzrle_encode_buffer(test, buffer, PAGE_SIZE, compressed,
 PAGE_SIZE);

[Qemu-devel] [PATCH v4 3/8] configure: add libpmem support

2018-02-27 Thread Haozhong Zhang

Add a pair of configure options --{enable,disable}-libpmem to control
whether QEMU is compiled with PMDK libpmem [1].

QEMU may write to the host persistent memory (e.g. in vNVDIMM label
emulation and live migration), so it must take the proper operations
to ensure the persistence of its own writes. Depending on the CPU
models and available instructions, the optimal operation can vary [2].
PMDK libpmem have already implemented those operations on multiple CPU
models (x86 and ARM) and the logic to select the optimal ones, so QEMU
can just use libpmem rather than re-implement them.

[1] PMDK (formerly known as NMVL), https://github.com/pmem/pmdk/
[2] 
https://github.com/pmem/pmdk/blob/38bfa652721a37fd94c0130ce0e3f5d8baa3ed40/src/libpmem/pmem.c#L33

Signed-off-by: Haozhong Zhang <haozhong.zh...@intel.com>
---
 configure | 35 +++
 1 file changed, 35 insertions(+)

diff --git a/configure b/configure
index 39f3a43001..78e10f6d6d 100755
--- a/configure
+++ b/configure
@@ -450,6 +450,7 @@ jemalloc="no"
 replication="yes"
 vxhs=""
 libxml2=""
+libpmem=""
 
 supported_cpu="no"
 supported_os="no"
@@ -1360,6 +1361,10 @@ for opt do
   ;;
   --disable-git-update) git_update=no
   ;;
+  --enable-libpmem) libpmem=yes
+  ;;
+  --disable-libpmem) libpmem=no
+  ;;
   *)
   echo "ERROR: unknown option $opt"
   echo "Try '$0 --help' for more information"
@@ -1612,6 +1617,7 @@ disabled with --disable-FEATURE, default is enabled if 
available:
   crypto-afalgLinux AF_ALG crypto backend driver
   vhost-user  vhost-user support
   capstonecapstone disassembler support
+  libpmem libpmem support
 
 NOTE: The object files are built at the place where configure is launched
 EOF
@@ -5347,6 +5353,30 @@ EOF
   fi
 fi
 
+##
+# check for libpmem
+
+if test "$libpmem" != "no"; then
+  cat > $TMPC <
+int main(void)
+{
+  pmem_is_pmem(0, 0);
+  return 0;
+}
+EOF
+  libpmem_libs="-lpmem"
+  if compile_prog "" "$libpmem_libs" ; then
+libs_softmmu="$libpmem_libs $libs_softmmu"
+libpmem="yes"
+  else
+if test "$libpmem" = "yes" ; then
+  feature_not_found "libpmem" "Install nvml or pmdk"
+fi
+libpmem="no"
+  fi
+fi
+
 ##
 # End of CC checks
 # After here, no more $cc or $ld runs
@@ -5817,6 +5847,7 @@ echo "avx2 optimization $avx2_opt"
 echo "replication support $replication"
 echo "VxHS block device $vxhs"
 echo "capstone  $capstone"
+echo "libpmem support   $libpmem"
 
 if test "$sdl_too_old" = "yes"; then
 echo "-> Your SDL version is too old - please upgrade to have SDL support"
@@ -6542,6 +6573,10 @@ if test "$vxhs" = "yes" ; then
   echo "VXHS_LIBS=$vxhs_libs" >> $config_host_mak
 fi
 
+if test "$libpmem" = "yes" ; then
+  echo "CONFIG_LIBPMEM=y" >> $config_host_mak
+fi
+
 if test "$tcg_interpreter" = "yes"; then
   QEMU_INCLUDES="-I\$(SRC_PATH)/tcg/tci $QEMU_INCLUDES"
 elif test "$ARCH" = "sparc64" ; then
-- 
2.14.1

[Qemu-devel] [PATCH v4 5/8] migration/ram: ensure write persistence on loading zero pages to PMEM

2018-02-27 Thread Haozhong Zhang

When loading a zero page, check whether it will be loaded to
persistent memory If yes, load it by libpmem function
pmem_memset_nodrain().  Combined with a call to pmem_drain() at the
end of RAM loading, we can guarantee all those zero pages are
persistently loaded.

Depending on the host HW/SW configurations, pmem_drain() can be
"sfence".  Therefore, we do not call pmem_drain() after each
pmem_memset_nodrain(), or use pmem_memset_persist() (equally
pmem_memset_nodrain() + pmem_drain()), in order to avoid unnecessary
overhead.

Signed-off-by: Haozhong Zhang <haozhong.zh...@intel.com>
---
 include/qemu/pmem.h |  2 ++
 migration/ram.c | 25 +
 migration/ram.h |  2 +-
 migration/rdma.c|  2 +-
 stubs/pmem.c|  9 +
 5 files changed, 34 insertions(+), 6 deletions(-)

diff --git a/include/qemu/pmem.h b/include/qemu/pmem.h
index 16f5b2653a..ce96379f3c 100644
--- a/include/qemu/pmem.h
+++ b/include/qemu/pmem.h
@@ -17,6 +17,8 @@
 #else  /* !CONFIG_LIBPMEM */
 
 void *pmem_memcpy_persist(void *pmemdest, const void *src, size_t len);
+void *pmem_memset_nodrain(void *pmemdest, int c, size_t len);
+void pmem_drain(void);
 
 #endif /* CONFIG_LIBPMEM */
 
diff --git a/migration/ram.c b/migration/ram.c
index 5e33e5cc79..3904ceee79 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -51,6 +51,7 @@
 #include "qemu/rcu_queue.h"
 #include "migration/colo.h"
 #include "migration/block.h"
+#include "qemu/pmem.h"
 
 /***/
 /* ram save/restore */
@@ -2479,11 +2480,16 @@ static inline void *host_from_ram_block_offset(RAMBlock 
*block,
  * @host: host address for the zero page
  * @ch: what the page is filled from.  We only support zero
  * @size: size of the zero page
+ * @is_pmem: whether @host is in the persistent memory
  */
-void ram_handle_compressed(void *host, uint8_t ch, uint64_t size)
+void ram_handle_compressed(void *host, uint8_t ch, uint64_t size, bool is_pmem)
 {
 if (ch != 0 || !is_zero_range(host, size)) {
-memset(host, ch, size);
+if (!is_pmem) {
+memset(host, ch, size);
+} else {
+pmem_memset_nodrain(host, ch, size);
+}
 }
 }
 
@@ -2839,6 +2845,7 @@ static int ram_load(QEMUFile *f, void *opaque, int 
version_id)
 bool postcopy_running = postcopy_is_running();
 /* ADVISE is earlier, it shows the source has the postcopy capability on */
 bool postcopy_advised = postcopy_is_advised();
+bool need_pmem_drain = false;
 
 seq_iter++;
 
@@ -2864,6 +2871,8 @@ static int ram_load(QEMUFile *f, void *opaque, int 
version_id)
 ram_addr_t addr, total_ram_bytes;
 void *host = NULL;
 uint8_t ch;
+RAMBlock *block = NULL;
+bool is_pmem = false;
 
 addr = qemu_get_be64(f);
 flags = addr & ~TARGET_PAGE_MASK;
@@ -2880,7 +2889,7 @@ static int ram_load(QEMUFile *f, void *opaque, int 
version_id)
 
 if (flags & (RAM_SAVE_FLAG_ZERO | RAM_SAVE_FLAG_PAGE |
  RAM_SAVE_FLAG_COMPRESS_PAGE | RAM_SAVE_FLAG_XBZRLE)) {
-RAMBlock *block = ram_block_from_stream(f, flags);
+block = ram_block_from_stream(f, flags);
 
 host = host_from_ram_block_offset(block, addr);
 if (!host) {
@@ -2890,6 +2899,9 @@ static int ram_load(QEMUFile *f, void *opaque, int 
version_id)
 }
 ramblock_recv_bitmap_set(block, host);
 trace_ram_load_loop(block->idstr, (uint64_t)addr, flags, host);
+
+is_pmem = ramblock_is_pmem(block);
+need_pmem_drain = need_pmem_drain || is_pmem;
 }
 
 switch (flags & ~RAM_SAVE_FLAG_CONTINUE) {
@@ -2943,7 +2955,7 @@ static int ram_load(QEMUFile *f, void *opaque, int 
version_id)
 
 case RAM_SAVE_FLAG_ZERO:
 ch = qemu_get_byte(f);
-ram_handle_compressed(host, ch, TARGET_PAGE_SIZE);
+ram_handle_compressed(host, ch, TARGET_PAGE_SIZE, is_pmem);
 break;
 
 case RAM_SAVE_FLAG_PAGE:
@@ -2986,6 +2998,11 @@ static int ram_load(QEMUFile *f, void *opaque, int 
version_id)
 }
 
 wait_for_decompress_done();
+
+if (need_pmem_drain) {
+pmem_drain();
+}
+
 rcu_read_unlock();
 trace_ram_load_complete(ret, seq_iter);
 return ret;
diff --git a/migration/ram.h b/migration/ram.h
index f3a227b4fc..18934ae9e4 100644
--- a/migration/ram.h
+++ b/migration/ram.h
@@ -57,7 +57,7 @@ int ram_postcopy_send_discard_bitmap(MigrationState *ms);
 int ram_discard_range(const char *block_name, uint64_t start, size_t length);
 int ram_postcopy_incoming_init(MigrationIncomingState *mis);
 
-void ram_handle_compressed(void *host, uint8_t ch, uint64_t size);
+void ram_handle_compressed(void *host, uint8_t ch, uint64_t size, bool 
is_pmem);
 
 int ramblock_recv_bitmap_test(RAMBlock *rb, void *host_addr);
 void ramblock_recv_bit

[Qemu-devel] [PATCH v4 1/8] memory, exec: switch file ram allocation functions to 'flags' parameters

2018-02-27 Thread Haozhong Zhang

As more flag parameters besides the existing 'share' are going to be
added to following functions
memory_region_init_ram_from_file
qemu_ram_alloc_from_fd
qemu_ram_alloc_from_file
, let's switch them to use the 'flags' parameters so as to ease future
flag additions.

The existing 'share' flag is converted to the QEMU_RAM_SHARE bit in
flags, and other flag bits are ignored by above functions right now.

Signed-off-by: Haozhong Zhang <haozhong.zh...@intel.com>
---
 backends/hostmem-file.c |  3 ++-
 exec.c  |  7 ---
 include/exec/memory.h   | 10 --
 include/exec/ram_addr.h | 25 +++--
 memory.c|  8 +---
 numa.c  |  2 +-
 6 files changed, 43 insertions(+), 12 deletions(-)

diff --git a/backends/hostmem-file.c b/backends/hostmem-file.c
index 134b08d63a..30df843d90 100644
--- a/backends/hostmem-file.c
+++ b/backends/hostmem-file.c
@@ -58,7 +58,8 @@ file_backend_memory_alloc(HostMemoryBackend *backend, Error 
**errp)
 path = object_get_canonical_path(OBJECT(backend));
 memory_region_init_ram_from_file(>mr, OBJECT(backend),
  path,
- backend->size, fb->align, backend->share,
+ backend->size, fb->align,
+ backend->share ? QEMU_RAM_SHARE : 0,
  fb->mem_path, errp);
 g_free(path);
 }
diff --git a/exec.c b/exec.c
index 4d8addb263..537bf12412 100644
--- a/exec.c
+++ b/exec.c
@@ -2000,12 +2000,13 @@ static void ram_block_add(RAMBlock *new_block, Error 
**errp, bool shared)
 
 #ifdef __linux__
 RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t size, MemoryRegion *mr,
- bool share, int fd,
+ uint64_t flags, int fd,
  Error **errp)
 {
 RAMBlock *new_block;
 Error *local_err = NULL;
 int64_t file_size;
+bool share = flags & QEMU_RAM_SHARE;
 
 if (xen_enabled()) {
 error_setg(errp, "-mem-path not supported with Xen");
@@ -2061,7 +2062,7 @@ RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t size, 
MemoryRegion *mr,
 
 
 RAMBlock *qemu_ram_alloc_from_file(ram_addr_t size, MemoryRegion *mr,
-   bool share, const char *mem_path,
+   uint64_t flags, const char *mem_path,
Error **errp)
 {
 int fd;
@@ -2073,7 +2074,7 @@ RAMBlock *qemu_ram_alloc_from_file(ram_addr_t size, 
MemoryRegion *mr,
 return NULL;
 }
 
-block = qemu_ram_alloc_from_fd(size, mr, share, fd, errp);
+block = qemu_ram_alloc_from_fd(size, mr, flags, fd, errp);
 if (!block) {
 if (created) {
 unlink(mem_path);
diff --git a/include/exec/memory.h b/include/exec/memory.h
index 15e81113ba..0fc9d23a48 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -487,6 +487,9 @@ void memory_region_init_resizeable_ram(MemoryRegion *mr,
void *host),
Error **errp);
 #ifdef __linux__
+
+#define QEMU_RAM_SHARE  (1UL << 0)
+
 /**
  * memory_region_init_ram_from_file:  Initialize RAM memory region with a
  *mmap-ed backend.
@@ -498,7 +501,10 @@ void memory_region_init_resizeable_ram(MemoryRegion *mr,
  * @size: size of the region.
  * @align: alignment of the region base address; if 0, the default alignment
  * (getpagesize()) will be used.
- * @share: %true if memory must be mmaped with the MAP_SHARED flag
+ * @flags: specify properties of this memory region, which can be one or bit-or
+ * of following values:
+ * - QEMU_RAM_SHARE: memory must be mmaped with the MAP_SHARED flag
+ * Other bits are ignored.
  * @path: the path in which to allocate the RAM.
  * @errp: pointer to Error*, to store an error if it happens.
  *
@@ -510,7 +516,7 @@ void memory_region_init_ram_from_file(MemoryRegion *mr,
   const char *name,
   uint64_t size,
   uint64_t align,
-  bool share,
+  uint64_t flags,
   const char *path,
   Error **errp);
 
diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h
index cf2446a176..b8b01d1eb9 100644
--- a/include/exec/ram_addr.h
+++ b/include/exec/ram_addr.h
@@ -72,12 +72,33 @@ static inline unsigned long int 
ramblock_recv_bitmap_offset(void *host_addr,
 
 long qemu_getrampagesize(void);
 unsigned long last_ram_page(void);
+
+/**
+ * qemu_ram_alloc_from_file,
+ * qemu_ram_alloc_from_fd:  Allocate a ram block from the specified back
+ *

[Qemu-devel] [PATCH v4 2/8] hostmem-file: add the 'pmem' option

2018-02-27 Thread Haozhong Zhang

When QEMU emulates vNVDIMM labels and migrates vNVDIMM devices, it
needs to know whether the backend storage is a real persistent memory,
in order to decide whether special operations should be performed to
ensure the data persistence.

This boolean option 'pmem' allows users to specify whether the backend
storage of memory-backend-file is a real persistent memory. If
'pmem=on', QEMU will set the flag RAM_PMEM in the RAM block of the
corresponding memory region.

Signed-off-by: Haozhong Zhang <haozhong.zh...@intel.com>
---
 backends/hostmem-file.c | 26 +-
 docs/nvdimm.txt | 14 ++
 exec.c  | 13 -
 include/exec/memory.h   |  2 ++
 include/exec/ram_addr.h |  3 +++
 qemu-options.hx |  9 -
 6 files changed, 64 insertions(+), 3 deletions(-)

diff --git a/backends/hostmem-file.c b/backends/hostmem-file.c
index 30df843d90..5d706d471f 100644
--- a/backends/hostmem-file.c
+++ b/backends/hostmem-file.c
@@ -34,6 +34,7 @@ struct HostMemoryBackendFile {
 bool discard_data;
 char *mem_path;
 uint64_t align;
+bool is_pmem;
 };
 
 static void
@@ -59,7 +60,8 @@ file_backend_memory_alloc(HostMemoryBackend *backend, Error 
**errp)
 memory_region_init_ram_from_file(>mr, OBJECT(backend),
  path,
  backend->size, fb->align,
- backend->share ? QEMU_RAM_SHARE : 0,
+ (backend->share ? QEMU_RAM_SHARE : 0) |
+ (fb->is_pmem ? QEMU_RAM_PMEM : 0),
  fb->mem_path, errp);
 g_free(path);
 }
@@ -131,6 +133,25 @@ static void file_memory_backend_set_align(Object *o, 
Visitor *v,
 error_propagate(errp, local_err);
 }
 
+static bool file_memory_backend_get_pmem(Object *o, Error **errp)
+{
+return MEMORY_BACKEND_FILE(o)->is_pmem;
+}
+
+static void file_memory_backend_set_pmem(Object *o, bool value, Error **errp)
+{
+HostMemoryBackend *backend = MEMORY_BACKEND(o);
+HostMemoryBackendFile *fb = MEMORY_BACKEND_FILE(o);
+
+if (host_memory_backend_mr_inited(backend)) {
+error_setg(errp, "cannot change property 'pmem' of %s '%s'",
+   object_get_typename(o), backend->id);
+return;
+}
+
+fb->is_pmem = value;
+}
+
 static void file_backend_unparent(Object *obj)
 {
 HostMemoryBackend *backend = MEMORY_BACKEND(obj);
@@ -162,6 +183,9 @@ file_backend_class_init(ObjectClass *oc, void *data)
 file_memory_backend_get_align,
 file_memory_backend_set_align,
 NULL, NULL, _abort);
+object_class_property_add_bool(oc, "pmem",
+file_memory_backend_get_pmem, file_memory_backend_set_pmem,
+_abort);
 }
 
 static void file_backend_instance_finalize(Object *o)
diff --git a/docs/nvdimm.txt b/docs/nvdimm.txt
index e903d8bb09..bcb2032672 100644
--- a/docs/nvdimm.txt
+++ b/docs/nvdimm.txt
@@ -153,3 +153,17 @@ guest NVDIMM region mapping structure.  This unarmed flag 
indicates
 guest software that this vNVDIMM device contains a region that cannot
 accept persistent writes. In result, for example, the guest Linux
 NVDIMM driver, marks such vNVDIMM device as read-only.
+
+If the vNVDIMM backend is on the host persistent memory that can be
+accessed in SNIA NVM Programming Model [1] (e.g., Intel NVDIMM), it's
+suggested to set the 'pmem' option of memory-backend-file to 'on'. When
+'pmem=on' and QEMU is built with libpmem [2] support (configured with
+--enable-libpmem), QEMU will take necessary operations to guarantee
+the persistence of its own writes to the vNVDIMM backend (e.g., in
+vNVDIMM label emulation and live migration).
+
+References
+--
+
+[1] SNIA NVM Programming Model: 
https://www.snia.org/sites/default/files/technical_work/final/NVMProgrammingModel_v1.2.pdf
+[2] PMDK: http://pmem.io/pmdk/
diff --git a/exec.c b/exec.c
index 537bf12412..3f3b61fb0a 100644
--- a/exec.c
+++ b/exec.c
@@ -99,6 +99,9 @@ static MemoryRegion io_mem_unassigned;
  */
 #define RAM_RESIZEABLE (1 << 2)
 
+/* RAM is backed by the persistent memory. */
+#define RAM_PMEM   (1 << 3)
+
 #endif
 
 #ifdef TARGET_PAGE_BITS_VARY
@@ -2007,6 +2010,7 @@ RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t size, 
MemoryRegion *mr,
 Error *local_err = NULL;
 int64_t file_size;
 bool share = flags & QEMU_RAM_SHARE;
+bool is_pmem = flags & QEMU_RAM_PMEM;
 
 if (xen_enabled()) {
 error_setg(errp, "-mem-path not supported with Xen");
@@ -2043,7 +2047,8 @@ RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t size, 
MemoryRegion *mr,
 new_block->mr = mr;
 new_block->used_length = size;
 new_block->max_length = size;
-new_block->flags = share ? RAM_SHARED : 0;
+new_block->flags = (share ? RAM_SHARED : 0) |
+   (is_pmem ? RAM_PMEM : 0);

[Qemu-devel] [PATCH v4 0/8] nvdimm: guarantee persistence of QEMU writes to persistent memory

2018-02-27 Thread Haozhong Zhang

QEMU writes to vNVDIMM backends in the vNVDIMM label emulation and
live migration. If the backend is on the persistent memory, QEMU needs
to take proper operations to ensure its writes persistent on the
persistent memory. Otherwise, a host power failure may result in the
loss the guest data on the persistent memory.

This v3 patch series is based on Marcel's patch "mem: add share
parameter to memory-backend-ram" [1] because of the changes in patch 1.

[1] https://lists.gnu.org/archive/html/qemu-devel/2018-02/msg03858.html

Previous versions can be found at
v3: https://lists.gnu.org/archive/html/qemu-devel/2018-02/msg04365.html
v2: https://lists.gnu.org/archive/html/qemu-devel/2018-02/msg01579.html
v1: https://lists.gnu.org/archive/html/qemu-devel/2017-12/msg05040.html

Changes in v4:
 * (Patch 2) Fix compilation errors found by patchew.

Changes in v3:
 * (Patch 5) Add a is_pmem flag to ram_handle_compressed() and handle
   PMEM writes in it, so we don't need the _common function.
 * (Patch 6) Expose qemu_get_buffer_common so we can remove the
   unnecessary qemu_get_buffer_to_pmem wrapper.
 * (Patch 8) Add a is_pmem flag to xbzrle_decode_buffer() and handle
   PMEM writes in it, so we can remove the unnecessary
   xbzrle_decode_buffer_{common, to_pmem}.
 * Move libpmem stubs to stubs/pmem.c and fix the compilation failures
   of test-{xbzrle,vmstate}.c.

Changes in v2:
 * (Patch 1) Use a flags parameter in file ram allocation functions.
 * (Patch 2) Add a new option 'pmem' to hostmem-file.
 * (Patch 3) Use libpmem to operate on the persistent memory, rather
   than re-implementing those operations in QEMU.
 * (Patch 5-8) Consider the write persistence in the migration path.

Haozhong Zhang (8):
  [1/8] memory, exec: switch file ram allocation functions to 'flags' parameters
  [2/8] hostmem-file: add the 'pmem' option
  [3/8] configure: add libpmem support
  [4/8] mem/nvdimm: ensure write persistence to PMEM in label emulation
  [5/8] migration/ram: ensure write persistence on loading zero pages to PMEM
  [6/8] migration/ram: ensure write persistence on loading normal pages to PMEM
  [7/8] migration/ram: ensure write persistence on loading compressed pages to 
PMEM
  [8/8] migration/ram: ensure write persistence on loading xbzrle pages to PMEM

 backends/hostmem-file.c | 27 +++-
 configure   | 35 ++
 docs/nvdimm.txt | 14 +++
 exec.c  | 20 ---
 hw/mem/nvdimm.c |  9 ++-
 include/exec/memory.h   | 12 +++--
 include/exec/ram_addr.h | 28 +++--
 include/migration/qemu-file-types.h |  2 ++
 include/qemu/pmem.h | 27 
 memory.c|  8 +++---
 migration/qemu-file.c   | 29 ++
 migration/ram.c | 49 +++--
 migration/ram.h |  2 +-
 migration/rdma.c|  2 +-
 migration/xbzrle.c  |  8 --
 migration/xbzrle.h  |  3 ++-
 numa.c  |  2 +-
 qemu-options.hx |  9 ++-
 stubs/Makefile.objs |  1 +
 stubs/pmem.c| 37 
 tests/Makefile.include  |  4 +--
 tests/test-xbzrle.c |  4 +--
 22 files changed, 285 insertions(+), 47 deletions(-)
 create mode 100644 include/qemu/pmem.h
 create mode 100644 stubs/pmem.c

-- 
2.14.1

[Qemu-devel] [PATCH v2 3/3] tests/bios-tables-test: add test cases for DIMM proximity

2018-02-27 Thread Haozhong Zhang

QEMU now builds one SRAT memory affinity structure for each
static-plugged PC-DIMM and NVDIMM device with the proximity domain
specified in the device option 'node', rather than only one SRAT
memory affinity structure covering the entire hotpluggable address
space with the proximity domain of the last node.

Add test cases on PC and Q35 machines with 3 proximity domains, and
one PC-DIMM and one NVDIMM attached to the second proximity domain.
Check whether the QEMU-built SRAT tables match with the expected ones.

Signed-off-by: Haozhong Zhang <haozhong.zh...@intel.com>
Suggested-by: Igor Mammedov <imamm...@redhat.com>
---
 tests/acpi-test-data/pc/APIC.dimmpxm  | Bin 0 -> 136 bytes
 tests/acpi-test-data/pc/DSDT.dimmpxm  | Bin 0 -> 6710 bytes
 tests/acpi-test-data/pc/NFIT.dimmpxm  | Bin 0 -> 224 bytes
 tests/acpi-test-data/pc/SRAT.dimmpxm  | Bin 0 -> 416 bytes
 tests/acpi-test-data/pc/SSDT.dimmpxm  | Bin 0 -> 685 bytes
 tests/acpi-test-data/q35/APIC.dimmpxm | Bin 0 -> 136 bytes
 tests/acpi-test-data/q35/DSDT.dimmpxm | Bin 0 -> 9394 bytes
 tests/acpi-test-data/q35/NFIT.dimmpxm | Bin 0 -> 224 bytes
 tests/acpi-test-data/q35/SRAT.dimmpxm | Bin 0 -> 416 bytes
 tests/acpi-test-data/q35/SSDT.dimmpxm | Bin 0 -> 685 bytes
 tests/bios-tables-test.c  |  33 +
 11 files changed, 33 insertions(+)
 create mode 100644 tests/acpi-test-data/pc/APIC.dimmpxm
 create mode 100644 tests/acpi-test-data/pc/DSDT.dimmpxm
 create mode 100644 tests/acpi-test-data/pc/NFIT.dimmpxm
 create mode 100644 tests/acpi-test-data/pc/SRAT.dimmpxm
 create mode 100644 tests/acpi-test-data/pc/SSDT.dimmpxm
 create mode 100644 tests/acpi-test-data/q35/APIC.dimmpxm
 create mode 100644 tests/acpi-test-data/q35/DSDT.dimmpxm
 create mode 100644 tests/acpi-test-data/q35/NFIT.dimmpxm
 create mode 100644 tests/acpi-test-data/q35/SRAT.dimmpxm
 create mode 100644 tests/acpi-test-data/q35/SSDT.dimmpxm

diff --git a/tests/acpi-test-data/pc/APIC.dimmpxm 
b/tests/acpi-test-data/pc/APIC.dimmpxm
new file mode 100644
index 
..658d7e748e37540ff85a02f4391efc7eaae3c8b4
GIT binary patch
literal 136
zcmZ<^@O18AU|?W8>g4b25v<@85#a0y6k`O6f!H9Lf#JbFFwFr}2jX%tGD2u3CJ@cY
q0}?#&4@5F?0WpXHVzIIUX<iVElM}|`0xE!radU%NENuUQMgRcNAq@cl

literal 0
HcmV?d1

diff --git a/tests/acpi-test-data/pc/DSDT.dimmpxm 
b/tests/acpi-test-data/pc/DSDT.dimmpxm
new file mode 100644
index 
..20e6433725bb3e70085cf6227f981106772bdaea
GIT binary patch
literal 6710
zcmcgxUvJyi6~C9H9O_E4DVt54IBf){ZQ8C)^e1#1z*8BxwMFc>Mv!Q`St
z2satx2E`NwaMQjOT80hSgA(XD{s`Mg=tt<jLWi|^s&@{_ODnPov=5tr(D)<L{hv
z@44sR%jlNgUOGbv-KeZ<H7i%SX=*z3Q9=l|@vl;sZV|huS5_UG5+rIrO8ISgRAlvi
zy|S@N|Jrr`;=1>~aB0UQo6nV}n;q}*6L*s!=>De17$ErAXf5Fu1dD*Ge^>q0g
zCdy7(ZxPwqsOwZQ<N#BZYi700K@>os1~+PE+aPH|zWFglB>Rzq^4yJTQ_q<#-N~s-
zj@2#`4|`k>yE>n_OmT<luLmv}xT%AK5gAT@J?M}>chclv|4EF<h|S23*0Qo$HocdG
zh=H6)gzOUKt&8Xlx@-4On>Pz3-`BKAD7a!4N}52}fwG(!gK1LTDmwuV1{QIb^P0e1
z2JXK7yNk$zZxT|wL{2o!YLk+yMAXXI5VZ>YQM7ZHL~a<_?EVL>wg#lZkfmU-(BFCX
z+A8*X^&{euac8D;wOYHuYwTd3WMNv)qqY?$`zvvQ|P<U{B0}0phj$7mW3d
z=*5}2$rojoSR@Jp%kqk@MU!|U^k{+2uhQ?t??fW4(jUYhV4xP4$$OH|U07+DWj@&}
zdVMyh5SC!;EKk`!6WCkuZ<Z~v1NI5~p3N{>c2@Li_7qbw4aa{12zLM14YM8jDiL))
zn0g#icQ^}xFaR_O!rfhfz1J>Q?Iq^%nTKBx)(35lb5DZUhmyr}pz
zD@aqEpkYG912Y=SBfJ!VM+P3HCLbnI&(y3oO_3K?CZgB;w*!9*#>RmZJOu
zGb)9GR>@bdfuhnhS~R5u3KX<TbHm8lw9?Sli29bPRj
<

[Qemu-devel] [PATCH v2 0/3] hw/acpi-build: build SRAT memory affinity structures for DIMM devices

2018-02-27 Thread Haozhong Zhang

ACPI 6.2A Table 5-129 "SPA Range Structure" requires the proximity
domain of a NVDIMM SPA range must match with corresponding entry in
SRAT table.

The address ranges of vNVDIMM in QEMU are allocated from the
hot-pluggable address space, which is entirely covered by one SRAT
memory affinity structure. However, users can set the vNVDIMM
proximity domain in NFIT SPA range structure by the 'node' property of
'-device nvdimm' to a value different than the one in the above SRAT
memory affinity structure.

In order to solve such proximity domain mismatch, this patch builds
one SRAT memory affinity structure for each static-plugged DIMM device,
including both PC-DIMM and NVDIMM, with the proximity domain specified
in '-device pc-dimm' or '-device nvdimm'.

The remaining hot-pluggable address space is covered by one or multiple
SRAT memory affinity structures with the proximity domain of the last
node as before.


Changes in v2:
 * Build SRAT memory affinity structures of PC-DIMM devices as well.
 * Add test cases.


Haozhong Zhang (3):
  hw/acpi-build: build SRAT memory affinity structures for DIMM devices
  tests/bios-tables-test: allow setting extra machine options
  tests/bios-tables-test: add test cases for DIMM proximity

 hw/i386/acpi-build.c  |  50 --
 hw/mem/pc-dimm.c  |   8 
 include/hw/mem/pc-dimm.h  |  10 +
 tests/acpi-test-data/pc/APIC.dimmpxm  | Bin 0 -> 136 bytes
 tests/acpi-test-data/pc/DSDT.dimmpxm  | Bin 0 -> 6710 bytes
 tests/acpi-test-data/pc/NFIT.dimmpxm  | Bin 0 -> 224 bytes
 tests/acpi-test-data/pc/SRAT.dimmpxm  | Bin 0 -> 416 bytes
 tests/acpi-test-data/pc/SSDT.dimmpxm  | Bin 0 -> 685 bytes
 tests/acpi-test-data/q35/APIC.dimmpxm | Bin 0 -> 136 bytes
 tests/acpi-test-data/q35/DSDT.dimmpxm | Bin 0 -> 9394 bytes
 tests/acpi-test-data/q35/NFIT.dimmpxm | Bin 0 -> 224 bytes
 tests/acpi-test-data/q35/SRAT.dimmpxm | Bin 0 -> 416 bytes
 tests/acpi-test-data/q35/SSDT.dimmpxm | Bin 0 -> 685 bytes
 tests/bios-tables-test.c  |  78 +++---
 14 files changed, 126 insertions(+), 20 deletions(-)
 create mode 100644 tests/acpi-test-data/pc/APIC.dimmpxm
 create mode 100644 tests/acpi-test-data/pc/DSDT.dimmpxm
 create mode 100644 tests/acpi-test-data/pc/NFIT.dimmpxm
 create mode 100644 tests/acpi-test-data/pc/SRAT.dimmpxm
 create mode 100644 tests/acpi-test-data/pc/SSDT.dimmpxm
 create mode 100644 tests/acpi-test-data/q35/APIC.dimmpxm
 create mode 100644 tests/acpi-test-data/q35/DSDT.dimmpxm
 create mode 100644 tests/acpi-test-data/q35/NFIT.dimmpxm
 create mode 100644 tests/acpi-test-data/q35/SRAT.dimmpxm
 create mode 100644 tests/acpi-test-data/q35/SSDT.dimmpxm

-- 
2.14.1

[Qemu-devel] [PATCH v2 2/3] tests/bios-tables-test: allow setting extra machine options

2018-02-27 Thread Haozhong Zhang

Some test cases may require extra machine options than the those
used in the current test_acpi_ones(), e.g., nvdimm test cases require
the machine option 'nvdimm=on'.

Signed-off-by: Haozhong Zhang <haozhong.zh...@intel.com>
---
 tests/bios-tables-test.c | 45 +
 1 file changed, 29 insertions(+), 16 deletions(-)

diff --git a/tests/bios-tables-test.c b/tests/bios-tables-test.c
index 65b271a173..d45181aa51 100644
--- a/tests/bios-tables-test.c
+++ b/tests/bios-tables-test.c
@@ -654,17 +654,22 @@ static void test_smbios_structs(test_data *data)
 }
 }
 
-static void test_acpi_one(const char *params, test_data *data)
+static void test_acpi_one(const char *extra_machine_opts,
+  const char *params, test_data *data)
 {
 char *args;
 
 /* Disable kernel irqchip to be able to override apic irq0. */
-args = g_strdup_printf("-machine %s,accel=%s,kernel-irqchip=off "
+args = g_strdup_printf("-machine %s,accel=%s,kernel-irqchip=off",
+   data->machine, "kvm:tcg");
+if (extra_machine_opts) {
+args = g_strdup_printf("%s,%s", args, extra_machine_opts);
+}
+args = g_strdup_printf("%s "
"-net none -display none %s "
"-drive id=hd0,if=none,file=%s,format=raw "
"-device ide-hd,drive=hd0 ",
-   data->machine, "kvm:tcg",
-   params ? params : "", disk);
+   args, params ? params : "", disk);
 
 qtest_start(args);
 
@@ -711,7 +716,7 @@ static void test_acpi_piix4_tcg(void)
 data.machine = MACHINE_PC;
 data.required_struct_types = base_required_struct_types;
 data.required_struct_types_len = ARRAY_SIZE(base_required_struct_types);
-test_acpi_one(NULL, );
+test_acpi_one(NULL, NULL, );
 free_test_data();
 }
 
@@ -724,7 +729,7 @@ static void test_acpi_piix4_tcg_bridge(void)
 data.variant = ".bridge";
 data.required_struct_types = base_required_struct_types;
 data.required_struct_types_len = ARRAY_SIZE(base_required_struct_types);
-test_acpi_one("-device pci-bridge,chassis_nr=1", );
+test_acpi_one(NULL, "-device pci-bridge,chassis_nr=1", );
 free_test_data();
 }
 
@@ -736,7 +741,7 @@ static void test_acpi_q35_tcg(void)
 data.machine = MACHINE_Q35;
 data.required_struct_types = base_required_struct_types;
 data.required_struct_types_len = ARRAY_SIZE(base_required_struct_types);
-test_acpi_one(NULL, );
+test_acpi_one(NULL, NULL, );
 free_test_data();
 }
 
@@ -749,7 +754,7 @@ static void test_acpi_q35_tcg_bridge(void)
 data.variant = ".bridge";
 data.required_struct_types = base_required_struct_types;
 data.required_struct_types_len = ARRAY_SIZE(base_required_struct_types);
-test_acpi_one("-device pci-bridge,chassis_nr=1",
+test_acpi_one(NULL, "-device pci-bridge,chassis_nr=1",
   );
 free_test_data();
 }
@@ -761,7 +766,8 @@ static void test_acpi_piix4_tcg_cphp(void)
 memset(, 0, sizeof(data));
 data.machine = MACHINE_PC;
 data.variant = ".cphp";
-test_acpi_one("-smp 2,cores=3,sockets=2,maxcpus=6"
+test_acpi_one(NULL,
+  "-smp 2,cores=3,sockets=2,maxcpus=6"
   " -numa node -numa node"
   " -numa dist,src=0,dst=1,val=21",
   );
@@ -775,7 +781,8 @@ static void test_acpi_q35_tcg_cphp(void)
 memset(, 0, sizeof(data));
 data.machine = MACHINE_Q35;
 data.variant = ".cphp";
-test_acpi_one(" -smp 2,cores=3,sockets=2,maxcpus=6"
+test_acpi_one(NULL,
+  " -smp 2,cores=3,sockets=2,maxcpus=6"
   " -numa node -numa node"
   " -numa dist,src=0,dst=1,val=21",
   );
@@ -795,7 +802,8 @@ static void test_acpi_q35_tcg_ipmi(void)
 data.variant = ".ipmibt";
 data.required_struct_types = ipmi_required_struct_types;
 data.required_struct_types_len = ARRAY_SIZE(ipmi_required_struct_types);
-test_acpi_one("-device ipmi-bmc-sim,id=bmc0"
+test_acpi_one(NULL,
+  "-device ipmi-bmc-sim,id=bmc0"
   " -device isa-ipmi-bt,bmc=bmc0",
   );
 free_test_data();
@@ -813,7 +821,8 @@ static void test_acpi_piix4_tcg_ipmi(void)
 data.variant = ".ipmikcs";
 data.required_struct_types = ipmi_required_struct_types;
 data.required_struct_types_len = ARRAY_SIZE(ipmi_required_struct_types);
-test_acpi_one("-device ipmi-bmc-sim,id=bmc0"
+test_acpi_one(NULL,
+  "-device ipmi-b

[Qemu-devel] [PATCH v2 1/3] hw/acpi-build: build SRAT memory affinity structures for DIMM devices

2018-02-27 Thread Haozhong Zhang

ACPI 6.2A Table 5-129 "SPA Range Structure" requires the proximity
domain of a NVDIMM SPA range must match with corresponding entry in
SRAT table.

The address ranges of vNVDIMM in QEMU are allocated from the
hot-pluggable address space, which is entirely covered by one SRAT
memory affinity structure. However, users can set the vNVDIMM
proximity domain in NFIT SPA range structure by the 'node' property of
'-device nvdimm' to a value different than the one in the above SRAT
memory affinity structure.

In order to solve such proximity domain mismatch, this patch builds
one SRAT memory affinity structure for each static-plugged DIMM device,
including both PC-DIMM and NVDIMM, with the proximity domain specified
in '-device pc-dimm' or '-device nvdimm'.

The remaining hot-pluggable address space is covered by one or multiple
SRAT memory affinity structures with the proximity domain of the last
node as before.

Signed-off-by: Haozhong Zhang <haozhong.zh...@intel.com>
---
 hw/i386/acpi-build.c | 50 
 hw/mem/pc-dimm.c |  8 
 include/hw/mem/pc-dimm.h | 10 ++
 3 files changed, 64 insertions(+), 4 deletions(-)

diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
index deb440f286..a88de06d8f 100644
--- a/hw/i386/acpi-build.c
+++ b/hw/i386/acpi-build.c
@@ -2323,6 +2323,49 @@ build_tpm2(GArray *table_data, BIOSLinker *linker, 
GArray *tcpalog)
 #define HOLE_640K_START  (640 * 1024)
 #define HOLE_640K_END   (1024 * 1024)
 
+static void build_srat_hotpluggable_memory(GArray *table_data, uint64_t base,
+   uint64_t len, int default_node)
+{
+GSList *dimms = pc_dimm_get_device_list();
+GSList *ent = dimms;
+PCDIMMDevice *dev;
+Object *obj;
+uint64_t end = base + len, addr, size;
+int node;
+AcpiSratMemoryAffinity *numamem;
+
+while (base < end) {
+numamem = acpi_data_push(table_data, sizeof *numamem);
+
+if (!ent) {
+build_srat_memory(numamem, base, end - base, default_node,
+  MEM_AFFINITY_HOTPLUGGABLE | 
MEM_AFFINITY_ENABLED);
+break;
+}
+
+dev = PC_DIMM(ent->data);
+obj = OBJECT(dev);
+addr = object_property_get_uint(obj, PC_DIMM_ADDR_PROP, NULL);
+size = object_property_get_uint(obj, PC_DIMM_SIZE_PROP, NULL);
+node = object_property_get_uint(obj, PC_DIMM_NODE_PROP, NULL);
+
+if (base < addr) {
+build_srat_memory(numamem, base, addr - base, default_node,
+  MEM_AFFINITY_HOTPLUGGABLE | 
MEM_AFFINITY_ENABLED);
+numamem = acpi_data_push(table_data, sizeof *numamem);
+}
+build_srat_memory(numamem, addr, size, node,
+  MEM_AFFINITY_HOTPLUGGABLE | MEM_AFFINITY_ENABLED |
+  (object_dynamic_cast(obj, TYPE_NVDIMM) ?
+   MEM_AFFINITY_NON_VOLATILE : 0));
+
+base = addr + size;
+ent = g_slist_next(ent);
+}
+
+g_slist_free(dimms);
+}
+
 static void
 build_srat(GArray *table_data, BIOSLinker *linker, MachineState *machine)
 {
@@ -2434,10 +2477,9 @@ build_srat(GArray *table_data, BIOSLinker *linker, 
MachineState *machine)
  * providing _PXM method if necessary.
  */
 if (hotplugabble_address_space_size) {
-numamem = acpi_data_push(table_data, sizeof *numamem);
-build_srat_memory(numamem, pcms->hotplug_memory.base,
-  hotplugabble_address_space_size, pcms->numa_nodes - 
1,
-  MEM_AFFINITY_HOTPLUGGABLE | MEM_AFFINITY_ENABLED);
+build_srat_hotpluggable_memory(table_data, pcms->hotplug_memory.base,
+   hotplugabble_address_space_size,
+   pcms->numa_nodes - 1);
 }
 
 build_header(linker, table_data,
diff --git a/hw/mem/pc-dimm.c b/hw/mem/pc-dimm.c
index 6e74b61cb6..9fd901e87a 100644
--- a/hw/mem/pc-dimm.c
+++ b/hw/mem/pc-dimm.c
@@ -276,6 +276,14 @@ static int pc_dimm_built_list(Object *obj, void *opaque)
 return 0;
 }
 
+GSList *pc_dimm_get_device_list(void)
+{
+GSList *list = NULL;
+
+object_child_foreach(qdev_get_machine(), pc_dimm_built_list, );
+return list;
+}
+
 uint64_t pc_dimm_get_free_addr(uint64_t address_space_start,
uint64_t address_space_size,
uint64_t *hint, uint64_t align, uint64_t size,
diff --git a/include/hw/mem/pc-dimm.h b/include/hw/mem/pc-dimm.h
index d83b957829..4cf5cc49e9 100644
--- a/include/hw/mem/pc-dimm.h
+++ b/include/hw/mem/pc-dimm.h
@@ -100,4 +100,14 @@ void pc_dimm_memory_plug(DeviceState *dev, 
MemoryHotplugState *hpms,
  MemoryRegion *mr, uint64_t align, Error **errp);
 void pc_dimm_memory_unplug(DeviceState *dev, MemoryHotplugState *hpms,
Memor

Re: [Qemu-devel] [PATCH] hw/acpi-build: build SRAT memory affinity structures for NVDIMM

2018-02-26 Thread Haozhong Zhang

On 02/26/18 14:59 +0100, Igor Mammedov wrote:
> On Thu, 22 Feb 2018 09:40:00 +0800
> Haozhong Zhang <haozhong.zh...@intel.com> wrote:
> 
> > On 02/21/18 14:55 +0100, Igor Mammedov wrote:
> > > On Tue, 20 Feb 2018 17:17:58 -0800
> > > Dan Williams <dan.j.willi...@intel.com> wrote:
> > >   
> > > > On Tue, Feb 20, 2018 at 6:10 AM, Igor Mammedov <imamm...@redhat.com> 
> > > > wrote:  
> > > > > On Sat, 17 Feb 2018 14:31:35 +0800
> > > > > Haozhong Zhang <haozhong.zh...@intel.com> wrote:
> > > > >
> > > > >> ACPI 6.2A Table 5-129 "SPA Range Structure" requires the proximity
> > > > >> domain of a NVDIMM SPA range must match with corresponding entry in
> > > > >> SRAT table.
> > > > >>
> > > > >> The address ranges of vNVDIMM in QEMU are allocated from the
> > > > >> hot-pluggable address space, which is entirely covered by one SRAT
> > > > >> memory affinity structure. However, users can set the vNVDIMM
> > > > >> proximity domain in NFIT SPA range structure by the 'node' property 
> > > > >> of
> > > > >> '-device nvdimm' to a value different than the one in the above SRAT
> > > > >> memory affinity structure.
> > > > >>
> > > > >> In order to solve such proximity domain mismatch, this patch build 
> > > > >> one
> > > > >> SRAT memory affinity structure for each NVDIMM device with the
> > > > >> proximity domain used in NFIT. The remaining hot-pluggable address
> > > > >> space is covered by one or multiple SRAT memory affinity structures
> > > > >> with the proximity domain of the last node as before.
> > > > >>
> > > > >> Signed-off-by: Haozhong Zhang <haozhong.zh...@intel.com>
> > > > > If we consider hotpluggable system, correctly implemented OS should
> > > > > be able pull proximity from Device::_PXM and override any value from 
> > > > > SRAT.
> > > > > Do we really have a problem here (anything that breaks if we would 
> > > > > use _PXM)?
> > > > > Maybe we should add _PXM object to nvdimm device nodes instead of 
> > > > > massaging SRAT?
> > > > 
> > > > Unfortunately _PXM is an awkward fit. Currently the proximity domain
> > > > is attached to the SPA range structure. The SPA range may be
> > > > associated with multiple DIMM devices and those individual NVDIMMs may
> > > > have conflicting _PXM properties.  
> > > There shouldn't be any conflict here as  NVDIMM device's _PXM method,
> > > should override in runtime any proximity specified by parent scope.
> > > (as parent scope I'd also count boot time NFIT/SRAT tables).
> > > 
> > > To make it more clear we could clear valid proximity domain flag in SPA
> > > like this:
> > > 
> > > diff --git a/hw/acpi/nvdimm.c b/hw/acpi/nvdimm.c
> > > index 59d6e42..131bca5 100644
> > > --- a/hw/acpi/nvdimm.c
> > > +++ b/hw/acpi/nvdimm.c
> > > @@ -260,9 +260,7 @@ nvdimm_build_structure_spa(GArray *structures, 
> > > DeviceState *dev)
> > >   */
> > >  nfit_spa->flags = cpu_to_le16(1 /* Control region is strictly for
> > > management during hot add/online
> > > -   operation */ |
> > > -  2 /* Data in Proximity Domain field is
> > > -   valid*/);
> > > +   operation */);
> > >  
> > >  /* NUMA node. */
> > >  nfit_spa->proximity_domain = cpu_to_le32(node);
> > >   
> > > > Even if that was unified across
> > > > DIMMs it is ambiguous whether a DIMM-device _PXM would relate to the
> > > > device's control interface, or the assembled persistent memory SPA
> > > > range.  
> > > I'm not sure what you mean under 'device's control interface',
> > > could you clarify where the ambiguity comes from?
> > > 
> > > I read spec as: _PXM applies to address range covered by NVDIMM
> > > device it belongs to.
> > > 
> > > As for assembled SPA, I'd assume that it applies to interleaved set
> > > and all NVDIMMs with it should be on the same node. It's somewhat
>

Re: [Qemu-devel] [PATCH] hw/acpi-build: build SRAT memory affinity structures for NVDIMM

2018-02-23 Thread Haozhong Zhang

Hi Fam,

On 02/23/18 17:17 -0800, no-re...@patchew.org wrote:
> Hi,
> 
> This series failed build test on s390x host. Please find the details below.
> 
> N/A. Internal error while reading log file

What does this message mean? Where can I get the log file?

Thanks,
Haozhong

Re: [Qemu-devel] [PATCH] hw/acpi-build: build SRAT memory affinity structures for NVDIMM

2018-02-21 Thread Haozhong Zhang

On 02/21/18 14:55 +0100, Igor Mammedov wrote:
> On Tue, 20 Feb 2018 17:17:58 -0800
> Dan Williams <dan.j.willi...@intel.com> wrote:
> 
> > On Tue, Feb 20, 2018 at 6:10 AM, Igor Mammedov <imamm...@redhat.com> wrote:
> > > On Sat, 17 Feb 2018 14:31:35 +0800
> > > Haozhong Zhang <haozhong.zh...@intel.com> wrote:
> > >  
> > >> ACPI 6.2A Table 5-129 "SPA Range Structure" requires the proximity
> > >> domain of a NVDIMM SPA range must match with corresponding entry in
> > >> SRAT table.
> > >>
> > >> The address ranges of vNVDIMM in QEMU are allocated from the
> > >> hot-pluggable address space, which is entirely covered by one SRAT
> > >> memory affinity structure. However, users can set the vNVDIMM
> > >> proximity domain in NFIT SPA range structure by the 'node' property of
> > >> '-device nvdimm' to a value different than the one in the above SRAT
> > >> memory affinity structure.
> > >>
> > >> In order to solve such proximity domain mismatch, this patch build one
> > >> SRAT memory affinity structure for each NVDIMM device with the
> > >> proximity domain used in NFIT. The remaining hot-pluggable address
> > >> space is covered by one or multiple SRAT memory affinity structures
> > >> with the proximity domain of the last node as before.
> > >>
> > >> Signed-off-by: Haozhong Zhang <haozhong.zh...@intel.com>  
> > > If we consider hotpluggable system, correctly implemented OS should
> > > be able pull proximity from Device::_PXM and override any value from SRAT.
> > > Do we really have a problem here (anything that breaks if we would use 
> > > _PXM)?
> > > Maybe we should add _PXM object to nvdimm device nodes instead of 
> > > massaging SRAT?  
> > 
> > Unfortunately _PXM is an awkward fit. Currently the proximity domain
> > is attached to the SPA range structure. The SPA range may be
> > associated with multiple DIMM devices and those individual NVDIMMs may
> > have conflicting _PXM properties.
> There shouldn't be any conflict here as  NVDIMM device's _PXM method,
> should override in runtime any proximity specified by parent scope.
> (as parent scope I'd also count boot time NFIT/SRAT tables).
> 
> To make it more clear we could clear valid proximity domain flag in SPA
> like this:
> 
> diff --git a/hw/acpi/nvdimm.c b/hw/acpi/nvdimm.c
> index 59d6e42..131bca5 100644
> --- a/hw/acpi/nvdimm.c
> +++ b/hw/acpi/nvdimm.c
> @@ -260,9 +260,7 @@ nvdimm_build_structure_spa(GArray *structures, 
> DeviceState *dev)
>   */
>  nfit_spa->flags = cpu_to_le16(1 /* Control region is strictly for
> management during hot add/online
> -   operation */ |
> -  2 /* Data in Proximity Domain field is
> -   valid*/);
> +   operation */);
>  
>  /* NUMA node. */
>  nfit_spa->proximity_domain = cpu_to_le32(node);
> 
> > Even if that was unified across
> > DIMMs it is ambiguous whether a DIMM-device _PXM would relate to the
> > device's control interface, or the assembled persistent memory SPA
> > range.
> I'm not sure what you mean under 'device's control interface',
> could you clarify where the ambiguity comes from?
> 
> I read spec as: _PXM applies to address range covered by NVDIMM
> device it belongs to.
> 
> As for assembled SPA, I'd assume that it applies to interleaved set
> and all NVDIMMs with it should be on the same node. It's somewhat
> irrelevant question though as QEMU so far implements only
>   1:1:1/SPA:Region Mapping:NVDIMM Device/
> mapping.
> 
> My main concern with using static configuration tables for proximity
> mapping, we'd miss on hotplug side of equation. However if we start
> from dynamic side first, we could later complement it with static
> tables if there really were need for it.

This patch affects only the static tables and static-plugged NVDIMM.
For hot-plugged NVDIMMs, guest OSPM still needs to evaluate _FIT to
get the information of the new NVDIMMs including their proximity
domains.

One intention of this patch is to simulate the bare metal as much as
possible. I have been using this patch to develop and test NVDIMM
enabling work on Xen, and think it might be useful for developers of
other OS and hypervisors.


Haozhong

[Qemu-devel] [PATCH] hw/acpi-build: build SRAT memory affinity structures for NVDIMM

2018-02-16 Thread Haozhong Zhang

ACPI 6.2A Table 5-129 "SPA Range Structure" requires the proximity
domain of a NVDIMM SPA range must match with corresponding entry in
SRAT table.

The address ranges of vNVDIMM in QEMU are allocated from the
hot-pluggable address space, which is entirely covered by one SRAT
memory affinity structure. However, users can set the vNVDIMM
proximity domain in NFIT SPA range structure by the 'node' property of
'-device nvdimm' to a value different than the one in the above SRAT
memory affinity structure.

In order to solve such proximity domain mismatch, this patch build one
SRAT memory affinity structure for each NVDIMM device with the
proximity domain used in NFIT. The remaining hot-pluggable address
space is covered by one or multiple SRAT memory affinity structures
with the proximity domain of the last node as before.

Signed-off-by: Haozhong Zhang <haozhong.zh...@intel.com>
---
 hw/acpi/nvdimm.c| 15 +--
 hw/i386/acpi-build.c| 47 +++
 include/hw/mem/nvdimm.h | 11 +++
 3 files changed, 67 insertions(+), 6 deletions(-)

diff --git a/hw/acpi/nvdimm.c b/hw/acpi/nvdimm.c
index 59d6e4254c..dff0818e77 100644
--- a/hw/acpi/nvdimm.c
+++ b/hw/acpi/nvdimm.c
@@ -33,12 +33,23 @@
 #include "hw/nvram/fw_cfg.h"
 #include "hw/mem/nvdimm.h"
 
+static gint nvdimm_addr_sort(gconstpointer a, gconstpointer b)
+{
+uint64_t addr0 = object_property_get_uint(OBJECT(NVDIMM(a)),
+  PC_DIMM_ADDR_PROP, NULL);
+uint64_t addr1 = object_property_get_uint(OBJECT(NVDIMM(b)),
+  PC_DIMM_ADDR_PROP, NULL);
+
+return addr0 < addr1 ? -1 :
+   addr0 > addr1 ?  1 : 0;
+}
+
 static int nvdimm_device_list(Object *obj, void *opaque)
 {
 GSList **list = opaque;
 
 if (object_dynamic_cast(obj, TYPE_NVDIMM)) {
-*list = g_slist_append(*list, DEVICE(obj));
+*list = g_slist_insert_sorted(*list, DEVICE(obj), nvdimm_addr_sort);
 }
 
 object_child_foreach(obj, nvdimm_device_list, opaque);
@@ -52,7 +63,7 @@ static int nvdimm_device_list(Object *obj, void *opaque)
  * Note: it is the caller's responsibility to free the list to avoid
  * memory leak.
  */
-static GSList *nvdimm_get_device_list(void)
+GSList *nvdimm_get_device_list(void)
 {
 GSList *list = NULL;
 
diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
index deb440f286..637ac3a8f0 100644
--- a/hw/i386/acpi-build.c
+++ b/hw/i386/acpi-build.c
@@ -2323,6 +2323,46 @@ build_tpm2(GArray *table_data, BIOSLinker *linker, 
GArray *tcpalog)
 #define HOLE_640K_START  (640 * 1024)
 #define HOLE_640K_END   (1024 * 1024)
 
+static void build_srat_hotpluggable_memory(GArray *table_data, uint64_t base,
+   uint64_t len, int default_node)
+{
+GSList *nvdimms = nvdimm_get_device_list();
+GSList *ent = nvdimms;
+NVDIMMDevice *dev;
+uint64_t end = base + len, addr, size;
+int node;
+AcpiSratMemoryAffinity *numamem;
+
+while (base < end) {
+numamem = acpi_data_push(table_data, sizeof *numamem);
+
+if (!ent) {
+build_srat_memory(numamem, base, end - base, default_node,
+  MEM_AFFINITY_HOTPLUGGABLE | 
MEM_AFFINITY_ENABLED);
+break;
+}
+
+dev = NVDIMM(ent->data);
+addr = object_property_get_uint(OBJECT(dev), PC_DIMM_ADDR_PROP, NULL);
+size = object_property_get_uint(OBJECT(dev), PC_DIMM_SIZE_PROP, NULL);
+node = object_property_get_uint(OBJECT(dev), PC_DIMM_NODE_PROP, NULL);
+
+if (base < addr) {
+build_srat_memory(numamem, base, addr - base, default_node,
+  MEM_AFFINITY_HOTPLUGGABLE | 
MEM_AFFINITY_ENABLED);
+numamem = acpi_data_push(table_data, sizeof *numamem);
+}
+build_srat_memory(numamem, addr, size, node,
+  MEM_AFFINITY_HOTPLUGGABLE | MEM_AFFINITY_ENABLED |
+  MEM_AFFINITY_NON_VOLATILE);
+
+base = addr + size;
+ent = ent->next;
+}
+
+g_slist_free(nvdimms);
+}
+
 static void
 build_srat(GArray *table_data, BIOSLinker *linker, MachineState *machine)
 {
@@ -2434,10 +2474,9 @@ build_srat(GArray *table_data, BIOSLinker *linker, 
MachineState *machine)
  * providing _PXM method if necessary.
  */
 if (hotplugabble_address_space_size) {
-numamem = acpi_data_push(table_data, sizeof *numamem);
-build_srat_memory(numamem, pcms->hotplug_memory.base,
-  hotplugabble_address_space_size, pcms->numa_nodes - 
1,
-  MEM_AFFINITY_HOTPLUGGABLE | MEM_AFFINITY_ENABLED);
+build_srat_hotpluggable_memory(table_data, pcms->hotplug_memory.base,
+   hotplugabble_address_space_size,
+

[Qemu-devel] [PATCH v3 6/8] migration/ram: ensure write persistence on loading normal pages to PMEM

2018-02-16 Thread Haozhong Zhang

When loading a normal page to persistent memory, load its data by
libpmem function pmem_memcpy_nodrain() instead of memcpy(). Combined
with a call to pmem_drain() at the end of memory loading, we can
guarantee all those normal pages are persistenly loaded to PMEM.

Signed-off-by: Haozhong Zhang <haozhong.zh...@intel.com>
---
 include/migration/qemu-file-types.h |  2 ++
 include/qemu/pmem.h |  1 +
 migration/qemu-file.c   | 29 +++--
 migration/ram.c |  2 +-
 stubs/pmem.c|  5 +
 tests/Makefile.include  |  2 +-
 6 files changed, 29 insertions(+), 12 deletions(-)

diff --git a/include/migration/qemu-file-types.h 
b/include/migration/qemu-file-types.h
index bd6d7dd7f9..c7c3f665f9 100644
--- a/include/migration/qemu-file-types.h
+++ b/include/migration/qemu-file-types.h
@@ -33,6 +33,8 @@ void qemu_put_byte(QEMUFile *f, int v);
 void qemu_put_be16(QEMUFile *f, unsigned int v);
 void qemu_put_be32(QEMUFile *f, unsigned int v);
 void qemu_put_be64(QEMUFile *f, uint64_t v);
+size_t qemu_get_buffer_common(QEMUFile *f, uint8_t *buf, size_t size,
+  bool is_pmem);
 size_t qemu_get_buffer(QEMUFile *f, uint8_t *buf, size_t size);
 
 int qemu_get_byte(QEMUFile *f);
diff --git a/include/qemu/pmem.h b/include/qemu/pmem.h
index ce96379f3c..127b87c326 100644
--- a/include/qemu/pmem.h
+++ b/include/qemu/pmem.h
@@ -16,6 +16,7 @@
 #include 
 #else  /* !CONFIG_LIBPMEM */
 
+void *pmem_memcpy_nodrain(void *pmemdest, const void *src, size_t len);
 void *pmem_memcpy_persist(void *pmemdest, const void *src, size_t len);
 void *pmem_memset_nodrain(void *pmemdest, int c, size_t len);
 void pmem_drain(void);
diff --git a/migration/qemu-file.c b/migration/qemu-file.c
index 2ab2bf362d..d19f677796 100644
--- a/migration/qemu-file.c
+++ b/migration/qemu-file.c
@@ -26,6 +26,7 @@
 #include "qemu-common.h"
 #include "qemu/error-report.h"
 #include "qemu/iov.h"
+#include "qemu/pmem.h"
 #include "migration.h"
 #include "qemu-file.h"
 #include "trace.h"
@@ -471,18 +472,13 @@ size_t qemu_peek_buffer(QEMUFile *f, uint8_t **buf, 
size_t size, size_t offset)
 return size;
 }
 
-/*
- * Read 'size' bytes of data from the file into buf.
- * 'size' can be larger than the internal buffer.
- *
- * It will return size bytes unless there was an error, in which case it will
- * return as many as it managed to read (assuming blocking fd's which
- * all current QEMUFile are)
- */
-size_t qemu_get_buffer(QEMUFile *f, uint8_t *buf, size_t size)
+size_t qemu_get_buffer_common(QEMUFile *f, uint8_t *buf, size_t size,
+  bool is_pmem)
 {
 size_t pending = size;
 size_t done = 0;
+void *(*memcpy_func)(void *d, const void *s, size_t n) =
+is_pmem ? pmem_memcpy_nodrain : memcpy;
 
 while (pending > 0) {
 size_t res;
@@ -492,7 +488,7 @@ size_t qemu_get_buffer(QEMUFile *f, uint8_t *buf, size_t 
size)
 if (res == 0) {
 return done;
 }
-memcpy(buf, src, res);
+memcpy_func(buf, src, res);
 qemu_file_skip(f, res);
 buf += res;
 pending -= res;
@@ -501,6 +497,19 @@ size_t qemu_get_buffer(QEMUFile *f, uint8_t *buf, size_t 
size)
 return done;
 }
 
+/*
+ * Read 'size' bytes of data from the file into buf.
+ * 'size' can be larger than the internal buffer.
+ *
+ * It will return size bytes unless there was an error, in which case it will
+ * return as many as it managed to read (assuming blocking fd's which
+ * all current QEMUFile are)
+ */
+size_t qemu_get_buffer(QEMUFile *f, uint8_t *buf, size_t size)
+{
+return qemu_get_buffer_common(f, buf, size, false);
+}
+
 /*
  * Read 'size' bytes of data from the file.
  * 'size' can be larger than the internal buffer.
diff --git a/migration/ram.c b/migration/ram.c
index cb93f9fafe..96f33018cf 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -2944,7 +2944,7 @@ static int ram_load(QEMUFile *f, void *opaque, int 
version_id)
 break;
 
 case RAM_SAVE_FLAG_PAGE:
-qemu_get_buffer(f, host, TARGET_PAGE_SIZE);
+qemu_get_buffer_common(f, host, TARGET_PAGE_SIZE, is_pmem);
 break;
 
 case RAM_SAVE_FLAG_COMPRESS_PAGE:
diff --git a/stubs/pmem.c b/stubs/pmem.c
index a65b3bfc6b..e172f31174 100644
--- a/stubs/pmem.c
+++ b/stubs/pmem.c
@@ -26,3 +26,8 @@ void *pmem_memset_nodrain(void *pmemdest, int c, size_t len)
 void pmem_drain(void)
 {
 }
+
+void *pmem_memcpy_nodrain(void *pmemdest, const void *src, size_t len)
+{
+return memcpy(pmemdest, src, len);
+}
diff --git a/tests/Makefile.include b/tests/Makefile.include
index a1bcbffe12..9ffb0cf8eb 100644
--- a/tests/Makefile.include
+++ b/tests/Makefile.include
@@ -637,7 +637,7 @@ tests/test-qdev-global-props$(EXESUF): 
tests/test-qdev-global-props.o \
$(test-qapi-obj-y)

[Qemu-devel] [PATCH v3 5/8] migration/ram: ensure write persistence on loading zero pages to PMEM

2018-02-16 Thread Haozhong Zhang

When loading a zero page, check whether it will be loaded to
persistent memory If yes, load it by libpmem function
pmem_memset_nodrain().  Combined with a call to pmem_drain() at the
end of RAM loading, we can guarantee all those zero pages are
persistently loaded.

Depending on the host HW/SW configurations, pmem_drain() can be
"sfence".  Therefore, we do not call pmem_drain() after each
pmem_memset_nodrain(), or use pmem_memset_persist() (equally
pmem_memset_nodrain() + pmem_drain()), in order to avoid unnecessary
overhead.

Signed-off-by: Haozhong Zhang <haozhong.zh...@intel.com>
---
 include/qemu/pmem.h |  2 ++
 migration/ram.c | 25 +
 migration/ram.h |  2 +-
 migration/rdma.c|  2 +-
 stubs/pmem.c|  9 +
 5 files changed, 34 insertions(+), 6 deletions(-)

diff --git a/include/qemu/pmem.h b/include/qemu/pmem.h
index 16f5b2653a..ce96379f3c 100644
--- a/include/qemu/pmem.h
+++ b/include/qemu/pmem.h
@@ -17,6 +17,8 @@
 #else  /* !CONFIG_LIBPMEM */
 
 void *pmem_memcpy_persist(void *pmemdest, const void *src, size_t len);
+void *pmem_memset_nodrain(void *pmemdest, int c, size_t len);
+void pmem_drain(void);
 
 #endif /* CONFIG_LIBPMEM */
 
diff --git a/migration/ram.c b/migration/ram.c
index 8333d8e35e..cb93f9fafe 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -51,6 +51,7 @@
 #include "qemu/rcu_queue.h"
 #include "migration/colo.h"
 #include "migration/block.h"
+#include "qemu/pmem.h"
 
 /***/
 /* ram save/restore */
@@ -2477,11 +2478,16 @@ static inline void *host_from_ram_block_offset(RAMBlock 
*block,
  * @host: host address for the zero page
  * @ch: what the page is filled from.  We only support zero
  * @size: size of the zero page
+ * @is_pmem: whether @host is in the persistent memory
  */
-void ram_handle_compressed(void *host, uint8_t ch, uint64_t size)
+void ram_handle_compressed(void *host, uint8_t ch, uint64_t size, bool is_pmem)
 {
 if (ch != 0 || !is_zero_range(host, size)) {
-memset(host, ch, size);
+if (!is_pmem) {
+memset(host, ch, size);
+} else {
+pmem_memset_nodrain(host, ch, size);
+}
 }
 }
 
@@ -2824,6 +2830,7 @@ static int ram_load(QEMUFile *f, void *opaque, int 
version_id)
 bool postcopy_running = postcopy_is_running();
 /* ADVISE is earlier, it shows the source has the postcopy capability on */
 bool postcopy_advised = postcopy_is_advised();
+bool need_pmem_drain = false;
 
 seq_iter++;
 
@@ -2849,6 +2856,8 @@ static int ram_load(QEMUFile *f, void *opaque, int 
version_id)
 ram_addr_t addr, total_ram_bytes;
 void *host = NULL;
 uint8_t ch;
+RAMBlock *block = NULL;
+bool is_pmem = false;
 
 addr = qemu_get_be64(f);
 flags = addr & ~TARGET_PAGE_MASK;
@@ -2865,7 +2874,7 @@ static int ram_load(QEMUFile *f, void *opaque, int 
version_id)
 
 if (flags & (RAM_SAVE_FLAG_ZERO | RAM_SAVE_FLAG_PAGE |
  RAM_SAVE_FLAG_COMPRESS_PAGE | RAM_SAVE_FLAG_XBZRLE)) {
-RAMBlock *block = ram_block_from_stream(f, flags);
+block = ram_block_from_stream(f, flags);
 
 host = host_from_ram_block_offset(block, addr);
 if (!host) {
@@ -2875,6 +2884,9 @@ static int ram_load(QEMUFile *f, void *opaque, int 
version_id)
 }
 ramblock_recv_bitmap_set(block, host);
 trace_ram_load_loop(block->idstr, (uint64_t)addr, flags, host);
+
+is_pmem = ramblock_is_pmem(block);
+need_pmem_drain = need_pmem_drain || is_pmem;
 }
 
 switch (flags & ~RAM_SAVE_FLAG_CONTINUE) {
@@ -2928,7 +2940,7 @@ static int ram_load(QEMUFile *f, void *opaque, int 
version_id)
 
 case RAM_SAVE_FLAG_ZERO:
 ch = qemu_get_byte(f);
-ram_handle_compressed(host, ch, TARGET_PAGE_SIZE);
+ram_handle_compressed(host, ch, TARGET_PAGE_SIZE, is_pmem);
 break;
 
 case RAM_SAVE_FLAG_PAGE:
@@ -2971,6 +2983,11 @@ static int ram_load(QEMUFile *f, void *opaque, int 
version_id)
 }
 
 wait_for_decompress_done();
+
+if (need_pmem_drain) {
+pmem_drain();
+}
+
 rcu_read_unlock();
 trace_ram_load_complete(ret, seq_iter);
 return ret;
diff --git a/migration/ram.h b/migration/ram.h
index f3a227b4fc..18934ae9e4 100644
--- a/migration/ram.h
+++ b/migration/ram.h
@@ -57,7 +57,7 @@ int ram_postcopy_send_discard_bitmap(MigrationState *ms);
 int ram_discard_range(const char *block_name, uint64_t start, size_t length);
 int ram_postcopy_incoming_init(MigrationIncomingState *mis);
 
-void ram_handle_compressed(void *host, uint8_t ch, uint64_t size);
+void ram_handle_compressed(void *host, uint8_t ch, uint64_t size, bool 
is_pmem);
 
 int ramblock_recv_bitmap_test(RAMBlock *rb, void *host_addr);
 void ramblock_recv_bit

[Qemu-devel] [PATCH v3 8/8] migration/ram: ensure write persistence on loading xbzrle pages to PMEM

2018-02-16 Thread Haozhong Zhang

When loading a xbzrle encoded page to persistent memory, load the data
via libpmem function pmem_memcpy_nodrain() instead of memcpy().
Combined with a call to pmem_drain() at the end of memory loading, we
can guarantee those xbzrle encoded pages are persistently loaded to PMEM.

Signed-off-by: Haozhong Zhang <haozhong.zh...@intel.com>
---
 migration/ram.c| 6 +++---
 migration/xbzrle.c | 8 ++--
 migration/xbzrle.h | 3 ++-
 tests/Makefile.include | 2 +-
 tests/test-xbzrle.c| 4 ++--
 5 files changed, 14 insertions(+), 9 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index 140f6886df..100083287c 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -2389,7 +2389,7 @@ static void ram_save_pending(QEMUFile *f, void *opaque, 
uint64_t max_size,
 }
 }
 
-static int load_xbzrle(QEMUFile *f, ram_addr_t addr, void *host)
+static int load_xbzrle(QEMUFile *f, ram_addr_t addr, void *host, bool is_pmem)
 {
 unsigned int xh_len;
 int xh_flags;
@@ -2415,7 +2415,7 @@ static int load_xbzrle(QEMUFile *f, ram_addr_t addr, void 
*host)
 
 /* decode RLE */
 if (xbzrle_decode_buffer(loaded_data, xh_len, host,
- TARGET_PAGE_SIZE) == -1) {
+ TARGET_PAGE_SIZE, is_pmem) == -1) {
 error_report("Failed to load XBZRLE page - decode error!");
 return -1;
 }
@@ -2964,7 +2964,7 @@ static int ram_load(QEMUFile *f, void *opaque, int 
version_id)
 break;
 
 case RAM_SAVE_FLAG_XBZRLE:
-if (load_xbzrle(f, addr, host) < 0) {
+if (load_xbzrle(f, addr, host, is_pmem) < 0) {
 error_report("Failed to decompress XBZRLE page at "
  RAM_ADDR_FMT, addr);
 ret = -EINVAL;
diff --git a/migration/xbzrle.c b/migration/xbzrle.c
index 1ba482ded9..ca713c3697 100644
--- a/migration/xbzrle.c
+++ b/migration/xbzrle.c
@@ -12,6 +12,7 @@
  */
 #include "qemu/osdep.h"
 #include "qemu/cutils.h"
+#include "qemu/pmem.h"
 #include "xbzrle.h"
 
 /*
@@ -126,11 +127,14 @@ int xbzrle_encode_buffer(uint8_t *old_buf, uint8_t 
*new_buf, int slen,
 return d;
 }
 
-int xbzrle_decode_buffer(uint8_t *src, int slen, uint8_t *dst, int dlen)
+int xbzrle_decode_buffer(uint8_t *src, int slen, uint8_t *dst, int dlen,
+ bool is_pmem)
 {
 int i = 0, d = 0;
 int ret;
 uint32_t count = 0;
+void *(*memcpy_func)(void *d, const void *s, size_t n) =
+is_pmem ? pmem_memcpy_nodrain : memcpy;
 
 while (i < slen) {
 
@@ -167,7 +171,7 @@ int xbzrle_decode_buffer(uint8_t *src, int slen, uint8_t 
*dst, int dlen)
 return -1;
 }
 
-memcpy(dst + d, src + i, count);
+memcpy_func(dst + d, src + i, count);
 d += count;
 i += count;
 }
diff --git a/migration/xbzrle.h b/migration/xbzrle.h
index a0db507b9c..f18f679f47 100644
--- a/migration/xbzrle.h
+++ b/migration/xbzrle.h
@@ -17,5 +17,6 @@
 int xbzrle_encode_buffer(uint8_t *old_buf, uint8_t *new_buf, int slen,
  uint8_t *dst, int dlen);
 
-int xbzrle_decode_buffer(uint8_t *src, int slen, uint8_t *dst, int dlen);
+int xbzrle_decode_buffer(uint8_t *src, int slen, uint8_t *dst, int dlen,
+ bool is_pmem);
 #endif
diff --git a/tests/Makefile.include b/tests/Makefile.include
index 9ffb0cf8eb..1005195cdc 100644
--- a/tests/Makefile.include
+++ b/tests/Makefile.include
@@ -616,7 +616,7 @@ tests/test-thread-pool$(EXESUF): tests/test-thread-pool.o 
$(test-block-obj-y)
 tests/test-iov$(EXESUF): tests/test-iov.o $(test-util-obj-y)
 tests/test-hbitmap$(EXESUF): tests/test-hbitmap.o $(test-util-obj-y) 
$(test-crypto-obj-y)
 tests/test-x86-cpuid$(EXESUF): tests/test-x86-cpuid.o
-tests/test-xbzrle$(EXESUF): tests/test-xbzrle.o migration/xbzrle.o 
migration/page_cache.o $(test-util-obj-y)
+tests/test-xbzrle$(EXESUF): tests/test-xbzrle.o migration/xbzrle.o 
migration/page_cache.o stubs/pmem.o $(test-util-obj-y)
 tests/test-cutils$(EXESUF): tests/test-cutils.o util/cutils.o 
$(test-util-obj-y)
 tests/test-int128$(EXESUF): tests/test-int128.o
 tests/rcutorture$(EXESUF): tests/rcutorture.o $(test-util-obj-y)
diff --git a/tests/test-xbzrle.c b/tests/test-xbzrle.c
index f5e08de91e..9afa0c4bcb 100644
--- a/tests/test-xbzrle.c
+++ b/tests/test-xbzrle.c
@@ -101,7 +101,7 @@ static void test_encode_decode_1_byte(void)
PAGE_SIZE);
 g_assert(dlen == (uleb128_encode_small([0], 4095) + 2));
 
-rc = xbzrle_decode_buffer(compressed, dlen, buffer, PAGE_SIZE);
+rc = xbzrle_decode_buffer(compressed, dlen, buffer, PAGE_SIZE, false);
 g_assert(rc == PAGE_SIZE);
 g_assert(memcmp(test, buffer, PAGE_SIZE) == 0);
 
@@ -156,7 +156,7 @@ static void encode_decode_range(void)
 dlen = xbzrle_encode_buffer(test, buffer, PAGE_SIZE, compressed,
 PAGE_SIZE);

[Qemu-devel] [PATCH v3 4/8] mem/nvdimm: ensure write persistence to PMEM in label emulation

2018-02-16 Thread Haozhong Zhang

Guest writes to vNVDIMM labels are intercepted and performed on the
backend by QEMU. When the backend is a real persistent memort, QEMU
needs to take proper operations to ensure its write persistence on the
persistent memory. Otherwise, a host power failure may result in the
loss of guest label configurations.

Signed-off-by: Haozhong Zhang <haozhong.zh...@intel.com>
---
 hw/mem/nvdimm.c |  9 -
 include/qemu/pmem.h | 23 +++
 stubs/Makefile.objs |  1 +
 stubs/pmem.c| 19 +++
 4 files changed, 51 insertions(+), 1 deletion(-)
 create mode 100644 include/qemu/pmem.h
 create mode 100644 stubs/pmem.c

diff --git a/hw/mem/nvdimm.c b/hw/mem/nvdimm.c
index 61e677f92f..18861d1a7a 100644
--- a/hw/mem/nvdimm.c
+++ b/hw/mem/nvdimm.c
@@ -23,6 +23,7 @@
  */
 
 #include "qemu/osdep.h"
+#include "qemu/pmem.h"
 #include "qapi/error.h"
 #include "qapi/visitor.h"
 #include "qapi-visit.h"
@@ -156,11 +157,17 @@ static void nvdimm_write_label_data(NVDIMMDevice *nvdimm, 
const void *buf,
 {
 MemoryRegion *mr;
 PCDIMMDevice *dimm = PC_DIMM(nvdimm);
+bool is_pmem = object_property_get_bool(OBJECT(dimm->hostmem),
+"pmem", NULL);
 uint64_t backend_offset;
 
 nvdimm_validate_rw_label_data(nvdimm, size, offset);
 
-memcpy(nvdimm->label_data + offset, buf, size);
+if (!is_pmem) {
+memcpy(nvdimm->label_data + offset, buf, size);
+} else {
+pmem_memcpy_persist(nvdimm->label_data + offset, buf, size);
+}
 
 mr = host_memory_backend_get_memory(dimm->hostmem, _abort);
 backend_offset = memory_region_size(mr) - nvdimm->label_size + offset;
diff --git a/include/qemu/pmem.h b/include/qemu/pmem.h
new file mode 100644
index 00..16f5b2653a
--- /dev/null
+++ b/include/qemu/pmem.h
@@ -0,0 +1,23 @@
+/*
+ * QEMU header file for libpmem.
+ *
+ * Copyright (c) 2018 Intel Corporation.
+ *
+ * Author: Haozhong Zhang <haozhong.zh...@intel.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#ifndef QEMU_PMEM_H
+#define QEMU_PMEM_H
+
+#ifdef CONFIG_LIBPMEM
+#include 
+#else  /* !CONFIG_LIBPMEM */
+
+void *pmem_memcpy_persist(void *pmemdest, const void *src, size_t len);
+
+#endif /* CONFIG_LIBPMEM */
+
+#endif /* !QEMU_PMEM_H */
diff --git a/stubs/Makefile.objs b/stubs/Makefile.objs
index 2d59d84091..ba944b9739 100644
--- a/stubs/Makefile.objs
+++ b/stubs/Makefile.objs
@@ -43,3 +43,4 @@ stub-obj-y += xen-common.o
 stub-obj-y += xen-hvm.o
 stub-obj-y += pci-host-piix.o
 stub-obj-y += ram-block.o
+stub-obj-$(call lnot,$(CONFIG_LIBPMEM)) += pmem.o
\ No newline at end of file
diff --git a/stubs/pmem.c b/stubs/pmem.c
new file mode 100644
index 00..03d990e571
--- /dev/null
+++ b/stubs/pmem.c
@@ -0,0 +1,19 @@
+/*
+ * Stubs for libpmem.
+ *
+ * Copyright (c) 2018 Intel Corporation.
+ *
+ * Author: Haozhong Zhang <haozhong.zh...@intel.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include 
+
+#include "qemu/pmem.h"
+
+void *pmem_memcpy_persist(void *pmemdest, const void *src, size_t len)
+{
+return memcpy(pmemdest, src, len);
+}
-- 
2.16.1

[Qemu-devel] [PATCH v3 7/8] migration/ram: ensure write persistence on loading compressed pages to PMEM

2018-02-16 Thread Haozhong Zhang

When loading a compressed page to persistent memory, flush CPU cache
after the data is decompressed. Combined with a call to pmem_drain()
at the end of memory loading, we can guarantee those compressed pages
are persistently loaded to PMEM.

Signed-off-by: Haozhong Zhang <haozhong.zh...@intel.com>
---
 include/qemu/pmem.h |  1 +
 migration/ram.c | 16 +++-
 stubs/pmem.c|  4 
 3 files changed, 16 insertions(+), 5 deletions(-)

diff --git a/include/qemu/pmem.h b/include/qemu/pmem.h
index 127b87c326..120439ecb8 100644
--- a/include/qemu/pmem.h
+++ b/include/qemu/pmem.h
@@ -20,6 +20,7 @@ void *pmem_memcpy_nodrain(void *pmemdest, const void *src, 
size_t len);
 void *pmem_memcpy_persist(void *pmemdest, const void *src, size_t len);
 void *pmem_memset_nodrain(void *pmemdest, int c, size_t len);
 void pmem_drain(void);
+void pmem_flush(const void *addr, size_t len);
 
 #endif /* CONFIG_LIBPMEM */
 
diff --git a/migration/ram.c b/migration/ram.c
index 96f33018cf..140f6886df 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -276,6 +276,7 @@ struct DecompressParam {
 void *des;
 uint8_t *compbuf;
 int len;
+bool is_pmem;
 };
 typedef struct DecompressParam DecompressParam;
 
@@ -2496,7 +2497,7 @@ static void *do_data_decompress(void *opaque)
 DecompressParam *param = opaque;
 unsigned long pagesize;
 uint8_t *des;
-int len;
+int len, rc;
 
 qemu_mutex_lock(>mutex);
 while (!param->quit) {
@@ -2512,8 +2513,11 @@ static void *do_data_decompress(void *opaque)
  * not a problem because the dirty page will be retransferred
  * and uncompress() won't break the data in other pages.
  */
-uncompress((Bytef *)des, ,
-   (const Bytef *)param->compbuf, len);
+rc = uncompress((Bytef *)des, ,
+(const Bytef *)param->compbuf, len);
+if (rc == Z_OK && param->is_pmem) {
+pmem_flush(des, len);
+}
 
 qemu_mutex_lock(_done_lock);
 param->done = true;
@@ -2599,7 +2603,8 @@ static void compress_threads_load_cleanup(void)
 }
 
 static void decompress_data_with_multi_threads(QEMUFile *f,
-   void *host, int len)
+   void *host, int len,
+   bool is_pmem)
 {
 int idx, thread_count;
 
@@ -2613,6 +2618,7 @@ static void decompress_data_with_multi_threads(QEMUFile 
*f,
 qemu_get_buffer(f, decomp_param[idx].compbuf, len);
 decomp_param[idx].des = host;
 decomp_param[idx].len = len;
+decomp_param[idx].is_pmem = is_pmem;
 qemu_cond_signal(_param[idx].cond);
 qemu_mutex_unlock(_param[idx].mutex);
 break;
@@ -2954,7 +2960,7 @@ static int ram_load(QEMUFile *f, void *opaque, int 
version_id)
 ret = -EINVAL;
 break;
 }
-decompress_data_with_multi_threads(f, host, len);
+decompress_data_with_multi_threads(f, host, len, is_pmem);
 break;
 
 case RAM_SAVE_FLAG_XBZRLE:
diff --git a/stubs/pmem.c b/stubs/pmem.c
index e172f31174..cfab830131 100644
--- a/stubs/pmem.c
+++ b/stubs/pmem.c
@@ -31,3 +31,7 @@ void *pmem_memcpy_nodrain(void *pmemdest, const void *src, 
size_t len)
 {
 return memcpy(pmemdest, src, len);
 }
+
+void pmem_flush(const void *addr, size_t len)
+{
+}
-- 
2.16.1

[Qemu-devel] [PATCH v3 0/8] nvdimm: guarantee persistence of QEMU writes to persistent memory

2018-02-16 Thread Haozhong Zhang

QEMU writes to vNVDIMM backends in the vNVDIMM label emulation and
live migration. If the backend is on the persistent memory, QEMU needs
to take proper operations to ensure its writes persistent on the
persistent memory. Otherwise, a host power failure may result in the
loss the guest data on the persistent memory.

This v3 patch series is based on Marcel's patch "mem: add share
parameter to memory-backend-ram" [1] because of the changes in patch 1.

[1] https://lists.gnu.org/archive/html/qemu-devel/2018-02/msg03858.html

Previous versions can be found at
v2: https://lists.gnu.org/archive/html/qemu-devel/2018-02/msg01579.html
v1: https://lists.gnu.org/archive/html/qemu-devel/2017-12/msg05040.html

Changes in v3:
 * (Patch 5) Add a is_pmem flag to ram_handle_compressed() and handle
   PMEM writes in it, so we don't need the _common function.
 * (Patch 6) Expose qemu_get_buffer_common so we can remove the
   unnecessary qemu_get_buffer_to_pmem wrapper.
 * (Patch 8) Add a is_pmem flag to xbzrle_decode_buffer() and handle
   PMEM writes in it, so we can remove the unnecessary
   xbzrle_decode_buffer_{common, to_pmem}.
 * Move libpmem stubs to stubs/pmem.c and fix the compilation failures
   of test-{xbzrle,vmstate}.c.

Changes in v2:
 * (Patch 1) Use a flags parameter in file ram allocation functions.
 * (Patch 2) Add a new option 'pmem' to hostmem-file.
 * (Patch 3) Use libpmem to operate on the persistent memory, rather
   than re-implementing those operations in QEMU.
 * (Patch 5-8) Consider the write persistence in the migration path.

Haozhong Zhang (8):
  [1/8] memory, exec: switch file ram allocation functions to 'flags' parameters
  [2/8] hostmem-file: add the 'pmem' option
  [3/8] configure: add libpmem support
  [4/8] mem/nvdimm: ensure write persistence to PMEM in label emulation
  [5/8] migration/ram: ensure write persistence on loading zero pages to PMEM
  [6/8] migration/ram: ensure write persistence on loading normal pages to PMEM
  [7/8] migration/ram: ensure write persistence on loading compressed pages to 
PMEM
  [8/8] migration/ram: ensure write persistence on loading xbzrle pages to PMEM

 backends/hostmem-file.c | 27 +++-
 configure   | 35 ++
 docs/nvdimm.txt | 14 +++
 exec.c  | 23 ++---
 hw/mem/nvdimm.c |  9 ++-
 include/exec/memory.h   | 12 +++--
 include/exec/ram_addr.h | 28 +++--
 include/migration/qemu-file-types.h |  2 ++
 include/qemu/pmem.h | 27 
 memory.c|  8 +++---
 migration/qemu-file.c   | 29 ++
 migration/ram.c | 49 +++--
 migration/ram.h |  2 +-
 migration/rdma.c|  2 +-
 migration/xbzrle.c  |  8 --
 migration/xbzrle.h  |  3 ++-
 numa.c  |  2 +-
 qemu-options.hx |  9 ++-
 stubs/Makefile.objs |  1 +
 stubs/pmem.c| 37 
 tests/Makefile.include  |  4 +--
 tests/test-xbzrle.c |  4 +--
 22 files changed, 288 insertions(+), 47 deletions(-)
 create mode 100644 include/qemu/pmem.h
 create mode 100644 stubs/pmem.c

-- 
2.16.1

[Qemu-devel] [PATCH v3 3/8] configure: add libpmem support

2018-02-16 Thread Haozhong Zhang

Add a pair of configure options --{enable,disable}-libpmem to control
whether QEMU is compiled with PMDK libpmem [1].

QEMU may write to the host persistent memory (e.g. in vNVDIMM label
emulation and live migration), so it must take the proper operations
to ensure the persistence of its own writes. Depending on the CPU
models and available instructions, the optimal operation can vary [2].
PMDK libpmem have already implemented those operations on multiple CPU
models (x86 and ARM) and the logic to select the optimal ones, so QEMU
can just use libpmem rather than re-implement them.

[1] PMDK (formerly known as NMVL), https://github.com/pmem/pmdk/
[2] 
https://github.com/pmem/pmdk/blob/38bfa652721a37fd94c0130ce0e3f5d8baa3ed40/src/libpmem/pmem.c#L33

Signed-off-by: Haozhong Zhang <haozhong.zh...@intel.com>
---
 configure | 35 +++
 1 file changed, 35 insertions(+)

diff --git a/configure b/configure
index 913e14839d..ba9953fffe 100755
--- a/configure
+++ b/configure
@@ -450,6 +450,7 @@ jemalloc="no"
 replication="yes"
 vxhs=""
 libxml2=""
+libpmem=""
 
 supported_cpu="no"
 supported_os="no"
@@ -1359,6 +1360,10 @@ for opt do
   ;;
   --disable-git-update) git_update=no
   ;;
+  --enable-libpmem) libpmem=yes
+  ;;
+  --disable-libpmem) libpmem=no
+  ;;
   *)
   echo "ERROR: unknown option $opt"
   echo "Try '$0 --help' for more information"
@@ -1611,6 +1616,7 @@ disabled with --disable-FEATURE, default is enabled if 
available:
   crypto-afalgLinux AF_ALG crypto backend driver
   vhost-user  vhost-user support
   capstonecapstone disassembler support
+  libpmem libpmem support
 
 NOTE: The object files are built at the place where configure is launched
 EOF
@@ -5345,6 +5351,30 @@ EOF
   fi
 fi
 
+##
+# check for libpmem
+
+if test "$libpmem" != "no"; then
+  cat > $TMPC <
+int main(void)
+{
+  pmem_is_pmem(0, 0);
+  return 0;
+}
+EOF
+  libpmem_libs="-lpmem"
+  if compile_prog "" "$libpmem_libs" ; then
+libs_softmmu="$libpmem_libs $libs_softmmu"
+libpmem="yes"
+  else
+if test "$libpmem" = "yes" ; then
+  feature_not_found "libpmem" "Install nvml or pmdk"
+fi
+libpmem="no"
+  fi
+fi
+
 ##
 # End of CC checks
 # After here, no more $cc or $ld runs
@@ -5815,6 +5845,7 @@ echo "avx2 optimization $avx2_opt"
 echo "replication support $replication"
 echo "VxHS block device $vxhs"
 echo "capstone  $capstone"
+echo "libpmem support   $libpmem"
 
 if test "$sdl_too_old" = "yes"; then
 echo "-> Your SDL version is too old - please upgrade to have SDL support"
@@ -6540,6 +6571,10 @@ if test "$vxhs" = "yes" ; then
   echo "VXHS_LIBS=$vxhs_libs" >> $config_host_mak
 fi
 
+if test "$libpmem" = "yes" ; then
+  echo "CONFIG_LIBPMEM=y" >> $config_host_mak
+fi
+
 if test "$tcg_interpreter" = "yes"; then
   QEMU_INCLUDES="-I\$(SRC_PATH)/tcg/tci $QEMU_INCLUDES"
 elif test "$ARCH" = "sparc64" ; then
-- 
2.16.1

[Qemu-devel] [PATCH v3 1/8] memory, exec: switch file ram allocation functions to 'flags' parameters

2018-02-16 Thread Haozhong Zhang

As more flag parameters besides the existing 'share' are going to be
added to following functions
memory_region_init_ram_from_file
qemu_ram_alloc_from_fd
qemu_ram_alloc_from_file
, let's switch them to use the 'flags' parameters so as to ease future
flag additions.

The existing 'share' flag is converted to the QEMU_RAM_SHARE bit in
flags, and other flag bits are ignored by above functions right now.

Signed-off-by: Haozhong Zhang <haozhong.zh...@intel.com>
---
 backends/hostmem-file.c |  3 ++-
 exec.c  |  7 ---
 include/exec/memory.h   | 10 --
 include/exec/ram_addr.h | 25 +++--
 memory.c|  8 +---
 numa.c  |  2 +-
 6 files changed, 43 insertions(+), 12 deletions(-)

diff --git a/backends/hostmem-file.c b/backends/hostmem-file.c
index 134b08d63a..30df843d90 100644
--- a/backends/hostmem-file.c
+++ b/backends/hostmem-file.c
@@ -58,7 +58,8 @@ file_backend_memory_alloc(HostMemoryBackend *backend, Error 
**errp)
 path = object_get_canonical_path(OBJECT(backend));
 memory_region_init_ram_from_file(>mr, OBJECT(backend),
  path,
- backend->size, fb->align, backend->share,
+ backend->size, fb->align,
+ backend->share ? QEMU_RAM_SHARE : 0,
  fb->mem_path, errp);
 g_free(path);
 }
diff --git a/exec.c b/exec.c
index 4d8addb263..537bf12412 100644
--- a/exec.c
+++ b/exec.c
@@ -2000,12 +2000,13 @@ static void ram_block_add(RAMBlock *new_block, Error 
**errp, bool shared)
 
 #ifdef __linux__
 RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t size, MemoryRegion *mr,
- bool share, int fd,
+ uint64_t flags, int fd,
  Error **errp)
 {
 RAMBlock *new_block;
 Error *local_err = NULL;
 int64_t file_size;
+bool share = flags & QEMU_RAM_SHARE;
 
 if (xen_enabled()) {
 error_setg(errp, "-mem-path not supported with Xen");
@@ -2061,7 +2062,7 @@ RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t size, 
MemoryRegion *mr,
 
 
 RAMBlock *qemu_ram_alloc_from_file(ram_addr_t size, MemoryRegion *mr,
-   bool share, const char *mem_path,
+   uint64_t flags, const char *mem_path,
Error **errp)
 {
 int fd;
@@ -2073,7 +2074,7 @@ RAMBlock *qemu_ram_alloc_from_file(ram_addr_t size, 
MemoryRegion *mr,
 return NULL;
 }
 
-block = qemu_ram_alloc_from_fd(size, mr, share, fd, errp);
+block = qemu_ram_alloc_from_fd(size, mr, flags, fd, errp);
 if (!block) {
 if (created) {
 unlink(mem_path);
diff --git a/include/exec/memory.h b/include/exec/memory.h
index 15e81113ba..0fc9d23a48 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -487,6 +487,9 @@ void memory_region_init_resizeable_ram(MemoryRegion *mr,
void *host),
Error **errp);
 #ifdef __linux__
+
+#define QEMU_RAM_SHARE  (1UL << 0)
+
 /**
  * memory_region_init_ram_from_file:  Initialize RAM memory region with a
  *mmap-ed backend.
@@ -498,7 +501,10 @@ void memory_region_init_resizeable_ram(MemoryRegion *mr,
  * @size: size of the region.
  * @align: alignment of the region base address; if 0, the default alignment
  * (getpagesize()) will be used.
- * @share: %true if memory must be mmaped with the MAP_SHARED flag
+ * @flags: specify properties of this memory region, which can be one or bit-or
+ * of following values:
+ * - QEMU_RAM_SHARE: memory must be mmaped with the MAP_SHARED flag
+ * Other bits are ignored.
  * @path: the path in which to allocate the RAM.
  * @errp: pointer to Error*, to store an error if it happens.
  *
@@ -510,7 +516,7 @@ void memory_region_init_ram_from_file(MemoryRegion *mr,
   const char *name,
   uint64_t size,
   uint64_t align,
-  bool share,
+  uint64_t flags,
   const char *path,
   Error **errp);
 
diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h
index cf2446a176..b8b01d1eb9 100644
--- a/include/exec/ram_addr.h
+++ b/include/exec/ram_addr.h
@@ -72,12 +72,33 @@ static inline unsigned long int 
ramblock_recv_bitmap_offset(void *host_addr,
 
 long qemu_getrampagesize(void);
 unsigned long last_ram_page(void);
+
+/**
+ * qemu_ram_alloc_from_file,
+ * qemu_ram_alloc_from_fd:  Allocate a ram block from the specified back
+ *

[Qemu-devel] [PATCH v3 2/8] hostmem-file: add the 'pmem' option

2018-02-16 Thread Haozhong Zhang

When QEMU emulates vNVDIMM labels and migrates vNVDIMM devices, it
needs to know whether the backend storage is a real persistent memory,
in order to decide whether special operations should be performed to
ensure the data persistence.

This boolean option 'pmem' allows users to specify whether the backend
storage of memory-backend-file is a real persistent memory. If
'pmem=on', QEMU will set the flag RAM_PMEM in the RAM block of the
corresponding memory region.

Signed-off-by: Haozhong Zhang <haozhong.zh...@intel.com>
---
 backends/hostmem-file.c | 26 +-
 docs/nvdimm.txt | 14 ++
 exec.c  | 16 +++-
 include/exec/memory.h   |  2 ++
 include/exec/ram_addr.h |  3 +++
 qemu-options.hx |  9 -
 6 files changed, 67 insertions(+), 3 deletions(-)

diff --git a/backends/hostmem-file.c b/backends/hostmem-file.c
index 30df843d90..5d706d471f 100644
--- a/backends/hostmem-file.c
+++ b/backends/hostmem-file.c
@@ -34,6 +34,7 @@ struct HostMemoryBackendFile {
 bool discard_data;
 char *mem_path;
 uint64_t align;
+bool is_pmem;
 };
 
 static void
@@ -59,7 +60,8 @@ file_backend_memory_alloc(HostMemoryBackend *backend, Error 
**errp)
 memory_region_init_ram_from_file(>mr, OBJECT(backend),
  path,
  backend->size, fb->align,
- backend->share ? QEMU_RAM_SHARE : 0,
+ (backend->share ? QEMU_RAM_SHARE : 0) |
+ (fb->is_pmem ? QEMU_RAM_PMEM : 0),
  fb->mem_path, errp);
 g_free(path);
 }
@@ -131,6 +133,25 @@ static void file_memory_backend_set_align(Object *o, 
Visitor *v,
 error_propagate(errp, local_err);
 }
 
+static bool file_memory_backend_get_pmem(Object *o, Error **errp)
+{
+return MEMORY_BACKEND_FILE(o)->is_pmem;
+}
+
+static void file_memory_backend_set_pmem(Object *o, bool value, Error **errp)
+{
+HostMemoryBackend *backend = MEMORY_BACKEND(o);
+HostMemoryBackendFile *fb = MEMORY_BACKEND_FILE(o);
+
+if (host_memory_backend_mr_inited(backend)) {
+error_setg(errp, "cannot change property 'pmem' of %s '%s'",
+   object_get_typename(o), backend->id);
+return;
+}
+
+fb->is_pmem = value;
+}
+
 static void file_backend_unparent(Object *obj)
 {
 HostMemoryBackend *backend = MEMORY_BACKEND(obj);
@@ -162,6 +183,9 @@ file_backend_class_init(ObjectClass *oc, void *data)
 file_memory_backend_get_align,
 file_memory_backend_set_align,
 NULL, NULL, _abort);
+object_class_property_add_bool(oc, "pmem",
+file_memory_backend_get_pmem, file_memory_backend_set_pmem,
+_abort);
 }
 
 static void file_backend_instance_finalize(Object *o)
diff --git a/docs/nvdimm.txt b/docs/nvdimm.txt
index e903d8bb09..bcb2032672 100644
--- a/docs/nvdimm.txt
+++ b/docs/nvdimm.txt
@@ -153,3 +153,17 @@ guest NVDIMM region mapping structure.  This unarmed flag 
indicates
 guest software that this vNVDIMM device contains a region that cannot
 accept persistent writes. In result, for example, the guest Linux
 NVDIMM driver, marks such vNVDIMM device as read-only.
+
+If the vNVDIMM backend is on the host persistent memory that can be
+accessed in SNIA NVM Programming Model [1] (e.g., Intel NVDIMM), it's
+suggested to set the 'pmem' option of memory-backend-file to 'on'. When
+'pmem=on' and QEMU is built with libpmem [2] support (configured with
+--enable-libpmem), QEMU will take necessary operations to guarantee
+the persistence of its own writes to the vNVDIMM backend (e.g., in
+vNVDIMM label emulation and live migration).
+
+References
+--
+
+[1] SNIA NVM Programming Model: 
https://www.snia.org/sites/default/files/technical_work/final/NVMProgrammingModel_v1.2.pdf
+[2] PMDK: http://pmem.io/pmdk/
diff --git a/exec.c b/exec.c
index 537bf12412..4b9b4678cf 100644
--- a/exec.c
+++ b/exec.c
@@ -99,6 +99,9 @@ static MemoryRegion io_mem_unassigned;
  */
 #define RAM_RESIZEABLE (1 << 2)
 
+/* RAM is backed by the persistent memory. */
+#define RAM_PMEM   (1 << 3)
+
 #endif
 
 #ifdef TARGET_PAGE_BITS_VARY
@@ -2007,6 +2010,7 @@ RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t size, 
MemoryRegion *mr,
 Error *local_err = NULL;
 int64_t file_size;
 bool share = flags & QEMU_RAM_SHARE;
+bool is_pmem = flags & QEMU_RAM_PMEM;
 
 if (xen_enabled()) {
 error_setg(errp, "-mem-path not supported with Xen");
@@ -2043,7 +2047,8 @@ RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t size, 
MemoryRegion *mr,
 new_block->mr = mr;
 new_block->used_length = size;
 new_block->max_length = size;
-new_block->flags = share ? RAM_SHARED : 0;
+new_block->flags = (share ? RAM_SHARED : 0) |
+   (is_pmem ? RAM_PMEM :

Re: [Qemu-devel] [PATCH v2 4/8] mem/nvdimm: ensure write persistence to PMEM in label emulation

2018-02-09 Thread Haozhong Zhang

On 02/09/18 14:27 +, Stefan Hajnoczi wrote:
> On Wed, Feb 07, 2018 at 03:33:27PM +0800, Haozhong Zhang wrote:
> > @@ -156,11 +157,17 @@ static void nvdimm_write_label_data(NVDIMMDevice 
> > *nvdimm, const void *buf,
> >  {
> >  MemoryRegion *mr;
> >  PCDIMMDevice *dimm = PC_DIMM(nvdimm);
> > +bool is_pmem = object_property_get_bool(OBJECT(dimm->hostmem),
> > +"pmem", NULL);
> >  uint64_t backend_offset;
> >  
> >  nvdimm_validate_rw_label_data(nvdimm, size, offset);
> >  
> > -memcpy(nvdimm->label_data + offset, buf, size);
> > +if (!is_pmem) {
> > +memcpy(nvdimm->label_data + offset, buf, size);
> > +} else {
> > +pmem_memcpy_persist(nvdimm->label_data + offset, buf, size);
> > +}
> 
> Is this enough to prevent label corruption in case of power failure?
> 
> pmem_memcpy_persist() is not atomic.  Power failure can result in a mix
> of the old and new label data.
> 
> If we want this operation to be 100% safe there needs to be some kind of
> update protocol that makes the change atomic, like a Label A and Label B
> area with a single Label Index field that can be updated atomically to
> point to the active Label A/B area.

All this patch series is to guarantee: if the guest is still alive and
running, all its previous writes to pmem, which were performed by
QEMU, will be still persistent on pmem.

If a power failure happens before QEMU returns to the guest, e.g., in
the middle of above pmem_memcpy_persist(), yes, the guest label data
may be in an inconsistent state, but the guest also has no chance to
progress.  And, that is what could happen in the non-virtualization
environment as well, and it's the responsibility of the (guest) SW to
defend such failures, e.g., by the protocol you mentioned.

Haozhong

Re: [Qemu-devel] [PATCH v2 7/8] migration/ram: ensure write persistence on loading compressed pages to PMEM

2018-02-07 Thread Haozhong Zhang

On 02/07/18 13:03 +, Dr. David Alan Gilbert wrote:
> * Haozhong Zhang (haozhong.zh...@intel.com) wrote:
> > On 02/07/18 11:54 +, Dr. David Alan Gilbert wrote:
> > > * Haozhong Zhang (haozhong.zh...@intel.com) wrote:
> > > > When loading a compressed page to persistent memory, flush CPU cache
> > > > after the data is decompressed. Combined with a call to pmem_drain()
> > > > at the end of memory loading, we can guarantee those compressed pages
> > > > are persistently loaded to PMEM.
> > > 
> > > Can you explain why this can use the flush and doesn't need the special
> > > memset?
> > 
> > The best approach to ensure the write persistence is to operate pmem
> > all via libpmem, e.g., pmem_memcpy_nodrain() + pmem_drain(). However,
> > the write to pmem in this case is performed by uncompress() which is
> > implemented out of QEMU and libpmem. It may or may not use libpmem,
> > which is not controlled by QEMU. Therefore, we have to use the less
> > optimal approach, that is to flush cache for all pmem addresses that
> > uncompress() may have written, i.e.,/e.g., memcpy() and/or memset() in
> > uncompress(), and pmem_flush() + pmem_drain() in QEMU.
> 
> In what way is it less optimal?
> If that's a legal thing to do, then why not just do a pmem_flush +
> pmem_drain right at the end of the ram loading and leave all the rest of
> the code untouched?

For example, the implementation pmem_memcpy_nodrain() prefers to use
movnt instructions w/o flush to write pmem if those instructions are
available, and falls back to memcpy() + flush if movnt are not
available, so I suppose the latter is less optimal.

Haozhong

> 
> Dave
> 
> > Haozhong
> > 
> > > 
> > > Dave
> > > 
> > > > Signed-off-by: Haozhong Zhang <haozhong.zh...@intel.com>
> > > > ---
> > > >  include/qemu/pmem.h |  4 
> > > >  migration/ram.c | 16 +++-
> > > >  2 files changed, 15 insertions(+), 5 deletions(-)
> > > > 
> > > > diff --git a/include/qemu/pmem.h b/include/qemu/pmem.h
> > > > index 77ee1fc4eb..20e3f6e71d 100644
> > > > --- a/include/qemu/pmem.h
> > > > +++ b/include/qemu/pmem.h
> > > > @@ -37,6 +37,10 @@ static inline void *pmem_memset_nodrain(void 
> > > > *pmemdest, int c, size_t len)
> > > >  return memset(pmemdest, c, len);
> > > >  }
> > > >  
> > > > +static inline void pmem_flush(const void *addr, size_t len)
> > > > +{
> > > > +}
> > > > +
> > > >  static inline void pmem_drain(void)
> > > >  {
> > > >  }
> > > > diff --git a/migration/ram.c b/migration/ram.c
> > > > index 5a79bbff64..924d2b9537 100644
> > > > --- a/migration/ram.c
> > > > +++ b/migration/ram.c
> > > > @@ -274,6 +274,7 @@ struct DecompressParam {
> > > >  void *des;
> > > >  uint8_t *compbuf;
> > > >  int len;
> > > > +bool is_pmem;
> > > >  };
> > > >  typedef struct DecompressParam DecompressParam;
> > > >  
> > > > @@ -2502,7 +2503,7 @@ static void *do_data_decompress(void *opaque)
> > > >  DecompressParam *param = opaque;
> > > >  unsigned long pagesize;
> > > >  uint8_t *des;
> > > > -int len;
> > > > +int len, rc;
> > > >  
> > > >  qemu_mutex_lock(>mutex);
> > > >  while (!param->quit) {
> > > > @@ -2518,8 +2519,11 @@ static void *do_data_decompress(void *opaque)
> > > >   * not a problem because the dirty page will be 
> > > > retransferred
> > > >   * and uncompress() won't break the data in other pages.
> > > >   */
> > > > -uncompress((Bytef *)des, ,
> > > > -   (const Bytef *)param->compbuf, len);
> > > > +rc = uncompress((Bytef *)des, ,
> > > > +(const Bytef *)param->compbuf, len);
> > > > +if (rc == Z_OK && param->is_pmem) {
> > > > +pmem_flush(des, len);
> > > > +}
> > > >  
> > > >  qemu_mutex_lock(_done_lock);
> > > >  param->done = true;
> > > > @@ -2605,7 +2609,8 @@ static void compress_threads_load_cleanup(void)
> > > >  }
> > > >  
> >

Re: [Qemu-devel] [PATCH v2 5/8] migration/ram: ensure write persistence on loading zero pages to PMEM

2018-02-07 Thread Haozhong Zhang

On 02/07/18 19:52 +0800, Haozhong Zhang wrote:
> On 02/07/18 11:38 +, Dr. David Alan Gilbert wrote:
> > * Haozhong Zhang (haozhong.zh...@intel.com) wrote:
> > > When loading a zero page, check whether it will be loaded to
> > > persistent memory If yes, load it by libpmem function
> > > pmem_memset_nodrain().  Combined with a call to pmem_drain() at the
> > > end of RAM loading, we can guarantee all those zero pages are
> > > persistently loaded.
> > 
> > I'm surprised pmem is this invasive to be honest;   I hadn't expected
> > the need for special memcpy's etc everywhere.  We're bound to miss some.
> > I assume there's separate kernel work needed to make postcopy work;
> > certainly the patches here don't seem to touch the postcopy path.
> 
> This link at
> https://wiki.qemu.org/Features/PostCopyLiveMigration#Conflicts shows
> that postcopy with memory-backend-file requires kernel support. Can
> you point me the details of the required kernel support, so that I can
> understand what would be needed to NVDIMM postcopy?

I saw test_ramblock_postcopiable() requires the ram block not be
mmap'ed with MAP_SHARED. The only pmem device (i.e., device DAX e.g.,
/dev/dax0.0) that can be safely used as the backend of vNVDIMM must be
shared mapped which is required by kernel, so postcopy does not work
with pmem right now. Even if the private mmap was supported for device
dax, it would still make little sense to private mmap it in QEMU,
because vNVDIMM intends to be non-volatile.

Haozhong

Re: [Qemu-devel] [PATCH v2 7/8] migration/ram: ensure write persistence on loading compressed pages to PMEM

2018-02-07 Thread Haozhong Zhang

On 02/07/18 11:54 +, Dr. David Alan Gilbert wrote:
> * Haozhong Zhang (haozhong.zh...@intel.com) wrote:
> > When loading a compressed page to persistent memory, flush CPU cache
> > after the data is decompressed. Combined with a call to pmem_drain()
> > at the end of memory loading, we can guarantee those compressed pages
> > are persistently loaded to PMEM.
> 
> Can you explain why this can use the flush and doesn't need the special
> memset?

The best approach to ensure the write persistence is to operate pmem
all via libpmem, e.g., pmem_memcpy_nodrain() + pmem_drain(). However,
the write to pmem in this case is performed by uncompress() which is
implemented out of QEMU and libpmem. It may or may not use libpmem,
which is not controlled by QEMU. Therefore, we have to use the less
optimal approach, that is to flush cache for all pmem addresses that
uncompress() may have written, i.e.,/e.g., memcpy() and/or memset() in
uncompress(), and pmem_flush() + pmem_drain() in QEMU.

Haozhong

> 
> Dave
> 
> > Signed-off-by: Haozhong Zhang <haozhong.zh...@intel.com>
> > ---
> >  include/qemu/pmem.h |  4 
> >  migration/ram.c | 16 +++-
> >  2 files changed, 15 insertions(+), 5 deletions(-)
> > 
> > diff --git a/include/qemu/pmem.h b/include/qemu/pmem.h
> > index 77ee1fc4eb..20e3f6e71d 100644
> > --- a/include/qemu/pmem.h
> > +++ b/include/qemu/pmem.h
> > @@ -37,6 +37,10 @@ static inline void *pmem_memset_nodrain(void *pmemdest, 
> > int c, size_t len)
> >  return memset(pmemdest, c, len);
> >  }
> >  
> > +static inline void pmem_flush(const void *addr, size_t len)
> > +{
> > +}
> > +
> >  static inline void pmem_drain(void)
> >  {
> >  }
> > diff --git a/migration/ram.c b/migration/ram.c
> > index 5a79bbff64..924d2b9537 100644
> > --- a/migration/ram.c
> > +++ b/migration/ram.c
> > @@ -274,6 +274,7 @@ struct DecompressParam {
> >  void *des;
> >  uint8_t *compbuf;
> >  int len;
> > +bool is_pmem;
> >  };
> >  typedef struct DecompressParam DecompressParam;
> >  
> > @@ -2502,7 +2503,7 @@ static void *do_data_decompress(void *opaque)
> >  DecompressParam *param = opaque;
> >  unsigned long pagesize;
> >  uint8_t *des;
> > -int len;
> > +int len, rc;
> >  
> >  qemu_mutex_lock(>mutex);
> >  while (!param->quit) {
> > @@ -2518,8 +2519,11 @@ static void *do_data_decompress(void *opaque)
> >   * not a problem because the dirty page will be retransferred
> >   * and uncompress() won't break the data in other pages.
> >   */
> > -uncompress((Bytef *)des, ,
> > -   (const Bytef *)param->compbuf, len);
> > +rc = uncompress((Bytef *)des, ,
> > +(const Bytef *)param->compbuf, len);
> > +if (rc == Z_OK && param->is_pmem) {
> > +pmem_flush(des, len);
> > +}
> >  
> >  qemu_mutex_lock(_done_lock);
> >  param->done = true;
> > @@ -2605,7 +2609,8 @@ static void compress_threads_load_cleanup(void)
> >  }
> >  
> >  static void decompress_data_with_multi_threads(QEMUFile *f,
> > -   void *host, int len)
> > +   void *host, int len,
> > +   bool is_pmem)
> >  {
> >  int idx, thread_count;
> >  
> > @@ -2619,6 +2624,7 @@ static void 
> > decompress_data_with_multi_threads(QEMUFile *f,
> >  qemu_get_buffer(f, decomp_param[idx].compbuf, len);
> >  decomp_param[idx].des = host;
> >  decomp_param[idx].len = len;
> > +decomp_param[idx].is_pmem = is_pmem;
> >  qemu_cond_signal(_param[idx].cond);
> >  qemu_mutex_unlock(_param[idx].mutex);
> >  break;
> > @@ -2964,7 +2970,7 @@ static int ram_load(QEMUFile *f, void *opaque, int 
> > version_id)
> >  ret = -EINVAL;
> >  break;
> >  }
> > -decompress_data_with_multi_threads(f, host, len);
> > +decompress_data_with_multi_threads(f, host, len, is_pmem);
> >  break;
> >  
> >  case RAM_SAVE_FLAG_XBZRLE:
> > -- 
> > 2.14.1
> > 
> --
> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [Qemu-devel] [PATCH v2 6/8] migration/ram: ensure write persistence on loading normal pages to PMEM

2018-02-07 Thread Haozhong Zhang

On 02/07/18 11:49 +, Dr. David Alan Gilbert wrote:
> * Haozhong Zhang (haozhong.zh...@intel.com) wrote:
> > When loading a normal page to persistent memory, load its data by
> > libpmem function pmem_memcpy_nodrain() instead of memcpy(). Combined
> > with a call to pmem_drain() at the end of memory loading, we can
> > guarantee all those normal pages are persistenly loaded to PMEM.
> > 
> > Signed-off-by: Haozhong Zhang <haozhong.zh...@intel.com>
> > ---
> >  include/migration/qemu-file-types.h |  1 +
> >  include/qemu/pmem.h |  6 ++
> >  migration/qemu-file.c   | 41 
> > -
> >  migration/ram.c |  6 +-
> >  4 files changed, 43 insertions(+), 11 deletions(-)
> > 
> > diff --git a/include/migration/qemu-file-types.h 
> > b/include/migration/qemu-file-types.h
> > index bd6d7dd7f9..bb5c547498 100644
> > --- a/include/migration/qemu-file-types.h
> > +++ b/include/migration/qemu-file-types.h
> > @@ -34,6 +34,7 @@ void qemu_put_be16(QEMUFile *f, unsigned int v);
> >  void qemu_put_be32(QEMUFile *f, unsigned int v);
> >  void qemu_put_be64(QEMUFile *f, uint64_t v);
> >  size_t qemu_get_buffer(QEMUFile *f, uint8_t *buf, size_t size);
> > +size_t qemu_get_buffer_to_pmem(QEMUFile *f, uint8_t *buf, size_t size);
> >  
> >  int qemu_get_byte(QEMUFile *f);
> >  
> > diff --git a/include/qemu/pmem.h b/include/qemu/pmem.h
> > index 861d8ecc21..77ee1fc4eb 100644
> > --- a/include/qemu/pmem.h
> > +++ b/include/qemu/pmem.h
> > @@ -26,6 +26,12 @@ pmem_memcpy_persist(void *pmemdest, const void *src, 
> > size_t len)
> >  return memcpy(pmemdest, src, len);
> >  }
> >  
> > +static inline void *
> > +pmem_memcpy_nodrain(void *pmemdest, const void *src, size_t len)
> > +{
> > +return memcpy(pmemdest, src, len);
> > +}
> > +
> >  static inline void *pmem_memset_nodrain(void *pmemdest, int c, size_t len)
> >  {
> >  return memset(pmemdest, c, len);
> > diff --git a/migration/qemu-file.c b/migration/qemu-file.c
> > index 2ab2bf362d..7e573010d9 100644
> > --- a/migration/qemu-file.c
> > +++ b/migration/qemu-file.c
> > @@ -26,6 +26,7 @@
> >  #include "qemu-common.h"
> >  #include "qemu/error-report.h"
> >  #include "qemu/iov.h"
> > +#include "qemu/pmem.h"
> >  #include "migration.h"
> >  #include "qemu-file.h"
> >  #include "trace.h"
> > @@ -471,15 +472,8 @@ size_t qemu_peek_buffer(QEMUFile *f, uint8_t **buf, 
> > size_t size, size_t offset)
> >  return size;
> >  }
> >  
> > -/*
> > - * Read 'size' bytes of data from the file into buf.
> > - * 'size' can be larger than the internal buffer.
> > - *
> > - * It will return size bytes unless there was an error, in which case it 
> > will
> > - * return as many as it managed to read (assuming blocking fd's which
> > - * all current QEMUFile are)
> > - */
> > -size_t qemu_get_buffer(QEMUFile *f, uint8_t *buf, size_t size)
> > +static size_t
> > +qemu_get_buffer_common(QEMUFile *f, uint8_t *buf, size_t size, bool 
> > is_pmem)
> >  {
> >  size_t pending = size;
> >  size_t done = 0;
> > @@ -492,7 +486,11 @@ size_t qemu_get_buffer(QEMUFile *f, uint8_t *buf, 
> > size_t size)
> >  if (res == 0) {
> >  return done;
> >  }
> > -memcpy(buf, src, res);
> > +if (!is_pmem) {
> > +memcpy(buf, src, res);
> > +} else {
> > +pmem_memcpy_nodrain(buf, src, res);
> > +}
> 
> I see why you're doing it, but again I'm surprised it's ended up having
> to modify qemu-file.

Well, the current programming model of pmem requires users to take
special care on the data persistence, so QEMU (as a user of pmem) has
to do that as well.

> 
> >  qemu_file_skip(f, res);
> >  buf += res;
> >  pending -= res;
> > @@ -501,6 +499,29 @@ size_t qemu_get_buffer(QEMUFile *f, uint8_t *buf, 
> > size_t size)
> >  return done;
> >  }
> >  
> > +/*
> > + * Read 'size' bytes of data from the file into buf.
> > + * 'size' can be larger than the internal buffer.
> > + *
> > + * It will return size bytes unless there was an error, in which case it 
> > will
> > + * return as many as it managed to read (assuming blocking fd's which
> > + * all current QEMUFile are)
> > + */
> > +size

Re: [Qemu-devel] [PATCH v2 5/8] migration/ram: ensure write persistence on loading zero pages to PMEM

2018-02-07 Thread Haozhong Zhang

On 02/07/18 11:38 +, Dr. David Alan Gilbert wrote:
> * Haozhong Zhang (haozhong.zh...@intel.com) wrote:
> > When loading a zero page, check whether it will be loaded to
> > persistent memory If yes, load it by libpmem function
> > pmem_memset_nodrain().  Combined with a call to pmem_drain() at the
> > end of RAM loading, we can guarantee all those zero pages are
> > persistently loaded.
> 
> I'm surprised pmem is this invasive to be honest;   I hadn't expected
> the need for special memcpy's etc everywhere.  We're bound to miss some.
> I assume there's separate kernel work needed to make postcopy work;
> certainly the patches here don't seem to touch the postcopy path.

This link at
https://wiki.qemu.org/Features/PostCopyLiveMigration#Conflicts shows
that postcopy with memory-backend-file requires kernel support. Can
you point me the details of the required kernel support, so that I can
understand what would be needed to NVDIMM postcopy?

> 
> > Depending on the host HW/SW configurations, pmem_drain() can be
> > "sfence".  Therefore, we do not call pmem_drain() after each
> > pmem_memset_nodrain(), or use pmem_memset_persist() (equally
> > pmem_memset_nodrain() + pmem_drain()), in order to avoid unnecessary
> > overhead.
> > 
> > Signed-off-by: Haozhong Zhang <haozhong.zh...@intel.com>
> > ---
> >  include/qemu/pmem.h |  9 +
> >  migration/ram.c | 34 +-
> >  2 files changed, 38 insertions(+), 5 deletions(-)
> > 
> > diff --git a/include/qemu/pmem.h b/include/qemu/pmem.h
> > index 9017596ff0..861d8ecc21 100644
> > --- a/include/qemu/pmem.h
> > +++ b/include/qemu/pmem.h
> > @@ -26,6 +26,15 @@ pmem_memcpy_persist(void *pmemdest, const void *src, 
> > size_t len)
> >  return memcpy(pmemdest, src, len);
> >  }
> >  
> > +static inline void *pmem_memset_nodrain(void *pmemdest, int c, size_t len)
> > +{
> > +return memset(pmemdest, c, len);
> > +}
> > +
> > +static inline void pmem_drain(void)
> > +{
> > +}
> > +
> >  #endif /* CONFIG_LIBPMEM */
> >  
> >  #endif /* !QEMU_PMEM_H */
> > diff --git a/migration/ram.c b/migration/ram.c
> > index cb1950f3eb..5a0e503818 100644
> > --- a/migration/ram.c
> > +++ b/migration/ram.c
> > @@ -49,6 +49,7 @@
> >  #include "qemu/rcu_queue.h"
> >  #include "migration/colo.h"
> >  #include "migration/block.h"
> > +#include "qemu/pmem.h"
> >  
> >  /***/
> >  /* ram save/restore */
> > @@ -2467,6 +2468,20 @@ static inline void 
> > *host_from_ram_block_offset(RAMBlock *block,
> >  return block->host + offset;
> >  }
> >  
> > +static void ram_handle_compressed_common(void *host, uint8_t ch, uint64_t 
> > size,
> > + bool is_pmem)
> 
> I don't think there's any advantage of splitting out this _common
> routine; lets just add the parameter to ram_handle_compressed.
> 
> > +{
> > +if (!ch && is_zero_range(host, size)) {
> > +return;
> > +}
> > +
> > +if (!is_pmem) {
> > +memset(host, ch, size);
> > +} else {
> > +pmem_memset_nodrain(host, ch, size);
> > +}
> 
> I'm wondering if it would be easier to pass in a memsetfunc ptr and call
> that (defualting to memset if it's NULL).

Yes, it would be more extensible if we have other special memory
devices in the future.

Thank,
Haozhong

> 
> > +}
> > +
> >  /**
> >   * ram_handle_compressed: handle the zero page case
> >   *
> > @@ -2479,9 +2494,7 @@ static inline void 
> > *host_from_ram_block_offset(RAMBlock *block,
> >   */
> >  void ram_handle_compressed(void *host, uint8_t ch, uint64_t size)
> >  {
> > -if (ch != 0 || !is_zero_range(host, size)) {
> > -memset(host, ch, size);
> > -}
> > +return ram_handle_compressed_common(host, ch, size, false);
> >  }
> >  
> >  static void *do_data_decompress(void *opaque)
> > @@ -2823,6 +2836,7 @@ static int ram_load(QEMUFile *f, void *opaque, int 
> > version_id)
> >  bool postcopy_running = postcopy_is_running();
> >  /* ADVISE is earlier, it shows the source has the postcopy capability 
> > on */
> >  bool postcopy_advised = postcopy_is_advised();
> > +bool need_pmem_drain = false;
> >  
> >  seq_iter++;
> >  
> > @@ -2848,6 +2862,8 @@ static int ram_load(QEMUFile *f, void *opaque, in

Re: [Qemu-devel] [PATCH v2 5/8] migration/ram: ensure write persistence on loading zero pages to PMEM

2018-02-07 Thread Haozhong Zhang

On 02/07/18 05:17 -0500, Pankaj Gupta wrote:
> 
> > 
> > When loading a zero page, check whether it will be loaded to
> > persistent memory If yes, load it by libpmem function
> > pmem_memset_nodrain().  Combined with a call to pmem_drain() at the
> > end of RAM loading, we can guarantee all those zero pages are
> > persistently loaded.
> > 
> > Depending on the host HW/SW configurations, pmem_drain() can be
> > "sfence".  Therefore, we do not call pmem_drain() after each
> > pmem_memset_nodrain(), or use pmem_memset_persist() (equally
> > pmem_memset_nodrain() + pmem_drain()), in order to avoid unnecessary
> > overhead.
> 
> Are you saying we don't need 'pmem_drain()'?

pmem_drain() is called only once after all ram blocks are loaded.

> 
> I can see its called in patch 5 in ram_load? But
> I also see empty definition. Anything I am missing here?

Functions defined in include/qemu/pmem.h are stubs and used only when
QEMU is not compiled with libpmem. When QEMU is compiled with
--enabled-libpmem, the one in libpmem is used.

Haozhong

> 
> Thanks,
> Pankaj
> 
> > 
> > Signed-off-by: Haozhong Zhang <haozhong.zh...@intel.com>
> > ---
> >  include/qemu/pmem.h |  9 +
> >  migration/ram.c | 34 +-
> >  2 files changed, 38 insertions(+), 5 deletions(-)
> > 
> > diff --git a/include/qemu/pmem.h b/include/qemu/pmem.h
> > index 9017596ff0..861d8ecc21 100644
> > --- a/include/qemu/pmem.h
> > +++ b/include/qemu/pmem.h
> > @@ -26,6 +26,15 @@ pmem_memcpy_persist(void *pmemdest, const void *src,
> > size_t len)
> >  return memcpy(pmemdest, src, len);
> >  }
> >  
> > +static inline void *pmem_memset_nodrain(void *pmemdest, int c, size_t len)
> > +{
> > +return memset(pmemdest, c, len);
> > +}
> > +
> > +static inline void pmem_drain(void)
> > +{
> > +}
> > +
> >  #endif /* CONFIG_LIBPMEM */
> >  
> >  #endif /* !QEMU_PMEM_H */
> > diff --git a/migration/ram.c b/migration/ram.c
> > index cb1950f3eb..5a0e503818 100644
> > --- a/migration/ram.c
> > +++ b/migration/ram.c
> > @@ -49,6 +49,7 @@
> >  #include "qemu/rcu_queue.h"
> >  #include "migration/colo.h"
> >  #include "migration/block.h"
> > +#include "qemu/pmem.h"
> >  
> >  /***/
> >  /* ram save/restore */
> > @@ -2467,6 +2468,20 @@ static inline void
> > *host_from_ram_block_offset(RAMBlock *block,
> >  return block->host + offset;
> >  }
> >  
> > +static void ram_handle_compressed_common(void *host, uint8_t ch, uint64_t
> > size,
> > + bool is_pmem)
> > +{
> > +if (!ch && is_zero_range(host, size)) {
> > +return;
> > +}
> > +
> > +if (!is_pmem) {
> > +memset(host, ch, size);
> > +} else {
> > +pmem_memset_nodrain(host, ch, size);
> > +}
> > +}
> > +
> >  /**
> >   * ram_handle_compressed: handle the zero page case
> >   *
> > @@ -2479,9 +2494,7 @@ static inline void 
> > *host_from_ram_block_offset(RAMBlock
> > *block,
> >   */
> >  void ram_handle_compressed(void *host, uint8_t ch, uint64_t size)
> >  {
> > -if (ch != 0 || !is_zero_range(host, size)) {
> > -memset(host, ch, size);
> > -}
> > +return ram_handle_compressed_common(host, ch, size, false);
> >  }
> >  
> >  static void *do_data_decompress(void *opaque)
> > @@ -2823,6 +2836,7 @@ static int ram_load(QEMUFile *f, void *opaque, int
> > version_id)
> >  bool postcopy_running = postcopy_is_running();
> >  /* ADVISE is earlier, it shows the source has the postcopy capability 
> > on
> >  */
> >  bool postcopy_advised = postcopy_is_advised();
> > +bool need_pmem_drain = false;
> >  
> >  seq_iter++;
> >  
> > @@ -2848,6 +2862,8 @@ static int ram_load(QEMUFile *f, void *opaque, int
> > version_id)
> >  ram_addr_t addr, total_ram_bytes;
> >  void *host = NULL;
> >  uint8_t ch;
> > +RAMBlock *block = NULL;
> > +bool is_pmem = false;
> >  
> >  addr = qemu_get_be64(f);
> >  flags = addr & ~TARGET_PAGE_MASK;
> > @@ -2864,7 +2880,7 @@ static int ram_load(QEMUFile *f, void *opaque, int
> > version_id)
> >  
> >  if (flags & (RAM_SAVE_F

[Qemu-devel] [PATCH v2 4/8] mem/nvdimm: ensure write persistence to PMEM in label emulation

2018-02-06 Thread Haozhong Zhang

Guest writes to vNVDIMM labels are intercepted and performed on the
backend by QEMU. When the backend is a real persistent memort, QEMU
needs to take proper operations to ensure its write persistence on the
persistent memory. Otherwise, a host power failure may result in the
loss of guest label configurations.

Signed-off-by: Haozhong Zhang <haozhong.zh...@intel.com>
---
 hw/mem/nvdimm.c |  9 -
 include/qemu/pmem.h | 31 +++
 2 files changed, 39 insertions(+), 1 deletion(-)
 create mode 100644 include/qemu/pmem.h

diff --git a/hw/mem/nvdimm.c b/hw/mem/nvdimm.c
index 61e677f92f..18861d1a7a 100644
--- a/hw/mem/nvdimm.c
+++ b/hw/mem/nvdimm.c
@@ -23,6 +23,7 @@
  */
 
 #include "qemu/osdep.h"
+#include "qemu/pmem.h"
 #include "qapi/error.h"
 #include "qapi/visitor.h"
 #include "qapi-visit.h"
@@ -156,11 +157,17 @@ static void nvdimm_write_label_data(NVDIMMDevice *nvdimm, 
const void *buf,
 {
 MemoryRegion *mr;
 PCDIMMDevice *dimm = PC_DIMM(nvdimm);
+bool is_pmem = object_property_get_bool(OBJECT(dimm->hostmem),
+"pmem", NULL);
 uint64_t backend_offset;
 
 nvdimm_validate_rw_label_data(nvdimm, size, offset);
 
-memcpy(nvdimm->label_data + offset, buf, size);
+if (!is_pmem) {
+memcpy(nvdimm->label_data + offset, buf, size);
+} else {
+pmem_memcpy_persist(nvdimm->label_data + offset, buf, size);
+}
 
 mr = host_memory_backend_get_memory(dimm->hostmem, _abort);
 backend_offset = memory_region_size(mr) - nvdimm->label_size + offset;
diff --git a/include/qemu/pmem.h b/include/qemu/pmem.h
new file mode 100644
index 00..9017596ff0
--- /dev/null
+++ b/include/qemu/pmem.h
@@ -0,0 +1,31 @@
+/*
+ * Stub functions for libpmem.
+ *
+ * Copyright (c) 2018 Intel Corporation.
+ *
+ * Author: Haozhong Zhang <haozhong.zh...@intel.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#ifndef QEMU_PMEM_H
+#define QEMU_PMEM_H
+
+#ifdef CONFIG_LIBPMEM
+#include 
+#else  /* !CONFIG_LIBPMEM */
+
+#include 
+
+/* Stubs */
+
+static inline void *
+pmem_memcpy_persist(void *pmemdest, const void *src, size_t len)
+{
+return memcpy(pmemdest, src, len);
+}
+
+#endif /* CONFIG_LIBPMEM */
+
+#endif /* !QEMU_PMEM_H */
-- 
2.14.1

[Qemu-devel] [PATCH v2 3/8] configure: add libpmem support

2018-02-06 Thread Haozhong Zhang

Add a pair of configure options --{enable,disable}-libpmem to control
whether QEMU is compiled with PMDK libpmem [1].

QEMU may write to the host persistent memory (e.g. in vNVDIMM label
emulation and live migration), so it must take the proper operations
to ensure the persistence of its own writes. Depending on the CPU
models and available instructions, the optimal operation can vary [2].
PMDK libpmem have already implemented those operations on multiple CPU
models (x86 and ARM) and the logic to select the optimal ones, so QEMU
can just use libpmem rather than re-implement them.

[1] PMDK (formerly known as NMVL), https://github.com/pmem/pmdk/
[2] 
https://github.com/pmem/pmdk/blob/38bfa652721a37fd94c0130ce0e3f5d8baa3ed40/src/libpmem/pmem.c#L33

Signed-off-by: Haozhong Zhang <haozhong.zh...@intel.com>
---
 configure | 35 +++
 1 file changed, 35 insertions(+)

diff --git a/configure b/configure
index 302fdc92ff..595967e5df 100755
--- a/configure
+++ b/configure
@@ -436,6 +436,7 @@ jemalloc="no"
 replication="yes"
 vxhs=""
 libxml2=""
+libpmem=""
 
 supported_cpu="no"
 supported_os="no"
@@ -1341,6 +1342,10 @@ for opt do
   ;;
   --disable-git-update) git_update=no
   ;;
+  --enable-libpmem) libpmem=yes
+  ;;
+  --disable-libpmem) libpmem=no
+  ;;
   *)
   echo "ERROR: unknown option $opt"
   echo "Try '$0 --help' for more information"
@@ -1592,6 +1597,7 @@ disabled with --disable-FEATURE, default is enabled if 
available:
   crypto-afalgLinux AF_ALG crypto backend driver
   vhost-user  vhost-user support
   capstonecapstone disassembler support
+  libpmem libpmem support
 
 NOTE: The object files are built at the place where configure is launched
 EOF
@@ -5205,6 +5211,30 @@ if compile_prog "" "" ; then
 have_utmpx=yes
 fi
 
+##
+# check for libpmem
+
+if test "$libpmem" != "no"; then
+  cat > $TMPC <
+int main(void)
+{
+  pmem_is_pmem(0, 0);
+  return 0;
+}
+EOF
+  libpmem_libs="-lpmem"
+  if compile_prog "" "$libpmem_libs" ; then
+libs_softmmu="$libpmem_libs $libs_softmmu"
+libpmem="yes"
+  else
+if test "$libpmem" = "yes" ; then
+  feature_not_found "libpmem" "Install nvml or pmdk"
+fi
+libpmem="no"
+  fi
+fi
+
 ##
 # End of CC checks
 # After here, no more $cc or $ld runs
@@ -5657,6 +5687,7 @@ echo "avx2 optimization $avx2_opt"
 echo "replication support $replication"
 echo "VxHS block device $vxhs"
 echo "capstone  $capstone"
+echo "libpmem support   $libpmem"
 
 if test "$sdl_too_old" = "yes"; then
 echo "-> Your SDL version is too old - please upgrade to have SDL support"
@@ -6374,6 +6405,10 @@ if test "$vxhs" = "yes" ; then
   echo "VXHS_LIBS=$vxhs_libs" >> $config_host_mak
 fi
 
+if test "$libpmem" = "yes" ; then
+  echo "CONFIG_LIBPMEM=y" >> $config_host_mak
+fi
+
 if test "$tcg_interpreter" = "yes"; then
   QEMU_INCLUDES="-I\$(SRC_PATH)/tcg/tci $QEMU_INCLUDES"
 elif test "$ARCH" = "sparc64" ; then
-- 
2.14.1

[Qemu-devel] [PATCH v2 6/8] migration/ram: ensure write persistence on loading normal pages to PMEM

2018-02-06 Thread Haozhong Zhang

When loading a normal page to persistent memory, load its data by
libpmem function pmem_memcpy_nodrain() instead of memcpy(). Combined
with a call to pmem_drain() at the end of memory loading, we can
guarantee all those normal pages are persistenly loaded to PMEM.

Signed-off-by: Haozhong Zhang <haozhong.zh...@intel.com>
---
 include/migration/qemu-file-types.h |  1 +
 include/qemu/pmem.h |  6 ++
 migration/qemu-file.c   | 41 -
 migration/ram.c |  6 +-
 4 files changed, 43 insertions(+), 11 deletions(-)

diff --git a/include/migration/qemu-file-types.h 
b/include/migration/qemu-file-types.h
index bd6d7dd7f9..bb5c547498 100644
--- a/include/migration/qemu-file-types.h
+++ b/include/migration/qemu-file-types.h
@@ -34,6 +34,7 @@ void qemu_put_be16(QEMUFile *f, unsigned int v);
 void qemu_put_be32(QEMUFile *f, unsigned int v);
 void qemu_put_be64(QEMUFile *f, uint64_t v);
 size_t qemu_get_buffer(QEMUFile *f, uint8_t *buf, size_t size);
+size_t qemu_get_buffer_to_pmem(QEMUFile *f, uint8_t *buf, size_t size);
 
 int qemu_get_byte(QEMUFile *f);
 
diff --git a/include/qemu/pmem.h b/include/qemu/pmem.h
index 861d8ecc21..77ee1fc4eb 100644
--- a/include/qemu/pmem.h
+++ b/include/qemu/pmem.h
@@ -26,6 +26,12 @@ pmem_memcpy_persist(void *pmemdest, const void *src, size_t 
len)
 return memcpy(pmemdest, src, len);
 }
 
+static inline void *
+pmem_memcpy_nodrain(void *pmemdest, const void *src, size_t len)
+{
+return memcpy(pmemdest, src, len);
+}
+
 static inline void *pmem_memset_nodrain(void *pmemdest, int c, size_t len)
 {
 return memset(pmemdest, c, len);
diff --git a/migration/qemu-file.c b/migration/qemu-file.c
index 2ab2bf362d..7e573010d9 100644
--- a/migration/qemu-file.c
+++ b/migration/qemu-file.c
@@ -26,6 +26,7 @@
 #include "qemu-common.h"
 #include "qemu/error-report.h"
 #include "qemu/iov.h"
+#include "qemu/pmem.h"
 #include "migration.h"
 #include "qemu-file.h"
 #include "trace.h"
@@ -471,15 +472,8 @@ size_t qemu_peek_buffer(QEMUFile *f, uint8_t **buf, size_t 
size, size_t offset)
 return size;
 }
 
-/*
- * Read 'size' bytes of data from the file into buf.
- * 'size' can be larger than the internal buffer.
- *
- * It will return size bytes unless there was an error, in which case it will
- * return as many as it managed to read (assuming blocking fd's which
- * all current QEMUFile are)
- */
-size_t qemu_get_buffer(QEMUFile *f, uint8_t *buf, size_t size)
+static size_t
+qemu_get_buffer_common(QEMUFile *f, uint8_t *buf, size_t size, bool is_pmem)
 {
 size_t pending = size;
 size_t done = 0;
@@ -492,7 +486,11 @@ size_t qemu_get_buffer(QEMUFile *f, uint8_t *buf, size_t 
size)
 if (res == 0) {
 return done;
 }
-memcpy(buf, src, res);
+if (!is_pmem) {
+memcpy(buf, src, res);
+} else {
+pmem_memcpy_nodrain(buf, src, res);
+}
 qemu_file_skip(f, res);
 buf += res;
 pending -= res;
@@ -501,6 +499,29 @@ size_t qemu_get_buffer(QEMUFile *f, uint8_t *buf, size_t 
size)
 return done;
 }
 
+/*
+ * Read 'size' bytes of data from the file into buf.
+ * 'size' can be larger than the internal buffer.
+ *
+ * It will return size bytes unless there was an error, in which case it will
+ * return as many as it managed to read (assuming blocking fd's which
+ * all current QEMUFile are)
+ */
+size_t qemu_get_buffer(QEMUFile *f, uint8_t *buf, size_t size)
+{
+return qemu_get_buffer_common(f, buf, size, false);
+}
+
+/*
+ * Mostly the same as qemu_get_buffer(), except that
+ * 1) it's for the case that 'buf' is in the persistent memory, and
+ * 2) it takes necessary operations to ensure the data persistence in 'buf'.
+ */
+size_t qemu_get_buffer_to_pmem(QEMUFile *f, uint8_t *buf, size_t size)
+{
+return qemu_get_buffer_common(f, buf, size, true);
+}
+
 /*
  * Read 'size' bytes of data from the file.
  * 'size' can be larger than the internal buffer.
diff --git a/migration/ram.c b/migration/ram.c
index 5a0e503818..5a79bbff64 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -2950,7 +2950,11 @@ static int ram_load(QEMUFile *f, void *opaque, int 
version_id)
 break;
 
 case RAM_SAVE_FLAG_PAGE:
-qemu_get_buffer(f, host, TARGET_PAGE_SIZE);
+if (!is_pmem) {
+qemu_get_buffer(f, host, TARGET_PAGE_SIZE);
+} else {
+qemu_get_buffer_to_pmem(f, host, TARGET_PAGE_SIZE);
+}
 break;
 
 case RAM_SAVE_FLAG_COMPRESS_PAGE:
-- 
2.14.1

[Qemu-devel] [PATCH v2 8/8] migration/ram: ensure write persistence on loading xbzrle pages to PMEM

2018-02-06 Thread Haozhong Zhang

When loading a xbzrle encoded page to persistent memory, load the data
via libpmem function pmem_memcpy_nodrain() instead of memcpy().
Combined with a call to pmem_drain() at the end of memory loading, we
can guarantee those xbzrle encoded pages are persistently loaded to PMEM.

Signed-off-by: Haozhong Zhang <haozhong.zh...@intel.com>
---
 migration/ram.c| 15 ++-
 migration/xbzrle.c | 20 ++--
 migration/xbzrle.h |  1 +
 3 files changed, 29 insertions(+), 7 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index 924d2b9537..87f977617d 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -2388,10 +2388,10 @@ static void ram_save_pending(QEMUFile *f, void *opaque, 
uint64_t max_size,
 }
 }
 
-static int load_xbzrle(QEMUFile *f, ram_addr_t addr, void *host)
+static int load_xbzrle(QEMUFile *f, ram_addr_t addr, void *host, bool is_pmem)
 {
 unsigned int xh_len;
-int xh_flags;
+int xh_flags, rc;
 uint8_t *loaded_data;
 
 /* extract RLE header */
@@ -2413,8 +2413,13 @@ static int load_xbzrle(QEMUFile *f, ram_addr_t addr, 
void *host)
 qemu_get_buffer_in_place(f, _data, xh_len);
 
 /* decode RLE */
-if (xbzrle_decode_buffer(loaded_data, xh_len, host,
- TARGET_PAGE_SIZE) == -1) {
+if (!is_pmem) {
+rc = xbzrle_decode_buffer(loaded_data, xh_len, host, TARGET_PAGE_SIZE);
+} else {
+rc = xbzrle_decode_buffer_to_pmem(loaded_data, xh_len, host,
+  TARGET_PAGE_SIZE);
+}
+if (rc == -1) {
 error_report("Failed to load XBZRLE page - decode error!");
 return -1;
 }
@@ -2974,7 +2979,7 @@ static int ram_load(QEMUFile *f, void *opaque, int 
version_id)
 break;
 
 case RAM_SAVE_FLAG_XBZRLE:
-if (load_xbzrle(f, addr, host) < 0) {
+if (load_xbzrle(f, addr, host, is_pmem) < 0) {
 error_report("Failed to decompress XBZRLE page at "
  RAM_ADDR_FMT, addr);
 ret = -EINVAL;
diff --git a/migration/xbzrle.c b/migration/xbzrle.c
index 1ba482ded9..499d8e1bfb 100644
--- a/migration/xbzrle.c
+++ b/migration/xbzrle.c
@@ -12,6 +12,7 @@
  */
 #include "qemu/osdep.h"
 #include "qemu/cutils.h"
+#include "qemu/pmem.h"
 #include "xbzrle.h"
 
 /*
@@ -126,7 +127,8 @@ int xbzrle_encode_buffer(uint8_t *old_buf, uint8_t 
*new_buf, int slen,
 return d;
 }
 
-int xbzrle_decode_buffer(uint8_t *src, int slen, uint8_t *dst, int dlen)
+static int xbzrle_decode_buffer_common(uint8_t *src, int slen, uint8_t *dst,
+   int dlen, bool is_pmem)
 {
 int i = 0, d = 0;
 int ret;
@@ -167,10 +169,24 @@ int xbzrle_decode_buffer(uint8_t *src, int slen, uint8_t 
*dst, int dlen)
 return -1;
 }
 
-memcpy(dst + d, src + i, count);
+if (!is_pmem) {
+memcpy(dst + d, src + i, count);
+} else {
+pmem_memcpy_nodrain(dst + d, src + i, count);
+}
 d += count;
 i += count;
 }
 
 return d;
 }
+
+int xbzrle_decode_buffer(uint8_t *src, int slen, uint8_t *dst, int dlen)
+{
+return xbzrle_decode_buffer_common(src, slen, dst, dlen, false);
+}
+
+int xbzrle_decode_buffer_to_pmem(uint8_t *src, int slen, uint8_t *dst, int 
dlen)
+{
+return xbzrle_decode_buffer_common(src, slen, dst, dlen, true);
+}
diff --git a/migration/xbzrle.h b/migration/xbzrle.h
index a0db507b9c..ac5ae32666 100644
--- a/migration/xbzrle.h
+++ b/migration/xbzrle.h
@@ -18,4 +18,5 @@ int xbzrle_encode_buffer(uint8_t *old_buf, uint8_t *new_buf, 
int slen,
  uint8_t *dst, int dlen);
 
 int xbzrle_decode_buffer(uint8_t *src, int slen, uint8_t *dst, int dlen);
+int xbzrle_decode_buffer_to_pmem(uint8_t *src, int slen, uint8_t *dst, int 
dlen);
 #endif
-- 
2.14.1

[Qemu-devel] [PATCH v2 2/8] hostmem-file: add the 'pmem' option

2018-02-06 Thread Haozhong Zhang

When QEMU emulates vNVDIMM labels and migrates vNVDIMM devices, it
needs to know whether the backend storage is a real persistent memory,
in order to decide whether special operations should be performed to
ensure the data persistence.

This boolean option 'pmem' allows users to specify whether the backend
storage of memory-backend-file is a real persistent memory. If
'pmem=on', QEMU will set the flag RAM_PMEM in the RAM block of the
corresponding memory region.

Signed-off-by: Haozhong Zhang <haozhong.zh...@intel.com>
---
 backends/hostmem-file.c | 26 +-
 docs/nvdimm.txt | 14 ++
 exec.c  | 16 +++-
 include/exec/memory.h   |  2 ++
 include/exec/ram_addr.h |  3 +++
 qemu-options.hx |  9 -
 6 files changed, 67 insertions(+), 3 deletions(-)

diff --git a/backends/hostmem-file.c b/backends/hostmem-file.c
index 30df843d90..5d706d471f 100644
--- a/backends/hostmem-file.c
+++ b/backends/hostmem-file.c
@@ -34,6 +34,7 @@ struct HostMemoryBackendFile {
 bool discard_data;
 char *mem_path;
 uint64_t align;
+bool is_pmem;
 };
 
 static void
@@ -59,7 +60,8 @@ file_backend_memory_alloc(HostMemoryBackend *backend, Error 
**errp)
 memory_region_init_ram_from_file(>mr, OBJECT(backend),
  path,
  backend->size, fb->align,
- backend->share ? QEMU_RAM_SHARE : 0,
+ (backend->share ? QEMU_RAM_SHARE : 0) |
+ (fb->is_pmem ? QEMU_RAM_PMEM : 0),
  fb->mem_path, errp);
 g_free(path);
 }
@@ -131,6 +133,25 @@ static void file_memory_backend_set_align(Object *o, 
Visitor *v,
 error_propagate(errp, local_err);
 }
 
+static bool file_memory_backend_get_pmem(Object *o, Error **errp)
+{
+return MEMORY_BACKEND_FILE(o)->is_pmem;
+}
+
+static void file_memory_backend_set_pmem(Object *o, bool value, Error **errp)
+{
+HostMemoryBackend *backend = MEMORY_BACKEND(o);
+HostMemoryBackendFile *fb = MEMORY_BACKEND_FILE(o);
+
+if (host_memory_backend_mr_inited(backend)) {
+error_setg(errp, "cannot change property 'pmem' of %s '%s'",
+   object_get_typename(o), backend->id);
+return;
+}
+
+fb->is_pmem = value;
+}
+
 static void file_backend_unparent(Object *obj)
 {
 HostMemoryBackend *backend = MEMORY_BACKEND(obj);
@@ -162,6 +183,9 @@ file_backend_class_init(ObjectClass *oc, void *data)
 file_memory_backend_get_align,
 file_memory_backend_set_align,
 NULL, NULL, _abort);
+object_class_property_add_bool(oc, "pmem",
+file_memory_backend_get_pmem, file_memory_backend_set_pmem,
+_abort);
 }
 
 static void file_backend_instance_finalize(Object *o)
diff --git a/docs/nvdimm.txt b/docs/nvdimm.txt
index e903d8bb09..bcb2032672 100644
--- a/docs/nvdimm.txt
+++ b/docs/nvdimm.txt
@@ -153,3 +153,17 @@ guest NVDIMM region mapping structure.  This unarmed flag 
indicates
 guest software that this vNVDIMM device contains a region that cannot
 accept persistent writes. In result, for example, the guest Linux
 NVDIMM driver, marks such vNVDIMM device as read-only.
+
+If the vNVDIMM backend is on the host persistent memory that can be
+accessed in SNIA NVM Programming Model [1] (e.g., Intel NVDIMM), it's
+suggested to set the 'pmem' option of memory-backend-file to 'on'. When
+'pmem=on' and QEMU is built with libpmem [2] support (configured with
+--enable-libpmem), QEMU will take necessary operations to guarantee
+the persistence of its own writes to the vNVDIMM backend (e.g., in
+vNVDIMM label emulation and live migration).
+
+References
+--
+
+[1] SNIA NVM Programming Model: 
https://www.snia.org/sites/default/files/technical_work/final/NVMProgrammingModel_v1.2.pdf
+[2] PMDK: http://pmem.io/pmdk/
diff --git a/exec.c b/exec.c
index 16b373a86b..1d83441afe 100644
--- a/exec.c
+++ b/exec.c
@@ -99,6 +99,9 @@ static MemoryRegion io_mem_unassigned;
  */
 #define RAM_RESIZEABLE (1 << 2)
 
+/* RAM is backed by the persistent memory. */
+#define RAM_PMEM   (1 << 3)
+
 #endif
 
 #ifdef TARGET_PAGE_BITS_VARY
@@ -2007,6 +2010,7 @@ RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t size, 
MemoryRegion *mr,
 Error *local_err = NULL;
 int64_t file_size;
 bool share = flags & QEMU_RAM_SHARE;
+bool is_pmem = flags & QEMU_RAM_PMEM;
 
 if (xen_enabled()) {
 error_setg(errp, "-mem-path not supported with Xen");
@@ -2043,7 +2047,8 @@ RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t size, 
MemoryRegion *mr,
 new_block->mr = mr;
 new_block->used_length = size;
 new_block->max_length = size;
-new_block->flags = share ? RAM_SHARED : 0;
+new_block->flags = (share ? RAM_SHARED : 0) |
+   (is_pmem ? RAM_PMEM :

[Qemu-devel] [PATCH v2 5/8] migration/ram: ensure write persistence on loading zero pages to PMEM

2018-02-06 Thread Haozhong Zhang

When loading a zero page, check whether it will be loaded to
persistent memory If yes, load it by libpmem function
pmem_memset_nodrain().  Combined with a call to pmem_drain() at the
end of RAM loading, we can guarantee all those zero pages are
persistently loaded.

Depending on the host HW/SW configurations, pmem_drain() can be
"sfence".  Therefore, we do not call pmem_drain() after each
pmem_memset_nodrain(), or use pmem_memset_persist() (equally
pmem_memset_nodrain() + pmem_drain()), in order to avoid unnecessary
overhead.

Signed-off-by: Haozhong Zhang <haozhong.zh...@intel.com>
---
 include/qemu/pmem.h |  9 +
 migration/ram.c | 34 +-
 2 files changed, 38 insertions(+), 5 deletions(-)

diff --git a/include/qemu/pmem.h b/include/qemu/pmem.h
index 9017596ff0..861d8ecc21 100644
--- a/include/qemu/pmem.h
+++ b/include/qemu/pmem.h
@@ -26,6 +26,15 @@ pmem_memcpy_persist(void *pmemdest, const void *src, size_t 
len)
 return memcpy(pmemdest, src, len);
 }
 
+static inline void *pmem_memset_nodrain(void *pmemdest, int c, size_t len)
+{
+return memset(pmemdest, c, len);
+}
+
+static inline void pmem_drain(void)
+{
+}
+
 #endif /* CONFIG_LIBPMEM */
 
 #endif /* !QEMU_PMEM_H */
diff --git a/migration/ram.c b/migration/ram.c
index cb1950f3eb..5a0e503818 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -49,6 +49,7 @@
 #include "qemu/rcu_queue.h"
 #include "migration/colo.h"
 #include "migration/block.h"
+#include "qemu/pmem.h"
 
 /***/
 /* ram save/restore */
@@ -2467,6 +2468,20 @@ static inline void *host_from_ram_block_offset(RAMBlock 
*block,
 return block->host + offset;
 }
 
+static void ram_handle_compressed_common(void *host, uint8_t ch, uint64_t size,
+ bool is_pmem)
+{
+if (!ch && is_zero_range(host, size)) {
+return;
+}
+
+if (!is_pmem) {
+memset(host, ch, size);
+} else {
+pmem_memset_nodrain(host, ch, size);
+}
+}
+
 /**
  * ram_handle_compressed: handle the zero page case
  *
@@ -2479,9 +2494,7 @@ static inline void *host_from_ram_block_offset(RAMBlock 
*block,
  */
 void ram_handle_compressed(void *host, uint8_t ch, uint64_t size)
 {
-if (ch != 0 || !is_zero_range(host, size)) {
-memset(host, ch, size);
-}
+return ram_handle_compressed_common(host, ch, size, false);
 }
 
 static void *do_data_decompress(void *opaque)
@@ -2823,6 +2836,7 @@ static int ram_load(QEMUFile *f, void *opaque, int 
version_id)
 bool postcopy_running = postcopy_is_running();
 /* ADVISE is earlier, it shows the source has the postcopy capability on */
 bool postcopy_advised = postcopy_is_advised();
+bool need_pmem_drain = false;
 
 seq_iter++;
 
@@ -2848,6 +2862,8 @@ static int ram_load(QEMUFile *f, void *opaque, int 
version_id)
 ram_addr_t addr, total_ram_bytes;
 void *host = NULL;
 uint8_t ch;
+RAMBlock *block = NULL;
+bool is_pmem = false;
 
 addr = qemu_get_be64(f);
 flags = addr & ~TARGET_PAGE_MASK;
@@ -2864,7 +2880,7 @@ static int ram_load(QEMUFile *f, void *opaque, int 
version_id)
 
 if (flags & (RAM_SAVE_FLAG_ZERO | RAM_SAVE_FLAG_PAGE |
  RAM_SAVE_FLAG_COMPRESS_PAGE | RAM_SAVE_FLAG_XBZRLE)) {
-RAMBlock *block = ram_block_from_stream(f, flags);
+block = ram_block_from_stream(f, flags);
 
 host = host_from_ram_block_offset(block, addr);
 if (!host) {
@@ -2874,6 +2890,9 @@ static int ram_load(QEMUFile *f, void *opaque, int 
version_id)
 }
 ramblock_recv_bitmap_set(block, host);
 trace_ram_load_loop(block->idstr, (uint64_t)addr, flags, host);
+
+is_pmem = ramblock_is_pmem(block);
+need_pmem_drain = need_pmem_drain || is_pmem;
 }
 
 switch (flags & ~RAM_SAVE_FLAG_CONTINUE) {
@@ -2927,7 +2946,7 @@ static int ram_load(QEMUFile *f, void *opaque, int 
version_id)
 
 case RAM_SAVE_FLAG_ZERO:
 ch = qemu_get_byte(f);
-ram_handle_compressed(host, ch, TARGET_PAGE_SIZE);
+ram_handle_compressed_common(host, ch, TARGET_PAGE_SIZE, is_pmem);
 break;
 
 case RAM_SAVE_FLAG_PAGE:
@@ -2970,6 +2989,11 @@ static int ram_load(QEMUFile *f, void *opaque, int 
version_id)
 }
 
 wait_for_decompress_done();
+
+if (need_pmem_drain) {
+pmem_drain();
+}
+
 rcu_read_unlock();
 trace_ram_load_complete(ret, seq_iter);
 return ret;
-- 
2.14.1

[Qemu-devel] [PATCH v2 7/8] migration/ram: ensure write persistence on loading compressed pages to PMEM

2018-02-06 Thread Haozhong Zhang

When loading a compressed page to persistent memory, flush CPU cache
after the data is decompressed. Combined with a call to pmem_drain()
at the end of memory loading, we can guarantee those compressed pages
are persistently loaded to PMEM.

Signed-off-by: Haozhong Zhang <haozhong.zh...@intel.com>
---
 include/qemu/pmem.h |  4 
 migration/ram.c | 16 +++-
 2 files changed, 15 insertions(+), 5 deletions(-)

diff --git a/include/qemu/pmem.h b/include/qemu/pmem.h
index 77ee1fc4eb..20e3f6e71d 100644
--- a/include/qemu/pmem.h
+++ b/include/qemu/pmem.h
@@ -37,6 +37,10 @@ static inline void *pmem_memset_nodrain(void *pmemdest, int 
c, size_t len)
 return memset(pmemdest, c, len);
 }
 
+static inline void pmem_flush(const void *addr, size_t len)
+{
+}
+
 static inline void pmem_drain(void)
 {
 }
diff --git a/migration/ram.c b/migration/ram.c
index 5a79bbff64..924d2b9537 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -274,6 +274,7 @@ struct DecompressParam {
 void *des;
 uint8_t *compbuf;
 int len;
+bool is_pmem;
 };
 typedef struct DecompressParam DecompressParam;
 
@@ -2502,7 +2503,7 @@ static void *do_data_decompress(void *opaque)
 DecompressParam *param = opaque;
 unsigned long pagesize;
 uint8_t *des;
-int len;
+int len, rc;
 
 qemu_mutex_lock(>mutex);
 while (!param->quit) {
@@ -2518,8 +2519,11 @@ static void *do_data_decompress(void *opaque)
  * not a problem because the dirty page will be retransferred
  * and uncompress() won't break the data in other pages.
  */
-uncompress((Bytef *)des, ,
-   (const Bytef *)param->compbuf, len);
+rc = uncompress((Bytef *)des, ,
+(const Bytef *)param->compbuf, len);
+if (rc == Z_OK && param->is_pmem) {
+pmem_flush(des, len);
+}
 
 qemu_mutex_lock(_done_lock);
 param->done = true;
@@ -2605,7 +2609,8 @@ static void compress_threads_load_cleanup(void)
 }
 
 static void decompress_data_with_multi_threads(QEMUFile *f,
-   void *host, int len)
+   void *host, int len,
+   bool is_pmem)
 {
 int idx, thread_count;
 
@@ -2619,6 +2624,7 @@ static void decompress_data_with_multi_threads(QEMUFile 
*f,
 qemu_get_buffer(f, decomp_param[idx].compbuf, len);
 decomp_param[idx].des = host;
 decomp_param[idx].len = len;
+decomp_param[idx].is_pmem = is_pmem;
 qemu_cond_signal(_param[idx].cond);
 qemu_mutex_unlock(_param[idx].mutex);
 break;
@@ -2964,7 +2970,7 @@ static int ram_load(QEMUFile *f, void *opaque, int 
version_id)
 ret = -EINVAL;
 break;
 }
-decompress_data_with_multi_threads(f, host, len);
+decompress_data_with_multi_threads(f, host, len, is_pmem);
 break;
 
 case RAM_SAVE_FLAG_XBZRLE:
-- 
2.14.1

[Qemu-devel] [PATCH v2 1/8] memory, exec: switch file ram allocation functions to 'flags' parameters

2018-02-06 Thread Haozhong Zhang

As more flag parameters besides the existing 'share' are going to be
added to following functions
memory_region_init_ram_from_file
qemu_ram_alloc_from_fd
qemu_ram_alloc_from_file
, let's switch them to use the 'flags' parameters so as to ease future
flag additions.

The existing 'share' flag is converted to the QEMU_RAM_SHARE bit in
flags, and other flag bits are ignored by above functions right now.

Signed-off-by: Haozhong Zhang <haozhong.zh...@intel.com>
---
 backends/hostmem-file.c |  3 ++-
 exec.c  |  7 ---
 include/exec/memory.h   | 10 --
 include/exec/ram_addr.h | 25 +++--
 memory.c|  8 +---
 numa.c  |  2 +-
 6 files changed, 43 insertions(+), 12 deletions(-)

diff --git a/backends/hostmem-file.c b/backends/hostmem-file.c
index 134b08d63a..30df843d90 100644
--- a/backends/hostmem-file.c
+++ b/backends/hostmem-file.c
@@ -58,7 +58,8 @@ file_backend_memory_alloc(HostMemoryBackend *backend, Error 
**errp)
 path = object_get_canonical_path(OBJECT(backend));
 memory_region_init_ram_from_file(>mr, OBJECT(backend),
  path,
- backend->size, fb->align, backend->share,
+ backend->size, fb->align,
+ backend->share ? QEMU_RAM_SHARE : 0,
  fb->mem_path, errp);
 g_free(path);
 }
diff --git a/exec.c b/exec.c
index 5e56efefeb..16b373a86b 100644
--- a/exec.c
+++ b/exec.c
@@ -2000,12 +2000,13 @@ static void ram_block_add(RAMBlock *new_block, Error 
**errp, bool shared)
 
 #ifdef __linux__
 RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t size, MemoryRegion *mr,
- bool share, int fd,
+ uint64_t flags, int fd,
  Error **errp)
 {
 RAMBlock *new_block;
 Error *local_err = NULL;
 int64_t file_size;
+bool share = flags & QEMU_RAM_SHARE;
 
 if (xen_enabled()) {
 error_setg(errp, "-mem-path not supported with Xen");
@@ -2061,7 +2062,7 @@ RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t size, 
MemoryRegion *mr,
 
 
 RAMBlock *qemu_ram_alloc_from_file(ram_addr_t size, MemoryRegion *mr,
-   bool share, const char *mem_path,
+   uint64_t flags, const char *mem_path,
Error **errp)
 {
 int fd;
@@ -2073,7 +2074,7 @@ RAMBlock *qemu_ram_alloc_from_file(ram_addr_t size, 
MemoryRegion *mr,
 return NULL;
 }
 
-block = qemu_ram_alloc_from_fd(size, mr, share, fd, errp);
+block = qemu_ram_alloc_from_fd(size, mr, flags, fd, errp);
 if (!block) {
 if (created) {
 unlink(mem_path);
diff --git a/include/exec/memory.h b/include/exec/memory.h
index 1b02bbd334..d87258b6ae 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -479,6 +479,9 @@ void memory_region_init_resizeable_ram(MemoryRegion *mr,
void *host),
Error **errp);
 #ifdef __linux__
+
+#define QEMU_RAM_SHARE  (1UL << 0)
+
 /**
  * memory_region_init_ram_from_file:  Initialize RAM memory region with a
  *mmap-ed backend.
@@ -490,7 +493,10 @@ void memory_region_init_resizeable_ram(MemoryRegion *mr,
  * @size: size of the region.
  * @align: alignment of the region base address; if 0, the default alignment
  * (getpagesize()) will be used.
- * @share: %true if memory must be mmaped with the MAP_SHARED flag
+ * @flags: specify properties of this memory region, which can be one or bit-or
+ * of following values:
+ * - QEMU_RAM_SHARE: memory must be mmaped with the MAP_SHARED flag
+ * Other bits are ignored.
  * @path: the path in which to allocate the RAM.
  * @errp: pointer to Error*, to store an error if it happens.
  *
@@ -502,7 +508,7 @@ void memory_region_init_ram_from_file(MemoryRegion *mr,
   const char *name,
   uint64_t size,
   uint64_t align,
-  bool share,
+  uint64_t flags,
   const char *path,
   Error **errp);
 
diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h
index cf2446a176..b8b01d1eb9 100644
--- a/include/exec/ram_addr.h
+++ b/include/exec/ram_addr.h
@@ -72,12 +72,33 @@ static inline unsigned long int 
ramblock_recv_bitmap_offset(void *host_addr,
 
 long qemu_getrampagesize(void);
 unsigned long last_ram_page(void);
+
+/**
+ * qemu_ram_alloc_from_file,
+ * qemu_ram_alloc_from_fd:  Allocate a ram block from the specified back
+ *

[Qemu-devel] [PATCH v2 0/8] nvdimm: guarantee persistence of QEMU writes to persistent memory

2018-02-06 Thread Haozhong Zhang

This v2 patch series extends v1 [1] by covering the migration path as
well.

QEMU writes to vNVDIMM backends in the vNVDIMM label emulation and
live migration. If the backend is on the persistent memory, QEMU needs
to take proper operations to ensure its writes persistent on the
persistent memory. Otherwise, a host power failure may result in the
loss the guest data on the persistent memory.

This patch series is based on Marcel's patch "mem: add share parameter
to memory-backend-ram" [2] because of the changes in patch 1.

[1] https://lists.gnu.org/archive/html/qemu-devel/2017-12/msg05040.html
[2] http://lists.gnu.org/archive/html/qemu-devel/2018-02/msg00768.html

Changes in v2:
 * (Patch 1) Use a flags parameter in file ram allocation functions.
 * (Patch 2) Add a new option 'pmem' to hostmem-file.
 * (Patch 3) Use libpmem to operate on the persistent memory, rather
   than re-implementing those operations in QEMU.
 * (Patch 5-8) Consider the write persistence in the migration path.

Haozhong Zhang (8):
 [1/8] memory, exec: switch file ram allocation functions to 'flags' parameters
 [2/8] hostmem-file: add the 'pmem' option
 [3/8] configure: add libpmem support
 [4/8] mem/nvdimm: ensure write persistence to PMEM in label emulation
 [5/8] migration/ram: ensure write persistence on loading zero pages to PMEM
 [6/8] migration/ram: ensure write persistence on loading normal pages to PMEM
 [7/8] migration/ram: ensure write persistence on loading compressed pages to 
PMEM
 [8/8] migration/ram: ensure write persistence on loading xbzrle pages to PMEM

 backends/hostmem-file.c | 27 +-
 configure   | 35 ++
 docs/nvdimm.txt | 14 
 exec.c  | 23 +---
 hw/mem/nvdimm.c |  9 -
 include/exec/memory.h   | 12 +--
 include/exec/ram_addr.h | 28 +--
 include/migration/qemu-file-types.h |  1 +
 include/qemu/pmem.h | 50 ++
 memory.c|  8 +++--
 migration/qemu-file.c   | 41 +++--
 migration/ram.c | 71 -
 migration/xbzrle.c  | 20 +--
 migration/xbzrle.h  |  1 +
 numa.c  |  2 +-
 qemu-options.hx |  9 -
 16 files changed, 308 insertions(+), 43 deletions(-)
 create mode 100644 include/qemu/pmem.h

-- 
2.14.1

Re: [Qemu-devel] [PATCH v4 0/6] nvdimm: support MAP_SYNC for memory-backend-file

2018-02-01 Thread Haozhong Zhang

On 01/31/18 19:02 -0800, Dan Williams wrote:
> On Wed, Jan 31, 2018 at 6:29 PM, Haozhong Zhang
> <haozhong.zh...@intel.com> wrote:
> > + vfio maintainer Alex Williamson in case my understanding of vfio is 
> > incorrect.
> >
> > On 01/31/18 16:32 -0800, Dan Williams wrote:
> >> On Wed, Jan 31, 2018 at 4:24 PM, Haozhong Zhang
> >> <haozhong.zh...@intel.com> wrote:
> >> > On 01/31/18 16:08 -0800, Dan Williams wrote:
> >> >> On Wed, Jan 31, 2018 at 4:02 PM, Haozhong Zhang
> >> >> <haozhong.zh...@intel.com> wrote:
> >> >> > On 01/31/18 14:25 -0800, Dan Williams wrote:
> >> >> >> On Tue, Jan 30, 2018 at 10:02 PM, Haozhong Zhang
> >> >> >> <haozhong.zh...@intel.com> wrote:
> >> >> >> > Linux 4.15 introduces a new mmap flag MAP_SYNC, which can be used 
> >> >> >> > to
> >> >> >> > guarantee the write persistence to mmap'ed files supporting DAX 
> >> >> >> > (e.g.,
> >> >> >> > files on ext4/xfs file system mounted with '-o dax').
> >> >> >>
> >> >> >> Wait, MAP_SYNC does not guarantee persistence. It makes sure that the
> >> >> >> metadata is in sync after a fault. However, that does not make
> >> >> >> filesystem-DAX safe for use with QEMU, because we still need to
> >> >> >> coordinate DMA with fileystem operations. There is no way to do that
> >> >> >> coordination from within a guest. QEMU needs to use device-dax if the
> >> >> >> guest might ever perform DMA to a virtual-pmem range. See this patch
> >> >> >> set for more details on the DAX vs DMA problem [1]. I think we need 
> >> >> >> to
> >> >> >> enforce this in the host kernel. I.e. do not allow file backed DAX
> >> >> >> pages to be mapped in EPT entries unless / until we have a solution 
> >> >> >> to
> >> >> >> the DMA synchronization problem. Apologies for not noticing this
> >> >> >> earlier.
> >> >> >
> >> >> > QEMU does not truncate or punch holes of the file once it has been
> >> >> > mmap()'ed. Does the problem [1] still exist in such case?
> >> >>
> >> >> Something else on the system might. The only agent that could enforce
> >> >> protection is the kernel, and the kernel will likely just disallow
> >> >> passing addresses from filesystem-dax vmas through to a guest
> >> >> altogether. I think there's even a problem in the non-DAX case unless
> >> >> KVM is pinning pages while they are handed out to a guest. The problem
> >> >> is that we don't have a page cache page to pin in the DAX case.
> >> >>
> >> >
> >> > Does it mean any user-space code like
> >> >   ptr = mmap(..., fd, ...); // fd refers to a file on DAX filesystem
> >> >   // make DMA to ptr
> >> > is unsafe?
> >>
> >> Yes, it is currently unsafe because there is no coordination with the
> >> filesytem if it decides to make block layout changes. We can fix that
> >> in the non-virtualization case by having the filesystem wait for DMA
> >> completion callbacks (i.e. what for all pages to be idle), but as far
> >> as I can see we can't do the same coordination for DMA initiated by a
> >> guest device driver.
> >>
> >
> > I think that fix [1] also works for KVM/QEMU. The guest DMA are
> > performed on two types of devices:
> >
> > 1. For emulated devices, the guest DMA requests are trapped and
> >actually performed by QEMU on the host side. The host side fix [1]
> >can cover this case.
> >
> > 2. For passthrough devices, vfio pins all pages, including those
> >backed by dax mode files, used by the guest if any device is
> >passthroughed to it. If I read the commit message in [2] correctly,
> >operations that change the page-to-file offset association of pages
> >from dax mode files will be deferred until the reference count of
> >the affected pages becomes 1.  That is, if any passthrough device
> >is used with a VM, the changes of page-to-file offset will not be
> >able to happen until the VM is shutdown, so the fix [1] still takes
> >effect here.
> 
> This sounds like a longterm mapping under control of vfio and not the
> filesystem. See get_user_pages_lon

Re: [Qemu-devel] [PATCH v4 0/6] nvdimm: support MAP_SYNC for memory-backend-file

2018-01-31 Thread Haozhong Zhang

+ vfio maintainer Alex Williamson in case my understanding of vfio is incorrect.

On 01/31/18 16:32 -0800, Dan Williams wrote:
> On Wed, Jan 31, 2018 at 4:24 PM, Haozhong Zhang
> <haozhong.zh...@intel.com> wrote:
> > On 01/31/18 16:08 -0800, Dan Williams wrote:
> >> On Wed, Jan 31, 2018 at 4:02 PM, Haozhong Zhang
> >> <haozhong.zh...@intel.com> wrote:
> >> > On 01/31/18 14:25 -0800, Dan Williams wrote:
> >> >> On Tue, Jan 30, 2018 at 10:02 PM, Haozhong Zhang
> >> >> <haozhong.zh...@intel.com> wrote:
> >> >> > Linux 4.15 introduces a new mmap flag MAP_SYNC, which can be used to
> >> >> > guarantee the write persistence to mmap'ed files supporting DAX (e.g.,
> >> >> > files on ext4/xfs file system mounted with '-o dax').
> >> >>
> >> >> Wait, MAP_SYNC does not guarantee persistence. It makes sure that the
> >> >> metadata is in sync after a fault. However, that does not make
> >> >> filesystem-DAX safe for use with QEMU, because we still need to
> >> >> coordinate DMA with fileystem operations. There is no way to do that
> >> >> coordination from within a guest. QEMU needs to use device-dax if the
> >> >> guest might ever perform DMA to a virtual-pmem range. See this patch
> >> >> set for more details on the DAX vs DMA problem [1]. I think we need to
> >> >> enforce this in the host kernel. I.e. do not allow file backed DAX
> >> >> pages to be mapped in EPT entries unless / until we have a solution to
> >> >> the DMA synchronization problem. Apologies for not noticing this
> >> >> earlier.
> >> >
> >> > QEMU does not truncate or punch holes of the file once it has been
> >> > mmap()'ed. Does the problem [1] still exist in such case?
> >>
> >> Something else on the system might. The only agent that could enforce
> >> protection is the kernel, and the kernel will likely just disallow
> >> passing addresses from filesystem-dax vmas through to a guest
> >> altogether. I think there's even a problem in the non-DAX case unless
> >> KVM is pinning pages while they are handed out to a guest. The problem
> >> is that we don't have a page cache page to pin in the DAX case.
> >>
> >
> > Does it mean any user-space code like
> >   ptr = mmap(..., fd, ...); // fd refers to a file on DAX filesystem
> >   // make DMA to ptr
> > is unsafe?
> 
> Yes, it is currently unsafe because there is no coordination with the
> filesytem if it decides to make block layout changes. We can fix that
> in the non-virtualization case by having the filesystem wait for DMA
> completion callbacks (i.e. what for all pages to be idle), but as far
> as I can see we can't do the same coordination for DMA initiated by a
> guest device driver.
> 

I think that fix [1] also works for KVM/QEMU. The guest DMA are
performed on two types of devices:

1. For emulated devices, the guest DMA requests are trapped and
   actually performed by QEMU on the host side. The host side fix [1]
   can cover this case.

2. For passthrough devices, vfio pins all pages, including those
   backed by dax mode files, used by the guest if any device is
   passthroughed to it. If I read the commit message in [2] correctly,
   operations that change the page-to-file offset association of pages
   from dax mode files will be deferred until the reference count of
   the affected pages becomes 1.  That is, if any passthrough device
   is used with a VM, the changes of page-to-file offset will not be
   able to happen until the VM is shutdown, so the fix [1] still takes
   effect here.

Another question is how a user-space application (e.g., QEMU) knows
whether it's safe to mmap a file on the DAX file system?

[1] https://lists.01.org/pipermail/linux-nvdimm/2017-December/013704.html
[2] https://lists.01.org/pipermail/linux-nvdimm/2017-December/013713.html


Thanks,
Haozhong

Re: [Qemu-devel] [PATCH v4 0/6] nvdimm: support MAP_SYNC for memory-backend-file

2018-01-31 Thread Haozhong Zhang

On 01/31/18 16:08 -0800, Dan Williams wrote:
> On Wed, Jan 31, 2018 at 4:02 PM, Haozhong Zhang
> <haozhong.zh...@intel.com> wrote:
> > On 01/31/18 14:25 -0800, Dan Williams wrote:
> >> On Tue, Jan 30, 2018 at 10:02 PM, Haozhong Zhang
> >> <haozhong.zh...@intel.com> wrote:
> >> > Linux 4.15 introduces a new mmap flag MAP_SYNC, which can be used to
> >> > guarantee the write persistence to mmap'ed files supporting DAX (e.g.,
> >> > files on ext4/xfs file system mounted with '-o dax').
> >>
> >> Wait, MAP_SYNC does not guarantee persistence. It makes sure that the
> >> metadata is in sync after a fault. However, that does not make
> >> filesystem-DAX safe for use with QEMU, because we still need to
> >> coordinate DMA with fileystem operations. There is no way to do that
> >> coordination from within a guest. QEMU needs to use device-dax if the
> >> guest might ever perform DMA to a virtual-pmem range. See this patch
> >> set for more details on the DAX vs DMA problem [1]. I think we need to
> >> enforce this in the host kernel. I.e. do not allow file backed DAX
> >> pages to be mapped in EPT entries unless / until we have a solution to
> >> the DMA synchronization problem. Apologies for not noticing this
> >> earlier.
> >
> > QEMU does not truncate or punch holes of the file once it has been
> > mmap()'ed. Does the problem [1] still exist in such case?
> 
> Something else on the system might. The only agent that could enforce
> protection is the kernel, and the kernel will likely just disallow
> passing addresses from filesystem-dax vmas through to a guest
> altogether. I think there's even a problem in the non-DAX case unless
> KVM is pinning pages while they are handed out to a guest. The problem
> is that we don't have a page cache page to pin in the DAX case.
> 

Does it mean any user-space code like
  ptr = mmap(..., fd, ...); // fd refers to a file on DAX filesystem
  // make DMA to ptr
is unsafe?

Re: [Qemu-devel] [PATCH v4 0/6] nvdimm: support MAP_SYNC for memory-backend-file

2018-01-31 Thread Haozhong Zhang

On 01/31/18 14:25 -0800, Dan Williams wrote:
> On Tue, Jan 30, 2018 at 10:02 PM, Haozhong Zhang
> <haozhong.zh...@intel.com> wrote:
> > Linux 4.15 introduces a new mmap flag MAP_SYNC, which can be used to
> > guarantee the write persistence to mmap'ed files supporting DAX (e.g.,
> > files on ext4/xfs file system mounted with '-o dax').
> 
> Wait, MAP_SYNC does not guarantee persistence. It makes sure that the
> metadata is in sync after a fault. However, that does not make
> filesystem-DAX safe for use with QEMU, because we still need to
> coordinate DMA with fileystem operations. There is no way to do that
> coordination from within a guest. QEMU needs to use device-dax if the
> guest might ever perform DMA to a virtual-pmem range. See this patch
> set for more details on the DAX vs DMA problem [1]. I think we need to
> enforce this in the host kernel. I.e. do not allow file backed DAX
> pages to be mapped in EPT entries unless / until we have a solution to
> the DMA synchronization problem. Apologies for not noticing this
> earlier.

QEMU does not truncate or punch holes of the file once it has been
mmap()'ed. Does the problem [1] still exist in such case?

Thanks,
Haozhong

> 
> [1]: https://lists.01.org/pipermail/linux-nvdimm/2017-December/013704.html

[Qemu-devel] [PATCH v4 5/6] hostmem: add more information in error messages

2018-01-30 Thread Haozhong Zhang

When there are multiple memory backends in use, including the object type
name, ID and the property name in the error message can help users to
locate the error.

Signed-off-by: Haozhong Zhang <haozhong.zh...@intel.com>
Suggested-by: "Dr. David Alan Gilbert" <dgilb...@redhat.com>
Reviewed-by: Michael S. Tsirkin <m...@redhat.com>
---
 backends/hostmem-file.c |  9 ++---
 backends/hostmem.c  | 11 +++
 2 files changed, 13 insertions(+), 7 deletions(-)

diff --git a/backends/hostmem-file.c b/backends/hostmem-file.c
index 67ecfed895..df06b547a6 100644
--- a/backends/hostmem-file.c
+++ b/backends/hostmem-file.c
@@ -80,7 +80,8 @@ static void set_mem_path(Object *o, const char *str, Error 
**errp)
 HostMemoryBackendFile *fb = MEMORY_BACKEND_FILE(o);
 
 if (host_memory_backend_mr_inited(backend)) {
-error_setg(errp, "cannot change property value");
+error_setg(errp, "cannot change property 'mem-path' of %s '%s'",
+   object_get_typename(o), backend->id);
 return;
 }
 g_free(fb->mem_path);
@@ -100,7 +101,8 @@ static void file_memory_backend_set_share(Object *o, bool 
value, Error **errp)
 HostMemoryBackendFile *fb = MEMORY_BACKEND_FILE(o);
 
 if (host_memory_backend_mr_inited(backend)) {
-error_setg(errp, "cannot change property value");
+error_setg(errp, "cannot change property 'share' of %s '%s'",
+   object_get_typename(o), backend->id);
 return;
 }
 fb->share = value;
@@ -137,7 +139,8 @@ static void file_memory_backend_set_align(Object *o, 
Visitor *v,
 uint64_t val;
 
 if (host_memory_backend_mr_inited(backend)) {
-error_setg(_err, "cannot change property value");
+error_setg(_err, "cannot change property '%s' of %s '%s'",
+   name, object_get_typename(o), backend->id);
 goto out;
 }
 
diff --git a/backends/hostmem.c b/backends/hostmem.c
index ee2c2d5bfd..6853d19bc5 100644
--- a/backends/hostmem.c
+++ b/backends/hostmem.c
@@ -46,7 +46,8 @@ host_memory_backend_set_size(Object *obj, Visitor *v, const 
char *name,
 uint64_t value;
 
 if (host_memory_backend_mr_inited(backend)) {
-error_setg(_err, "cannot change property value");
+error_setg(_err, "cannot change property %s of %s '%s'",
+   name, object_get_typename(obj), backend->id);
 goto out;
 }
 
@@ -55,8 +56,9 @@ host_memory_backend_set_size(Object *obj, Visitor *v, const 
char *name,
 goto out;
 }
 if (!value) {
-error_setg(_err, "Property '%s.%s' doesn't take value '%"
-   PRIu64 "'", object_get_typename(obj), name, value);
+error_setg(_err,
+   "property '%s' of %s '%s' doesn't take value '%" PRIu64 "'",
+   name, object_get_typename(obj), backend->id, value);
 goto out;
 }
 backend->size = value;
@@ -363,7 +365,8 @@ static void set_id(Object *o, const char *str, Error **errp)
 HostMemoryBackend *backend = MEMORY_BACKEND(o);
 
 if (backend->id) {
-error_setg(errp, "cannot change property value");
+error_setg(errp, "cannot change property 'id' of %s '%s'",
+   object_get_typename(o), backend->id);
 return;
 }
 backend->id = g_strdup(str);
-- 
2.14.1

[Qemu-devel] [PATCH v4 4/6] util/mmap-alloc: support MAP_SYNC in qemu_ram_mmap()

2018-01-30 Thread Haozhong Zhang

When a file supporting DAX is used as vNVDIMM backend, mmap it with
MAP_SYNC flag in addition can guarantee the persistence of guest write
to the backend file without other QEMU actions (e.g., periodic fsync()
by QEMU).

A set of QEMU_RAM_SYNC_{AUTO,ON,OFF} flags are added to qemu_ram_mmap():

- If QEMU_RAM_SYNC_ON is present, qemu_ram_mmap() will try to pass
  MAP_SYNC to mmap(). It will then fail if the host OS or the backend
  file do not support MAP_SYNC, or MAP_SYNC is conflict with other
  flags.

- If QEMU_RAM_SYNC_OFF is present, qemu_ram_mmap() will never pass
  MAP_SYNC to mmap().

- If QEMU_RAM_SYNC_AUTO is present, and
  * if the host OS and the backend file support MAP_SYNC, and MAP_SYNC
is not conflict with other flags, qemu_ram_mmap() will work as if
QEMU_RAM_SYNC_ON is present;
  * otherwise, qemu_ram_mmap() will work as if QEMU_RAM_SYNC_OFF is
present.

Signed-off-by: Haozhong Zhang <haozhong.zh...@intel.com>
---
 include/exec/memory.h | 26 ++
 include/exec/ram_addr.h   |  4 
 include/qemu/mmap-alloc.h |  4 
 include/standard-headers/linux/mman.h | 42 +++
 util/mmap-alloc.c | 23 ++-
 5 files changed, 98 insertions(+), 1 deletion(-)
 create mode 100644 include/standard-headers/linux/mman.h

diff --git a/include/exec/memory.h b/include/exec/memory.h
index 6b547da6a3..96a60e9c1d 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -458,6 +458,28 @@ void memory_region_init_resizeable_ram(MemoryRegion *mr,
 
 #define QEMU_RAM_SHARE  (1UL << 0)
 
+#define QEMU_RAM_SYNC_SHIFT 1
+#define QEMU_RAM_SYNC_MASK  0x6
+#define QEMU_RAM_SYNC_OFF   ((0UL << QEMU_RAM_SYNC_SHIFT) & QEMU_RAM_SYNC_MASK)
+#define QEMU_RAM_SYNC_ON((1UL << QEMU_RAM_SYNC_SHIFT) & QEMU_RAM_SYNC_MASK)
+#define QEMU_RAM_SYNC_AUTO  ((2UL << QEMU_RAM_SYNC_SHIFT) & QEMU_RAM_SYNC_MASK)
+
+static inline uint64_t qemu_ram_sync_flags(OnOffAuto v)
+{
+return v == ON_OFF_AUTO_OFF ? QEMU_RAM_SYNC_OFF :
+   v == ON_OFF_AUTO_ON ? QEMU_RAM_SYNC_ON : QEMU_RAM_SYNC_AUTO;
+}
+
+static inline OnOffAuto qemu_ram_sync_val(uint64_t flags)
+{
+unsigned int v = (flags & QEMU_RAM_SYNC_MASK) >> QEMU_RAM_SYNC_SHIFT;
+
+assert(v < 3);
+
+return v == 0 ? ON_OFF_AUTO_OFF :
+   v == 1 ? ON_OFF_AUTO_ON : ON_OFF_AUTO_AUTO;
+}
+
 #ifdef __linux__
 /**
  * memory_region_init_ram_from_file:  Initialize RAM memory region with a
@@ -473,6 +495,10 @@ void memory_region_init_resizeable_ram(MemoryRegion *mr,
  * @flags: specify properties of this memory region, which can be one or bit-or
  * of following values:
  * - QEMU_RAM_SHARE: memory must be mmaped with the MAP_SHARED flag
+ * - One of
+ *   QEMU_RAM_SYNC_ON:   mmap with MAP_SYNC flag
+ *   QEMU_RAM_SYNC_OFF:  do not mmap with MAP_SYNC flag
+ *   QEMU_RAM_SYNC_AUTO: automatically decide the use of MAP_SYNC flag
  * Other bits are ignored.
  * @path: the path in which to allocate the RAM.
  * @errp: pointer to Error*, to store an error if it happens.
diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h
index e24aae75a2..a2cc5a9f60 100644
--- a/include/exec/ram_addr.h
+++ b/include/exec/ram_addr.h
@@ -84,6 +84,10 @@ unsigned long last_ram_page(void);
  *  @flags: specify the properties of the ram block, which can be one
  *  or bit-or of following values
  *  - QEMU_RAM_SHARE: mmap the back file or device with MAP_SHARED
+ *  - One of
+ *QEMU_RAM_SYNC_ON:   mmap with MAP_SYNC flag
+ *QEMU_RAM_SYNC_OFF:  do not mmap with MAP_SYNC flag
+ *QEMU_RAM_SYNC_AUTO: automatically decide the use of MAP_SYNC flag
  *  Other bits are ignored.
  *  @mem_path or @fd: specify the back file or device
  *  @errp: pointer to Error*, to store an error if it happens
diff --git a/include/qemu/mmap-alloc.h b/include/qemu/mmap-alloc.h
index dc5e8b5efb..74346bdd3a 100644
--- a/include/qemu/mmap-alloc.h
+++ b/include/qemu/mmap-alloc.h
@@ -18,6 +18,10 @@ size_t qemu_mempath_getpagesize(const char *mem_path);
  *  @flags: specifies additional properties of the mapping, which can be one or
  *  bit-or of following values
  *  - QEMU_RAM_SHARE: mmap with MAP_SHARED flag
+ *  - One of
+ *QEMU_RAM_SYNC_ON:   mmap with MAP_SYNC flag
+ *QEMU_RAM_SYNC_OFF:  do not mmap with MAP_SYNC flag
+ *QEMU_RAM_SYNC_AUTO: automatically decide the use of MAP_SYNC flag
  *  Other bits are ignored.
  *
  * Return:
diff --git a/include/standard-headers/linux/mman.h 
b/include/standard-headers/linux/mman.h
new file mode 100644
index 00..02ad4f
--- /dev/null
+++ b/include/standard-headers/linux/mman.h
@@ -0,0 +1,42 @@
+/*
+ * Definitions of Linux-specific mmap flags.
+ *
+ * Copyright Intel Corpor

[Qemu-devel] [PATCH v4 2/6] exec: switch qemu_ram_alloc_from_{file, fd} to the 'flags' parameter

2018-01-30 Thread Haozhong Zhang

As more flag parameters besides the existing 'share' are going to be
added to qemu_ram_alloc_from_{file,fd}(), let's swith 'share' to a
'flags' parameters in advance, so as to ease the further additions.

Signed-off-by: Haozhong Zhang <haozhong.zh...@intel.com>
---
 exec.c  | 15 ---
 include/exec/ram_addr.h | 25 +++--
 memory.c|  8 ++--
 3 files changed, 37 insertions(+), 11 deletions(-)

diff --git a/exec.c b/exec.c
index c0a5a52c4a..0b46b03d87 100644
--- a/exec.c
+++ b/exec.c
@@ -1607,6 +1607,7 @@ static void *file_ram_alloc(RAMBlock *block,
 ram_addr_t memory,
 int fd,
 bool truncate,
+uint64_t flags,
 Error **errp)
 {
 void *area;
@@ -1652,8 +1653,7 @@ static void *file_ram_alloc(RAMBlock *block,
 perror("ftruncate");
 }
 
-area = qemu_ram_mmap(fd, memory, block->mr->align,
- (block->flags & RAM_SHARED) ? QEMU_RAM_SHARE : 0);
+area = qemu_ram_mmap(fd, memory, block->mr->align, flags);
 if (area == MAP_FAILED) {
 error_setg_errno(errp, errno,
  "unable to map backing store for guest RAM");
@@ -2000,7 +2000,7 @@ static void ram_block_add(RAMBlock *new_block, Error 
**errp)
 
 #ifdef __linux__
 RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t size, MemoryRegion *mr,
- bool share, int fd,
+ uint64_t flags, int fd,
  Error **errp)
 {
 RAMBlock *new_block;
@@ -2042,8 +2042,9 @@ RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t size, 
MemoryRegion *mr,
 new_block->mr = mr;
 new_block->used_length = size;
 new_block->max_length = size;
-new_block->flags = share ? RAM_SHARED : 0;
-new_block->host = file_ram_alloc(new_block, size, fd, !file_size, errp);
+new_block->flags = (flags & QEMU_RAM_SHARE) ? RAM_SHARED : 0;
+new_block->host = file_ram_alloc(new_block, size, fd, !file_size, flags,
+ errp);
 if (!new_block->host) {
 g_free(new_block);
 return NULL;
@@ -2061,7 +2062,7 @@ RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t size, 
MemoryRegion *mr,
 
 
 RAMBlock *qemu_ram_alloc_from_file(ram_addr_t size, MemoryRegion *mr,
-   bool share, const char *mem_path,
+   uint64_t flags, const char *mem_path,
Error **errp)
 {
 int fd;
@@ -2073,7 +2074,7 @@ RAMBlock *qemu_ram_alloc_from_file(ram_addr_t size, 
MemoryRegion *mr,
 return NULL;
 }
 
-block = qemu_ram_alloc_from_fd(size, mr, share, fd, errp);
+block = qemu_ram_alloc_from_fd(size, mr, flags, fd, errp);
 if (!block) {
 if (created) {
 unlink(mem_path);
diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h
index 7633ef6342..e24aae75a2 100644
--- a/include/exec/ram_addr.h
+++ b/include/exec/ram_addr.h
@@ -72,12 +72,33 @@ static inline unsigned long int 
ramblock_recv_bitmap_offset(void *host_addr,
 
 long qemu_getrampagesize(void);
 unsigned long last_ram_page(void);
+
+/**
+ * qemu_ram_alloc_from_file,
+ * qemu_ram_alloc_from_fd: Allocate a ram block from the specified back
+ * file or device
+ *
+ * Parameters:
+ *  @size: the size in bytes of the ram block
+ *  @mr: the memory region where the ram block is
+ *  @flags: specify the properties of the ram block, which can be one
+ *  or bit-or of following values
+ *  - QEMU_RAM_SHARE: mmap the back file or device with MAP_SHARED
+ *  Other bits are ignored.
+ *  @mem_path or @fd: specify the back file or device
+ *  @errp: pointer to Error*, to store an error if it happens
+ *
+ * Return:
+ *  On success, return a pointer to the ram block.
+ *  On failure, return NULL.
+ */
 RAMBlock *qemu_ram_alloc_from_file(ram_addr_t size, MemoryRegion *mr,
-   bool share, const char *mem_path,
+   uint64_t flags, const char *mem_path,
Error **errp);
 RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t size, MemoryRegion *mr,
- bool share, int fd,
+ uint64_t flags, int fd,
  Error **errp);
+
 RAMBlock *qemu_ram_alloc_from_ptr(ram_addr_t size, void *host,
   MemoryRegion *mr, Error **errp);
 RAMBlock *qemu_ram_alloc(ram_addr_t size, MemoryRegion *mr, Error **errp);
diff --git a/memory.c b/memory.c
index 449a1429b9..1ac4ebcaca 100644
--- a/memory.c
+++ b/memory.c
@@ -1580,7 +1580,9 @@ void memory_region_init_ram_from_file(MemoryRegion *mr,
 mr->terminates = true;
 mr->destru

[Qemu-devel] [PATCH v4 0/6] nvdimm: support MAP_SYNC for memory-backend-file

2018-01-30 Thread Haozhong Zhang

Linux 4.15 introduces a new mmap flag MAP_SYNC, which can be used to
guarantee the write persistence to mmap'ed files supporting DAX (e.g.,
files on ext4/xfs file system mounted with '-o dax').

A description of MAP_SYNC and MAP_SHARED_VALIDATE can be found at
https://patchwork.kernel.org/patch/10028151/

This patchset enables QEMU to use MAP_SYNC flag for memory-backend-file,
in order to guarantee the guest write persistence to backend files
supporting DAX.

A new auto on/off option 'sync' is added to memory-backend-file:
 - on:  try to pass MAP_SYNC to mmap(2); if MAP_SYNC is not supported or
'share=off', QEMU will abort
 - off: never pass MAP_SYNC to mmap(2)
 - auto (default): if MAP_SYNC is supported and 'share=on', work as if
'sync=on'; otherwise, work as if 'sync=off'

Changes in v4:
 * Add patch 1-3 to switch some functions to a single 'flags'
   parameters. (Michael S. Tsirkin)
 * v3 patch 1-3 become v4 patch 4-6.
 * Patch 4: move definitions of MAP_SYNC and MAP_SHARED_VALIDATE to a
   new header file under include/standard-headers/linux/. (Michael S. Tsirkin)
 * Patch 6: refine the description of the 'sync' option. (Michael S. Tsirkin)

Changes in v3:
 * Patch 1: add MAP_SHARED_VALIDATE in both sync=on and sync=auto
   cases, and add back the retry mechanism. MAP_SYNC will be ignored
   by Linux kernel 4.15 if MAP_SHARED_VALIDATE is missed.
 * Patch 1: define MAP_SYNC and MAP_SHARED_VALIDATE as 0 on non-Linux
   platforms in order to make qemu_ram_mmap() compile on those platforms.
 * Patch 2&3: include more information in error messages of
   memory-backend in hope to help user to identify the error.
   (Dr. David Alan Gilbert)
 * Patch 3: fix typo in the commit message. (Dr. David Alan Gilbert)

Changes in v2:
 * Add 'sync' option to control the use of MAP_SYNC. (Eduardo Habkost)
 * Remove the unnecessary set of MAP_SHARED_VALIDATE in some cases and
   the retry mechanism in qemu_ram_mmap(). (Michael S. Tsirkin)
 * Move OS dependent definitions of MAP_SYNC and MAP_SHARED_VALIDATE
   to osdep.h. (Michael S. Tsirkin)


Haozhong Zhang (6):
  util/mmap-alloc: switch qemu_ram_mmap() to 'flags' parameter
  exec: switch qemu_ram_alloc_from_{file,fd} to the 'flags' parameter
  memory: switch memory_region_init_ram_from_file() to 'flags' parameter
  util/mmap-alloc: support MAP_SYNC in qemu_ram_mmap()
  hostmem: add more information in error messages
  hostmem-file: add 'sync' option

 backends/hostmem-file.c   | 51 ---
 backends/hostmem.c| 11 +---
 docs/nvdimm.txt   | 20 +-
 exec.c| 15 ++-
 include/exec/memory.h | 36 +++--
 include/exec/ram_addr.h   | 29 ++--
 include/qemu/mmap-alloc.h | 23 +++-
 include/standard-headers/linux/mman.h | 42 +
 memory.c  |  8 +++---
 numa.c|  2 +-
 qemu-options.hx   | 22 ++-
 util/mmap-alloc.c | 31 ++---
 util/oslib-posix.c|  2 +-
 13 files changed, 261 insertions(+), 31 deletions(-)
 create mode 100644 include/standard-headers/linux/mman.h

-- 
2.14.1

[Qemu-devel] [PATCH v4 1/6] util/mmap-alloc: switch qemu_ram_mmap() to 'flags' parameter

2018-01-30 Thread Haozhong Zhang

As more flag parameters besides the existing 'shared' are going to be
added to qemu_ram_mmap(), let's switch 'shared' to a 'flags' parameter
in advance, so as to ease the further additions.

Signed-off-by: Haozhong Zhang <haozhong.zh...@intel.com>
Suggested-by: "Michael S. Tsirkin" <m...@redhat.com>
---
 exec.c|  2 +-
 include/exec/memory.h |  3 +++
 include/qemu/mmap-alloc.h | 19 ++-
 util/mmap-alloc.c |  8 +---
 util/oslib-posix.c|  2 +-
 5 files changed, 28 insertions(+), 6 deletions(-)

diff --git a/exec.c b/exec.c
index 629a508385..c0a5a52c4a 100644
--- a/exec.c
+++ b/exec.c
@@ -1653,7 +1653,7 @@ static void *file_ram_alloc(RAMBlock *block,
 }
 
 area = qemu_ram_mmap(fd, memory, block->mr->align,
- block->flags & RAM_SHARED);
+ (block->flags & RAM_SHARED) ? QEMU_RAM_SHARE : 0);
 if (area == MAP_FAILED) {
 error_setg_errno(errp, errno,
  "unable to map backing store for guest RAM");
diff --git a/include/exec/memory.h b/include/exec/memory.h
index 07c5d6d597..4790cd9e13 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -455,6 +455,9 @@ void memory_region_init_resizeable_ram(MemoryRegion *mr,
uint64_t length,
void *host),
Error **errp);
+
+#define QEMU_RAM_SHARE  (1UL << 0)
+
 #ifdef __linux__
 /**
  * memory_region_init_ram_from_file:  Initialize RAM memory region with a
diff --git a/include/qemu/mmap-alloc.h b/include/qemu/mmap-alloc.h
index 50385e3f81..dc5e8b5efb 100644
--- a/include/qemu/mmap-alloc.h
+++ b/include/qemu/mmap-alloc.h
@@ -7,7 +7,24 @@ size_t qemu_fd_getpagesize(int fd);
 
 size_t qemu_mempath_getpagesize(const char *mem_path);
 
-void *qemu_ram_mmap(int fd, size_t size, size_t align, bool shared);
+/**
+ * qemu_ram_mmap: mmap the specified file or device.
+ *
+ * Parameters:
+ *  @fd: the file or the device to mmap
+ *  @size: the number of bytes to be mmaped
+ *  @align: if not zero, specify the alignment of the starting mapping address;
+ *  otherwise, the alignment in use will be determined by QEMU.
+ *  @flags: specifies additional properties of the mapping, which can be one or
+ *  bit-or of following values
+ *  - QEMU_RAM_SHARE: mmap with MAP_SHARED flag
+ *  Other bits are ignored.
+ *
+ * Return:
+ *  On success, return a pointer to the mapped area.
+ *  On failure, return MAP_FAILED.
+ */
+void *qemu_ram_mmap(int fd, size_t size, size_t align, uint64_t flags);
 
 void qemu_ram_munmap(void *ptr, size_t size);
 
diff --git a/util/mmap-alloc.c b/util/mmap-alloc.c
index 2fd8cbcc6f..cd95566800 100644
--- a/util/mmap-alloc.c
+++ b/util/mmap-alloc.c
@@ -13,6 +13,7 @@
 #include "qemu/osdep.h"
 #include "qemu/mmap-alloc.h"
 #include "qemu/host-utils.h"
+#include "exec/memory.h"
 
 #define HUGETLBFS_MAGIC   0x958458f6
 
@@ -73,7 +74,7 @@ size_t qemu_mempath_getpagesize(const char *mem_path)
 return getpagesize();
 }
 
-void *qemu_ram_mmap(int fd, size_t size, size_t align, bool shared)
+void *qemu_ram_mmap(int fd, size_t size, size_t align, uint64_t flags)
 {
 /*
  * Note: this always allocates at least one extra page of virtual address
@@ -90,11 +91,12 @@ void *qemu_ram_mmap(int fd, size_t size, size_t align, bool 
shared)
  * anonymous memory is OK.
  */
 int anonfd = fd == -1 || qemu_fd_getpagesize(fd) == getpagesize() ? -1 : 
fd;
-int flags = anonfd == -1 ? MAP_ANONYMOUS : MAP_NORESERVE;
-void *ptr = mmap(0, total, PROT_NONE, flags | MAP_PRIVATE, anonfd, 0);
+int mmap_flags = anonfd == -1 ? MAP_ANONYMOUS : MAP_NORESERVE;
+void *ptr = mmap(0, total, PROT_NONE, mmap_flags | MAP_PRIVATE, anonfd, 0);
 #else
 void *ptr = mmap(0, total, PROT_NONE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
 #endif
+bool shared = flags & QEMU_RAM_SHARE;
 size_t offset;
 void *ptr1;
 
diff --git a/util/oslib-posix.c b/util/oslib-posix.c
index 77369c92ce..2a78cfb67e 100644
--- a/util/oslib-posix.c
+++ b/util/oslib-posix.c
@@ -130,7 +130,7 @@ void *qemu_memalign(size_t alignment, size_t size)
 void *qemu_anon_ram_alloc(size_t size, uint64_t *alignment)
 {
 size_t align = QEMU_VMALLOC_ALIGN;
-void *ptr = qemu_ram_mmap(-1, size, align, false);
+void *ptr = qemu_ram_mmap(-1, size, align, 0);
 
 if (ptr == MAP_FAILED) {
 return NULL;
-- 
2.14.1

[Qemu-devel] [PATCH v4 3/6] memory: switch memory_region_init_ram_from_file() to 'flags' parameter

2018-01-30 Thread Haozhong Zhang

As more flag parameters besides the existing 'share' are going to be
added to memory_region_init_ram_from_file(), let's switch 'share' to
a 'flags' parameter in advance, so as to ease the further additions.

Signed-off-by: Haozhong Zhang <haozhong.zh...@intel.com>
---
 backends/hostmem-file.c | 3 ++-
 include/exec/memory.h   | 7 +--
 memory.c| 6 ++
 numa.c  | 2 +-
 4 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/backends/hostmem-file.c b/backends/hostmem-file.c
index e319ec1ad8..67ecfed895 100644
--- a/backends/hostmem-file.c
+++ b/backends/hostmem-file.c
@@ -59,7 +59,8 @@ file_backend_memory_alloc(HostMemoryBackend *backend, Error 
**errp)
 path = object_get_canonical_path(OBJECT(backend));
 memory_region_init_ram_from_file(>mr, OBJECT(backend),
  path,
- backend->size, fb->align, fb->share,
+ backend->size, fb->align,
+ fb->share ? QEMU_RAM_SHARE : 0,
  fb->mem_path, errp);
 g_free(path);
 }
diff --git a/include/exec/memory.h b/include/exec/memory.h
index 4790cd9e13..6b547da6a3 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -470,7 +470,10 @@ void memory_region_init_resizeable_ram(MemoryRegion *mr,
  * @size: size of the region.
  * @align: alignment of the region base address; if 0, the default alignment
  * (getpagesize()) will be used.
- * @share: %true if memory must be mmaped with the MAP_SHARED flag
+ * @flags: specify properties of this memory region, which can be one or bit-or
+ * of following values:
+ * - QEMU_RAM_SHARE: memory must be mmaped with the MAP_SHARED flag
+ * Other bits are ignored.
  * @path: the path in which to allocate the RAM.
  * @errp: pointer to Error*, to store an error if it happens.
  *
@@ -482,7 +485,7 @@ void memory_region_init_ram_from_file(MemoryRegion *mr,
   const char *name,
   uint64_t size,
   uint64_t align,
-  bool share,
+  uint64_t flags,
   const char *path,
   Error **errp);
 
diff --git a/memory.c b/memory.c
index 1ac4ebcaca..a4f19a5d30 100644
--- a/memory.c
+++ b/memory.c
@@ -1571,7 +1571,7 @@ void memory_region_init_ram_from_file(MemoryRegion *mr,
   const char *name,
   uint64_t size,
   uint64_t align,
-  bool share,
+  uint64_t flags,
   const char *path,
   Error **errp)
 {
@@ -1580,9 +1580,7 @@ void memory_region_init_ram_from_file(MemoryRegion *mr,
 mr->terminates = true;
 mr->destructor = memory_region_destructor_ram;
 mr->align = align;
-mr->ram_block = qemu_ram_alloc_from_file(size, mr,
- share ? QEMU_RAM_SHARE : 0,
- path, errp);
+mr->ram_block = qemu_ram_alloc_from_file(size, mr, flags, path, errp);
 mr->dirty_log_mask = tcg_enabled() ? (1 << DIRTY_MEMORY_CODE) : 0;
 }
 
diff --git a/numa.c b/numa.c
index 83675a03f3..fa202a376d 100644
--- a/numa.c
+++ b/numa.c
@@ -456,7 +456,7 @@ static void allocate_system_memory_nonnuma(MemoryRegion 
*mr, Object *owner,
 if (mem_path) {
 #ifdef __linux__
 Error *err = NULL;
-memory_region_init_ram_from_file(mr, owner, name, ram_size, 0, false,
+memory_region_init_ram_from_file(mr, owner, name, ram_size, 0, 0,
  mem_path, );
 if (err) {
 error_report_err(err);
-- 
2.14.1

[Qemu-devel] [PATCH v4 6/6] hostmem-file: add 'sync' option

2018-01-30 Thread Haozhong Zhang

This option controls whether QEMU mmap(2) the memory backend file with
MAP_SYNC flag, which can fully guarantee the guest write persistence
to the backend, if MAP_SYNC flag is supported by the host kernel
(Linux kernel 4.15 and later) and the backend is a file supporting
DAX (e.g., file on ext4/xfs file system mounted with '-o dax').

It can take one of following values:
 - on:  try to pass MAP_SYNC to mmap(2); if MAP_SYNC is not supported or
'share=off', QEMU will abort
 - off: never pass MAP_SYNC to mmap(2)
 - auto (default): if MAP_SYNC is supported and 'share=on', work as if
'sync=on'; otherwise, work as if 'sync=off'

Signed-off-by: Haozhong Zhang <haozhong.zh...@intel.com>
Suggested-by: Eduardo Habkost <ehabk...@redhat.com>
Reviewed-by: Michael S. Tsirkin <m...@redhat.com>
---
 backends/hostmem-file.c | 41 -
 docs/nvdimm.txt | 20 +++-
 qemu-options.hx | 22 +-
 3 files changed, 80 insertions(+), 3 deletions(-)

diff --git a/backends/hostmem-file.c b/backends/hostmem-file.c
index df06b547a6..ade80d76f1 100644
--- a/backends/hostmem-file.c
+++ b/backends/hostmem-file.c
@@ -15,6 +15,7 @@
 #include "sysemu/hostmem.h"
 #include "sysemu/sysemu.h"
 #include "qom/object_interfaces.h"
+#include "qapi-visit.h"
 
 /* hostmem-file.c */
 /**
@@ -35,6 +36,7 @@ struct HostMemoryBackendFile {
 bool discard_data;
 char *mem_path;
 uint64_t align;
+OnOffAuto sync;
 };
 
 static void
@@ -60,7 +62,8 @@ file_backend_memory_alloc(HostMemoryBackend *backend, Error 
**errp)
 memory_region_init_ram_from_file(>mr, OBJECT(backend),
  path,
  backend->size, fb->align,
- fb->share ? QEMU_RAM_SHARE : 0,
+ (fb->share ? QEMU_RAM_SHARE : 0) |
+ qemu_ram_sync_flags(fb->sync),
  fb->mem_path, errp);
 g_free(path);
 }
@@ -154,6 +157,39 @@ static void file_memory_backend_set_align(Object *o, 
Visitor *v,
 error_propagate(errp, local_err);
 }
 
+static void file_memory_backend_get_sync(
+Object *obj, Visitor *v, const char *name, void *opaque, Error **errp)
+{
+HostMemoryBackendFile *fb = MEMORY_BACKEND_FILE(obj);
+OnOffAuto value = fb->sync;
+
+visit_type_OnOffAuto(v, name, , errp);
+}
+
+static void file_memory_backend_set_sync(
+Object *obj, Visitor *v, const char *name, void *opaque, Error **errp)
+{
+HostMemoryBackend *backend = MEMORY_BACKEND(obj);
+HostMemoryBackendFile *fb = MEMORY_BACKEND_FILE(obj);
+Error *local_err = NULL;
+OnOffAuto value;
+
+if (host_memory_backend_mr_inited(backend)) {
+error_setg(_err, "cannot change property '%s' of %s '%s'",
+   name, object_get_typename(obj), backend->id);
+goto out;
+}
+
+visit_type_OnOffAuto(v, name, , _err);
+if (local_err) {
+goto out;
+}
+fb->sync = value;
+
+ out:
+error_propagate(errp, local_err);
+}
+
 static void file_backend_unparent(Object *obj)
 {
 HostMemoryBackend *backend = MEMORY_BACKEND(obj);
@@ -188,6 +224,9 @@ file_backend_class_init(ObjectClass *oc, void *data)
 file_memory_backend_get_align,
 file_memory_backend_set_align,
 NULL, NULL, _abort);
+object_class_property_add(oc, "sync", "OnOffAuto",
+file_memory_backend_get_sync, file_memory_backend_set_sync,
+NULL, NULL, _abort);
 }
 
 static void file_backend_instance_finalize(Object *o)
diff --git a/docs/nvdimm.txt b/docs/nvdimm.txt
index e903d8bb09..5e9cee5f5e 100644
--- a/docs/nvdimm.txt
+++ b/docs/nvdimm.txt
@@ -142,11 +142,29 @@ backend of vNVDIMM:
 Guest Data Persistence
 --
 
+vNVDIMM is designed and implemented to guarantee the guest data
+persistence on the backends even on the host crash and power
+failures. However, there are still some requirements and limitations
+as explained below.
+
 Though QEMU supports multiple types of vNVDIMM backends on Linux,
-currently the only one that can guarantee the guest write persistence
+if MAP_SYNC is not supported by the host kernel and the backends,
+the only backend that can guarantee the guest write persistence
 is the device DAX on the real NVDIMM device (e.g., /dev/dax0.0), to
 which all guest access do not involve any host-side kernel cache.
 
+mmap(2) flag MAP_SYNC is added since Linux kernel 4.15. On such
+systems, QEMU can mmap(2) the backend with MAP_SYNC, which can
+guarantee the guest write persistence to vNVDIMM. Besides the host
+kernel support, enabling MAP_SYNC in QEMU also requires:
+
+ - the backend is a file supporting DAX, e.g., a file on an ext4 or
+   xfs file system mounted with '-o dax',
+
+ - 'sync' option of memory-backend-f

Re: [Qemu-devel] [PATCH v3 3/3] hostmem-file: add 'sync' option

2018-01-24 Thread Haozhong Zhang

On 01/24/18 22:23 +0200, Michael S. Tsirkin wrote:
> On Wed, Jan 17, 2018 at 04:13:25PM +0800, Haozhong Zhang wrote:
> > This option controls whether QEMU mmap(2) the memory backend file with
> > MAP_SYNC flag, which can fully guarantee the guest write persistence
> > to the backend, if MAP_SYNC flag is supported by the host kernel
> > (Linux kernel 4.15 and later) and the backend is a file supporting
> > DAX (e.g., file on ext4/xfs file system mounted with '-o dax').
> > 
> > It can take one of following values:
> >  - on:  try to pass MAP_SYNC to mmap(2); if MAP_SYNC is not supported or
> > 'share=off', QEMU will abort
> >  - off: never pass MAP_SYNC to mmap(2)
> >  - auto (default): if MAP_SYNC is supported and 'share=on', work as if
> >     'sync=on'; otherwise, work as if 'sync=off'
> > 
> > Signed-off-by: Haozhong Zhang <haozhong.zh...@intel.com>
> > Suggested-by: Eduardo Habkost <ehabk...@redhat.com>

[..]
> >  
> >  @table @option
> >  
> > -@item -object 
> > memory-backend-file,id=@var{id},size=@var{size},mem-path=@var{dir},share=@var{on|off},discard-data=@var{on|off},merge=@var{on|off},dump=@var{on|off},prealloc=@var{on|off},host-nodes=@var{host-nodes},policy=@var{default|preferred|bind|interleave},align=@var{align}
> > +@item -object 
> > memory-backend-file,id=@var{id},size=@var{size},mem-path=@var{dir},share=@var{on|off},discard-data=@var{on|off},merge=@var{on|off},dump=@var{on|off},prealloc=@var{on|off},host-nodes=@var{host-nodes},policy=@var{default|preferred|bind|interleave},align=@var{align},sync=@var{on|off|auto}
> >  
> >  Creates a memory file backend object, which can be used to back
> >  the guest RAM with huge pages.
> > @@ -4034,6 +4034,25 @@ requires an alignment different than the default one 
> > used by QEMU, eg
> >  the device DAX /dev/dax0.0 requires 2M alignment rather than 4K. In
> >  such cases, users can specify the required alignment via this option.
> >  
> > +The @option{sync} option specifies whether QEMU mmap(2) @option{mem-path}
> > +with MAP_SYNC flag, which can fully guarantee the guest write
> > +persistence to @option{mem-path}.
> 
> I would add ... even in case of a host power loss.
> Here and wherever you say "fully".

Without MAP_SYNC, QEMU can only guarantee the guest data is written to
the host NVDIMM after, for example, guest clwb+sfence. However, if
some host file system meta data of the mapped file have not been
written back to the host NVDIMM when a host power failure happens, the
mapped file may be broken though all its data may be still there.

Anyway, I'll remove the confusing word "fully" and add your suggestion.

Thanks,
Haozhong

> 
> > MAP_SYNC requires supports from both
> > +the host kernel (since Linux kernel 4.15) and @option{mem-path} (only
> > +files supporting DAX). It can take one of following values:
> > +
> > +@table @option
> > +@item @var{on}
> > +try to pass MAP_SYNC to mmap(2); if MAP_SYNC is not supported or
> > +@option{share}=@var{off}, QEMU will abort
> > +
> > +@item @var{off}
> > +never pass MAP_SYNC to mmap(2)
> > +
> > +@item @var{auto} (default)
> > +if MAP_SYNC is supported and @option{share}=@var{on}, work as if
> > +@option{sync}=@var{on}; otherwise, work as if @option{sync}=@var{off}
> > +@end table
> > +
> >  @item -object 
> > memory-backend-ram,id=@var{id},merge=@var{on|off},dump=@var{on|off},prealloc=@var{on|off},size=@var{size},host-nodes=@var{host-nodes},policy=@var{default|preferred|bind|interleave}
> >  
> >  Creates a memory backend object, which can be used to back the guest RAM.
> > -- 
> > 2.14.1

Re: [Qemu-devel] [PATCH v3 1/3] util/mmap-alloc: support MAP_SYNC in qemu_ram_mmap()

2018-01-24 Thread Haozhong Zhang

On 01/24/18 22:20 +0200, Michael S. Tsirkin wrote:
> > index 50385e3f81..dd5876471f 100644
> > --- a/include/qemu/mmap-alloc.h
> > +++ b/include/qemu/mmap-alloc.h
> > @@ -7,7 +7,8 @@ size_t qemu_fd_getpagesize(int fd);
> >  
> >  size_t qemu_mempath_getpagesize(const char *mem_path);
> >  
> > -void *qemu_ram_mmap(int fd, size_t size, size_t align, bool shared);
> > +void *qemu_ram_mmap(int fd, size_t size, size_t align, bool shared,
> > +OnOffAuto sync);
> >  
> >  void qemu_ram_munmap(void *ptr, size_t size);
> >  
> 
> And Marcel plans to add a remappable flag ...  Is it time we
> switched to a flags field?

Yes. Some patches on my hands are going to add another field to this
function, so let's switch to flags.

> 
> > diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
> > index adb3758275..0ff10cb529 100644
> > --- a/include/qemu/osdep.h
> > +++ b/include/qemu/osdep.h
> > @@ -372,6 +372,24 @@ void qemu_anon_ram_free(void *ptr, size_t size);
> >  #  define QEMU_VMALLOC_ALIGN getpagesize()
> >  #endif
> >  
> > +/*
> > + * MAP_SHARED_VALIDATE and MAP_SYNC were introduced in Linux kernel
> > + * 4.15, so they may not be defined when compiling on older kernels.
> > + */
> > +#ifdef CONFIG_LINUX
> > +#ifndef MAP_SHARED_VALIDATE
> > +#define MAP_SHARED_VALIDATE   0x3
> > +#endif
> > +#ifndef MAP_SYNC
> > +#define MAP_SYNC  0x8
> > +#endif
> > +#define QEMU_HAS_MAP_SYNC true
> > +#else  /* !CONFIG_LINUX */
> > +#define MAP_SHARED_VALIDATE   0x0
> > +#define MAP_SYNC  0x0
> > +#define QEMU_HAS_MAP_SYNC false
> > +#endif /* CONFIG_LINUX */
> > +
> >  #ifdef CONFIG_POSIX
> >  struct qemu_signalfd_siginfo {
> >  uint32_t ssi_signo;   /* Signal number */
> 
> Please just import this into standard-headers from Linux.
>

Sure, I'll move it to a new file include/standard-headers/linux/mman.h.

Thanks,
Haozhong

1 2 3 4 >

1 - 100 of 394 matches

Mail list logo