Re: [PATCH v3 3/3] spapr: Add Hcalls to support PAPR NVDIMM device

2019-12-16 Thread Shivaprasad G Bhat

Hi David,


On 11/22/2019 10:41 AM, David Gibson wrote:

On Mon, Oct 14, 2019 at 01:38:16PM -0500, Shivaprasad G Bhat wrote:

device_add/del phase itself instead.

The guest kernel makes bind/unbind requests for the virtual NVDIMM device
at the region level granularity. Without interleaving, each virtual NVDIMM

It's not clear to me what a "region" means in this context.


That is PMEM terminology. "region" in this context is guest physical
address range.

Fixing all the rest of the things you pointed out.

Thanks,
Shivaprasad








Re: [PATCH v3 3/3] spapr: Add Hcalls to support PAPR NVDIMM device

2019-11-21 Thread David Gibson
On Mon, Oct 14, 2019 at 01:38:16PM -0500, Shivaprasad G Bhat wrote:
> This patch implements few of the necessary hcalls for the nvdimm support.
> 
> PAPR semantics is such that each NVDIMM device is comprising of multiple
> SCM(Storage Class Memory) blocks. The guest requests the hypervisor to
> bind each of the SCM blocks of the NVDIMM device using hcalls. There can
> be SCM block unbind requests in case of driver errors or unplug(not
> supported now) use cases. The NVDIMM label read/writes are done through
> hcalls.
> 
> Since each virtual NVDIMM device is divided into multiple SCM blocks,
> the bind, unbind, and queries using hcalls on those blocks can come
> independently. This doesn't fit well into the qemu device semantics,
> where the map/unmap are done at the (whole)device/object level granularity.
> The patch doesnt actually bind/unbind on hcalls but let it happen at the
> device_add/del phase itself instead.
> 
> The guest kernel makes bind/unbind requests for the virtual NVDIMM device
> at the region level granularity. Without interleaving, each virtual NVDIMM

It's not clear to me what a "region" means in this context.

> device is presented as separate region. There is no way to configure the
> virtual NVDIMM interleaving for the guests today. So, there is no way a
> partial bind/unbind request can come for the vNVDIMM in a hcall for a
> subset of SCM blocks of a virtual NVDIMM. Hence it is safe to do
> bind/unbind everything during the device_add/del.
> 
> Signed-off-by: Shivaprasad G Bhat 
> ---
> ---
>  hw/ppc/spapr_hcall.c   |  300 
> 
>  include/hw/ppc/spapr.h |8 +
>  2 files changed, 307 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/ppc/spapr_hcall.c b/hw/ppc/spapr_hcall.c
> index 23e4bdb829..4e9ad96f7c 100644
> --- a/hw/ppc/spapr_hcall.c
> +++ b/hw/ppc/spapr_hcall.c

This is large enough and sufficiently non-core that I'd suggest
putting it in its own file.

> @@ -18,6 +18,10 @@
>  #include "hw/ppc/spapr_ovec.h"
>  #include "mmu-book3s-v3.h"
>  #include "hw/mem/memory-device.h"
> +#include "hw/ppc/spapr_drc.h"
> +#include "hw/mem/nvdimm.h"
> +#include "qemu/range.h"
> +#include "qemu/nvdimm-utils.h"
>  
>  static bool has_spr(PowerPCCPU *cpu, int spr)
>  {
> @@ -1961,6 +1965,295 @@ static target_ulong h_update_dt(PowerPCCPU *cpu, 
> SpaprMachineState *spapr,
>  return H_SUCCESS;
>  }
>  
> +static target_ulong h_scm_read_metadata(PowerPCCPU *cpu,
> +SpaprMachineState *spapr,
> +target_ulong opcode,
> +target_ulong *args)
> +{
> +uint32_t drc_index = args[0];
> +uint64_t offset = args[1];
> +uint64_t numBytesToRead = args[2];
> +SpaprDrc *drc = spapr_drc_by_index(drc_index);
> +NVDIMMDevice *nvdimm;
> +NVDIMMClass *ddc;
> +__be64 data_be = 0;
> +uint64_t data = 0;
> +
> +if (drc && spapr_drc_type(drc) != SPAPR_DR_CONNECTOR_TYPE_PMEM) {

I'm pretty sure you want if (!drc || ... ) here.  Otherwise if the
given index is totally bogus you'll crash a few lines down.

> +return H_PARAMETER;
> +}
> +
> +if (numBytesToRead != 1 && numBytesToRead != 2 &&
> +numBytesToRead != 4 && numBytesToRead != 8) {
> +return H_P3;
> +}
> +
> +nvdimm = NVDIMM(drc->dev);

What if the drc slot is not plugged at present (so drc->dev is NULL)?

> +if ((offset + numBytesToRead < offset) ||
> +(nvdimm->label_size < numBytesToRead + offset)) {
> +return H_P2;
> +}
> +
> +ddc = NVDIMM_GET_CLASS(nvdimm);
> +ddc->read_label_data(nvdimm, _be, numBytesToRead, offset);
> +
> +switch (numBytesToRead) {
> +case 1:
> +data = data_be & 0xff;

The read_label_data above is only filling in the first byte of
data_be.  That only corresponds to data_be & 0xff if the host is
little endian which you shouldn't rely on.

Also, I'm not sure in what sense "data_be" is big-endian as the name
suggests.  AFAICT the label data is a byte-addressable array, which
means it doesn't really have an endianness.

I think what you're trying to do here is put the 0th byte of label
into the LSB of the return value ... the 7th byte of label data into
the MSB of the return value.

From the point of view of the 64-bit value you're returning in a
register, that makes the label data effectively *LE* (LSB first), not
BE.  Since you're initializing data_be to 0, you can just load into it
and le64_to_cpu() without reference to numBytesToRead to accomplish
this.

> +break;
> +case 2:
> +data = be16_to_cpu(data_be & 0x);
> +break;
> +case 4:
> +data = be32_to_cpu(data_be & 0x);
> +break;
> +case 8:
> +data = be64_to_cpu(data_be);
> +break;
> +default:
> +break;
> +}
> +
> +args[0] = data;
> +
> +return H_SUCCESS;
> +}
> +
> +static target_ulong 

[PATCH v3 3/3] spapr: Add Hcalls to support PAPR NVDIMM device

2019-10-14 Thread Shivaprasad G Bhat
This patch implements few of the necessary hcalls for the nvdimm support.

PAPR semantics is such that each NVDIMM device is comprising of multiple
SCM(Storage Class Memory) blocks. The guest requests the hypervisor to
bind each of the SCM blocks of the NVDIMM device using hcalls. There can
be SCM block unbind requests in case of driver errors or unplug(not
supported now) use cases. The NVDIMM label read/writes are done through
hcalls.

Since each virtual NVDIMM device is divided into multiple SCM blocks,
the bind, unbind, and queries using hcalls on those blocks can come
independently. This doesn't fit well into the qemu device semantics,
where the map/unmap are done at the (whole)device/object level granularity.
The patch doesnt actually bind/unbind on hcalls but let it happen at the
device_add/del phase itself instead.

The guest kernel makes bind/unbind requests for the virtual NVDIMM device
at the region level granularity. Without interleaving, each virtual NVDIMM
device is presented as separate region. There is no way to configure the
virtual NVDIMM interleaving for the guests today. So, there is no way a
partial bind/unbind request can come for the vNVDIMM in a hcall for a
subset of SCM blocks of a virtual NVDIMM. Hence it is safe to do
bind/unbind everything during the device_add/del.

Signed-off-by: Shivaprasad G Bhat 
---
---
 hw/ppc/spapr_hcall.c   |  300 
 include/hw/ppc/spapr.h |8 +
 2 files changed, 307 insertions(+), 1 deletion(-)

diff --git a/hw/ppc/spapr_hcall.c b/hw/ppc/spapr_hcall.c
index 23e4bdb829..4e9ad96f7c 100644
--- a/hw/ppc/spapr_hcall.c
+++ b/hw/ppc/spapr_hcall.c
@@ -18,6 +18,10 @@
 #include "hw/ppc/spapr_ovec.h"
 #include "mmu-book3s-v3.h"
 #include "hw/mem/memory-device.h"
+#include "hw/ppc/spapr_drc.h"
+#include "hw/mem/nvdimm.h"
+#include "qemu/range.h"
+#include "qemu/nvdimm-utils.h"
 
 static bool has_spr(PowerPCCPU *cpu, int spr)
 {
@@ -1961,6 +1965,295 @@ static target_ulong h_update_dt(PowerPCCPU *cpu, 
SpaprMachineState *spapr,
 return H_SUCCESS;
 }
 
+static target_ulong h_scm_read_metadata(PowerPCCPU *cpu,
+SpaprMachineState *spapr,
+target_ulong opcode,
+target_ulong *args)
+{
+uint32_t drc_index = args[0];
+uint64_t offset = args[1];
+uint64_t numBytesToRead = args[2];
+SpaprDrc *drc = spapr_drc_by_index(drc_index);
+NVDIMMDevice *nvdimm;
+NVDIMMClass *ddc;
+__be64 data_be = 0;
+uint64_t data = 0;
+
+if (drc && spapr_drc_type(drc) != SPAPR_DR_CONNECTOR_TYPE_PMEM) {
+return H_PARAMETER;
+}
+
+if (numBytesToRead != 1 && numBytesToRead != 2 &&
+numBytesToRead != 4 && numBytesToRead != 8) {
+return H_P3;
+}
+
+nvdimm = NVDIMM(drc->dev);
+if ((offset + numBytesToRead < offset) ||
+(nvdimm->label_size < numBytesToRead + offset)) {
+return H_P2;
+}
+
+ddc = NVDIMM_GET_CLASS(nvdimm);
+ddc->read_label_data(nvdimm, _be, numBytesToRead, offset);
+
+switch (numBytesToRead) {
+case 1:
+data = data_be & 0xff;
+break;
+case 2:
+data = be16_to_cpu(data_be & 0x);
+break;
+case 4:
+data = be32_to_cpu(data_be & 0x);
+break;
+case 8:
+data = be64_to_cpu(data_be);
+break;
+default:
+break;
+}
+
+args[0] = data;
+
+return H_SUCCESS;
+}
+
+static target_ulong h_scm_write_metadata(PowerPCCPU *cpu,
+ SpaprMachineState *spapr,
+ target_ulong opcode,
+ target_ulong *args)
+{
+uint32_t drc_index = args[0];
+uint64_t offset = args[1];
+uint64_t data = args[2];
+uint64_t numBytesToWrite = args[3];
+SpaprDrc *drc = spapr_drc_by_index(drc_index);
+NVDIMMDevice *nvdimm;
+DeviceState *dev;
+NVDIMMClass *ddc;
+__be64 data_be = 0;
+
+if (drc && spapr_drc_type(drc) != SPAPR_DR_CONNECTOR_TYPE_PMEM) {
+return H_PARAMETER;
+}
+
+if (numBytesToWrite != 1 && numBytesToWrite != 2 &&
+numBytesToWrite != 4 && numBytesToWrite != 8) {
+return H_P4;
+}
+
+dev = drc->dev;
+nvdimm = NVDIMM(dev);
+
+switch (numBytesToWrite) {
+case 1:
+if (data & 0xff00) {
+return H_P2;
+}
+data_be = data & 0xff;
+break;
+case 2:
+if (data & 0x) {
+return H_P2;
+}
+data_be = cpu_to_be16(data & 0x);
+break;
+case 4:
+if (data & 0x) {
+return H_P2;
+}
+data_be = cpu_to_be32(data & 0x);
+break;
+case 8:
+data_be = cpu_to_be64(data);
+break;
+default: /* lint */
+break;
+}
+
+ddc =