RE: [PATCH] ARM/arm64: KVM: test properly for a PTE's uncachedness

2015-11-08 Thread Pavel Fedin
 Hello!

 I have tested this patch, it also fixes the crash on Exynos5410, and is indeed 
a better approach.

Tested-by: Pavel Fedin 

CC'ed general KVM mailing list too.

Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia


> -Original Message-
> From: kvmarm-boun...@lists.cs.columbia.edu 
> [mailto:kvmarm-boun...@lists.cs.columbia.edu] On
> Behalf Of Ard Biesheuvel
> Sent: Friday, November 06, 2015 2:43 PM
> To: linux-arm-ker...@lists.infradead.org; kvm...@lists.cs.columbia.edu; 
> marc.zyng...@arm.com;
> christoffer.d...@linaro.org
> Cc: Ard Biesheuvel
> Subject: [PATCH] ARM/arm64: KVM: test properly for a PTE's uncachedness
> 
> The open coded tests for checking whether a PTE maps a page as
> uncached use a flawed 'pte_val(xxx) & CONST != CONST' pattern,
> which is not guaranteed to work since the type of a mapping is an
> index into the MAIR table, not a set of mutually exclusive bits.
> 
> Considering that, on arm64, the S2 type definitions use the following
> MAIR indexes
> 
> #define MT_S2_NORMAL0xf
> #define MT_S2_DEVICE_nGnRE  0x1
> 
> we have been getting lucky merely because the S2 device mappings also
> have the PTE_UXN bit set, which means that a device PTE still does not
> equal a normal PTE after masking with the former type.
> 
> Instead, implement proper checking against the MAIR indexes that are
> known to define uncached memory attributes.
> 
> Signed-off-by: Ard Biesheuvel 
> ---
>  arch/arm/include/asm/kvm_mmu.h   | 11 +++
>  arch/arm/kvm/mmu.c   |  5 ++---
>  arch/arm64/include/asm/kvm_mmu.h | 12 
>  3 files changed, 25 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
> index 405aa1883307..422973835d41 100644
> --- a/arch/arm/include/asm/kvm_mmu.h
> +++ b/arch/arm/include/asm/kvm_mmu.h
> @@ -279,6 +279,17 @@ static inline void __kvm_extend_hypmap(pgd_t 
> *boot_hyp_pgd,
>  pgd_t *merged_hyp_pgd,
>  unsigned long hyp_idmap_start) { }
> 
> +static inline bool __kvm_pte_is_uncached(pte_t pte)
> +{
> + switch (pte_val(pte) & L_PTE_MT_MASK) {
> + case L_PTE_MT_UNCACHED:
> + case L_PTE_MT_BUFFERABLE:
> + case L_PTE_MT_DEV_SHARED:
> + return true;
> + }
> + return false;
> +}
> +
>  #endif   /* !__ASSEMBLY__ */
> 
>  #endif /* __ARM_KVM_MMU_H__ */
> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
> index 6984342da13d..eb9a06e3dbee 100644
> --- a/arch/arm/kvm/mmu.c
> +++ b/arch/arm/kvm/mmu.c
> @@ -213,7 +213,7 @@ static void unmap_ptes(struct kvm *kvm, pmd_t *pmd,
>   kvm_tlb_flush_vmid_ipa(kvm, addr);
> 
>   /* No need to invalidate the cache for device mappings 
> */
> - if ((pte_val(old_pte) & PAGE_S2_DEVICE) != 
> PAGE_S2_DEVICE)
> + if (!__kvm_pte_is_uncached(old_pte))
>   kvm_flush_dcache_pte(old_pte);
> 
>   put_page(virt_to_page(pte));
> @@ -305,8 +305,7 @@ static void stage2_flush_ptes(struct kvm *kvm, pmd_t *pmd,
> 
>   pte = pte_offset_kernel(pmd, addr);
>   do {
> - if (!pte_none(*pte) &&
> - (pte_val(*pte) & PAGE_S2_DEVICE) != PAGE_S2_DEVICE)
> + if (!pte_none(*pte) && !__kvm_pte_is_uncached(*pte))
>   kvm_flush_dcache_pte(*pte);
>   } while (pte++, addr += PAGE_SIZE, addr != end);
>  }
> diff --git a/arch/arm64/include/asm/kvm_mmu.h 
> b/arch/arm64/include/asm/kvm_mmu.h
> index 61505676d085..5806f412a47a 100644
> --- a/arch/arm64/include/asm/kvm_mmu.h
> +++ b/arch/arm64/include/asm/kvm_mmu.h
> @@ -302,5 +302,17 @@ static inline void __kvm_extend_hypmap(pgd_t 
> *boot_hyp_pgd,
>   merged_hyp_pgd[idmap_idx] = __pgd(__pa(boot_hyp_pgd) | PMD_TYPE_TABLE);
>  }
> 
> +static inline bool __kvm_pte_is_uncached(pte_t pte)
> +{
> + switch (pte_val(pte) & PTE_ATTRINDX_MASK) {
> + case PTE_ATTRINDX(MT_DEVICE_nGnRnE):
> + case PTE_ATTRINDX(MT_DEVICE_nGnRE):
> + case PTE_ATTRINDX(MT_DEVICE_GRE):
> + case PTE_ATTRINDX(MT_NORMAL_NC):
> + return true;
> + }
> + return false;
> +}
> +
>  #endif /* __ASSEMBLY__ */
>  #endif /* __ARM64_KVM_MMU_H__ */
> --
> 1.9.1
> 
> ___
> kvmarm mailing list
> kvm...@lists.cs.columbia.edu
> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 32/33] nvdimm acpi: support _FIT method

2015-11-08 Thread Xiao Guangrong



On 11/09/2015 01:50 AM, Michael S. Tsirkin wrote:

On Fri, Oct 30, 2015 at 01:56:26PM +0800, Xiao Guangrong wrote:

FIT buffer is not completely mapped into guest address space, so a new
function, Read FIT, function index 0x, is reserved by QEMU to
read the piece of FIT buffer. The buffer is concatenated before _FIT
return

Refer to docs/specs/acpi-nvdimm.txt for detailed design

Signed-off-by: Xiao Guangrong 
---
  hw/acpi/nvdimm.c | 168 +--
  1 file changed, 164 insertions(+), 4 deletions(-)

diff --git a/hw/acpi/nvdimm.c b/hw/acpi/nvdimm.c
index f8d7d19..3f35220 100644
--- a/hw/acpi/nvdimm.c
+++ b/hw/acpi/nvdimm.c
@@ -384,6 +384,18 @@ static void nvdimm_build_nfit(GSList *device_list, GArray 
*table_offsets,
  g_array_free(structures, true);
  }

+/*
+ * define UUID for NVDIMM Root Device according to Chapter 3 DSM Interface
+ * for NVDIMM Root Device - Example in DSM Spec Rev1.
+ */
+#define NVDIMM_DSM_ROOT_UUID "2F10E7A4-9E91-11E4-89D3-123B93F75CBA"
+
+/*
+ * Read FIT Function, which is a QEMU internal use only function, more detail
+ * refer to docs/specs/acpi_nvdimm.txt
+ */
+#define NVDIMM_DSM_FUNC_READ_FIT 0x
+
  /* define NVDIMM DSM return status codes according to DSM Spec Rev1. */
  enum {
  /* Common return status codes. */
@@ -420,6 +432,11 @@ struct NvdimmFuncInSetLabelData {
  } QEMU_PACKED;
  typedef struct NvdimmFuncInSetLabelData NvdimmFuncInSetLabelData;

+struct NvdimmFuncInReadFit {
+uint32_t offset; /* fit offset */
+} QEMU_PACKED;
+typedef struct NvdimmFuncInReadFit NvdimmFuncInReadFit;
+
  struct NvdimmDsmIn {
  uint32_t handle;
  uint32_t revision;
@@ -429,6 +446,7 @@ struct NvdimmDsmIn {
  uint8_t arg3[0];
  NvdimmFuncInSetLabelData func_set_label_data;
  NvdimmFuncInGetLabelData func_get_label_data;
+NvdimmFuncInReadFit func_read_fit;
  };
  } QEMU_PACKED;
  typedef struct NvdimmDsmIn NvdimmDsmIn;
@@ -450,13 +468,71 @@ struct NvdimmFuncOutGetLabelData {
  } QEMU_PACKED;
  typedef struct NvdimmFuncOutGetLabelData NvdimmFuncOutGetLabelData;

+struct NvdimmFuncOutReadFit {
+uint32_t status;/* return status code. */
+uint32_t length;/* the length of fit data we read. */
+uint8_t fit_data[0]; /* fit data. */
+} QEMU_PACKED;
+typedef struct NvdimmFuncOutReadFit NvdimmFuncOutReadFit;
+
  static void nvdimm_dsm_write_status(GArray *out, uint32_t status)
  {
  status = cpu_to_le32(status);
  build_append_int_noprefix(out, status, sizeof(status));
  }

-static void nvdimm_dsm_root(NvdimmDsmIn *in, GArray *out)
+/* Build fit memory which is presented to guest via _FIT method. */
+static void nvdimm_build_fit(AcpiNVDIMMState *state)
+{
+if (!state->fit) {
+GSList *device_list = nvdimm_get_plugged_device_list();
+
+nvdimm_debug("Rebuild FIT...\n");
+state->fit = nvdimm_build_device_structure(device_list);
+g_slist_free(device_list);
+}
+}
+
+/* Read FIT data, defined in docs/specs/acpi_nvdimm.txt. */
+static void nvdimm_dsm_func_read_fit(AcpiNVDIMMState *state,
+ NvdimmDsmIn *in, GArray *out)
+{
+NvdimmFuncInReadFit *read_fit = &in->func_read_fit;
+NvdimmFuncOutReadFit fit_out;
+uint32_t read_length = TARGET_PAGE_SIZE - sizeof(NvdimmFuncOutReadFit);
+uint32_t status = NVDIMM_DSM_ROOT_DEV_STATUS_INVALID_PARAS;
+
+nvdimm_build_fit(state);
+
+le32_to_cpus(&read_fit->offset);
+
+nvdimm_debug("Read FIT offset %#x.\n", read_fit->offset);
+
+if (read_fit->offset > state->fit->len) {
+nvdimm_debug("offset %#x is beyond fit size (%#x).\n",
+ read_fit->offset, state->fit->len);
+goto exit;
+}
+
+read_length = MIN(read_length, state->fit->len - read_fit->offset);
+nvdimm_debug("read length %#x.\n", read_length);
+
+fit_out.status = cpu_to_le32(NVDIMM_DSM_STATUS_SUCCESS);
+fit_out.length = cpu_to_le32(read_length);


Is array always empty at this point?
If yes, better assert this here to make sure guest can not
use unlimited memory.


It's unnecessary. At the end of dsm handler, we have asserted it that the
memory size can not beyond the size of dsm memory region:

static uint64_t
nvdimm_dsm_read(void *opaque, hwaddr addr, unsigned size)
{
   ..
exit:
assert(out->len <= memory_region_size(dsm_ram_mr));

/* Write output result to dsm memory. */
memcpy(dsm_ram_addr, out->data, out->len);
memory_region_set_dirty(dsm_ram_mr, 0, out->len);

buf_size = cpu_to_le32(out->len);
..
}




+g_array_append_vals(out, &fit_out, sizeof(fit_out));
+
+if (read_length) {
+g_array_append_vals(out, state->fit->data + read_fit->offset,
+read_length);
+}
+return;
+
+exit:
+nvdimm_dsm_write_status(out, status);
+}
+
+static void nvdimm_dsm_root(AcpiNVDIMMState *state, NvdimmDsmIn *in,
+GArray *out)
 

Re: [PATCH v6 25/33] nvdimm acpi: build ACPI nvdimm devices

2015-11-08 Thread Xiao Guangrong



On 11/09/2015 01:38 AM, Michael S. Tsirkin wrote:

On Fri, Oct 30, 2015 at 01:56:19PM +0800, Xiao Guangrong wrote:

NVDIMM devices is defined in ACPI 6.0 9.20 NVDIMM Devices

There is a root device under \_SB and specified NVDIMM devices are under the
root device. Each NVDIMM device has _ADR which returns its handle used to
associate MEMDEV structure in NFIT

We reserve handle 0 for root device. In this patch, we save handle, handle,
arg1 and arg2 to dsm memory. Arg3 is conditionally saved in later patch

Signed-off-by: Xiao Guangrong 
---
  hw/acpi/nvdimm.c | 184 +++
  1 file changed, 184 insertions(+)

diff --git a/hw/acpi/nvdimm.c b/hw/acpi/nvdimm.c
index dd84e5f..53ed675 100644
--- a/hw/acpi/nvdimm.c
+++ b/hw/acpi/nvdimm.c
@@ -368,6 +368,15 @@ static void nvdimm_build_nfit(GSList *device_list, GArray 
*table_offsets,
  g_array_free(structures, true);
  }

+struct NvdimmDsmIn {
+uint32_t handle;
+uint32_t revision;
+uint32_t function;
+   /* the remaining size in the page is used by arg3. */
+uint8_t arg3[0];
+} QEMU_PACKED;
+typedef struct NvdimmDsmIn NvdimmDsmIn;
+
  static uint64_t
  nvdimm_dsm_read(void *opaque, hwaddr addr, unsigned size)
  {
@@ -377,6 +386,7 @@ nvdimm_dsm_read(void *opaque, hwaddr addr, unsigned size)
  static void
  nvdimm_dsm_write(void *opaque, hwaddr addr, uint64_t val, unsigned size)
  {
+fprintf(stderr, "BUG: we never write DSM notification IO Port.\n");
  }

  static const MemoryRegionOps nvdimm_dsm_ops = {
@@ -402,6 +412,179 @@ void nvdimm_init_acpi_state(MemoryRegion *memory, 
MemoryRegion *io,
  memory_region_add_subregion(io, NVDIMM_ACPI_IO_BASE, &state->io_mr);
  }

+#define BUILD_STA_METHOD(_dev_, _method_)  \
+do {   \
+_method_ = aml_method("_STA", 0);  \
+aml_append(_method_, aml_return(aml_int(0x0f)));   \
+aml_append(_dev_, _method_);   \
+} while (0)
+
+#define BUILD_DSM_METHOD(_dev_, _method_, _handle_, _uuid_)\
+do {   \
+Aml *ifctx, *uuid; \
+_method_ = aml_method("_DSM", 4);  \
+/* check UUID if it is we expect, return the errorcode if not.*/   \


check that UUID is what we expect?


Yes, it is, better English indeed.




+uuid = aml_touuid(_uuid_); \
+ifctx = aml_if(aml_lnot(aml_equal(aml_arg(0), uuid))); \
+aml_append(ifctx, aml_return(aml_int(1 /* Not Supported */))); \
+aml_append(method, ifctx); \
+aml_append(method, aml_return(aml_call4("NCAL", aml_int(_handle_), \
+   aml_arg(1), aml_arg(2), aml_arg(3;  \


So name NCAL here matches the name below.
Pls define a macro for it so we aren't limited
by silly 4-letter limitations of AML.
Same applies to all other names you use here and elsewhere.


Okay, it's good to me.




+aml_append(_dev_, _method_);   \
+} while (0)
+
+#define BUILD_FIELD_UNIT_SIZE(_field_, _byte_, _name_) \
+aml_append(_field_, aml_named_field(_name_, (_byte_) * BITS_PER_BYTE))
+
+#define BUILD_FIELD_UNIT_STRUCT(_field_, _s_, _f_, _name_) \
+BUILD_FIELD_UNIT_SIZE(_field_, sizeof(typeof_field(_s_, _f_)), _name_)
+


why are these macros? Make them functions pls.


Since Igor thought it hidden the details and it was not good to the readers.
I will just drop this macros and inline it in the place where it is used.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v7 12/35] util: let qemu_fd_getlength support block device

2015-11-08 Thread Xiao Guangrong



On 11/06/2015 11:54 PM, Eduardo Habkost wrote:

On Mon, Nov 02, 2015 at 05:13:14PM +0800, Xiao Guangrong wrote:

lseek can not work for all block devices as the man page says:
| Some devices are incapable of seeking and POSIX does not specify
| which devices must support lseek().

This patch tries to add the support on Linux by using BLKGETSIZE64
ioctl

Signed-off-by: Xiao Guangrong 


On which cases is this patch necessary? Do you know any examples of
Linux block devices that won't work with lseek(SEEK_END)?


To be honest, i have not checked all block device, this patch was made
based on the man page. However, i do not mind drop this patch (and maybe
other patches) to make this pachset smaller. BLKGETSIZE64 can be added
in the future if we meet such device.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v7 11/35] util: introduce qemu_file_getlength()

2015-11-08 Thread Xiao Guangrong



On 11/06/2015 11:50 PM, Eduardo Habkost wrote:

As this patch affects raw_getlength(), CCing the raw block driver
maintainer and the qemu-block mailing list.


Eduardo, thanks for your reminder. I will keep CCing Kevin and qemu-block mail
list for future version.



On Mon, Nov 02, 2015 at 05:13:13PM +0800, Xiao Guangrong wrote:

It is used to get the size of the specified file, also qemu_fd_getlength()
is introduced to unify the code with raw_getlength() in block/raw-posix.c

Signed-off-by: Xiao Guangrong 
---
  block/raw-posix.c|  7 +--
  include/qemu/osdep.h |  2 ++
  util/osdep.c | 31 +++


I know I was the one who suggested osdep.c, but maybe oslib-posix.c is a
more appropriate place for the new function?



Since the function we introduced here can work on both windows and posix, so
i thing osdep.c is the right place. Otherwise we should implement it for 
multiple
platforms.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v7 07/35] util: introduce qemu_file_get_page_size()

2015-11-08 Thread Xiao Guangrong



On 11/06/2015 11:36 PM, Eduardo Habkost wrote:

On Mon, Nov 02, 2015 at 05:13:09PM +0800, Xiao Guangrong wrote:

There are three places use the some logic to get the page size on
the file path or file fd

Windows did not support file hugepage, so it will return normal page
for this case. And this interface has not been used on windows so far

This patch introduces qemu_file_get_page_size() to unify the code

Signed-off-by: Xiao Guangrong 

[...]

diff --git a/util/oslib-posix.c b/util/oslib-posix.c
index 914cef5..51437ff 100644
--- a/util/oslib-posix.c
+++ b/util/oslib-posix.c
@@ -340,7 +340,7 @@ static void sigbus_handler(int signal)
  siglongjmp(sigjump, 1);
  }

-static size_t fd_getpagesize(int fd)
+static size_t fd_getpagesize(int fd, Error **errp)
  {
  #ifdef CONFIG_LINUX
  struct statfs fs;
@@ -351,7 +351,12 @@ static size_t fd_getpagesize(int fd)
  ret = fstatfs(fd, &fs);
  } while (ret != 0 && errno == EINTR);

-if (ret == 0 && fs.f_type == HUGETLBFS_MAGIC) {
+if (ret) {
+error_setg_errno(errp, errno, "fstatfs is failed");
+return 0;
+}
+
+if (fs.f_type == HUGETLBFS_MAGIC) {
  return fs.f_bsize;
  }


You are changing os_mem_prealloc() behavior when fstatfs() fails, here.
Have you ensured there are no cases where fstatfs() fails but this code
is still expected to work?


stat() is supported for all kinds of files, so failed stat() is caused by
file is not exist or kernel internal error (e,g memory is not enough) or
security check is not passed. Whichever we should not do any operation on
the file if stat() failed. The origin code did not check it but it is worth
being fixed i think.



The change looks safe: gethugepagesize() seems to be always called in
the path that would make fd_getpagesize() be called from
os_mem_prealloc(), so allocation would abort much earlier if statfs()
failed. But I haven't confirmed that yet, and I wanted to be sure.



Yes, I am entirely agree with you. :)

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V6 0/6] Fast mmio eventfd fixes

2015-11-08 Thread Jason Wang


On 11/09/2015 01:11 AM, Michael S. Tsirkin wrote:
> On Tue, Sep 15, 2015 at 02:41:53PM +0800, Jason Wang wrote:
>> Hi:
>>
>> This series fixes two issues of fast mmio eventfd:
>>
>> 1) A single iodev instance were registerd on two buses: KVM_MMIO_BUS
>>and KVM_FAST_MMIO_BUS. This will cause double in
>>ioeventfd_destructor()
>> 2) A zero length iodev on KVM_MMIO_BUS will never be found but
>>kvm_io_bus_cmp(). This will lead e.g the eventfd will be trapped by
>>qemu instead of host.
>>
>> 1 is fixed by allocating two instances of iodev and introduce a new
>> capability for userspace. 2 is fixed by ignore the actual length if
>> the length of iodev is zero in kvm_io_bus_cmp().
>>
>> Please review.
>> Changes from V5:
>> - move patch of explicitly checking for KVM_MMIO_BUS to patch 1 and
>>   remove the unnecessary checks
>> - even more grammar and typo fixes
>> - rabase to kvm.git
>> - document KVM_CAP_FAST_MMIO
> What's up with userspace using this capability?

It was renamed to KVM_CAP_IOEVENTFD_ANY_LENGTH.

> Did patches ever get posted?

See https://lkml.org/lkml/2015/9/28/208

>
>> Changes from V4:
>> - move the location of kvm_assign_ioeventfd() in patch 1 which reduce
>>   the change set.
>> - commit log typo fixes
>> - switch to use kvm_deassign_ioeventfd_id) when fail to register to
>>   fast mmio bus
>> - change kvm_io_bus_cmp() as Paolo's suggestions
>> - introduce a new capability to avoid new userspace crash old kernel
>> - add a new patch that only try to register mmio eventfd on fast mmio
>>   bus
>>
>> Changes from V3:
>>
>> - Don't do search on two buses when trying to do write on
>>   KVM_MMIO_BUS. This fixes a small regression found by vmexit.flat.
>> - Since we don't do search on two buses, change kvm_io_bus_cmp() to
>>   let it can find zero length iodevs.
>> - Fix the unnecessary lines in tracepoint patch.
>>
>> Changes from V2:
>> - Tweak styles and comment suggested by Cornelia.
>>
>> Changes from v1:
>> - change ioeventfd_bus_from_flags() to return KVM_FAST_MMIO_BUS when
>>   needed to save lots of unnecessary changes.
>>
>> Jason Wang (6):
>>   kvm: don't try to register to KVM_FAST_MMIO_BUS for non mmio eventfd
>>   kvm: factor out core eventfd assign/deassign logic
>>   kvm: fix double free for fast mmio eventfd
>>   kvm: fix zero length mmio searching
>>   kvm: add tracepoint for fast mmio
>>   kvm: add fast mmio capabilitiy
>>
>>  Documentation/virtual/kvm/api.txt |   7 ++-
>>  arch/x86/kvm/trace.h  |  18 ++
>>  arch/x86/kvm/vmx.c|   1 +
>>  arch/x86/kvm/x86.c|   1 +
>>  include/uapi/linux/kvm.h  |   1 +
>>  virt/kvm/eventfd.c| 124 
>> ++
>>  virt/kvm/kvm_main.c   |  20 +-
>>  7 files changed, 118 insertions(+), 54 deletions(-)
>>
>> -- 
>> 2.1.4
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] KVM: x86: Add lowest-priority support for vt-d posted-interrupts

2015-11-08 Thread Feng Wu
Use vector-hashing to handle lowest-priority interrupts for
posted-interrupts. As an example, modern Intel CPUs use this
method to handle lowest-priority interrupts.

Signed-off-by: Feng Wu 
---
 arch/x86/include/asm/kvm_host.h |  2 ++
 arch/x86/kvm/irq_comm.c | 52 +
 arch/x86/kvm/lapic.c| 57 +
 arch/x86/kvm/lapic.h|  2 ++
 arch/x86/kvm/vmx.c  | 14 --
 5 files changed, 125 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 9265196..e225106 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1258,6 +1258,8 @@ bool kvm_vcpu_is_bsp(struct kvm_vcpu *vcpu);
 
 bool kvm_intr_is_single_vcpu(struct kvm *kvm, struct kvm_lapic_irq *irq,
 struct kvm_vcpu **dest_vcpu);
+struct kvm_vcpu *kvm_intr_vector_hashing_dest(struct kvm *kvm,
+ struct kvm_lapic_irq *irq);
 
 void kvm_set_msi_irq(struct kvm_kernel_irq_routing_entry *e,
 struct kvm_lapic_irq *irq);
diff --git a/arch/x86/kvm/irq_comm.c b/arch/x86/kvm/irq_comm.c
index 84b96d3..8156e45 100644
--- a/arch/x86/kvm/irq_comm.c
+++ b/arch/x86/kvm/irq_comm.c
@@ -266,6 +266,58 @@ out:
return r;
 }
 
+/*
+ * This routine handles lowest-priority interrupts using vector-hashing
+ * mechanism. As an example, modern Intel CPUs use this method to handle
+ * lowest-priority interrupts.
+ *
+ * Here is the details about the vector-hashing mechanism:
+ * 1. For lowest-priority interrupts, store all the possible destination
+ *vCPUs in an array.
+ * 2. Use "guest vector % max number of destination vCPUs" to find the right
+ *destination vCPU in the array for the lowest-priority interrupt.
+ */
+struct kvm_vcpu *kvm_intr_vector_hashing_dest(struct kvm *kvm,
+ struct kvm_lapic_irq *irq)
+
+{
+   unsigned long dest_vcpu_bitmap[BITS_TO_LONGS(KVM_MAX_VCPUS)];
+   unsigned int dest_vcpus = 0;
+   struct kvm_vcpu *vcpu;
+   unsigned int i, mod, idx = 0;
+
+   vcpu = kvm_intr_vector_hashing_dest_fast(kvm, irq);
+   if (vcpu)
+   return vcpu;
+
+   memset(dest_vcpu_bitmap, 0, sizeof(dest_vcpu_bitmap));
+
+   kvm_for_each_vcpu(i, vcpu, kvm) {
+   if (!kvm_apic_present(vcpu))
+   continue;
+
+   if (!kvm_apic_match_dest(vcpu, NULL, irq->shorthand,
+   irq->dest_id, irq->dest_mode))
+   continue;
+
+   __set_bit(vcpu->vcpu_id, dest_vcpu_bitmap);
+   dest_vcpus++;
+   }
+
+   if (dest_vcpus == 0)
+   return NULL;
+
+   mod = irq->vector % dest_vcpus;
+
+   for (i = 0; i <= mod; i++) {
+   idx = find_next_bit(dest_vcpu_bitmap, KVM_MAX_VCPUS, idx) + 1;
+   BUG_ON(idx >= KVM_MAX_VCPUS);
+   }
+
+   return kvm_get_vcpu(kvm, idx - 1);
+}
+EXPORT_SYMBOL_GPL(kvm_intr_vector_hashing_dest);
+
 bool kvm_intr_is_single_vcpu(struct kvm *kvm, struct kvm_lapic_irq *irq,
 struct kvm_vcpu **dest_vcpu)
 {
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index ecd4ea1..4937aa4 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -816,6 +816,63 @@ out:
return ret;
 }
 
+struct kvm_vcpu *kvm_intr_vector_hashing_dest_fast(struct kvm *kvm,
+  struct kvm_lapic_irq *irq)
+{
+   struct kvm_apic_map *map;
+   struct kvm_vcpu *vcpu = NULL;
+
+   if (irq->shorthand)
+   return NULL;
+
+   rcu_read_lock();
+   map = rcu_dereference(kvm->arch.apic_map);
+
+   if (!map)
+   goto out;
+
+   if ((irq->dest_mode != APIC_DEST_PHYSICAL) &&
+   kvm_lowest_prio_delivery(irq)) {
+   u16 cid;
+   int i, idx = 0;
+   unsigned long bitmap = 1;
+   unsigned int mod, dest_vcpus = 0;
+   struct kvm_lapic **dst = NULL;
+
+
+   if (!kvm_apic_logical_map_valid(map))
+   goto out;
+
+   apic_logical_id(map, irq->dest_id, &cid, (u16 *)&bitmap);
+
+   if (cid >= ARRAY_SIZE(map->logical_map))
+   goto out;
+
+   dst = map->logical_map[cid];
+
+   for_each_set_bit(i, &bitmap, 16) {
+   if (!dst[i])
+   continue;
+
+   dest_vcpus++;
+   }
+
+   mod = irq->vector % dest_vcpus;
+
+   for (i = 0; i <= mod; i++) {
+   idx = find_next_bit(&bitmap, KVM_MAX_VCPUS, idx) + 1;
+   BUG_ON(idx >= KVM_MAX_VCPUS);
+   }
+
+   if (kvm_apic_present(dst[idx-1]->vcpu))
+

Re: [PATCH v2 0/4] KVM: VMX: enable LBR virtualization

2015-11-08 Thread Jian Zhou

Hi Paolo,

May I ask that any suggestion about the version 2 of VMX LBRV?
This version is updated following your advices in version 1.
BTW the kvm-unit-test for this feature has sent too, and I
have tested the CPUs emulated by QEMU.

Thanks,
Jian

On 2015/10/23 17:15, Jian Zhou wrote:

Changelog in v2:
   (1) move the implementation into vmx.c
   (2) migraton is supported
   (3) add arrays in kvm_vcpu_arch struct to save/restore
   LBR MSRs at vm exit/entry time.
   (4) add a parameter of kvm_intel module to permanently
   disable LBRV
   (5) table of supported CPUs is reorgnized, LBRV
   can be enabled or not according to the guest CPUID

Jian Zhou (4):
   KVM: X86: Add arrays to save/restore LBR MSRs
   KVM: X86: LBR MSRs of supported CPU types
   KVM: X86: Migration is supported
   KVM: VMX: details of LBR virtualization implementation

  arch/x86/include/asm/kvm_host.h  |  26 -
  arch/x86/include/asm/msr-index.h |  26 -
  arch/x86/kvm/vmx.c   | 245 +++
  arch/x86/kvm/x86.c   |  88 --
  4 files changed, 366 insertions(+), 19 deletions(-)

--
1.7.12.4



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 2/3] target-i386: calculate vcpu's TSC rate to be migrated

2015-11-08 Thread haozhong . zhang
On 11/06/15 13:12, Eduardo Habkost wrote:
> On Fri, Nov 06, 2015 at 10:32:24AM +0800, haozhong.zh...@intel.com wrote:
> > On 11/05/15 14:05, Eduardo Habkost wrote:
> > > On Thu, Nov 05, 2015 at 09:30:51AM +0800, Haozhong Zhang wrote:
> > > > On 11/04/15 19:42, Eduardo Habkost wrote:
> [...]
> > > > > > +env->tsc_khz_saved = r;
> > > > > > +}
> > > > > 
> > > > > Why do you need a separate tsc_khz_saved field, and don't simply use
> > > > > tsc_khz? It would have the additional feature of letting QMP clients
> > > > > query the current TSC rate by asking for the tsc-freq property on CPU
> > > > > objects.
> > > > >
> > > > 
> > > > It's to avoid overriding env->tsc_khz on the destination in the
> > > > migration. I can change this line to
> > > >  env->tsc_khz = env->tsc_khz_saved = r;
> > > 
> > > You are already avoiding overriding env->tsc_khz, because you use
> > > KVM_GET_TSC_KHZ only if tsc_khz is not set yet. I still don't see why
> > > you need a tsc_khz_saved field that requires duplicating the SET_TSC_KHZ
> > > code, if you could just do this:
> > > 
> > > if (!env->tsc_khz) {
> > > env->tsc_khz = kvm_vcpu_ioctl(cs, KVM_GET_TSC_KHZ);
> > > }
> > >
> > 
> > Consider an example that we migrate a VM from machine A to machine B
> > and then to machine C, and QEMU on machine B is launched with the cpu
> > option 'tsc-freq' (i.e. env->tsc_khz on B is non-zero at the
> > beginning):
> >  1) In the migration from B to C, the user-specified TSC frequency by
> > 'tsc-freq' on B is expected to be migrated to C. That is, the
> > value of env->tsc_khz on B is migrated.
> >  2) If TSC frequency is migrated through env->tsc_khz, then
> > env->tsc_khz on B will be overrode in the migration from A to B
> > before kvm_arch_setup_tsc_khz(). If the guest TSC frequency is
> > different than the user-specified TSC frequency on B, the
> > expectation in 1) will not be satisfied anymore.
> 
> Setting tsc-freq on B when tsc-freq was not used on A is invalid usage.
> This is not different from changing the CPU model and adding or removing
> CPU flags when migrating, which is also incorrect. The command-line
> parameters defining the VM must be the same when you migrate.
>

Good to know it's an invalid usage. Then the question is what QEMU is
expected to do for this invalid usage?

 1) Abort the migration? But I find that the current QEMU does not
abort the migration between different CPU models (e.g. Nehalem and
Haswell).

 2) Or do not abort the migration and ignore tsc-freq option? If so,
tsc_khz_saved will be not needed.

Haozhong

> -- 
> Eduardo
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 00/12] KVM: x86: add support for VMX TSC scaling

2015-11-08 Thread haozhong . zhang
On 11/06/15 21:40, Paolo Bonzini wrote:
> 
> 
> On 06/11/2015 13:42, Haozhong Zhang wrote:
> > On 11/06/15 11:49, Paolo Bonzini wrote:
> >>
> >>
> >> On 20/10/2015 09:39, Haozhong Zhang wrote:
> >>> This patchset adds support for VMX TSC scaling feature which is
> >>> available on Intel Skylake CPU. The specification of VMX TSC scaling
> >>> can be found at
> >>> http://www.intel.com/content/www/us/en/processors/timestamp-counter-scaling-virtualization-white-paper.html
> >>>
> >>> VMX TSC scaling allows guest TSC which is read by guest rdtsc(p)
> >>> instructions increases in a rate that is customized by the hypervisor
> >>> and can be different than the host TSC rate. Basically, VMX TSC
> >>> scaling adds a 64-bit field called TSC multiplier in VMCS so that, if
> >>> VMX TSC scaling is enabled, TSC read by guest rdtsc(p) instructions
> >>> will be calculated by the following formula:
> >>>
> >>>   guest EDX:EAX = (Host TSC * TSC multiplier) >> 48 + VMX TSC Offset
> >>>
> >>> where, Host TSC = Host MSR_IA32_TSC + Host MSR_IA32_TSC_ADJUST.
> >>>
> >>> This patchset, when cooperating with another QEMU patchset (sent in
> >>> another email "target-i386: save/restore vcpu's TSC rate during
> >>> migration"), allows guest programs observe a consistent TSC rate even
> >>> though they are migrated among machines with different host TSC rates.
> >>>
> >>> VMX TSC scaling shares some common logics with SVM TSC ratio which
> >>> is already supported by KVM. Patch 1 ~ 8 move those common logics from
> >>> SVM code to the common code. Upon them, patch 9 ~ 12 add VMX-specific
> >>> support for VMX TSC scaling.
> >>>
> >>> Changes in v2:
> >>>  * Remove the duplicated variable 'kvm_tsc_scaling_ratio_rsvd'.
> >>>  * Remove an unnecessary error check in original patch 2.
> >>>  * Do 64-bit arithmetic by functions recommended by Paolo.
> >>>  * Make kvm_set_tsc_khz() returns an error number so that ioctl
> >>>KVM_SET_TSC_KHZ does not return 0 if errors happen.
> >>>
> >>> Reviewed-by: Eric Northup 
> >>
> >> Thanks for the patches.  There are a couple changes that I can do myself:
> >>
> >> 1) kvm_default_tsc_scaling_ratio can be initialized in
> >> kvm_arch_hardware_setup, since it's just 1ULL <<
> >> kvm_tsc_scaling_ratio_frac_bits
> >>
> > Agree
> > 
> >> 2) things that you are adding to include/linux/kvm_host.h should instead
> >> go in arch/x86/include/linux/kvm_host.h
> >>
> > Agree, because they are x86 specific.
> > 
> >> That's it.  I'll commit it as soon as I test on AMD (today hopefully).
> 
> It tested fine.  I'll give it a shot with the 32-bit mul_u64_u64_shr on
> Monday as well, but I don't expect any issue.
> 
> Thanks, the patches are neat!
> 
> Paolo

Thank you for the test!

Haozhong

> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 32/33] nvdimm acpi: support _FIT method

2015-11-08 Thread Michael S. Tsirkin
On Fri, Oct 30, 2015 at 01:56:26PM +0800, Xiao Guangrong wrote:
> FIT buffer is not completely mapped into guest address space, so a new
> function, Read FIT, function index 0x, is reserved by QEMU to
> read the piece of FIT buffer. The buffer is concatenated before _FIT
> return
> 
> Refer to docs/specs/acpi-nvdimm.txt for detailed design
> 
> Signed-off-by: Xiao Guangrong 
> ---
>  hw/acpi/nvdimm.c | 168 
> +--
>  1 file changed, 164 insertions(+), 4 deletions(-)
> 
> diff --git a/hw/acpi/nvdimm.c b/hw/acpi/nvdimm.c
> index f8d7d19..3f35220 100644
> --- a/hw/acpi/nvdimm.c
> +++ b/hw/acpi/nvdimm.c
> @@ -384,6 +384,18 @@ static void nvdimm_build_nfit(GSList *device_list, 
> GArray *table_offsets,
>  g_array_free(structures, true);
>  }
>  
> +/*
> + * define UUID for NVDIMM Root Device according to Chapter 3 DSM Interface
> + * for NVDIMM Root Device - Example in DSM Spec Rev1.
> + */
> +#define NVDIMM_DSM_ROOT_UUID "2F10E7A4-9E91-11E4-89D3-123B93F75CBA"
> +
> +/*
> + * Read FIT Function, which is a QEMU internal use only function, more detail
> + * refer to docs/specs/acpi_nvdimm.txt
> + */
> +#define NVDIMM_DSM_FUNC_READ_FIT 0x
> +
>  /* define NVDIMM DSM return status codes according to DSM Spec Rev1. */
>  enum {
>  /* Common return status codes. */
> @@ -420,6 +432,11 @@ struct NvdimmFuncInSetLabelData {
>  } QEMU_PACKED;
>  typedef struct NvdimmFuncInSetLabelData NvdimmFuncInSetLabelData;
>  
> +struct NvdimmFuncInReadFit {
> +uint32_t offset; /* fit offset */
> +} QEMU_PACKED;
> +typedef struct NvdimmFuncInReadFit NvdimmFuncInReadFit;
> +
>  struct NvdimmDsmIn {
>  uint32_t handle;
>  uint32_t revision;
> @@ -429,6 +446,7 @@ struct NvdimmDsmIn {
>  uint8_t arg3[0];
>  NvdimmFuncInSetLabelData func_set_label_data;
>  NvdimmFuncInGetLabelData func_get_label_data;
> +NvdimmFuncInReadFit func_read_fit;
>  };
>  } QEMU_PACKED;
>  typedef struct NvdimmDsmIn NvdimmDsmIn;
> @@ -450,13 +468,71 @@ struct NvdimmFuncOutGetLabelData {
>  } QEMU_PACKED;
>  typedef struct NvdimmFuncOutGetLabelData NvdimmFuncOutGetLabelData;
>  
> +struct NvdimmFuncOutReadFit {
> +uint32_t status;/* return status code. */
> +uint32_t length;/* the length of fit data we read. */
> +uint8_t fit_data[0]; /* fit data. */
> +} QEMU_PACKED;
> +typedef struct NvdimmFuncOutReadFit NvdimmFuncOutReadFit;
> +
>  static void nvdimm_dsm_write_status(GArray *out, uint32_t status)
>  {
>  status = cpu_to_le32(status);
>  build_append_int_noprefix(out, status, sizeof(status));
>  }
>  
> -static void nvdimm_dsm_root(NvdimmDsmIn *in, GArray *out)
> +/* Build fit memory which is presented to guest via _FIT method. */
> +static void nvdimm_build_fit(AcpiNVDIMMState *state)
> +{
> +if (!state->fit) {
> +GSList *device_list = nvdimm_get_plugged_device_list();
> +
> +nvdimm_debug("Rebuild FIT...\n");
> +state->fit = nvdimm_build_device_structure(device_list);
> +g_slist_free(device_list);
> +}
> +}
> +
> +/* Read FIT data, defined in docs/specs/acpi_nvdimm.txt. */
> +static void nvdimm_dsm_func_read_fit(AcpiNVDIMMState *state,
> + NvdimmDsmIn *in, GArray *out)
> +{
> +NvdimmFuncInReadFit *read_fit = &in->func_read_fit;
> +NvdimmFuncOutReadFit fit_out;
> +uint32_t read_length = TARGET_PAGE_SIZE - sizeof(NvdimmFuncOutReadFit);
> +uint32_t status = NVDIMM_DSM_ROOT_DEV_STATUS_INVALID_PARAS;
> +
> +nvdimm_build_fit(state);
> +
> +le32_to_cpus(&read_fit->offset);
> +
> +nvdimm_debug("Read FIT offset %#x.\n", read_fit->offset);
> +
> +if (read_fit->offset > state->fit->len) {
> +nvdimm_debug("offset %#x is beyond fit size (%#x).\n",
> + read_fit->offset, state->fit->len);
> +goto exit;
> +}
> +
> +read_length = MIN(read_length, state->fit->len - read_fit->offset);
> +nvdimm_debug("read length %#x.\n", read_length);
> +
> +fit_out.status = cpu_to_le32(NVDIMM_DSM_STATUS_SUCCESS);
> +fit_out.length = cpu_to_le32(read_length);

Is array always empty at this point?
If yes, better assert this here to make sure guest can not
use unlimited memory.

> +g_array_append_vals(out, &fit_out, sizeof(fit_out));
> +
> +if (read_length) {
> +g_array_append_vals(out, state->fit->data + read_fit->offset,
> +read_length);
> +}
> +return;
> +
> +exit:
> +nvdimm_dsm_write_status(out, status);
> +}
> +
> +static void nvdimm_dsm_root(AcpiNVDIMMState *state, NvdimmDsmIn *in,
> +GArray *out)
>  {
>  uint32_t status = NVDIMM_DSM_STATUS_NOT_SUPPORTED;
>  
> @@ -475,6 +551,10 @@ static void nvdimm_dsm_root(NvdimmDsmIn *in, GArray *out)
>  return;
>  }
>  
> +if (in->function == NVDIMM_DSM_FUNC_READ_FIT /* FIT Read */) {
> +return nvdimm_dsm_func_read_fit(state, in, out)

Re: [PATCH v6 25/33] nvdimm acpi: build ACPI nvdimm devices

2015-11-08 Thread Michael S. Tsirkin
On Fri, Oct 30, 2015 at 01:56:19PM +0800, Xiao Guangrong wrote:
> NVDIMM devices is defined in ACPI 6.0 9.20 NVDIMM Devices
> 
> There is a root device under \_SB and specified NVDIMM devices are under the
> root device. Each NVDIMM device has _ADR which returns its handle used to
> associate MEMDEV structure in NFIT
> 
> We reserve handle 0 for root device. In this patch, we save handle, handle,
> arg1 and arg2 to dsm memory. Arg3 is conditionally saved in later patch
> 
> Signed-off-by: Xiao Guangrong 
> ---
>  hw/acpi/nvdimm.c | 184 
> +++
>  1 file changed, 184 insertions(+)
> 
> diff --git a/hw/acpi/nvdimm.c b/hw/acpi/nvdimm.c
> index dd84e5f..53ed675 100644
> --- a/hw/acpi/nvdimm.c
> +++ b/hw/acpi/nvdimm.c
> @@ -368,6 +368,15 @@ static void nvdimm_build_nfit(GSList *device_list, 
> GArray *table_offsets,
>  g_array_free(structures, true);
>  }
>  
> +struct NvdimmDsmIn {
> +uint32_t handle;
> +uint32_t revision;
> +uint32_t function;
> +   /* the remaining size in the page is used by arg3. */
> +uint8_t arg3[0];
> +} QEMU_PACKED;
> +typedef struct NvdimmDsmIn NvdimmDsmIn;
> +
>  static uint64_t
>  nvdimm_dsm_read(void *opaque, hwaddr addr, unsigned size)
>  {
> @@ -377,6 +386,7 @@ nvdimm_dsm_read(void *opaque, hwaddr addr, unsigned size)
>  static void
>  nvdimm_dsm_write(void *opaque, hwaddr addr, uint64_t val, unsigned size)
>  {
> +fprintf(stderr, "BUG: we never write DSM notification IO Port.\n");
>  }
>  
>  static const MemoryRegionOps nvdimm_dsm_ops = {
> @@ -402,6 +412,179 @@ void nvdimm_init_acpi_state(MemoryRegion *memory, 
> MemoryRegion *io,
>  memory_region_add_subregion(io, NVDIMM_ACPI_IO_BASE, &state->io_mr);
>  }
>  
> +#define BUILD_STA_METHOD(_dev_, _method_)  \
> +do {   \
> +_method_ = aml_method("_STA", 0);  \
> +aml_append(_method_, aml_return(aml_int(0x0f)));   \
> +aml_append(_dev_, _method_);   \
> +} while (0)
> +
> +#define BUILD_DSM_METHOD(_dev_, _method_, _handle_, _uuid_)\
> +do {   \
> +Aml *ifctx, *uuid; \
> +_method_ = aml_method("_DSM", 4);  \
> +/* check UUID if it is we expect, return the errorcode if not.*/   \

check that UUID is what we expect?

> +uuid = aml_touuid(_uuid_); \
> +ifctx = aml_if(aml_lnot(aml_equal(aml_arg(0), uuid))); \
> +aml_append(ifctx, aml_return(aml_int(1 /* Not Supported */))); \
> +aml_append(method, ifctx); \
> +aml_append(method, aml_return(aml_call4("NCAL", aml_int(_handle_), \
> +   aml_arg(1), aml_arg(2), aml_arg(3;  \

So name NCAL here matches the name below.
Pls define a macro for it so we aren't limited
by silly 4-letter limitations of AML.
Same applies to all other names you use here and elsewhere.

> +aml_append(_dev_, _method_);   \
> +} while (0)
> +
> +#define BUILD_FIELD_UNIT_SIZE(_field_, _byte_, _name_) \
> +aml_append(_field_, aml_named_field(_name_, (_byte_) * BITS_PER_BYTE))
> +
> +#define BUILD_FIELD_UNIT_STRUCT(_field_, _s_, _f_, _name_) \
> +BUILD_FIELD_UNIT_SIZE(_field_, sizeof(typeof_field(_s_, _f_)), _name_)
> +

why are these macros? Make them functions pls.

> +static void build_nvdimm_devices(GSList *device_list, Aml *root_dev)
> +{
> +for (; device_list; device_list = device_list->next) {
> +NVDIMMDevice *nvdimm = device_list->data;
> +int slot = object_property_get_int(OBJECT(nvdimm), DIMM_SLOT_PROP,
> +   NULL);
> +uint32_t handle = nvdimm_slot_to_handle(slot);
> +Aml *dev, *method;
> +
> +dev = aml_device("NV%02X", slot);
> +aml_append(dev, aml_name_decl("_ADR", aml_int(handle)));
> +
> +BUILD_STA_METHOD(dev, method);
> +
> +/*
> + * Chapter 4: _DSM Interface for NVDIMM Device (non-root) - Example
> + * in DSM Spec Rev1.
> + */
> +BUILD_DSM_METHOD(dev, method,
> + handle /* NVDIMM Device Handle */,
> + "4309AC30-0D11-11E4-9191-0800200C9A66"
> + /* UUID for NVDIMM Devices. */);
> +
> +aml_append(root_dev, dev);
> +}
> +}
> +
> +static void nvdimm_build_acpi_devices(GSList *device_list, Aml *sb_scope)
> +{
> +Aml *dev, *method, *field;
> +uint64_t page_size = TARGET_PAGE_SIZE;
> +
> +dev = aml_device("NVDR");
> +aml_append(dev, aml_nam

[PATCH] kvmtool: Makefile: remove LDFLAGS from guest_init linking

2015-11-08 Thread Andre Przywara
Looking back at the HEAD from a few commits ago, it's obvious that
using the LDFLAGS variable for linking the guest_init binary was
rather pointless, as it was zeroed in the beginning and then never
set.

As guest_init is a rather special binary that does not cope well with
arbitrary linker flags, let's reinstantiate the previous state by
removing the LDFLAGS variable from those linking steps. This allows
LDFLAGS to be used for linking the actual kvmtool binary only and
helps to re-merge commit d0e2772b93a ("Makefile: allow overriding
CFLAGS on the command line").

Signed-off-by: Andre Przywara 
---
Hi,

Riku: can you confirm that this patch together with the one that was
reverted (d0e2772b93a) still works in your environment?

Will, if that works for Riku and you acknowledge this patch, can you
also re-merge the reverted patch mentioned above? Turns out that the
revert also made my other patch useless ;-)

Cheers,
Andre.

 Makefile | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/Makefile b/Makefile
index 9138942..8095d59 100644
--- a/Makefile
+++ b/Makefile
@@ -385,13 +385,13 @@ ifneq ($(ARCH_PRE_INIT),)
 $(GUEST_PRE_INIT): $(ARCH_PRE_INIT)
$(E) "  LINK" $@
$(Q) $(CC) -s -nostdlib $(ARCH_PRE_INIT) -o $@
-   $(Q) $(LD) $(LDFLAGS) -r -b binary -o guest/guest_pre_init.o 
$(GUEST_PRE_INIT)
+   $(Q) $(LD) -r -b binary -o guest/guest_pre_init.o $(GUEST_PRE_INIT)
 endif
 
 $(GUEST_INIT): guest/init.c
$(E) "  LINK" $@
$(Q) $(CC) $(GUEST_INIT_FLAGS) guest/init.c -o $@
-   $(Q) $(LD) $(LDFLAGS) -r -b binary -o guest/guest_init.o $(GUEST_INIT)
+   $(Q) $(LD) -r -b binary -o guest/guest_init.o $(GUEST_INIT)
 
 %.s: %.c
$(Q) $(CC) -o $@ -S $(CFLAGS) -fverbose-asm $<
-- 
1.8.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V6 0/6] Fast mmio eventfd fixes

2015-11-08 Thread Michael S. Tsirkin
On Tue, Sep 15, 2015 at 02:41:53PM +0800, Jason Wang wrote:
> Hi:
> 
> This series fixes two issues of fast mmio eventfd:
> 
> 1) A single iodev instance were registerd on two buses: KVM_MMIO_BUS
>and KVM_FAST_MMIO_BUS. This will cause double in
>ioeventfd_destructor()
> 2) A zero length iodev on KVM_MMIO_BUS will never be found but
>kvm_io_bus_cmp(). This will lead e.g the eventfd will be trapped by
>qemu instead of host.
> 
> 1 is fixed by allocating two instances of iodev and introduce a new
> capability for userspace. 2 is fixed by ignore the actual length if
> the length of iodev is zero in kvm_io_bus_cmp().
> 
> Please review.
> Changes from V5:
> - move patch of explicitly checking for KVM_MMIO_BUS to patch 1 and
>   remove the unnecessary checks
> - even more grammar and typo fixes
> - rabase to kvm.git
> - document KVM_CAP_FAST_MMIO

What's up with userspace using this capability?
Did patches ever get posted?

> 
> Changes from V4:
> - move the location of kvm_assign_ioeventfd() in patch 1 which reduce
>   the change set.
> - commit log typo fixes
> - switch to use kvm_deassign_ioeventfd_id) when fail to register to
>   fast mmio bus
> - change kvm_io_bus_cmp() as Paolo's suggestions
> - introduce a new capability to avoid new userspace crash old kernel
> - add a new patch that only try to register mmio eventfd on fast mmio
>   bus
> 
> Changes from V3:
> 
> - Don't do search on two buses when trying to do write on
>   KVM_MMIO_BUS. This fixes a small regression found by vmexit.flat.
> - Since we don't do search on two buses, change kvm_io_bus_cmp() to
>   let it can find zero length iodevs.
> - Fix the unnecessary lines in tracepoint patch.
> 
> Changes from V2:
> - Tweak styles and comment suggested by Cornelia.
> 
> Changes from v1:
> - change ioeventfd_bus_from_flags() to return KVM_FAST_MMIO_BUS when
>   needed to save lots of unnecessary changes.
> 
> Jason Wang (6):
>   kvm: don't try to register to KVM_FAST_MMIO_BUS for non mmio eventfd
>   kvm: factor out core eventfd assign/deassign logic
>   kvm: fix double free for fast mmio eventfd
>   kvm: fix zero length mmio searching
>   kvm: add tracepoint for fast mmio
>   kvm: add fast mmio capabilitiy
> 
>  Documentation/virtual/kvm/api.txt |   7 ++-
>  arch/x86/kvm/trace.h  |  18 ++
>  arch/x86/kvm/vmx.c|   1 +
>  arch/x86/kvm/x86.c|   1 +
>  include/uapi/linux/kvm.h  |   1 +
>  virt/kvm/eventfd.c| 124 
> ++
>  virt/kvm/kvm_main.c   |  20 +-
>  7 files changed, 118 insertions(+), 54 deletions(-)
> 
> -- 
> 2.1.4
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 0/3] virtio DMA API core stuff

2015-11-08 Thread David Woodhouse
On Sun, 2015-11-08 at 12:37 +0200, Michael S. Tsirkin wrote:
> On Thu, Oct 29, 2015 at 05:18:56PM +0100, David Woodhouse wrote:
> > On Thu, 2015-10-29 at 11:01 +0200, Michael S. Tsirkin wrote:
> > > 
> > > But you trust your hypervisor (you have no choice anyway),
> > > and you don't want the overhead of tweaking IOMMU
> > > on data path for virtio. Thus iommu=on is out too.
> > 
> > That's not at all special for virtio or guest VMs. Even with real
> > hardware, we might want performance from *some* devices, and security
> > from others. See the DMA_ATTR_IOMMU_BYPASS which is currently being
> > discussed.
> 
> Right. So let's wait for that discussion to play out?

That discussion is purely about a requested optimisation. This one is
about correctness.

> > But of course the easy answer in *your* case it just to ask the
> > hypervisor not to put the virtio devices behind an IOMMU at all. Which
> > we were planning to remain the default behaviour.
> 
> One can't do this for x86 ATM, can one?

The converse is true, in fact — currently, there's no way to tell 
qemu-system-x86 that you *do* want it to put the virtio devices behind
the emulated IOMMU, as it has no support for that.

Which is a bit sad really, since the DMAR table that qemu advertises to
the guest does *tell* the guest that the virtio devices are behind the
emulated IOMMU.

In the short term, we'll be fixing the DMAR table, and still not
actually making it possible to put the virtio devices behind the
emulated IOMMU.

In the fullness of time, however, we *will* be fixing the qemu IOMMU
code so that it can translate for virtio devices — and for assigned
physical devices, which I believe are also broken at the moment when
qemu emulates an IOMMU.

> > In all cases, the DMA API shall do the right thing.
> 
> I have no problem with that. For example, can we teach
> the DMA API on intel x86 to use PT for virtio by default?
> That would allow merging Andy's patches with
> full compatibility with old guests and hosts.

A quirk so that we *notice* the bug in the existing qemu DMAR table,
and disbelieve what it says about the virtio devices?

Alternatively, we could just recognise that the emulated IOMMU support
in qemu is an experimental feature and doesn't work right, yet. Are
people really using it in anger?

If we do want to do a quirk, then we should make it get it right for
assigned devices too.

To start with, do you want to try to express the criteria for "the DMAR
table lies and  device is actually untranslated" in a form of
prose which could reasonably be translated into code?

-- 
David WoodhouseOpen Source Technology Centre
david.woodho...@intel.com  Intel Corporation



smime.p7s
Description: S/MIME cryptographic signature


Re: [PATCH v3 0/3] virtio DMA API core stuff

2015-11-08 Thread Joerg Roedel
On Sun, Nov 08, 2015 at 12:37:47PM +0200, Michael S. Tsirkin wrote:
> I have no problem with that. For example, can we teach
> the DMA API on intel x86 to use PT for virtio by default?
> That would allow merging Andy's patches with
> full compatibility with old guests and hosts.

Well, the only incompatibility comes from an experimental qemu feature,
more explicitly from a bug in that features implementation. So why
should we work around that in the kernel? I think it is not too hard to
fix qemu to generate a correct DMAR table which excludes the virtio
devices from iommu translation.


Joerg

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 0/3] virtio DMA API core stuff

2015-11-08 Thread Michael S. Tsirkin
On Thu, Oct 29, 2015 at 05:18:56PM +0100, David Woodhouse wrote:
> On Thu, 2015-10-29 at 11:01 +0200, Michael S. Tsirkin wrote:
> > 
> > Example: you have a mix of assigned devices and virtio devices. You
> > don't trust your assigned device vendor not to corrupt your memory so
> > you want to limit the damage your assigned device can do to your
> > guest,
> > so you use an IOMMU for that.  Thus existing iommu=pt within guest is
> > out.
> > 
> > But you trust your hypervisor (you have no choice anyway),
> > and you don't want the overhead of tweaking IOMMU
> > on data path for virtio. Thus iommu=on is out too.
> 
> That's not at all special for virtio or guest VMs. Even with real
> hardware, we might want performance from *some* devices, and security
> from others. See the DMA_ATTR_IOMMU_BYPASS which is currently being
> discussed.

Right. So let's wait for that discussion to play out?

> But of course the easy answer in *your* case it just to ask the
> hypervisor not to put the virtio devices behind an IOMMU at all. Which
> we were planning to remain the default behaviour.

One can't do this for x86 ATM, can one?

> In all cases, the DMA API shall do the right thing.

I have no problem with that. For example, can we teach
the DMA API on intel x86 to use PT for virtio by default?
That would allow merging Andy's patches with
full compatibility with old guests and hosts.

> -- 
> dwmw2
> 
> 



-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html