Re: [Qemu-devel] How to reserve guest physical region for ACPI

2016-01-07 Thread Igor Mammedov
On Tue, 5 Jan 2016 18:43:02 +0200
"Michael S. Tsirkin" <m...@redhat.com> wrote:

> On Tue, Jan 05, 2016 at 05:30:25PM +0100, Igor Mammedov wrote:
> > > > bios-linker-loader is a great interface for initializing some
> > > > guest owned data and linking it together but I think it adds
> > > > unnecessary complexity and is misused if it's used to handle
> > > > device owned data/on device memory in this and VMGID cases.
> > > 
> > > I want a generic interface for guest to enumerate these things.  linker
> > > seems quite reasonable but if you see a reason why it won't do, or want
> > > to propose a better interface, fine.
> > > 
> > > PCI would do, too - though windows guys had concerns about
> > > returning PCI BARs from ACPI.  
> > There were potential issues with pSeries bootloader that treated
> > PCI_CLASS_MEMORY_RAM as conventional RAM but it was fixed.
> > Could you point out to discussion about windows issues?
> > 
> > What VMGEN patches that used PCI for mapping purposes were
> > stuck at, was that it was suggested to use PCI_CLASS_MEMORY_RAM
> > class id but we couldn't agree on it.
> > 
> > VMGEN v13 with full discussion is here
> > https://patchwork.ozlabs.org/patch/443554/
> > So to continue with this route we would need to pick some other
> > driver less class id so windows won't prompt for driver or
> > maybe supply our own driver stub to guarantee that no one
> > would touch it. Any suggestions?  
> 
> Pick any device/vendor id pair for which windows specifies no driver.
> There's a small risk that this will conflict with some
> guest but I think it's minimal.
device/vendor id pair was QEMU specific so doesn't conflicts with anything
issue we were trying to solve was to prevent windows asking for driver
even though it does so only once if told not to ask again.

That's why PCI_CLASS_MEMORY_RAM was selected as it's generic driver-less
device descriptor in INF file which matches as the last resort if
there isn't any other diver that's matched device by device/vendor id pair.

> 
> 
> > > 
> > >   
> > > > There was RFC on list to make BIOS boot from NVDIMM already
> > > > doing some ACPI table lookup/parsing. Now if they were forced
> > > > to also parse and execute AML to initialize QEMU with guest
> > > > allocated address that would complicate them quite a bit.
> > > 
> > > If they just need to find a table by name, it won't be
> > > too bad, would it?  
> > that's what they were doing scanning memory for static NVDIMM table.
> > However if it were DataTable, BIOS side would have to execute
> > AML so that the table address could be told to QEMU.  
> 
> Not at all. You can find any table by its signature without
> parsing AML.
yep, and then BIOS would need to tell its address to QEMU
writing to IO port which is allocated statically in QEMU
for this purpose and is described in AML only on guest side.

> 
> 
> > In case of direct mapping or PCI BAR there is no need to initialize
> > QEMU side from AML.
> > That also saves us IO port where this address should be written
> > if bios-linker-loader approach is used.
> >   
> > >   
> > > > While with NVDIMM control memory region mapped directly by QEMU,
> > > > respective patches don't need in any way to initialize QEMU,
> > > > all they would need just read necessary data from control region.
> > > > 
> > > > Also using bios-linker-loader takes away some usable RAM
> > > > from guest and in the end that doesn't scale,
> > > > the more devices I add the less usable RAM is left for guest OS
> > > > while all the device needs is a piece of GPA address space
> > > > that would belong to it.
> > > 
> > > I don't get this comment. I don't think it's MMIO that is wanted.
> > > If it's backed by qemu virtual memory then it's RAM.  
> > Then why don't allocate video card VRAM the same way and try to explain
> > user that a guest started with '-m 128 -device cirrus-vga,vgamem_mb=64Mb'
> > only has 64Mb of available RAM because of we think that on device VRAM
> > is also RAM.
> > 
> > Maybe I've used MMIO term wrongly here but it roughly reflects the idea
> > that on device memory (whether it's VRAM, NVDIMM control block or VMGEN
> > area) is not allocated from guest's usable RAM (as described in E820)
> > but rather directly mapped in guest's GPA and doesn't consume available
> > RAM as guest sees it. That's also the way it's done on real hardware.
> &

Re: [PATCH 5/6] nvdimm acpi: let qemu handle _DSM method

2016-01-07 Thread Igor Mammedov
On Tue,  5 Jan 2016 02:52:07 +0800
Xiao Guangrong  wrote:

> If dsm memory is successfully patched, we let qemu fully emulate
> the dsm method
> 
> This patch saves _DSM input parameters into dsm memory, tell dsm
> memory address to QEMU, then fetch the result from the dsm memory
you also need to add NVDR._CRS method that would report
resources used by operation regions.

NVDIMM_COMMON_DSM - probably should be serialized, otherwise
there is a race risk, when several callers would write to
control region.


> 
> Signed-off-by: Xiao Guangrong 
> ---
>  hw/acpi/aml-build.c |  27 ++
>  hw/acpi/nvdimm.c| 124 
> ++--
>  include/hw/acpi/aml-build.h |   2 +
>  3 files changed, 150 insertions(+), 3 deletions(-)
> 
> diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c
> index 677c1a6..e65171f 100644
> --- a/hw/acpi/aml-build.c
> +++ b/hw/acpi/aml-build.c
> @@ -1013,6 +1013,19 @@ Aml *create_field_common(int opcode, Aml *srcbuf, Aml 
> *index, const char *name)
>  return var;
>  }
>  
> +/* ACPI 1.0b: 16.2.5.2 Named Objects Encoding: DefCreateField */
> +Aml *aml_create_field(Aml *srcbuf, Aml *index, Aml *len, const char *name)
> +{
> +Aml *var = aml_alloc();
> +build_append_byte(var->buf, 0x5B); /* ExtOpPrefix */
> +build_append_byte(var->buf, 0x13); /* CreateFieldOp */
> +aml_append(var, srcbuf);
> +aml_append(var, index);
> +aml_append(var, len);
> +build_append_namestring(var->buf, "%s", name);
> +return var;
> +}
> +
>  /* ACPI 1.0b: 16.2.5.2 Named Objects Encoding: DefCreateDWordField */
>  Aml *aml_create_dword_field(Aml *srcbuf, Aml *index, const char *name)
>  {
> @@ -1439,6 +1452,20 @@ Aml *aml_alias(const char *source_object, const char 
> *alias_object)
>  return var;
>  }
>  
> +/* ACPI 1.0b: 16.2.5.4 Type 2 Opcodes Encoding: DefConcat */
> +Aml *aml_concatenate(Aml *source1, Aml *source2, Aml *target)
> +{
> +Aml *var = aml_opcode(0x73 /* ConcatOp */);
> +aml_append(var, source1);
> +aml_append(var, source2);
> +
> +if (target) {
> +aml_append(var, target);
> +}
> +
> +return var;
> +}
> +
>  void
>  build_header(GArray *linker, GArray *table_data,
>   AcpiTableHeader *h, const char *sig, int len, uint8_t rev,
> diff --git a/hw/acpi/nvdimm.c b/hw/acpi/nvdimm.c
> index a72104c..dfccbc0 100644
> --- a/hw/acpi/nvdimm.c
> +++ b/hw/acpi/nvdimm.c
> @@ -369,6 +369,24 @@ static void nvdimm_build_nfit(GSList *device_list, 
> GArray *table_offsets,
>  g_array_free(structures, true);
>  }
>  
> +struct NvdimmDsmIn {
> +uint32_t handle;
> +uint32_t revision;
> +uint32_t function;
> +   /* the remaining size in the page is used by arg3. */
> +union {
> +uint8_t arg3[0];
> +};
> +} QEMU_PACKED;
> +typedef struct NvdimmDsmIn NvdimmDsmIn;
> +
> +struct NvdimmDsmOut {
> +/* the size of buffer filled by QEMU. */
> +uint32_t len;
> +uint8_t data[0];
> +} QEMU_PACKED;
> +typedef struct NvdimmDsmOut NvdimmDsmOut;
> +
>  static uint64_t
>  nvdimm_dsm_read(void *opaque, hwaddr addr, unsigned size)
>  {
> @@ -408,11 +426,21 @@ void nvdimm_init_acpi_state(AcpiNVDIMMState *state, 
> MemoryRegion *io,
>  
>  static void nvdimm_build_common_dsm(Aml *dev)
>  {
> -Aml *method, *ifctx, *function;
> +Aml *method, *ifctx, *function, *unpatched, *field, *high_dsm_mem;
> +Aml *result_size, *dsm_mem;
>  uint8_t byte_list[1];
>  
>  method = aml_method(NVDIMM_COMMON_DSM, 4, AML_NOTSERIALIZED);
>  function = aml_arg(2);
> +dsm_mem = aml_arg(3);
> +
> +aml_append(method, aml_store(aml_call0(NVDIMM_GET_DSM_MEM), dsm_mem));
> +
> +/*
> + * do not support any method if DSM memory address has not been
> + * patched.
> + */
> +unpatched = aml_if(aml_equal(dsm_mem, aml_int64(0x0)));
>  
>  /*
>   * function 0 is called to inquire what functions are supported by
> @@ -421,12 +449,102 @@ static void nvdimm_build_common_dsm(Aml *dev)
>  ifctx = aml_if(aml_equal(function, aml_int(0)));
>  byte_list[0] = 0 /* No function Supported */;
>  aml_append(ifctx, aml_return(aml_buffer(1, byte_list)));
> -aml_append(method, ifctx);
> +aml_append(unpatched, ifctx);
>  
>  /* No function is supported yet. */
>  byte_list[0] = 1 /* Not Supported */;
> -aml_append(method, aml_return(aml_buffer(1, byte_list)));
> +aml_append(unpatched, aml_return(aml_buffer(1, byte_list)));
> +aml_append(method, unpatched);
> +
> +/* map DSM memory and IO into ACPI namespace. */
> +aml_append(method, aml_operation_region("NPIO", AML_SYSTEM_IO,
> +   aml_int(NVDIMM_ACPI_IO_BASE), NVDIMM_ACPI_IO_LEN));
> +aml_append(method, aml_operation_region("NRAM", AML_SYSTEM_MEMORY,
> +dsm_mem, TARGET_PAGE_SIZE));
> +
> +/*
> + * DSM notifier:
> + * LNTF: write 

Re: How to reserve guest physical region for ACPI

2016-01-07 Thread Igor Mammedov
On Mon, 4 Jan 2016 21:17:31 +0100
Laszlo Ersek <ler...@redhat.com> wrote:

> Michael CC'd me on the grandparent of the email below. I'll try to add
> my thoughts in a single go, with regard to OVMF.
> 
> On 12/30/15 20:52, Michael S. Tsirkin wrote:
> > On Wed, Dec 30, 2015 at 04:55:54PM +0100, Igor Mammedov wrote:  
> >> On Mon, 28 Dec 2015 14:50:15 +0200
> >> "Michael S. Tsirkin" <m...@redhat.com> wrote:
> >>  
> >>> On Mon, Dec 28, 2015 at 10:39:04AM +0800, Xiao Guangrong wrote:  
> >>>>
> >>>> Hi Michael, Paolo,
> >>>>
> >>>> Now it is the time to return to the challenge that how to reserve guest
> >>>> physical region internally used by ACPI.
> >>>>
> >>>> Igor suggested that:
> >>>> | An alternative place to allocate reserve from could be high memory.
> >>>> | For pc we have "reserved-memory-end" which currently makes sure
> >>>> | that hotpluggable memory range isn't used by firmware
> >>>> (https://lists.nongnu.org/archive/html/qemu-devel/2015-11/msg00926.html) 
> >>>>  
> 
> OVMF has no support for the "reserved-memory-end" fw_cfg file. The
> reason is that nobody wrote that patch, nor asked for the patch to be
> written. (Not implying that just requesting the patch would be
> sufficient for the patch to be written.)
Hijacking this part of thread to check if OVMF would work with memory-hotplug
and if it needs "reserved-memory-end" support at all.

How OVMF determines which GPA ranges to use for initializing PCI BARs
at boot time, more specifically 64-bit BARs.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 0/6] NVDIMM ACPI: introduce the framework of QEMU emulated DSM

2016-01-07 Thread Igor Mammedov
On Tue,  5 Jan 2016 02:52:02 +0800
Xiao Guangrong <guangrong.x...@linux.intel.com> wrote:

> This patchset is against commit 5530427f0ca (acpi: extend aml_and() to
> accept target argument) on pci branch of Michael's git tree
> and can be found at:
>   https://github.com/xiaogr/qemu.git nvdimm-acpi-v1
> 
> This is the second part of vNVDIMM implementation which implements the
> BIOS patched dsm memory and introduces the framework that allows QEMU
> to emulate DSM method
> 
> Thanks to Michael's idea, we do not reserve any memory for NVDIMM ACPI,
> instead we let BIOS allocate the memory and patch the address to the
> offset we want
> 
> IO port is still enabled as it plays as the way to notify QEMU and pass
> the patched dsm memory address, so that IO port region, 0x0a18 - 0xa20,
> is reserved and it is divided into two 32 bits ports and used to pass
> the low 32 bits and high 32 bits of dsm memory address to QEMU
> 
> Thanks Igor's idea, this patchset also extends DSDT/SSDT to revision 2
> to apply 64 bit operations, in order to keeping compatibility, old
> version (<= 2.5) still uses revision 1. Since 64 bit operations breaks
> old guests (such as windows XP), we should keep the 64 bits stuff in
> the private place where common ACPI operation does not touch it
> 

general notes:
1. could you split out AML API additions/changes into separate patches?
   even if series nvdims patches couldn't be accepted on next respin,
   AML API patches could be good and we could pick them up just
   for API completeness. That also would make them easier to review
   and reduces count of patches you'd need to respin.
2. add test case for NVDIMM table blob, see tests/bios-tables-test.c
   at the beginning of series.
3. make V=1 check would show you ASL diff your patches are introducing,
   it will save you from booting real guest and dumping/decompiling
   tables manually.
4. at the end of series add NVDIMM table test blob with new table.
   you can use tests/acpi-test-data/rebuild-expected-aml.sh to make it
5. if make check by some miracle passes with these patches,
   dump NVDIMM table in guest and try to decompile and then compile it
   back with IASL, it will show you what needs to be fixed.
   
PS:
 under NVDIMM table I mean SSDT NVMDIM table.

> Igor Mammedov (1):
>   pc: acpi: bump DSDT/SSDT compliance revision to v2
> 
> Xiao Guangrong (5):
>   nvdimm acpi: initialize the resource used by NVDIMM ACPI
>   nvdimm acpi: introduce patched dsm memory
>   acpi: allow using acpi named offset for OperationRegion
>   nvdimm acpi: let qemu handle _DSM method
>   nvdimm acpi: emulate dsm method
> 
>  hw/acpi/Makefile.objs   |   2 +-
>  hw/acpi/aml-build.c |  45 +++-
>  hw/acpi/ich9.c  |  32 +
>  hw/acpi/nvdimm.c| 276 
> ++--
>  hw/acpi/piix4.c |   3 +
>  hw/i386/acpi-build.c|  41 ---
>  hw/i386/pc.c|   8 +-
>  hw/i386/pc_piix.c   |   5 +
>  hw/i386/pc_q35.c|   8 +-
>  include/hw/acpi/aml-build.h |   6 +-
>  include/hw/acpi/ich9.h  |   2 +
>  include/hw/i386/pc.h|  19 ++-
>  include/hw/mem/nvdimm.h |  44 ++-
>  13 files changed, 449 insertions(+), 42 deletions(-)
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to reserve guest physical region for ACPI

2016-01-06 Thread Igor Mammedov
On Tue, 5 Jan 2016 18:22:33 +0100
Laszlo Ersek <ler...@redhat.com> wrote:

> On 01/05/16 18:08, Igor Mammedov wrote:
> > On Mon, 4 Jan 2016 21:17:31 +0100
> > Laszlo Ersek <ler...@redhat.com> wrote:
> >   
> >> Michael CC'd me on the grandparent of the email below. I'll try to add
> >> my thoughts in a single go, with regard to OVMF.
> >>
> >> On 12/30/15 20:52, Michael S. Tsirkin wrote:  
> >>> On Wed, Dec 30, 2015 at 04:55:54PM +0100, Igor Mammedov wrote:
> >>>> On Mon, 28 Dec 2015 14:50:15 +0200
> >>>> "Michael S. Tsirkin" <m...@redhat.com> wrote:
> >>>>
> >>>>> On Mon, Dec 28, 2015 at 10:39:04AM +0800, Xiao Guangrong wrote:
> >>>>>>
> >>>>>> Hi Michael, Paolo,
> >>>>>>
> >>>>>> Now it is the time to return to the challenge that how to reserve guest
> >>>>>> physical region internally used by ACPI.
> >>>>>>
> >>>>>> Igor suggested that:
> >>>>>> | An alternative place to allocate reserve from could be high memory.
> >>>>>> | For pc we have "reserved-memory-end" which currently makes sure
> >>>>>> | that hotpluggable memory range isn't used by firmware
> >>>>>> (https://lists.nongnu.org/archive/html/qemu-devel/2015-11/msg00926.html)
> >>>>>> 
> >>
> >> OVMF has no support for the "reserved-memory-end" fw_cfg file. The
> >> reason is that nobody wrote that patch, nor asked for the patch to be
> >> written. (Not implying that just requesting the patch would be
> >> sufficient for the patch to be written.)
> >>  
> >>>>> I don't want to tie things to reserved-memory-end because this
> >>>>> does not scale: next time we need to reserve memory,
> >>>>> we'll need to find yet another way to figure out what is where.
> >>>> Could you elaborate a bit more on a problem you're seeing?
> >>>>
> >>>> To me it looks like it scales rather well.
> >>>> For example lets imagine that we adding a device
> >>>> that has some on device memory that should be mapped into GPA
> >>>> code to do so would look like:
> >>>>
> >>>>   pc_machine_device_plug_cb(dev)
> >>>>   {
> >>>>...
> >>>>if (dev == OUR_NEW_DEVICE_TYPE) {
> >>>>memory_region_add_subregion(as, current_reserved_end, >mr);
> >>>>set_new_reserved_end(current_reserved_end + 
> >>>> memory_region_size(>mr));
> >>>>}
> >>>>   }
> >>>>
> >>>> we can practically add any number of new devices that way.
> >>>
> >>> Yes but we'll have to build a host side allocator for these, and that's
> >>> nasty. We'll also have to maintain these addresses indefinitely (at
> >>> least per machine version) as they are guest visible.
> >>> Not only that, there's no way for guest to know if we move things
> >>> around, so basically we'll never be able to change addresses.
> >>>
> >>> 
> >>>>  
> >>>>> I would like ./hw/acpi/bios-linker-loader.c interface to be extended to
> >>>>> support 64 bit RAM instead
> >>
> >> This looks quite doable in OVMF, as long as the blob to allocate from
> >> high memory contains *zero* ACPI tables.
> >>
> >> (
> >> Namely, each ACPI table is installed from the containing fw_cfg blob
> >> with EFI_ACPI_TABLE_PROTOCOL.InstallAcpiTable(), and the latter has its
> >> own allocation policy for the *copies* of ACPI tables it installs.
> >>
> >> This allocation policy is left unspecified in the section of the UEFI
> >> spec that governs EFI_ACPI_TABLE_PROTOCOL.
> >>
> >> The current policy in edk2 (= the reference implementation) seems to be
> >> "allocate from under 4GB". It is currently being changed to "try to
> >> allocate from under 4GB, and if that fails, retry from high memory". (It
> >> is motivated by Aarch64 machines that may have no DRAM at all under 4GB.)
> >> )
> >>  
> >>>>> (and maybe a way to allocate and
> >>>>> zero-initialize buffer without loading it through fwcfg),
> >>
> >> Sounds reas

Re: [PATCH 3/6] nvdimm acpi: introduce patched dsm memory

2016-01-06 Thread Igor Mammedov
On Tue,  5 Jan 2016 02:52:05 +0800
Xiao Guangrong  wrote:

> The dsm memory is used to save the input parameters and store
> the dsm result which is filled by QEMU.
> 
> The address of dsm memory is decided by bios and patched into
> int64 object returned by "MEMA" method
> 
> Signed-off-by: Xiao Guangrong 
> ---
>  hw/acpi/aml-build.c | 12 
>  hw/acpi/nvdimm.c| 24 ++--
>  include/hw/acpi/aml-build.h |  1 +
>  3 files changed, 35 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c
> index 78e1290..83eadb3 100644
> --- a/hw/acpi/aml-build.c
> +++ b/hw/acpi/aml-build.c
> @@ -394,6 +394,18 @@ Aml *aml_int(const uint64_t val)
>  }
>  
>  /*
> + * ACPI 1.0b: 16.2.3 Data Objects Encoding:
> + * encode: QWordConst
> + */
> +Aml *aml_int64(const uint64_t val)
> +{
> +Aml *var = aml_alloc();
> +build_append_byte(var->buf, 0x0E); /* QWordPrefix */
> +build_append_int_noprefix(var->buf, val, 8);
> +return var;
> +}
> +
> +/*
>   * helper to construct NameString, which returns Aml object
>   * for using with aml_append or other aml_* terms
>   */
> diff --git a/hw/acpi/nvdimm.c b/hw/acpi/nvdimm.c
> index bc7cd8f..a72104c 100644
> --- a/hw/acpi/nvdimm.c
> +++ b/hw/acpi/nvdimm.c
> @@ -28,6 +28,7 @@
>  
>  #include "hw/acpi/acpi.h"
>  #include "hw/acpi/aml-build.h"
> +#include "hw/acpi/bios-linker-loader.h"
>  #include "hw/nvram/fw_cfg.h"
>  #include "hw/mem/nvdimm.h"
>  
> @@ -402,7 +403,8 @@ void nvdimm_init_acpi_state(AcpiNVDIMMState *state, 
> MemoryRegion *io,
>  state->dsm_mem->len);
>  }
>  
> -#define NVDIMM_COMMON_DSM  "NCAL"
> +#define NVDIMM_GET_DSM_MEM  "MEMA"
> +#define NVDIMM_COMMON_DSM   "NCAL"
>  
>  static void nvdimm_build_common_dsm(Aml *dev)
>  {
> @@ -468,7 +470,8 @@ static void nvdimm_build_ssdt(GSList *device_list, GArray 
> *table_offsets,
>GArray *table_data, GArray *linker,
>uint8_t revision)
>  {
> -Aml *ssdt, *sb_scope, *dev;
> +Aml *ssdt, *sb_scope, *dev, *method;
> +int offset;
>  
>  acpi_add_table(table_offsets, table_data);
>  
> @@ -499,9 +502,26 @@ static void nvdimm_build_ssdt(GSList *device_list, 
> GArray *table_offsets,
>  
>  aml_append(sb_scope, dev);
>  
> +/*
> + * leave it at the end of ssdt so that we can conveniently get the
> + * offset of int64 object returned by the function which will be
> + * patched with the real address of the dsm memory by BIOS.
> + */
> +method = aml_method(NVDIMM_GET_DSM_MEM, 0, AML_NOTSERIALIZED);
> +aml_append(method, aml_return(aml_int64(0x0)));
there is no need in dedicated aml_int64(), you can use aml_int(0x64) 
trick

> +aml_append(sb_scope, method);
>  aml_append(ssdt, sb_scope);
>  /* copy AML table into ACPI tables blob and patch header there */
>  g_array_append_vals(table_data, ssdt->buf->data, ssdt->buf->len);
> +
> +offset = table_data->len - 8;
> +
> +bios_linker_loader_alloc(linker, NVDIMM_DSM_MEM_FILE, TARGET_PAGE_SIZE,
> + false /* high memory */);
> +bios_linker_loader_add_pointer(linker, ACPI_BUILD_TABLE_FILE,
> +   NVDIMM_DSM_MEM_FILE, table_data,
> +   table_data->data + offset,
> +   sizeof(uint64_t));
this offset magic will break badly as soon as someone add something
to the end of SSDT.


>  build_header(linker, table_data,
>  (void *)(table_data->data + table_data->len - ssdt->buf->len),
>  "SSDT", ssdt->buf->len, revision, "NVDIMM");
> diff --git a/include/hw/acpi/aml-build.h b/include/hw/acpi/aml-build.h
> index ef44d02..b4726a4 100644
> --- a/include/hw/acpi/aml-build.h
> +++ b/include/hw/acpi/aml-build.h
> @@ -246,6 +246,7 @@ Aml *aml_name(const char *name_format, ...) 
> GCC_FMT_ATTR(1, 2);
>  Aml *aml_name_decl(const char *name, Aml *val);
>  Aml *aml_return(Aml *val);
>  Aml *aml_int(const uint64_t val);
> +Aml *aml_int64(const uint64_t val);
>  Aml *aml_arg(int pos);
>  Aml *aml_to_integer(Aml *arg);
>  Aml *aml_to_hexstring(Aml *src, Aml *dst);

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to reserve guest physical region for ACPI

2016-01-05 Thread Igor Mammedov
On Wed, 30 Dec 2015 21:52:32 +0200
"Michael S. Tsirkin" <m...@redhat.com> wrote:

> On Wed, Dec 30, 2015 at 04:55:54PM +0100, Igor Mammedov wrote:
> > On Mon, 28 Dec 2015 14:50:15 +0200
> > "Michael S. Tsirkin" <m...@redhat.com> wrote:
> >   
> > > On Mon, Dec 28, 2015 at 10:39:04AM +0800, Xiao Guangrong wrote:  
> > > > 
> > > > Hi Michael, Paolo,
> > > > 
> > > > Now it is the time to return to the challenge that how to reserve guest
> > > > physical region internally used by ACPI.
> > > > 
> > > > Igor suggested that:
> > > > | An alternative place to allocate reserve from could be high memory.
> > > > | For pc we have "reserved-memory-end" which currently makes sure
> > > > | that hotpluggable memory range isn't used by firmware
> > > > (https://lists.nongnu.org/archive/html/qemu-devel/2015-11/msg00926.html)
> > > >   
> > > 
> > > I don't want to tie things to reserved-memory-end because this
> > > does not scale: next time we need to reserve memory,
> > > we'll need to find yet another way to figure out what is where.  
> > Could you elaborate a bit more on a problem you're seeing?
> > 
> > To me it looks like it scales rather well.
> > For example lets imagine that we adding a device
> > that has some on device memory that should be mapped into GPA
> > code to do so would look like:
> > 
> >   pc_machine_device_plug_cb(dev)
> >   {
> >...
> >if (dev == OUR_NEW_DEVICE_TYPE) {
> >memory_region_add_subregion(as, current_reserved_end, >mr);
> >set_new_reserved_end(current_reserved_end + 
> > memory_region_size(>mr));
> >}
> >   }
> > 
> > we can practically add any number of new devices that way.  
> 
> Yes but we'll have to build a host side allocator for these, and that's
> nasty. We'll also have to maintain these addresses indefinitely (at
> least per machine version) as they are guest visible.
> Not only that, there's no way for guest to know if we move things
> around, so basically we'll never be able to change addresses.
simplistic GPA allocator in snippet above does the job,

if one unconditionally adds a device in new version then yes
code has to have compat code based on machine version.
But that applies to any device that gas a state to migrate
or to any address space layout change.

However device that directly maps addresses doesn't have to
have fixed address though, it could behave the same way as
PCI device with BARs, with only difference that its
MemoryRegions are mapped before guest is running vs
BARs mapped by BIOS.
It could be worth to create a generic base device class
that would do above. Then it could be inherited from and
extended by concrete device implementations.

> >
> > > I would like ./hw/acpi/bios-linker-loader.c interface to be extended to
> > > support 64 bit RAM instead (and maybe a way to allocate and
> > > zero-initialize buffer without loading it through fwcfg), this way bios
> > > does the allocation, and addresses can be patched into acpi.  
> > and then guest side needs to parse/execute some AML that would
> > initialize QEMU side so it would know where to write data.  
> 
> Well not really - we can put it in a data table, by itself
> so it's easy to find.
> 
> AML is only needed if access from ACPI is desired.
in both cases (VMGEN, NVDIMM) access from ACPI is required
as minimum to write address back to QEMU and for NVDIM
to pass _DSM method data between guest and QEMU.

> 
> 
> > bios-linker-loader is a great interface for initializing some
> > guest owned data and linking it together but I think it adds
> > unnecessary complexity and is misused if it's used to handle
> > device owned data/on device memory in this and VMGID cases.  
> 
> I want a generic interface for guest to enumerate these things.  linker
> seems quite reasonable but if you see a reason why it won't do, or want
> to propose a better interface, fine.
> 
> PCI would do, too - though windows guys had concerns about
> returning PCI BARs from ACPI.
There were potential issues with pSeries bootloader that treated
PCI_CLASS_MEMORY_RAM as conventional RAM but it was fixed.
Could you point out to discussion about windows issues?

What VMGEN patches that used PCI for mapping purposes were
stuck at, was that it was suggested to use PCI_CLASS_MEMORY_RAM
class id but we couldn't agree on it.

VMGEN v13 with full discussion is here
https://patchwork.ozlabs.org/patch/443554/
So to continue with this route we would need to pick some other
driver less class id so windows won

Re: How to reserve guest physical region for ACPI

2016-01-05 Thread Igor Mammedov
On Mon, 4 Jan 2016 21:17:31 +0100
Laszlo Ersek <ler...@redhat.com> wrote:

> Michael CC'd me on the grandparent of the email below. I'll try to add
> my thoughts in a single go, with regard to OVMF.
> 
> On 12/30/15 20:52, Michael S. Tsirkin wrote:
> > On Wed, Dec 30, 2015 at 04:55:54PM +0100, Igor Mammedov wrote:  
> >> On Mon, 28 Dec 2015 14:50:15 +0200
> >> "Michael S. Tsirkin" <m...@redhat.com> wrote:
> >>  
> >>> On Mon, Dec 28, 2015 at 10:39:04AM +0800, Xiao Guangrong wrote:  
> >>>>
> >>>> Hi Michael, Paolo,
> >>>>
> >>>> Now it is the time to return to the challenge that how to reserve guest
> >>>> physical region internally used by ACPI.
> >>>>
> >>>> Igor suggested that:
> >>>> | An alternative place to allocate reserve from could be high memory.
> >>>> | For pc we have "reserved-memory-end" which currently makes sure
> >>>> | that hotpluggable memory range isn't used by firmware
> >>>> (https://lists.nongnu.org/archive/html/qemu-devel/2015-11/msg00926.html) 
> >>>>  
> 
> OVMF has no support for the "reserved-memory-end" fw_cfg file. The
> reason is that nobody wrote that patch, nor asked for the patch to be
> written. (Not implying that just requesting the patch would be
> sufficient for the patch to be written.)
> 
> >>> I don't want to tie things to reserved-memory-end because this
> >>> does not scale: next time we need to reserve memory,
> >>> we'll need to find yet another way to figure out what is where.  
> >> Could you elaborate a bit more on a problem you're seeing?
> >>
> >> To me it looks like it scales rather well.
> >> For example lets imagine that we adding a device
> >> that has some on device memory that should be mapped into GPA
> >> code to do so would look like:
> >>
> >>   pc_machine_device_plug_cb(dev)
> >>   {
> >>...
> >>if (dev == OUR_NEW_DEVICE_TYPE) {
> >>memory_region_add_subregion(as, current_reserved_end, >mr);
> >>set_new_reserved_end(current_reserved_end + 
> >> memory_region_size(>mr));
> >>}
> >>   }
> >>
> >> we can practically add any number of new devices that way.  
> > 
> > Yes but we'll have to build a host side allocator for these, and that's
> > nasty. We'll also have to maintain these addresses indefinitely (at
> > least per machine version) as they are guest visible.
> > Not only that, there's no way for guest to know if we move things
> > around, so basically we'll never be able to change addresses.
> > 
> >   
> >>
> >>> I would like ./hw/acpi/bios-linker-loader.c interface to be extended to
> >>> support 64 bit RAM instead  
> 
> This looks quite doable in OVMF, as long as the blob to allocate from
> high memory contains *zero* ACPI tables.
> 
> (
> Namely, each ACPI table is installed from the containing fw_cfg blob
> with EFI_ACPI_TABLE_PROTOCOL.InstallAcpiTable(), and the latter has its
> own allocation policy for the *copies* of ACPI tables it installs.
> 
> This allocation policy is left unspecified in the section of the UEFI
> spec that governs EFI_ACPI_TABLE_PROTOCOL.
> 
> The current policy in edk2 (= the reference implementation) seems to be
> "allocate from under 4GB". It is currently being changed to "try to
> allocate from under 4GB, and if that fails, retry from high memory". (It
> is motivated by Aarch64 machines that may have no DRAM at all under 4GB.)
> )
> 
> >>> (and maybe a way to allocate and
> >>> zero-initialize buffer without loading it through fwcfg),  
> 
> Sounds reasonable.
> 
> >>> this way bios
> >>> does the allocation, and addresses can be patched into acpi.  
> >> and then guest side needs to parse/execute some AML that would
> >> initialize QEMU side so it would know where to write data.  
> > 
> > Well not really - we can put it in a data table, by itself
> > so it's easy to find.  
> 
> Do you mean acpi_tb_find_table(), acpi_get_table_by_index() /
> acpi_get_table_with_size()?
> 
> > 
> > AML is only needed if access from ACPI is desired.
> > 
> >   
> >> bios-linker-loader is a great interface for initializing some
> >> guest owned data and linking it together but I think it adds
> >> unnecessary complexity and is misused if it's used to handle
> >> device owned data/on dev

Re: How to reserve guest physical region for ACPI

2015-12-30 Thread Igor Mammedov
On Mon, 28 Dec 2015 14:50:15 +0200
"Michael S. Tsirkin"  wrote:

> On Mon, Dec 28, 2015 at 10:39:04AM +0800, Xiao Guangrong wrote:
> > 
> > Hi Michael, Paolo,
> > 
> > Now it is the time to return to the challenge that how to reserve guest
> > physical region internally used by ACPI.
> > 
> > Igor suggested that:
> > | An alternative place to allocate reserve from could be high memory.
> > | For pc we have "reserved-memory-end" which currently makes sure
> > | that hotpluggable memory range isn't used by firmware
> > (https://lists.nongnu.org/archive/html/qemu-devel/2015-11/msg00926.html)
> 
> I don't want to tie things to reserved-memory-end because this
> does not scale: next time we need to reserve memory,
> we'll need to find yet another way to figure out what is where.
Could you elaborate a bit more on a problem you're seeing?

To me it looks like it scales rather well.
For example lets imagine that we adding a device
that has some on device memory that should be mapped into GPA
code to do so would look like:

  pc_machine_device_plug_cb(dev)
  {
   ...
   if (dev == OUR_NEW_DEVICE_TYPE) {
   memory_region_add_subregion(as, current_reserved_end, >mr);
   set_new_reserved_end(current_reserved_end + 
memory_region_size(>mr));
   }
  }

we can practically add any number of new devices that way.

 
> I would like ./hw/acpi/bios-linker-loader.c interface to be extended to
> support 64 bit RAM instead (and maybe a way to allocate and
> zero-initialize buffer without loading it through fwcfg), this way bios
> does the allocation, and addresses can be patched into acpi.
and then guest side needs to parse/execute some AML that would
initialize QEMU side so it would know where to write data.

bios-linker-loader is a great interface for initializing some
guest owned data and linking it together but I think it adds
unnecessary complexity and is misused if it's used to handle
device owned data/on device memory in this and VMGID cases.

There was RFC on list to make BIOS boot from NVDIMM already
doing some ACPI table lookup/parsing. Now if they were forced
to also parse and execute AML to initialize QEMU with guest
allocated address that would complicate them quite a bit.
While with NVDIMM control memory region mapped directly by QEMU,
respective patches don't need in any way to initialize QEMU,
all they would need just read necessary data from control region.

Also using bios-linker-loader takes away some usable RAM
from guest and in the end that doesn't scale,
the more devices I add the less usable RAM is left for guest OS
while all the device needs is a piece of GPA address space
that would belong to it.

> 
> See patch at the bottom that might be handy.
> 
> > he also innovated a way to use 64-bit address in DSDT/SSDT.rev = 1:
> > | when writing ASL one shall make sure that only XP supported
> > | features are in global scope, which is evaluated when tables
> > | are loaded and features of rev2 and higher are inside methods.
> > | That way XP doesn't crash as far as it doesn't evaluate unsupported
> > | opcode and one can guard those opcodes checking _REV object if neccesary.
> > (https://lists.nongnu.org/archive/html/qemu-devel/2015-11/msg01010.html)
> 
> Yes, this technique works.
> 
> An alternative is to add an XSDT, XP ignores that.
> XSDT at the moment breaks OVMF (because it loads both
> the RSDT and the XSDT, which is wrong), but I think
> Laszlo was working on a fix for that.
Using XSDT would increase ACPI tables occupied RAM
as it would duplicate DSDT + non XP supported AML
at global namespace.

So far we've managed keep DSDT compatible with XP while
introducing features from v2 and higher ACPI revisions as
AML that is only evaluated on demand.
We can continue doing so unless we have to unconditionally
add incompatible AML at global scope.


> 
> > Michael, Paolo, what do you think about these ideas?
> > 
> > Thanks!
> 
> 
> 
> So using a patch below, we can add Name(PQRS, 0x0) at the top of the
> SSDT (or bottom, or add a separate SSDT just for that).  It returns the
> current offset so we can add that to the linker.
> 
> Won't work if you append the Name to the Aml structure (these can be
> nested to arbitrary depth using aml_append), so using plain GArray for
> this API makes sense to me.
> 
> --->
> 
> acpi: add build_append_named_dword, returning an offset in buffer
> 
> This is a very limited form of support for runtime patching -
> similar in functionality to what we can do with ACPI_EXTRACT
> macros in python, but implemented in C.
> 
> This is to allow ACPI code direct access to data tables -
> which is exactly what DataTableRegion is there for, except
> no known windows release so far implements DataTableRegion.
unsupported means Windows will BSOD, so it's practically
unusable unless MS will patch currently existing Windows
versions.

Another thing about DataTableRegion is that ACPI tables are
supposed to have static content which matches checksum in
table the 

Re: [PATCH v6 05/33] acpi: add aml_object_type

2015-11-09 Thread Igor Mammedov
On Mon, 9 Nov 2015 13:35:51 +0200
"Michael S. Tsirkin"  wrote:

> On Fri, Oct 30, 2015 at 01:55:59PM +0800, Xiao Guangrong wrote:
> > Implement ObjectType which is used by NVDIMM _DSM method in
> > later patch
> > 
> > Signed-off-by: Xiao Guangrong 
> 
> I had to go dig in the _DSM patch to see how it's used.
> And sure enough, callers have to know AML to make
> sense of it. So pls don't split out tiny patches like this.
> include the callee with the caller.
I'd prefer AML API patches as separate ones, as it makes
easier to review vs ACPI spec and also easier to reuse
in another series.
And once they are ok/reviewed we can merge them ahead of
users, so next series respins won't have to post the same
patches over and over and we won't have to review them
again every respin and others could use already merged API
instead of duplicating work that's been already done.

> 
> > ---
> >  hw/acpi/aml-build.c | 8 
> >  include/hw/acpi/aml-build.h | 1 +
> >  2 files changed, 9 insertions(+)
> > 
> > diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c
> > index efc06ab..9f792ab 100644
> > --- a/hw/acpi/aml-build.c
> > +++ b/hw/acpi/aml-build.c
> > @@ -1178,6 +1178,14 @@ Aml *aml_concatenate(Aml *source1, Aml *source2, Aml 
> > *target)
> >  return var;
> >  }
> >  
> > +/* ACPI 1.0b: 16.2.5.4 Type 2 Opcodes Encoding: DefObjectType */
> > +Aml *aml_object_type(Aml *object)
> > +{
> > +Aml *var = aml_opcode(0x8E /* ObjectTypeOp */);
> > +aml_append(var, object);
> > +return var;
> > +}
> > +
> 
> It would be better to have a higher level API
> that can be used without knowning AML.
> For example:
> 
>   aml_object_type_is_package()
Higher level API is fine but it's better be done
on top of AML one, i.e. add it in addition to
 aml_object_type()


> 
> 
> 
> >  void
> >  build_header(GArray *linker, GArray *table_data,
> >   AcpiTableHeader *h, const char *sig, int len, uint8_t rev)
> > diff --git a/include/hw/acpi/aml-build.h b/include/hw/acpi/aml-build.h
> > index 325782d..5b8a118 100644
> > --- a/include/hw/acpi/aml-build.h
> > +++ b/include/hw/acpi/aml-build.h
> > @@ -278,6 +278,7 @@ Aml *aml_derefof(Aml *arg);
> >  Aml *aml_sizeof(Aml *arg);
> >  Aml *aml_create_field(Aml *srcbuf, Aml *index, Aml *len, const char *name);
> >  Aml *aml_concatenate(Aml *source1, Aml *source2, Aml *target);
> > +Aml *aml_object_type(Aml *object);
> >  
> >  void
> >  build_header(GArray *linker, GArray *table_data,
> > -- 
> > 1.8.3.1
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH v7 25/35] nvdimm acpi: init the resource used by NVDIMM ACPI

2015-11-09 Thread Igor Mammedov
On Fri, 6 Nov 2015 16:31:43 +0800
Xiao Guangrong <guangrong.x...@linux.intel.com> wrote:

> 
> 
> On 11/05/2015 10:49 PM, Igor Mammedov wrote:
> > On Thu, 5 Nov 2015 21:33:39 +0800
> > Xiao Guangrong <guangrong.x...@linux.intel.com> wrote:
> >
> >>
> >>
> >> On 11/05/2015 09:03 PM, Igor Mammedov wrote:
> >>> On Thu, 5 Nov 2015 18:15:31 +0800
> >>> Xiao Guangrong <guangrong.x...@linux.intel.com> wrote:
> >>>
> >>>>
> >>>>
> >>>> On 11/05/2015 05:58 PM, Igor Mammedov wrote:
> >>>>> On Mon,  2 Nov 2015 17:13:27 +0800
> >>>>> Xiao Guangrong <guangrong.x...@linux.intel.com> wrote:
> >>>>>
> >>>>>> A page staring from 0xFF0 and IO port 0x0a18 - 0xa1b in guest are
> >>>>>   ^^ missing one 0???
> >>>>>
> >>>>>> reserved for NVDIMM ACPI emulation, refer to docs/specs/acpi_nvdimm.txt
> >>>>>> for detailed design
> >>>>>>
> >>>>>> A parameter, 'nvdimm-support', is introduced for PIIX4_PM and ICH9-LPC
> >>>>>> that controls if nvdimm support is enabled, it is true on default and
> >>>>>> it is false on 2.4 and its earlier version to keep compatibility
> >>>>>>
> >>>>>> Signed-off-by: Xiao Guangrong <guangrong.x...@linux.intel.com>
> >>>>> [...]
> >>>>>
> >>>>>> @@ -33,6 +33,15 @@
> >>>>>>  */
> >>>>>> #define MIN_NAMESPACE_LABEL_SIZE  (128UL << 10)
> >>>>>>
> >>>>>> +/*
> >>>>>> + * A page staring from 0xFF0 and IO port 0x0a18 - 0xa1b in guest 
> >>>>>> are
> >>>>> ^^^ missing 0 or value in define 
> >>>>> below has an extra 0
> >>>>>
> >>>>>> + * reserved for NVDIMM ACPI emulation, refer to 
> >>>>>> docs/specs/acpi_nvdimm.txt
> >>>>>> + * for detailed design.
> >>>>>> + */
> >>>>>> +#define NVDIMM_ACPI_MEM_BASE  0xFF00ULL
> >>>>> it still maps RAM at arbitrary place,
> >>>>> that's the reason why VMGenID patches hasn't been merged for
> >>>>> more than several months.
> >>>>> I'm not against of using (more exactly I'm for it) direct mapping
> >>>>> but we should reach consensus when and how to use it first.
> >>>>>
> >>>>> I'd wouldn't use addresses below 4G as it may be used firmware or
> >>>>> legacy hardware and I won't bet that 0xFF00ULL won't conflict
> >>>>> with anything.
> >>>>> An alternative place to allocate reserve from could be high memory.
> >>>>> For pc we have "reserved-memory-end" which currently makes sure
> >>>>> that hotpluggable memory range isn't used by firmware.
> >>>>>
> >>>>> How about making API that allows to map additional memory
> >>>>> ranges before reserved-memory-end and pushes it up as mappings are
> >>>>> added.
[...]

> 
> Really a good study case to me, i tried your patch and moved the 64 bit
> staffs to the private method, it worked. :)
> 
> Igor, is this the API you want?

Lets get ack from Michael on the idea of RAM mapping before
"reserved-memory-end" first.
If he rejects it then there isn't any other way except of switching
to MMIO instead.

> 
> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> index 6bf569a..aba29df 100644
> --- a/hw/i386/pc.c
> +++ b/hw/i386/pc.c
> @@ -1291,6 +1291,38 @@ FWCfgState *xen_load_linux(PCMachineState *pcms,
>   return fw_cfg;
>   }
> 
> +static void pc_reserve_high_memory_init(PCMachineState *pcms,
> +uint64_t base, uint64_t align)
> +{
> +pcms->reserve_high_memory.current_addr = ROUND_UP(base, align);
> +}
> +
> +static uint64_t
> +pc_reserve_high_memory_end(PCMachineState *pcms, int64_t align)
> +{
> +return ROUND_UP(pcms->reserve_high_memory.current_addr, align);
> +}
> +
> +uint64_t pc_reserve_high_memory(PCMachineState *pcms, uint64_t size,
> +int64_t align, Error **errp)
> +{
> +uint64_t end_addr, current_addr = pcms->reserve_high_memory.current_addr;
> +
> +if (!curre

Re: [PATCH v7 25/35] nvdimm acpi: init the resource used by NVDIMM ACPI

2015-11-05 Thread Igor Mammedov
On Mon,  2 Nov 2015 17:13:27 +0800
Xiao Guangrong  wrote:

> A page staring from 0xFF0 and IO port 0x0a18 - 0xa1b in guest are
   ^^ missing one 0???

> reserved for NVDIMM ACPI emulation, refer to docs/specs/acpi_nvdimm.txt
> for detailed design
> 
> A parameter, 'nvdimm-support', is introduced for PIIX4_PM and ICH9-LPC
> that controls if nvdimm support is enabled, it is true on default and
> it is false on 2.4 and its earlier version to keep compatibility
> 
> Signed-off-by: Xiao Guangrong 
[...]

> @@ -33,6 +33,15 @@
>   */
>  #define MIN_NAMESPACE_LABEL_SIZE  (128UL << 10)
>  
> +/*
> + * A page staring from 0xFF0 and IO port 0x0a18 - 0xa1b in guest are
 ^^^ missing 0 or value in define below has an 
extra 0

> + * reserved for NVDIMM ACPI emulation, refer to docs/specs/acpi_nvdimm.txt
> + * for detailed design.
> + */
> +#define NVDIMM_ACPI_MEM_BASE  0xFF00ULL
it still maps RAM at arbitrary place,
that's the reason why VMGenID patches hasn't been merged for
more than several months.
I'm not against of using (more exactly I'm for it) direct mapping
but we should reach consensus when and how to use it first.

I'd wouldn't use addresses below 4G as it may be used firmware or
legacy hardware and I won't bet that 0xFF00ULL won't conflict
with anything.
An alternative place to allocate reserve from could be high memory.
For pc we have "reserved-memory-end" which currently makes sure
that hotpluggable memory range isn't used by firmware.

How about making API that allows to map additional memory
ranges before reserved-memory-end and pushes it up as mappings are
added.

Michael, Paolo what do you think about it?


> +#define NVDIMM_ACPI_IO_BASE   0x0a18
> +#define NVDIMM_ACPI_IO_LEN4
> +
>  #define TYPE_NVDIMM  "nvdimm"
>  #define NVDIMM(obj)  OBJECT_CHECK(NVDIMMDevice, (obj), TYPE_NVDIMM)
>  #define NVDIMM_CLASS(oc) OBJECT_CLASS_CHECK(NVDIMMClass, (oc), TYPE_NVDIMM)
> @@ -80,4 +89,29 @@ struct NVDIMMClass {
>  };
>  typedef struct NVDIMMClass NVDIMMClass;
>  
> +/*
> + * AcpiNVDIMMState:
> + * @is_enabled: detect if NVDIMM support is enabled.
> + *
> + * @fit: fit buffer which will be accessed via ACPI _FIT method. It is
> + *   dynamically built based on current NVDIMM devices so that it does
> + *   not require to keep consistent during live migration.
> + *
> + * @ram_mr: RAM-based memory region which is mapped into guest address
> + *  space and used to transfer data between OSPM and QEMU.
> + * @io_mr: the IO region used by OSPM to transfer control to QEMU.
> + */
> +struct AcpiNVDIMMState {
> +bool is_enabled;
> +
> +GArray *fit;
> +
> +MemoryRegion ram_mr;
> +MemoryRegion io_mr;
> +};
> +typedef struct AcpiNVDIMMState AcpiNVDIMMState;
> +
> +/* Initialize the memory and IO region needed by NVDIMM ACPI emulation.*/
> +void nvdimm_init_acpi_state(MemoryRegion *memory, MemoryRegion *io,
> +Object *owner, AcpiNVDIMMState *state);
>  #endif

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v7 25/35] nvdimm acpi: init the resource used by NVDIMM ACPI

2015-11-05 Thread Igor Mammedov
On Thu, 5 Nov 2015 18:15:31 +0800
Xiao Guangrong <guangrong.x...@linux.intel.com> wrote:

> 
> 
> On 11/05/2015 05:58 PM, Igor Mammedov wrote:
> > On Mon,  2 Nov 2015 17:13:27 +0800
> > Xiao Guangrong <guangrong.x...@linux.intel.com> wrote:
> >
> >> A page staring from 0xFF0 and IO port 0x0a18 - 0xa1b in guest are
> > ^^ missing one 0???
> >
> >> reserved for NVDIMM ACPI emulation, refer to docs/specs/acpi_nvdimm.txt
> >> for detailed design
> >>
> >> A parameter, 'nvdimm-support', is introduced for PIIX4_PM and ICH9-LPC
> >> that controls if nvdimm support is enabled, it is true on default and
> >> it is false on 2.4 and its earlier version to keep compatibility
> >>
> >> Signed-off-by: Xiao Guangrong <guangrong.x...@linux.intel.com>
> > [...]
> >
> >> @@ -33,6 +33,15 @@
> >>*/
> >>   #define MIN_NAMESPACE_LABEL_SIZE  (128UL << 10)
> >>
> >> +/*
> >> + * A page staring from 0xFF0 and IO port 0x0a18 - 0xa1b in guest are
> >   ^^^ missing 0 or value in define below 
> > has an extra 0
> >
> >> + * reserved for NVDIMM ACPI emulation, refer to docs/specs/acpi_nvdimm.txt
> >> + * for detailed design.
> >> + */
> >> +#define NVDIMM_ACPI_MEM_BASE  0xFF00ULL
> > it still maps RAM at arbitrary place,
> > that's the reason why VMGenID patches hasn't been merged for
> > more than several months.
> > I'm not against of using (more exactly I'm for it) direct mapping
> > but we should reach consensus when and how to use it first.
> >
> > I'd wouldn't use addresses below 4G as it may be used firmware or
> > legacy hardware and I won't bet that 0xFF00ULL won't conflict
> > with anything.
> > An alternative place to allocate reserve from could be high memory.
> > For pc we have "reserved-memory-end" which currently makes sure
> > that hotpluggable memory range isn't used by firmware.
> >
> > How about making API that allows to map additional memory
> > ranges before reserved-memory-end and pushes it up as mappings are
> > added.
> 
> That what i did in the v1/v2 versions, however, as you noticed, using 64-bit
> address in ACPI in QEMU is not a easy work - we can not simply make
> SSDT.rev = 2 to apply 64 bit address, the reason i have documented in
> v3's changelog:
> 
>3) we figure out a unused memory hole below 4G that is 0xFF0 ~
>   0xFFF0, this range is large enough for NVDIMM ACPI as build 64-bit
>   ACPI SSDT/DSDT table will break windows XP.
>   BTW, only make SSDT.rev = 2 can not work since the width is only 
> depended
>   on DSDT.rev based on 19.6.28 DefinitionBlock (Declare Definition Block)
>   in ACPI spec:
> | Note: For compatibility with ACPI versions before ACPI 2.0, the bit
> | width of Integer objects is dependent on the ComplianceRevision of the DSDT.
> | If the ComplianceRevision is less than 2, all integers are restricted to 32
> | bits. Otherwise, full 64-bit integers are used. The version of the DSDT sets
> | the global integer width for all integers, including integers in SSDTs.
>4) use the lowest ACPI spec version to document AML terms.
> 
> The only way introducing 64 bit address is adding XSDT support that what
> Michael did before, however, it seems it need great efforts to do it as
> it will break OVMF. It's a long term workload. :(
to enable 64-bit integers in AML it's sufficient to change DSDT revision to 2,
I already have a patch that switches DSDT/SSDT to rev2.
Tests show it doesn't break WindowsXP (which is rev1) and uses 64-bit integers
on linux & later Windows versions.

> 
> The luck thing is, the ACPI part is not ABI, we can move it to the high
> memory if QEMU supports XSDT is ready in future development.
But mapped control region is ABI and we can't change it if we find out later
that it breaks something.

> 
> Thanks!

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v7 25/35] nvdimm acpi: init the resource used by NVDIMM ACPI

2015-11-05 Thread Igor Mammedov
On Thu, 5 Nov 2015 21:33:39 +0800
Xiao Guangrong <guangrong.x...@linux.intel.com> wrote:

> 
> 
> On 11/05/2015 09:03 PM, Igor Mammedov wrote:
> > On Thu, 5 Nov 2015 18:15:31 +0800
> > Xiao Guangrong <guangrong.x...@linux.intel.com> wrote:
> >
> >>
> >>
> >> On 11/05/2015 05:58 PM, Igor Mammedov wrote:
> >>> On Mon,  2 Nov 2015 17:13:27 +0800
> >>> Xiao Guangrong <guangrong.x...@linux.intel.com> wrote:
> >>>
> >>>> A page staring from 0xFF0 and IO port 0x0a18 - 0xa1b in guest are
> >>>  ^^ missing one 0???
> >>>
> >>>> reserved for NVDIMM ACPI emulation, refer to docs/specs/acpi_nvdimm.txt
> >>>> for detailed design
> >>>>
> >>>> A parameter, 'nvdimm-support', is introduced for PIIX4_PM and ICH9-LPC
> >>>> that controls if nvdimm support is enabled, it is true on default and
> >>>> it is false on 2.4 and its earlier version to keep compatibility
> >>>>
> >>>> Signed-off-by: Xiao Guangrong <guangrong.x...@linux.intel.com>
> >>> [...]
> >>>
> >>>> @@ -33,6 +33,15 @@
> >>>> */
> >>>>#define MIN_NAMESPACE_LABEL_SIZE  (128UL << 10)
> >>>>
> >>>> +/*
> >>>> + * A page staring from 0xFF0 and IO port 0x0a18 - 0xa1b in guest are
> >>>^^^ missing 0 or value in define below 
> >>> has an extra 0
> >>>
> >>>> + * reserved for NVDIMM ACPI emulation, refer to 
> >>>> docs/specs/acpi_nvdimm.txt
> >>>> + * for detailed design.
> >>>> + */
> >>>> +#define NVDIMM_ACPI_MEM_BASE  0xFF00ULL
> >>> it still maps RAM at arbitrary place,
> >>> that's the reason why VMGenID patches hasn't been merged for
> >>> more than several months.
> >>> I'm not against of using (more exactly I'm for it) direct mapping
> >>> but we should reach consensus when and how to use it first.
> >>>
> >>> I'd wouldn't use addresses below 4G as it may be used firmware or
> >>> legacy hardware and I won't bet that 0xFF00ULL won't conflict
> >>> with anything.
> >>> An alternative place to allocate reserve from could be high memory.
> >>> For pc we have "reserved-memory-end" which currently makes sure
> >>> that hotpluggable memory range isn't used by firmware.
> >>>
> >>> How about making API that allows to map additional memory
> >>> ranges before reserved-memory-end and pushes it up as mappings are
> >>> added.
> >>
> >> That what i did in the v1/v2 versions, however, as you noticed, using 
> >> 64-bit
> >> address in ACPI in QEMU is not a easy work - we can not simply make
> >> SSDT.rev = 2 to apply 64 bit address, the reason i have documented in
> >> v3's changelog:
> >>
> >> 3) we figure out a unused memory hole below 4G that is 0xFF0 ~
> >>0xFFF0, this range is large enough for NVDIMM ACPI as build 
> >> 64-bit
> >>ACPI SSDT/DSDT table will break windows XP.
> >>BTW, only make SSDT.rev = 2 can not work since the width is only 
> >> depended
> >>on DSDT.rev based on 19.6.28 DefinitionBlock (Declare Definition 
> >> Block)
> >>in ACPI spec:
> >> | Note: For compatibility with ACPI versions before ACPI 2.0, the bit
> >> | width of Integer objects is dependent on the ComplianceRevision of the 
> >> DSDT.
> >> | If the ComplianceRevision is less than 2, all integers are restricted to 
> >> 32
> >> | bits. Otherwise, full 64-bit integers are used. The version of the DSDT 
> >> sets
> >> | the global integer width for all integers, including integers in SSDTs.
> >> 4) use the lowest ACPI spec version to document AML terms.
> >>
> >> The only way introducing 64 bit address is adding XSDT support that what
> >> Michael did before, however, it seems it need great efforts to do it as
> >> it will break OVMF. It's a long term workload. :(
> > to enable 64-bit integers in AML it's sufficient to change DSDT revision to 
> > 2,
> > I already have a patch that switches DSDT/SSDT to rev2.
> > Tests show it doesn't break WindowsXP (which is rev1) and uses 64-bit 
> > integers
> > on linux & later Windows versions.
> 

Re: [PATCH v7 27/35] nvdimm acpi: build ACPI nvdimm devices

2015-11-04 Thread Igor Mammedov
On Tue, 3 Nov 2015 22:22:40 +0800
Xiao Guangrong <guangrong.x...@linux.intel.com> wrote:

> 
> 
> On 11/03/2015 09:13 PM, Igor Mammedov wrote:
> > On Mon,  2 Nov 2015 17:13:29 +0800
> > Xiao Guangrong <guangrong.x...@linux.intel.com> wrote:
> >
> >> NVDIMM devices is defined in ACPI 6.0 9.20 NVDIMM Devices
> >>
> >> There is a root device under \_SB and specified NVDIMM devices are under 
> >> the
> >> root device. Each NVDIMM device has _ADR which returns its handle used to
> >> associate MEMDEV structure in NFIT
> >>
> >> We reserve handle 0 for root device. In this patch, we save handle, handle,
> >> arg1 and arg2 to dsm memory. Arg3 is conditionally saved in later patch
> >>
> >> Signed-off-by: Xiao Guangrong <guangrong.x...@linux.intel.com>
> >> ---
> >>   hw/acpi/nvdimm.c | 184 
> >> +++
> >>   1 file changed, 184 insertions(+)
> >>
> >> diff --git a/hw/acpi/nvdimm.c b/hw/acpi/nvdimm.c
> >> index dd84e5f..53ed675 100644
> >> --- a/hw/acpi/nvdimm.c
> >> +++ b/hw/acpi/nvdimm.c
> >> @@ -368,6 +368,15 @@ static void nvdimm_build_nfit(GSList *device_list, 
> >> GArray *table_offsets,
> >>   g_array_free(structures, true);
> >>   }
> >>
> >> +struct NvdimmDsmIn {
> >> +uint32_t handle;
> >> +uint32_t revision;
> >> +uint32_t function;
> >> +   /* the remaining size in the page is used by arg3. */
> >> +uint8_t arg3[0];
> >> +} QEMU_PACKED;
> >> +typedef struct NvdimmDsmIn NvdimmDsmIn;
> >> +
> >>   static uint64_t
> >>   nvdimm_dsm_read(void *opaque, hwaddr addr, unsigned size)
> >>   {
> >> @@ -377,6 +386,7 @@ nvdimm_dsm_read(void *opaque, hwaddr addr, unsigned 
> >> size)
> >>   static void
> >>   nvdimm_dsm_write(void *opaque, hwaddr addr, uint64_t val, unsigned size)
> >>   {
> >> +fprintf(stderr, "BUG: we never write DSM notification IO Port.\n");
> > it doesn't seem like this hunk belongs here
> 
> Er, we have changed the logic:
> - others:
>1) the buffer length is directly got from IO read rather than got
>   from dsm memory
> [ This has documented in v5's changelog. ]
> 
> So, the IO write is replaced by IO read, nvdimm_dsm_write() should not be
> triggered.
> 
> >
> >>   }
> >>
> >>   static const MemoryRegionOps nvdimm_dsm_ops = {
> >> @@ -402,6 +412,179 @@ void nvdimm_init_acpi_state(MemoryRegion *memory, 
> >> MemoryRegion *io,
> >>   memory_region_add_subregion(io, NVDIMM_ACPI_IO_BASE, >io_mr);
> >>   }
> >>
> >> +#define BUILD_STA_METHOD(_dev_, _method_) 
> >>  \
> >> +do {  
> >>  \
> >> +_method_ = aml_method("_STA", 0); 
> >>  \
> >> +aml_append(_method_, aml_return(aml_int(0x0f)));  
> >>  \
> >> +aml_append(_dev_, _method_);  
> >>  \
> >> +} while (0)
> > _STA doesn't have any logic here so drop macro and just
> > replace its call sites with:
> 
> Okay, I was just wanting to save some code lines. I will drop this macro.
> 
> >
> > aml_append(foo_dev, aml_name_decl("_STA", aml_int(0xf));
> 
> _STA is required as a method with zero argument but this statement just
> define a object. It is okay?
Spec doesn't say that it must be method, it says that it will evaluate _STA 
object
and result must be a combination of defined flags.
AML wise calling a method with 0 arguments and referencing named variable
is the same thing, both end up being just a namestring.

Also note that _STA here return 0xF, and spec says that if _STA is missing
OSPM shall assume its implicit value being 0xF, so you can just drop _STA
object here altogether.

> 
> >
> >
> >> +
> >> +#define BUILD_DSM_METHOD(_dev_, _method_, _handle_, _uuid_)   
> >>  \
> >> +do {  
> >>  \
> >> +Aml *ifctx, *uuid;
> >>  \
> >> +_method_ = aml_method("_DSM", 4); 
> >>  \
> >> +/* check UUID if it is we expect, return the errorcode if n

Re: [PATCH v7 27/35] nvdimm acpi: build ACPI nvdimm devices

2015-11-03 Thread Igor Mammedov
On Mon,  2 Nov 2015 17:13:29 +0800
Xiao Guangrong  wrote:

> NVDIMM devices is defined in ACPI 6.0 9.20 NVDIMM Devices
> 
> There is a root device under \_SB and specified NVDIMM devices are under the
> root device. Each NVDIMM device has _ADR which returns its handle used to
> associate MEMDEV structure in NFIT
> 
> We reserve handle 0 for root device. In this patch, we save handle, handle,
> arg1 and arg2 to dsm memory. Arg3 is conditionally saved in later patch
> 
> Signed-off-by: Xiao Guangrong 
> ---
>  hw/acpi/nvdimm.c | 184 
> +++
>  1 file changed, 184 insertions(+)
> 
> diff --git a/hw/acpi/nvdimm.c b/hw/acpi/nvdimm.c
> index dd84e5f..53ed675 100644
> --- a/hw/acpi/nvdimm.c
> +++ b/hw/acpi/nvdimm.c
> @@ -368,6 +368,15 @@ static void nvdimm_build_nfit(GSList *device_list, 
> GArray *table_offsets,
>  g_array_free(structures, true);
>  }
>  
> +struct NvdimmDsmIn {
> +uint32_t handle;
> +uint32_t revision;
> +uint32_t function;
> +   /* the remaining size in the page is used by arg3. */
> +uint8_t arg3[0];
> +} QEMU_PACKED;
> +typedef struct NvdimmDsmIn NvdimmDsmIn;
> +
>  static uint64_t
>  nvdimm_dsm_read(void *opaque, hwaddr addr, unsigned size)
>  {
> @@ -377,6 +386,7 @@ nvdimm_dsm_read(void *opaque, hwaddr addr, unsigned size)
>  static void
>  nvdimm_dsm_write(void *opaque, hwaddr addr, uint64_t val, unsigned size)
>  {
> +fprintf(stderr, "BUG: we never write DSM notification IO Port.\n");
it doesn't seem like this hunk belongs here

>  }
>  
>  static const MemoryRegionOps nvdimm_dsm_ops = {
> @@ -402,6 +412,179 @@ void nvdimm_init_acpi_state(MemoryRegion *memory, 
> MemoryRegion *io,
>  memory_region_add_subregion(io, NVDIMM_ACPI_IO_BASE, >io_mr);
>  }
>  
> +#define BUILD_STA_METHOD(_dev_, _method_)  \
> +do {   \
> +_method_ = aml_method("_STA", 0);  \
> +aml_append(_method_, aml_return(aml_int(0x0f)));   \
> +aml_append(_dev_, _method_);   \
> +} while (0)
_STA doesn't have any logic here so drop macro and just
replace its call sites with:

aml_append(foo_dev, aml_name_decl("_STA", aml_int(0xf));


> +
> +#define BUILD_DSM_METHOD(_dev_, _method_, _handle_, _uuid_)\
> +do {   \
> +Aml *ifctx, *uuid; \
> +_method_ = aml_method("_DSM", 4);  \
> +/* check UUID if it is we expect, return the errorcode if not.*/   \
> +uuid = aml_touuid(_uuid_); \
> +ifctx = aml_if(aml_lnot(aml_equal(aml_arg(0), uuid))); \
> +aml_append(ifctx, aml_return(aml_int(1 /* Not Supported */))); \
> +aml_append(method, ifctx); \
> +aml_append(method, aml_return(aml_call4("NCAL", aml_int(_handle_), \
> +   aml_arg(1), aml_arg(2), aml_arg(3;  \
> +aml_append(_dev_, _method_);   \
> +} while (0)
> +
> +#define BUILD_FIELD_UNIT_SIZE(_field_, _byte_, _name_) \
> +aml_append(_field_, aml_named_field(_name_, (_byte_) * BITS_PER_BYTE))
> +
> +#define BUILD_FIELD_UNIT_STRUCT(_field_, _s_, _f_, _name_) \
> +BUILD_FIELD_UNIT_SIZE(_field_, sizeof(typeof_field(_s_, _f_)), _name_)
> +
> +static void build_nvdimm_devices(GSList *device_list, Aml *root_dev)
> +{
> +for (; device_list; device_list = device_list->next) {
> +NVDIMMDevice *nvdimm = device_list->data;
> +int slot = object_property_get_int(OBJECT(nvdimm), DIMM_SLOT_PROP,
> +   NULL);
> +uint32_t handle = nvdimm_slot_to_handle(slot);
> +Aml *dev, *method;
> +
> +dev = aml_device("NV%02X", slot);
> +aml_append(dev, aml_name_decl("_ADR", aml_int(handle)));
> +
> +BUILD_STA_METHOD(dev, method);
> +
> +/*
> + * Chapter 4: _DSM Interface for NVDIMM Device (non-root) - Example
> + * in DSM Spec Rev1.
> + */
> +BUILD_DSM_METHOD(dev, method,
> + handle /* NVDIMM Device Handle */,
> + "4309AC30-0D11-11E4-9191-0800200C9A66"
> + /* UUID for NVDIMM Devices. */);
this will add N-bytes * #NVDIMMS in worst case.
Please drop macro and just consolidate this method into _DSM method of parent 
scope
and then call it from here like this:
   Method(_DSM, 4)
   Return(^_DSM(Arg[0-3]))

> +
> +aml_append(root_dev, dev);
> +}
> +}
> +
> +static void nvdimm_build_acpi_devices(GSList 

Re: [PATCH v7 09/35] exec: allow file_ram_alloc to work on file

2015-11-03 Thread Igor Mammedov
On Mon,  2 Nov 2015 17:13:11 +0800
Xiao Guangrong  wrote:

> Currently, file_ram_alloc() only works on directory - it creates a file
> under @path and do mmap on it
> 
> This patch tries to allow it to work on file directly, if @path is a
> directory it works as before, otherwise it treats @path as the target
> file then directly allocate memory from it
Paolo has just queued
https://lists.gnu.org/archive/html/qemu-devel/2015-10/msg06513.html
perhaps that's what you can reuse here.
> 
> Signed-off-by: Xiao Guangrong 
> ---
>  exec.c | 80 
> ++
>  1 file changed, 51 insertions(+), 29 deletions(-)
> 
> diff --git a/exec.c b/exec.c
> index 9075f4d..db0fdaf 100644
> --- a/exec.c
> +++ b/exec.c
> @@ -1174,14 +1174,60 @@ void qemu_mutex_unlock_ramlist(void)
>  }
>  
>  #ifdef __linux__
> +static bool path_is_dir(const char *path)
> +{
> +struct stat fs;
> +
> +return stat(path, ) == 0 && S_ISDIR(fs.st_mode);
> +}
> +
> +static int open_ram_file_path(RAMBlock *block, const char *path, size_t size)
> +{
> +char *filename;
> +char *sanitized_name;
> +char *c;
> +int fd;
> +
> +if (!path_is_dir(path)) {
> +int flags = (block->flags & RAM_SHARED) ? O_RDWR : O_RDONLY;
> +
> +flags |= O_EXCL;
> +return open(path, flags);
> +}
> +
> +/* Make name safe to use with mkstemp by replacing '/' with '_'. */
> +sanitized_name = g_strdup(memory_region_name(block->mr));
> +for (c = sanitized_name; *c != '\0'; c++) {
> +if (*c == '/') {
> +*c = '_';
> +}
> +}
> +filename = g_strdup_printf("%s/qemu_back_mem.%s.XX", path,
> +   sanitized_name);
> +g_free(sanitized_name);
> +fd = mkstemp(filename);
> +if (fd >= 0) {
> +unlink(filename);
> +/*
> + * ftruncate is not supported by hugetlbfs in older
> + * hosts, so don't bother bailing out on errors.
> + * If anything goes wrong with it under other filesystems,
> + * mmap will fail.
> + */
> +if (ftruncate(fd, size)) {
> +perror("ftruncate");
> +}
> +}
> +g_free(filename);
> +
> +return fd;
> +}
> +
>  static void *file_ram_alloc(RAMBlock *block,
>  ram_addr_t memory,
>  const char *path,
>  Error **errp)
>  {
> -char *filename;
> -char *sanitized_name;
> -char *c;
>  void *area;
>  int fd;
>  uint64_t pagesize;
> @@ -1212,38 +1258,14 @@ static void *file_ram_alloc(RAMBlock *block,
>  goto error;
>  }
>  
> -/* Make name safe to use with mkstemp by replacing '/' with '_'. */
> -sanitized_name = g_strdup(memory_region_name(block->mr));
> -for (c = sanitized_name; *c != '\0'; c++) {
> -if (*c == '/')
> -*c = '_';
> -}
> -
> -filename = g_strdup_printf("%s/qemu_back_mem.%s.XX", path,
> -   sanitized_name);
> -g_free(sanitized_name);
> +memory = ROUND_UP(memory, pagesize);
>  
> -fd = mkstemp(filename);
> +fd = open_ram_file_path(block, path, memory);
>  if (fd < 0) {
>  error_setg_errno(errp, errno,
>   "unable to create backing store for path %s", path);
> -g_free(filename);
>  goto error;
>  }
> -unlink(filename);
> -g_free(filename);
> -
> -memory = ROUND_UP(memory, pagesize);
> -
> -/*
> - * ftruncate is not supported by hugetlbfs in older
> - * hosts, so don't bother bailing out on errors.
> - * If anything goes wrong with it under other filesystems,
> - * mmap will fail.
> - */
> -if (ftruncate(fd, memory)) {
> -perror("ftruncate");
> -}
>  
>  area = qemu_ram_mmap(fd, memory, pagesize, block->flags & RAM_SHARED);
>  if (area == MAP_FAILED) {

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH v7 06/35] acpi: add aml_method_serialized

2015-11-03 Thread Igor Mammedov
On Mon,  2 Nov 2015 17:13:08 +0800
Xiao Guangrong  wrote:

> It avoid explicit Mutex and will be used by NVDIMM ACPI
> 
> Signed-off-by: Xiao Guangrong 
> ---
>  hw/acpi/aml-build.c | 26 --
>  include/hw/acpi/aml-build.h |  1 +
>  2 files changed, 25 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c
> index 9f792ab..8bee8b2 100644
> --- a/hw/acpi/aml-build.c
> +++ b/hw/acpi/aml-build.c
> @@ -696,14 +696,36 @@ Aml *aml_while(Aml *predicate)
>  }
>  
>  /* ACPI 1.0b: 16.2.5.2 Named Objects Encoding: DefMethod */
> -Aml *aml_method(const char *name, int arg_count)
> +static Aml *__aml_method(const char *name, int arg_count, bool serialized)
We don't have many users of aml_method() yet, so I'd prefer to have a single
vs multiple function call:

I suggest to do something like:
typedef enum {
AML_NONSERIALIZED = 0,
AML_SERIALIZED = 1,
} AmlSerializeRule;

aml_method(const char *name, AmlSerializeRule rule, int synclevel);

with current users fixed up with AML_NONSERIALIZED argument. 

>  {
>  Aml *var = aml_bundle(0x14 /* MethodOp */, AML_PACKAGE);
> +int methodflags;
> +
> +/*
> + * MethodFlags:
> + *   bit 0-2: ArgCount (0-7)
> + *   bit 3: SerializeFlag
> + * 0: NotSerialized
> + * 1: Serialized
> + *   bit 4-7: reserved (must be 0)
> + */
> +assert(!(arg_count & ~7));
> +methodflags = arg_count | (serialized << 3);
>  build_append_namestring(var->buf, "%s", name);
> -build_append_byte(var->buf, arg_count); /* MethodFlags: ArgCount */
> +build_append_byte(var->buf, methodflags);
>  return var;
>  }
>  
> +Aml *aml_method(const char *name, int arg_count)
> +{
> +return __aml_method(name, arg_count, false);
> +}
> +
> +Aml *aml_method_serialized(const char *name, int arg_count)
> +{
> +return __aml_method(name, arg_count, true);
> +}
> +
>  /* ACPI 1.0b: 16.2.5.2 Named Objects Encoding: DefDevice */
>  Aml *aml_device(const char *name_format, ...)
>  {
> diff --git a/include/hw/acpi/aml-build.h b/include/hw/acpi/aml-build.h
> index 5b8a118..00cf40e 100644
> --- a/include/hw/acpi/aml-build.h
> +++ b/include/hw/acpi/aml-build.h
> @@ -263,6 +263,7 @@ Aml *aml_qword_memory(AmlDecode dec, AmlMinFixed 
> min_fixed,
>  Aml *aml_scope(const char *name_format, ...) GCC_FMT_ATTR(1, 2);
>  Aml *aml_device(const char *name_format, ...) GCC_FMT_ATTR(1, 2);
>  Aml *aml_method(const char *name, int arg_count);
> +Aml *aml_method_serialized(const char *name, int arg_count);
>  Aml *aml_if(Aml *predicate);
>  Aml *aml_else(void);
>  Aml *aml_while(Aml *predicate);

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 28/33] nvdimm acpi: support DSM_FUN_IMPLEMENTED function

2015-10-29 Thread Igor Mammedov
On Wed, 21 Oct 2015 21:32:38 +0800
Xiao Guangrong  wrote:

> 
> 
> On 10/21/2015 06:49 PM, Stefan Hajnoczi wrote:
> > On Wed, Oct 21, 2015 at 12:26:35AM +0800, Xiao Guangrong wrote:
> >>
> >>
> >> On 10/20/2015 11:51 PM, Stefan Hajnoczi wrote:
> >>> On Mon, Oct 19, 2015 at 08:54:14AM +0800, Xiao Guangrong wrote:
>  +exit:
>  +/* Write our output result to dsm memory. */
>  +((dsm_out *)dsm_ram_addr)->len = out->len;
> >>>
> >>> Missing byteswap?
> >>>
> >>> I thought you were going to remove this field because it wasn't needed
> >>> by the guest.
> >>>
> >>
> >> The @len is the size of _DSM result buffer, for example, for the function 
> >> of
> >> DSM_FUN_IMPLEMENTED the result buffer is 8 bytes, and for
> >> DSM_DEV_FUN_NAMESPACE_LABEL_SIZE the buffer size is 4 bytes. It tells ASL 
> >> code
> >> how much size of memory we need to return to the _DSM caller.
> >>
> >> In _DSM code, it's handled like this:
> >>
> >> "RLEN" is @len, “OBUF” is the left memory in DSM page.
> >>
> >>  /* get @len*/
> >>  aml_append(method, aml_store(aml_name("RLEN"), aml_local(6)));
> >>  /* @len << 3 to get bits. */
> >>  aml_append(method, aml_store(aml_shiftleft(aml_local(6),
> >> aml_int(3)), aml_local(6)));
> >>
> >>  /* get @len << 3 bits from OBUF, and return it to the caller. */
> >>  aml_append(method, aml_create_field(aml_name("ODAT"), aml_int(0),
> >>  aml_local(6) , "OBUF"));
> >>
> >> Since @len is our internally used, it's not return to guest, so i did not 
> >> do
> >> byteswap here.
> >
> > I am not familiar with the ACPI details, but I think this emits bytecode
> > that will be run by the guest's ACPI interpreter?
> >
> > You still need to define the endianness of fields since QEMU and the
> > guest could have different endianness.
> >
> > In other words, will the following work if a big-endian ppc host is
> > running a little-endian x86 guest?
> >
> >((dsm_out *)dsm_ram_addr)->len = out->len;
> >
> 
> Er... If we do byteswap in QEMU then it is also needed in ASL code, however,
> ASL lacks this kind of instruction.  I guess ACPI interpreter is smart enough
> to change value to Littel-Endian for all 2 bytes / 4 bytes / 8 bytes accesses
> 
> I will do the change in next version, thanks for you pointing it out, Stefan!
According to ACPI spec integers encoded as little endian,
so QEMU needs to convert fields accessible by OSPM to it
(i.e. do cpu_to_le())

> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 22/32] nvdimm: init the address region used by NVDIMM ACPI

2015-10-19 Thread Igor Mammedov
On Mon, 19 Oct 2015 09:56:12 +0300
"Michael S. Tsirkin"  wrote:

> On Sun, Oct 11, 2015 at 11:52:54AM +0800, Xiao Guangrong wrote:
[...]
> > diff --git a/include/hw/mem/nvdimm.h b/include/hw/mem/nvdimm.h
> > index f6bd2c4..aa95961 100644
> > --- a/include/hw/mem/nvdimm.h
> > +++ b/include/hw/mem/nvdimm.h
> > @@ -15,6 +15,10 @@
> >  
> >  #include "hw/mem/dimm.h"
> >  
> > +/* Memory region 0xFF0 ~ 0xFFF0 is reserved for NVDIMM
> > ACPI. */ +#define NVDIMM_ACPI_MEM_BASE   0xFF00ULL
Michael,

If it's ok to map control RAM region directly from QEMU at arbitrary
location let's do the same for VMGENID too (i.e. use v16
implementation which does exactly the same thing as this series).

> > +#define NVDIMM_ACPI_MEM_SIZE   0xF0ULL
[...]

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 22/32] nvdimm: init the address region used by NVDIMM ACPI

2015-10-19 Thread Igor Mammedov
On Mon, 19 Oct 2015 12:17:22 +0300
"Michael S. Tsirkin"  wrote:

> On Mon, Oct 19, 2015 at 03:44:13PM +0800, Xiao Guangrong wrote:
> > 
> > 
> > On 10/19/2015 03:39 PM, Michael S. Tsirkin wrote:
> > >On Mon, Oct 19, 2015 at 03:27:21PM +0800, Xiao Guangrong wrote:
> > +nvdimm_init_memory_state(>nvdimm_memory,
> > system_memory, machine,
> > + TARGET_PAGE_SIZE);
> > +
> > >>>
> > >>>Shouldn't this be conditional on presence of the nvdimm device?
> > >>>
> > >>
> > >>We will enable hotplug on nvdimm devices in the near future once
> > >>Linux driver is ready. I'd keep it here for future development.
> > >
> > >No, I don't think we should add stuff unconditionally. If not
> > >nvdimm, some other flag should indicate user intends to hotplug
> > >things.
> > >
> > 
> > Actually, it is not unconditionally which is called if parameter
> > "-m aaa, maxmem=bbb" (aaa < bbb) is used. It is on the some path of
> > memoy-hotplug initiation.
> > 
> 
> Right, but that's not the same as nvdimm.
> 

it could be pc-machine property, then it could be turned on like this:
 -machine nvdimm_support=on
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH v3 22/32] nvdimm: init the address region used by NVDIMM ACPI

2015-10-19 Thread Igor Mammedov
On Mon, 19 Oct 2015 18:01:17 +0800
Xiao Guangrong <guangrong.x...@linux.intel.com> wrote:

> 
> 
> On 10/19/2015 05:46 PM, Igor Mammedov wrote:
> > On Mon, 19 Oct 2015 12:17:22 +0300
> > "Michael S. Tsirkin" <m...@redhat.com> wrote:
> >
> >> On Mon, Oct 19, 2015 at 03:44:13PM +0800, Xiao Guangrong wrote:
> >>>
> >>>
> >>> On 10/19/2015 03:39 PM, Michael S. Tsirkin wrote:
> >>>> On Mon, Oct 19, 2015 at 03:27:21PM +0800, Xiao Guangrong wrote:
> >>>>>>> +nvdimm_init_memory_state(>nvdimm_memory,
> >>>>>>> system_memory, machine,
> >>>>>>> + TARGET_PAGE_SIZE);
> >>>>>>> +
> >>>>>>
> >>>>>> Shouldn't this be conditional on presence of the nvdimm device?
> >>>>>>
> >>>>>
> >>>>> We will enable hotplug on nvdimm devices in the near future once
> >>>>> Linux driver is ready. I'd keep it here for future development.
> >>>>
> >>>> No, I don't think we should add stuff unconditionally. If not
> >>>> nvdimm, some other flag should indicate user intends to hotplug
> >>>> things.
> >>>>
> >>>
> >>> Actually, it is not unconditionally which is called if parameter
> >>> "-m aaa, maxmem=bbb" (aaa < bbb) is used. It is on the some path
> >>> of memoy-hotplug initiation.
> >>>
> >>
> >> Right, but that's not the same as nvdimm.
> >>
> >
> > it could be pc-machine property, then it could be turned on like
> > this: -machine nvdimm_support=on
> 
> Er, I do not understand why this separate switch is needed and why
> nvdimm and pc-dimm is different. :(
> 
> NVDIMM reuses memory-hotplug's framework, such as maxmem, slot, and
> dimm device, even some of ACPI logic to do hotplug things, etc. Both
> nvdimm and pc-dimm are built on the same infrastructure.
NVDIMM support consumes precious low RAM  and MMIO resources and
not small amount at that. So turning it on unconditionally with
memory hotplug even if NVDIMM wouldn't ever be used isn't nice.

However that concern could be dropped if instead of allocating it's
own control MMIO/RAM regions, NVDIMM would reuse memory hotplug's MMIO
region and replace RAM region with serializing/marshaling label data
over the same MMIO interface (yes, it's slower but it's not
performance critical path).  

> 
> 
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RESEND PATCH] kvm: Allow the Hyper-V vendor ID to be specified

2015-10-16 Thread Igor Mammedov
On Fri, 16 Oct 2015 08:26:14 -0600
Alex Williamson  wrote:

> On Fri, 2015-10-16 at 09:30 +0200, Paolo Bonzini wrote:
> > 
> > On 16/10/2015 00:16, Alex Williamson wrote:
> > > According to Microsoft documentation, the signature in the standard
> > > hypervisor CPUID leaf at 0x4000 identifies the Vendor ID and is
> > > for reporting and diagnostic purposes only.  We can therefore allow
> > > the user to change it to whatever they want, within the 12 character
> > > limit.  Add a new hyperv-vendor-id option to the -cpu flag to allow
> > > for this, ex:
> > > 
> > >  -cpu host,hv_time,hv_vendor_id=KeenlyKVM
> > > 
> > > Link: http://msdn.microsoft.com/library/windows/hardware/hh975392
> > > Signed-off-by: Alex Williamson 
> > > ---
> > > 
> > > Cc'ing get_maintainers this time.  Any takers?  Thanks,
> > > Alex
> > > 
> > >  target-i386/cpu-qom.h |1 +
> > >  target-i386/cpu.c |1 +
> > >  target-i386/kvm.c |   14 +-
> > >  3 files changed, 15 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/target-i386/cpu-qom.h b/target-i386/cpu-qom.h
> > > index c35b624..6c1eaaa 100644
> > > --- a/target-i386/cpu-qom.h
> > > +++ b/target-i386/cpu-qom.h
> > > @@ -88,6 +88,7 @@ typedef struct X86CPU {
> > >  bool hyperv_vapic;
> > >  bool hyperv_relaxed_timing;
> > >  int hyperv_spinlock_attempts;
> > > +char *hyperv_vendor_id;
> > >  bool hyperv_time;
> > >  bool hyperv_crash;
> > >  bool check_cpuid;
> > > diff --git a/target-i386/cpu.c b/target-i386/cpu.c
> > > index 05d7f26..71df546 100644
> > > --- a/target-i386/cpu.c
> > > +++ b/target-i386/cpu.c
> > > @@ -3146,6 +3146,7 @@ static Property x86_cpu_properties[] = {
> > >  DEFINE_PROP_UINT32("level", X86CPU, env.cpuid_level, 0),
> > >  DEFINE_PROP_UINT32("xlevel", X86CPU, env.cpuid_xlevel, 0),
> > >  DEFINE_PROP_UINT32("xlevel2", X86CPU, env.cpuid_xlevel2, 0),
> > > +DEFINE_PROP_STRING("hv-vendor-id", X86CPU, hyperv_vendor_id),
> > >  DEFINE_PROP_END_OF_LIST()
> > >  };
> > >  
> > > diff --git a/target-i386/kvm.c b/target-i386/kvm.c
> > > index 80d1a7e..5e3ab22 100644
> > > --- a/target-i386/kvm.c
> > > +++ b/target-i386/kvm.c
> > > @@ -490,7 +490,19 @@ int kvm_arch_init_vcpu(CPUState *cs)
> > >  if (hyperv_enabled(cpu)) {
> > >  c = _data.entries[cpuid_i++];
> > >  c->function = HYPERV_CPUID_VENDOR_AND_MAX_FUNCTIONS;
> > > -memcpy(signature, "Microsoft Hv", 12);
> > > +if (!cpu->hyperv_vendor_id) {
> > > +memcpy(signature, "Microsoft Hv", 12);
> > > +} else {
> > > +size_t len = strlen(cpu->hyperv_vendor_id);
> > > +
> > > +if (len > 12) {
> > > +fprintf(stderr,
> > > +"hyperv-vendor-id too long, limited to 12 
> > > charaters");
> > > +abort();
> > 
> > I'm removing this abort and queueing the patch.  I'll send a pull
> > request today.
> 
> If we don't abort then we should really set len = 12 here.  Thanks,
or make custom property setter that will check value validity,
so it could safely fail CPU creation during hotplug.

> 
> Alex
> 
> > > +}
> > > +memset(signature, 0, 12);
> > > +memcpy(signature, cpu->hyperv_vendor_id, len);
> > > +}
> > >  c->eax = HYPERV_CPUID_MIN;
> > >  c->ebx = signature[0];
> > >  c->ecx = signature[1];
> > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe kvm" in
> > > the body of a message to majord...@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 04/32] acpi: add aml_mutex, aml_acquire, aml_release

2015-10-13 Thread Igor Mammedov
On Sun, 11 Oct 2015 11:52:36 +0800
Xiao Guangrong  wrote:

> Implement Mutex, Acquire and Release terms which are used by NVDIMM _DSM 
> method
> in later patch
> 
> Signed-off-by: Xiao Guangrong 
> ---
>  hw/acpi/aml-build.c | 32 
>  include/hw/acpi/aml-build.h |  3 +++
>  2 files changed, 35 insertions(+)
> 
> diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c
> index 9fe5e7b..ab52692 100644
> --- a/hw/acpi/aml-build.c
> +++ b/hw/acpi/aml-build.c
> @@ -1164,6 +1164,38 @@ Aml *aml_create_field(Aml *srcbuf, Aml *index, Aml 
> *len, const char *name)
>  return var;
>  }
>  
> +/* ACPI 1.0b: 16.2.5.2 Named Objects Encoding: DefMutex */
> +Aml *aml_mutex(const char *name, uint8_t flags)
s/flags/sync_level/

> +{
> +Aml *var = aml_alloc();
> +build_append_byte(var->buf, 0x5B); /* ExtOpPrefix */
> +build_append_byte(var->buf, 0x01); /* MutexOp */
> +build_append_namestring(var->buf, "%s", name);

add assert here to check that reserved bits are 0
> +build_append_byte(var->buf, flags);
> +return var;
> +}
> +
> +/* ACPI 1.0b: 16.2.5.4 Type 2 Opcodes Encoding: DefAcquire */
> +Aml *aml_acquire(Aml *mutex, uint16_t timeout)
> +{
> +Aml *var = aml_alloc();
> +build_append_byte(var->buf, 0x5B); /* ExtOpPrefix */
> +build_append_byte(var->buf, 0x23); /* AcquireOp */
> +aml_append(var, mutex);
> +build_append_int_noprefix(var->buf, timeout, sizeof(timeout));
> +return var;
> +}
> +
> +/* ACPI 1.0b: 16.2.5.3 Type 1 Opcodes Encoding: DefRelease */
> +Aml *aml_release(Aml *mutex)
> +{
> +Aml *var = aml_alloc();
> +build_append_byte(var->buf, 0x5B); /* ExtOpPrefix */
> +build_append_byte(var->buf, 0x27); /* ReleaseOp */
> +aml_append(var, mutex);
> +return var;
> +}
> +
>  void
>  build_header(GArray *linker, GArray *table_data,
>   AcpiTableHeader *h, const char *sig, int len, uint8_t rev)
> diff --git a/include/hw/acpi/aml-build.h b/include/hw/acpi/aml-build.h
> index 7e1c43b..d494c0c 100644
> --- a/include/hw/acpi/aml-build.h
> +++ b/include/hw/acpi/aml-build.h
> @@ -277,6 +277,9 @@ Aml *aml_unicode(const char *str);
>  Aml *aml_derefof(Aml *arg);
>  Aml *aml_sizeof(Aml *arg);
>  Aml *aml_create_field(Aml *srcbuf, Aml *index, Aml *len, const char *name);
> +Aml *aml_mutex(const char *name, uint8_t flags);
> +Aml *aml_acquire(Aml *mutex, uint16_t timeout);
> +Aml *aml_release(Aml *mutex);
>  
>  void
>  build_header(GArray *linker, GArray *table_data,

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 02/32] acpi: add aml_sizeof

2015-10-13 Thread Igor Mammedov
On Sun, 11 Oct 2015 11:52:34 +0800
Xiao Guangrong <guangrong.x...@linux.intel.com> wrote:

> Implement SizeOf term which is used by NVDIMM _DSM method in later patch
> 
> Signed-off-by: Xiao Guangrong <guangrong.x...@linux.intel.com>
Reviewed-by: Igor Mammedov <imamm...@redhat.com>

> ---
>  hw/acpi/aml-build.c | 8 
>  include/hw/acpi/aml-build.h | 1 +
>  2 files changed, 9 insertions(+)
> 
> diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c
> index cbd53f4..a72214d 100644
> --- a/hw/acpi/aml-build.c
> +++ b/hw/acpi/aml-build.c
> @@ -1143,6 +1143,14 @@ Aml *aml_derefof(Aml *arg)
>  return var;
>  }
>  
> +/* ACPI 1.0b: 16.2.5.4 Type 2 Opcodes Encoding: DefSizeOf */
> +Aml *aml_sizeof(Aml *arg)
> +{
> +Aml *var = aml_opcode(0x87 /* SizeOfOp */);
> +aml_append(var, arg);
> +return var;
> +}
> +
>  void
>  build_header(GArray *linker, GArray *table_data,
>   AcpiTableHeader *h, const char *sig, int len, uint8_t rev)
> diff --git a/include/hw/acpi/aml-build.h b/include/hw/acpi/aml-build.h
> index 5a03d33..7296efb 100644
> --- a/include/hw/acpi/aml-build.h
> +++ b/include/hw/acpi/aml-build.h
> @@ -275,6 +275,7 @@ Aml *aml_varpackage(uint32_t num_elements);
>  Aml *aml_touuid(const char *uuid);
>  Aml *aml_unicode(const char *str);
>  Aml *aml_derefof(Aml *arg);
> +Aml *aml_sizeof(Aml *arg);
>  
>  void
>  build_header(GArray *linker, GArray *table_data,

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 03/32] acpi: add aml_create_field

2015-10-13 Thread Igor Mammedov
On Sun, 11 Oct 2015 11:52:35 +0800
Xiao Guangrong  wrote:

> Implement CreateField term which is used by NVDIMM _DSM method in later patch
> 
> Signed-off-by: Xiao Guangrong 
> ---
>  hw/acpi/aml-build.c | 13 +
>  include/hw/acpi/aml-build.h |  1 +
>  2 files changed, 14 insertions(+)
> 
> diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c
> index a72214d..9fe5e7b 100644
> --- a/hw/acpi/aml-build.c
> +++ b/hw/acpi/aml-build.c
> @@ -1151,6 +1151,19 @@ Aml *aml_sizeof(Aml *arg)
>  return var;
>  }
>  
> +/* ACPI 1.0b: 16.2.5.2 Named Objects Encoding: DefCreateField */
> +Aml *aml_create_field(Aml *srcbuf, Aml *index, Aml *len, const char *name)
you haven't addressed v2 comment wrt index, len
 https://lists.gnu.org/archive/html/qemu-devel/2015-09/msg00435.html

> +{
> +Aml *var = aml_alloc();
> +build_append_byte(var->buf, 0x5B); /* ExtOpPrefix */
> +build_append_byte(var->buf, 0x13); /* CreateFieldOp */
> +aml_append(var, srcbuf);
> +aml_append(var, index);
> +aml_append(var, len);
> +build_append_namestring(var->buf, "%s", name);
> +return var;
> +}
> +
>  void
>  build_header(GArray *linker, GArray *table_data,
>   AcpiTableHeader *h, const char *sig, int len, uint8_t rev)
> diff --git a/include/hw/acpi/aml-build.h b/include/hw/acpi/aml-build.h
> index 7296efb..7e1c43b 100644
> --- a/include/hw/acpi/aml-build.h
> +++ b/include/hw/acpi/aml-build.h
> @@ -276,6 +276,7 @@ Aml *aml_touuid(const char *uuid);
>  Aml *aml_unicode(const char *str);
>  Aml *aml_derefof(Aml *arg);
>  Aml *aml_sizeof(Aml *arg);
> +Aml *aml_create_field(Aml *srcbuf, Aml *index, Aml *len, const char *name);
>  
>  void
>  build_header(GArray *linker, GArray *table_data,

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 01/32] acpi: add aml_derefof

2015-10-13 Thread Igor Mammedov
On Sun, 11 Oct 2015 11:52:33 +0800
Xiao Guangrong <guangrong.x...@linux.intel.com> wrote:

> Implement DeRefOf term which is used by NVDIMM _DSM method in later patch
> 
> Signed-off-by: Xiao Guangrong <guangrong.x...@linux.intel.com>
Reviewed-by: Igor Mammedov <imamm...@redhat.com>

> ---
>  hw/acpi/aml-build.c | 8 
>  include/hw/acpi/aml-build.h | 1 +
>  2 files changed, 9 insertions(+)
> 
> diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c
> index 0d4b324..cbd53f4 100644
> --- a/hw/acpi/aml-build.c
> +++ b/hw/acpi/aml-build.c
> @@ -1135,6 +1135,14 @@ Aml *aml_unicode(const char *str)
>  return var;
>  }
>  
> +/* ACPI 1.0b: 16.2.5.4 Type 2 Opcodes Encoding: DefDerefOf */
> +Aml *aml_derefof(Aml *arg)
> +{
> +Aml *var = aml_opcode(0x83 /* DerefOfOp */);
> +aml_append(var, arg);
> +return var;
> +}
> +
>  void
>  build_header(GArray *linker, GArray *table_data,
>   AcpiTableHeader *h, const char *sig, int len, uint8_t rev)
> diff --git a/include/hw/acpi/aml-build.h b/include/hw/acpi/aml-build.h
> index 1b632dc..5a03d33 100644
> --- a/include/hw/acpi/aml-build.h
> +++ b/include/hw/acpi/aml-build.h
> @@ -274,6 +274,7 @@ Aml *aml_create_dword_field(Aml *srcbuf, Aml *index, 
> const char *name);
>  Aml *aml_varpackage(uint32_t num_elements);
>  Aml *aml_touuid(const char *uuid);
>  Aml *aml_unicode(const char *str);
> +Aml *aml_derefof(Aml *arg);
>  
>  void
>  build_header(GArray *linker, GArray *table_data,

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH v3 25/32] nvdimm: build ACPI nvdimm devices

2015-10-13 Thread Igor Mammedov
On Sun, 11 Oct 2015 11:52:57 +0800
Xiao Guangrong  wrote:

> NVDIMM devices is defined in ACPI 6.0 9.20 NVDIMM Devices
> 
> There is a root device under \_SB and specified NVDIMM devices are under the
> root device. Each NVDIMM device has _ADR which returns its handle used to
> associate MEMDEV structure in NFIT
> 
> We reserve handle 0 for root device. In this patch, we save handle, arg0,
> arg1 and arg2. Arg3 is conditionally saved in later patch
> 
> Signed-off-by: Xiao Guangrong 
> ---
>  hw/mem/nvdimm/acpi.c | 203 
> +++
>  1 file changed, 203 insertions(+)
> 
> diff --git a/hw/mem/nvdimm/acpi.c b/hw/mem/nvdimm/acpi.c
I'd suggest to put ACPI parts to hw/acpi/nvdimm.c file so that ACPI
maintainers won't miss changes to this files.


> index 1450a6a..d9fa0fd 100644
> --- a/hw/mem/nvdimm/acpi.c
> +++ b/hw/mem/nvdimm/acpi.c
> @@ -308,15 +308,38 @@ static void build_nfit(void *fit, GSList *device_list, 
> GArray *table_offsets,
>   "NFIT", table_data->len - nfit_start, 1);
>  }
>  
> +#define NOTIFY_VALUE  0x99
> +
> +struct dsm_in {
> +uint32_t handle;
> +uint8_t arg0[16];
> +uint32_t arg1;
> +uint32_t arg2;
> +   /* the remaining size in the page is used by arg3. */
> +uint8_t arg3[0];
> +} QEMU_PACKED;
> +typedef struct dsm_in dsm_in;
> +
> +struct dsm_out {
> +/* the size of buffer filled by QEMU. */
> +uint16_t len;
> +uint8_t data[0];
> +} QEMU_PACKED;
> +typedef struct dsm_out dsm_out;
> +
>  static uint64_t dsm_read(void *opaque, hwaddr addr,
>   unsigned size)
>  {
> +fprintf(stderr, "BUG: we never read DSM notification MMIO.\n");
>  return 0;
>  }
>  
>  static void dsm_write(void *opaque, hwaddr addr,
>uint64_t val, unsigned size)
>  {
> +if (val != NOTIFY_VALUE) {
> +fprintf(stderr, "BUG: unexepected notify value 0x%" PRIx64, val);
> +}
>  }
>  
>  static const MemoryRegionOps dsm_ops = {
> @@ -372,6 +395,183 @@ static MemoryRegion *build_dsm_memory(NVDIMMState 
> *state)
>  return dsm_fit_mr;
>  }
>  
> +#define BUILD_STA_METHOD(_dev_, _method_)  \
> +do {   \
> +_method_ = aml_method("_STA", 0);  \
> +aml_append(_method_, aml_return(aml_int(0x0f)));   \
> +aml_append(_dev_, _method_);   \
> +} while (0)
> +
> +#define SAVE_ARG012_HANDLE_LOCK(_method_, _handle_)\
> +do {   \
> +aml_append(_method_, aml_acquire(aml_name("NLCK"), 0x));   \
how about making method serialized, then you could drop explicit lock/unlock 
logic
for that you'd need to extend existing aml_method() to something like this:

  aml_method("FOO", 3/*count*/, AML_SERIALIZED, 0 /* sync_level */)

> +aml_append(_method_, aml_store(_handle_, aml_name("HDLE")));   \
> +aml_append(_method_, aml_store(aml_arg(0), aml_name("ARG0"))); \
Could you describe QEMU<->ASL interface in a separate spec
file (for example like: docs/specs/acpi_mem_hotplug.txt),
it will help to with review process as there will be something to compare
patches with.
Once that is finalized/agreed upon, it should be easy to review and probably
to write corresponding patches.

Also I'd try to minimize QEMU<->ASL interface and implement as much as possible
of ASL logic in AML instead of pushing it in hardware (QEMU).
For example there isn't really any need to tell QEMU ARG0 (UUID), _DSM method
could just compare UUIDs itself and execute a corresponding branch.
Probably something else could be optimized as well but that we can find out
during discussion over QEMU<->ASL interface spec.

> +aml_append(_method_, aml_store(aml_arg(1), aml_name("ARG1"))); \
> +aml_append(_method_, aml_store(aml_arg(2), aml_name("ARG2"))); \
> +} while (0)
> +
> +#define NOTIFY_AND_RETURN_UNLOCK(_method_)   \
> +do {   \
> +aml_append(_method_, aml_store(aml_int(NOTIFY_VALUE),  \
> +   aml_name("NOTI"))); \
> +aml_append(_method_, aml_store(aml_name("RLEN"), aml_local(6)));   \
> +aml_append(_method_, aml_store(aml_shiftleft(aml_local(6), \
> +  aml_int(3)), aml_local(6))); \
> +aml_append(_method_, aml_create_field(aml_name("ODAT"), aml_int(0),\
> +  aml_local(6) , "OBUF")); \
> +aml_append(_method_, aml_name_decl("ZBUF", aml_buffer(0, NULL)));  \
> +aml_append(_method_, 

Re: [PATCH v3 00/32] implement vNVDIMM

2015-10-12 Thread Igor Mammedov
On Mon, 12 Oct 2015 11:06:20 +0800
Xiao Guangrong  wrote:

> 
> 
> On 10/12/2015 10:59 AM, Bharata B Rao wrote:
> > Xiao,
> >
> > Are these patches present in any git tree so that they can be easily tried 
> > out.
> >
> 
> Sorry, currently no git tree out of my workspace is available :(
Is it possible for you to put working tree on github?

> 
> BTW, this patchset is based on top of the commit b37686f7e on qemu tree:
> commit b37686f7e84b22cfaf7fd01ac5133f2617cc3027
> Merge: 8be6e62 98cf48f
> Author: Peter Maydell 
> Date:   Fri Oct 9 12:18:13 2015 +0100
> 
>  Merge remote-tracking branch 
> 'remotes/stefanha/tags/tracing-pull-request' into staging
> 
> Thanks.
> 
> > Regards,
> > Bharata.
> >
> > On Sun, Oct 11, 2015 at 9:22 AM, Xiao Guangrong
> >  wrote:
> >> Changelog in v3:
> >> There is huge change in this version, thank Igor, Stefan, Paolo, Eduardo,
> >> Michael for their valuable comments, the patchset finally gets better 
> >> shape.
> >> - changes from Igor's comments:
> >>1) abstract dimm device type from pc-dimm and create nvdimm device 
> >> based on
> >>   dimm, then it uses memory backend device as nvdimm's memory and NUMA 
> >> has
> >>   easily been implemented.
> >>2) let file-backend device support any kind of filesystem not only for
> >>   hugetlbfs and let it work on file not only for directory which is
> >>   achieved by extending 'mem-path' - if it's a directory then it works 
> >> as
> >>   current behavior, otherwise if it's file then directly allocates 
> >> memory
> >>   from it.
> >>3) we figure out a unused memory hole below 4G that is 0xFF0 ~
> >>   0xFFF0, this range is large enough for NVDIMM ACPI as build 
> >> 64-bit
> >>   ACPI SSDT/DSDT table will break windows XP.
> >>   BTW, only make SSDT.rev = 2 can not work since the width is only 
> >> depended
> >>   on DSDT.rev based on 19.6.28 DefinitionBlock (Declare Definition 
> >> Block)
> >>   in ACPI spec:
> >> | Note: For compatibility with ACPI versions before ACPI 2.0, the bit
> >> | width of Integer objects is dependent on the ComplianceRevision of the 
> >> DSDT.
> >> | If the ComplianceRevision is less than 2, all integers are restricted to 
> >> 32
> >> | bits. Otherwise, full 64-bit integers are used. The version of the DSDT 
> >> sets
> >> | the global integer width for all integers, including integers in SSDTs.
> >>4) use the lowest ACPI spec version to document AML terms.
> >>5) use "nvdimm" as nvdimm device name instead of "pc-nvdimm"
> >>
> >> - changes from Stefan's comments:
> >>1) do not do endian adjustment in-place since _DSM memory is visible to 
> >> guest
> >>2) use target platform's target page size instead of fixed PAGE_SIZE
> >>   definition
> >>3) lots of code style improvement and typo fixes.
> >>4) live migration fix
> >> - changes from Paolo's comments:
> >>1) improve the name of memory region
> >>
> >> - other changes:
> >>1) return exact buffer size for _DSM method instead of the page size.
> >>2) introduce mutex in NVDIMM ACPI as the _DSM memory is shared by all 
> >> nvdimm
> >>   devices.
> >>3) NUMA support
> >>4) implement _FIT method
> >>5) rename "configdata" to "reserve-label-data"
> >>6) simplify _DSM arg3 determination
> >>7) main changelog update to let it reflect v3.
> >>
> >> Changlog in v2:
> >> - Use litten endian for DSM method, thanks for Stefan's suggestion
> >>
> >> - introduce a new parameter, @configdata, if it's false, Qemu will
> >>build a static and readonly namespace in memory and use it serveing
> >>for DSM GET_CONFIG_SIZE/GET_CONFIG_DATA requests. In this case, no
> >>reserved region is needed at the end of the @file, it is good for
> >>the user who want to pass whole nvdimm device and make its data
> >>completely be visible to guest
> >>
> >> - divide the source code into separated files and add maintain info
> >>
> >> BTW, PCOMMIT virtualization on KVM side is work in progress, hopefully will
> >> be posted on next week
> >>
> >> == Background ==
> >> NVDIMM (A Non-Volatile Dual In-line Memory Module) is going to be supported
> >> on Intel's platform. They are discovered via ACPI and configured by _DSM
> >> method of NVDIMM device in ACPI. There has some supporting documents which
> >> can be found at:
> >> ACPI 6: http://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf
> >> NVDIMM Namespace: http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf
> >> DSM Interface Example: 
> >> http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
> >> Driver Writer's Guide: 
> >> http://pmem.io/documents/NVDIMM_Driver_Writers_Guide.pdf
> >>
> >> Currently, the NVDIMM driver has been merged into upstream Linux Kernel and
> >> this patchset tries to enable it in virtualization field
> >>
> >> == Design ==
> >> NVDIMM 

[PATCH] kvm: svm: reset mmu on VCPU reset

2015-09-18 Thread Igor Mammedov
When INIT/SIPI sequence is sent to VCPU which before that
was in use by OS, VMRUN might fail with:

 KVM: entry failed, hardware error 0x
 EAX= EBX= ECX= EDX=06d3
 ESI= EDI= EBP= ESP=
 EIP= EFL=0002 [---] CPL=0 II=0 A20=1 SMM=0 HLT=0
 ES =   9300
 CS =9a00 0009a000  9a00
 [...]
 CR0=6010 CR2=b6f3e000 CR3=01942000 CR4=07e0
 [...]
 EFER=

with corresponding SVM error:
 KVM: FAILED VMRUN WITH VMCB:
 [...]
 cpl:0efer: 1000
 cr0:80010010 cr2:  7fd7fe85bf90
 cr3:000187d0c000 cr4:  0020
 [...]

What happens is that VCPU state right after offlinig:
CR0: 0x80050033  EFER: 0xd01  CR4: 0x7e0
  -> long mode with CR3 pointing to longmode page tables

and when VCPU gets INIT/SIPI following transition happens
CR0: 0 -> 0x6010 EFER: 0x0  CR4: 0x7e0
  -> paging disabled with stale CR3

However SVM under the hood puts VCPU in Paged Real Mode*
which effectively translates CR0 0x6010 -> 80010010 after

   svm_vcpu_reset()
   -> init_vmcb()
   -> kvm_set_cr0()
   -> svm_set_cr0()

but from  kvm_set_cr0() perspective CR0: 0 -> 0x6010
only caching bits are changed and
commit d81135a57aa6
 ("KVM: x86: do not reset mmu if CR0.CD and CR0.NW are changed")'
regressed svm_vcpu_reset() which relied on MMU being reset.

As result VMRUN after svm_vcpu_reset() tries to run
VCPU in Paged Real Mode with stale MMU context (longmode page tables),
which causes some AMD CPUs** to bail out with VMEXIT_INVALID.

Fix issue by unconditionally resetting MMU context
at init_vmcb() time.

--
* AMD64 Architecture Programmer’s Manual,
Volume 2: System Programming, rev: 3.25
  15.19 Paged Real Mode
** Opteron 1216

Signed-off-by: Igor Mammedov <imamm...@redhat.com>
---
 arch/x86/kvm/svm.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index fdb8cb6..89173af 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -1264,6 +1264,7 @@ static void init_vmcb(struct vcpu_svm *svm, bool 
init_event)
 * It also updates the guest-visible cr0 value.
 */
(void)kvm_set_cr0(>vcpu, X86_CR0_NW | X86_CR0_CD | X86_CR0_ET);
+   kvm_mmu_reset_context(>vcpu);
 
save->cr4 = X86_CR4_PAE;
/* rdx = ?? */
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 08/18] nvdimm: init backend memory mapping and config data area

2015-09-17 Thread Igor Mammedov
On Thu, 17 Sep 2015 16:39:12 +0800
Xiao Guangrong  wrote:

> 
> 
> On 09/16/2015 12:10 AM, Paolo Bonzini wrote:
> >
> >
> > On 01/09/2015 11:14, Stefan Hajnoczi wrote:
> 
>  When I was digging into live migration code, i noticed that the same MR 
>  name may
>  cause the name "idstr", please refer to qemu_ram_set_idstr().
> 
>  Since nvdimm devices do not have parent-bus, it will trigger the abort() 
>  in that
>  function.
> >> I see.  The other devices that use a constant name are on a bus so the
> >> abort doesn't trigger.
> >
> > However, the MR name must be the same across the two machines.  Indices
> > are not friendly to hotplug.  Even though hotplug isn't supported now,
> > we should prepare and try not to change migration format when we support
> > hotplug in the future.
> >
> 
> Thanks for your reminder.
> 
> > Is there any other fixed value that we can use, for example the base
> > address of the NVDIMM?
> 
> How about use object_get_canonical_path(OBJECT(dev)) (the @dev is NVDIMM
> device) ?
if you use split backend/frotnend idea then existing backends
already have a stable name derived from backend's ID and you won't need to care
about it.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 09/18] nvdimm: build ACPI NFIT table

2015-09-15 Thread Igor Mammedov
On Tue, 15 Sep 2015 18:12:43 +0200
Paolo Bonzini  wrote:

> 
> 
> On 14/08/2015 16:52, Xiao Guangrong wrote:
> > NFIT is defined in ACPI 6.0: 5.2.25 NVDIMM Firmware Interface Table (NFIT)
> > 
> > Currently, we only support PMEM mode. Each device has 3 tables:
> > - SPA table, define the PMEM region info
> > 
> > - MEM DEV table, it has the @handle which is used to associate specified
> >   ACPI NVDIMM  device we will introduce in later patch.
> >   Also we can happily ignored the memory device's interleave, the real
> >   nvdimm hardware access is hidden behind host
> > 
> > - DCR table, it defines Vendor ID used to associate specified vendor
> >   nvdimm driver. Since we only implement PMEM mode this time, Command
> >   window and Data window are not needed
> > 
> > Signed-off-by: Xiao Guangrong 
> > ---
> >  hw/i386/acpi-build.c   |   3 +
> >  hw/mem/Makefile.objs   |   2 +-
> >  hw/mem/nvdimm/acpi.c   | 285 
> > +
> >  hw/mem/nvdimm/internal.h   |  29 +
> >  hw/mem/nvdimm/pc-nvdimm.c  |  27 -
> >  include/hw/mem/pc-nvdimm.h |   2 +
> >  6 files changed, 346 insertions(+), 2 deletions(-)
> >  create mode 100644 hw/mem/nvdimm/acpi.c
> >  create mode 100644 hw/mem/nvdimm/internal.h
> > 
> > diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
> > index 8ead1c1..092ed2f 100644
> > --- a/hw/i386/acpi-build.c
> > +++ b/hw/i386/acpi-build.c
> > @@ -39,6 +39,7 @@
> >  #include "hw/loader.h"
> >  #include "hw/isa/isa.h"
> >  #include "hw/acpi/memory_hotplug.h"
> > +#include "hw/mem/pc-nvdimm.h"
> >  #include "sysemu/tpm.h"
> >  #include "hw/acpi/tpm.h"
> >  #include "sysemu/tpm_backend.h"
> > @@ -1741,6 +1742,8 @@ void acpi_build(PcGuestInfo *guest_info, 
> > AcpiBuildTables *tables)
> >  build_dmar_q35(tables_blob, tables->linker);
> >  }
> >  
> > +pc_nvdimm_build_nfit_table(table_offsets, tables_blob, tables->linker);
> > +
> >  /* Add tables supplied by user (if any) */
> >  for (u = acpi_table_first(); u; u = acpi_table_next(u)) {
> >  unsigned len = acpi_table_len(u);
> > diff --git a/hw/mem/Makefile.objs b/hw/mem/Makefile.objs
> > index 4df7482..7a6948d 100644
> > --- a/hw/mem/Makefile.objs
> > +++ b/hw/mem/Makefile.objs
> > @@ -1,2 +1,2 @@
> >  common-obj-$(CONFIG_MEM_HOTPLUG) += pc-dimm.o
> > -common-obj-$(CONFIG_NVDIMM) += nvdimm/pc-nvdimm.o
> > +common-obj-$(CONFIG_NVDIMM) += nvdimm/pc-nvdimm.o nvdimm/acpi.o
> > diff --git a/hw/mem/nvdimm/acpi.c b/hw/mem/nvdimm/acpi.c
> > new file mode 100644
> > index 000..f28752f
> > --- /dev/null
> > +++ b/hw/mem/nvdimm/acpi.c
> > @@ -0,0 +1,285 @@
> > +/*
> > + * NVDIMM (A Non-Volatile Dual In-line Memory Module) NFIT Implement
> > + *
> > + * Copyright(C) 2015 Intel Corporation.
> > + *
> > + * Author:
> > + *  Xiao Guangrong 
> > + *
> > + * NFIT is defined in ACPI 6.0: 5.2.25 NVDIMM Firmware Interface Table 
> > (NFIT)
> > + * and the DSM specfication can be found at:
> > + *   http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
> > + *
> > + * Currently, it only supports PMEM Virtualization.
> > + *
> > + * This library is free software; you can redistribute it and/or
> > + * modify it under the terms of the GNU Lesser General Public
> > + * License as published by the Free Software Foundation; either
> > + * version 2 of the License, or (at your option) any later version.
> > + *
> > + * This library is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> > + * Lesser General Public License for more details.
> > + *
> > + * You should have received a copy of the GNU Lesser General Public
> > + * License along with this library; if not, see 
> > 
> > + */
> > +
> > +#include "qemu-common.h"
> > +
> > +#include "hw/acpi/aml-build.h"
> > +#include "hw/mem/pc-nvdimm.h"
> > +
> > +#include "internal.h"
> > +
> > +static void nfit_spa_uuid_pm(void *uuid)
> > +{
> > +uuid_le uuid_pm = UUID_LE(0x66f0d379, 0xb4f3, 0x4074, 0xac, 0x43, 0x0d,
> > +  0x33, 0x18, 0xb7, 0x8c, 0xdb);
> > +memcpy(uuid, _pm, sizeof(uuid_pm));
> > +}
> > +
> > +enum {
> > +NFIT_TABLE_SPA = 0,
> > +NFIT_TABLE_MEM = 1,
> > +NFIT_TABLE_IDT = 2,
> > +NFIT_TABLE_SMBIOS = 3,
> > +NFIT_TABLE_DCR = 4,
> > +NFIT_TABLE_BDW = 5,
> > +NFIT_TABLE_FLUSH = 6,
> > +};
> > +
> > +enum {
> > +EFI_MEMORY_UC = 0x1ULL,
> > +EFI_MEMORY_WC = 0x2ULL,
> > +EFI_MEMORY_WT = 0x4ULL,
> > +EFI_MEMORY_WB = 0x8ULL,
> > +EFI_MEMORY_UCE = 0x10ULL,
> > +EFI_MEMORY_WP = 0x1000ULL,
> > +EFI_MEMORY_RP = 0x2000ULL,
> > +EFI_MEMORY_XP = 0x4000ULL,
> > +EFI_MEMORY_NV = 0x8000ULL,
> > +EFI_MEMORY_MORE_RELIABLE = 0x1ULL,
> > +};
> > +
> > +/*
> > + * struct nfit - 

Re: [Qemu-devel] [PATCH v2 08/18] nvdimm: init backend memory mapping and config data area

2015-09-10 Thread Igor Mammedov
On Tue, 8 Sep 2015 21:38:17 +0800
Xiao Guangrong <guangrong.x...@linux.intel.com> wrote:

> 
> 
> On 09/07/2015 10:11 PM, Igor Mammedov wrote:
> > On Fri, 14 Aug 2015 22:52:01 +0800
> > Xiao Guangrong <guangrong.x...@linux.intel.com> wrote:
> >
> >> The parameter @file is used as backed memory for NVDIMM which is
> >> divided into two parts if @dataconfig is true:
> >> - first parts is (0, size - 128K], which is used as PMEM (Persistent
> >>Memory)
> >> - 128K at the end of the file, which is used as Config Data Area, it's
> >>used to store Label namespace data
> >>
> >> The @file supports both regular file and block device, of course we
> >> can assign any these two kinds of files for test and emulation, however,
> >> in the real word for performance reason, we usually used these files as
> >> NVDIMM backed file:
> >> - the regular file in the filesystem with DAX enabled created on NVDIMM
> >>device on host
> >> - the raw PMEM device on host, e,g /dev/pmem0
> >
> > A lot of code in this series could reuse what QEMU already
> > uses for implementing pc-dimm devices.
> >
> > here is common concepts that could be reused.
> >- on physical system both DIMM and NVDIMM devices use
> >  the same slots. We could share QEMU's '-m slots' option between
> >  both devices. An alternative to not sharing would be to introduce
> >  '-machine nvdimm_slots' option.
> >  And yes, we need to know number of NVDIMMs to describe
> >  them all in ACPI table (taking in amount future hotplug
> >  include in this possible NVDIMM devices)
> >  I'd go the same way as on real hardware on make them share the same 
> > slots.
> 
> I'd prefer sharing slots for pc-dimm and nvdimm, it's easier to reuse the
> logic of slot-assignment and plug/unplug.
> 
> >- they share the same physical address space and limits
> >  on how much memory system can handle. So I'd suggest sharing existing
> >  '-m maxmem' option and reuse hotplug_memory address space.
> 
> Sounds good to me.
> 
> >
> > Essentially what I'm suggesting is to inherit NVDIMM's implementation
> > from pc-dimm reusing all of its code/backends and
> > just override parts that do memory mapping into guest's address space to
> > accommodate NVDIMM's requirements.
> 
> Good idea!
> 
> We have to differentiate pc-dimm and nvdimm in the common code and nvdimm
> has different points with pc-dimm (for example, its has reserved-region, and
> need support live migration of label data). How about rename 'pc-nvdimm' to
> 'memory-device' and make it as a common device type, then build pc-dimm and
> nvdimm on top of it?
sound good, only I'd call it just 'dimm' as 'memory-device' is too broad.
Also I'd make base class abstract.

> 
> Something like:
> static TypeInfo memory_device_info = {
>  .name  = TYPE_MEM_DEV,
>  .parent= TYPE_DEVICE,
> };
> 
> static TypeInfo memory_device_info = {
> .name = TYPE_PC_DIMM,
> .parent = TYPE_MEM_DEV,
> };
> 
> static TypeInfo memory_device_info = {
> .name = TYPE_NVDIMM,
> .parent = TYPE_MEM_DEV,
> };
> 
> It also make CONIFG_NVDIMM and CONFIG_HOT_PLUG be independent.
> 
> >
> >>
> >> Signed-off-by: Xiao Guangrong <guangrong.x...@linux.intel.com>
> >> ---
> >>   hw/mem/nvdimm/pc-nvdimm.c  | 109 
> >> -
> >>   include/hw/mem/pc-nvdimm.h |   7 +++
> >>   2 files changed, 115 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/hw/mem/nvdimm/pc-nvdimm.c b/hw/mem/nvdimm/pc-nvdimm.c
> >> index 7a270a8..97710d1 100644
> >> --- a/hw/mem/nvdimm/pc-nvdimm.c
> >> +++ b/hw/mem/nvdimm/pc-nvdimm.c
> >> @@ -22,12 +22,20 @@
> >>* License along with this library; if not, see 
> >> <http://www.gnu.org/licenses/>
> >>*/
> >>
> >> +#include 
> >> +#include 
> >> +#include 
> >> +
> >> +#include "exec/address-spaces.h"
> >>   #include "hw/mem/pc-nvdimm.h"
> >>
> >> -#define PAGE_SIZE  (1UL << 12)
> >> +#define PAGE_SIZE   (1UL << 12)
> >> +
> >> +#define MIN_CONFIG_DATA_SIZE(128 << 10)
> >>
> >>   static struct nvdimms_info {
> >>   ram_addr_t current_addr;
> >> +int device_index;
> >>   } nvdimms_info;
> >>
> >>   /* the address range [offset, ~0ULL) is reserved for NV

Re: [Qemu-devel] [PATCH v2 06/18] pc: implement NVDIMM device abstract

2015-09-10 Thread Igor Mammedov
On Tue, 8 Sep 2015 22:03:01 +0800
Xiao Guangrong <guangrong.x...@linux.intel.com> wrote:

> 
> 
> On 09/07/2015 09:40 PM, Igor Mammedov wrote:
> > On Sun, 6 Sep 2015 14:07:21 +0800
> > Xiao Guangrong <guangrong.x...@linux.intel.com> wrote:
> >
> >>
> >>
> >> On 09/02/2015 07:31 PM, Igor Mammedov wrote:
> >>> On Wed, 2 Sep 2015 18:36:43 +0800
> >>> Xiao Guangrong <guangrong.x...@linux.intel.com> wrote:
> >>>
> >>>>
> >>>>
> >>>> On 09/02/2015 05:58 PM, Igor Mammedov wrote:
> >>>>> On Fri, 14 Aug 2015 22:51:59 +0800
> >>>>> Xiao Guangrong <guangrong.x...@linux.intel.com> wrote:
> >>>>>
> >>>>>> Introduce "pc-nvdimm" device and it has two parameters:
> >>>>> Why do you use prefix "pc-", I suppose we potentially
> >>>>> could use this device not only with x86 targets but with
> >>>>> other targets as well.
> >>>>> I'd just drop 'pc' prefix through out patchset.
> >>>>
> >>>> Yeah, the prefix is stolen from pc-dimm, will drop this
> >>>> prefix as your suggestion.
> >>>>
> >>>>>
> >>>>>> - @file, which is the backed memory file for NVDIMM device
> >>>>> Could you try to split device into backend/frontend parts,
> >>>>> like it's done with pc-dimm. As I understand it's preferred
> >>>>> way to implement this kind of devices.
> >>>>> Then you could reuse memory backends that we already have
> >>>>> including file backend.
> >>>>
> >>>> I considered it too and Stefan, Paolo got the some idea in
> >>>> V1's review, however:
> >>>>
> >>>> | However, file-based memory used by NVDIMM is special, it divides the 
> >>>> file
> >>>> | to two parts, one part is used as PMEM and another part is used to 
> >>>> store
> >>>> | NVDIMM's configure data.
> >>>> |
> >>>> | Maybe we can introduce "end-reserved" property to reserve specified 
> >>>> size
> >>>> | at the end of the file. Or create a new class type based on
> >>>> | memory-backend-file (named nvdimm-backend-file) class to hide this 
> >>>> magic
> >>>> | thing?
> >>> I'd go with separate backend/frontend idea.
> >>>
> >>> Question is if this config area is part backend or frontend?
> >>
> >> Configdata area is used to store nvdimm device's configuration, normally, 
> >> it's
> >> namespace info.
> >>
> >> Currently, we chosen configdata located at the end of nvdimm's 
> >> backend-memory
> >> as it's easy to configure / use and configdata is naturally non-volatile 
> >> and it
> >> is like the layout on physical device.
> >>
> >> However, using two separated backed-memory is okay, for example:
> >> -object memory-backend-file,id=mem0,file=/storage/foo
> >> -object memory-backend-file,id=mem1,file=/storage/bar
> >> -device nvdimm,memdev=mem0,configdata=mem1
> >> then configdata is written to a single backend.
> >>
> >> Which one is better for you? :)
> >>
> >>> If we pass-through NVDIMM device do we need to set configdata=true
> >>> and QEMU would skip building config structures and use structures
> >>> that are already present on passed-through device in that place?
> >>>
> >>
> >> The file specified by @file is something like a normal disk, like 
> >> /dev/sda/,
> >> host process can use whole space on it. If we want to directly pass it to 
> >> guest,
> >> we can specify 'configdata=false'. If we allow guest to 'partition' (create
> >> namespace on) it then we use 'configdata=true' to reserve some space to 
> >> store
> >> its partition info (namesapce info).
> > As far as I understand currently linux provides to userspace only one 
> > interface
> > which is block device i.e. /dev/sdX and on top of it userspace can put
> > PM/DAX aware filesystem and use files from it. In either cases kernel
> > just provides access to separate namespaces and not to a whole NVDIMM which
> > includes 'labels area'. Hence /dev/sdX is not passed-though NVDIMM,
> > so we could consider it as just a file/storage that could be used by 
> > userspace.
> &g

Re: [Qemu-devel] [PATCH v2 06/18] pc: implement NVDIMM device abstract

2015-09-07 Thread Igor Mammedov
On Sun, 6 Sep 2015 14:07:21 +0800
Xiao Guangrong <guangrong.x...@linux.intel.com> wrote:

> 
> 
> On 09/02/2015 07:31 PM, Igor Mammedov wrote:
> > On Wed, 2 Sep 2015 18:36:43 +0800
> > Xiao Guangrong <guangrong.x...@linux.intel.com> wrote:
> >
> >>
> >>
> >> On 09/02/2015 05:58 PM, Igor Mammedov wrote:
> >>> On Fri, 14 Aug 2015 22:51:59 +0800
> >>> Xiao Guangrong <guangrong.x...@linux.intel.com> wrote:
> >>>
> >>>> Introduce "pc-nvdimm" device and it has two parameters:
> >>> Why do you use prefix "pc-", I suppose we potentially
> >>> could use this device not only with x86 targets but with
> >>> other targets as well.
> >>> I'd just drop 'pc' prefix through out patchset.
> >>
> >> Yeah, the prefix is stolen from pc-dimm, will drop this
> >> prefix as your suggestion.
> >>
> >>>
> >>>> - @file, which is the backed memory file for NVDIMM device
> >>> Could you try to split device into backend/frontend parts,
> >>> like it's done with pc-dimm. As I understand it's preferred
> >>> way to implement this kind of devices.
> >>> Then you could reuse memory backends that we already have
> >>> including file backend.
> >>
> >> I considered it too and Stefan, Paolo got the some idea in
> >> V1's review, however:
> >>
> >> | However, file-based memory used by NVDIMM is special, it divides the file
> >> | to two parts, one part is used as PMEM and another part is used to store
> >> | NVDIMM's configure data.
> >> |
> >> | Maybe we can introduce "end-reserved" property to reserve specified size
> >> | at the end of the file. Or create a new class type based on
> >> | memory-backend-file (named nvdimm-backend-file) class to hide this magic
> >> | thing?
> > I'd go with separate backend/frontend idea.
> >
> > Question is if this config area is part backend or frontend?
> 
> Configdata area is used to store nvdimm device's configuration, normally, it's
> namespace info.
> 
> Currently, we chosen configdata located at the end of nvdimm's backend-memory
> as it's easy to configure / use and configdata is naturally non-volatile and 
> it
> is like the layout on physical device.
> 
> However, using two separated backed-memory is okay, for example:
> -object memory-backend-file,id=mem0,file=/storage/foo
> -object memory-backend-file,id=mem1,file=/storage/bar
> -device nvdimm,memdev=mem0,configdata=mem1
> then configdata is written to a single backend.
> 
> Which one is better for you? :)
> 
> > If we pass-through NVDIMM device do we need to set configdata=true
> > and QEMU would skip building config structures and use structures
> > that are already present on passed-through device in that place?
> >
> 
> The file specified by @file is something like a normal disk, like /dev/sda/,
> host process can use whole space on it. If we want to directly pass it to 
> guest,
> we can specify 'configdata=false'. If we allow guest to 'partition' (create
> namespace on) it then we use 'configdata=true' to reserve some space to store
> its partition info (namesapce info).
As far as I understand currently linux provides to userspace only one interface
which is block device i.e. /dev/sdX and on top of it userspace can put
PM/DAX aware filesystem and use files from it. In either cases kernel
just provides access to separate namespaces and not to a whole NVDIMM which
includes 'labels area'. Hence /dev/sdX is not passed-though NVDIMM,
so we could consider it as just a file/storage that could be used by userspace.

Lets assume that NVDIMM should always have 'labels area'.
In that case I'd always reserve space for it and
 * format it (build a new one) if backend doesn't have a
   valid labels area dropping configdata parameter along the way
 * or if backing-file already has valid labels area I'd just use it.

If you need to make labels area readonly you can introduce 
'NVDIMM.readonly_labels'
option and just use labels backend's without allowing changes writeback.
IT would be better to make it another series on top of basic NVDIMM 
implementation
if there is an actual usecase for it.

PS:
Also when you write commit messages, comment and name variables try to use 
terms from
relevant spec and mention specs where you describe data structures from them.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-ppc] KVM memory slots limit on powerpc

2015-09-07 Thread Igor Mammedov
On Fri, 4 Sep 2015 12:04:41 +0200
Alexander Graf  wrote:

> 
> 
> On 04.09.15 11:59, Christian Borntraeger wrote:
> > Am 04.09.2015 um 11:35 schrieb Thomas Huth:
> >>
> >>  Hi all,
> >>
> >> now that we get memory hotplugging for the spapr machine on qemu-ppc,
> >> too, it seems like we easily can hit the amount of KVM-internal memory
> >> slots now ("#define KVM_USER_MEM_SLOTS 32" in
> >> arch/powerpc/include/asm/kvm_host.h). For example, start
> >> qemu-system-ppc64 with a couple of "-device secondary-vga" and "-m
> >> 4G,slots=32,maxmem=40G" and then try to hot-plug all 32 DIMMs ... and
> >> you'll see that it aborts way earlier already.
> >>
> >> The x86 code already increased the amount of KVM_USER_MEM_SLOTS to 509
> >> already (+3 internal slots = 512) ... maybe we should now increase the
> >> amount of slots on powerpc, too? Since we don't use internal slots on
> >> POWER, would 512 be a good value? Or would less be sufficient, too?
> > 
> > When you are at it, the s390 value should also be increased I guess.
> 
> That constant defines the array size for the memslot array in struct kvm
> which in turn again gets allocated by kzalloc, so it's pinned kernel
> memory that is physically contiguous. Doing big allocations can turn
> into problems during runtime.
> 
> So maybe there is another way? Can we extend the memslot array size
> dynamically somehow? Allocate it separately? How much memory does the
> memslot array use up with 512 entries?

KVM switched memslots allocation to kvm_kvzalloc(), so it would fallback to 
vmalloc
 commit 744961341d472db6272ed9b42319a90f5a2aa7c4
 kvm: avoid page allocation failure in kvm_set_memory_region()

> 
> Alex
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH v2 08/18] nvdimm: init backend memory mapping and config data area

2015-09-07 Thread Igor Mammedov
On Fri, 14 Aug 2015 22:52:01 +0800
Xiao Guangrong  wrote:

> The parameter @file is used as backed memory for NVDIMM which is
> divided into two parts if @dataconfig is true:
> - first parts is (0, size - 128K], which is used as PMEM (Persistent
>   Memory)
> - 128K at the end of the file, which is used as Config Data Area, it's
>   used to store Label namespace data
> 
> The @file supports both regular file and block device, of course we
> can assign any these two kinds of files for test and emulation, however,
> in the real word for performance reason, we usually used these files as
> NVDIMM backed file:
> - the regular file in the filesystem with DAX enabled created on NVDIMM
>   device on host
> - the raw PMEM device on host, e,g /dev/pmem0

A lot of code in this series could reuse what QEMU already
uses for implementing pc-dimm devices.

here is common concepts that could be reused.
  - on physical system both DIMM and NVDIMM devices use
the same slots. We could share QEMU's '-m slots' option between
both devices. An alternative to not sharing would be to introduce
'-machine nvdimm_slots' option.
And yes, we need to know number of NVDIMMs to describe
them all in ACPI table (taking in amount future hotplug
include in this possible NVDIMM devices)
I'd go the same way as on real hardware on make them share the same slots.
  - they share the same physical address space and limits
on how much memory system can handle. So I'd suggest sharing existing
'-m maxmem' option and reuse hotplug_memory address space.

Essentially what I'm suggesting is to inherit NVDIMM's implementation
from pc-dimm reusing all of its code/backends and
just override parts that do memory mapping into guest's address space to
accommodate NVDIMM's requirements.

> 
> Signed-off-by: Xiao Guangrong 
> ---
>  hw/mem/nvdimm/pc-nvdimm.c  | 109 
> -
>  include/hw/mem/pc-nvdimm.h |   7 +++
>  2 files changed, 115 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/mem/nvdimm/pc-nvdimm.c b/hw/mem/nvdimm/pc-nvdimm.c
> index 7a270a8..97710d1 100644
> --- a/hw/mem/nvdimm/pc-nvdimm.c
> +++ b/hw/mem/nvdimm/pc-nvdimm.c
> @@ -22,12 +22,20 @@
>   * License along with this library; if not, see 
> 
>   */
>  
> +#include 
> +#include 
> +#include 
> +
> +#include "exec/address-spaces.h"
>  #include "hw/mem/pc-nvdimm.h"
>  
> -#define PAGE_SIZE  (1UL << 12)
> +#define PAGE_SIZE   (1UL << 12)
> +
> +#define MIN_CONFIG_DATA_SIZE(128 << 10)
>  
>  static struct nvdimms_info {
>  ram_addr_t current_addr;
> +int device_index;
>  } nvdimms_info;
>  
>  /* the address range [offset, ~0ULL) is reserved for NVDIMM. */
> @@ -37,6 +45,26 @@ void pc_nvdimm_reserve_range(ram_addr_t offset)
>  nvdimms_info.current_addr = offset;
>  }
>  
> +static ram_addr_t reserved_range_push(uint64_t size)
> +{
> +uint64_t current;
> +
> +current = ROUND_UP(nvdimms_info.current_addr, PAGE_SIZE);
> +
> +/* do not have enough space? */
> +if (current + size < current) {
> +return 0;
> +}
> +
> +nvdimms_info.current_addr = current + size;
> +return current;
> +}
You can't use all memory above hotplug_memory area since
we have to tell guest where 64-bit PCI window starts,
and currently it should start at reserved-memory-end
(but it isn't due to a bug: I've just posted fix to qemu-devel
 "[PATCH 0/2] pc: fix 64-bit PCI window clashing with memory hotplug region"
)

> +
> +static uint32_t new_device_index(void)
> +{
> +return nvdimms_info.device_index++;
> +}
> +
>  static char *get_file(Object *obj, Error **errp)
>  {
>  PCNVDIMMDevice *nvdimm = PC_NVDIMM(obj);
> @@ -48,6 +76,11 @@ static void set_file(Object *obj, const char *str, Error 
> **errp)
>  {
>  PCNVDIMMDevice *nvdimm = PC_NVDIMM(obj);
>  
> +if (memory_region_size(>mr)) {
> +error_setg(errp, "cannot change property value");
> +return;
> +}
> +
>  if (nvdimm->file) {
>  g_free(nvdimm->file);
>  }
> @@ -76,13 +109,87 @@ static void pc_nvdimm_init(Object *obj)
>   set_configdata, NULL);
>  }
>  
> +static uint64_t get_file_size(int fd)
> +{
> +struct stat stat_buf;
> +uint64_t size;
> +
> +if (fstat(fd, _buf) < 0) {
> +return 0;
> +}
> +
> +if (S_ISREG(stat_buf.st_mode)) {
> +return stat_buf.st_size;
> +}
> +
> +if (S_ISBLK(stat_buf.st_mode) && !ioctl(fd, BLKGETSIZE64, )) {
> +return size;
> +}
> +
> +return 0;
> +}
All this file stuff I'd leave to already existing backends like
memory-backend-file or even memory-backend-ram which already do
above and more allowing to configure persistent and volatile
NVDIMMs without changing NVDIMM fronted code.

> +
>  static void pc_nvdimm_realize(DeviceState *dev, Error **errp)
>  {
>  PCNVDIMMDevice *nvdimm = 

Re: [Qemu-devel] [PATCH v2 07/18] nvdimm: reserve address range for NVDIMM

2015-09-04 Thread Igor Mammedov
On Fri, 14 Aug 2015 22:52:00 +0800
Xiao Guangrong  wrote:

> NVDIMM reserves all the free range above 4G to do:
> - Persistent Memory (PMEM) mapping
> - implement NVDIMM ACPI device _DSM method
> 
> Signed-off-by: Xiao Guangrong 
> ---
>  hw/i386/pc.c   | 12 ++--
>  hw/mem/nvdimm/pc-nvdimm.c  | 13 +
>  include/hw/mem/pc-nvdimm.h |  1 +
>  3 files changed, 24 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> index 7661ea9..41af6ea 100644
> --- a/hw/i386/pc.c
> +++ b/hw/i386/pc.c
> @@ -64,6 +64,7 @@
>  #include "hw/pci/pci_host.h"
>  #include "acpi-build.h"
>  #include "hw/mem/pc-dimm.h"
> +#include "hw/mem/pc-nvdimm.h"
>  #include "qapi/visitor.h"
>  #include "qapi-visit.h"
>  
> @@ -1302,6 +1303,7 @@ FWCfgState *pc_memory_init(MachineState *machine,
>  MemoryRegion *ram_below_4g, *ram_above_4g;
>  FWCfgState *fw_cfg;
>  PCMachineState *pcms = PC_MACHINE(machine);
> +ram_addr_t offset;
>  
>  assert(machine->ram_size == below_4g_mem_size + above_4g_mem_size);
>  
> @@ -1339,6 +1341,8 @@ FWCfgState *pc_memory_init(MachineState *machine,
>  exit(EXIT_FAILURE);
>  }
>  
> +offset = 0x1ULL + above_4g_mem_size;
> +
>  /* initialize hotplug memory address space */
>  if (guest_info->has_reserved_memory &&
>  (machine->ram_size < machine->maxram_size)) {
> @@ -1358,8 +1362,7 @@ FWCfgState *pc_memory_init(MachineState *machine,
>  exit(EXIT_FAILURE);
>  }
>  
> -pcms->hotplug_memory.base =
> -ROUND_UP(0x1ULL + above_4g_mem_size, 1ULL << 30);
> +pcms->hotplug_memory.base = ROUND_UP(offset, 1ULL << 30);
>  
>  if (pcms->enforce_aligned_dimm) {
>  /* size hotplug region assuming 1G page max alignment per slot */
> @@ -1377,8 +1380,13 @@ FWCfgState *pc_memory_init(MachineState *machine,
> "hotplug-memory", hotplug_mem_size);
>  memory_region_add_subregion(system_memory, pcms->hotplug_memory.base,
>  >hotplug_memory.mr);
> +
> +offset = pcms->hotplug_memory.base + hotplug_mem_size;
>  }
>  
> + /* all the space left above 4G is reserved for NVDIMM. */
> +pc_nvdimm_reserve_range(offset);
I'd drop 'offset' in this patch and just use:
  foo(pcms->hotplug_memory.base + hotplug_mem_size)

> +
>  /* Initialize PC system firmware */
>  pc_system_firmware_init(rom_memory, guest_info->isapc_ram_fw);
>  
> diff --git a/hw/mem/nvdimm/pc-nvdimm.c b/hw/mem/nvdimm/pc-nvdimm.c
> index a53d235..7a270a8 100644
> --- a/hw/mem/nvdimm/pc-nvdimm.c
> +++ b/hw/mem/nvdimm/pc-nvdimm.c
> @@ -24,6 +24,19 @@
>  
>  #include "hw/mem/pc-nvdimm.h"
>  
> +#define PAGE_SIZE  (1UL << 12)
> +
> +static struct nvdimms_info {
> +ram_addr_t current_addr;
> +} nvdimms_info;
no globals please, so far it looks like pcms->hotplug_memory
so add asimmilar nvdimm_memory field to PCMachineState

> +
> +/* the address range [offset, ~0ULL) is reserved for NVDIMM. */
> +void pc_nvdimm_reserve_range(ram_addr_t offset)
do you plan to reuse this function, if not then just inline it at call site

> +{
> +offset = ROUND_UP(offset, PAGE_SIZE);
I'd suggest round up to 1Gb as we do with mem hotplug

> +nvdimms_info.current_addr = offset;
> +}
> +
>  static char *get_file(Object *obj, Error **errp)
>  {
>  PCNVDIMMDevice *nvdimm = PC_NVDIMM(obj);
> diff --git a/include/hw/mem/pc-nvdimm.h b/include/hw/mem/pc-nvdimm.h
> index 51152b8..8601e9b 100644
> --- a/include/hw/mem/pc-nvdimm.h
> +++ b/include/hw/mem/pc-nvdimm.h
> @@ -28,4 +28,5 @@ typedef struct PCNVDIMMDevice {
>  #define PC_NVDIMM(obj) \
>  OBJECT_CHECK(PCNVDIMMDevice, (obj), TYPE_PC_NVDIMM)
>  
> +void pc_nvdimm_reserve_range(ram_addr_t offset);
>  #endif

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 01/18] acpi: allow aml_operation_region() working on 64 bit offset

2015-09-02 Thread Igor Mammedov
On Fri, 14 Aug 2015 22:51:54 +0800
Xiao Guangrong <guangrong.x...@linux.intel.com> wrote:

> Currently, the offset in OperationRegion is limited to 32 bit, extend it
> to 64 bit so that we can switch SSDT to 64 bit in later patch
> 
> Signed-off-by: Xiao Guangrong <guangrong.x...@linux.intel.com>
Reviewed-by: Igor Mammedov <imamm...@redhat.com>

> ---
>  hw/acpi/aml-build.c | 2 +-
>  include/hw/acpi/aml-build.h | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c
> index 0d4b324..02f9e3d 100644
> --- a/hw/acpi/aml-build.c
> +++ b/hw/acpi/aml-build.c
> @@ -752,7 +752,7 @@ Aml *aml_package(uint8_t num_elements)
>  
>  /* ACPI 1.0b: 16.2.5.2 Named Objects Encoding: DefOpRegion */
>  Aml *aml_operation_region(const char *name, AmlRegionSpace rs,
> -  uint32_t offset, uint32_t len)
> +  uint64_t offset, uint32_t len)
>  {
>  Aml *var = aml_alloc();
>  build_append_byte(var->buf, 0x5B); /* ExtOpPrefix */
> diff --git a/include/hw/acpi/aml-build.h b/include/hw/acpi/aml-build.h
> index e3afa13..996ac5b 100644
> --- a/include/hw/acpi/aml-build.h
> +++ b/include/hw/acpi/aml-build.h
> @@ -222,7 +222,7 @@ Aml *aml_interrupt(AmlConsumerAndProducer con_and_pro,
>  Aml *aml_io(AmlIODecode dec, uint16_t min_base, uint16_t max_base,
>  uint8_t aln, uint8_t len);
>  Aml *aml_operation_region(const char *name, AmlRegionSpace rs,
> -  uint32_t offset, uint32_t len);
> +  uint64_t offset, uint32_t len);
>  Aml *aml_irq_no_flags(uint8_t irq);
>  Aml *aml_named_field(const char *name, unsigned length);
>  Aml *aml_reserved_field(unsigned length);

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH v2 06/18] pc: implement NVDIMM device abstract

2015-09-02 Thread Igor Mammedov
On Wed, 2 Sep 2015 18:36:43 +0800
Xiao Guangrong <guangrong.x...@linux.intel.com> wrote:

> 
> 
> On 09/02/2015 05:58 PM, Igor Mammedov wrote:
> > On Fri, 14 Aug 2015 22:51:59 +0800
> > Xiao Guangrong <guangrong.x...@linux.intel.com> wrote:
> >
> >> Introduce "pc-nvdimm" device and it has two parameters:
> > Why do you use prefix "pc-", I suppose we potentially
> > could use this device not only with x86 targets but with
> > other targets as well.
> > I'd just drop 'pc' prefix through out patchset.
> 
> Yeah, the prefix is stolen from pc-dimm, will drop this
> prefix as your suggestion.
> 
> >
> >> - @file, which is the backed memory file for NVDIMM device
> > Could you try to split device into backend/frontend parts,
> > like it's done with pc-dimm. As I understand it's preferred
> > way to implement this kind of devices.
> > Then you could reuse memory backends that we already have
> > including file backend.
> 
> I considered it too and Stefan, Paolo got the some idea in
> V1's review, however:
> 
> | However, file-based memory used by NVDIMM is special, it divides the file
> | to two parts, one part is used as PMEM and another part is used to store
> | NVDIMM's configure data.
> |
> | Maybe we can introduce "end-reserved" property to reserve specified size
> | at the end of the file. Or create a new class type based on
> | memory-backend-file (named nvdimm-backend-file) class to hide this magic
> | thing?
I'd go with separate backend/frontend idea.

Question is if this config area is part backend or frontend?
If we pass-through NVDIMM device do we need to set configdata=true
and QEMU would skip building config structures and use structures
that are already present on passed-through device in that place?


> 
> Your idea?
[...]
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 05/18] acpi: add aml_create_field

2015-09-02 Thread Igor Mammedov
On Fri, 14 Aug 2015 22:51:58 +0800
Xiao Guangrong  wrote:

> Implement CreateField term which are used by NVDIMM _DSM method in later patch
> 
> Signed-off-by: Xiao Guangrong 
> ---
>  hw/acpi/aml-build.c | 14 ++
>  include/hw/acpi/aml-build.h |  1 +
>  2 files changed, 15 insertions(+)
> 
> diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c
> index a526eed..debdad2 100644
> --- a/hw/acpi/aml-build.c
> +++ b/hw/acpi/aml-build.c
> @@ -1151,6 +1151,20 @@ Aml *aml_sizeof(Aml *arg)
>  return var;
>  }
>  
> +/* ACPI 6.0: 20.2.5.2 Named Objects Encoding: DefCreateField */
ditto, refer to to the first revision where it's appeared

> +Aml *aml_create_field(Aml *srcbuf, Aml *index, Aml *len, const char *name)
index and len could be only of 'Integer' type, so there is no point
to pass them in as Aml, just use uintFOO_t here and convert
them to aml_int() internally. That way call sites will be smaller
and have less chance to pass a wrong Aml variable. 

> +{
> +Aml *var = aml_alloc();
> +
drop newline

> +build_append_byte(var->buf, 0x5B); /* ExtOpPrefix */
> +build_append_byte(var->buf, 0x13); /* CreateFieldOp */
> +aml_append(var, srcbuf);
> +aml_append(var, index);
> +aml_append(var, len);
> +build_append_namestring(var->buf, "%s", name);
> +return var;
> +}
> +
>  void
>  build_header(GArray *linker, GArray *table_data,
>   AcpiTableHeader *h, const char *sig, int len, uint8_t rev)
> diff --git a/include/hw/acpi/aml-build.h b/include/hw/acpi/aml-build.h
> index 6b591ab..d4dbd44 100644
> --- a/include/hw/acpi/aml-build.h
> +++ b/include/hw/acpi/aml-build.h
> @@ -277,6 +277,7 @@ Aml *aml_touuid(const char *uuid);
>  Aml *aml_unicode(const char *str);
>  Aml *aml_derefof(Aml *arg);
>  Aml *aml_sizeof(Aml *arg);
> +Aml *aml_create_field(Aml *srcbuf, Aml *index, Aml *len, const char *name);
>  
>  void
>  build_header(GArray *linker, GArray *table_data,

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH v2 02/18] i386/acpi-build: allow SSDT to operate on 64 bit

2015-09-02 Thread Igor Mammedov
On Wed, 2 Sep 2015 18:43:41 +0800
Xiao Guangrong <guangrong.x...@linux.intel.com> wrote:

> 
> 
> On 09/02/2015 06:06 PM, Igor Mammedov wrote:
> > On Fri, 14 Aug 2015 22:51:55 +0800
> > Xiao Guangrong <guangrong.x...@linux.intel.com> wrote:
> >
> >> Only 512M is left for MMIO below 4G and that are used by PCI, BIOS etc.
> >> Other components also reserve regions from their internal usage, e.g,
> >> [0xFED0, 0xFED0 + 0x400) is reserved for HPET
> >>
> >> Switch SSDT to 64 bit to use the huge free room above 4G. In the later
> >> patches, we will dynamical allocate free space within this region which
> >> is used by NVDIMM _DSM method
> >>
> >> Signed-off-by: Xiao Guangrong <guangrong.x...@linux.intel.com>
> >> ---
> >>   hw/i386/acpi-build.c  | 4 ++--
> >>   hw/i386/acpi-dsdt.dsl | 2 +-
> >>   2 files changed, 3 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
> >> index 46eddb8..8ead1c1 100644
> >> --- a/hw/i386/acpi-build.c
> >> +++ b/hw/i386/acpi-build.c
> >> @@ -1348,7 +1348,7 @@ build_ssdt(GArray *table_data, GArray *linker,
> >>   g_array_append_vals(table_data, ssdt->buf->data, ssdt->buf->len);
> >>   build_header(linker, table_data,
> >>   (void *)(table_data->data + table_data->len - ssdt->buf->len),
> >> -"SSDT", ssdt->buf->len, 1);
> >> +"SSDT", ssdt->buf->len, 2);
> > That might break Windows XP, since it supports only 1.0b ACPI with some
> > 2.0 extensions.
> > there is 2 way to work around it:
> >   - add an additional Rev2 ssdt table if NVDIMMs are present
> > and describe them there
> 
> I like this way, it's more straightforward to me.
> 
> BTW, IIUC the DSDT still need to be changed to Rev2 to recognise SSDT with 
> Rev2,
> does it hurt Windows XP?
Probably it will, but why DSDT should be v2 for one of SSDT to be v2,
they are separate tables.

Also you might find following interesting wrt Windows compatibility
http://www.acpi.info/presentations/S01USMOBS169_OS%20new.ppt


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 02/18] i386/acpi-build: allow SSDT to operate on 64 bit

2015-09-02 Thread Igor Mammedov
On Fri, 14 Aug 2015 22:51:55 +0800
Xiao Guangrong  wrote:

> Only 512M is left for MMIO below 4G and that are used by PCI, BIOS etc.
> Other components also reserve regions from their internal usage, e.g,
> [0xFED0, 0xFED0 + 0x400) is reserved for HPET
> 
> Switch SSDT to 64 bit to use the huge free room above 4G. In the later
> patches, we will dynamical allocate free space within this region which
> is used by NVDIMM _DSM method
> 
> Signed-off-by: Xiao Guangrong 
> ---
>  hw/i386/acpi-build.c  | 4 ++--
>  hw/i386/acpi-dsdt.dsl | 2 +-
>  2 files changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
> index 46eddb8..8ead1c1 100644
> --- a/hw/i386/acpi-build.c
> +++ b/hw/i386/acpi-build.c
> @@ -1348,7 +1348,7 @@ build_ssdt(GArray *table_data, GArray *linker,
>  g_array_append_vals(table_data, ssdt->buf->data, ssdt->buf->len);
>  build_header(linker, table_data,
>  (void *)(table_data->data + table_data->len - ssdt->buf->len),
> -"SSDT", ssdt->buf->len, 1);
> +"SSDT", ssdt->buf->len, 2);
That might break Windows XP, since it supports only 1.0b ACPI with some
2.0 extensions.
there is 2 way to work around it:
 - add an additional Rev2 ssdt table if NVDIMMs are present
   and describe them there
 - make sure that you use only 32bit arithmetic in AML
   (and emulate 64bit like it has been done for memory hotplug)

>  free_aml_allocator();
>  }
>  
> @@ -1586,7 +1586,7 @@ build_dsdt(GArray *table_data, GArray *linker, 
> AcpiMiscInfo *misc)
>  
>  memset(dsdt, 0, sizeof *dsdt);
>  build_header(linker, table_data, dsdt, "DSDT",
> - misc->dsdt_size, 1);
> + misc->dsdt_size, 2);
>  }
>  
>  static GArray *
> diff --git a/hw/i386/acpi-dsdt.dsl b/hw/i386/acpi-dsdt.dsl
> index a2d84ec..5cd3f0e 100644
> --- a/hw/i386/acpi-dsdt.dsl
> +++ b/hw/i386/acpi-dsdt.dsl
> @@ -22,7 +22,7 @@ ACPI_EXTRACT_ALL_CODE AcpiDsdtAmlCode
>  DefinitionBlock (
>  "acpi-dsdt.aml",// Output Filename
>  "DSDT", // Signature
> -0x01,   // DSDT Compliance Revision
> +0x02,   // DSDT Compliance Revision
>  "BXPC", // OEMID
>  "BXDSDT",   // TABLE ID
>  0x1 // OEM Revision

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 03/18] acpi: add aml_derefof

2015-09-02 Thread Igor Mammedov
On Fri, 14 Aug 2015 22:51:56 +0800
Xiao Guangrong  wrote:

> Implement DeRefOf term which is used by NVDIMM _DSM method in later patch
> 
> Signed-off-by: Xiao Guangrong 
> ---
>  hw/acpi/aml-build.c | 8 
>  include/hw/acpi/aml-build.h | 1 +
>  2 files changed, 9 insertions(+)
> 
> diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c
> index 02f9e3d..9e89efc 100644
> --- a/hw/acpi/aml-build.c
> +++ b/hw/acpi/aml-build.c
> @@ -1135,6 +1135,14 @@ Aml *aml_unicode(const char *str)
>  return var;
>  }
>  
> +/* ACPI 6.0: 20.2.5.4 Type 2 Opcodes Encoding: DefDerefOf */
Pls put here lowest doc revision where the term has first appeared

> +Aml *aml_derefof(Aml *arg)
> +{
> +Aml *var = aml_opcode(0x83 /* DerefOfOp */);
> +aml_append(var, arg);
> +return var;
> +}
> +
>  void
>  build_header(GArray *linker, GArray *table_data,
>   AcpiTableHeader *h, const char *sig, int len, uint8_t rev)
> diff --git a/include/hw/acpi/aml-build.h b/include/hw/acpi/aml-build.h
> index 996ac5b..21dc5e9 100644
> --- a/include/hw/acpi/aml-build.h
> +++ b/include/hw/acpi/aml-build.h
> @@ -275,6 +275,7 @@ Aml *aml_create_dword_field(Aml *srcbuf, Aml *index, 
> const char *name);
>  Aml *aml_varpackage(uint32_t num_elements);
>  Aml *aml_touuid(const char *uuid);
>  Aml *aml_unicode(const char *str);
> +Aml *aml_derefof(Aml *arg);
>  
>  void
>  build_header(GArray *linker, GArray *table_data,

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH v2 06/18] pc: implement NVDIMM device abstract

2015-09-02 Thread Igor Mammedov
On Fri, 14 Aug 2015 22:51:59 +0800
Xiao Guangrong  wrote:

> Introduce "pc-nvdimm" device and it has two parameters:
Why do you use prefix "pc-", I suppose we potentially
could use this device not only with x86 targets but with
other targets as well.
I'd just drop 'pc' prefix through out patchset.

> - @file, which is the backed memory file for NVDIMM device
Could you try to split device into backend/frontend parts,
like it's done with pc-dimm. As I understand it's preferred
way to implement this kind of devices.
Then you could reuse memory backends that we already have
including file backend.

So CLI could look like:
-object memory-backend-file,id=mem0,file=/storage/foo
-device nvdimm,memdev=mem0,configdata=on

> 
> - @configdata, specify if we need to reserve 128k at the end of
>   @file for nvdimm device's config data. Default is false
> 
> If @configdata is false, Qemu will build a static and readonly
> namespace in memory and use it serveing for
> DSM GET_CONFIG_SIZE/GET_CONFIG_DATA requests.
> This is good for the user who want to pass whole nvdimm device
> and make its data is complete visible to guest
> 
> We can use "-device pc-nvdimm,file=/dev/pmem,configdata" in the
> Qemu command to create NVDIMM device for the guest
PS:
please try to fix commit message spelling/grammar wise.

[...]
> +++ b/hw/mem/nvdimm/pc-nvdimm.c
> @@ -0,0 +1,99 @@
> +/*
> + * NVDIMM (A Non-Volatile Dual In-line Memory Module) Virtualization 
> Implement
s/Implement/Implementation/ in all new files
an maybe s/NVDIMM (A // as it's reduntant

[...]
> +static bool has_configdata(Object *obj, Error **errp)
> +{
> +PCNVDIMMDevice *nvdimm = PC_NVDIMM(obj);
> +
> +return nvdimm->configdata;
> +}
> +
> +static void set_configdata(Object *obj, bool value, Error **errp)
> +{
> +PCNVDIMMDevice *nvdimm = PC_NVDIMM(obj);
> +
> +nvdimm->configdata = value;
> +}
usually for property setters/getters we use form:
 "device_prefix"_[g|s]et_foo
so
 nvdim_get_configdata ...

[...]

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 04/18] acpi: add aml_sizeof

2015-09-02 Thread Igor Mammedov
On Fri, 14 Aug 2015 22:51:57 +0800
Xiao Guangrong  wrote:

> Implement SizeOf term which is used by NVDIMM _DSM method in later patch
> 
> Signed-off-by: Xiao Guangrong 
> ---
>  hw/acpi/aml-build.c | 8 
>  include/hw/acpi/aml-build.h | 1 +
>  2 files changed, 9 insertions(+)
> 
> diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c
> index 9e89efc..a526eed 100644
> --- a/hw/acpi/aml-build.c
> +++ b/hw/acpi/aml-build.c
> @@ -1143,6 +1143,14 @@ Aml *aml_derefof(Aml *arg)
>  return var;
>  }
>  
> +/* ACPI 6.0: 20.2.5.4 Type 2 Opcodes Encoding: DefSizeOf */
ditto, refer to to the first revision where it's appeared

> +Aml *aml_sizeof(Aml *arg)
> +{
> +Aml *var = aml_opcode(0x87 /* SizeOfOp */);
> +aml_append(var, arg);
> +return var;
> +}
> +
>  void
>  build_header(GArray *linker, GArray *table_data,
>   AcpiTableHeader *h, const char *sig, int len, uint8_t rev)
> diff --git a/include/hw/acpi/aml-build.h b/include/hw/acpi/aml-build.h
> index 21dc5e9..6b591ab 100644
> --- a/include/hw/acpi/aml-build.h
> +++ b/include/hw/acpi/aml-build.h
> @@ -276,6 +276,7 @@ Aml *aml_varpackage(uint32_t num_elements);
>  Aml *aml_touuid(const char *uuid);
>  Aml *aml_unicode(const char *str);
>  Aml *aml_derefof(Aml *arg);
> +Aml *aml_sizeof(Aml *arg);
>  
>  void
>  build_header(GArray *linker, GArray *table_data,

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] vhost: increase default limit of nregions from 64 to 509

2015-07-30 Thread Igor Mammedov
On Thu, 30 Jul 2015 09:33:57 +0300
Michael S. Tsirkin m...@redhat.com wrote:

 On Thu, Jul 30, 2015 at 08:26:03AM +0200, Igor Mammedov wrote:
  On Wed, 29 Jul 2015 18:28:26 +0300
  Michael S. Tsirkin m...@redhat.com wrote:
  
   On Wed, Jul 29, 2015 at 04:29:23PM +0200, Igor Mammedov wrote:
although now there is vhost module max_mem_regions option
to set custom limit it doesn't help for default setups,
since it requires administrator manually set a higher
limit on each host. Which complicates servers deployments
and management.
Rise limit to the same value as KVM has (509 slots max),
so that default deployments would work out of box.

Signed-off-by: Igor Mammedov imamm...@redhat.com
---
PS:
Users that would want to lock down vhost could still
use max_mem_regions option to set lower limit, but
I expect it would be minority.
   
   I'm not inclined to merge this.
   
   Once we change this we can't take it back. It's not a decision
   to be taken lightly.
in addition, you already gave out control on limit allowing
to rise it via module parameter. Rising default is just a way
to reduce pain for users that would try to use more than 64
slots.

  considering that continuous HVA idea has failed, why would you
  want to take limit back in the future if we rise it now?
 
 I'm not sure.
 
 I think you merely demonstrated it's a big change for userspace -
 not that it's unfeasible.
It's not a big change but rather ugly one, being unportable,
enforcing unnecessary (and to really reasonable) restrictions
on memory backends and changing memory unplug mgmt workflow
depending on if HVA used or not.

 Alternatively, if we want an unlimited size table, we should keep it
 in userspace memory.
this patch doesn't propose unlimited table size.
with proposed limit we are talking about max order-4 allocations
that can fallback to vmalloc if needed. And it makes vhost consistent
with KVM's limit, which has similar table.
Proposed limit also neatly works around corner cases of
existing old userspace that can't deal with it when it hits limit.


 
   
   And memory hotplug users are a minority.  Out of these, users with a
   heavily fragmented PA space due to hotplug abuse are an even smaller
   minority.
   
---
 include/uapi/linux/vhost.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h
index 2511954..92657bf 100644
--- a/include/uapi/linux/vhost.h
+++ b/include/uapi/linux/vhost.h
@@ -140,7 +140,7 @@ struct vhost_memory {
 #define VHOST_MEM_MAX_NREGIONS_NONE 0
 /* We support at least as many nregions in VHOST_SET_MEM_TABLE:
  * for use on legacy kernels without VHOST_GET_MEM_MAX_NREGIONS 
support. */
-#define VHOST_MEM_MAX_NREGIONS_DEFAULT 64
+#define VHOST_MEM_MAX_NREGIONS_DEFAULT 509
 
 /* VHOST_NET specific defines */
 
-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] vhost: increase default limit of nregions from 64 to 509

2015-07-30 Thread Igor Mammedov
On Thu, 30 Jul 2015 09:33:57 +0300
Michael S. Tsirkin m...@redhat.com wrote:

 On Thu, Jul 30, 2015 at 08:26:03AM +0200, Igor Mammedov wrote:
  On Wed, 29 Jul 2015 18:28:26 +0300
  Michael S. Tsirkin m...@redhat.com wrote:
  
   On Wed, Jul 29, 2015 at 04:29:23PM +0200, Igor Mammedov wrote:
although now there is vhost module max_mem_regions option
to set custom limit it doesn't help for default setups,
since it requires administrator manually set a higher
limit on each host. Which complicates servers deployments
and management.
Rise limit to the same value as KVM has (509 slots max),
so that default deployments would work out of box.

Signed-off-by: Igor Mammedov imamm...@redhat.com
---
PS:
Users that would want to lock down vhost could still
use max_mem_regions option to set lower limit, but
I expect it would be minority.
   
   I'm not inclined to merge this.
   
   Once we change this we can't take it back. It's not a decision
   to be taken lightly.
  considering that continuous HVA idea has failed, why would you
  want to take limit back in the future if we rise it now?
 
 I'm not sure.
 
 I think you merely demonstrated it's a big change for userspace -
 not that it's unfeasible.
 
 Alternatively, if we want an unlimited size table, we should keep it
 in userspace memory.
btw:
if table were a simple array and kernel will do inefficient linear scan
to do translation then I guess we could use userspace memory.

But I'm afraid we can't trust userspace in case of more elaborate
structure. Even if it's just binary search over sorted array,
it would be possible for userspace to hung kernel thread in
translate_desc() by providing corrupted or wrongly sorted table.
And we can't afford table validation on hot path.


 
   
   And memory hotplug users are a minority.  Out of these, users with a
   heavily fragmented PA space due to hotplug abuse are an even smaller
   minority.
   
---
 include/uapi/linux/vhost.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h
index 2511954..92657bf 100644
--- a/include/uapi/linux/vhost.h
+++ b/include/uapi/linux/vhost.h
@@ -140,7 +140,7 @@ struct vhost_memory {
 #define VHOST_MEM_MAX_NREGIONS_NONE 0
 /* We support at least as many nregions in VHOST_SET_MEM_TABLE:
  * for use on legacy kernels without VHOST_GET_MEM_MAX_NREGIONS 
support. */
-#define VHOST_MEM_MAX_NREGIONS_DEFAULT 64
+#define VHOST_MEM_MAX_NREGIONS_DEFAULT 509
 
 /* VHOST_NET specific defines */
 
-- 
1.8.3.1
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] vhost: increase default limit of nregions from 64 to 509

2015-07-30 Thread Igor Mammedov
On Wed, 29 Jul 2015 18:28:26 +0300
Michael S. Tsirkin m...@redhat.com wrote:

 On Wed, Jul 29, 2015 at 04:29:23PM +0200, Igor Mammedov wrote:
  although now there is vhost module max_mem_regions option
  to set custom limit it doesn't help for default setups,
  since it requires administrator manually set a higher
  limit on each host. Which complicates servers deployments
  and management.
  Rise limit to the same value as KVM has (509 slots max),
  so that default deployments would work out of box.
  
  Signed-off-by: Igor Mammedov imamm...@redhat.com
  ---
  PS:
  Users that would want to lock down vhost could still
  use max_mem_regions option to set lower limit, but
  I expect it would be minority.
 
 I'm not inclined to merge this.
 
 Once we change this we can't take it back. It's not a decision
 to be taken lightly.
considering that continuous HVA idea has failed, why would you
want to take limit back in the future if we rise it now?

 
 And memory hotplug users are a minority.  Out of these, users with a
 heavily fragmented PA space due to hotplug abuse are an even smaller
 minority.
 
  ---
   include/uapi/linux/vhost.h | 2 +-
   1 file changed, 1 insertion(+), 1 deletion(-)
  
  diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h
  index 2511954..92657bf 100644
  --- a/include/uapi/linux/vhost.h
  +++ b/include/uapi/linux/vhost.h
  @@ -140,7 +140,7 @@ struct vhost_memory {
   #define VHOST_MEM_MAX_NREGIONS_NONE 0
   /* We support at least as many nregions in VHOST_SET_MEM_TABLE:
* for use on legacy kernels without VHOST_GET_MEM_MAX_NREGIONS support. */
  -#define VHOST_MEM_MAX_NREGIONS_DEFAULT 64
  +#define VHOST_MEM_MAX_NREGIONS_DEFAULT 509
   
   /* VHOST_NET specific defines */
   
  -- 
  1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] vhost: add ioctl to query nregions upper limit

2015-07-29 Thread Igor Mammedov
On Wed, 29 Jul 2015 17:43:17 +0300
Michael S. Tsirkin m...@redhat.com wrote:

 On Wed, Jul 29, 2015 at 04:29:22PM +0200, Igor Mammedov wrote:
  From: Michael S. Tsirkin m...@redhat.com
  
  Userspace currently simply tries to give vhost as many regions
  as it happens to have, but you only have the mem table
  when you have initialized a large part of VM, so graceful
  failure is very hard to support.
  
  The result is that userspace tends to fail catastrophically.
  
  Instead, add a new ioctl so userspace can find out how much kernel
  supports, up front. This returns a positive value that we commit to.
  
  Also, document our contract with legacy userspace: when running on an
  old kernel, you get -1 and you can assume at least 64 slots.  Since 0
  value's left unused, let's make that mean that the current userspace
  behaviour (trial and error) is required, just in case we want it back.
 
 What's wrong with reading the module parameter value? It's there in
 sysfs ...
for most cases it would work but distro doesn't have to mount
sysfs under /sys so it adds to app a burden of discovering
where sysfs is mounted and in what module to look for parameter.

So IMHO, sysfs is more human oriented interface,
while ioctl is more stable API for apps.

 
  
  Signed-off-by: Michael S. Tsirkin m...@redhat.com
  Signed-off-by: Igor Mammedov imamm...@redhat.com
  ---
   drivers/vhost/vhost.c  |  7 ++-
   include/uapi/linux/vhost.h | 17 -
   2 files changed, 22 insertions(+), 2 deletions(-)
  
  diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
  index eec2f11..76dc0cf 100644
  --- a/drivers/vhost/vhost.c
  +++ b/drivers/vhost/vhost.c
  @@ -30,7 +30,7 @@
   
   #include vhost.h
   
  -static ushort max_mem_regions = 64;
  +static ushort max_mem_regions = VHOST_MEM_MAX_NREGIONS_DEFAULT;
   module_param(max_mem_regions, ushort, 0444);
   MODULE_PARM_DESC(max_mem_regions,
  Maximum number of memory regions in memory map. (default: 64));
  @@ -944,6 +944,11 @@ long vhost_dev_ioctl(struct vhost_dev *d, unsigned int 
  ioctl, void __user *argp)
  long r;
  int i, fd;
   
  +   if (ioctl == VHOST_GET_MEM_MAX_NREGIONS) {
  +   r = max_mem_regions;
  +   goto done;
  +   }
  +
  /* If you are not the owner, you can become one */
  if (ioctl == VHOST_SET_OWNER) {
  r = vhost_dev_set_owner(d);
  diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h
  index ab373191..2511954 100644
  --- a/include/uapi/linux/vhost.h
  +++ b/include/uapi/linux/vhost.h
  @@ -80,7 +80,7 @@ struct vhost_memory {
* Allows subsequent call to VHOST_OWNER_SET to succeed. */
   #define VHOST_RESET_OWNER _IO(VHOST_VIRTIO, 0x02)
   
  -/* Set up/modify memory layout */
  +/* Set up/modify memory layout: see also VHOST_GET_MEM_MAX_NREGIONS below. 
  */
   #define VHOST_SET_MEM_TABLE_IOW(VHOST_VIRTIO, 0x03, struct 
  vhost_memory)
   
   /* Write logging setup. */
  @@ -127,6 +127,21 @@ struct vhost_memory {
   /* Set eventfd to signal an error */
   #define VHOST_SET_VRING_ERR _IOW(VHOST_VIRTIO, 0x22, struct 
  vhost_vring_file)
   
  +/* Query upper limit on nregions in VHOST_SET_MEM_TABLE arguments.
  + * Returns:
  + * 0  value = MAX_INT - gives the upper limit, higher values will fail
  + * 0 - there's no static limit: try and see if it works
  + * -1 - on failure
  + */
  +#define VHOST_GET_MEM_MAX_NREGIONS   _IO(VHOST_VIRTIO, 0x23)
  +
  +/* Returned by VHOST_GET_MEM_MAX_NREGIONS to mean there's no static limit:
  + * try and it'll work if you are lucky. */
  +#define VHOST_MEM_MAX_NREGIONS_NONE 0
  +/* We support at least as many nregions in VHOST_SET_MEM_TABLE:
  + * for use on legacy kernels without VHOST_GET_MEM_MAX_NREGIONS support. */
  +#define VHOST_MEM_MAX_NREGIONS_DEFAULT 64
  +
   /* VHOST_NET specific defines */
   
   /* Attach virtio net ring to a raw socket, or tap device.
  -- 
  1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/2] vhost: add ioctl to query nregions limit and rise default limit

2015-07-29 Thread Igor Mammedov

Igor Mammedov (1):
  vhost: increase default limit of nregions from 64 to 509

Michael S. Tsirkin (1):
  vhost: add ioctl to query nregions upper limit

 drivers/vhost/vhost.c  |  7 ++-
 include/uapi/linux/vhost.h | 17 -
 2 files changed, 22 insertions(+), 2 deletions(-)

-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] vhost: add ioctl to query nregions upper limit

2015-07-29 Thread Igor Mammedov
From: Michael S. Tsirkin m...@redhat.com

Userspace currently simply tries to give vhost as many regions
as it happens to have, but you only have the mem table
when you have initialized a large part of VM, so graceful
failure is very hard to support.

The result is that userspace tends to fail catastrophically.

Instead, add a new ioctl so userspace can find out how much kernel
supports, up front. This returns a positive value that we commit to.

Also, document our contract with legacy userspace: when running on an
old kernel, you get -1 and you can assume at least 64 slots.  Since 0
value's left unused, let's make that mean that the current userspace
behaviour (trial and error) is required, just in case we want it back.

Signed-off-by: Michael S. Tsirkin m...@redhat.com
Signed-off-by: Igor Mammedov imamm...@redhat.com
---
 drivers/vhost/vhost.c  |  7 ++-
 include/uapi/linux/vhost.h | 17 -
 2 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index eec2f11..76dc0cf 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -30,7 +30,7 @@
 
 #include vhost.h
 
-static ushort max_mem_regions = 64;
+static ushort max_mem_regions = VHOST_MEM_MAX_NREGIONS_DEFAULT;
 module_param(max_mem_regions, ushort, 0444);
 MODULE_PARM_DESC(max_mem_regions,
Maximum number of memory regions in memory map. (default: 64));
@@ -944,6 +944,11 @@ long vhost_dev_ioctl(struct vhost_dev *d, unsigned int 
ioctl, void __user *argp)
long r;
int i, fd;
 
+   if (ioctl == VHOST_GET_MEM_MAX_NREGIONS) {
+   r = max_mem_regions;
+   goto done;
+   }
+
/* If you are not the owner, you can become one */
if (ioctl == VHOST_SET_OWNER) {
r = vhost_dev_set_owner(d);
diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h
index ab373191..2511954 100644
--- a/include/uapi/linux/vhost.h
+++ b/include/uapi/linux/vhost.h
@@ -80,7 +80,7 @@ struct vhost_memory {
  * Allows subsequent call to VHOST_OWNER_SET to succeed. */
 #define VHOST_RESET_OWNER _IO(VHOST_VIRTIO, 0x02)
 
-/* Set up/modify memory layout */
+/* Set up/modify memory layout: see also VHOST_GET_MEM_MAX_NREGIONS below. */
 #define VHOST_SET_MEM_TABLE_IOW(VHOST_VIRTIO, 0x03, struct vhost_memory)
 
 /* Write logging setup. */
@@ -127,6 +127,21 @@ struct vhost_memory {
 /* Set eventfd to signal an error */
 #define VHOST_SET_VRING_ERR _IOW(VHOST_VIRTIO, 0x22, struct vhost_vring_file)
 
+/* Query upper limit on nregions in VHOST_SET_MEM_TABLE arguments.
+ * Returns:
+ * 0  value = MAX_INT - gives the upper limit, higher values will fail
+ * 0 - there's no static limit: try and see if it works
+ * -1 - on failure
+ */
+#define VHOST_GET_MEM_MAX_NREGIONS   _IO(VHOST_VIRTIO, 0x23)
+
+/* Returned by VHOST_GET_MEM_MAX_NREGIONS to mean there's no static limit:
+ * try and it'll work if you are lucky. */
+#define VHOST_MEM_MAX_NREGIONS_NONE 0
+/* We support at least as many nregions in VHOST_SET_MEM_TABLE:
+ * for use on legacy kernels without VHOST_GET_MEM_MAX_NREGIONS support. */
+#define VHOST_MEM_MAX_NREGIONS_DEFAULT 64
+
 /* VHOST_NET specific defines */
 
 /* Attach virtio net ring to a raw socket, or tap device.
-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] vhost: increase default limit of nregions from 64 to 509

2015-07-29 Thread Igor Mammedov
although now there is vhost module max_mem_regions option
to set custom limit it doesn't help for default setups,
since it requires administrator manually set a higher
limit on each host. Which complicates servers deployments
and management.
Rise limit to the same value as KVM has (509 slots max),
so that default deployments would work out of box.

Signed-off-by: Igor Mammedov imamm...@redhat.com
---
PS:
Users that would want to lock down vhost could still
use max_mem_regions option to set lower limit, but
I expect it would be minority.
---
 include/uapi/linux/vhost.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h
index 2511954..92657bf 100644
--- a/include/uapi/linux/vhost.h
+++ b/include/uapi/linux/vhost.h
@@ -140,7 +140,7 @@ struct vhost_memory {
 #define VHOST_MEM_MAX_NREGIONS_NONE 0
 /* We support at least as many nregions in VHOST_SET_MEM_TABLE:
  * for use on legacy kernels without VHOST_GET_MEM_MAX_NREGIONS support. */
-#define VHOST_MEM_MAX_NREGIONS_DEFAULT 64
+#define VHOST_MEM_MAX_NREGIONS_DEFAULT 509
 
 /* VHOST_NET specific defines */
 
-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 2/2] vhost: add max_mem_regions module parameter

2015-07-16 Thread Igor Mammedov
On Thu,  2 Jul 2015 15:08:11 +0200
Igor Mammedov imamm...@redhat.com wrote:

 it became possible to use a bigger amount of memory
 slots, which is used by memory hotplug for
 registering hotplugged memory.
 However QEMU crashes if it's used with more than ~60
 pc-dimm devices and vhost-net enabled since host kernel
 in module vhost-net refuses to accept more than 64
 memory regions.
 
 Allow to tweak limit via max_mem_regions module paramemter
 with default value set to 64 slots.
Michael,

what was the reason not to rise default?
As much as I think I can't come up with one.

Making it as module option doesn't make much sense
since old userspace will crash on new kernels anyway
until admins learn that there is a module option
to rise limit.

Rising default limit on par with kvm's will help
to avoid those crashes and would allow to drop
not necessary option to reduce user confusion.

 
 Signed-off-by: Igor Mammedov imamm...@redhat.com
 ---
  drivers/vhost/vhost.c | 8 ++--
  1 file changed, 6 insertions(+), 2 deletions(-)
 
 diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
 index 6488011..9a68e2e 100644
 --- a/drivers/vhost/vhost.c
 +++ b/drivers/vhost/vhost.c
 @@ -29,8 +29,12 @@
  
  #include vhost.h
  
 +static ushort max_mem_regions = 64;
 +module_param(max_mem_regions, ushort, 0444);
 +MODULE_PARM_DESC(max_mem_regions,
 + Maximum number of memory regions in memory map. (default: 64));
 +
  enum {
 - VHOST_MEMORY_MAX_NREGIONS = 64,
   VHOST_MEMORY_F_LOG = 0x1,
  };
  
 @@ -696,7 +700,7 @@ static long vhost_set_memory(struct vhost_dev *d, struct 
 vhost_memory __user *m)
   return -EFAULT;
   if (mem.padding)
   return -EOPNOTSUPP;
 - if (mem.nregions  VHOST_MEMORY_MAX_NREGIONS)
 + if (mem.nregions  max_mem_regions)
   return -E2BIG;
   newmem = vhost_kvzalloc(size + mem.nregions * sizeof(*m-regions));
   if (!newmem)

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] fixup! vhost: extend memory regions allocation to vmalloc

2015-07-15 Thread Igor Mammedov
callers of vhost_kvzalloc() expect the same behaviour on
allocation error as from kmalloc/vmalloc i.e. NULL return
value. So just return vzmalloc() returned value instead of
returning ERR_PTR(-ENOMEM)

issue introduced by
  4de7255f7d2be5e51664c6ac6011ffd6e5463571 in vhost-next tree

Spotted-by: Dan Carpenter dan.carpen...@oracle.com
Suggested-by: Julia Lawall julia.law...@lip6.fr
Signed-off-by: Igor Mammedov imamm...@redhat.com
---
 drivers/vhost/vhost.c | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index a9fe859..3702487 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -683,11 +683,8 @@ static void *vhost_kvzalloc(unsigned long size)
 {
void *n = kzalloc(size, GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT);
 
-   if (!n) {
+   if (!n)
n = vzalloc(size);
-   if (!n)
-   return ERR_PTR(-ENOMEM);
-   }
return n;
 }
 
-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH] target-i386: Sanity check host processor physical address width

2015-07-09 Thread Igor Mammedov
On Thu, 09 Jul 2015 09:02:38 +0200
Laszlo Ersek ler...@redhat.com wrote:

 On 07/09/15 00:42, Bandan Das wrote:
  
  If a Linux guest is assigned more memory than is supported
  by the host processor, the guest is unable to boot. That
  is expected, however, there's no message indicating the user
  what went wrong. This change prints a message to stderr if
  KVM has the corresponding capability.
  
  Reported-by: Laszlo Ersek ler...@redhat.com
  Signed-off-by: Bandan Das b...@redhat.com
  ---
   linux-headers/linux/kvm.h | 1 +
   target-i386/kvm.c | 6 ++
   2 files changed, 7 insertions(+)
  
  diff --git a/linux-headers/linux/kvm.h b/linux-headers/linux/kvm.h
  index 3bac873..6afad49 100644
  --- a/linux-headers/linux/kvm.h
  +++ b/linux-headers/linux/kvm.h
  @@ -817,6 +817,7 @@ struct kvm_ppc_smmu_info {
   #define KVM_CAP_DISABLE_QUIRKS 116
   #define KVM_CAP_X86_SMM 117
   #define KVM_CAP_MULTI_ADDRESS_SPACE 118
  +#define KVM_CAP_PHY_ADDR_WIDTH 119
   
   #ifdef KVM_CAP_IRQ_ROUTING
   
  diff --git a/target-i386/kvm.c b/target-i386/kvm.c
  index 066d03d..66e3448 100644
  --- a/target-i386/kvm.c
  +++ b/target-i386/kvm.c
  @@ -892,6 +892,7 @@ int kvm_arch_init(MachineState *ms, KVMState *s)
   uint64_t shadow_mem;
   int ret;
   struct utsname utsname;
  +int max_phys_bits;
   
   ret = kvm_get_supported_msrs(s);
   if (ret  0) {
  @@ -945,6 +946,11 @@ int kvm_arch_init(MachineState *ms, KVMState *s)
   }
   }
   
  +max_phys_bits = kvm_check_extension(s, KVM_CAP_PHY_ADDR_WIDTH);
  +if (max_phys_bits  (1ULL  max_phys_bits) = ram_size)
  +fprintf(stderr, Warning: The amount of memory assigned to the 
  guest 
  +is more than that supported by the host CPU(s). Guest may be 
  unstable.\n);
  +
   if (kvm_check_extension(s, KVM_CAP_X86_SMM)) {
   smram_machine_done.notify = register_smram_listener;
   qemu_add_machine_init_done_notifier(smram_machine_done);
  
 
 First, see my comments on the KVM patch.
 
 Second, ram_size is not the right thing to compare. What should be
 checked is whether the highest guest-physical address that maps to RAM
 can be represented in the address width of the host processor (and only
 if EPT is enabled, but that sub-condition belongs to the KVM patch).
 
 Note that this is not the same as the check written in the patch. For
 example, if you assume a 32-bit PCI hole with size 1 GB, then a total
 guest RAM of size 63 GB will result in the highest guest-phys memory
 address being 0xF__, which just fits into 36 bits.
 
 Correspondingly, the above code would not print the warning for
 
   -m $((63 * 1024 + 1))
 
 on my laptop (which has address sizes   : 36 bits physical, ...), even
 though such a guest would not boot for me (with EPT enabled).
 
 Please see
 
 http://thread.gmane.org/gmane.comp.bios.tianocore.devel/15418/focus=15447
 
 So, ram_size in the controlling expression should be replaced with
 maximum_guest_ram_address (which should be inclusive, and the = relop
 should be preserved).
also with memory hotplug tuned on we should check if the end of
hotplug memory area is less then limit, i.e.:

  pcms-hotplug_memory.base + hotplug_mem_size  1ULL  max_phys_bits

 
 Thanks
 Laszlo
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH] target-i386: Sanity check host processor physical address width

2015-07-09 Thread Igor Mammedov
On Wed, 08 Jul 2015 18:42:01 -0400
Bandan Das b...@redhat.com wrote:

 
 If a Linux guest is assigned more memory than is supported
 by the host processor, the guest is unable to boot. That
 is expected, however, there's no message indicating the user
 what went wrong. This change prints a message to stderr if
 KVM has the corresponding capability.
 
 Reported-by: Laszlo Ersek ler...@redhat.com
 Signed-off-by: Bandan Das b...@redhat.com
 ---
  linux-headers/linux/kvm.h | 1 +
  target-i386/kvm.c | 6 ++
  2 files changed, 7 insertions(+)
 
 diff --git a/linux-headers/linux/kvm.h b/linux-headers/linux/kvm.h
 index 3bac873..6afad49 100644
 --- a/linux-headers/linux/kvm.h
 +++ b/linux-headers/linux/kvm.h
 @@ -817,6 +817,7 @@ struct kvm_ppc_smmu_info {
  #define KVM_CAP_DISABLE_QUIRKS 116
  #define KVM_CAP_X86_SMM 117
  #define KVM_CAP_MULTI_ADDRESS_SPACE 118
 +#define KVM_CAP_PHY_ADDR_WIDTH 119
  
  #ifdef KVM_CAP_IRQ_ROUTING
  
 diff --git a/target-i386/kvm.c b/target-i386/kvm.c
 index 066d03d..66e3448 100644
 --- a/target-i386/kvm.c
 +++ b/target-i386/kvm.c
 @@ -892,6 +892,7 @@ int kvm_arch_init(MachineState *ms, KVMState *s)
  uint64_t shadow_mem;
  int ret;
  struct utsname utsname;
 +int max_phys_bits;
  
  ret = kvm_get_supported_msrs(s);
  if (ret  0) {
 @@ -945,6 +946,11 @@ int kvm_arch_init(MachineState *ms, KVMState *s)
  }
  }
  
 +max_phys_bits = kvm_check_extension(s, KVM_CAP_PHY_ADDR_WIDTH);
max_phys_bits seems generic enough and could be applied to other targets
as well.

making it a property of machine, would make accessing/manipulating it easier.
define default value for machine/TCG mode and when KVM is enabled
it would override/set its own limit.

then any board could easily access machine-max_gpa to make board specific
checks.

 +if (max_phys_bits  (1ULL  max_phys_bits) = ram_size)
 +fprintf(stderr, Warning: The amount of memory assigned to the guest 
 
 +is more than that supported by the host CPU(s). Guest may be 
 unstable.\n);
 +
  if (kvm_check_extension(s, KVM_CAP_X86_SMM)) {
  smram_machine_done.notify = register_smram_listener;
  qemu_add_machine_init_done_notifier(smram_machine_done);

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 0/2] vhost: support more than 64 memory regions

2015-07-08 Thread Igor Mammedov
On Thu,  2 Jul 2015 15:08:09 +0200
Igor Mammedov imamm...@redhat.com wrote:

 changes since v3:
   * rebased on top of vhost-next branch
 changes since v2:
   * drop cache patches for now as suggested
   * add max_mem_regions module parameter instead of unconditionally
 increasing limit
   * drop bsearch patch since it's already queued
 
 References to previous versions:
 v2: https://lkml.org/lkml/2015/6/17/276
 v1: http://www.spinics.net/lists/kvm/msg117654.html
 
 Series allows to tweak vhost's memory regions count limit.
 
 It fixes VM crashing on memory hotplug due to vhost refusing
 accepting more than 64 memory regions with max_mem_regions
 set to more than 262 slots in default QEMU configuration.
 
 Igor Mammedov (2):
   vhost: extend memory regions allocation to vmalloc
   vhost: add max_mem_regions module parameter
 
  drivers/vhost/vhost.c | 28 ++--
  1 file changed, 22 insertions(+), 6 deletions(-)
 

ping
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [BUG/RFC] Two cpus are not brought up normally in SLES11 sp3 VM after reboot

2015-07-07 Thread Igor Mammedov
On Tue, 7 Jul 2015 19:43:35 +0800
zhanghailiang zhang.zhanghaili...@huawei.com wrote:

 On 2015/7/7 19:23, Igor Mammedov wrote:
  On Mon, 6 Jul 2015 17:59:10 +0800
  zhanghailiang zhang.zhanghaili...@huawei.com wrote:
 
  On 2015/7/6 16:45, Paolo Bonzini wrote:
 
 
  On 06/07/2015 09:54, zhanghailiang wrote:
 
From host, we found that QEMU vcpu1 thread and vcpu7 thread were not
  consuming any cpu (Should be in idle state),
  All of VCPUs' stacks in host is like bellow:
 
  [a07089b5] kvm_vcpu_block+0x65/0xa0 [kvm]
  [a071c7c1] __vcpu_run+0xd1/0x260 [kvm]
  [a071d508] kvm_arch_vcpu_ioctl_run+0x68/0x1a0 [kvm]
  [a0709cee] kvm_vcpu_ioctl+0x38e/0x580 [kvm]
  [8116be8b] do_vfs_ioctl+0x8b/0x3b0
  [8116c251] sys_ioctl+0xa1/0xb0
  [81468092] system_call_fastpath+0x16/0x1b
  [2ab9fe1f99a7] 0x2ab9fe1f99a7
  [] 0x
 
  We looked into the kernel codes that could leading to the above 'Stuck'
  warning,
  in current upstream there isn't any printk(...Stuck...) left since that 
  code path
  has been reworked.
  I've often seen this on over-committed host during guest CPUs up/down 
  torture test.
  Could you update guest kernel to upstream and see if issue reproduces?
 
 
 Hmm, Unfortunately, it is very hard to reproduce, and we are still trying to 
 reproduce it.
 
 For your test case, is it a kernel bug?
 Or is there any related patch could solve your test problem been merged into
 upstream ?
I don't remember all prerequisite patches but you should be able to find
  http://marc.info/?l=linux-kernelm=140326703108009w=2
  x86/smpboot: Initialize secondary CPU only if master CPU will wait for it
and then look for dependencies.


 
 Thanks,
 zhanghailiang
 
  and found that the only possible is the emulation of 'cpuid' instruct in
  kvm/qemu has something wrong.
  But since we can’t reproduce this problem, we are not quite sure.
  Is there any possible that the cupid emulation in kvm/qemu has some bug ?
 
  Can you explain the relationship to the cpuid emulation?  What do the
  traces say about vcpus 1 and 7?
 
  OK, we searched the VM's kernel codes with the 'Stuck' message, and  it is 
  located in
  do_boot_cpu(). It's in BSP context, the call process is:
  BSP executes start_kernel() - smp_init() - smp_boot_cpus() - 
  do_boot_cpu() - wakeup_secondary_via_INIT() to trigger APs.
  It will wait 5s for APs to startup, if some AP not startup normally, it 
  will print 'CPU%d Stuck' or 'CPU%d: Not responding'.
 
  If it prints 'Stuck', it means the AP has received the SIPI interrupt and 
  begins to execute the code
  'ENTRY(trampoline_data)' (trampoline_64.S) , but be stuck in some places 
  before smp_callin()(smpboot.c).
  The follow is the starup process of BSP and AP.
  BSP:
  start_kernel()
  -smp_init()
 -smp_boot_cpus()
   -do_boot_cpu()
   -start_ip = trampoline_address(); //set the address that AP 
  will go to execute
   -wakeup_secondary_cpu_via_init(); // kick the secondary CPU
   -for (timeout = 0; timeout  5; timeout++)
   if (cpumask_test_cpu(cpu, cpu_callin_mask)) break;// 
  check if AP startup or not
 
  APs:
  ENTRY(trampoline_data) (trampoline_64.S)
  -ENTRY(secondary_startup_64) (head_64.S)
 -start_secondary() (smpboot.c)
-cpu_init();
-smp_callin();
-cpumask_set_cpu(cpuid, cpu_callin_mask); -Note: if AP 
  comes here, the BSP will not prints the error message.
 
From above call process, we can be sure that, the AP has been stuck 
  between trampoline_data and the cpumask_set_cpu() in
  smp_callin(), we look through these codes path carefully, and only found a 
  'hlt' instruct that could block the process.
  It is located in trampoline_data():
 
  ENTRY(trampoline_data)
...
 
 callverify_cpu  # Verify the cpu supports long mode
 testl   %eax, %eax  # Check for return code
 jnz no_longmode
 
...
 
  no_longmode:
 hlt
 jmp no_longmode
 
  For the process verify_cpu(),
  we can only find the 'cpuid' sensitive instruct that could lead VM exit 
  from No-root mode.
  This is why we doubt if cpuid emulation is wrong in KVM/QEMU that leading 
  to the fail in verify_cpu.
 
From the message in VM, we know vcpu1 and vcpu7 is something wrong.
  [5.060042] CPU1: Stuck ??
  [   10.170815] CPU7: Stuck ??
  [   10.171648] Brought up 6 CPUs
 
  Besides, the follow is the cpus message got from host.
  80FF72F5-FF6D-E411-A8C8-00821800:/home/fsp/hrg # virsh 
  qemu-monitor-command instance-000
  * CPU #0: pc=0x7f64160c683d thread_id=68570
  CPU #1: pc=0x810301f1 (halted) thread_id=68573
  CPU #2: pc=0x810301e2 (halted) thread_id=68575
  CPU #3: pc=0x810301e2 (halted) thread_id=68576
  CPU #4: pc=0x810301e2 (halted) thread_id=68577

Re: [BUG/RFC] Two cpus are not brought up normally in SLES11 sp3 VM after reboot

2015-07-07 Thread Igor Mammedov
On Mon, 6 Jul 2015 17:59:10 +0800
zhanghailiang zhang.zhanghaili...@huawei.com wrote:

 On 2015/7/6 16:45, Paolo Bonzini wrote:
 
 
  On 06/07/2015 09:54, zhanghailiang wrote:
 
   From host, we found that QEMU vcpu1 thread and vcpu7 thread were not
  consuming any cpu (Should be in idle state),
  All of VCPUs' stacks in host is like bellow:
 
  [a07089b5] kvm_vcpu_block+0x65/0xa0 [kvm]
  [a071c7c1] __vcpu_run+0xd1/0x260 [kvm]
  [a071d508] kvm_arch_vcpu_ioctl_run+0x68/0x1a0 [kvm]
  [a0709cee] kvm_vcpu_ioctl+0x38e/0x580 [kvm]
  [8116be8b] do_vfs_ioctl+0x8b/0x3b0
  [8116c251] sys_ioctl+0xa1/0xb0
  [81468092] system_call_fastpath+0x16/0x1b
  [2ab9fe1f99a7] 0x2ab9fe1f99a7
  [] 0x
 
  We looked into the kernel codes that could leading to the above 'Stuck'
  warning,
in current upstream there isn't any printk(...Stuck...) left since that code 
path
has been reworked.
I've often seen this on over-committed host during guest CPUs up/down torture 
test.
Could you update guest kernel to upstream and see if issue reproduces?

  and found that the only possible is the emulation of 'cpuid' instruct in
  kvm/qemu has something wrong.
  But since we can’t reproduce this problem, we are not quite sure.
  Is there any possible that the cupid emulation in kvm/qemu has some bug ?
 
  Can you explain the relationship to the cpuid emulation?  What do the
  traces say about vcpus 1 and 7?
 
 OK, we searched the VM's kernel codes with the 'Stuck' message, and  it is 
 located in
 do_boot_cpu(). It's in BSP context, the call process is:
 BSP executes start_kernel() - smp_init() - smp_boot_cpus() - do_boot_cpu() 
 - wakeup_secondary_via_INIT() to trigger APs.
 It will wait 5s for APs to startup, if some AP not startup normally, it will 
 print 'CPU%d Stuck' or 'CPU%d: Not responding'.
 
 If it prints 'Stuck', it means the AP has received the SIPI interrupt and 
 begins to execute the code
 'ENTRY(trampoline_data)' (trampoline_64.S) , but be stuck in some places 
 before smp_callin()(smpboot.c).
 The follow is the starup process of BSP and AP.
 BSP:
 start_kernel()
-smp_init()
   -smp_boot_cpus()
 -do_boot_cpu()
 -start_ip = trampoline_address(); //set the address that AP will 
 go to execute
 -wakeup_secondary_cpu_via_init(); // kick the secondary CPU
 -for (timeout = 0; timeout  5; timeout++)
 if (cpumask_test_cpu(cpu, cpu_callin_mask)) break;// check if 
 AP startup or not
 
 APs:
 ENTRY(trampoline_data) (trampoline_64.S)
-ENTRY(secondary_startup_64) (head_64.S)
   -start_secondary() (smpboot.c)
  -cpu_init();
  -smp_callin();
  -cpumask_set_cpu(cpuid, cpu_callin_mask); -Note: if AP 
 comes here, the BSP will not prints the error message.
 
  From above call process, we can be sure that, the AP has been stuck between 
 trampoline_data and the cpumask_set_cpu() in
 smp_callin(), we look through these codes path carefully, and only found a 
 'hlt' instruct that could block the process.
 It is located in trampoline_data():
 
 ENTRY(trampoline_data)
  ...
 
   callverify_cpu  # Verify the cpu supports long mode
   testl   %eax, %eax  # Check for return code
   jnz no_longmode
 
  ...
 
 no_longmode:
   hlt
   jmp no_longmode
 
 For the process verify_cpu(),
 we can only find the 'cpuid' sensitive instruct that could lead VM exit from 
 No-root mode.
 This is why we doubt if cpuid emulation is wrong in KVM/QEMU that leading to 
 the fail in verify_cpu.
 
  From the message in VM, we know vcpu1 and vcpu7 is something wrong.
 [5.060042] CPU1: Stuck ??
 [   10.170815] CPU7: Stuck ??
 [   10.171648] Brought up 6 CPUs
 
 Besides, the follow is the cpus message got from host.
 80FF72F5-FF6D-E411-A8C8-00821800:/home/fsp/hrg # virsh 
 qemu-monitor-command instance-000
 * CPU #0: pc=0x7f64160c683d thread_id=68570
CPU #1: pc=0x810301f1 (halted) thread_id=68573
CPU #2: pc=0x810301e2 (halted) thread_id=68575
CPU #3: pc=0x810301e2 (halted) thread_id=68576
CPU #4: pc=0x810301e2 (halted) thread_id=68577
CPU #5: pc=0x810301e2 (halted) thread_id=68578
CPU #6: pc=0x810301e2 (halted) thread_id=68583
CPU #7: pc=0x810301f1 (halted) thread_id=68584
 
 Oh, i also forgot to mention in the above message that, we have bond each 
 vCPU to different physical CPU in
 host.
 
 Thanks,
 zhanghailiang
 
 
 
 
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 2/2] vhost: add max_mem_regions module parameter

2015-07-02 Thread Igor Mammedov
it became possible to use a bigger amount of memory
slots, which is used by memory hotplug for
registering hotplugged memory.
However QEMU crashes if it's used with more than ~60
pc-dimm devices and vhost-net enabled since host kernel
in module vhost-net refuses to accept more than 64
memory regions.

Allow to tweak limit via max_mem_regions module paramemter
with default value set to 64 slots.

Signed-off-by: Igor Mammedov imamm...@redhat.com
---
 drivers/vhost/vhost.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 6488011..9a68e2e 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -29,8 +29,12 @@
 
 #include vhost.h
 
+static ushort max_mem_regions = 64;
+module_param(max_mem_regions, ushort, 0444);
+MODULE_PARM_DESC(max_mem_regions,
+   Maximum number of memory regions in memory map. (default: 64));
+
 enum {
-   VHOST_MEMORY_MAX_NREGIONS = 64,
VHOST_MEMORY_F_LOG = 0x1,
 };
 
@@ -696,7 +700,7 @@ static long vhost_set_memory(struct vhost_dev *d, struct 
vhost_memory __user *m)
return -EFAULT;
if (mem.padding)
return -EOPNOTSUPP;
-   if (mem.nregions  VHOST_MEMORY_MAX_NREGIONS)
+   if (mem.nregions  max_mem_regions)
return -E2BIG;
newmem = vhost_kvzalloc(size + mem.nregions * sizeof(*m-regions));
if (!newmem)
-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 1/2] vhost: extend memory regions allocation to vmalloc

2015-07-02 Thread Igor Mammedov
with large number of memory regions we could end up with
high order allocations and kmalloc could fail if
host is under memory pressure.
Considering that memory regions array is used on hot path
try harder to allocate using kmalloc and if it fails resort
to vmalloc.
It's still better than just failing vhost_set_memory() and
causing guest crash due to it when a new memory hotplugged
to guest.

I'll still look at QEMU side solution to reduce amount of
memory regions it feeds to vhost to make things even better,
but it doesn't hurt for kernel to behave smarter and don't
crash older QEMU's which could use large amount of memory
regions.

Signed-off-by: Igor Mammedov imamm...@redhat.com
---
 drivers/vhost/vhost.c | 20 
 1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 71bb468..6488011 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -544,7 +544,7 @@ void vhost_dev_cleanup(struct vhost_dev *dev, bool locked)
fput(dev-log_file);
dev-log_file = NULL;
/* No one will access memory at this point */
-   kfree(dev-memory);
+   kvfree(dev-memory);
dev-memory = NULL;
WARN_ON(!list_empty(dev-work_list));
if (dev-worker) {
@@ -674,6 +674,18 @@ static int vhost_memory_reg_sort_cmp(const void *p1, const 
void *p2)
return 0;
 }
 
+static void *vhost_kvzalloc(unsigned long size)
+{
+   void *n = kzalloc(size, GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT);
+
+   if (!n) {
+   n = vzalloc(size);
+   if (!n)
+   return ERR_PTR(-ENOMEM);
+   }
+   return n;
+}
+
 static long vhost_set_memory(struct vhost_dev *d, struct vhost_memory __user 
*m)
 {
struct vhost_memory mem, *newmem, *oldmem;
@@ -686,7 +698,7 @@ static long vhost_set_memory(struct vhost_dev *d, struct 
vhost_memory __user *m)
return -EOPNOTSUPP;
if (mem.nregions  VHOST_MEMORY_MAX_NREGIONS)
return -E2BIG;
-   newmem = kmalloc(size + mem.nregions * sizeof *m-regions, GFP_KERNEL);
+   newmem = vhost_kvzalloc(size + mem.nregions * sizeof(*m-regions));
if (!newmem)
return -ENOMEM;
 
@@ -700,7 +712,7 @@ static long vhost_set_memory(struct vhost_dev *d, struct 
vhost_memory __user *m)
vhost_memory_reg_sort_cmp, NULL);
 
if (!memory_access_ok(d, newmem, 0)) {
-   kfree(newmem);
+   kvfree(newmem);
return -EFAULT;
}
oldmem = d-memory;
@@ -712,7 +724,7 @@ static long vhost_set_memory(struct vhost_dev *d, struct 
vhost_memory __user *m)
d-vqs[i]-memory = newmem;
mutex_unlock(d-vqs[i]-mutex);
}
-   kfree(oldmem);
+   kvfree(oldmem);
return 0;
 }
 
-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 0/2] vhost: support more than 64 memory regions

2015-07-02 Thread Igor Mammedov
changes since v3:
  * rebased on top of vhost-next branch
changes since v2:
  * drop cache patches for now as suggested
  * add max_mem_regions module parameter instead of unconditionally
increasing limit
  * drop bsearch patch since it's already queued

References to previous versions:
v2: https://lkml.org/lkml/2015/6/17/276
v1: http://www.spinics.net/lists/kvm/msg117654.html

Series allows to tweak vhost's memory regions count limit.

It fixes VM crashing on memory hotplug due to vhost refusing
accepting more than 64 memory regions with max_mem_regions
set to more than 262 slots in default QEMU configuration.

Igor Mammedov (2):
  vhost: extend memory regions allocation to vmalloc
  vhost: add max_mem_regions module parameter

 drivers/vhost/vhost.c | 28 ++--
 1 file changed, 22 insertions(+), 6 deletions(-)

-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 1/2] vhost: extend memory regions allocation to vmalloc

2015-07-01 Thread Igor Mammedov
with large number of memory regions we could end up with
high order allocations and kmalloc could fail if
host is under memory pressure.
Considering that memory regions array is used on hot path
try harder to allocate using kmalloc and if it fails resort
to vmalloc.
It's still better than just failing vhost_set_memory() and
causing guest crash due to it when a new memory hotplugged
to guest.

I'll still look at QEMU side solution to reduce amount of
memory regions it feeds to vhost to make things even better,
but it doesn't hurt for kernel to behave smarter and don't
crash older QEMU's which could use large amount of memory
regions.

Signed-off-by: Igor Mammedov imamm...@redhat.com
---
 drivers/vhost/vhost.c | 22 +-
 1 file changed, 17 insertions(+), 5 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index f1e07b8..99931a0 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -471,7 +471,7 @@ void vhost_dev_cleanup(struct vhost_dev *dev, bool locked)
fput(dev-log_file);
dev-log_file = NULL;
/* No one will access memory at this point */
-   kfree(dev-memory);
+   kvfree(dev-memory);
dev-memory = NULL;
WARN_ON(!list_empty(dev-work_list));
if (dev-worker) {
@@ -601,6 +601,18 @@ static int vhost_memory_reg_sort_cmp(const void *p1, const 
void *p2)
return 0;
 }
 
+static void *vhost_kvzalloc(unsigned long size)
+{
+   void *n = kzalloc(size, GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT);
+
+   if (!n) {
+   n = vzalloc(size);
+   if (!n)
+   return ERR_PTR(-ENOMEM);
+   }
+   return n;
+}
+
 static long vhost_set_memory(struct vhost_dev *d, struct vhost_memory __user 
*m)
 {
struct vhost_memory mem, *newmem, *oldmem;
@@ -613,21 +625,21 @@ static long vhost_set_memory(struct vhost_dev *d, struct 
vhost_memory __user *m)
return -EOPNOTSUPP;
if (mem.nregions  VHOST_MEMORY_MAX_NREGIONS)
return -E2BIG;
-   newmem = kmalloc(size + mem.nregions * sizeof *m-regions, GFP_KERNEL);
+   newmem = vhost_kvzalloc(size + mem.nregions * sizeof(*m-regions));
if (!newmem)
return -ENOMEM;
 
memcpy(newmem, mem, size);
if (copy_from_user(newmem-regions, m-regions,
   mem.nregions * sizeof *m-regions)) {
-   kfree(newmem);
+   kvfree(newmem);
return -EFAULT;
}
sort(newmem-regions, newmem-nregions, sizeof(*newmem-regions),
vhost_memory_reg_sort_cmp, NULL);
 
if (!memory_access_ok(d, newmem, 0)) {
-   kfree(newmem);
+   kvfree(newmem);
return -EFAULT;
}
oldmem = d-memory;
@@ -639,7 +651,7 @@ static long vhost_set_memory(struct vhost_dev *d, struct 
vhost_memory __user *m)
d-vqs[i]-memory = newmem;
mutex_unlock(d-vqs[i]-mutex);
}
-   kfree(oldmem);
+   kvfree(oldmem);
return 0;
 }
 
-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 2/2] vhost: add max_mem_regions module parameter

2015-07-01 Thread Igor Mammedov
it became possible to use a bigger amount of memory
slots, which is used by memory hotplug for
registering hotplugged memory.
However QEMU crashes if it's used with more than ~60
pc-dimm devices and vhost-net enabled since host kernel
in module vhost-net refuses to accept more than 64
memory regions.

Allow to tweak limit via max_mem_regions module paramemter
with default value set to 64 slots.

Signed-off-by: Igor Mammedov imamm...@redhat.com
---
 drivers/vhost/vhost.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 99931a0..5905cd7 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -29,8 +29,12 @@
 
 #include vhost.h
 
+static ushort max_mem_regions = 64;
+module_param(max_mem_regions, ushort, 0444);
+MODULE_PARM_DESC(max_mem_regions,
+   Maximum number of memory regions in memory map. (default: 64));
+
 enum {
-   VHOST_MEMORY_MAX_NREGIONS = 64,
VHOST_MEMORY_F_LOG = 0x1,
 };
 
@@ -623,7 +627,7 @@ static long vhost_set_memory(struct vhost_dev *d, struct 
vhost_memory __user *m)
return -EFAULT;
if (mem.padding)
return -EOPNOTSUPP;
-   if (mem.nregions  VHOST_MEMORY_MAX_NREGIONS)
+   if (mem.nregions  max_mem_regions)
return -E2BIG;
newmem = vhost_kvzalloc(size + mem.nregions * sizeof(*m-regions));
if (!newmem)
-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 0/2] vhost: support more than 64 memory regions

2015-07-01 Thread Igor Mammedov
changes since v2:
  * drop cache patches for now as suggested
  * add max_mem_regions module parameter instead of unconditionally
increasing limit
  * drop bsearch patch since it's already queued

References to previous versions:
v2: https://lkml.org/lkml/2015/6/17/276
v1: http://www.spinics.net/lists/kvm/msg117654.html

Series allows to tweak vhost's memory regions count limit.

It fixes VM crashing on memory hotplug due to vhost refusing
accepting more than 64 memory regions with max_mem_regions
set to more than 262 slots in default QEMU configuration.

Igor Mammedov (2):
  vhost: extend memory regions allocation to vmalloc
  vhost: add max_mem_regions module parameter

 drivers/vhost/vhost.c | 30 +++---
 1 file changed, 23 insertions(+), 7 deletions(-)

-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] vhost: add ioctl to query nregions upper limit

2015-06-25 Thread Igor Mammedov
On Wed, 24 Jun 2015 17:08:56 +0200
Michael S. Tsirkin m...@redhat.com wrote:

 On Wed, Jun 24, 2015 at 04:52:29PM +0200, Igor Mammedov wrote:
  On Wed, 24 Jun 2015 16:17:46 +0200
  Michael S. Tsirkin m...@redhat.com wrote:
  
   On Wed, Jun 24, 2015 at 04:07:27PM +0200, Igor Mammedov wrote:
On Wed, 24 Jun 2015 15:49:27 +0200
Michael S. Tsirkin m...@redhat.com wrote:

 Userspace currently simply tries to give vhost as many regions
 as it happens to have, but you only have the mem table
 when you have initialized a large part of VM, so graceful
 failure is very hard to support.
 
 The result is that userspace tends to fail catastrophically.
 
 Instead, add a new ioctl so userspace can find out how much
 kernel supports, up front. This returns a positive value that
 we commit to.
 
 Also, document our contract with legacy userspace: when
 running on an old kernel, you get -1 and you can assume at
 least 64 slots.  Since 0 value's left unused, let's make that
 mean that the current userspace behaviour (trial and error)
 is required, just in case we want it back.
 
 Signed-off-by: Michael S. Tsirkin m...@redhat.com
 Cc: Igor Mammedov imamm...@redhat.com
 Cc: Paolo Bonzini pbonz...@redhat.com
 ---
  include/uapi/linux/vhost.h | 17 -
  drivers/vhost/vhost.c  |  5 +
  2 files changed, 21 insertions(+), 1 deletion(-)
 
 diff --git a/include/uapi/linux/vhost.h
 b/include/uapi/linux/vhost.h index ab373191..f71fa6d 100644
 --- a/include/uapi/linux/vhost.h
 +++ b/include/uapi/linux/vhost.h
 @@ -80,7 +80,7 @@ struct vhost_memory {
   * Allows subsequent call to VHOST_OWNER_SET to succeed. */
  #define VHOST_RESET_OWNER _IO(VHOST_VIRTIO, 0x02)
  
 -/* Set up/modify memory layout */
 +/* Set up/modify memory layout: see also
 VHOST_GET_MEM_MAX_NREGIONS below. */ #define
 VHOST_SET_MEM_TABLE   _IOW(VHOST_VIRTIO, 0x03, struct
 vhost_memory) /* Write logging setup. */
 @@ -127,6 +127,21 @@ struct vhost_memory {
  /* Set eventfd to signal an error */
  #define VHOST_SET_VRING_ERR _IOW(VHOST_VIRTIO, 0x22, struct
 vhost_vring_file) 
 +/* Query upper limit on nregions in VHOST_SET_MEM_TABLE
 arguments.
 + * Returns:
 + *   0  value = MAX_INT - gives the upper limit,
 higher values will fail
 + *   0 - there's no static limit: try and see if it
 works
 + *   -1 - on failure
 + */
 +#define VHOST_GET_MEM_MAX_NREGIONS   _IO(VHOST_VIRTIO, 0x23)
 +
 +/* Returned by VHOST_GET_MEM_MAX_NREGIONS to mean there's no
 static limit:
 + * try and it'll work if you are lucky. */
 +#define VHOST_MEM_MAX_NREGIONS_NONE 0
is it needed? we always have a limit,
or don't have IOCTL = -1 = old try and see way

 +/* We support at least as many nregions in
 VHOST_SET_MEM_TABLE:
 + * for use on legacy kernels without
 VHOST_GET_MEM_MAX_NREGIONS support. */ +#define
 VHOST_MEM_MAX_NREGIONS_DEFAULT 64
^^^ not used below,
if it's for legacy then perhaps s/DEFAULT/LEGACY/ 
   
   The assumption was that userspace detecting old kernels will just
   use 64, this means we do want a flag to get the old way.
   
   OTOH if you won't think it's useful, let me know.
  this header will be synced into QEMU's tree so that we could use
  this define there, isn't it? IMHO then _LEGACY is more exact
  description of macro.
  
  As for 0 return value, -1 is just fine for detecting old kernels
  (i.e. try and see if it works), so 0 looks unnecessary but it
  doesn't in any way hurt either. For me limit or -1 is enough to try
  fix userspace.
 
 OK.
 Do you want to try now before I do v2?

I've just tried, idea to check limit is unusable in this case.
here is a link to a patch that implements it:
https://github.com/imammedo/qemu/commits/vhost_slot_limit_check

slots count is changing dynamically depending on used devices
and more importantly guest OS could change slots count during
its runtime when during managing devices it could trigger
repartitioning of current memory table as device's memory regions
mapped into address space.

That leads to 2 different values of used slots at guest startup
time and after guest booted or after hotplug.

I my case guest could be started with max 58 DIMMs coldplugged,
but after boot 3 more slots are freed and it's possible to hotadd
3 more DIMMs. That however leads to the guest that can't be migrated
to since by QEMU design all hotplugged devices should be present
at target's startup time i.e. 60 DIMMs total and that obviously
goes above vhost limit at that time.
Other issue with it is that QEMU could report only current
limit to mgmt tools, so they can't know for sure how many slots
exactly they can allow user to set when creating VM and will have to
guess or create a VM with unusable/under provisioned slots.

We have a similar limit check

Re: [PATCH RFC] vhost: add ioctl to query nregions upper limit

2015-06-24 Thread Igor Mammedov
On Wed, 24 Jun 2015 15:49:27 +0200
Michael S. Tsirkin m...@redhat.com wrote:

 Userspace currently simply tries to give vhost as many regions
 as it happens to have, but you only have the mem table
 when you have initialized a large part of VM, so graceful
 failure is very hard to support.
 
 The result is that userspace tends to fail catastrophically.
 
 Instead, add a new ioctl so userspace can find out how much kernel
 supports, up front. This returns a positive value that we commit to.
 
 Also, document our contract with legacy userspace: when running on an
 old kernel, you get -1 and you can assume at least 64 slots.  Since 0
 value's left unused, let's make that mean that the current userspace
 behaviour (trial and error) is required, just in case we want it back.
 
 Signed-off-by: Michael S. Tsirkin m...@redhat.com
 Cc: Igor Mammedov imamm...@redhat.com
 Cc: Paolo Bonzini pbonz...@redhat.com
 ---
  include/uapi/linux/vhost.h | 17 -
  drivers/vhost/vhost.c  |  5 +
  2 files changed, 21 insertions(+), 1 deletion(-)
 
 diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h
 index ab373191..f71fa6d 100644
 --- a/include/uapi/linux/vhost.h
 +++ b/include/uapi/linux/vhost.h
 @@ -80,7 +80,7 @@ struct vhost_memory {
   * Allows subsequent call to VHOST_OWNER_SET to succeed. */
  #define VHOST_RESET_OWNER _IO(VHOST_VIRTIO, 0x02)
  
 -/* Set up/modify memory layout */
 +/* Set up/modify memory layout: see also VHOST_GET_MEM_MAX_NREGIONS below. */
  #define VHOST_SET_MEM_TABLE  _IOW(VHOST_VIRTIO, 0x03, struct vhost_memory)
  
  /* Write logging setup. */
 @@ -127,6 +127,21 @@ struct vhost_memory {
  /* Set eventfd to signal an error */
  #define VHOST_SET_VRING_ERR _IOW(VHOST_VIRTIO, 0x22, struct vhost_vring_file)
  
 +/* Query upper limit on nregions in VHOST_SET_MEM_TABLE arguments.
 + * Returns:
 + *   0  value = MAX_INT - gives the upper limit, higher values will fail
 + *   0 - there's no static limit: try and see if it works
 + *   -1 - on failure
 + */
 +#define VHOST_GET_MEM_MAX_NREGIONS   _IO(VHOST_VIRTIO, 0x23)
 +
 +/* Returned by VHOST_GET_MEM_MAX_NREGIONS to mean there's no static limit:
 + * try and it'll work if you are lucky. */
 +#define VHOST_MEM_MAX_NREGIONS_NONE 0
is it needed? we always have a limit,
or don't have IOCTL = -1 = old try and see way

 +/* We support at least as many nregions in VHOST_SET_MEM_TABLE:
 + * for use on legacy kernels without VHOST_GET_MEM_MAX_NREGIONS support. */
 +#define VHOST_MEM_MAX_NREGIONS_DEFAULT 64
^^^ not used below,
if it's for legacy then perhaps s/DEFAULT/LEGACY/ 

 +
  /* VHOST_NET specific defines */
  
  /* Attach virtio net ring to a raw socket, or tap device.
 diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
 index 9e8e004..3b68f9d 100644
 --- a/drivers/vhost/vhost.c
 +++ b/drivers/vhost/vhost.c
 @@ -917,6 +917,11 @@ long vhost_dev_ioctl(struct vhost_dev *d, unsigned int 
 ioctl, void __user *argp)
   long r;
   int i, fd;
  
 + if (ioctl == VHOST_GET_MEM_MAX_NREGIONS) {
 + r = VHOST_MEMORY_MAX_NREGIONS;
 + goto done;
 + }
 +
   /* If you are not the owner, you can become one */
   if (ioctl == VHOST_SET_OWNER) {
   r = vhost_dev_set_owner(d);

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] vhost: add ioctl to query nregions upper limit

2015-06-24 Thread Igor Mammedov
On Wed, 24 Jun 2015 16:17:46 +0200
Michael S. Tsirkin m...@redhat.com wrote:

 On Wed, Jun 24, 2015 at 04:07:27PM +0200, Igor Mammedov wrote:
  On Wed, 24 Jun 2015 15:49:27 +0200
  Michael S. Tsirkin m...@redhat.com wrote:
  
   Userspace currently simply tries to give vhost as many regions
   as it happens to have, but you only have the mem table
   when you have initialized a large part of VM, so graceful
   failure is very hard to support.
   
   The result is that userspace tends to fail catastrophically.
   
   Instead, add a new ioctl so userspace can find out how much kernel
   supports, up front. This returns a positive value that we commit to.
   
   Also, document our contract with legacy userspace: when running on an
   old kernel, you get -1 and you can assume at least 64 slots.  Since 0
   value's left unused, let's make that mean that the current userspace
   behaviour (trial and error) is required, just in case we want it back.
   
   Signed-off-by: Michael S. Tsirkin m...@redhat.com
   Cc: Igor Mammedov imamm...@redhat.com
   Cc: Paolo Bonzini pbonz...@redhat.com
   ---
include/uapi/linux/vhost.h | 17 -
drivers/vhost/vhost.c  |  5 +
2 files changed, 21 insertions(+), 1 deletion(-)
   
   diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h
   index ab373191..f71fa6d 100644
   --- a/include/uapi/linux/vhost.h
   +++ b/include/uapi/linux/vhost.h
   @@ -80,7 +80,7 @@ struct vhost_memory {
 * Allows subsequent call to VHOST_OWNER_SET to succeed. */
#define VHOST_RESET_OWNER _IO(VHOST_VIRTIO, 0x02)

   -/* Set up/modify memory layout */
   +/* Set up/modify memory layout: see also VHOST_GET_MEM_MAX_NREGIONS 
   below. */
#define VHOST_SET_MEM_TABLE  _IOW(VHOST_VIRTIO, 0x03, struct 
   vhost_memory)

/* Write logging setup. */
   @@ -127,6 +127,21 @@ struct vhost_memory {
/* Set eventfd to signal an error */
#define VHOST_SET_VRING_ERR _IOW(VHOST_VIRTIO, 0x22, struct 
   vhost_vring_file)

   +/* Query upper limit on nregions in VHOST_SET_MEM_TABLE arguments.
   + * Returns:
   + *   0  value = MAX_INT - gives the upper limit, higher values 
   will fail
   + *   0 - there's no static limit: try and see if it works
   + *   -1 - on failure
   + */
   +#define VHOST_GET_MEM_MAX_NREGIONS   _IO(VHOST_VIRTIO, 0x23)
   +
   +/* Returned by VHOST_GET_MEM_MAX_NREGIONS to mean there's no static 
   limit:
   + * try and it'll work if you are lucky. */
   +#define VHOST_MEM_MAX_NREGIONS_NONE 0
  is it needed? we always have a limit,
  or don't have IOCTL = -1 = old try and see way
  
   +/* We support at least as many nregions in VHOST_SET_MEM_TABLE:
   + * for use on legacy kernels without VHOST_GET_MEM_MAX_NREGIONS support. 
   */
   +#define VHOST_MEM_MAX_NREGIONS_DEFAULT 64
  ^^^ not used below,
  if it's for legacy then perhaps s/DEFAULT/LEGACY/ 
 
 The assumption was that userspace detecting old kernels will just use 64,
 this means we do want a flag to get the old way.
 
 OTOH if you won't think it's useful, let me know.
this header will be synced into QEMU's tree so that we could use this define 
there,
isn't it? IMHO then _LEGACY is more exact description of macro.

As for 0 return value, -1 is just fine for detecting old kernels (i.e. try and 
see if it works), so 0 looks unnecessary but it doesn't in any way hurt either.
For me limit or -1 is enough to try fix userspace.

 
   +
/* VHOST_NET specific defines */

/* Attach virtio net ring to a raw socket, or tap device.
   diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
   index 9e8e004..3b68f9d 100644
   --- a/drivers/vhost/vhost.c
   +++ b/drivers/vhost/vhost.c
   @@ -917,6 +917,11 @@ long vhost_dev_ioctl(struct vhost_dev *d, unsigned 
   int ioctl, void __user *argp)
 long r;
 int i, fd;

   + if (ioctl == VHOST_GET_MEM_MAX_NREGIONS) {
   + r = VHOST_MEMORY_MAX_NREGIONS;
   + goto done;
   + }
   +
 /* If you are not the owner, you can become one */
 if (ioctl == VHOST_SET_OWNER) {
 r = vhost_dev_set_owner(d);
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/5] vhost: support upto 509 memory regions

2015-06-22 Thread Igor Mammedov
On Fri, 19 Jun 2015 18:33:39 +0200
Michael S. Tsirkin m...@redhat.com wrote:

 On Fri, Jun 19, 2015 at 06:26:27PM +0200, Paolo Bonzini wrote:
  
  
  On 19/06/2015 18:20, Michael S. Tsirkin wrote:
We could, but I/O is just an example.  It can be I/O, a network ring,
whatever.  We cannot audit all address_space_map uses.

  
   No need to audit them all: defer device_add using an hva range until
   address_space_unmap drops using hvas in range drops reference count to
   0.
  
  That could be forever.  You certainly don't want to lockup the monitor
  forever just because a device model isn't too friendly to memory hot-unplug.
 
 We can defer the addition, no need to lockup the monitor.
 
  That's why you need to audit them (also, it's perfectly in the device
  model's right to use address_space_unmap this way: it's the guest that's
  buggy and leaves a dangling reference to a region before unplugging it).
  
  Paolo
 
 Then maybe it's not too bad that the guest will crash because the memory
 was unmapped.
So far HVA is unusable even if we will make this assumption and let guest crash.
virt_net doesn't work with it anyway,
translation of GPA to HVA for descriptors works as expected (correctly)
but vhost+HVA hack backed virtio still can't send/received packets.

That's why I prefer to merge kernel solution first as a stable and
not introducing any issues solution. And work on userspace approach on
top of that.

Hopefully it could be done but we still would need time
to iron out side effects/issues it causes or could cause so that
fix became stable enough for production.

--
To unsubscribe from this list: send the line unsubscribe kvm in


Re: [PATCH 3/5] vhost: support upto 509 memory regions

2015-06-18 Thread Igor Mammedov
On Thu, 18 Jun 2015 13:41:22 +0200
Michael S. Tsirkin m...@redhat.com wrote:

 On Thu, Jun 18, 2015 at 01:39:12PM +0200, Igor Mammedov wrote:
  Lets leave decision upto users instead of making them live with
  crashing guests.
 
 Come on, let's fix it in userspace.
I'm not abandoning userspace approach either but it might take time
to implement in robust manner as it's much more complex and has much
more places to backfire then a straightforward kernel fix which will
work for both old userspace and a new one.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/5] vhost: support upto 509 memory regions

2015-06-18 Thread Igor Mammedov
On Thu, 18 Jun 2015 11:50:22 +0200
Michael S. Tsirkin m...@redhat.com wrote:

 On Thu, Jun 18, 2015 at 11:12:24AM +0200, Igor Mammedov wrote:
  On Wed, 17 Jun 2015 18:30:02 +0200
  Michael S. Tsirkin m...@redhat.com wrote:
  
   On Wed, Jun 17, 2015 at 06:09:21PM +0200, Igor Mammedov wrote:
On Wed, 17 Jun 2015 17:38:40 +0200
Michael S. Tsirkin m...@redhat.com wrote:

 On Wed, Jun 17, 2015 at 05:12:57PM +0200, Igor Mammedov wrote:
  On Wed, 17 Jun 2015 16:32:02 +0200
  Michael S. Tsirkin m...@redhat.com wrote:
  
   On Wed, Jun 17, 2015 at 03:20:44PM +0200, Paolo Bonzini wrote:


On 17/06/2015 15:13, Michael S. Tsirkin wrote:
   Considering userspace can be malicious, I guess yes.
  I don't think it's a valid concern in this case,
  setting limit back from 509 to 64 will not help here in any
  way, userspace still can create as many vhost instances as
  it needs to consume memory it desires.
 
 Not really since vhost char device isn't world-accessible.
 It's typically opened by a priveledged tool, the fd is
 then passed to an unpriveledged userspace, or permissions
 dropped.

Then what's the concern anyway?

Paolo
   
   Each fd now ties up 16K of kernel memory.  It didn't use to, so
   priveledged tool could safely give the unpriveledged userspace
   a ton of these fds.
  if privileged tool gives out unlimited amount of fds then it
  doesn't matter whether fd ties 4K or 16K, host still could be DoSed.
  
 
 Of course it does not give out unlimited fds, there's a way
 for the sysadmin to specify the number of fds. Look at how libvirt
 uses vhost, it should become clear I think.
then it just means that tool has to take into account a new limits
to partition host in sensible manner.
   
   Meanwhile old tools are vulnerable to OOM attacks.
  I've chatted with libvirt folks, it doesn't care about how much memory
  vhost would consume nor do any host capacity planning in that regard.
 
 Exactly, it's up to host admin.
 
  But lets assume that there are tools that do this so
  how about instead of hardcoding limit make it a module parameter
  with default set to 64. That would allow users to set higher limit
  if they need it and nor regress old tools. it will also give tools
  interface for reading limit from vhost module.
 
 And now you need to choose between security and functionality :(
There is no conflict here and it's not about choosing.
If admin has a method to estimate guest memory footprint
to do capacity partitioning then he would need to redo
partitioning taking in account new footprint when
he/she rises limit manually.

(BTW libvirt has tried and reverted patches that were trying to
predict required memory, admin might be able to do it manually
better but it's another topic how to do it ans it's not related
to this thread)

Lets leave decision upto users instead of making them live with
crashing guests.

 
   
Exposing limit as module parameter might be of help to tool for
getting/setting it in a way it needs.
   
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/5] vhost: support upto 509 memory regions

2015-06-18 Thread Igor Mammedov
On Thu, 18 Jun 2015 16:47:33 +0200
Michael S. Tsirkin m...@redhat.com wrote:

 On Thu, Jun 18, 2015 at 03:46:14PM +0200, Paolo Bonzini wrote:
  
  
  On 18/06/2015 15:19, Michael S. Tsirkin wrote:
   On Thu, Jun 18, 2015 at 01:50:32PM +0200, Paolo Bonzini wrote:
  
  
   On 18/06/2015 13:41, Michael S. Tsirkin wrote:
   On Thu, Jun 18, 2015 at 01:39:12PM +0200, Igor Mammedov wrote:
   Lets leave decision upto users instead of making them live with
   crashing guests.
  
   Come on, let's fix it in userspace.
  
   It's not trivial to fix it in userspace.  Since QEMU uses RCU there
   isn't a single memory map to use for a linear gpa-hva map.
   
   Could you elaborate?
   
   I'm confused by this mention of RCU.
   You use RCU for accesses to the memory map, correct?
   So memory map itself is a write side operation, as such all you need to
   do is take some kind of lock to prevent conflicting with other memory
   maps, do rcu sync under this lock.
  
  You're right, the problem isn't directly related to RCU.  RCU would be
  easy to handle by using synchronize_rcu instead of call_rcu.  While I
  identified an RCU-related problem with Igor's patches, it's much more
  entrenched.
  
  RAM can be used by asynchronous operations while the VM runs, between
  address_space_map and address_space_unmap.  It is possible and common to
  have a quiescent state between the map and unmap, and a memory map
  change can happen in the middle of this.  Normally this is not a
  problem, because changes to the memory map do not make the hva go away
  (memory regions are reference counted).
 
 Right, so you want mmap(MAP_NORESERVE) when that reference
 count becomes 0.
 
  However, with Igor's patches a memory_region_del_subregion will cause a
  mmap(MAP_NORESERVE), which _does_ have the effect of making the hva go away.
  
  I guess one way to do it would be to alias the same page in two places,
  one for use by vhost and one for use by everything else.  However, the
  kernel does not provide the means to do this kind of aliasing for
  anonymous mmaps.
  
  Paolo
 
 Basically pages go away on munmap, so won't simple
   lock
   munmap
   mmap(MAP_NORESERVE)
   unlock
 do the trick?
at what time are you suggesting to do this?



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/5] vhost: support upto 509 memory regions

2015-06-18 Thread Igor Mammedov
On Wed, 17 Jun 2015 18:30:02 +0200
Michael S. Tsirkin m...@redhat.com wrote:

 On Wed, Jun 17, 2015 at 06:09:21PM +0200, Igor Mammedov wrote:
  On Wed, 17 Jun 2015 17:38:40 +0200
  Michael S. Tsirkin m...@redhat.com wrote:
  
   On Wed, Jun 17, 2015 at 05:12:57PM +0200, Igor Mammedov wrote:
On Wed, 17 Jun 2015 16:32:02 +0200
Michael S. Tsirkin m...@redhat.com wrote:

 On Wed, Jun 17, 2015 at 03:20:44PM +0200, Paolo Bonzini wrote:
  
  
  On 17/06/2015 15:13, Michael S. Tsirkin wrote:
 Considering userspace can be malicious, I guess yes.
I don't think it's a valid concern in this case,
setting limit back from 509 to 64 will not help here in any
way, userspace still can create as many vhost instances as
it needs to consume memory it desires.
   
   Not really since vhost char device isn't world-accessible.
   It's typically opened by a priveledged tool, the fd is
   then passed to an unpriveledged userspace, or permissions
   dropped.
  
  Then what's the concern anyway?
  
  Paolo
 
 Each fd now ties up 16K of kernel memory.  It didn't use to, so
 priveledged tool could safely give the unpriveledged userspace
 a ton of these fds.
if privileged tool gives out unlimited amount of fds then it
doesn't matter whether fd ties 4K or 16K, host still could be DoSed.

   
   Of course it does not give out unlimited fds, there's a way
   for the sysadmin to specify the number of fds. Look at how libvirt
   uses vhost, it should become clear I think.
  then it just means that tool has to take into account a new limits
  to partition host in sensible manner.
 
 Meanwhile old tools are vulnerable to OOM attacks.
I've chatted with libvirt folks, it doesn't care about how much memory
vhost would consume nor do any host capacity planning in that regard.

But lets assume that there are tools that do this so
how about instead of hardcoding limit make it a module parameter
with default set to 64. That would allow users to set higher limit
if they need it and nor regress old tools. it will also give tools
interface for reading limit from vhost module.

 
  Exposing limit as module parameter might be of help to tool for
  getting/setting it in a way it needs.
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/5] vhost: support upto 509 memory regions

2015-06-17 Thread Igor Mammedov
On Wed, 17 Jun 2015 18:30:02 +0200
Michael S. Tsirkin m...@redhat.com wrote:

 On Wed, Jun 17, 2015 at 06:09:21PM +0200, Igor Mammedov wrote:
  On Wed, 17 Jun 2015 17:38:40 +0200
  Michael S. Tsirkin m...@redhat.com wrote:
  
   On Wed, Jun 17, 2015 at 05:12:57PM +0200, Igor Mammedov wrote:
On Wed, 17 Jun 2015 16:32:02 +0200
Michael S. Tsirkin m...@redhat.com wrote:

 On Wed, Jun 17, 2015 at 03:20:44PM +0200, Paolo Bonzini wrote:
  
  
  On 17/06/2015 15:13, Michael S. Tsirkin wrote:
 Considering userspace can be malicious, I guess yes.
I don't think it's a valid concern in this case,
setting limit back from 509 to 64 will not help here in
any way, userspace still can create as many vhost
instances as it needs to consume memory it desires.
   
   Not really since vhost char device isn't world-accessible.
   It's typically opened by a priveledged tool, the fd is
   then passed to an unpriveledged userspace, or permissions
   dropped.
  
  Then what's the concern anyway?
  
  Paolo
 
 Each fd now ties up 16K of kernel memory.  It didn't use to,
 so priveledged tool could safely give the unpriveledged
 userspace a ton of these fds.
if privileged tool gives out unlimited amount of fds then it
doesn't matter whether fd ties 4K or 16K, host still could be
DoSed.

   
   Of course it does not give out unlimited fds, there's a way
   for the sysadmin to specify the number of fds. Look at how libvirt
   uses vhost, it should become clear I think.
  then it just means that tool has to take into account a new limits
  to partition host in sensible manner.
 
 Meanwhile old tools are vulnerable to OOM attacks.
Let's leave old limit by default and allow override it via module
parameter, that way tools old tools won't be affected and new tools
could set limit the way they need.
That will accommodate current slot hungry userspace and a new one with
continuous HVA and won't regress old tools.

 
  Exposing limit as module parameter might be of help to tool for
  getting/setting it in a way it needs.
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/5] vhost: support upto 509 memory regions

2015-06-17 Thread Igor Mammedov
On Wed, 17 Jun 2015 18:47:18 +0200
Paolo Bonzini pbonz...@redhat.com wrote:

 
 
 On 17/06/2015 18:41, Michael S. Tsirkin wrote:
  On Wed, Jun 17, 2015 at 06:38:25PM +0200, Paolo Bonzini wrote:
 
 
  On 17/06/2015 18:34, Michael S. Tsirkin wrote:
  On Wed, Jun 17, 2015 at 06:31:32PM +0200, Paolo Bonzini wrote:
 
 
  On 17/06/2015 18:30, Michael S. Tsirkin wrote:
  Meanwhile old tools are vulnerable to OOM attacks.
 
  For each vhost device there will be likely one tap interface,
  and I suspect that it takes way, way more than 16KB of memory.
 
  That's not true. We have a vhost device per queue, all queues
  are part of a single tap device.
 
  s/tap/VCPU/ then.  A KVM VCPU also takes more than 16KB of memory.
  
  That's up to you as a kvm maintainer :)
 
 Not easy, when the CPU alone requires three (albeit non-consecutive)
 pages for the VMCS, the APIC access page and the EPT root.
 
  People are already concerned about vhost device
  memory usage, I'm not happy to define our user/kernel interface
  in a way that forces even more memory to be used up.
 
 So, the questions to ask are:
 
 1) What is the memory usage like immediately after vhost is brought
 up, apart from these 16K?
 
 2) Is there anything in vhost that allocates a user-controllable
 amount of memory?
 
 3) What is the size of the data structures that support one virtqueue
 (there are two of them)?  Does it depend on the size of the
 virtqueues?
 
 4) Would it make sense to share memory regions between multiple vhost
 devices?  Would it be hard to implement?  It would also make memory
 operations O(1) rather than O(#cpus).
 
 Paolo

in addition to that could vhost share memmap with KVM i.e. use its
memslots instead of duplicating it?
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/5] vhost: support upto 509 memory regions

2015-06-17 Thread Igor Mammedov
On Wed, 17 Jun 2015 12:46:09 +0200
Michael S. Tsirkin m...@redhat.com wrote:

 On Wed, Jun 17, 2015 at 12:37:42PM +0200, Igor Mammedov wrote:
  On Wed, 17 Jun 2015 12:11:09 +0200
  Michael S. Tsirkin m...@redhat.com wrote:
  
   On Wed, Jun 17, 2015 at 10:54:21AM +0200, Igor Mammedov wrote:
On Wed, 17 Jun 2015 09:39:06 +0200
Michael S. Tsirkin m...@redhat.com wrote:

 On Wed, Jun 17, 2015 at 09:28:02AM +0200, Igor Mammedov wrote:
  On Wed, 17 Jun 2015 08:34:26 +0200
  Michael S. Tsirkin m...@redhat.com wrote:
  
   On Wed, Jun 17, 2015 at 12:00:56AM +0200, Igor Mammedov wrote:
On Tue, 16 Jun 2015 23:14:20 +0200
Michael S. Tsirkin m...@redhat.com wrote:

 On Tue, Jun 16, 2015 at 06:33:37PM +0200, Igor Mammedov wrote:
  since commit
   1d4e7e3 kvm: x86: increase user memory slots to 509
  
  it became possible to use a bigger amount of memory
  slots, which is used by memory hotplug for
  registering hotplugged memory.
  However QEMU crashes if it's used with more than ~60
  pc-dimm devices and vhost-net since host kernel
  in module vhost-net refuses to accept more than 65
  memory regions.
  
  Increase VHOST_MEMORY_MAX_NREGIONS from 65 to 509
 
 It was 64, not 65.
 
  to match KVM_USER_MEM_SLOTS fixes issue for vhost-net.
  
  Signed-off-by: Igor Mammedov imamm...@redhat.com
 
 Still thinking about this: can you reorder this to
 be the last patch in the series please?
sure

 
 Also - 509?
userspace memory slots in terms of KVM, I made it match
KVM's allotment of memory slots for userspace side.
   
   Maybe KVM has its reasons for this #. I don't see
   why we need to match this exactly.
  np, I can cap it at safe 300 slots but it's unlikely that it
  would take cut off 1 extra hop since it's capped by QEMU
  at 256+[initial fragmented memory]
 
 But what's the point? We allocate 32 bytes per slot.
 300*32 = 9600 which is more than 8K, so we are doing
 an order-3 allocation anyway.
 If we could cap it at 8K (256 slots) that would make sense
 since we could avoid wasting vmalloc space.
256 is amount of hotpluggable slots  and there is no way
to predict how initial memory would be fragmented
(i.e. amount of slots it would take), if we guess wrong
we are back to square one with crashing userspace.
So I'd stay consistent with KVM's limit 509 since
it's only limit, i.e. not actual amount of allocated slots.

 I'm still not very happy with the whole approach,
 giving userspace ability allocate 4 whole pages
 of kernel memory like this.
I'm working in parallel so that userspace won't take so
many slots but it won't prevent its current versions
crashing due to kernel limitation.
   
   Right but at least it's not a regression. If we promise userspace to
   support a ton of regions, we can't take it back later, and I'm concerned
   about the memory usage.
   
   I think it's already safe to merge the binary lookup patches, and maybe
   cache and vmalloc, so that the remaining patch will be small.
  it isn't regression with switching to binary search and increasing
  slots to 509 either performance wise it's more on improvment side.
  And I was thinking about memory usage as well, that's why I've dropped
  faster radix tree in favor of more compact array, can't do better
  on kernel side of fix.
  
  Yes we will give userspace to ability to use more slots/and lock up
  more memory if it's not able to consolidate memory regions but
  that leaves an option for user to run guest with vhost performance
  vs crashing it at runtime.
 
 Crashing is entirely QEMU's own doing in not handling
 the error gracefully.
and that's hard to fix (handle error gracefully) the way it's implemented now.

  
  userspace/targets that could consolidate memory regions should
  do so and I'm working on that as well but that doesn't mean
  that users shouldn't have a choice.
 
 It's a fairly unusual corner case, I'm not yet
 convinced we need to quickly add support to it when just waiting a bit
 longer will get us an equivalent (or even more efficient) fix in
 userspace.
with memory hotplug support in QEMU has been released for quite
a long time already and there is users that use it so fix in
the future QEMU won't make it work with their distros.
So I wouldn't say that is fairly unusual corner case.

 
  So far it's kernel limitation and this patch fixes crashes
  that users see now, with the rest of patches enabling performance
  not to regress.
 
 When I say regression I refer to an option to limit the array
 size again after userspace started using the larger size.
Is there a need to do so?

Userspace that cares about memory footprint won't use many slots
keeping

Re: [PATCH 0/5] vhost: support upto 509 memory regions

2015-06-17 Thread Igor Mammedov
On Wed, 17 Jun 2015 08:31:23 +0200
Michael S. Tsirkin m...@redhat.com wrote:

 On Wed, Jun 17, 2015 at 12:19:15AM +0200, Igor Mammedov wrote:
  On Tue, 16 Jun 2015 23:16:07 +0200
  Michael S. Tsirkin m...@redhat.com wrote:
  
   On Tue, Jun 16, 2015 at 06:33:34PM +0200, Igor Mammedov wrote:
Series extends vhost to support upto 509 memory regions,
and adds some vhost:translate_desc() performance improvemnts
so it won't regress when memslots are increased to 509.

It fixes running VM crashing during memory hotplug due
to vhost refusing accepting more than 64 memory regions.

It's only host kernel side fix to make it work with QEMU
versions that support memory hotplug. But I'll continue
to work on QEMU side solution to reduce amount of memory
regions to make things even better.
   
   I'm concerned userspace work will be harder, in particular,
   performance gains will be harder to measure.
  it appears so, so far.
  
   How about a flag to disable caching?
  I've tried to measure cost of cache miss but without much luck,
  difference between version with cache and with caching removed
  was within margin of error (±10ns) (i.e. not mensurable on my
  5min/10*10^6 test workload).
 
 Confused. I thought it was very much measureable.
 So why add a cache if you can't measure its effect?
I hasn't been able to measure immediate delta between function
start/end with precision more than 10ns, perhaps used method
(system tap) is to blame.
But it's still possible to measure indirectly like 2% from 5/5.

 
  Also I'm concerned about adding extra fetch+branch for flag
  checking will make things worse for likely path of cache hit,
  so I'd avoid it if possible.
  
  Or do you mean a simple global per module flag to disable it and
  wrap thing in static key so that it will be cheap jump to skip
  cache?
 
 Something like this, yes.
ok, will do.

 
Performance wise for guest with (in my case 3 memory regions)
and netperf's UDP_RR workload translate_desc() execution
time from total workload takes:

Memory  |1G RAM|cached|non cached
regions #   |  3   |  53  |  53

upstream| 0.3% |  -   | 3.5%

this series | 0.2% | 0.5% | 0.7%

where non cached column reflects trashing wokload
with constant cache miss. More details on timing in
respective patches.

Igor Mammedov (5):
  vhost: use binary search instead of linear in find_region()
  vhost: extend memory regions allocation to vmalloc
  vhost: support upto 509 memory regions
  vhost: add per VQ memory region caching
  vhost: translate_desc: optimization for desc.len  region size

 drivers/vhost/vhost.c | 95
+--
drivers/vhost/vhost.h |  1 + 2 files changed, 71 insertions(+),
25 deletions(-)

-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/5] vhost: support upto 509 memory regions

2015-06-17 Thread Igor Mammedov
On Wed, 17 Jun 2015 13:51:56 +0200
Michael S. Tsirkin m...@redhat.com wrote:

 On Wed, Jun 17, 2015 at 01:48:03PM +0200, Igor Mammedov wrote:
So far it's kernel limitation and this patch fixes crashes
that users see now, with the rest of patches enabling performance
not to regress.
   
   When I say regression I refer to an option to limit the array
   size again after userspace started using the larger size.
  Is there a need to do so?
 
 Considering userspace can be malicious, I guess yes.
I don't think it's a valid concern in this case,
setting limit back from 509 to 64 will not help here in any way,
userspace still can create as many vhost instances as it needs
to consume memory it desires.

 
  Userspace that cares about memory footprint won't use many slots
  keeping it low and user space that can't do without many slots
  or doesn't care will have bigger memory footprint.
 
 We really can't trust userspace to do the right thing though.
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/5] vhost: support upto 509 memory regions

2015-06-17 Thread Igor Mammedov
On Wed, 17 Jun 2015 08:34:26 +0200
Michael S. Tsirkin m...@redhat.com wrote:

 On Wed, Jun 17, 2015 at 12:00:56AM +0200, Igor Mammedov wrote:
  On Tue, 16 Jun 2015 23:14:20 +0200
  Michael S. Tsirkin m...@redhat.com wrote:
  
   On Tue, Jun 16, 2015 at 06:33:37PM +0200, Igor Mammedov wrote:
since commit
 1d4e7e3 kvm: x86: increase user memory slots to 509

it became possible to use a bigger amount of memory
slots, which is used by memory hotplug for
registering hotplugged memory.
However QEMU crashes if it's used with more than ~60
pc-dimm devices and vhost-net since host kernel
in module vhost-net refuses to accept more than 65
memory regions.

Increase VHOST_MEMORY_MAX_NREGIONS from 65 to 509
   
   It was 64, not 65.
   
to match KVM_USER_MEM_SLOTS fixes issue for vhost-net.

Signed-off-by: Igor Mammedov imamm...@redhat.com
   
   Still thinking about this: can you reorder this to
   be the last patch in the series please?
  sure
  
   
   Also - 509?
  userspace memory slots in terms of KVM, I made it match
  KVM's allotment of memory slots for userspace side.
 
 Maybe KVM has its reasons for this #. I don't see
 why we need to match this exactly.
np, I can cap it at safe 300 slots but it's unlikely that it
would take cut off 1 extra hop since it's capped by QEMU
at 256+[initial fragmented memory]

 
   I think if we are changing this, it'd be nice to
   create a way for userspace to discover the support
   and the # of regions supported.
  That was my first idea before extending KVM's memslots
  to teach kernel to tell qemu this number so that QEMU
  at least would be able to check if new memory slot could
  be added but I was redirected to a more simple solution
  of just extending vs everdoing things.
  Currently QEMU supports upto ~250 memslots so 509
  is about twice high we need it so it should work for near
  future
 
 Yes but old kernels are still around. Would be nice if you
 can detect them.
 
  but eventually we might still teach kernel and QEMU
  to make things more robust.
 
 A new ioctl would be easy to add, I think it's a good
 idea generally.
I can try to do something like this on top of this series.

 
   
   
---
 drivers/vhost/vhost.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 99931a0..6a18c92 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -30,7 +30,7 @@
 #include vhost.h
 
 enum {
-   VHOST_MEMORY_MAX_NREGIONS = 64,
+   VHOST_MEMORY_MAX_NREGIONS = 509,
VHOST_MEMORY_F_LOG = 0x1,
 };
 
-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/5] vhost: support upto 509 memory regions

2015-06-17 Thread Igor Mammedov
On Wed, 17 Jun 2015 12:11:09 +0200
Michael S. Tsirkin m...@redhat.com wrote:

 On Wed, Jun 17, 2015 at 10:54:21AM +0200, Igor Mammedov wrote:
  On Wed, 17 Jun 2015 09:39:06 +0200
  Michael S. Tsirkin m...@redhat.com wrote:
  
   On Wed, Jun 17, 2015 at 09:28:02AM +0200, Igor Mammedov wrote:
On Wed, 17 Jun 2015 08:34:26 +0200
Michael S. Tsirkin m...@redhat.com wrote:

 On Wed, Jun 17, 2015 at 12:00:56AM +0200, Igor Mammedov wrote:
  On Tue, 16 Jun 2015 23:14:20 +0200
  Michael S. Tsirkin m...@redhat.com wrote:
  
   On Tue, Jun 16, 2015 at 06:33:37PM +0200, Igor Mammedov wrote:
since commit
 1d4e7e3 kvm: x86: increase user memory slots to 509

it became possible to use a bigger amount of memory
slots, which is used by memory hotplug for
registering hotplugged memory.
However QEMU crashes if it's used with more than ~60
pc-dimm devices and vhost-net since host kernel
in module vhost-net refuses to accept more than 65
memory regions.

Increase VHOST_MEMORY_MAX_NREGIONS from 65 to 509
   
   It was 64, not 65.
   
to match KVM_USER_MEM_SLOTS fixes issue for vhost-net.

Signed-off-by: Igor Mammedov imamm...@redhat.com
   
   Still thinking about this: can you reorder this to
   be the last patch in the series please?
  sure
  
   
   Also - 509?
  userspace memory slots in terms of KVM, I made it match
  KVM's allotment of memory slots for userspace side.
 
 Maybe KVM has its reasons for this #. I don't see
 why we need to match this exactly.
np, I can cap it at safe 300 slots but it's unlikely that it
would take cut off 1 extra hop since it's capped by QEMU
at 256+[initial fragmented memory]
   
   But what's the point? We allocate 32 bytes per slot.
   300*32 = 9600 which is more than 8K, so we are doing
   an order-3 allocation anyway.
   If we could cap it at 8K (256 slots) that would make sense
   since we could avoid wasting vmalloc space.
  256 is amount of hotpluggable slots  and there is no way
  to predict how initial memory would be fragmented
  (i.e. amount of slots it would take), if we guess wrong
  we are back to square one with crashing userspace.
  So I'd stay consistent with KVM's limit 509 since
  it's only limit, i.e. not actual amount of allocated slots.
  
   I'm still not very happy with the whole approach,
   giving userspace ability allocate 4 whole pages
   of kernel memory like this.
  I'm working in parallel so that userspace won't take so
  many slots but it won't prevent its current versions
  crashing due to kernel limitation.
 
 Right but at least it's not a regression. If we promise userspace to
 support a ton of regions, we can't take it back later, and I'm concerned
 about the memory usage.
 
 I think it's already safe to merge the binary lookup patches, and maybe
 cache and vmalloc, so that the remaining patch will be small.
it isn't regression with switching to binary search and increasing
slots to 509 either performance wise it's more on improvment side.
And I was thinking about memory usage as well, that's why I've dropped
faster radix tree in favor of more compact array, can't do better
on kernel side of fix.

Yes we will give userspace to ability to use more slots/and lock up
more memory if it's not able to consolidate memory regions but
that leaves an option for user to run guest with vhost performance
vs crashing it at runtime.

userspace/targets that could consolidate memory regions should
do so and I'm working on that as well but that doesn't mean
that users shouldn't have a choice.
So far it's kernel limitation and this patch fixes crashes
that users see now, with the rest of patches enabling performance
not to regress.

 
   
   I think if we are changing this, it'd be nice to
   create a way for userspace to discover the support
   and the # of regions supported.
  That was my first idea before extending KVM's memslots
  to teach kernel to tell qemu this number so that QEMU
  at least would be able to check if new memory slot could
  be added but I was redirected to a more simple solution
  of just extending vs everdoing things.
  Currently QEMU supports upto ~250 memslots so 509
  is about twice high we need it so it should work for near
  future
 
 Yes but old kernels are still around. Would be nice if you
 can detect them.
 
  but eventually we might still teach kernel and QEMU
  to make things more robust.
 
 A new ioctl would be easy to add, I think it's a good
 idea generally.
I can try to do something like this on top of this series.

 
   
   
---
 drivers/vhost/vhost.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index

Re: [PATCH 3/5] vhost: support upto 509 memory regions

2015-06-17 Thread Igor Mammedov
On Wed, 17 Jun 2015 09:39:06 +0200
Michael S. Tsirkin m...@redhat.com wrote:

 On Wed, Jun 17, 2015 at 09:28:02AM +0200, Igor Mammedov wrote:
  On Wed, 17 Jun 2015 08:34:26 +0200
  Michael S. Tsirkin m...@redhat.com wrote:
  
   On Wed, Jun 17, 2015 at 12:00:56AM +0200, Igor Mammedov wrote:
On Tue, 16 Jun 2015 23:14:20 +0200
Michael S. Tsirkin m...@redhat.com wrote:

 On Tue, Jun 16, 2015 at 06:33:37PM +0200, Igor Mammedov wrote:
  since commit
   1d4e7e3 kvm: x86: increase user memory slots to 509
  
  it became possible to use a bigger amount of memory
  slots, which is used by memory hotplug for
  registering hotplugged memory.
  However QEMU crashes if it's used with more than ~60
  pc-dimm devices and vhost-net since host kernel
  in module vhost-net refuses to accept more than 65
  memory regions.
  
  Increase VHOST_MEMORY_MAX_NREGIONS from 65 to 509
 
 It was 64, not 65.
 
  to match KVM_USER_MEM_SLOTS fixes issue for vhost-net.
  
  Signed-off-by: Igor Mammedov imamm...@redhat.com
 
 Still thinking about this: can you reorder this to
 be the last patch in the series please?
sure

 
 Also - 509?
userspace memory slots in terms of KVM, I made it match
KVM's allotment of memory slots for userspace side.
   
   Maybe KVM has its reasons for this #. I don't see
   why we need to match this exactly.
  np, I can cap it at safe 300 slots but it's unlikely that it
  would take cut off 1 extra hop since it's capped by QEMU
  at 256+[initial fragmented memory]
 
 But what's the point? We allocate 32 bytes per slot.
 300*32 = 9600 which is more than 8K, so we are doing
 an order-3 allocation anyway.
 If we could cap it at 8K (256 slots) that would make sense
 since we could avoid wasting vmalloc space.
256 is amount of hotpluggable slots  and there is no way
to predict how initial memory would be fragmented
(i.e. amount of slots it would take), if we guess wrong
we are back to square one with crashing userspace.
So I'd stay consistent with KVM's limit 509 since
it's only limit, i.e. not actual amount of allocated slots.

 I'm still not very happy with the whole approach,
 giving userspace ability allocate 4 whole pages
 of kernel memory like this.
I'm working in parallel so that userspace won't take so
many slots but it won't prevent its current versions
crashing due to kernel limitation.

 
 I think if we are changing this, it'd be nice to
 create a way for userspace to discover the support
 and the # of regions supported.
That was my first idea before extending KVM's memslots
to teach kernel to tell qemu this number so that QEMU
at least would be able to check if new memory slot could
be added but I was redirected to a more simple solution
of just extending vs everdoing things.
Currently QEMU supports upto ~250 memslots so 509
is about twice high we need it so it should work for near
future
   
   Yes but old kernels are still around. Would be nice if you
   can detect them.
   
but eventually we might still teach kernel and QEMU
to make things more robust.
   
   A new ioctl would be easy to add, I think it's a good
   idea generally.
  I can try to do something like this on top of this series.
  
   
 
 
  ---
   drivers/vhost/vhost.c | 2 +-
   1 file changed, 1 insertion(+), 1 deletion(-)
  
  diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
  index 99931a0..6a18c92 100644
  --- a/drivers/vhost/vhost.c
  +++ b/drivers/vhost/vhost.c
  @@ -30,7 +30,7 @@
   #include vhost.h
   
   enum {
  -   VHOST_MEMORY_MAX_NREGIONS = 64,
  +   VHOST_MEMORY_MAX_NREGIONS = 509,
  VHOST_MEMORY_F_LOG = 0x1,
   };
   
  -- 
  1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 0/6] vhost: support upto 509 memory regions

2015-06-17 Thread Igor Mammedov
Ref to prefious version discussion:
[PATCH 0/5] vhost: support upto 509 memory regions
http://www.spinics.net/lists/kvm/msg117654.html

Chagelog v1-v2:
  * fix spelling errors
  * move vhost: support upto 509 memory regions to the end of queue
  * move kvfree() form 1/6 to 2/6 where it belongs
  * add vhost module parameter to enable/disable translation caching

Series extends vhost to support upto 509 memory regions,
and adds some vhost:translate_desc() performance improvemnts
so it won't regress when memslots are increased to 509.

It fixes running VM crashing during memory hotplug due
to vhost refusing accepting more than 64 memory regions.

It's only host kernel side fix to make it work with QEMU
versions that support memory hotplug. But I'll continue
to work on QEMU side solution to reduce amount of memory
regions to make things even better.

Performance wise for guest with (in my case 3 memory regions)
and netperf's UDP_RR workload translate_desc() execution
time from total workload takes:

Memory  |1G RAM|cached|non cached
regions #   |  3   |  53  |  53

upstream| 0.3% |  -   | 3.5%

this series | 0.2% | 0.5% | 0.7%

where non cached column reflects trashing wokload
with constant cache miss. More details on timing in
respective patches.

Igor Mammedov (6):
  vhost: use binary search instead of linear in find_region()
  vhost: extend memory regions allocation to vmalloc
  vhost: add per VQ memory region caching
  vhost: translate_desc: optimization for desc.len  region size
  vhost: add 'translation_cache' module parameter
  vhost: support upto 509 memory regions

 drivers/vhost/vhost.c | 105 ++
 drivers/vhost/vhost.h |   1 +
 2 files changed, 82 insertions(+), 24 deletions(-)

-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 2/6] vhost: extend memory regions allocation to vmalloc

2015-06-17 Thread Igor Mammedov
with large number of memory regions we could end up with
high order allocations and kmalloc could fail if
host is under memory pressure.
Considering that memory regions array is used on hot path
try harder to allocate using kmalloc and if it fails resort
to vmalloc.
It's still better than just failing vhost_set_memory() and
causing guest crash due to it when a new memory hotplugged
to guest.

I'll still look at QEMU side solution to reduce amount of
memory regions it feeds to vhost to make things even better,
but it doesn't hurt for kernel to behave smarter and don't
crash older QEMU's which could use large amount of memory
regions.

Signed-off-by: Igor Mammedov imamm...@redhat.com
---
 drivers/vhost/vhost.c | 22 +-
 1 file changed, 17 insertions(+), 5 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index f1e07b8..99931a0 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -471,7 +471,7 @@ void vhost_dev_cleanup(struct vhost_dev *dev, bool locked)
fput(dev-log_file);
dev-log_file = NULL;
/* No one will access memory at this point */
-   kfree(dev-memory);
+   kvfree(dev-memory);
dev-memory = NULL;
WARN_ON(!list_empty(dev-work_list));
if (dev-worker) {
@@ -601,6 +601,18 @@ static int vhost_memory_reg_sort_cmp(const void *p1, const 
void *p2)
return 0;
 }
 
+static void *vhost_kvzalloc(unsigned long size)
+{
+   void *n = kzalloc(size, GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT);
+
+   if (!n) {
+   n = vzalloc(size);
+   if (!n)
+   return ERR_PTR(-ENOMEM);
+   }
+   return n;
+}
+
 static long vhost_set_memory(struct vhost_dev *d, struct vhost_memory __user 
*m)
 {
struct vhost_memory mem, *newmem, *oldmem;
@@ -613,21 +625,21 @@ static long vhost_set_memory(struct vhost_dev *d, struct 
vhost_memory __user *m)
return -EOPNOTSUPP;
if (mem.nregions  VHOST_MEMORY_MAX_NREGIONS)
return -E2BIG;
-   newmem = kmalloc(size + mem.nregions * sizeof *m-regions, GFP_KERNEL);
+   newmem = vhost_kvzalloc(size + mem.nregions * sizeof(*m-regions));
if (!newmem)
return -ENOMEM;
 
memcpy(newmem, mem, size);
if (copy_from_user(newmem-regions, m-regions,
   mem.nregions * sizeof *m-regions)) {
-   kfree(newmem);
+   kvfree(newmem);
return -EFAULT;
}
sort(newmem-regions, newmem-nregions, sizeof(*newmem-regions),
vhost_memory_reg_sort_cmp, NULL);
 
if (!memory_access_ok(d, newmem, 0)) {
-   kfree(newmem);
+   kvfree(newmem);
return -EFAULT;
}
oldmem = d-memory;
@@ -639,7 +651,7 @@ static long vhost_set_memory(struct vhost_dev *d, struct 
vhost_memory __user *m)
d-vqs[i]-memory = newmem;
mutex_unlock(d-vqs[i]-mutex);
}
-   kfree(oldmem);
+   kvfree(oldmem);
return 0;
 }
 
-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 3/6] vhost: add per VQ memory region caching

2015-06-17 Thread Igor Mammedov
that brings down translate_desc() cost to around 210ns
if accessed descriptors are from the same memory region.

Signed-off-by: Igor Mammedov imamm...@redhat.com
---
that's what netperf/iperf workloads were during testing.
---
 drivers/vhost/vhost.c | 16 +---
 drivers/vhost/vhost.h |  1 +
 2 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 99931a0..5c39a1e 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -200,6 +200,7 @@ static void vhost_vq_reset(struct vhost_dev *dev,
vq-call = NULL;
vq-log_ctx = NULL;
vq-memory = NULL;
+   vq-cached_reg = 0;
 }
 
 static int vhost_worker(void *data)
@@ -649,6 +650,7 @@ static long vhost_set_memory(struct vhost_dev *d, struct 
vhost_memory __user *m)
for (i = 0; i  d-nvqs; ++i) {
mutex_lock(d-vqs[i]-mutex);
d-vqs[i]-memory = newmem;
+   d-vqs[i]-cached_reg = 0;
mutex_unlock(d-vqs[i]-mutex);
}
kvfree(oldmem);
@@ -936,11 +938,17 @@ done:
 EXPORT_SYMBOL_GPL(vhost_dev_ioctl);
 
 static const struct vhost_memory_region *find_region(struct vhost_memory *mem,
-__u64 addr, __u32 len)
+__u64 addr, __u32 len,
+int *cached_reg)
 {
const struct vhost_memory_region *reg;
int start = 0, end = mem-nregions;
 
+   reg = mem-regions + *cached_reg;
+   if (likely(addr = reg-guest_phys_addr 
+   reg-guest_phys_addr + reg-memory_size  addr))
+   return reg;
+
while (start  end) {
int slot = start + (end - start) / 2;
reg = mem-regions + slot;
@@ -952,8 +960,10 @@ static const struct vhost_memory_region 
*find_region(struct vhost_memory *mem,
 
reg = mem-regions + start;
if (addr = reg-guest_phys_addr 
-   reg-guest_phys_addr + reg-memory_size  addr)
+   reg-guest_phys_addr + reg-memory_size  addr) {
+   *cached_reg = start;
return reg;
+   }
return NULL;
 }
 
@@ -1107,7 +1117,7 @@ static int translate_desc(struct vhost_virtqueue *vq, u64 
addr, u32 len,
ret = -ENOBUFS;
break;
}
-   reg = find_region(mem, addr, len);
+   reg = find_region(mem, addr, len, vq-cached_reg);
if (unlikely(!reg)) {
ret = -EFAULT;
break;
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 8c1c792..68bd00f 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -106,6 +106,7 @@ struct vhost_virtqueue {
/* Log write descriptors */
void __user *log_base;
struct vhost_log *log;
+   int cached_reg;
 };
 
 struct vhost_dev {
-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 5/6] vhost: add 'translation_cache' module parameter

2015-06-17 Thread Igor Mammedov
by default translation of virtqueue descriptors is done with
caching enabled, but caching will add only extra cost
in cases of trashing workload where majority descriptors
are translated to different memory regions.
So add an option to allow exclude cache miss cost for such cases.

Performance with cashing enabled for sequential workload
doesn't seem to be affected much vs version without static key switch,
i.e. still the same 0.2% of total time with key(NOPs) consuming
5ms on 5min workload.

Signed-off-by: Igor Mammedov imamm...@redhat.com
---
I don't have a test case for trashing workload though, but jmp
instruction adds up ~6ms(55M instructions) minus excluded caching
around 24ms on 5min workload.
---
 drivers/vhost/vhost.c | 20 
 1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 5bcb323..78290b7 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -29,6 +29,13 @@
 
 #include vhost.h
 
+struct static_key translation_cache_key = STATIC_KEY_INIT_TRUE;
+static bool translation_cache = true;
+module_param(translation_cache, bool, 0444);
+MODULE_PARM_DESC(translation_cache,
+   Enables/disables virtqueue descriptor translation caching,
+Set to 0 to disable. (default: 1));
+
 enum {
VHOST_MEMORY_MAX_NREGIONS = 64,
VHOST_MEMORY_F_LOG = 0x1,
@@ -944,10 +951,12 @@ static const struct vhost_memory_region 
*find_region(struct vhost_memory *mem,
const struct vhost_memory_region *reg;
int start = 0, end = mem-nregions;
 
-   reg = mem-regions + *cached_reg;
-   if (likely(addr = reg-guest_phys_addr 
-   reg-guest_phys_addr + reg-memory_size  addr))
-   return reg;
+   if (static_key_true(translation_cache_key)) {
+   reg = mem-regions + *cached_reg;
+   if (likely(addr = reg-guest_phys_addr 
+   reg-guest_phys_addr + reg-memory_size  addr))
+   return reg;
+   }
 
while (start  end) {
int slot = start + (end - start) / 2;
@@ -1612,6 +1621,9 @@ EXPORT_SYMBOL_GPL(vhost_disable_notify);
 
 static int __init vhost_init(void)
 {
+   if (!translation_cache)
+   static_key_slow_dec(translation_cache_key);
+
return 0;
 }
 
-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 1/6] vhost: use binary search instead of linear in find_region()

2015-06-17 Thread Igor Mammedov
For default region layouts performance stays the same
as linear search i.e. it takes around 210ns average for
translate_desc() that inlines find_region().

But it scales better with larger amount of regions,
235ns BS vs 300ns LS with 55 memory regions
and it will be about the same values when allowed number
of slots is increased to 509 like it has been done in KVM.

Signed-off-by: Igor Mammedov imamm...@redhat.com
---
v2:
  move kvfree() to 2/2 where it belongs
---
 drivers/vhost/vhost.c | 36 +++-
 1 file changed, 27 insertions(+), 9 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 2ee2826..f1e07b8 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -25,6 +25,7 @@
 #include linux/kthread.h
 #include linux/cgroup.h
 #include linux/module.h
+#include linux/sort.h
 
 #include vhost.h
 
@@ -590,6 +591,16 @@ int vhost_vq_access_ok(struct vhost_virtqueue *vq)
 }
 EXPORT_SYMBOL_GPL(vhost_vq_access_ok);
 
+static int vhost_memory_reg_sort_cmp(const void *p1, const void *p2)
+{
+   const struct vhost_memory_region *r1 = p1, *r2 = p2;
+   if (r1-guest_phys_addr  r2-guest_phys_addr)
+   return 1;
+   if (r1-guest_phys_addr  r2-guest_phys_addr)
+   return -1;
+   return 0;
+}
+
 static long vhost_set_memory(struct vhost_dev *d, struct vhost_memory __user 
*m)
 {
struct vhost_memory mem, *newmem, *oldmem;
@@ -612,6 +623,8 @@ static long vhost_set_memory(struct vhost_dev *d, struct 
vhost_memory __user *m)
kfree(newmem);
return -EFAULT;
}
+   sort(newmem-regions, newmem-nregions, sizeof(*newmem-regions),
+   vhost_memory_reg_sort_cmp, NULL);
 
if (!memory_access_ok(d, newmem, 0)) {
kfree(newmem);
@@ -913,17 +926,22 @@ EXPORT_SYMBOL_GPL(vhost_dev_ioctl);
 static const struct vhost_memory_region *find_region(struct vhost_memory *mem,
 __u64 addr, __u32 len)
 {
-   struct vhost_memory_region *reg;
-   int i;
+   const struct vhost_memory_region *reg;
+   int start = 0, end = mem-nregions;
 
-   /* linear search is not brilliant, but we really have on the order of 6
-* regions in practice */
-   for (i = 0; i  mem-nregions; ++i) {
-   reg = mem-regions + i;
-   if (reg-guest_phys_addr = addr 
-   reg-guest_phys_addr + reg-memory_size - 1 = addr)
-   return reg;
+   while (start  end) {
+   int slot = start + (end - start) / 2;
+   reg = mem-regions + slot;
+   if (addr = reg-guest_phys_addr)
+   end = slot;
+   else
+   start = slot + 1;
}
+
+   reg = mem-regions + start;
+   if (addr = reg-guest_phys_addr 
+   reg-guest_phys_addr + reg-memory_size  addr)
+   return reg;
return NULL;
 }
 
-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 4/6] vhost: translate_desc: optimization for desc.len region size

2015-06-17 Thread Igor Mammedov
when translating descriptors they are typically less than
memory region that holds them and translated into 1 iov
entry, so it's not nessesary to check remaining length
twice and calculate used length and next address
in such cases.

replace a remaining length and 'size' increment branches
with a single remaining length check and execute
next iov steps only when it needed.

It saves a tiny 2% of translate_desc() execution time.

Signed-off-by: Igor Mammedov imamm...@redhat.com
---
PS:
I'm not sure if iov_size  0 is always true, if it's not
then better to drop this patch.
---
 drivers/vhost/vhost.c | 21 +
 1 file changed, 13 insertions(+), 8 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 5c39a1e..5bcb323 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -,12 +,8 @@ static int translate_desc(struct vhost_virtqueue *vq, 
u64 addr, u32 len,
int ret = 0;
 
mem = vq-memory;
-   while ((u64)len  s) {
+   do {
u64 size;
-   if (unlikely(ret = iov_size)) {
-   ret = -ENOBUFS;
-   break;
-   }
reg = find_region(mem, addr, len, vq-cached_reg);
if (unlikely(!reg)) {
ret = -EFAULT;
@@ -1124,13 +1120,22 @@ static int translate_desc(struct vhost_virtqueue *vq, 
u64 addr, u32 len,
}
_iov = iov + ret;
size = reg-memory_size - addr + reg-guest_phys_addr;
-   _iov-iov_len = min((u64)len - s, size);
_iov-iov_base = (void __user *)(unsigned long)
(reg-userspace_addr + addr - reg-guest_phys_addr);
+   ++ret;
+   if (likely((u64)len - s  size)) {
+   _iov-iov_len = (u64)len - s;
+   break;
+   }
+
+   if (unlikely(ret = iov_size)) {
+   ret = -ENOBUFS;
+   break;
+   }
+   _iov-iov_len = size;
s += size;
addr += size;
-   ++ret;
-   }
+   } while (1);
 
return ret;
 }
-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 6/6] vhost: support upto 509 memory regions

2015-06-17 Thread Igor Mammedov
since commit
 1d4e7e3 kvm: x86: increase user memory slots to 509

it became possible to use a bigger amount of memory
slots, which is used by memory hotplug for
registering hotplugged memory.
However QEMU crashes if it's used with more than ~60
pc-dimm devices and vhost-net since host kernel
in module vhost-net refuses to accept more than 64
memory regions.

Increase VHOST_MEMORY_MAX_NREGIONS limit from 64 to 509
to match KVM_USER_MEM_SLOTS to fix issue for vhost-net
and current QEMU versions.

Signed-off-by: Igor Mammedov imamm...@redhat.com
---
 drivers/vhost/vhost.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 78290b7..e93023e 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -37,7 +37,7 @@ MODULE_PARM_DESC(translation_cache,
 Set to 0 to disable. (default: 1));
 
 enum {
-   VHOST_MEMORY_MAX_NREGIONS = 64,
+   VHOST_MEMORY_MAX_NREGIONS = 509,
VHOST_MEMORY_F_LOG = 0x1,
 };
 
-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/5] vhost: support upto 509 memory regions

2015-06-17 Thread Igor Mammedov
On Wed, 17 Jun 2015 16:32:02 +0200
Michael S. Tsirkin m...@redhat.com wrote:

 On Wed, Jun 17, 2015 at 03:20:44PM +0200, Paolo Bonzini wrote:
  
  
  On 17/06/2015 15:13, Michael S. Tsirkin wrote:
 Considering userspace can be malicious, I guess yes.
I don't think it's a valid concern in this case,
setting limit back from 509 to 64 will not help here in any way,
userspace still can create as many vhost instances as it needs
to consume memory it desires.
   
   Not really since vhost char device isn't world-accessible.
   It's typically opened by a priveledged tool, the fd is
   then passed to an unpriveledged userspace, or permissions dropped.
  
  Then what's the concern anyway?
  
  Paolo
 
 Each fd now ties up 16K of kernel memory.  It didn't use to, so
 priveledged tool could safely give the unpriveledged userspace
 a ton of these fds.
if privileged tool gives out unlimited amount of fds then it
doesn't matter whether fd ties 4K or 16K, host still could be DoSed.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/5] vhost: support upto 509 memory regions

2015-06-17 Thread Igor Mammedov
On Wed, 17 Jun 2015 17:38:40 +0200
Michael S. Tsirkin m...@redhat.com wrote:

 On Wed, Jun 17, 2015 at 05:12:57PM +0200, Igor Mammedov wrote:
  On Wed, 17 Jun 2015 16:32:02 +0200
  Michael S. Tsirkin m...@redhat.com wrote:
  
   On Wed, Jun 17, 2015 at 03:20:44PM +0200, Paolo Bonzini wrote:


On 17/06/2015 15:13, Michael S. Tsirkin wrote:
   Considering userspace can be malicious, I guess yes.
  I don't think it's a valid concern in this case,
  setting limit back from 509 to 64 will not help here in any
  way, userspace still can create as many vhost instances as
  it needs to consume memory it desires.
 
 Not really since vhost char device isn't world-accessible.
 It's typically opened by a priveledged tool, the fd is
 then passed to an unpriveledged userspace, or permissions
 dropped.

Then what's the concern anyway?

Paolo
   
   Each fd now ties up 16K of kernel memory.  It didn't use to, so
   priveledged tool could safely give the unpriveledged userspace
   a ton of these fds.
  if privileged tool gives out unlimited amount of fds then it
  doesn't matter whether fd ties 4K or 16K, host still could be DoSed.
  
 
 Of course it does not give out unlimited fds, there's a way
 for the sysadmin to specify the number of fds. Look at how libvirt
 uses vhost, it should become clear I think.
then it just means that tool has to take into account a new limits
to partition host in sensible manner.
Exposing limit as module parameter might be of help to tool for
getting/setting it in a way it needs.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/5] vhost: use binary search instead of linear in find_region()

2015-06-16 Thread Igor Mammedov
For default region layouts performance stays the same
as linear search i.e. it takes around 210ns average for
translate_desc() that inlines find_region().

But it scales better with larger amount of regions,
235ns BS vs 300ns LS with 55 memory regions
and it will be about the same values when allowed number
of slots is increased to 509 like it has been done in kvm.

Signed-off-by: Igor Mammedov imamm...@redhat.com
---
 drivers/vhost/vhost.c | 38 --
 1 file changed, 28 insertions(+), 10 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 2ee2826..a22f8c3 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -25,6 +25,7 @@
 #include linux/kthread.h
 #include linux/cgroup.h
 #include linux/module.h
+#include linux/sort.h
 
 #include vhost.h
 
@@ -590,6 +591,16 @@ int vhost_vq_access_ok(struct vhost_virtqueue *vq)
 }
 EXPORT_SYMBOL_GPL(vhost_vq_access_ok);
 
+static int vhost_memory_reg_sort_cmp(const void *p1, const void *p2)
+{
+   const struct vhost_memory_region *r1 = p1, *r2 = p2;
+   if (r1-guest_phys_addr  r2-guest_phys_addr)
+   return 1;
+   if (r1-guest_phys_addr  r2-guest_phys_addr)
+   return -1;
+   return 0;
+}
+
 static long vhost_set_memory(struct vhost_dev *d, struct vhost_memory __user 
*m)
 {
struct vhost_memory mem, *newmem, *oldmem;
@@ -609,9 +620,11 @@ static long vhost_set_memory(struct vhost_dev *d, struct 
vhost_memory __user *m)
memcpy(newmem, mem, size);
if (copy_from_user(newmem-regions, m-regions,
   mem.nregions * sizeof *m-regions)) {
-   kfree(newmem);
+   kvfree(newmem);
return -EFAULT;
}
+   sort(newmem-regions, newmem-nregions, sizeof(*newmem-regions),
+   vhost_memory_reg_sort_cmp, NULL);
 
if (!memory_access_ok(d, newmem, 0)) {
kfree(newmem);
@@ -913,17 +926,22 @@ EXPORT_SYMBOL_GPL(vhost_dev_ioctl);
 static const struct vhost_memory_region *find_region(struct vhost_memory *mem,
 __u64 addr, __u32 len)
 {
-   struct vhost_memory_region *reg;
-   int i;
+   const struct vhost_memory_region *reg;
+   int start = 0, end = mem-nregions;
 
-   /* linear search is not brilliant, but we really have on the order of 6
-* regions in practice */
-   for (i = 0; i  mem-nregions; ++i) {
-   reg = mem-regions + i;
-   if (reg-guest_phys_addr = addr 
-   reg-guest_phys_addr + reg-memory_size - 1 = addr)
-   return reg;
+   while (start  end) {
+   int slot = start + (end - start) / 2;
+   reg = mem-regions + slot;
+   if (addr = reg-guest_phys_addr)
+   end = slot;
+   else
+   start = slot + 1;
}
+
+   reg = mem-regions + start;
+   if (addr = reg-guest_phys_addr 
+   reg-guest_phys_addr + reg-memory_size  addr)
+   return reg;
return NULL;
 }
 
-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/5] vhost: translate_desc: optimization for desc.len region size

2015-06-16 Thread Igor Mammedov
when translating descriptors they are typically less than
memory region that holds them and translated into 1 iov
enty, so it's not nessesary to check remaining length
twice and calculate used length and next address
in such cases.

so relace a remaining length and 'size' increment branches
with a single remaining length check and execute
next iov steps only when it needed.

It saves tiny 2% of translate_desc() execution time.

Signed-off-by: Igor Mammedov imamm...@redhat.com
---
PS:
I'm not sure if iov_size  0 is always true, if it's not
then better to drop this patch.
---
 drivers/vhost/vhost.c | 21 +
 1 file changed, 13 insertions(+), 8 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 68c1c88..84c457d 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -,12 +,8 @@ static int translate_desc(struct vhost_virtqueue *vq, 
u64 addr, u32 len,
int ret = 0;
 
mem = vq-memory;
-   while ((u64)len  s) {
+   do {
u64 size;
-   if (unlikely(ret = iov_size)) {
-   ret = -ENOBUFS;
-   break;
-   }
reg = find_region(mem, addr, len, vq-cached_reg);
if (unlikely(!reg)) {
ret = -EFAULT;
@@ -1124,13 +1120,22 @@ static int translate_desc(struct vhost_virtqueue *vq, 
u64 addr, u32 len,
}
_iov = iov + ret;
size = reg-memory_size - addr + reg-guest_phys_addr;
-   _iov-iov_len = min((u64)len - s, size);
_iov-iov_base = (void __user *)(unsigned long)
(reg-userspace_addr + addr - reg-guest_phys_addr);
+   ++ret;
+   if (likely((u64)len - s  size)) {
+   _iov-iov_len = (u64)len - s;
+   break;
+   }
+
+   if (unlikely(ret = iov_size)) {
+   ret = -ENOBUFS;
+   break;
+   }
+   _iov-iov_len = size;
s += size;
addr += size;
-   ++ret;
-   }
+   } while (1);
 
return ret;
 }
-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/5] vhost: support upto 509 memory regions

2015-06-16 Thread Igor Mammedov
Series extends vhost to support upto 509 memory regions,
and adds some vhost:translate_desc() performance improvemnts
so it won't regress when memslots are increased to 509.

It fixes running VM crashing during memory hotplug due
to vhost refusing accepting more than 64 memory regions.

It's only host kernel side fix to make it work with QEMU
versions that support memory hotplug. But I'll continue
to work on QEMU side solution to reduce amount of memory
regions to make things even better.

Performance wise for guest with (in my case 3 memory regions)
and netperf's UDP_RR workload translate_desc() execution
time from total workload takes:

Memory  |1G RAM|cached|non cached
regions #   |  3   |  53  |  53

upstream| 0.3% |  -   | 3.5%

this series | 0.2% | 0.5% | 0.7%

where non cached column reflects trashing wokload
with constant cache miss. More details on timing in
respective patches.

Igor Mammedov (5):
  vhost: use binary search instead of linear in find_region()
  vhost: extend memory regions allocation to vmalloc
  vhost: support upto 509 memory regions
  vhost: add per VQ memory region caching
  vhost: translate_desc: optimization for desc.len  region size

 drivers/vhost/vhost.c | 95 +--
 drivers/vhost/vhost.h |  1 +
 2 files changed, 71 insertions(+), 25 deletions(-)

-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/5] vhost: support upto 509 memory regions

2015-06-16 Thread Igor Mammedov
since commit
 1d4e7e3 kvm: x86: increase user memory slots to 509

it became possible to use a bigger amount of memory
slots, which is used by memory hotplug for
registering hotplugged memory.
However QEMU crashes if it's used with more than ~60
pc-dimm devices and vhost-net since host kernel
in module vhost-net refuses to accept more than 65
memory regions.

Increase VHOST_MEMORY_MAX_NREGIONS from 65 to 509
to match KVM_USER_MEM_SLOTS fixes issue for vhost-net.

Signed-off-by: Igor Mammedov imamm...@redhat.com
---
 drivers/vhost/vhost.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 99931a0..6a18c92 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -30,7 +30,7 @@
 #include vhost.h
 
 enum {
-   VHOST_MEMORY_MAX_NREGIONS = 64,
+   VHOST_MEMORY_MAX_NREGIONS = 509,
VHOST_MEMORY_F_LOG = 0x1,
 };
 
-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/5] vhost: extend memory regions allocation to vmalloc

2015-06-16 Thread Igor Mammedov
with large number of memory regions we could end up with
high order allocations and kmalloc could fail if
host is under memory pressure.
Considering that memory regions array is used on hot path
try harder to allocate using kmalloc and if it fails resort
to vmalloc.
It's still better than just failing vhost_set_memory() and
causing guest crash due to it when a new memory hotplugged
to guest.

I'll still look at QEMU side solution to reduce amount of
memory regions it feeds to vhost to make things even better,
but it doesn't hurt for kernel to behave smarter and don't
crash older QEMU's which could use large amount of memory
regions.

Signed-off-by: Igor Mammedov imamm...@redhat.com
---
---
 drivers/vhost/vhost.c | 20 
 1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index a22f8c3..99931a0 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -471,7 +471,7 @@ void vhost_dev_cleanup(struct vhost_dev *dev, bool locked)
fput(dev-log_file);
dev-log_file = NULL;
/* No one will access memory at this point */
-   kfree(dev-memory);
+   kvfree(dev-memory);
dev-memory = NULL;
WARN_ON(!list_empty(dev-work_list));
if (dev-worker) {
@@ -601,6 +601,18 @@ static int vhost_memory_reg_sort_cmp(const void *p1, const 
void *p2)
return 0;
 }
 
+static void *vhost_kvzalloc(unsigned long size)
+{
+   void *n = kzalloc(size, GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT);
+
+   if (!n) {
+   n = vzalloc(size);
+   if (!n)
+   return ERR_PTR(-ENOMEM);
+   }
+   return n;
+}
+
 static long vhost_set_memory(struct vhost_dev *d, struct vhost_memory __user 
*m)
 {
struct vhost_memory mem, *newmem, *oldmem;
@@ -613,7 +625,7 @@ static long vhost_set_memory(struct vhost_dev *d, struct 
vhost_memory __user *m)
return -EOPNOTSUPP;
if (mem.nregions  VHOST_MEMORY_MAX_NREGIONS)
return -E2BIG;
-   newmem = kmalloc(size + mem.nregions * sizeof *m-regions, GFP_KERNEL);
+   newmem = vhost_kvzalloc(size + mem.nregions * sizeof(*m-regions));
if (!newmem)
return -ENOMEM;
 
@@ -627,7 +639,7 @@ static long vhost_set_memory(struct vhost_dev *d, struct 
vhost_memory __user *m)
vhost_memory_reg_sort_cmp, NULL);
 
if (!memory_access_ok(d, newmem, 0)) {
-   kfree(newmem);
+   kvfree(newmem);
return -EFAULT;
}
oldmem = d-memory;
@@ -639,7 +651,7 @@ static long vhost_set_memory(struct vhost_dev *d, struct 
vhost_memory __user *m)
d-vqs[i]-memory = newmem;
mutex_unlock(d-vqs[i]-mutex);
}
-   kfree(oldmem);
+   kvfree(oldmem);
return 0;
 }
 
-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/5] vhost: add per VQ memory region caching

2015-06-16 Thread Igor Mammedov
that brings down translate_desc() cost to around 210ns
if accessed descriptors are from the same memory region.

Signed-off-by: Igor Mammedov imamm...@redhat.com
---
that's what netperf/iperf workloads were during testing.
---
 drivers/vhost/vhost.c | 16 +---
 drivers/vhost/vhost.h |  1 +
 2 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 6a18c92..68c1c88 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -200,6 +200,7 @@ static void vhost_vq_reset(struct vhost_dev *dev,
vq-call = NULL;
vq-log_ctx = NULL;
vq-memory = NULL;
+   vq-cached_reg = 0;
 }
 
 static int vhost_worker(void *data)
@@ -649,6 +650,7 @@ static long vhost_set_memory(struct vhost_dev *d, struct 
vhost_memory __user *m)
for (i = 0; i  d-nvqs; ++i) {
mutex_lock(d-vqs[i]-mutex);
d-vqs[i]-memory = newmem;
+   d-vqs[i]-cached_reg = 0;
mutex_unlock(d-vqs[i]-mutex);
}
kvfree(oldmem);
@@ -936,11 +938,17 @@ done:
 EXPORT_SYMBOL_GPL(vhost_dev_ioctl);
 
 static const struct vhost_memory_region *find_region(struct vhost_memory *mem,
-__u64 addr, __u32 len)
+__u64 addr, __u32 len,
+int *cached_reg)
 {
const struct vhost_memory_region *reg;
int start = 0, end = mem-nregions;
 
+   reg = mem-regions + *cached_reg;
+   if (likely(addr = reg-guest_phys_addr 
+   reg-guest_phys_addr + reg-memory_size  addr))
+   return reg;
+
while (start  end) {
int slot = start + (end - start) / 2;
reg = mem-regions + slot;
@@ -952,8 +960,10 @@ static const struct vhost_memory_region 
*find_region(struct vhost_memory *mem,
 
reg = mem-regions + start;
if (addr = reg-guest_phys_addr 
-   reg-guest_phys_addr + reg-memory_size  addr)
+   reg-guest_phys_addr + reg-memory_size  addr) {
+   *cached_reg = start;
return reg;
+   }
return NULL;
 }
 
@@ -1107,7 +1117,7 @@ static int translate_desc(struct vhost_virtqueue *vq, u64 
addr, u32 len,
ret = -ENOBUFS;
break;
}
-   reg = find_region(mem, addr, len);
+   reg = find_region(mem, addr, len, vq-cached_reg);
if (unlikely(!reg)) {
ret = -EFAULT;
break;
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 8c1c792..68bd00f 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -106,6 +106,7 @@ struct vhost_virtqueue {
/* Log write descriptors */
void __user *log_base;
struct vhost_log *log;
+   int cached_reg;
 };
 
 struct vhost_dev {
-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] vhost: use binary search instead of linear in find_region()

2015-06-16 Thread Igor Mammedov
On Tue, 16 Jun 2015 23:07:24 +0200
Michael S. Tsirkin m...@redhat.com wrote:

 On Tue, Jun 16, 2015 at 06:33:35PM +0200, Igor Mammedov wrote:
  For default region layouts performance stays the same
  as linear search i.e. it takes around 210ns average for
  translate_desc() that inlines find_region().
  
  But it scales better with larger amount of regions,
  235ns BS vs 300ns LS with 55 memory regions
  and it will be about the same values when allowed number
  of slots is increased to 509 like it has been done in kvm.
  
  Signed-off-by: Igor Mammedov imamm...@redhat.com
  ---
   drivers/vhost/vhost.c | 38 --
   1 file changed, 28 insertions(+), 10 deletions(-)
  
  diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
  index 2ee2826..a22f8c3 100644
  --- a/drivers/vhost/vhost.c
  +++ b/drivers/vhost/vhost.c
  @@ -25,6 +25,7 @@
   #include linux/kthread.h
   #include linux/cgroup.h
   #include linux/module.h
  +#include linux/sort.h
   
   #include vhost.h
   
  @@ -590,6 +591,16 @@ int vhost_vq_access_ok(struct vhost_virtqueue
  *vq) }
   EXPORT_SYMBOL_GPL(vhost_vq_access_ok);
   
  +static int vhost_memory_reg_sort_cmp(const void *p1, const void
  *p2) +{
  +   const struct vhost_memory_region *r1 = p1, *r2 = p2;
  +   if (r1-guest_phys_addr  r2-guest_phys_addr)
  +   return 1;
  +   if (r1-guest_phys_addr  r2-guest_phys_addr)
  +   return -1;
  +   return 0;
  +}
  +
   static long vhost_set_memory(struct vhost_dev *d, struct
  vhost_memory __user *m) {
  struct vhost_memory mem, *newmem, *oldmem;
  @@ -609,9 +620,11 @@ static long vhost_set_memory(struct vhost_dev
  *d, struct vhost_memory __user *m) memcpy(newmem, mem, size);
  if (copy_from_user(newmem-regions, m-regions,
 mem.nregions * sizeof *m-regions)) {
  -   kfree(newmem);
  +   kvfree(newmem);
  return -EFAULT;
  }
 
 What's this doing here?
ops, it sneaked in from 2/5 when I was splitting patches.
I'll fix it up.

 
  +   sort(newmem-regions, newmem-nregions,
  sizeof(*newmem-regions),
  +   vhost_memory_reg_sort_cmp, NULL);
   
  if (!memory_access_ok(d, newmem, 0)) {
  kfree(newmem);
  @@ -913,17 +926,22 @@ EXPORT_SYMBOL_GPL(vhost_dev_ioctl);
   static const struct vhost_memory_region *find_region(struct
  vhost_memory *mem, __u64 addr, __u32 len)
   {
  -   struct vhost_memory_region *reg;
  -   int i;
  +   const struct vhost_memory_region *reg;
  +   int start = 0, end = mem-nregions;
   
  -   /* linear search is not brilliant, but we really have on
  the order of 6
  -* regions in practice */
  -   for (i = 0; i  mem-nregions; ++i) {
  -   reg = mem-regions + i;
  -   if (reg-guest_phys_addr = addr 
  -   reg-guest_phys_addr + reg-memory_size - 1 =
  addr)
  -   return reg;
  +   while (start  end) {
  +   int slot = start + (end - start) / 2;
  +   reg = mem-regions + slot;
  +   if (addr = reg-guest_phys_addr)
  +   end = slot;
  +   else
  +   start = slot + 1;
  }
  +
  +   reg = mem-regions + start;
  +   if (addr = reg-guest_phys_addr 
  +   reg-guest_phys_addr + reg-memory_size  addr)
  +   return reg;
  return NULL;
   }
   
  -- 
  1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/5] vhost: support upto 509 memory regions

2015-06-16 Thread Igor Mammedov
On Tue, 16 Jun 2015 23:14:20 +0200
Michael S. Tsirkin m...@redhat.com wrote:

 On Tue, Jun 16, 2015 at 06:33:37PM +0200, Igor Mammedov wrote:
  since commit
   1d4e7e3 kvm: x86: increase user memory slots to 509
  
  it became possible to use a bigger amount of memory
  slots, which is used by memory hotplug for
  registering hotplugged memory.
  However QEMU crashes if it's used with more than ~60
  pc-dimm devices and vhost-net since host kernel
  in module vhost-net refuses to accept more than 65
  memory regions.
  
  Increase VHOST_MEMORY_MAX_NREGIONS from 65 to 509
 
 It was 64, not 65.
 
  to match KVM_USER_MEM_SLOTS fixes issue for vhost-net.
  
  Signed-off-by: Igor Mammedov imamm...@redhat.com
 
 Still thinking about this: can you reorder this to
 be the last patch in the series please?
sure

 
 Also - 509?
userspace memory slots in terms of KVM, I made it match
KVM's allotment of memory slots for userspace side.

 I think if we are changing this, it'd be nice to
 create a way for userspace to discover the support
 and the # of regions supported.
That was my first idea before extending KVM's memslots
to teach kernel to tell qemu this number so that QEMU
at least would be able to check if new memory slot could
be added but I was redirected to a more simple solution
of just extending vs everdoing things.
Currently QEMU supports upto ~250 memslots so 509
is about twice high we need it so it should work for near
future but eventually we might still teach kernel and QEMU
to make things more robust.

 
 
  ---
   drivers/vhost/vhost.c | 2 +-
   1 file changed, 1 insertion(+), 1 deletion(-)
  
  diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
  index 99931a0..6a18c92 100644
  --- a/drivers/vhost/vhost.c
  +++ b/drivers/vhost/vhost.c
  @@ -30,7 +30,7 @@
   #include vhost.h
   
   enum {
  -   VHOST_MEMORY_MAX_NREGIONS = 64,
  +   VHOST_MEMORY_MAX_NREGIONS = 509,
  VHOST_MEMORY_F_LOG = 0x1,
   };
   
  -- 
  1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/5] vhost: support upto 509 memory regions

2015-06-16 Thread Igor Mammedov
On Tue, 16 Jun 2015 23:16:07 +0200
Michael S. Tsirkin m...@redhat.com wrote:

 On Tue, Jun 16, 2015 at 06:33:34PM +0200, Igor Mammedov wrote:
  Series extends vhost to support upto 509 memory regions,
  and adds some vhost:translate_desc() performance improvemnts
  so it won't regress when memslots are increased to 509.
  
  It fixes running VM crashing during memory hotplug due
  to vhost refusing accepting more than 64 memory regions.
  
  It's only host kernel side fix to make it work with QEMU
  versions that support memory hotplug. But I'll continue
  to work on QEMU side solution to reduce amount of memory
  regions to make things even better.
 
 I'm concerned userspace work will be harder, in particular,
 performance gains will be harder to measure.
it appears so, so far.

 How about a flag to disable caching?
I've tried to measure cost of cache miss but without much luck,
difference between version with cache and with caching removed
was within margin of error (±10ns) (i.e. not mensurable on my
5min/10*10^6 test workload).
Also I'm concerned about adding extra fetch+branch for flag
checking will make things worse for likely path of cache hit,
so I'd avoid it if possible.

Or do you mean a simple global per module flag to disable it and
wrap thing in static key so that it will be cheap jump to skip
cache?
 
  Performance wise for guest with (in my case 3 memory regions)
  and netperf's UDP_RR workload translate_desc() execution
  time from total workload takes:
  
  Memory  |1G RAM|cached|non cached
  regions #   |  3   |  53  |  53
  
  upstream| 0.3% |  -   | 3.5%
  
  this series | 0.2% | 0.5% | 0.7%
  
  where non cached column reflects trashing wokload
  with constant cache miss. More details on timing in
  respective patches.
  
  Igor Mammedov (5):
vhost: use binary search instead of linear in find_region()
vhost: extend memory regions allocation to vmalloc
vhost: support upto 509 memory regions
vhost: add per VQ memory region caching
vhost: translate_desc: optimization for desc.len  region size
  
   drivers/vhost/vhost.c | 95
  +--
  drivers/vhost/vhost.h |  1 + 2 files changed, 71 insertions(+), 25
  deletions(-)
  
  -- 
  1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   3   >