Re: [Qemu-devel] [PATCH v3] hw/arm/aspeed: Unlock SCU when running kernel

2017-11-15 Thread Cédric Le Goater
On 11/14/2017 01:20 PM, Joel Stanley wrote:
> The ASPEED hardware contains a lock register for the SCU that disables
> any writes to the SCU when it is locked. The machine comes up with the
> lock enabled, but on all known hardware u-boot will unlock it and leave
> it unlocked when loading the kernel.
> 
> This means the kernel expects the SCU to be unlocked. When booting from
> an emulated ROM the normal u-boot unlock path is executed. Things don't
> go well when booting using the -kernel command line, as u-boot does not
> run first.
> 
> Change behaviour so that when a kernel is passed to the machine, set the
> reset value of the SCU to be unlocked.
> 
> Signed-off-by: Joel Stanley 

Reviewed-by: Cédric Le Goater 

Thanks,

C.


> ---
> v3:
>  - remove unused member from AspeedBoardConfig
> v2:
>  - use the correct value for the key
>  - rename it ASPEED_SCU_PROT_KEY
> ---
>  hw/arm/aspeed.c  | 9 +
>  hw/arm/aspeed_soc.c  | 2 ++
>  hw/misc/aspeed_scu.c | 5 +++--
>  include/hw/misc/aspeed_scu.h | 3 +++
>  4 files changed, 17 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/arm/aspeed.c b/hw/arm/aspeed.c
> index ab895ad490af..7088c907bd23 100644
> --- a/hw/arm/aspeed.c
> +++ b/hw/arm/aspeed.c
> @@ -186,6 +186,15 @@ static void aspeed_board_init(MachineState *machine,
>  &error_abort);
>  object_property_set_int(OBJECT(&bmc->soc), cfg->num_cs, "num-cs",
>  &error_abort);
> +if (machine->kernel_filename) {
> +/*
> + * When booting with a -kernel command line there is no u-boot
> + * that runs to unlock the SCU. In this case set the default to
> + * be unlocked as the kernel expects
> + */
> +object_property_set_int(OBJECT(&bmc->soc), ASPEED_SCU_PROT_KEY,
> +"hw-prot-key", &error_abort);
> +}
>  object_property_set_bool(OBJECT(&bmc->soc), true, "realized",
>   &error_abort);
>  
> diff --git a/hw/arm/aspeed_soc.c b/hw/arm/aspeed_soc.c
> index 5aa3d2ddd9cd..c83b7e207b86 100644
> --- a/hw/arm/aspeed_soc.c
> +++ b/hw/arm/aspeed_soc.c
> @@ -154,6 +154,8 @@ static void aspeed_soc_init(Object *obj)
>"hw-strap1", &error_abort);
>  object_property_add_alias(obj, "hw-strap2", OBJECT(&s->scu),
>"hw-strap2", &error_abort);
> +object_property_add_alias(obj, "hw-prot-key", OBJECT(&s->scu),
> +  "hw-prot-key", &error_abort);
>  
>  object_initialize(&s->fmc, sizeof(s->fmc), sc->info->fmc_typename);
>  object_property_add_child(obj, "fmc", OBJECT(&s->fmc), NULL);
> diff --git a/hw/misc/aspeed_scu.c b/hw/misc/aspeed_scu.c
> index 95022d3607ad..74537ce9755d 100644
> --- a/hw/misc/aspeed_scu.c
> +++ b/hw/misc/aspeed_scu.c
> @@ -85,7 +85,6 @@
>  #define BMC_REV  TO_REG(0x19C)
>  #define BMC_DEV_ID   TO_REG(0x1A4)
>  
> -#define PROT_KEY_UNLOCK 0x1688A8A8
>  #define SCU_IO_REGION_SIZE 0x1000
>  
>  static const uint32_t ast2400_a0_resets[ASPEED_SCU_NR_REGS] = {
> @@ -192,7 +191,7 @@ static void aspeed_scu_write(void *opaque, hwaddr offset, 
> uint64_t data,
>  }
>  
>  if (reg > PROT_KEY && reg < CPU2_BASE_SEG1 &&
> -s->regs[PROT_KEY] != PROT_KEY_UNLOCK) {
> +s->regs[PROT_KEY] != ASPEED_SCU_PROT_KEY) {
>  qemu_log_mask(LOG_GUEST_ERROR, "%s: SCU is locked!\n", __func__);
>  return;
>  }
> @@ -246,6 +245,7 @@ static void aspeed_scu_reset(DeviceState *dev)
>  s->regs[SILICON_REV] = s->silicon_rev;
>  s->regs[HW_STRAP1] = s->hw_strap1;
>  s->regs[HW_STRAP2] = s->hw_strap2;
> +s->regs[PROT_KEY] = s->hw_prot_key;
>  }
>  
>  static uint32_t aspeed_silicon_revs[] = {
> @@ -299,6 +299,7 @@ static Property aspeed_scu_properties[] = {
>  DEFINE_PROP_UINT32("silicon-rev", AspeedSCUState, silicon_rev, 0),
>  DEFINE_PROP_UINT32("hw-strap1", AspeedSCUState, hw_strap1, 0),
>  DEFINE_PROP_UINT32("hw-strap2", AspeedSCUState, hw_strap2, 0),
> +DEFINE_PROP_UINT32("hw-prot-key", AspeedSCUState, hw_prot_key, 0),
>  DEFINE_PROP_END_OF_LIST(),
>  };
>  
> diff --git a/include/hw/misc/aspeed_scu.h b/include/hw/misc/aspeed_scu.h
> index bd4ac013f997..d70cc0aeca61 100644
> --- a/include/hw/misc/aspeed_scu.h
> +++ b/include/hw/misc/aspeed_scu.h
> @@ -29,6 +29,7 @@ typedef struct AspeedSCUState {
>  uint32_t silicon_rev;
>  uint32_t hw_strap1;
>  uint32_t hw_strap2;
> +uint32_t hw_prot_key;
>  } AspeedSCUState;
>  
>  #define AST2400_A0_SILICON_REV   0x02000303U
> @@ -38,6 +39,8 @@ typedef struct AspeedSCUState {
>  
>  extern bool is_supported_silicon_rev(uint32_t silicon_rev);
>  
> +#define ASPEED_SCU_PROT_KEY  0x1688A8A8
> +
>  /*
>   * Extracted from Aspeed SDK v00.03.21. Fixes and extra definitions
>   * were added.
> 




Re: [Qemu-devel] HAXM is now open source

2017-11-15 Thread Yu Ning



On 11/15/2017 3:13, John Snow wrote:


On 11/14/2017 06:09 AM, Thomas Huth wrote:


That's great news! I hope this all will help to promote QEMU on Windows
and macOS quite a bit!

However, during the past months, I noticed a couple of times that users
ask on IRC or the qemu-discuss mailing list how they could accelerate
their QEMU on Windows - and they are running only in TCG mode when you
ask how they start QEMU. So it seems like there is not much knowledge
about "--accel hax" in the public yet. Maybe you could write a nice blog
post for the QEMU blog or something similar that explains how to use
HAXM with QEMU on Windows for the normal users? Or maybe make it more
prominent in the QEMU wiki? (e.g. the main page only mentions KVM and
Xen, but not HAXM)

A blog post advertising this new development would be an absolute
miracle for linking to people who are just getting started with QEMU on
Windows.

(It would also be really good for idiots like me, who do not use Windows
for anything other than playing video games and sometimes forget that it
is capable of doing other things.)


Thanks for the encouragement.

I think I can start by writing a small section that can be added to an 
existing document, and later expanded to a blog article. But I haven't 
found a suitable place for it on the QEMU wiki.


Does it make sense to add the proposed piece to the QEMU user manual 
(qemu-doc.texi), under the Quick Start section (2.2)? The user manual is 
published on Stefan's website and referred to by qemu.org:


https://qemu.weilnetz.de/doc/qemu-doc.html

so I think it's a popular resource among users.

BTW, I assume most Windows users begin their QEMU journey from this page 
(also credit to Stefan):


https://qemu.weilnetz.de/

and I've verified the hax accelerator module is built into the latest 
binaries there (at least the W64 ones).


-Yu



Re: [Qemu-devel] [PATCH for-2.12 v3 05/11] spapr: introduce an IRQ allocator using a bitmap

2017-11-15 Thread Cédric Le Goater
On 11/14/2017 04:28 PM, Greg Kurz wrote:
> On Tue, 14 Nov 2017 11:54:53 +
> Cédric Le Goater  wrote:
> 
>> On 11/14/2017 09:42 AM, Greg Kurz wrote:
>>> On Fri, 10 Nov 2017 15:20:11 +
>>> Cédric Le Goater  wrote:
>>>   
 Let's define a new set of XICSFabric IRQ operations for the latest
 pseries machine. These simply use a a bitmap 'irq_map' as a IRQ number
 allocator.

 The previous pseries machines keep the old set of IRQ operations using
 the ICSIRQState array.

 Signed-off-by: Cédric Le Goater 
 ---

  Changes since v2 :

  - introduced a second set of XICSFabric IRQ operations for older
pseries machines

  hw/ppc/spapr.c | 76 
 ++
  include/hw/ppc/spapr.h |  3 ++
  2 files changed, 74 insertions(+), 5 deletions(-)

 diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
 index 4bdceb45a14f..4ef0b73559ca 100644
 --- a/hw/ppc/spapr.c
 +++ b/hw/ppc/spapr.c
 @@ -1681,6 +1681,22 @@ static const VMStateDescription 
 vmstate_spapr_patb_entry = {
  },
  };
  
 +static bool spapr_irq_map_needed(void *opaque)
 +{
 +return true;  
>>>
>>> I see that the next patch adds some code to avoid sending the
>>> bitmap if it doesn't contain state, but I guess you should also
>>> explicitly have this function to return false for older machine
>>> types (see remark below).
>>>   
 +}
 +
 +static const VMStateDescription vmstate_spapr_irq_map = {
 +.name = "spapr_irq_map",
 +.version_id = 0,
 +.minimum_version_id = 0,
 +.needed = spapr_irq_map_needed,
 +.fields = (VMStateField[]) {
 +VMSTATE_BITMAP(irq_map, sPAPRMachineState, 0, nr_irqs),
 +VMSTATE_END_OF_LIST()
 +},
 +};
 +
  static const VMStateDescription vmstate_spapr = {
  .name = "spapr",
  .version_id = 3,
 @@ -1700,6 +1716,7 @@ static const VMStateDescription vmstate_spapr = {
  &vmstate_spapr_ov5_cas,
  &vmstate_spapr_patb_entry,
  &vmstate_spapr_pending_events,
 +&vmstate_spapr_irq_map,
  NULL
  }
  };
 @@ -2337,8 +2354,12 @@ static void ppc_spapr_init(MachineState *machine)
  /* Setup a load limit for the ramdisk leaving room for SLOF and FDT */
  load_limit = MIN(spapr->rma_size, RTAS_MAX_ADDR) - FW_OVERHEAD;
  
 +/* Initialize the IRQ allocator */
 +spapr->nr_irqs  = XICS_IRQS_SPAPR;
> 
> BTW, is this constant for the machine lifetime ? If so, maybe it should go
> to sPAPRMachineClass.

For Xive, we will be increasing the value of 'nr_irqs' with the number 
of max_cpus to handle the IPIs.

C.


> 
 +spapr->irq_map  = bitmap_new(spapr->nr_irqs);
 +  
>>>
>>> I think you should introduce a sPAPRMachineClass::has_irq_bitmap boolean
>>> so that the bitmap is only allocated for newer machine types. And you should
>>> then use this flag in spapr_irq_map_needed() above.  
>>
>> yes. I can add a boot to be more explicit on the use of the bitmap.
>>
>> Thanks,
>>
>> C. 
>>
>>
>>>
>>> Apart from that, the rest of the patch looks good.
>>>   
  /* Set up Interrupt Controller before we create the VCPUs */
 -xics_system_init(machine, XICS_IRQS_SPAPR, &error_fatal);
 +xics_system_init(machine, spapr->nr_irqs, &error_fatal);
  
  /* Set up containers for ibm,client-architecture-support negotiated 
 options
   */
 @@ -3560,7 +3581,7 @@ static int ics_find_free_block(ICSState *ics, int 
 num, int alignnum)
  return -1;
  }
  
 -static bool spapr_irq_test(XICSFabric *xi, int irq)
 +static bool spapr_irq_test_2_11(XICSFabric *xi, int irq)
  {
  sPAPRMachineState *spapr = SPAPR_MACHINE(xi);
  ICSState *ics = spapr->ics;
 @@ -3569,7 +3590,7 @@ static bool spapr_irq_test(XICSFabric *xi, int irq)
  return !ICS_IRQ_FREE(ics, srcno);
  }
  
 -static int spapr_irq_alloc_block(XICSFabric *xi, int count, int align)
 +static int spapr_irq_alloc_block_2_11(XICSFabric *xi, int count, int 
 align)
  {
  sPAPRMachineState *spapr = SPAPR_MACHINE(xi);
  ICSState *ics = spapr->ics;
 @@ -3583,7 +3604,7 @@ static int spapr_irq_alloc_block(XICSFabric *xi, int 
 count, int align)
  return srcno + ics->offset;
  }
  
 -static void spapr_irq_free_block(XICSFabric *xi, int irq, int num)
 +static void spapr_irq_free_block_2_11(XICSFabric *xi, int irq, int num)
  {
  sPAPRMachineState *spapr = SPAPR_MACHINE(xi);
  ICSState *ics = spapr->ics;
 @@ -3601,6 +3622,46 @@ static void spapr_irq_free_block(XICSFabric *xi, 
 int irq, int num)
  }
  }
  
 +static bool spapr_irq_test(XICSFabric *xi, int irq)
 +{
 +sPAPRMachineState *spapr = SPAPR_MACHIN

Re: [Qemu-devel] [PATCH] qapi: block-core: Clarify events emitted by 'block-job-cancel'

2017-11-15 Thread Kashyap Chamarthy
On Tue, Nov 14, 2017 at 11:26:59AM -0800, no-re...@patchew.org wrote:
> Hi,
> 
> This series failed automatic build test. Please find the testing commands and
> their output below. If you have docker installed, you can probably reproduce 
> it
> locally.

[...]

> In file included from /tmp/qemu-test/src/qapi-schema.json:85:
> In file included from /tmp/qemu-test/src/qapi/block.json:7:
> /tmp/qemu-test/src/qapi/block-core.json:2081:1: '@device:' can't follow 
> 'Note' section

Hmm, I don't know the syntax enough to see why the 'Note' section is not
allowed.  But it's not strictly needed at all.  So I'll just do:

s/Note:/It is worth noting/

> make: *** [qapi-visit.h] Error 1
> make: *** Deleting file `qapi-visit.h'
> make: *** Waiting for unfinished jobs

[...]

-- 
/kashyap



Re: [Qemu-devel] [PATCH v3 for-2.11 1/3] tpm_emulator: Add a caching layer for the TPM Established flag

2017-11-15 Thread Valluri, Amarnath
On Wed, 2017-11-15 at 04:47 +0100, Marc-André Lureau wrote:
> Hi
> 
> On Wed, Nov 15, 2017 at 2:16 AM, Stefan Berger
>  wrote:
> > 
> > On 11/14/2017 06:40 PM, Marc-André Lureau wrote:
> > > 
> > > 
> > > Hi
> > > 
> > > On Tue, Nov 14, 2017 at 10:52 PM, Stefan Berger
> > >  wrote:
> > > > 
> > > > 
> > > > Add a caching layer for the TPM established flag so that we
> > > > don't
> > > > need to go to the emulator every time the flag is read by
> > > > accessing
> > > > the REG_ACCESS register.
> > > What's the impact? Isn't this just a "small" optimization? Iotw,
> > > why
> > > is this for-2.11?
> > 
> > The TIS has a register that contains this flag and that's being
> > polled quite
> > frequently. So it generates a lot of traffic to the emulator. This
> > caching
> > layer gets rid of most of the traffic.
> I didn't notice any problem when doing my tests, I guess Amarnath
> niether. perhaps it's best to delay for after 2.11.

Yes, I could see there is little frequent(21 calls to backend) to get
establish flag, and this patch will help to avoid them. But i still
agree with Marc and tag this as "small" optimization.

- Amarnath


[Qemu-devel] [PATCH v2] qapi: block-core: Clarify events emitted by 'block-job-cancel'

2017-11-15 Thread Kashyap Chamarthy
When you cancel an in-progress live block operation with QMP
`block-job-cancel`, it emits the event: BLOCK_JOB_CANCELLED.  However,
when `block-job-cancel` is issued after `drive-mirror` has indicated (by
emitting the event BLOCK_JOB_READY) that the source and destination
remain synchronized:

[...] # Snip `drive-mirror` invocation & outputs
{
  "execute":"block-job-cancel",
  "arguments":{
"device":"virtio0"
  }
}

{"return": {}}

It (`block-job-cancel`) will counterintuitively emit the event
'BLOCK_JOB_COMPLETED':

{
  "timestamp":{
"seconds":1510678024,
"microseconds":526240
  },
  "event":"BLOCK_JOB_COMPLETED",
  "data":{
"device":"virtio0",
"len":41126400,
"offset":41126400,
"speed":0,
"type":"mirror"
  }
}

But this is expected behaviour, where the _COMPLETED event indicates
that synchronization has successfully ended (and the destination has a
point-in-time copy, which is at the time of cancel).

So add a small note to this effect.  (Thanks: Max Reitz for reminding
me of this on IRC.)

Signed-off-by: Kashyap Chamarthy 
---
Changes in v2:
 - "Note:" seems to be a special construct in Patchew, my usage caused a
build failure.  So do: s/Note:/Note that/
 - Add the missing 'Signed-off-by'
---
 qapi/block-core.json | 8 
 1 file changed, 8 insertions(+)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index 
ab96e348e6317bb42769ae20f4a4519bac02e93a..4ecfd1fbc5bc231c69da15d489c44e3e8b0706a5
 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -2065,6 +2065,14 @@
 # BLOCK_JOB_CANCELLED event.  Before that happens the job is still visible when
 # enumerated using query-block-jobs.
 #
+# Note that the 'block-job-cancel' command will emit the event
+# BLOCK_JOB_COMPLETED if you issue it ('block-job-cancel') after 'drive-mirror'
+# has indicated (by emitting the event BLOCK_JOB_READY) that the source and
+# destination remain synchronized.  In this case, the BLOCK_JOB_COMPLETED
+# event indicates that synchronization (from 'drive-mirror') has successfully
+# ended and the destination now has a point-in-time copy, which is at the time
+# of cancel.
+#
 # For streaming, the image file retains its backing file unless the streaming
 # operation happens to complete just as it is being cancelled.  A new streaming
 # operation can be started at a later time to finish copying all data from the
-- 
2.9.5




Re: [Qemu-devel] [PATCH for-2.11] qcow2: Fix overly broad madvise()

2017-11-15 Thread Alberto Garcia
On Tue 14 Nov 2017 07:41:27 PM CET, Max Reitz wrote:
> @mem_size and @offset are both size_t, thus subtracting them from one
> another will just return a big size_t if mem_size < offset -- even more
> obvious here because the result is stored in another size_t.
>
> Checking that result to be positive is therefore not sufficient to
> excluse the case that offset > mem_size.  Thus, we currently sometimes
> issue an madvise() over a very large address range.
>
> This is triggered by iotest 163, but with -m64, this does not result in
> tangible problems.  But with -m32, this test produces three segfaults,
> all of which are fixed by this patch.
>
> Signed-off-by: Max Reitz 

Oh, I guess this happens when the page size is larger than the cluster
size? Otherwise I don't see how...

Reviewed-by: Alberto Garcia 

Berto



Re: [Qemu-devel] [RFC v3 01/27] char-io: fix possible race on IOWatchPoll

2017-11-15 Thread Stefan Hajnoczi
On Tue, Nov 14, 2017 at 07:31:10PM +0800, Peter Xu wrote:
> On Tue, Nov 14, 2017 at 10:32:19AM +, Stefan Hajnoczi wrote:
> > On Tue, Nov 14, 2017 at 02:09:39PM +0800, Peter Xu wrote:
> > > On Mon, Nov 13, 2017 at 04:52:11PM +, Stefan Hajnoczi wrote:
> > > > On Mon, Nov 06, 2017 at 05:46:17PM +0800, Peter Xu wrote:
> > > > > This is not a problem if we are only having one single loop thread 
> > > > > like
> > > > > before.  However, after per-monitor thread is introduced, this is not
> > > > > true any more, and the race can happen.
> > > > > 
> > > > > The race can be triggered with "make check -j8" sometimes:
> > > > 
> > > > Please mention a specific test case that fails.
> > > 
> > > It was any of the check-qtest-$(TARGET)s that failed.  I'll mention
> > > that in next post.
> > > 
> > > > 
> > > > > 
> > > > >   qemu-system-x86_64: /root/git/qemu/chardev/char-io.c:91:
> > > > >   io_watch_poll_finalize: Assertion `iwp->src == NULL' failed.
> > > > > 
> > > > > This patch keeps the reference for the watch object when creating in
> > > > > io_add_watch_poll(), so that the object will never be released in the
> > > > > context main loop, especially when the context loop is running in
> > > > > another standalone thread.  Meanwhile, when we want to remove the 
> > > > > watch
> > > > > object, we always first detach the watch object from its owner 
> > > > > context,
> > > > > then we continue with the cleanup.
> > > > > 
> > > > > Without this patch, calling io_remove_watch_poll() in main loop thread
> > > > > is not thread-safe, since the other per-monitor thread may be 
> > > > > modifying
> > > > > the watch object at the same time.
> > > > > 
> > > > > Reviewed-by: Marc-André Lureau 
> > > > > Signed-off-by: Peter Xu 
> > > > > ---
> > > > >  chardev/char-io.c | 16 ++--
> > > > >  1 file changed, 14 insertions(+), 2 deletions(-)
> > > > > 
> > > > > diff --git a/chardev/char-io.c b/chardev/char-io.c
> > > > > index f81052481a..50b5bac704 100644
> > > > > --- a/chardev/char-io.c
> > > > > +++ b/chardev/char-io.c
> > > > > @@ -122,7 +122,6 @@ GSource *io_add_watch_poll(Chardev *chr,
> > > > >  g_free(name);
> > > > >  
> > > > >  g_source_attach(&iwp->parent, context);
> > > > > -g_source_unref(&iwp->parent);
> > > > >  return (GSource *)iwp;
> > > > >  }
> > > > >  
> > > > > @@ -131,12 +130,25 @@ static void io_remove_watch_poll(GSource 
> > > > > *source)
> > > > >  IOWatchPoll *iwp;
> > > > >  
> > > > >  iwp = io_watch_poll_from_source(source);
> > > > > +
> > > > > +/*
> > > > > + * Here the order of destruction really matters.  We need to 
> > > > > first
> > > > > + * detach the IOWatchPoll object from the context (which may 
> > > > > still
> > > > > + * be running in another loop thread), only after that could we
> > > > > + * continue to operate on iwp->src, or there may be race 
> > > > > condition
> > > > > + * between current thread and the context loop thread.
> > > > > + *
> > > > > + * Let's blame the glib bug mentioned in commit 2b316774f6
> > > > > + * ("qemu-char: do not operate on sources from finalize
> > > > > + * callbacks") for this extra complexity.
> > > > 
> > > > I don't understand how this bug is to blame.  Isn't the problem here a
> > > > race condition between two QEMU threads?
> > > 
> > > Yes, it is.
> > > 
> > > The problem is, we won't have the race condition if glib does not have
> > > that bug mentioned.  Then the thread running GMainContext will have
> > > full control of iwp->src destruction, and destruction of it would be
> > > fairly straightforward (unref iwp->src in IOWatchPoll destructor).
> > > Now IIUC we are doing this in a hacky way, say, we destroy iwp->src
> > > explicitly from main thread before quitting (see [1] below, the whole
> > > if clause).
> > > 
> > > > 
> > > > Why are two threads accessing the watch at the same time?
> > > 
> > > Here is how I understand:
> > > 
> > > Firstly we need to tackle with that bug, by an explicit destruction of
> > > iwp->src below; meanwhile when we are destroying it, the GMainContext
> > > can still be running somewhere (it's not happening in current series
> > > since I stopped iothread earlier than this point, however it can still
> > > happen if in the future we don't do that), then we possibly want this
> > > patch.
> > > 
> > > Again, without this patch, current series should work; however I do
> > > hope this patch can be in, in case someday we want to provide complete
> > > thread safety for Chardevs (now it is not really thread-safe).
> > 
> > You said qtests fail with "Assertion `iwp->src == NULL' failed" but then
> > you said "without this patch, current series should work".  How do you
> > reproduce the failure if it doesn't occur?
> 
> Actually it occurs in some old versions, but not in current version.
> Current version destroys the iothread earlier (as Dan suggested), so
> it can avoid the issue.  Sorry for not bein

Re: [Qemu-devel] [PATCH] tcg: Record code_gen_buffer address for user-only memory helpers

2017-11-15 Thread Richard Henderson
On 11/14/2017 05:09 PM, Alex Bennée wrote:
>> -/* Now we have a real cpu fault.  Since this is the exact location of
>> - * the exception, we must undo the adjustment done by cpu_restore_state
>> - * for handling call return addresses.  */
>> -cpu_restore_state(cpu, pc + GETPC_ADJ);
>> +/* Now we have a real cpu fault.  */
>> +cpu_restore_state(cpu, pc);
> 
> I can't help thinking when we get it wrong we should be doing something
> here, maybe a LOG_UNIMP? Otherwise we silently fail or at least the
> user-space falls off a cliff later.

Oh we silently get it wrong in so many ways.  E.g. zero callers of
cpu_restore_state_from_tb check its return status.  Anyway, I think this sort
of cleanup has to wait til next cycle.


r~



Re: [Qemu-devel] [RFC v3 00/27] QMP: out-of-band (OOB) execution support

2017-11-15 Thread Stefan Hajnoczi
On Mon, Nov 06, 2017 at 09:08:00PM +0800, Peter Xu wrote:
> On Mon, Nov 06, 2017 at 02:12:17AM -0800, no-re...@patchew.org wrote:
> > Hi,
> > 
> > This series seems to have some coding style problems. See output below for
> > more information:
> > 
> > Subject: [Qemu-devel] [RFC v3 00/27] QMP: out-of-band (OOB) execution 
> > support
> > Type: series
> > Message-id: 20171106094643.14881-1-pet...@redhat.com
> > 
> > === TEST SCRIPT BEGIN ===
> > #!/bin/bash
> > 
> > BASE=base
> > n=1
> > total=$(git log --oneline $BASE.. | wc -l)
> > failed=0
> > 
> > git config --local diff.renamelimit 0
> > git config --local diff.renames True
> > 
> > commits="$(git log --format=%H --reverse $BASE..)"
> > for c in $commits; do
> > echo "Checking PATCH $n/$total: $(git log -n 1 --format=%s $c)..."
> > if ! git show $c --format=email | ./scripts/checkpatch.pl --mailback -; 
> > then
> > failed=1
> > echo
> > fi
> > n=$((n+1))
> > done
> > 
> > exit $failed
> > === TEST SCRIPT END ===
> > 
> > Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384
> > From https://github.com/patchew-project/qemu
> >  * [new tag]   patchew/20171106094643.14881-1-pet...@redhat.com 
> > -> patchew/20171106094643.14881-1-pet...@redhat.com
> > Switched to a new branch 'test'
> > 5525c4e791 tests: qmp-test: add oob test
> > ccc9e4c399 tests: qmp-test: verify command batching
> > 7f45b4c6c0 docs: update QMP documents for OOB commands
> > 58cfe877d2 monitor: enable IO thread for (qmp & !mux) typed
> > 5e1d56ce74 qmp: isolate responses into io thread
> > aef4275536 qmp: let migrate-incoming allow out-of-band
> > 5e68beacf3 qmp: support out-of-band (oob) execution
> > 43c7215a30 qapi: introduce new cmd option "allow-oob"
> > e11127ba4b monitor: send event when request queue full
> > 45cef4f7e4 qmp: add new event "request-dropped"
> > 485da28be1 monitor: separate QMP parser and dispatcher
> > 4892fe9ca2 monitor: let monitor_{suspend|resume} thread safe
> > 1b86166d9c monitor: introduce monitor_qmp_respond()
> > 0f48093add qmp: introduce some capability helpers
> > 8d3f33043d qmp: negociate QMP capabilities
> > 023b386d0e qmp: introduce QMPCapability
> > 2bde5ca8ce monitor: allow to use IO thread for parsing
> > f4cc112f80 monitor: create monitor dedicate iothread
> > 3590fdc1d4 monitor: let mon_list be tail queue
> > 11c818a9ac monitor: unify global init
> > 36d3efb87d monitor: move the cur_mon hack deeper for QMP
> > bf3e493a86 qjson: add "opaque" field to JSONMessageParser
> > 17367fe7a1 monitor: move skip_flush into monitor_data_init
> > 0c98d4baa4 qobject: let object_property_get_str() use new API
> > aa4b973dd5 qobject: introduce qobject_get_try_str()
> > 981ccebc1e qobject: introduce qstring_get_try_str()
> > d40ba38085 char-io: fix possible race on IOWatchPoll
> > 
> > === OUTPUT BEGIN ===
> > Checking PATCH 1/27: char-io: fix possible race on IOWatchPoll...
> > Checking PATCH 2/27: qobject: introduce qstring_get_try_str()...
> > Checking PATCH 3/27: qobject: introduce qobject_get_try_str()...
> > Checking PATCH 4/27: qobject: let object_property_get_str() use new API...
> > Checking PATCH 5/27: monitor: move skip_flush into monitor_data_init...
> > Checking PATCH 6/27: qjson: add "opaque" field to JSONMessageParser...
> > Checking PATCH 7/27: monitor: move the cur_mon hack deeper for QMP...
> > Checking PATCH 8/27: monitor: unify global init...
> > Checking PATCH 9/27: monitor: let mon_list be tail queue...
> > Checking PATCH 10/27: monitor: create monitor dedicate iothread...
> > Checking PATCH 11/27: monitor: allow to use IO thread for parsing...
> > Checking PATCH 12/27: qmp: introduce QMPCapability...
> > Checking PATCH 13/27: qmp: negociate QMP capabilities...
> > Checking PATCH 14/27: qmp: introduce some capability helpers...
> > Checking PATCH 15/27: monitor: introduce monitor_qmp_respond()...
> > Checking PATCH 16/27: monitor: let monitor_{suspend|resume} thread safe...
> > ERROR: braces {} are necessary for all arms of this statement
> > #28: FILE: monitor.c:4014:
> > +if (atomic_dec_fetch(&mon->suspend_cnt) == 0)
> > [...]
> > 
> > ERROR: Missing Signed-off-by: line(s)
> 
> Will add it in next post.  I still haven't found a good way to let
> magit add this line for me every time automatically.

If you don't want to type "git commit -s" all the time you could use the
"format.signOff = on" git-config(1) variable.


signature.asc
Description: PGP signature


Re: [Qemu-devel] Abnormal observation during migration: too many "write-not-dirty" pages

2017-11-15 Thread Juan Quintela
"Chunguang Li"  wrote:
> Hi all! 

Hi

Sorry for the delay, I was on vacation an still getting up to speed.

> I got a very abnormal observation for the VM migration. I found that many 
> pages marked as dirty during
> migration are "not really dirty", which is, their content are the same as the 
> old version. 

I think your test is quite good, and I am also ashamed that 80% of
"false" dirty pages is really a lot.

> I did the migration experiment like this: 
>
> During the setup phase of migration, first I suspended the VM. Then I copied 
> all the pages within the guest
> physical address space to a memory buffer as large as the guest memory size. 
> After that, the dirty tracking
> began and I resumed the VM. Besides, at the end
> of each iteration, I also suspended the VM temporarily. During the 
> suspension, I compared the content of all
> the pages marked as dirty in this iteration byte-by-byte with their former 
> copies inside the buffer. If the
> content of one page was the same as its former copy, I recorded it as a 
> "write-not-dirty" page (the page is
> written exactly with the same content as the old version). Otherwise, I 
> replaced this page in the buffer with
> the new content, for the possible comparison in the future. After the reset 
> of the dirty bitmap, I resumed the
> VM. Thus, I obtain the proportion of the write-not-dirty pages within all the 
> pages marked as dirty for each
> pre-copy iteration. 


vhost and friends could make a small difference here, but in general,
this approach should be ok.

> I repeated this experiment with 15 workloads, which are 11 CPU2006 
> benchmarks, Memcached server,
> kernel compilation, playing a video, and an idle VM. The CPU2006 benchmarks 
> and Memcached are
> write-intensive workloads. So almost all of them did not converge to 
> stop-copy. 

That is the impressive part, 15 workloads.  Thanks for taking the effor.

BTW, do you have your qemu changes handy, just to be able to test
locally, and "review" how do you measure things.


> Startlingly, the proportions of the write-not-dirty pages are quite high. 
> Memcached and three CPU2006
> benchmarks(zeusmp, mcf and bzip2) have the most high proportions. Their 
> proportions of the write-not-dirty
> pages within all the dirty pages are as high as 45%-80%.

Or the workload does really stupid things like:

a = 0;
a = 1;
a = 0;

This makes no sense at all.

Just in case, could you try to test this with xbzrle?  It should go well
with this use case (but you need to get a big enough buffer to cache
enough memory).


> The proportions of the other workloads are about
> 5%-20%, which are also abnormal. According to my intuition, the proportion of 
> write-not-dirty pages should be
> far less than these numbers. I think it should be quite a particular case 
> that one page is written with exactly
> the same content as the former data. 

I agree with that.

> Besides, the zero pages are not counted for all the results. Because I think 
> codes like memset() may write
> large area of pages to zero pages, which are already zero pages before. 
>
> I excluded some possible unknown reasons with the machine hardware, because I 
> repeated the experiments
> with two sets of different machines. Then I guessed it might be related with 
> the huge page feature. However,
> the result was the same when I turned the huge page feature off in the OS. 

Huge page could have caused that.  Remember that we have transparent
huge pages.  I have to look at that code.

> Now there are only two possible reasons in my opinion. 
>
> First, there is some bugs in the KVM kernel dirty tracking mechanism. It may 
> mark some pages that do not
> receive write request as dirty. 

That is a posibilty.

> Second, there is some bugs in the OS running inside the VM. It may issue some 
> unnecessary write
> requests. 
>
> What do you think about this abnormal phenomenon? Any advice or possible 
> reasons or even guesses? I
> appreciate any responses, because it has confused me for a long time. Thank 
> you.

I would like to reproduce this.

Thanks for bringing this to our attention.

Later, Juan.



Re: [Qemu-devel] [PATCH v3] xen-disk: use an IOThread per instance

2017-11-15 Thread Paul Durrant
Anthony, Stefano,

  Ping?

> -Original Message-
> From: Paul Durrant [mailto:paul.durr...@citrix.com]
> Sent: 07 November 2017 10:47
> To: qemu-devel@nongnu.org; xen-de...@lists.xenproject.org
> Cc: Paul Durrant ; Stefano Stabellini
> ; Anthony Perard ;
> Kevin Wolf ; Max Reitz 
> Subject: [PATCH v3] xen-disk: use an IOThread per instance
> 
> This patch allocates an IOThread object for each xen_disk instance and
> sets the AIO context appropriately on connect. This allows processing
> of I/O to proceed in parallel.
> 
> The patch also adds tracepoints into xen_disk to make it possible to
> follow the state transtions of an instance in the log.
> 
> Signed-off-by: Paul Durrant 
> ---
> Cc: Stefano Stabellini 
> Cc: Anthony Perard 
> Cc: Kevin Wolf 
> Cc: Max Reitz 
> 
> v3:
>  - Use new iothread_create/destroy() functions
> 
> v2:
>  - explicitly acquire and release AIO context in qemu_aio_complete() and
>blk_bh()
> ---
>  hw/block/trace-events |  7 +++
>  hw/block/xen_disk.c   | 53
> ---
>  2 files changed, 53 insertions(+), 7 deletions(-)
> 
> diff --git a/hw/block/trace-events b/hw/block/trace-events
> index cb6767b3ee..962a3bfa24 100644
> --- a/hw/block/trace-events
> +++ b/hw/block/trace-events
> @@ -10,3 +10,10 @@ virtio_blk_submit_multireq(void *vdev, void *mrb, int
> start, int num_reqs, uint6
>  # hw/block/hd-geometry.c
>  hd_geometry_lchs_guess(void *blk, int cyls, int heads, int secs) "blk %p
> LCHS %d %d %d"
>  hd_geometry_guess(void *blk, uint32_t cyls, uint32_t heads, uint32_t secs,
> int trans) "blk %p CHS %u %u %u trans %d"
> +
> +# hw/block/xen_disk.c
> +xen_disk_alloc(char *name) "%s"
> +xen_disk_init(char *name) "%s"
> +xen_disk_connect(char *name) "%s"
> +xen_disk_disconnect(char *name) "%s"
> +xen_disk_free(char *name) "%s"
> diff --git a/hw/block/xen_disk.c b/hw/block/xen_disk.c
> index e431bd89e8..f74fcd42d1 100644
> --- a/hw/block/xen_disk.c
> +++ b/hw/block/xen_disk.c
> @@ -27,10 +27,12 @@
>  #include "hw/xen/xen_backend.h"
>  #include "xen_blkif.h"
>  #include "sysemu/blockdev.h"
> +#include "sysemu/iothread.h"
>  #include "sysemu/block-backend.h"
>  #include "qapi/error.h"
>  #include "qapi/qmp/qdict.h"
>  #include "qapi/qmp/qstring.h"
> +#include "trace.h"
> 
>  /* - */
> 
> @@ -125,6 +127,9 @@ struct XenBlkDev {
>  DriveInfo   *dinfo;
>  BlockBackend*blk;
>  QEMUBH  *bh;
> +
> +IOThread*iothread;
> +AioContext  *ctx;
>  };
> 
>  /* - */
> @@ -596,9 +601,12 @@ static int ioreq_runio_qemu_aio(struct ioreq
> *ioreq);
>  static void qemu_aio_complete(void *opaque, int ret)
>  {
>  struct ioreq *ioreq = opaque;
> +struct XenBlkDev *blkdev = ioreq->blkdev;
> +
> +aio_context_acquire(blkdev->ctx);
> 
>  if (ret != 0) {
> -xen_pv_printf(&ioreq->blkdev->xendev, 0, "%s I/O error\n",
> +xen_pv_printf(&blkdev->xendev, 0, "%s I/O error\n",
>ioreq->req.operation == BLKIF_OP_READ ? "read" : 
> "write");
>  ioreq->aio_errors++;
>  }
> @@ -607,10 +615,10 @@ static void qemu_aio_complete(void *opaque, int
> ret)
>  if (ioreq->presync) {
>  ioreq->presync = 0;
>  ioreq_runio_qemu_aio(ioreq);
> -return;
> +goto done;
>  }
>  if (ioreq->aio_inflight > 0) {
> -return;
> +goto done;
>  }
> 
>  if (xen_feature_grant_copy) {
> @@ -647,16 +655,19 @@ static void qemu_aio_complete(void *opaque, int
> ret)
>  }
>  case BLKIF_OP_READ:
>  if (ioreq->status == BLKIF_RSP_OKAY) {
> -block_acct_done(blk_get_stats(ioreq->blkdev->blk), &ioreq->acct);
> +block_acct_done(blk_get_stats(blkdev->blk), &ioreq->acct);
>  } else {
> -block_acct_failed(blk_get_stats(ioreq->blkdev->blk), 
> &ioreq->acct);
> +block_acct_failed(blk_get_stats(blkdev->blk), &ioreq->acct);
>  }
>  break;
>  case BLKIF_OP_DISCARD:
>  default:
>  break;
>  }
> -qemu_bh_schedule(ioreq->blkdev->bh);
> +qemu_bh_schedule(blkdev->bh);
> +
> +done:
> +aio_context_release(blkdev->ctx);
>  }
> 
>  static bool blk_split_discard(struct ioreq *ioreq, blkif_sector_t
> sector_number,
> @@ -913,17 +924,29 @@ static void blk_handle_requests(struct XenBlkDev
> *blkdev)
>  static void blk_bh(void *opaque)
>  {
>  struct XenBlkDev *blkdev = opaque;
> +
> +aio_context_acquire(blkdev->ctx);
>  blk_handle_requests(blkdev);
> +aio_context_release(blkdev->ctx);
>  }
> 
>  static void blk_alloc(struct XenDevice *xendev)
>  {
>  struct XenBlkDev *blkdev = container_of(xendev, struct XenBlkDev,
> xendev);
> +Error *err = NULL;
> +
> +trace_xen_disk_alloc(xendev->name);
> 
>  QLIST_INIT(&blkdev->inflight);
>  QLIST_INIT(&blkdev->finish

Re: [Qemu-devel] [RFC v3 01/27] char-io: fix possible race on IOWatchPoll

2017-11-15 Thread Peter Xu
On Wed, Nov 15, 2017 at 09:37:40AM +, Stefan Hajnoczi wrote:
> On Tue, Nov 14, 2017 at 07:31:10PM +0800, Peter Xu wrote:
> > On Tue, Nov 14, 2017 at 10:32:19AM +, Stefan Hajnoczi wrote:
> > > On Tue, Nov 14, 2017 at 02:09:39PM +0800, Peter Xu wrote:
> > > > On Mon, Nov 13, 2017 at 04:52:11PM +, Stefan Hajnoczi wrote:
> > > > > On Mon, Nov 06, 2017 at 05:46:17PM +0800, Peter Xu wrote:
> > > > > > This is not a problem if we are only having one single loop thread 
> > > > > > like
> > > > > > before.  However, after per-monitor thread is introduced, this is 
> > > > > > not
> > > > > > true any more, and the race can happen.
> > > > > > 
> > > > > > The race can be triggered with "make check -j8" sometimes:
> > > > > 
> > > > > Please mention a specific test case that fails.
> > > > 
> > > > It was any of the check-qtest-$(TARGET)s that failed.  I'll mention
> > > > that in next post.
> > > > 
> > > > > 
> > > > > > 
> > > > > >   qemu-system-x86_64: /root/git/qemu/chardev/char-io.c:91:
> > > > > >   io_watch_poll_finalize: Assertion `iwp->src == NULL' failed.
> > > > > > 
> > > > > > This patch keeps the reference for the watch object when creating in
> > > > > > io_add_watch_poll(), so that the object will never be released in 
> > > > > > the
> > > > > > context main loop, especially when the context loop is running in
> > > > > > another standalone thread.  Meanwhile, when we want to remove the 
> > > > > > watch
> > > > > > object, we always first detach the watch object from its owner 
> > > > > > context,
> > > > > > then we continue with the cleanup.
> > > > > > 
> > > > > > Without this patch, calling io_remove_watch_poll() in main loop 
> > > > > > thread
> > > > > > is not thread-safe, since the other per-monitor thread may be 
> > > > > > modifying
> > > > > > the watch object at the same time.
> > > > > > 
> > > > > > Reviewed-by: Marc-André Lureau 
> > > > > > Signed-off-by: Peter Xu 
> > > > > > ---
> > > > > >  chardev/char-io.c | 16 ++--
> > > > > >  1 file changed, 14 insertions(+), 2 deletions(-)
> > > > > > 
> > > > > > diff --git a/chardev/char-io.c b/chardev/char-io.c
> > > > > > index f81052481a..50b5bac704 100644
> > > > > > --- a/chardev/char-io.c
> > > > > > +++ b/chardev/char-io.c
> > > > > > @@ -122,7 +122,6 @@ GSource *io_add_watch_poll(Chardev *chr,
> > > > > >  g_free(name);
> > > > > >  
> > > > > >  g_source_attach(&iwp->parent, context);
> > > > > > -g_source_unref(&iwp->parent);
> > > > > >  return (GSource *)iwp;
> > > > > >  }
> > > > > >  
> > > > > > @@ -131,12 +130,25 @@ static void io_remove_watch_poll(GSource 
> > > > > > *source)
> > > > > >  IOWatchPoll *iwp;
> > > > > >  
> > > > > >  iwp = io_watch_poll_from_source(source);
> > > > > > +
> > > > > > +/*
> > > > > > + * Here the order of destruction really matters.  We need to 
> > > > > > first
> > > > > > + * detach the IOWatchPoll object from the context (which may 
> > > > > > still
> > > > > > + * be running in another loop thread), only after that could we
> > > > > > + * continue to operate on iwp->src, or there may be race 
> > > > > > condition
> > > > > > + * between current thread and the context loop thread.
> > > > > > + *
> > > > > > + * Let's blame the glib bug mentioned in commit 2b316774f6
> > > > > > + * ("qemu-char: do not operate on sources from finalize
> > > > > > + * callbacks") for this extra complexity.
> > > > > 
> > > > > I don't understand how this bug is to blame.  Isn't the problem here a
> > > > > race condition between two QEMU threads?
> > > > 
> > > > Yes, it is.
> > > > 
> > > > The problem is, we won't have the race condition if glib does not have
> > > > that bug mentioned.  Then the thread running GMainContext will have
> > > > full control of iwp->src destruction, and destruction of it would be
> > > > fairly straightforward (unref iwp->src in IOWatchPoll destructor).
> > > > Now IIUC we are doing this in a hacky way, say, we destroy iwp->src
> > > > explicitly from main thread before quitting (see [1] below, the whole
> > > > if clause).
> > > > 
> > > > > 
> > > > > Why are two threads accessing the watch at the same time?
> > > > 
> > > > Here is how I understand:
> > > > 
> > > > Firstly we need to tackle with that bug, by an explicit destruction of
> > > > iwp->src below; meanwhile when we are destroying it, the GMainContext
> > > > can still be running somewhere (it's not happening in current series
> > > > since I stopped iothread earlier than this point, however it can still
> > > > happen if in the future we don't do that), then we possibly want this
> > > > patch.
> > > > 
> > > > Again, without this patch, current series should work; however I do
> > > > hope this patch can be in, in case someday we want to provide complete
> > > > thread safety for Chardevs (now it is not really thread-safe).
> > > 
> > > You said qtests fail with "Assertion `iwp->src == NULL' failed" but then

Re: [Qemu-devel] [PATCH 0/7] s390x/pci: Improve zPCI to cover more cases

2017-11-15 Thread Pierre Morel

On 13/11/2017 18:13, Cornelia Huck wrote:

On Tue,  7 Nov 2017 18:24:32 +0100
Pierre Morel  wrote:


Right now the PCI support is very limited (e.g. pass through of a
host vfio device)
To enable features like virtio-pci several modifications needs to be
done.

Virtio-PCI uses subregions, which may eventually be discontinuous
inside bars instead of a single flat region.
The address offset being formerly calculated from the BAR base address
must be adapted to the subregions instead of to the single region.

This patch provides the new calculation for the three kind of BAR
access, zPCI STORE, zPCI LOAD and zPCI STORE BLOCK.

We use the opportunity to
  - enhance the fault detection for zPCI STORE and LOAD,
  - enhance the fault detection and to provide the maximum STORE BLOCK
block size, maxstbl, for zPCI STORE BLOCK
  - factor out part of the code used to calculate the offset and
access the BARs,
  - factor out the code for endianess conversion.


Pierre Morel (7):
   s390x/pci: factor out endianess conversion
   s390x/pci: rework PCI STORE
   s390x/pci: rework PCI LOAD
   s390x/pci: rework PCI STORE BLOCK
   s390x/pci: move the memory region read from pcilg
   s390x/pci: move the memory region write from pcistg
   s390x/pci: search for subregion inside the BARs

  hw/s390x/s390-pci-bus.h  |   1 +
  hw/s390x/s390-pci-inst.c | 250 ---
  hw/s390x/s390-pci-inst.h |   2 +-
  3 files changed, 153 insertions(+), 100 deletions(-)



I assume you'll send a v2?


Yes, I have a lot of mater to do this now.

Thanks for reviewing,

Pierre



I'll see if I can find some time to wire up pci in tcg (as this would
get us additional test coverage, especially regarding endianness), but
I won't complain should someone beat me to it.


Yes, would be fine to have it to increase test coverage.


--
Pierre Morel
Linux/KVM/QEMU in Böblingen - Germany




Re: [Qemu-devel] [PATCH 4/7] s390x/pci: rework PCI STORE BLOCK

2017-11-15 Thread Pierre Morel

On 13/11/2017 18:10, Cornelia Huck wrote:

On Mon, 13 Nov 2017 17:38:40 +0100
Pierre Morel  wrote:


On 13/11/2017 16:23, Cornelia Huck wrote:

On Tue,  7 Nov 2017 18:24:36 +0100
Pierre Morel  wrote:
   

Enhance the fault detection.

Add the maxstbl entry to both the Query PCI Function Group
response and the PCIBusDevice structure.

Initialize the maxstbl to 128 per default until we get
the actual data from the hardware.

Signed-off-by: Pierre Morel 
Reviewed-by: Yi Min Zhao 
---
   hw/s390x/s390-pci-bus.h  |  1 +
   hw/s390x/s390-pci-inst.c | 62 
+---
   hw/s390x/s390-pci-inst.h |  2 +-
   3 files changed, 40 insertions(+), 25 deletions(-)
  
   

@@ -662,22 +664,10 @@ int pcistb_service_call(S390CPU *cpu, uint8_t r1, uint8_t 
r3, uint64_t gaddr,
   fh = env->regs[r1] >> 32;
   pcias = (env->regs[r1] >> 16) & 0xf;
   len = env->regs[r1] & 0xff;
+offset = env->regs[r3];
   
-if (pcias > 5) {

-DPRINTF("pcistb invalid space\n");
-setcc(cpu, ZPCI_PCI_LS_ERR);
-s390_set_status_code(env, r1, ZPCI_PCI_ST_INVAL_AS);
-return 0;
-}
-
-switch (len) {
-case 16:
-case 32:
-case 64:
-case 128:
-break;
-default:
-program_interrupt(env, PGM_SPECIFICATION, 6);
+if (!(fh & FH_MASK_ENABLE)) {
+setcc(cpu, ZPCI_PCI_LS_INVAL_HANDLE);


So this means you move checking for the device before checking for the
parameters or the as.


Yes, this is clearly following the specifications that CC=3 has priority
over CC=1 or CC=2.


OK, it would then make sense to mention in the patch description that
you fixed up the precedence as well.


I will do
thanks





By the way I find that defining ZPCI_PCI_LS_INVAL_HANDLE is obfuscating,
we have the information from the test we just made but we loose the
information about if it is a 1, 2 or 3 CC value.
May be in another patch?


Let's keep this for now, we can revisit that later.


OK





   

   return 0;
   }
   
@@ -689,12 +679,7 @@ int pcistb_service_call(S390CPU *cpu, uint8_t r1, uint8_t r3, uint64_t gaddr,

   }
   
   switch (pbdev->state) {

-case ZPCI_FS_RESERVED:
-case ZPCI_FS_STANDBY:
-case ZPCI_FS_DISABLED:
   case ZPCI_FS_PERMANENT_ERROR:
-setcc(cpu, ZPCI_PCI_LS_INVAL_HANDLE);
-return 0;
   case ZPCI_FS_ERROR:
   setcc(cpu, ZPCI_PCI_LS_ERR);
   s390_set_status_code(env, r1, ZPCI_PCI_ST_BLOCKED);
@@ -703,8 +688,33 @@ int pcistb_service_call(S390CPU *cpu, uint8_t r1, uint8_t 
r3, uint64_t gaddr,
   break;
   }
   
+if (pcias > 5) {

+DPRINTF("pcistb invalid space\n");
+setcc(cpu, ZPCI_PCI_LS_ERR);
+s390_set_status_code(env, r1, ZPCI_PCI_ST_INVAL_AS);
+return 0;
+}
+
+/* Verify the address, offset and length */
+/* offset must be a multiple of 8 */
+if (offset % 8) {
+goto addressing_error;
+}
+/* Length must be greater than 8, a multiple of 8, lower than maxstbl */
+if ((len <= 8) || (len % 8) || (len > pbdev->maxstbl)) {
+goto addressing_error;
+}
+/* Do not cross a 4K-byte boundary */
+if (((offset & 0xfff) + len) > 0x1000) {
+goto addressing_error;
+}
+/* Guest address must be double word aligned */
+if (gaddr & 0x07UL) {
+goto addressing_error;
+}


So the checks here are only evaluated if the instruction actually pokes
at a valid region?


hum, I did not find the precedence of execution for PCI STORE BLOCK.

My logic is that you must find a destination before you start reading
the source, so I would say it is the right way to do.
But the experience as already shown that my logic may not always be
compatible with the internals of S390x :)

However, the documentation specifies that if an error condition is
detected that precludes the *initiation* of the store operation a CC=1
is set.
The fact that the *initiation* is focused on enforce the idea I have on
the precedence between the low level operations.


OK, this interpretation makes sense. It might be a good idea to poke the
architecture authors if it is ambiguous, though :)


Yes.





   

+
   mr = pbdev->pdev->io_regions[pcias].memory;
-if (!memory_region_access_valid(mr, env->regs[r3], len, true)) {
+if (!memory_region_access_valid(mr, offset, len, true)) {
   program_interrupt(env, PGM_OPERAND, 6);
   return 0;
   }
@@ -714,9 +724,9 @@ int pcistb_service_call(S390CPU *cpu, uint8_t r1, uint8_t 
r3, uint64_t gaddr,
   }
   
   for (i = 0; i < len / 8; i++) {

-result = memory_region_dispatch_write(mr, env->regs[r3] + i * 8,
- ldq_p(buffer + i * 8), 8,
- MEMTXATTRS_UNSPECIFIED);
+result = memory_region_dispatch_write(mr, offset + i * 8,
+  ldq_p(buffer + i * 8), 8,
+  

Re: [Qemu-devel] Abnormal observation during migration: too many "write-not-dirty" pages

2017-11-15 Thread Dr. David Alan Gilbert
* Chunguang Li (lichungu...@hust.edu.cn) wrote:
> Hi all!
> 
> I got a very abnormal observation for the VM migration. I found that many 
> pages marked as dirty during migration are "not really dirty", which is, 
> their content are the same as the old version.
> 
> 
> 
> 
> I did the migration experiment like this:
> 
> During the setup phase of migration, first I suspended the VM. Then I copied 
> all the pages within the guest physical address space to a memory buffer as 
> large as the guest memory size. After that, the dirty tracking began and I 
> resumed the VM. Besides, at the end
> of each iteration, I also suspended the VM temporarily. During the 
> suspension, I compared the content of all the pages marked as dirty in this 
> iteration byte-by-byte with their former copies inside the buffer. If the 
> content of one page was the same as its former copy, I recorded it as a 
> "write-not-dirty" page (the page is written exactly with the same content as 
> the old version). Otherwise, I replaced this page in the buffer with the new 
> content, for the possible comparison in the future. After the reset of the 
> dirty bitmap, I resumed the VM. Thus, I obtain the proportion of the 
> write-not-dirty pages within all the pages marked as dirty for each pre-copy 
> iteration.
> 
> I repeated this experiment with 15 workloads, which are 11 CPU2006 
> benchmarks, Memcached server, kernel compilation, playing a video, and an 
> idle VM. The CPU2006 benchmarks and Memcached are write-intensive workloads. 
> So almost all of them did not converge to stop-copy.
> 
> 
> 
> 
> Startlingly, the proportions of the write-not-dirty pages are quite high. 
> Memcached and three CPU2006 benchmarks(zeusmp, mcf and bzip2) have the most 
> high proportions. Their proportions of the write-not-dirty pages within all 
> the dirty pages are as high as 45%-80%. The proportions of the other 
> workloads are about 5%-20%, which are also abnormal. According to my 
> intuition, the proportion of write-not-dirty pages should be far less than 
> these numbers. I think it should be quite a particular case that one page is 
> written with exactly the same content as the former data.
> 
> Besides, the zero pages are not counted for all the results. Because I think 
> codes like memset() may write large area of pages to zero pages, which are 
> already zero pages before.
> 
> 
> 
> 
> I excluded some possible unknown reasons with the machine hardware, because I 
> repeated the experiments with two sets of different machines. Then I guessed 
> it might be related with the huge page feature. However, the result was the 
> same when I turned the huge page feature off in the OS.
> 
> 
> 
> 
> Now there are only two possible reasons in my opinion. 
> 
> First, there is some bugs in the KVM kernel dirty tracking mechanism. It may 
> mark some pages that do not receive write request as dirty.
> 
> Second, there is some bugs in the OS running inside the VM. It may issue some 
> unnecessary write requests.
> 
> 
> What do you think about this abnormal phenomenon? Any advice or possible 
> reasons or even guesses? I appreciate any responses, because it has confused 
> me for a long time. Thank you.

Wasn't it you who pointed out last year the other possibility? - The
problem of false positives due to sync'ing the whole of memory and then
writing the data out, but some of the dirty pages were already written?

Dave

> 
> --
> Chunguang Li, Ph.D. Candidate
> Wuhan National Laboratory for Optoelectronics (WNLO)
> Huazhong University of Science & Technology (HUST)
> Wuhan, Hubei Prov., China
> 
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK



Re: [Qemu-devel] [RFC v3 27/27] tests: qmp-test: add oob test

2017-11-15 Thread Stefan Hajnoczi
On Mon, Nov 06, 2017 at 05:46:43PM +0800, Peter Xu wrote:
> +/*
> + * Try a time-consuming command, following by a OOB command, make
> + * sure we get OOB command before the time-consuming one (which is
> + * run in the parser).
> + *
> + * When writting up this test script, the only command that
> + * support OOB is migrate-incoming.  It's not the best command to
> + * test OOB but we don't really have a choice here.  We will check
> + * arriving order but not command errors, which does not really
> + * matter to us.
> + */
> +qmp_async("{ 'execute': 'dump-guest-memory',"
> +  "  'arguments': { 'paging': true, "
> +  " 'protocol': 'file:/dev/null' }, "
> +  "  'id': 'time-consuming-cmd'}");
> +qmp_async("{ 'execute': 'migrate-incoming', "
> +  "  'control': { 'run-oob': true }, "
> +  "  'id': 'oob-cmd' }");
> +
> +/* Ignore all events.  Wait for 2 acks */
> +while (acks < 2) {
> +resp = qmp_receive();
> +if (qdict_haskey(resp, "event")) {
> +/* Skip possible events */
> +continue;
> +}
> +cmd_id = qdict_get_str(resp, "id");
> +if (acks == 0) {
> +/* Need to receive OOB response first */
> +g_assert_cmpstr(cmd_id, ==, "oob-cmd");
> +} else if (acks == 1) {
> +g_assert_cmpstr(cmd_id, ==, "time-consuming-cmd");
> +}
> +acks++;
> +}

This test is non-deterministic.  The dump-guest-memory command could
complete first on a fast machine.

On a slow machine this test might take a long time...

Please introduce a test command that is deterministic.  For example,
when 'x-oob-test' is invoked without 'run-oob': true it waits until
invoked again, this time with 'run-oob': true.

We have similar interfaces in the block layer for controlling the order
in which parallel I/O requests are processed.  This allows test cases to
deterministically take specific code paths.


signature.asc
Description: PGP signature


[Qemu-devel] [PATCH] sga: stop using sgabios rom for 2.11+

2017-11-15 Thread Gerd Hoffmann
seabios 1.11 got builtin serial console support, so we don't need
sgabios any more.  "-machine graphics=off" should be used instead.
update sga.c accordingly.

Signed-off-by: Gerd Hoffmann 
---
 include/hw/compat.h |  4 
 hw/misc/sga.c   | 26 +-
 2 files changed, 29 insertions(+), 1 deletion(-)

diff --git a/include/hw/compat.h b/include/hw/compat.h
index f96212c49c..1e3ad895db 100644
--- a/include/hw/compat.h
+++ b/include/hw/compat.h
@@ -14,6 +14,10 @@
 .driver   = "i82559a",\
 .property = "x-use-alt-device-id",\
 .value= "false",\
+},{\
+.driver   = "sga",\
+.property = "use-sgabios",\
+.value= "true",\
 },
 
 #define HW_COMPAT_2_9 \
diff --git a/hw/misc/sga.c b/hw/misc/sga.c
index 03b006d6f0..bcb4c98554 100644
--- a/hw/misc/sga.c
+++ b/hw/misc/sga.c
@@ -25,6 +25,7 @@
  *
  */
 #include "qemu/osdep.h"
+#include "qemu/error-report.h"
 #include "hw/pci/pci.h"
 #include "hw/i386/pc.h"
 #include "hw/loader.h"
@@ -37,18 +38,41 @@
 
 typedef struct ISASGAState {
 ISADevice parent_obj;
+bool use_sgabios;
 } ISASGAState;
 
 static void sga_realizefn(DeviceState *dev, Error **errp)
 {
-rom_add_vga(SGABIOS_FILENAME);
+MachineState *machine = MACHINE(qdev_get_machine());
+ISASGAState *sga = SGA(dev);
+
+if (sga->use_sgabios) {
+rom_add_vga(SGABIOS_FILENAME);
+} else if (machine->enable_graphics) {
+warn_report("sgabios is deprecated for your machine type, "
+"please use -machine graphics=off instead of "
+"-device sga");
+rom_add_vga(SGABIOS_FILENAME);
+} else {
+/*
+ * For this machine type we expect SeaBIOS to provide a
+ * serial console for -machine graphics=off.  No need to
+ * do anything here.
+ */
+}
 }
 
+static Property sga_properties[] = {
+DEFINE_PROP_BOOL("use-sgabios", ISASGAState, use_sgabios, false),
+DEFINE_PROP_END_OF_LIST(),
+};
+
 static void sga_class_initfn(ObjectClass *klass, void *data)
 {
 DeviceClass *dc = DEVICE_CLASS(klass);
 
 set_bit(DEVICE_CATEGORY_DISPLAY, dc->categories);
+dc->props = sga_properties;
 dc->realize = sga_realizefn;
 dc->desc = "Serial Graphics Adapter";
 }
-- 
2.9.3




Re: [Qemu-devel] [PATCH v6] NUMA: Enable adding NUMA node implicitly

2017-11-15 Thread Igor Mammedov
On Wed, 15 Nov 2017 09:29:22 +0800
Dou Liyang  wrote:

> Hi Igor,
> 
> [...]
> >> +parse_numa_node(ms, &node, NULL);  
> > I get build break here:
> >
> > numa.c:451:13: error: too few arguments to function ‘parse_numa_node’
> >  parse_numa_node(ms, &node, NULL);
> >  
> 
> In upstream tree, your commit
> 
>cc001888b780 ("numa: fixup parsed NumaNodeOptions earlier")
> 
> removed a argument from parse_numa_node() recently. this definition
> of function becomes
> 
> static void parse_numa_node(MachineState *ms, NumaNodeOptions *node,
>  Error **errp)
> 
> this patch is based on the upstream tree, parse_numa_node() should have
> three arguments.
> 
> I am not sure why you got this building failure log, can you tell me
> which branch did you test?
it looks like I've failed to update master branch and applied it to old master,
so after refetching and applying it again build and tests passes.

(Sorry for  noise :/)

> 
> Thanks,
>   dou
> 
> 
> 




Re: [Qemu-devel] [PATCH v6] NUMA: Enable adding NUMA node implicitly

2017-11-15 Thread Igor Mammedov
On Tue, 14 Nov 2017 10:34:01 +0800
Dou Liyang  wrote:

> Linux and Windows need ACPI SRAT table to make memory hotplug work properly,
> however currently QEMU doesn't create SRAT table if numa options aren't 
> present
> on CLI.
> 
> Which breaks both linux and windows guests in certain conditions:
>  * Windows: won't enable memory hotplug without SRAT table at all
>  * Linux: if QEMU is started with initial memory all below 4Gb and no SRAT 
> table
>present, guest kernel will use nommu DMA ops, which breaks 32bit hw drivers
>when memory is hotplugged and guest tries to use it with that drivers.
> 
> Fix above issues by automatically creating a numa node when QEMU is started 
> with
> memory hotplug enabled but without '-numa' options on CLI.
> (PS: auto-create numa node only for new machine types so not to break 
> migration).
> 
> Which would provide SRAT table to guests without explicit -numa options on CLI
> and would allow:
>  * Windows: to enable memory hotplug
>  * Linux: switch to SWIOTLB DMA ops, to bounce DMA transfers to 32bit 
> allocated
>buffers that legacy drivers/hw can handle.
> 
> [Rewritten by Igor]
> 
> Reported-by: Thadeu Lima de Souza Cascardo 
> Suggested-by: Igor Mammedov 
> Signed-off-by: Dou Liyang 
> Cc: Paolo Bonzini 
> Cc: Richard Henderson 
> Cc: Eduardo Habkost 
> Cc: "Michael S. Tsirkin" 
> Cc: Marcel Apfelbaum 
> Cc: Igor Mammedov 
> Cc: David Hildenbrand 
> Cc: Thomas Huth 
> Cc: Alistair Francis 
> Cc: Takao Indoh 
> Cc: Izumi Taku 
> ---
> changelog V5 --> V6:
>   - rebase it to avoid building failure
>   - test again
Reviewed-by: Igor Mammedov 


> ---
>  hw/i386/pc.c|  1 +
>  hw/i386/pc_piix.c   |  1 +
>  hw/i386/pc_q35.c|  1 +
>  include/hw/boards.h |  1 +
>  numa.c  | 21 -
>  vl.c|  3 +--
>  6 files changed, 25 insertions(+), 3 deletions(-)
> 
> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> index e11a65b..156501c 100644
> --- a/hw/i386/pc.c
> +++ b/hw/i386/pc.c
> @@ -2325,6 +2325,7 @@ static void pc_machine_class_init(ObjectClass *oc, void 
> *data)
>  mc->cpu_index_to_instance_props = pc_cpu_index_to_props;
>  mc->get_default_cpu_node_id = pc_get_default_cpu_node_id;
>  mc->possible_cpu_arch_ids = pc_possible_cpu_arch_ids;
> +mc->auto_enable_numa_with_memhp = true;
>  mc->has_hotpluggable_cpus = true;
>  mc->default_boot_order = "cad";
>  mc->hot_add_cpu = pc_hot_add_cpu;
> diff --git a/hw/i386/pc_piix.c b/hw/i386/pc_piix.c
> index f79d5cb..5e47528 100644
> --- a/hw/i386/pc_piix.c
> +++ b/hw/i386/pc_piix.c
> @@ -446,6 +446,7 @@ static void pc_i440fx_2_10_machine_options(MachineClass 
> *m)
>  m->is_default = 0;
>  m->alias = NULL;
>  SET_MACHINE_COMPAT(m, PC_COMPAT_2_10);
> +m->auto_enable_numa_with_memhp = false;
>  }
>  
>  DEFINE_I440FX_MACHINE(v2_10, "pc-i440fx-2.10", NULL,
> diff --git a/hw/i386/pc_q35.c b/hw/i386/pc_q35.c
> index da3ea60..d606004 100644
> --- a/hw/i386/pc_q35.c
> +++ b/hw/i386/pc_q35.c
> @@ -318,6 +318,7 @@ static void pc_q35_2_10_machine_options(MachineClass *m)
>  m->alias = NULL;
>  SET_MACHINE_COMPAT(m, PC_COMPAT_2_10);
>  m->numa_auto_assign_ram = numa_legacy_auto_assign_ram;
> +m->auto_enable_numa_with_memhp = false;
>  }
>  
>  DEFINE_Q35_MACHINE(v2_10, "pc-q35-2.10", NULL,
> diff --git a/include/hw/boards.h b/include/hw/boards.h
> index 191a5b3..f1077f1 100644
> --- a/include/hw/boards.h
> +++ b/include/hw/boards.h
> @@ -192,6 +192,7 @@ struct MachineClass {
>  bool ignore_memory_transaction_failures;
>  int numa_mem_align_shift;
>  const char **valid_cpu_types;
> +bool auto_enable_numa_with_memhp;
>  void (*numa_auto_assign_ram)(MachineClass *mc, NodeInfo *nodes,
>   int nb_nodes, ram_addr_t size);
>  
> diff --git a/numa.c b/numa.c
> index 8d78d95..7151b24 100644
> --- a/numa.c
> +++ b/numa.c
> @@ -216,6 +216,7 @@ static void parse_numa_node(MachineState *ms, 
> NumaNodeOptions *node,
>  }
>  numa_info[nodenr].present = true;
>  max_numa_nodeid = MAX(max_numa_nodeid, nodenr + 1);
> +nb_numa_nodes++;
>  }
>  
>  static void parse_numa_distance(NumaDistOptions *dist, Error **errp)
> @@ -282,7 +283,6 @@ static int parse_numa(void *opaque, QemuOpts *opts, Error 
> **errp)
>  if (err) {
>  goto end;
>  }
> -nb_numa_nodes++;
>  break;
>  case NUMA_OPTIONS_TYPE_DIST:
>  parse_numa_distance(&object->u.dist, &err);
> @@ -433,6 +433,25 @@ void parse_numa_opts(MachineState *ms)
>  exit(1);
>  }
>  
> +/*
> + * If memory hotplug is enabled (slots > 0) but without '-numa'
> + * options explicitly on CLI, guestes will break.
> + *
> + *   Windows: won't enable memory hotplug without SRAT table at all
> + *
> + *   Linux: if QEMU is started with initial memory all below 4Gb
> + *   and no SRAT table present, guest kernel will use nommu DMA ops,
> + *   

Re: [Qemu-devel] [RFC v3 18/27] qmp: add new event "request-dropped"

2017-11-15 Thread Stefan Hajnoczi
On Mon, Nov 06, 2017 at 05:46:34PM +0800, Peter Xu wrote:
> diff --git a/qapi-schema.json b/qapi-schema.json
> index 531fd4c0db..650714da06 100644
> --- a/qapi-schema.json
> +++ b/qapi-schema.json
> @@ -3222,3 +3222,38 @@
>  # Since: 2.11
>  ##
>  { 'command': 'watchdog-set-action', 'data' : {'action': 'WatchdogAction'} }
> +
> +##
> +# @RequestDropReason:
> +#
> +# Reasons that caused one request to be dropped.

Please use "command" consistently.  QMP does not call it not "request".

> +#
> +# @queue-full: the queue of request is full.
> +#
> +# Since: 2.12
> +##
> +{ 'enum': 'RequestDropReason',
> +  'data': ['queue-full' ] }
> +
> +##
> +# @REQUEST_DROPPED:
> +#
> +# Emitted when one QMP request is dropped due to some reason.

Please add:

  REQUEST_DROPPED is only emitted when the oob capability is enabled.

Rationale: old clients don't know about this event so they cannot be
expected to handle it!

> +#
> +# @id:If the original request contains an string-typed "id" field,
> +# it'll be put into this field.  Otherwise it'll be an empty
> +# string.

Please change:

  @id: The dropped command's string-typed "id" field.

Sending commands without the id field is likely to cause confusion since
there are cases where the client is unable to determine which command
was meant.  Since client code needs to be updated to enable the oob
capability anyway, we might as well require that clients always include
the id field with every command when the oob capability is enabled.
Please mention this requirement where the oob capability is documented.


signature.asc
Description: PGP signature


Re: [Qemu-devel] [Question] Qemu's Heap Becomes Very Large and Never Reduce Down

2017-11-15 Thread Stefan Hajnoczi
On Wed, Nov 15, 2017 at 03:14:52AM +, Xulei (Stone) wrote:
> Hi, guys
> 
> I met a strange problem, with qemu 2.8.1:
> qemu consumes too many heap memory after several operations and can not 
> release them anymore:
> hot pulg/unplug disk & net, vnc connect/disconnect, guestOS reboot, etc.
> 
> 
> 01a7a000-3b4efe000 rw-p  00:00 0 
> [heap]
> 
> Size:   15520272 kB
> 
> Rss:14421836 kB
> 
> Pss:14421836 kB
> 
> Shared_Clean:  0 kB
> 
> Shared_Dirty:  0 kB
> 
> Private_Clean:  1164 kB
> 
> Private_Dirty:  14420672 kB
> 
> Referenced:  7485624 kB
> 
> Anonymous:  14421836 kB
> 
> AnonHugePages: 34816 kB
> 
> Swap:1098140 kB
> 
> KernelPageSize:4 kB
> 
> MMUPageSize:   4 kB
> 
> Locked:0 kB
> 
> VmFlags: rd wr mr mw me ac sd
> 
> My steps are:
> 1) start several VMs all equipped only 8G memory;
> 2) random combining those operations mentioned above;
> 3) after few hours, qemu's Virt memory and RSS both grow too large and never 
> fall down;
> 
> After analysis via /proc/$pid/smaps, I found the VMA of pc.ram does not 
> occupy much
> memory but only becauses of heap section.
> 
> I guess that has some relations of glibc or qemu rcu_thread, but i can not 
> figure it out.
> Is there some patches can fix this problem or does somebody have any idea?

Please try qemu.git/master.

The malloc implementation (glibc, tcmalloc, jemalloc) probably makes a
difference since you are seeing heap growth.

The main question your report raises is that "random combining those
operations mentioned above" makes it hard to identify the operation that
leads to heap growth.  Can you run isolated tests that do only
hotplug/unplug *or* VNC connect/disconnect *or* guest OS reboot, not
everything together?

Thanks,
Stefan


signature.asc
Description: PGP signature


Re: [Qemu-devel] [Qemu-block] [PATCH for-2.11] qcow2: Fix overly broad madvise()

2017-11-15 Thread Darren Kenny

FWIW,

Reviewed-by: Darren Kenny 

Thanks,

Darren.

On Tue, Nov 14, 2017 at 07:41:27PM +0100, Max Reitz wrote:

@mem_size and @offset are both size_t, thus subtracting them from one
another will just return a big size_t if mem_size < offset -- even more
obvious here because the result is stored in another size_t.

Checking that result to be positive is therefore not sufficient to
excluse the case that offset > mem_size.  Thus, we currently sometimes
issue an madvise() over a very large address range.

This is triggered by iotest 163, but with -m64, this does not result in
tangible problems.  But with -m32, this test produces three segfaults,
all of which are fixed by this patch.

Signed-off-by: Max Reitz 
---
block/qcow2-cache.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/qcow2-cache.c b/block/qcow2-cache.c
index 75746a7f43..5222a7b94d 100644
--- a/block/qcow2-cache.c
+++ b/block/qcow2-cache.c
@@ -73,7 +73,7 @@ static void qcow2_cache_table_release(BlockDriverState *bs, 
Qcow2Cache *c,
size_t mem_size = (size_t) s->cluster_size * num_tables;
size_t offset = QEMU_ALIGN_UP((uintptr_t) t, align) - (uintptr_t) t;
size_t length = QEMU_ALIGN_DOWN(mem_size - offset, align);
-if (length > 0) {
+if (mem_size > offset && length > 0) {
madvise((uint8_t *) t + offset, length, MADV_DONTNEED);
}
#endif
--
2.13.6






Re: [Qemu-devel] [Qemu-block] [PATCH for-2.11] qcow2: Fix overly broad madvise()

2017-11-15 Thread Darren Kenny

Should have said that this is subject to the typo that Eric pointed
out, of course.

Thanks,

Darren.

On Wed, Nov 15, 2017 at 11:04:19AM +, Darren Kenny wrote:

FWIW,

Reviewed-by: Darren Kenny 

Thanks,

Darren.

On Tue, Nov 14, 2017 at 07:41:27PM +0100, Max Reitz wrote:

@mem_size and @offset are both size_t, thus subtracting them from one
another will just return a big size_t if mem_size < offset -- even more
obvious here because the result is stored in another size_t.

Checking that result to be positive is therefore not sufficient to
excluse the case that offset > mem_size.  Thus, we currently sometimes
issue an madvise() over a very large address range.

This is triggered by iotest 163, but with -m64, this does not result in
tangible problems.  But with -m32, this test produces three segfaults,
all of which are fixed by this patch.

Signed-off-by: Max Reitz 
---
block/qcow2-cache.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/qcow2-cache.c b/block/qcow2-cache.c
index 75746a7f43..5222a7b94d 100644
--- a/block/qcow2-cache.c
+++ b/block/qcow2-cache.c
@@ -73,7 +73,7 @@ static void qcow2_cache_table_release(BlockDriverState *bs, 
Qcow2Cache *c,
   size_t mem_size = (size_t) s->cluster_size * num_tables;
   size_t offset = QEMU_ALIGN_UP((uintptr_t) t, align) - (uintptr_t) t;
   size_t length = QEMU_ALIGN_DOWN(mem_size - offset, align);
-if (length > 0) {
+if (mem_size > offset && length > 0) {
   madvise((uint8_t *) t + offset, length, MADV_DONTNEED);
   }
#endif
--
2.13.6






Re: [Qemu-devel] [PATCH v2 2/2] Add new PCI ID for i82559a

2017-11-15 Thread Jason Wang



On 2017年11月15日 05:41, Stefan Weil wrote:

Am 06.11.2017 um 21:35 schrieb Mike Nawrocki:

Adds a new PCI ID for the i82559a (0x8086 0x1030) interface. Enables
this ID with a new property "use-alt-device-id" to preserve
compatibility.

Signed-off-by: Mike Nawrocki 
---
  hw/net/eepro100.c| 12 
  include/hw/pci/pci.h |  1 +
  qemu-options.hx  |  2 +-
  3 files changed, 14 insertions(+), 1 deletion(-)


Sorry that I missed this patch.
I think I should have an entry for eepro100 in MAINTAINERS.


Yes, please send a patch for this.

Thanks



Mike, which hardware uses i82559a with PCI device id 0x1030?

https://www.intel.com/content/www/us/en/support/articles/05612/network-and-i-o/ethernet-products.html
only lists devices with
0x1229.

Thanks,
Stefan





Re: [Qemu-devel] [PULL 7/8] Add new PCI ID for i82559a

2017-11-15 Thread Jason Wang



On 2017年11月15日 14:43, Stefan Weil wrote:

Hi,

I currently think that this patch is wrong and should be reverted.

It fixes a certain use case by hacking the PCI device id, but does
not model the way how that device id is set on the real hardware
correctly.

As far as I know, all i82559 have a default PCI device id of 0x1229.
It can be changed by the EEPROM configuration, but not all network
cards do have an EEPROM.

See for example this URL for more information:
http://zoo.cs.yale.edu/classes/cs422/2010/ref/82559_eeprom.pdf

The correct solution is modeling the EEPROM and allowing QEMU
users to provide an EEPROM‌ file.


Yes and unless there's new version of sepc that has different ID, I tend 
to revert this.


Thanks



Cheers
Stefan






Re: [Qemu-devel] [PATCH V5] hw/pci-host: Fix x86 Host Bridges 64bit PCI hole

2017-11-15 Thread Marcel Apfelbaum

On 15/11/2017 4:16, Michael S. Tsirkin wrote:

On Mon, Nov 13, 2017 at 03:07:45PM +0200, Marcel Apfelbaum wrote:

On 11/11/2017 17:25, Marcel Apfelbaum wrote:

Currently there is no MMIO range over 4G
reserved for PCI hotplug. Since the 32bit PCI hole
depends on the number of cold-plugged PCI devices
and other factors, it is very possible is too small
to hotplug PCI devices with large BARs.

Fix it by reserving 2G for I4400FX chipset
in order to comply with older Win32 Guest OSes
and 32G for Q35 chipset.

Even if the new defaults of pci-hole64-size will appear in
"info qtree" also for older machines, the property was
not implemented so no changes will be visible to guests.

Note this is a regression since prev QEMU versions had
some range reserved for 64bit PCI hotplug.

Reviewed-by: Laszlo Ersek 
Reviewed-by: Gerd Hoffmann 
Signed-off-by: Marcel Apfelbaum 
---



Hi Michael,

Can you please merge the patch for QEMU 2.11
if you have no further comments?
I think is an important fix and it worth
having it in 2.11 .

Thanks,
Marcel




Hi Michael,


Just a note: this changes the DSDT to use
QWordMemory in the _CRS.

I expected it to cause trouble for old windows such as winXP
but it does not seem to cause them - at least not boot crashes.

Worth checking that 32 bit hotplug still works though.
Could you pls confirm?



I installed a fresh WinXpSp3 32bit image and I can confirm
there is no issue with both the installation and
PCI hotplug/hot-unplug of a E1000 NIC.

Thanks,
Marcel


V4 -> V5:
   - Renamed a local variable (Laszlo)
   - Added a comment to q35 props (Eduardo)

V3 -> V4:
   - Addressed Laszlo's comments:
  - Added defines for pci-hole64 default size props.
  - Rounded the hole64_end to 1G
  - Moved some info to commit message
   - Addressed Michael's comments:
  - Added more comments.
   - I kept Gerd's "review-by" tag since no functional changes were made.

V2 -> V3:
   - Addressed Gerd's and others comments and re-enabled the pci-hole64-size
 property defaulting it to 2G for I440FX and 32g for Q35.
   - Even if the new defaults of pci-hole64-size will appear in "info qtree"
 also for older machines, the property was not implemented so
 no changes will be visible to guests.

V1 -> V2:
   Addressed Igor's comments:
  - aligned the hole64 start to 1Gb
   (I think all the computations took care of it already,
but it can't hurt)
  - Init compat props to "off" instead of "false"

   hw/i386/pc.c  | 22 ++
   hw/pci-host/piix.c| 32 ++--
   hw/pci-host/q35.c | 42 +++---
   include/hw/i386/pc.h  | 10 +-
   include/hw/pci-host/q35.h |  1 +
   5 files changed, 101 insertions(+), 6 deletions(-)

diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index e11a65b545..fafe5ba5cd 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -1448,6 +1448,28 @@ void pc_memory_init(PCMachineState *pcms,
   pcms->ioapic_as = &address_space_memory;
   }
+/*
+ * The 64bit pci hole starts after "above 4G RAM" and
+ * potentially the space reserved for memory hotplug.
+ */
+uint64_t pc_pci_hole64_start(void)
+{
+PCMachineState *pcms = PC_MACHINE(qdev_get_machine());
+PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
+uint64_t hole64_start = 0;
+
+if (pcmc->has_reserved_memory && pcms->hotplug_memory.base) {
+hole64_start = pcms->hotplug_memory.base;
+if (!pcmc->broken_reserved_end) {
+hole64_start += memory_region_size(&pcms->hotplug_memory.mr);
+}
+} else {
+hole64_start = 0x1ULL + pcms->above_4g_mem_size;
+}
+
+return ROUND_UP(hole64_start, 1ULL << 30);
+}
+
   qemu_irq pc_allocate_cpu_irq(void)
   {
   return qemu_allocate_irq(pic_irq_request, NULL, 0);
diff --git a/hw/pci-host/piix.c b/hw/pci-host/piix.c
index a7e2256870..a684a7cca9 100644
--- a/hw/pci-host/piix.c
+++ b/hw/pci-host/piix.c
@@ -50,6 +50,7 @@ typedef struct I440FXState {
   PCIHostState parent_obj;
   Range pci_hole;
   uint64_t pci_hole64_size;
+bool pci_hole64_fix;
   uint32_t short_root_bus;
   } I440FXState;
@@ -112,6 +113,9 @@ struct PCII440FXState {
   #define I440FX_PAM_SIZE 7
   #define I440FX_SMRAM0x72
+/* Keep it 2G to comply with older win32 guests */
+#define I440FX_PCI_HOST_HOLE64_SIZE_DEFAULT (1ULL << 31)
+
   /* Older coreboot versions (4.0 and older) read a config register that 
doesn't
* exist in real hardware, to get the RAM size from QEMU.
*/
@@ -238,29 +242,52 @@ static void i440fx_pcihost_get_pci_hole_end(Object *obj, 
Visitor *v,
   visit_type_uint32(v, name, &value, errp);
   }
+/*
+ * The 64bit PCI hole start is set by the Guest firmware
+ * as the address of the first 64bit PCI MEM resource.
+ * If no PCI device has resources on the 64bit area,
+ * the 64bit PCI hole will start after "over 4G RAM" and the
+ * reserved space for memory hotplug if any.
+ */
   s

[Qemu-devel] Effect of qemu-img convert -m and -W options

2017-11-15 Thread Richard W.M. Jones
[CC to qemu-devel since I'm obviously doing something wrong here,
I'm just not sure what.]

I was getting ready to add multiple threads to ‘qemu-img convert’ (the
longest part of v2v conversions) when I noticed that it had them
already!  (To be fair this was only added in February this year so no
wonder we didn't notice.)

To enable parallel convert we would need to use the ‘qemu-img convert
-W’ option (which allows out-of-order writes to the destination) and
‘qemu-img convert -m <#num-coroutines>’ option to select the
parallelism (defaults to 8).  The documentation refers to coroutines
but I verified from strace that it is using real threads.

I did some testing to see what effect this has.  For this I used a
large guest image which is approximately a third full of random data
(the rest being sparsely allocated):

  Source format:   raw
  Source virtual size: 100 GB
  Source used space:   31 GB  
  Target format:   qcow2
  Version: qemu-img-2.10.0-7.fc28.x86_64
  Conversion command:
rm -f /to/target
time qemu-img convert [-W] [-m ##] -f raw source.img -O qcow2 /to/target

Source and target are regular files on two different disks.  The test
machine is a Xeon E5 with 16 real cores.

--
  Non-preallocated output
  (times are in seconds)
   without -W   -W

-m 1153  -

-m 4155 157

-m 8 [default]  158 231

-m 16 [max] 166 166
--

The documentation for ‘-W’ notes that this is only recommended for
preallocated outputs (which the test above does not use).  So let's
try using a preallocated qcow2 output.

  Conversion command:
# the same target file is reused each time
time qemu-img convert -n [-W] [-m ##] -f raw source.img -O qcow2 /to/target

--
   Preallocated output
  (times are in seconds)
   without -W   -W

-m 1147  -

-m 4146 145

-m 8 [default]  146 199

-m 16 [max] 147 146
--

Based on this there seems to be some issue with the ‘-W’ option -- I
even thought I might have it backwards, but checking the code it does
seem like ‘-W’ enables (rather than disables) out of order writes.
Also some bizarre interaction between ‘-W’ and ‘-m 8’.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
libguestfs lets you edit virtual machines.  Supports shell scripting,
bindings from many languages.  http://libguestfs.org



Re: [Qemu-devel] [PATCH v3] hw/arm/aspeed: Unlock SCU when running kernel

2017-11-15 Thread Peter Maydell
On 15 November 2017 at 08:18, Cédric Le Goater  wrote:
> On 11/14/2017 01:20 PM, Joel Stanley wrote:
>> The ASPEED hardware contains a lock register for the SCU that disables
>> any writes to the SCU when it is locked. The machine comes up with the
>> lock enabled, but on all known hardware u-boot will unlock it and leave
>> it unlocked when loading the kernel.
>>
>> This means the kernel expects the SCU to be unlocked. When booting from
>> an emulated ROM the normal u-boot unlock path is executed. Things don't
>> go well when booting using the -kernel command line, as u-boot does not
>> run first.
>>
>> Change behaviour so that when a kernel is passed to the machine, set the
>> reset value of the SCU to be unlocked.
>>
>> Signed-off-by: Joel Stanley 
>
> Reviewed-by: Cédric Le Goater 

Thanks. Should this go into 2.11 ? It sounds like the kind of
bugfix that ought to...

-- PMM



Re: [Qemu-devel] [PATCH for-2.11] target/arm: Report GICv3 sysregs present in ID registers if needed

2017-11-15 Thread Peter Maydell
On 15 November 2017 at 01:14, Alistair Francis  wrote:
> On Tue, Nov 7, 2017 at 7:01 AM, Peter Maydell  
> wrote:
>> The CPU ID registers ID_AA64PFR0_EL1, ID_PFR1_EL1 and ID_PFR1
>> have a field for reporting presence of GICv3 system registers.
>> We need to report this field correctly in order for Xen to
>> work as a guest inside QEMU emulation. We mustn't incorrectly
>> claim the sysregs exist when they don't, though, or Linux will
>> crash.
>>
>> Unfortunately the way we've designed the GICv3 emulation in QEMU
>> puts the system registers as part of the GICv3 device, which
>> may be created after the CPU proper has been realized. This
>> means that we don't know at the point when we define the ID
>> registers what the correct value is. Handle this by switching
>> them to calling a function at runtime to read the value, where
>> we can fill in the GIC field appropriately.
>>
>> Signed-off-by: Peter Maydell 
>
> Is this going to make it into 2.11?

Yes, it's supposed to go into 2.11 -- I think I just forgot
to put it in the rc1 pullreq, but it'll go in rc2.

thanks
-- PMM



Re: [Qemu-devel] [PATCH v3 for-2.11 1/3] tpm_emulator: Add a caching layer for the TPM Established flag

2017-11-15 Thread Stefan Berger

On 11/14/2017 10:47 PM, Marc-André Lureau wrote:

Hi

On Wed, Nov 15, 2017 at 2:16 AM, Stefan Berger
 wrote:

On 11/14/2017 06:40 PM, Marc-André Lureau wrote:

Hi

On Tue, Nov 14, 2017 at 10:52 PM, Stefan Berger
 wrote:

Add a caching layer for the TPM established flag so that we don't
need to go to the emulator every time the flag is read by accessing
the REG_ACCESS register.

What's the impact? Isn't this just a "small" optimization? Iotw, why
is this for-2.11?


The TIS has a register that contains this flag and that's being polled quite
frequently. So it generates a lot of traffic to the emulator. This caching
layer gets rid of most of the traffic.

I didn't notice any problem when doing my tests, I guess Amarnath
niether. perhaps it's best to delay for after 2.11.


So then let's delay it. Following a Reviewed-by I'd put it in the 
tpm-next queue then.


   Stefan




[Qemu-devel] building user interfaces as modules

2017-11-15 Thread Gerd Hoffmann
  Hi,

I'm trying to revamp the display initialization code:

  * First use a qapi type for the display options, so we can use
-blockdev style command line parsing for -display (not there yet,
but closer).
  * Second initialize all user interfaces the same way and build a
registry for UIs.  Drops lots of #ifdefs from vl.c
  * Finally this should make it *alot* easier to build user interfaces
as modules.

For SDL we already have sdl.mo defined in ui/Makefile.objs. so I tried
that first.  Just flipping CONFIG_SDL from 'y' to 'm' doesn't work
though.  sdl support isn't build, neither with nor without
--enable-modules.

Any clues?

thanks,
  Gerd

git branch:
  https://www.kraxel.org/cgit/qemu/log/?h=sirius/display-cmdline




[Qemu-devel] [PULL 3/4] tpm_tis: Return TPM_VERSION_UNSPEC in case of BE failure

2017-11-15 Thread Stefan Berger
In case the backend has a failure, such as the tpm_emulator's CMD_INIT
failing, the TIS goes into failure mode and does not respond to reads
or writes to MMIO registers. In this case we need to prevent the ACPI
table from being added and the straight-forward way is to indicate that
there's no known TPM version being used.

Signed-off-by: Stefan Berger 
Reviewed-by: Marc-André Lureau 
---
 hw/tpm/tpm_tis.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/hw/tpm/tpm_tis.c b/hw/tpm/tpm_tis.c
index 7402528..fec2fc6 100644
--- a/hw/tpm/tpm_tis.c
+++ b/hw/tpm/tpm_tis.c
@@ -1008,6 +1008,10 @@ TPMVersion tpm_tis_get_tpm_version(Object *obj)
 {
 TPMState *s = TPM(obj);
 
+if (tpm_backend_had_startup_error(s->be_driver)) {
+return TPM_VERSION_UNSPEC;
+}
+
 return tpm_backend_get_tpm_version(s->be_driver);
 }
 
-- 
2.5.5




[Qemu-devel] [PULL 1/4] specs: Extend TPM spec with TPM emulator description

2017-11-15 Thread Stefan Berger
Following the recent extension of QEMU with a TPM emulator device,
update the specs describing for how to interact with the device.

The results of commands run inside a Linux VM are expected to be
similar to those when the TPM passthrough device is used, so we
just reuse that.

Signed-off-by: Stefan Berger 
Reviewed-by: Marc-André Lureau 
---
 docs/specs/tpm.txt | 79 ++
 1 file changed, 79 insertions(+)

diff --git a/docs/specs/tpm.txt b/docs/specs/tpm.txt
index 914daac..d1d7157 100644
--- a/docs/specs/tpm.txt
+++ b/docs/specs/tpm.txt
@@ -121,3 +121,82 @@ crw---. 1 root root 10, 224 Jul 11 10:11 /dev/tpm0
 PCR-00: 35 4E 3B CE 23 9F 38 59 ...
 ...
 PCR-23: 00 00 00 00 00 00 00 00 ...
+
+
+== The QEMU TPM emulator device ==
+
+The TPM emulator device uses an external TPM emulator called 'swtpm' for
+sending TPM commands to and receiving responses from. The swtpm program
+must have been started before trying to access it through the TPM emulator
+with QEMU.
+
+The TPM emulator implements a command channel for transferring TPM commands
+and responses as well as a control channel over which control commands can
+be sent. The specification for the control channel can be found here:
+
+https://github.com/stefanberger/swtpm/blob/master/man/man3/swtpm_ioctls.pod
+
+
+The control channel serves the purpose of resetting, initializing, and
+migrating the TPM state, among other things.
+
+The swtpm program behaves like a hardware TPM and therefore needs to be
+initialized by the firmware running inside the QEMU virtual machine.
+One necessary step for initializing the device is to send the TPM_Startup
+command to it. SeaBIOS, for example, has been instrumented to initialize
+a TPM 1.2 or TPM 2 device using this command.
+
+
+QEMU files related to the TPM emulator device:
+ - hw/tpm/tpm_emulator.c
+ - hw/tpm/tpm_util.c
+ - hw/tpm/tpm_util.h
+
+
+The following commands start the swtpm with a UnixIO control channel over
+a socket interface. They do not need to be run as root.
+
+mkdir /tmp/mytpm1
+swtpm socket --tpmstate dir=/tmp/mytpm1 \
+  --ctrl type=unixio,path=/tmp/mytpm1/swtpm-sock \
+  --log level=20
+
+Command line to start QEMU with the TPM emulator device communicating with
+the swtpm:
+
+qemu-system-x86_64 -display sdl -enable-kvm \
+  -m 1024 -boot d -bios bios-256k.bin -boot menu=on \
+  -chardev socket,id=chrtpm,path=/tmp/mytpm1/swtpm-sock \
+  -tpmdev emulator,id=tpm0,chardev=chrtpm \
+  -device tpm-tis,tpmdev=tpm0 test.img
+
+
+In case SeaBIOS is used as firmware, it should show the TPM menu item
+after entering the menu with 'ESC'.
+
+Select boot device:
+1. DVD/CD [ata1-0: QEMU DVD-ROM ATAPI-4 DVD/CD]
+[...]
+5. Legacy option rom
+
+t. TPM Configuration
+
+
+The following commands should result in similar output inside the VM with a
+Linux kernel that either has the TPM TIS driver built-in or available as a
+module:
+
+#> dmesg | grep -i tpm
+[0.711310] tpm_tis 00:06: 1.2 TPM (device=id 0x1, rev-id 1)
+
+#> dmesg | grep TCPA
+[0.00] ACPI: TCPA 0x03FFD191C 32 (v02 BOCHS  \
+BXPCTCPA 001 BXPC 0001)
+
+#> ls -l /dev/tpm*
+crw---. 1 root root 10, 224 Jul 11 10:11 /dev/tpm0
+
+#> find /sys/devices/ | grep pcrs$ | xargs cat
+PCR-00: 35 4E 3B CE 23 9F 38 59 ...
+...
+PCR-23: 00 00 00 00 00 00 00 00 ...
-- 
2.5.5




[Qemu-devel] [PULL 2/4] tpm-emulator: protect concurrent ctrl_chr access

2017-11-15 Thread Stefan Berger
From: Marc-André Lureau 

The control chardev is being used from the data thread to set the
locality of the next request. Altough the chr has a write mutex, we
may potentially read the reply from another thread request.

Add a mutex to protect from concurrent control commands.

Signed-off-by: Marc-André Lureau 
Reviewed-by: Stefan Berger 
Signed-off-by: Stefan Berger 
---
 hw/tpm/tpm_emulator.c | 44 
 1 file changed, 28 insertions(+), 16 deletions(-)

diff --git a/hw/tpm/tpm_emulator.c b/hw/tpm/tpm_emulator.c
index 9aaec8e..e1a6810 100644
--- a/hw/tpm/tpm_emulator.c
+++ b/hw/tpm/tpm_emulator.c
@@ -71,15 +71,21 @@ typedef struct TPMEmulator {
 ptm_cap caps; /* capabilities of the TPM */
 uint8_t cur_locty_number; /* last set locality */
 Error *migration_blocker;
+
+QemuMutex mutex;
 } TPMEmulator;
 
 
-static int tpm_emulator_ctrlcmd(CharBackend *dev, unsigned long cmd, void *msg,
+static int tpm_emulator_ctrlcmd(TPMEmulator *tpm, unsigned long cmd, void *msg,
 size_t msg_len_in, size_t msg_len_out)
 {
+CharBackend *dev = &tpm->ctrl_chr;
 uint32_t cmd_no = cpu_to_be32(cmd);
 ssize_t n = sizeof(uint32_t) + msg_len_in;
 uint8_t *buf = NULL;
+int ret = -1;
+
+qemu_mutex_lock(&tpm->mutex);
 
 buf = g_alloca(n);
 memcpy(buf, &cmd_no, sizeof(cmd_no));
@@ -87,17 +93,21 @@ static int tpm_emulator_ctrlcmd(CharBackend *dev, unsigned 
long cmd, void *msg,
 
 n = qemu_chr_fe_write_all(dev, buf, n);
 if (n <= 0) {
-return -1;
+goto end;
 }
 
 if (msg_len_out != 0) {
 n = qemu_chr_fe_read_all(dev, msg, msg_len_out);
 if (n <= 0) {
-return -1;
+goto end;
 }
 }
 
-return 0;
+ret = 0;
+
+end:
+qemu_mutex_unlock(&tpm->mutex);
+return ret;
 }
 
 static int tpm_emulator_unix_tx_bufs(TPMEmulator *tpm_emu,
@@ -154,7 +164,7 @@ static int tpm_emulator_set_locality(TPMEmulator *tpm_emu, 
uint8_t locty_number,
 
 DPRINTF("setting locality : 0x%x", locty_number);
 loc.u.req.loc = locty_number;
-if (tpm_emulator_ctrlcmd(&tpm_emu->ctrl_chr, CMD_SET_LOCALITY, &loc,
+if (tpm_emulator_ctrlcmd(tpm_emu, CMD_SET_LOCALITY, &loc,
  sizeof(loc), sizeof(loc)) < 0) {
 error_setg(errp, "tpm-emulator: could not set locality : %s",
strerror(errno));
@@ -202,8 +212,8 @@ error:
 static int tpm_emulator_probe_caps(TPMEmulator *tpm_emu)
 {
 DPRINTF("%s", __func__);
-if (tpm_emulator_ctrlcmd(&tpm_emu->ctrl_chr, CMD_GET_CAPABILITY,
- &tpm_emu->caps, 0, sizeof(tpm_emu->caps)) < 0) {
+if (tpm_emulator_ctrlcmd(tpm_emu, CMD_GET_CAPABILITY,
+ &tpm_emu->caps, 0, sizeof(tpm_emu->caps)) < 0) {
 error_report("tpm-emulator: probing failed : %s", strerror(errno));
 return -1;
 }
@@ -254,8 +264,8 @@ static int tpm_emulator_startup_tpm(TPMBackend *tb)
 ptm_res res;
 
 DPRINTF("%s", __func__);
-if (tpm_emulator_ctrlcmd(&tpm_emu->ctrl_chr, CMD_INIT, &init, sizeof(init),
- sizeof(init)) < 0) {
+if (tpm_emulator_ctrlcmd(tpm_emu, CMD_INIT, &init, sizeof(init),
+ sizeof(init)) < 0) {
 error_report("tpm-emulator: could not send INIT: %s",
  strerror(errno));
 goto err_exit;
@@ -278,7 +288,7 @@ static bool 
tpm_emulator_get_tpm_established_flag(TPMBackend *tb)
 ptm_est est;
 
 DPRINTF("%s", __func__);
-if (tpm_emulator_ctrlcmd(&tpm_emu->ctrl_chr, CMD_GET_TPMESTABLISHED, &est,
+if (tpm_emulator_ctrlcmd(tpm_emu, CMD_GET_TPMESTABLISHED, &est,
  0, sizeof(est)) < 0) {
 error_report("tpm-emulator: Could not get the TPM established flag: 
%s",
  strerror(errno));
@@ -302,7 +312,7 @@ static int 
tpm_emulator_reset_tpm_established_flag(TPMBackend *tb,
 }
 
 reset_est.u.req.loc = tpm_emu->cur_locty_number;
-if (tpm_emulator_ctrlcmd(&tpm_emu->ctrl_chr, CMD_RESET_TPMESTABLISHED,
+if (tpm_emulator_ctrlcmd(tpm_emu, CMD_RESET_TPMESTABLISHED,
  &reset_est, sizeof(reset_est),
  sizeof(reset_est)) < 0) {
 error_report("tpm-emulator: Could not reset the establishment bit: %s",
@@ -330,7 +340,7 @@ static void tpm_emulator_cancel_cmd(TPMBackend *tb)
 return;
 }
 
-if (tpm_emulator_ctrlcmd(&tpm_emu->ctrl_chr, CMD_CANCEL_TPM_CMD, &res, 0,
+if (tpm_emulator_ctrlcmd(tpm_emu, CMD_CANCEL_TPM_CMD, &res, 0,
  sizeof(res)) < 0) {
 error_report("tpm-emulator: Could not cancel command: %s",
  strerror(errno));
@@ -378,8 +388,8 @@ static int tpm_emulator_prepare_data_fd(TPMEmulator 
*tpm_emu)
 
 qemu_chr_fe_set_msgfds(&tpm_emu->ctrl_chr, fds + 1, 1);
 
-if (tpm_emulator_ctrlcmd(&tpm_emu->ctrl_chr, CM

[Qemu-devel] [PULL 0/4] Merge tpm 2017/11/15 v1

2017-11-15 Thread Stefan Berger
This pull request is for 2.11 and extends documentation as well as fixes
bugs related to concurrency and failure mode.

The following changes since commit 4ffa88c99c54d2a30f79e3dbecec50b023eff1c8:

  Merge remote-tracking branch 
'remotes/berrange/tags/pull-qcrypto-2017-11-08-1' into staging (2017-11-10 
16:01:35 +)

are available in the git repository at:

  git://github.com/stefanberger/qemu-tpm.git tags/pull-tpm-2017-11-15-1

for you to fetch changes up to 6cd65969da57eaa9bdff07b80ecca2becce0597a:

  tpm_tis: Return 0 for every register in case of failure mode (2017-11-15 
06:47:35 -0500)


Merge tpm 2017/11/15 v1


Marc-André Lureau (1):
  tpm-emulator: protect concurrent ctrl_chr access

Stefan Berger (3):
  specs: Extend TPM spec with TPM emulator description
  tpm_tis: Return TPM_VERSION_UNSPEC in case of BE failure
  tpm_tis: Return 0 for every register in case of failure mode

 docs/specs/tpm.txt| 79 +++
 hw/tpm/tpm_emulator.c | 44 +---
 hw/tpm/tpm_tis.c  |  6 +++-
 3 files changed, 112 insertions(+), 17 deletions(-)


-- 
2.5.5




[Qemu-devel] [PULL 4/4] tpm_tis: Return 0 for every register in case of failure mode

2017-11-15 Thread Stefan Berger
Rather than returning ~0, return 0 for every register in case of failure
mode. The '0' is better to indicate that there's no device there. It avoids
SeaBIOS detecting a device and getting stuck on it trying to read and write
its registers.

Signed-off-by: Stefan Berger 
Reviewed-by: Marc-André Lureau 
---
 hw/tpm/tpm_tis.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/tpm/tpm_tis.c b/hw/tpm/tpm_tis.c
index fec2fc6..42d647d 100644
--- a/hw/tpm/tpm_tis.c
+++ b/hw/tpm/tpm_tis.c
@@ -545,7 +545,7 @@ static uint64_t tpm_tis_mmio_read(void *opaque, hwaddr addr,
 uint8_t v;
 
 if (tpm_backend_had_startup_error(s->be_driver)) {
-return val;
+return 0;
 }
 
 switch (offset) {
-- 
2.5.5




[Qemu-devel] [PULL for-2.11 3/3] target/arm: Fix GETPC usage in do_paired_cmpxchg64_l/be

2017-11-15 Thread Richard Henderson
Use of GETPC must be restricted to those functions that are
directly called from TCG generated code.

Reviewed-by: Alex Bennée 
Fixes: 2399d4e7cec22ecf1c51062d2ebfd45220dbaace
Signed-off-by: Richard Henderson 
---
 target/arm/helper-a64.c | 14 ++
 1 file changed, 6 insertions(+), 8 deletions(-)

diff --git a/target/arm/helper-a64.c b/target/arm/helper-a64.c
index 96a3ecf707..b84ebcae6e 100644
--- a/target/arm/helper-a64.c
+++ b/target/arm/helper-a64.c
@@ -432,9 +432,8 @@ uint64_t HELPER(crc32c_64)(uint64_t acc, uint64_t val, 
uint32_t bytes)
 /* Returns 0 on success; 1 otherwise.  */
 static uint64_t do_paired_cmpxchg64_le(CPUARMState *env, uint64_t addr,
uint64_t new_lo, uint64_t new_hi,
-   bool parallel)
+   bool parallel, uintptr_t ra)
 {
-uintptr_t ra = GETPC();
 Int128 oldv, cmpv, newv;
 bool success;
 
@@ -491,20 +490,19 @@ static uint64_t do_paired_cmpxchg64_le(CPUARMState *env, 
uint64_t addr,
 uint64_t HELPER(paired_cmpxchg64_le)(CPUARMState *env, uint64_t addr,
   uint64_t new_lo, uint64_t new_hi)
 {
-return do_paired_cmpxchg64_le(env, addr, new_lo, new_hi, false);
+return do_paired_cmpxchg64_le(env, addr, new_lo, new_hi, false, GETPC());
 }
 
 uint64_t HELPER(paired_cmpxchg64_le_parallel)(CPUARMState *env, uint64_t addr,
   uint64_t new_lo, uint64_t new_hi)
 {
-return do_paired_cmpxchg64_le(env, addr, new_lo, new_hi, true);
+return do_paired_cmpxchg64_le(env, addr, new_lo, new_hi, true, GETPC());
 }
 
 static uint64_t do_paired_cmpxchg64_be(CPUARMState *env, uint64_t addr,
uint64_t new_lo, uint64_t new_hi,
-   bool parallel)
+   bool parallel, uintptr_t ra)
 {
-uintptr_t ra = GETPC();
 Int128 oldv, cmpv, newv;
 bool success;
 
@@ -561,11 +559,11 @@ static uint64_t do_paired_cmpxchg64_be(CPUARMState *env, 
uint64_t addr,
 uint64_t HELPER(paired_cmpxchg64_be)(CPUARMState *env, uint64_t addr,
  uint64_t new_lo, uint64_t new_hi)
 {
-return do_paired_cmpxchg64_be(env, addr, new_lo, new_hi, false);
+return do_paired_cmpxchg64_be(env, addr, new_lo, new_hi, false, GETPC());
 }
 
 uint64_t HELPER(paired_cmpxchg64_be_parallel)(CPUARMState *env, uint64_t addr,
  uint64_t new_lo, uint64_t new_hi)
 {
-return do_paired_cmpxchg64_be(env, addr, new_lo, new_hi, true);
+return do_paired_cmpxchg64_be(env, addr, new_lo, new_hi, true, GETPC());
 }
-- 
2.13.6




[Qemu-devel] [PULL for-2.11 0/3] tcg: user-mode memory helper fixes

2017-11-15 Thread Richard Henderson
Fixes the issue that Peter found wrt javac on aarch64-linux-user.


r~


The following changes since commit 1fa0f627d03cd0d0755924247cafeb42969016bf:

  Update version for v2.11.0-rc1 release (2017-11-14 18:37:49 +)

are available in the git repository at:

  git://github.com/rth7680/qemu.git tags/pull-tcg-20171115

for you to fetch changes up to 3c5f9c3f35dd3b6d1d1cd68c9d4d86fc3c59c397:

  target/arm: Fix GETPC usage in do_paired_cmpxchg64_l/be (2017-11-15 10:34:33 
+0100)


User-mode memory helper fixes


Richard Henderson (3):
  tcg: Record code_gen_buffer address for user-only memory helpers
  target/arm: Use helper_retaddr in stxp helpers
  target/arm: Fix GETPC usage in do_paired_cmpxchg64_l/be

 accel/tcg/atomic_template.h   | 32 +
 include/exec/cpu_ldst.h   |  2 ++
 include/exec/cpu_ldst_useronly_template.h | 14 ++--
 accel/tcg/cputlb.c|  1 +
 accel/tcg/user-exec.c | 58 +--
 target/arm/helper-a64.c   | 20 ++-
 6 files changed, 99 insertions(+), 28 deletions(-)



[Qemu-devel] [PULL for-2.11 2/3] target/arm: Use helper_retaddr in stxp helpers

2017-11-15 Thread Richard Henderson
We use raw memory primitives along the !parallel_cpus paths in order to
simplify the endianness handling.  Because of that, we did not benefit
from the generic changes to cpu_ldst_user_only_template.h.

The simplest fix is to manipulate helper_retaddr here.

Reviewed-by: Alex Bennée 
Signed-off-by: Richard Henderson 
---
 target/arm/helper-a64.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/target/arm/helper-a64.c b/target/arm/helper-a64.c
index d0e435ca4b..96a3ecf707 100644
--- a/target/arm/helper-a64.c
+++ b/target/arm/helper-a64.c
@@ -456,6 +456,8 @@ static uint64_t do_paired_cmpxchg64_le(CPUARMState *env, 
uint64_t addr,
 #ifdef CONFIG_USER_ONLY
 /* ??? Enforce alignment.  */
 uint64_t *haddr = g2h(addr);
+
+helper_retaddr = ra;
 o0 = ldq_le_p(haddr + 0);
 o1 = ldq_le_p(haddr + 1);
 oldv = int128_make128(o0, o1);
@@ -465,6 +467,7 @@ static uint64_t do_paired_cmpxchg64_le(CPUARMState *env, 
uint64_t addr,
 stq_le_p(haddr + 0, int128_getlo(newv));
 stq_le_p(haddr + 1, int128_gethi(newv));
 }
+helper_retaddr = 0;
 #else
 int mem_idx = cpu_mmu_index(env, false);
 TCGMemOpIdx oi0 = make_memop_idx(MO_LEQ | MO_ALIGN_16, mem_idx);
@@ -523,6 +526,8 @@ static uint64_t do_paired_cmpxchg64_be(CPUARMState *env, 
uint64_t addr,
 #ifdef CONFIG_USER_ONLY
 /* ??? Enforce alignment.  */
 uint64_t *haddr = g2h(addr);
+
+helper_retaddr = ra;
 o1 = ldq_be_p(haddr + 0);
 o0 = ldq_be_p(haddr + 1);
 oldv = int128_make128(o0, o1);
@@ -532,6 +537,7 @@ static uint64_t do_paired_cmpxchg64_be(CPUARMState *env, 
uint64_t addr,
 stq_be_p(haddr + 0, int128_gethi(newv));
 stq_be_p(haddr + 1, int128_getlo(newv));
 }
+helper_retaddr = 0;
 #else
 int mem_idx = cpu_mmu_index(env, false);
 TCGMemOpIdx oi0 = make_memop_idx(MO_BEQ | MO_ALIGN_16, mem_idx);
-- 
2.13.6




[Qemu-devel] [PULL for-2.11 1/3] tcg: Record code_gen_buffer address for user-only memory helpers

2017-11-15 Thread Richard Henderson
When we handle a signal from a fault within a user-only memory helper,
we cannot cpu_restore_state with the PC found within the signal frame.
Use a TLS variable, helper_retaddr, to record the unwind start point
to find the faulting guest insn.

Tested-by: Alex Bennée 
Reviewed-by: Alex Bennée 
Reported-by: Peter Maydell 
Signed-off-by: Richard Henderson 
---
 accel/tcg/atomic_template.h   | 32 +
 include/exec/cpu_ldst.h   |  2 ++
 include/exec/cpu_ldst_useronly_template.h | 14 ++--
 accel/tcg/cputlb.c|  1 +
 accel/tcg/user-exec.c | 58 +--
 5 files changed, 87 insertions(+), 20 deletions(-)

diff --git a/accel/tcg/atomic_template.h b/accel/tcg/atomic_template.h
index b400b2a3d3..1c7c17526c 100644
--- a/accel/tcg/atomic_template.h
+++ b/accel/tcg/atomic_template.h
@@ -62,7 +62,9 @@ ABI_TYPE ATOMIC_NAME(cmpxchg)(CPUArchState *env, target_ulong 
addr,
   ABI_TYPE cmpv, ABI_TYPE newv EXTRA_ARGS)
 {
 DATA_TYPE *haddr = ATOMIC_MMU_LOOKUP;
-return atomic_cmpxchg__nocheck(haddr, cmpv, newv);
+DATA_TYPE ret = atomic_cmpxchg__nocheck(haddr, cmpv, newv);
+ATOMIC_MMU_CLEANUP;
+return ret;
 }
 
 #if DATA_SIZE >= 16
@@ -70,6 +72,7 @@ ABI_TYPE ATOMIC_NAME(ld)(CPUArchState *env, target_ulong addr 
EXTRA_ARGS)
 {
 DATA_TYPE val, *haddr = ATOMIC_MMU_LOOKUP;
 __atomic_load(haddr, &val, __ATOMIC_RELAXED);
+ATOMIC_MMU_CLEANUP;
 return val;
 }
 
@@ -78,13 +81,16 @@ void ATOMIC_NAME(st)(CPUArchState *env, target_ulong addr,
 {
 DATA_TYPE *haddr = ATOMIC_MMU_LOOKUP;
 __atomic_store(haddr, &val, __ATOMIC_RELAXED);
+ATOMIC_MMU_CLEANUP;
 }
 #else
 ABI_TYPE ATOMIC_NAME(xchg)(CPUArchState *env, target_ulong addr,
ABI_TYPE val EXTRA_ARGS)
 {
 DATA_TYPE *haddr = ATOMIC_MMU_LOOKUP;
-return atomic_xchg__nocheck(haddr, val);
+DATA_TYPE ret = atomic_xchg__nocheck(haddr, val);
+ATOMIC_MMU_CLEANUP;
+return ret;
 }
 
 #define GEN_ATOMIC_HELPER(X)\
@@ -92,8 +98,10 @@ ABI_TYPE ATOMIC_NAME(X)(CPUArchState *env, target_ulong 
addr,   \
  ABI_TYPE val EXTRA_ARGS)   \
 {   \
 DATA_TYPE *haddr = ATOMIC_MMU_LOOKUP;   \
-return atomic_##X(haddr, val);  \
-}   \
+DATA_TYPE ret = atomic_##X(haddr, val); \
+ATOMIC_MMU_CLEANUP; \
+return ret; \
+}
 
 GEN_ATOMIC_HELPER(fetch_add)
 GEN_ATOMIC_HELPER(fetch_and)
@@ -123,7 +131,9 @@ ABI_TYPE ATOMIC_NAME(cmpxchg)(CPUArchState *env, 
target_ulong addr,
   ABI_TYPE cmpv, ABI_TYPE newv EXTRA_ARGS)
 {
 DATA_TYPE *haddr = ATOMIC_MMU_LOOKUP;
-return BSWAP(atomic_cmpxchg__nocheck(haddr, BSWAP(cmpv), BSWAP(newv)));
+DATA_TYPE ret = atomic_cmpxchg__nocheck(haddr, BSWAP(cmpv), BSWAP(newv));
+ATOMIC_MMU_CLEANUP;
+return BSWAP(ret);
 }
 
 #if DATA_SIZE >= 16
@@ -131,6 +141,7 @@ ABI_TYPE ATOMIC_NAME(ld)(CPUArchState *env, target_ulong 
addr EXTRA_ARGS)
 {
 DATA_TYPE val, *haddr = ATOMIC_MMU_LOOKUP;
 __atomic_load(haddr, &val, __ATOMIC_RELAXED);
+ATOMIC_MMU_CLEANUP;
 return BSWAP(val);
 }
 
@@ -140,13 +151,16 @@ void ATOMIC_NAME(st)(CPUArchState *env, target_ulong addr,
 DATA_TYPE *haddr = ATOMIC_MMU_LOOKUP;
 val = BSWAP(val);
 __atomic_store(haddr, &val, __ATOMIC_RELAXED);
+ATOMIC_MMU_CLEANUP;
 }
 #else
 ABI_TYPE ATOMIC_NAME(xchg)(CPUArchState *env, target_ulong addr,
ABI_TYPE val EXTRA_ARGS)
 {
 DATA_TYPE *haddr = ATOMIC_MMU_LOOKUP;
-return BSWAP(atomic_xchg__nocheck(haddr, BSWAP(val)));
+ABI_TYPE ret = atomic_xchg__nocheck(haddr, BSWAP(val));
+ATOMIC_MMU_CLEANUP;
+return BSWAP(ret);
 }
 
 #define GEN_ATOMIC_HELPER(X)\
@@ -154,7 +168,9 @@ ABI_TYPE ATOMIC_NAME(X)(CPUArchState *env, target_ulong 
addr,   \
  ABI_TYPE val EXTRA_ARGS)   \
 {   \
 DATA_TYPE *haddr = ATOMIC_MMU_LOOKUP;   \
-return BSWAP(atomic_##X(haddr, BSWAP(val)));\
+DATA_TYPE ret = atomic_##X(haddr, BSWAP(val));  \
+ATOMIC_MMU_CLEANUP; \
+return BSWAP(ret);  \
 }
 
 GEN_ATOMIC_HELPER(fetch_and)
@@ -180,6 +196,7 @@ ABI_TYPE ATOMIC_NAME(fetch_add)(CPUArchState *env, 
target_ulong addr,
 sto = BSWAP(ret + val);
 ldn = atomic_cmpxchg__nocheck(haddr, ldo, sto);
 if (

[Qemu-devel] [PATCH] ps2: fix PS2Queue counter field type

2017-11-15 Thread P J P
From: Prasad J Pandit 

During Qemu guest migration, a destination process invokes ps2
post_load function. In that, if 'rptr' and 'count' fields were
tampered, it could lead to OOB access or infinite loop issues.
Change their type to 'uint8_t' so they remain within bounds.

Reported-by: Cyrille Chatras 
Signed-off-by: Prasad J Pandit 
---
 hw/input/ps2.c | 26 ++
 1 file changed, 10 insertions(+), 16 deletions(-)

diff --git a/hw/input/ps2.c b/hw/input/ps2.c
index f388a23c8e..ce4b167084 100644
--- a/hw/input/ps2.c
+++ b/hw/input/ps2.c
@@ -90,7 +90,7 @@ typedef struct {
 /* Keep the data array 256 bytes long, which compatibility
  with older qemu versions. */
 uint8_t data[256];
-int rptr, wptr, count;
+uint8_t rptr, wptr, count;
 } PS2Queue;
 
 struct PS2State {
@@ -1225,24 +1225,18 @@ static void ps2_common_reset(PS2State *s)
 static void ps2_common_post_load(PS2State *s)
 {
 PS2Queue *q = &s->queue;
-int size;
-int i;
-int tmp_data[PS2_QUEUE_SIZE];
+uint8_t i, size;
+uint8_t tmp_data[PS2_QUEUE_SIZE];
 
 /* set the useful data buffer queue size, < PS2_QUEUE_SIZE */
 size = q->count > PS2_QUEUE_SIZE ? 0 : q->count;
 
 /* move the queue elements to the start of data array */
-if (size > 0) {
-for (i = 0; i < size; i++) {
-/* move the queue elements to the temporary buffer */
-tmp_data[i] = q->data[q->rptr];
-if (++q->rptr == 256) {
-q->rptr = 0;
-}
-}
-memcpy(q->data, tmp_data, size);
+for (i = 0; i < size; i++) {
+tmp_data[i] = q->data[q->rptr++];
 }
+memcpy(q->data, tmp_data, size);
+
 /* reset rptr/wptr/count */
 q->rptr = 0;
 q->wptr = size;
@@ -1286,9 +1280,9 @@ static const VMStateDescription vmstate_ps2_common = {
 .minimum_version_id = 2,
 .fields = (VMStateField[]) {
 VMSTATE_INT32(write_cmd, PS2State),
-VMSTATE_INT32(queue.rptr, PS2State),
-VMSTATE_INT32(queue.wptr, PS2State),
-VMSTATE_INT32(queue.count, PS2State),
+VMSTATE_UINT8(queue.rptr, PS2State),
+VMSTATE_UINT8(queue.wptr, PS2State),
+VMSTATE_UINT8(queue.count, PS2State),
 VMSTATE_BUFFER(queue.data, PS2State),
 VMSTATE_END_OF_LIST()
 }
-- 
2.13.6




Re: [Qemu-devel] [PATCH] ps2: fix PS2Queue counter field type

2017-11-15 Thread Daniel P. Berrange
On Wed, Nov 15, 2017 at 06:16:02PM +0530, P J P wrote:
> From: Prasad J Pandit 
> 
> During Qemu guest migration, a destination process invokes ps2
> post_load function. In that, if 'rptr' and 'count' fields were
> tampered, it could lead to OOB access or infinite loop issues.
> Change their type to 'uint8_t' so they remain within bounds.

If you're concerned that someone is tampering with QEMU state
in transit during migration, then you're going to end up playing
whack-a-mole across the entire QEMU codebase IMHO. The answer
to the problem of tampering is to have encryption of the
migration data stream between both QEMU's. Thus QEMU on the
target merely has to trust QEMU on the source. If QEMU on the
source is itself compromised you've already lost and migration
won't make life any worse.

> 
> Reported-by: Cyrille Chatras 
> Signed-off-by: Prasad J Pandit 
> ---
>  hw/input/ps2.c | 26 ++
>  1 file changed, 10 insertions(+), 16 deletions(-)
> 
> diff --git a/hw/input/ps2.c b/hw/input/ps2.c
> index f388a23c8e..ce4b167084 100644
> --- a/hw/input/ps2.c
> +++ b/hw/input/ps2.c
> @@ -90,7 +90,7 @@ typedef struct {
>  /* Keep the data array 256 bytes long, which compatibility
>   with older qemu versions. */
>  uint8_t data[256];
> -int rptr, wptr, count;
> +uint8_t rptr, wptr, count;
>  } PS2Queue;
>  
>  struct PS2State {
> @@ -1225,24 +1225,18 @@ static void ps2_common_reset(PS2State *s)
>  static void ps2_common_post_load(PS2State *s)
>  {
>  PS2Queue *q = &s->queue;
> -int size;
> -int i;
> -int tmp_data[PS2_QUEUE_SIZE];
> +uint8_t i, size;
> +uint8_t tmp_data[PS2_QUEUE_SIZE];
>  
>  /* set the useful data buffer queue size, < PS2_QUEUE_SIZE */
>  size = q->count > PS2_QUEUE_SIZE ? 0 : q->count;
>  
>  /* move the queue elements to the start of data array */
> -if (size > 0) {
> -for (i = 0; i < size; i++) {
> -/* move the queue elements to the temporary buffer */
> -tmp_data[i] = q->data[q->rptr];
> -if (++q->rptr == 256) {
> -q->rptr = 0;
> -}
> -}
> -memcpy(q->data, tmp_data, size);
> +for (i = 0; i < size; i++) {
> +tmp_data[i] = q->data[q->rptr++];
>  }
> +memcpy(q->data, tmp_data, size);
> +
>  /* reset rptr/wptr/count */
>  q->rptr = 0;
>  q->wptr = size;
> @@ -1286,9 +1280,9 @@ static const VMStateDescription vmstate_ps2_common = {
>  .minimum_version_id = 2,
>  .fields = (VMStateField[]) {
>  VMSTATE_INT32(write_cmd, PS2State),
> -VMSTATE_INT32(queue.rptr, PS2State),
> -VMSTATE_INT32(queue.wptr, PS2State),
> -VMSTATE_INT32(queue.count, PS2State),
> +VMSTATE_UINT8(queue.rptr, PS2State),
> +VMSTATE_UINT8(queue.wptr, PS2State),
> +VMSTATE_UINT8(queue.count, PS2State),

This would surely break migration backwards compatibility. Any new QEMU
release accepting incoming migration from an old QEMU release will read
the wrong amount of data off the migration stream, and thus will ultimately
cause migraiton to fail.

>  VMSTATE_BUFFER(queue.data, PS2State),
>  VMSTATE_END_OF_LIST()
>  }

Regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|



[Qemu-devel] KVM call for agenda -- 2017-11-21

2017-11-15 Thread Juan Quintela


Hi

Please, send any topic that you are interested in covering.

At the end of Monday I will send an email with the agenda or the
cancellation of the call, so hurry up.

After discussions on the QEMU Summit, we are going to have always open a
KVM call where you can add topics.

 Call details:

By popular demand, a google calendar public entry with it

  
https://www.google.com/calendar/embed?src=dG9iMXRqcXAzN3Y4ZXZwNzRoMHE4a3BqcXNAZ3JvdXAuY2FsZW5kYXIuZ29vZ2xlLmNvbQ

(Let me know if you have any problems with the calendar entry.  I just
gave up about getting right at the same time CEST, CET, EDT and DST).

If you need phone number details,  contact me privately

Thanks, Juan.



Re: [Qemu-devel] [PATCH v2 2/2] Add new PCI ID for i82559a

2017-11-15 Thread Michael Nawrocki

On 11/14/2017 04:41 PM, Stefan Weil wrote:

Am 06.11.2017 um 21:35 schrieb Mike Nawrocki:

Adds a new PCI ID for the i82559a (0x8086 0x1030) interface. Enables
this ID with a new property "use-alt-device-id" to preserve
compatibility.

Signed-off-by: Mike Nawrocki 
---
  hw/net/eepro100.c| 12 
  include/hw/pci/pci.h |  1 +
  qemu-options.hx  |  2 +-
  3 files changed, 14 insertions(+), 1 deletion(-)



Sorry that I missed this patch.
I think I should have an entry for eepro100 in MAINTAINERS.

Mike, which hardware uses i82559a with PCI device id 0x1030?

https://www.intel.com/content/www/us/en/support/articles/05612/network-and-i-o/ethernet-products.html
only lists devices with
0x1229.

Thanks,
Stefan



Hi Stefan,

I've got a VxWorks driver binary that explicitly looks for device ID 
0x1030 (which is admittedly not ideal). It seems like the "82559 
InBusiness 10/100" hardware uses this, though I've had trouble finding 
an official source. The following documents reference that ID:


https://pci-ids.ucw.cz/read/PC/8086/1030
http://ks.pams.ncsu.edu/pub/ncsuscyld/i386/misc/src/trees/hdstg2/modules/pcitable
https://cateee.net/lkddb/web-lkddb/E100.html

And I found a similar post on a different mailing list that might shed 
some light:

http://www.beowulf.org/pipermail/eepro100/2000-January/000760.html

It looks like the 8255x series of devices have a number of potential 
IDs; maybe a property to set a specific PCI device ID would work?


Thanks,
Mike



Re: [Qemu-devel] [PATCH v3] hw/arm/aspeed: Unlock SCU when running kernel

2017-11-15 Thread Cédric Le Goater
On 11/15/2017 11:56 AM, Peter Maydell wrote:
> On 15 November 2017 at 08:18, Cédric Le Goater  wrote:
>> On 11/14/2017 01:20 PM, Joel Stanley wrote:
>>> The ASPEED hardware contains a lock register for the SCU that disables
>>> any writes to the SCU when it is locked. The machine comes up with the
>>> lock enabled, but on all known hardware u-boot will unlock it and leave
>>> it unlocked when loading the kernel.
>>>
>>> This means the kernel expects the SCU to be unlocked. When booting from
>>> an emulated ROM the normal u-boot unlock path is executed. Things don't
>>> go well when booting using the -kernel command line, as u-boot does not
>>> run first.
>>>
>>> Change behaviour so that when a kernel is passed to the machine, set the
>>> reset value of the SCU to be unlocked.
>>>
>>> Signed-off-by: Joel Stanley 
>>
>> Reviewed-by: Cédric Le Goater 
> 
> Thanks. Should this go into 2.11 ? It sounds like the kind of
> bugfix that ought to...

Yes. Please.

Thanks,

C.



Re: [Qemu-devel] [PATCH] build-sys: restrict vmcoreinfo to fw_cfg+dma capable targets

2017-11-15 Thread Marc-André Lureau
Michael,

Could you pick this patch for 2.11 ?

thanks

On Mon, Nov 6, 2017 at 1:40 PM, Daniel Henrique Barboza
 wrote:
>
>
> On 11/06/2017 09:53 AM, Marc-André Lureau wrote:
>>
>> vmcoreinfo is built for all targets. However, it requires fw_cfg with
>> DMA operations support (write operation). Restrict vmcoreinfo exposure
>> to architectures that are supporting FW_CFG_DMA, that is arm-virt and
>> x86 only atm.
>>
>> Signed-off-by: Marc-André Lureau 
>> ---
>
>
> Reviewed-by: Daniel Henrique Barboza 
> Tested-by: Daniel Henrique Barboza 
>
>
>>   default-configs/arm-softmmu.mak| 2 ++
>>   default-configs/i386-softmmu.mak   | 1 +
>>   default-configs/x86_64-softmmu.mak | 1 +
>>   hw/misc/Makefile.objs  | 2 +-
>>   4 files changed, 5 insertions(+), 1 deletion(-)
>>
>> diff --git a/default-configs/arm-softmmu.mak
>> b/default-configs/arm-softmmu.mak
>> index 5059d134c8..d37edc4312 100644
>> --- a/default-configs/arm-softmmu.mak
>> +++ b/default-configs/arm-softmmu.mak
>> @@ -130,3 +130,5 @@ CONFIG_SMBIOS=y
>>   CONFIG_ASPEED_SOC=y
>>   CONFIG_GPIO_KEY=y
>>   CONFIG_MSF2=y
>> +
>> +CONFIG_FW_CFG_DMA=y
>> diff --git a/default-configs/i386-softmmu.mak
>> b/default-configs/i386-softmmu.mak
>> index a685c439e7..95ac4b464a 100644
>> --- a/default-configs/i386-softmmu.mak
>> +++ b/default-configs/i386-softmmu.mak
>> @@ -60,3 +60,4 @@ CONFIG_SMBIOS=y
>>   CONFIG_HYPERV_TESTDEV=$(CONFIG_KVM)
>>   CONFIG_PXB=y
>>   CONFIG_ACPI_VMGENID=y
>> +CONFIG_FW_CFG_DMA=y
>> diff --git a/default-configs/x86_64-softmmu.mak
>> b/default-configs/x86_64-softmmu.mak
>> index ea69e8289e..0221236825 100644
>> --- a/default-configs/x86_64-softmmu.mak
>> +++ b/default-configs/x86_64-softmmu.mak
>> @@ -60,3 +60,4 @@ CONFIG_SMBIOS=y
>>   CONFIG_HYPERV_TESTDEV=$(CONFIG_KVM)
>>   CONFIG_PXB=y
>>   CONFIG_ACPI_VMGENID=y
>> +CONFIG_FW_CFG_DMA=y
>> diff --git a/hw/misc/Makefile.objs b/hw/misc/Makefile.objs
>> index 19202d90cf..10c88a84b4 100644
>> --- a/hw/misc/Makefile.objs
>> +++ b/hw/misc/Makefile.objs
>> @@ -9,7 +9,7 @@ common-obj-$(CONFIG_PCI_TESTDEV) += pci-testdev.o
>>   common-obj-$(CONFIG_EDU) += edu.o
>> common-obj-y += unimp.o
>> -common-obj-y += vmcoreinfo.o
>> +common-obj-$(CONFIG_FW_CFG_DMA) += vmcoreinfo.o
>> obj-$(CONFIG_VMPORT) += vmport.o
>>
>
>
>



-- 
Marc-André Lureau



Re: [Qemu-devel] [Question] Qemu's Heap Becomes Very Large and Never Reduce Down

2017-11-15 Thread Paolo Bonzini
On 15/11/2017 04:14, Xulei (Stone) wrote:
> Hi, guys
> 
> I met a strange problem, with qemu 2.8.1:
> qemu consumes too many heap memory after several operations and can not
> release them anymore: 
> hot pulg/unplug disk & net, vnc connect/disconnect, guestOS reboot, etc.

Try with newer QEMU; until recently we used a lot of memory at startup,
but it was improved in 2.11.

However, I think Anthony also had a patch adding a malloc_trim call in
the RCU thread, which improved memory usage.  Anthony, do you know if
this is still necessary in 2.11?

Thanks,

Paolo



Re: [Qemu-devel] [PATCH] ps2: fix PS2Queue counter field type

2017-11-15 Thread Paolo Bonzini
On 15/11/2017 13:51, Daniel P. Berrange wrote:
> If you're concerned that someone is tampering with QEMU state
> in transit during migration, then you're going to end up playing
> whack-a-mole across the entire QEMU codebase IMHO. The answer
> to the problem of tampering is to have encryption of the
> migration data stream between both QEMU's. Thus QEMU on the
> target merely has to trust QEMU on the source. If QEMU on the
> source is itself compromised you've already lost and migration
> won't make life any worse.
> 

This is not entirely true.  A lot of such cases were fixed in the past,
especially when they could cause out-of-bounds access.  Someone could
provide a bad migration stream (e.g. as a fake bug report!), so
migration data should not be considered trusted.

However, PJP's patch breaks migration by changing a 4-byte field to
1-byte.  The correct fix is to range-check the fields in
ps2_common_post_load.

Thanks,

Paolo



Re: [Qemu-devel] [PATCH v17 6/6] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_VQ

2017-11-15 Thread Michael S. Tsirkin
On Wed, Nov 15, 2017 at 11:47:58AM +0800, Wei Wang wrote:
> On 11/15/2017 05:21 AM, Michael S. Tsirkin wrote:
> > On Tue, Nov 14, 2017 at 08:02:03PM +0800, Wei Wang wrote:
> > > On 11/14/2017 01:32 AM, Michael S. Tsirkin wrote:
> > > > > - guest2host_cmd: written by the guest to ACK to the host about the
> > > > > commands that have been received. The host will clear the 
> > > > > corresponding
> > > > > bits on the host2guest_cmd register. The guest also uses this register
> > > > > to send commands to the host (e.g. when finish free page reporting).
> > > > I am not sure what is the role of guest2host_cmd. Reporting of
> > > > the correct cmd id seems sufficient indication that guest
> > > > received the start command. Not getting any more seems sufficient
> > > > to detect stop.
> > > > 
> > > I think the issue is when the host is waiting for the guest to report 
> > > pages,
> > > it does not know whether the guest is going to report more or the report 
> > > is
> > > done already. That's why we need a way to let the guest tell the host "the
> > > report is done, don't wait for more", then the host continues to the next
> > > step - sending the non-free pages to the destination. The following method
> > > is a conclusion of other comments, with some new thought. Please have a
> > > check if it is good.
> > config won't work well for this IMHO.
> > Writes to config register are hard to synchronize with the VQ.
> > For example, guest sends free pages, host says stop, meanwhile
> > guest sends stop for 1st set of pages.
> 
> I still don't see an issue with this. Please see below:
> (before jumping into the discussion, just make sure I've well explained this
> point: now host-to-guest commands are done via config, and guest-to-host
> commands are done via the free page vq)

This is fine by me actually. But right now you have guest to host
not going through vq, going through command register instead -
this is how sending stop to host seems to happen.
If you make it go through vq then I think all will be well.

> 
> Case: Host starts to request the reporting with cmd_id=1. Some time later,
> Host writes "stop" to config, meantime guest happens to finish the reporting
> and plan to actively send a "stop" command from the free_page_vq().
>   Essentially, this is like a sync between two threads - if we view
> the config interrupt handler as one thread, another is the free page
> reporting worker thread.
> 
> - what the config handler does is simply:
>   1.1:  WRITE_ONCE(vb->reporting_stop, true);
> 
> - what the reporting thread will do is
>   2.1:  WRITE_ONCE(vb->reporting_stop, true);
>   2.2:  send_stop_to_host_via_vq();
> 
> From the guest point of view, no matter 1.1 is executed first or 2.1 first,
> it doesn't make a difference to the end result - vb->reporting_stop is set.
> 
> From the host point of view, it knows that cmd_id=1 has truly stopped the
> reporting when it receives a "stop" sign via the vq.
> 
> 
> > How about adding a buffer with "stop" in the VQ instead?
> > Wastes a VQ entry which you will need to reserve for this
> > but is it a big deal?
> 
> The free page vq is guest-to-host direction.

Yes, for guest to host stop sign.

> Using it for host-to-guest
> requests will make it bidirectional, which will result in the same issue
> described before: https://lkml.org/lkml/2017/10/11/1009 (the first response)
> 
> On the other hand, I think adding another new vq for host-to-guest
> requesting doesn't make a difference in essence, compared to using config
> (same 1.1, 2.1, 2.2 above), but will be more complicated.

I agree with this. Host to guest can just incremenent the "free command id"
register.

> 
> > > Two new configuration registers in total:
> > > - cmd_reg: the command register, combined from the previous host2guest and
> > > guest2host. I think we can use the same register for host requesting and
> > > guest ACKing, since the guest writing will trap to QEMU, that is, all the
> > > writes to the register are performed in QEMU, and we can keep things work 
> > > in
> > > a correct way there.
> > > - cmd_id_reg: the sequence id of the free page report command.
> > > 
> > > -- free page report:
> > >  - host requests the guest to start reporting by "cmd_reg |
> > > REPORT_START";
> > >  - guest ACKs to the host about receiving the start reporting request 
> > > by
> > > "cmd_reg | REPORT_START", host will clear the flag bit once receiving the
> > > ACK.
> > >  - host requests the guest to stop reporting by "cmd_reg | 
> > > REPORT_STOP";
> > >  - guest ACKs to the host about receiving the stop reporting request 
> > > by
> > > "cmd_reg | REPORT_STOP", host will clear the flag once receiving the ACK.
> > >  - guest tells the host about the start of the reporting by writing 
> > > "cmd
> > > id" into an outbuf, which is added to the free page vq.
> > >  - guest tells the host about the end of the reporting

Re: [Qemu-devel] [PATCH] ps2: fix PS2Queue counter field type

2017-11-15 Thread Dr. David Alan Gilbert
* Paolo Bonzini (pbonz...@redhat.com) wrote:
> On 15/11/2017 13:51, Daniel P. Berrange wrote:
> > If you're concerned that someone is tampering with QEMU state
> > in transit during migration, then you're going to end up playing
> > whack-a-mole across the entire QEMU codebase IMHO. The answer
> > to the problem of tampering is to have encryption of the
> > migration data stream between both QEMU's. Thus QEMU on the
> > target merely has to trust QEMU on the source. If QEMU on the
> > source is itself compromised you've already lost and migration
> > won't make life any worse.
> > 
> 
> This is not entirely true.  A lot of such cases were fixed in the past,
> especially when they could cause out-of-bounds access.  Someone could
> provide a bad migration stream (e.g. as a fake bug report!), so
> migration data should not be considered trusted.

There's probably others to be honest; it's not something we've
traditionally been careful of.

> However, PJP's patch breaks migration by changing a 4-byte field to
> 1-byte.  The correct fix is to range-check the fields in
> ps2_common_post_load.

Agreed.

Dave

> Thanks,
> 
> Paolo
> 
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK



Re: [Qemu-devel] Abnormal observation during migration: too many "write-not-dirty" pages

2017-11-15 Thread Chunguang Li



> -原始邮件-
> 发件人: "Dr. David Alan Gilbert" 
> 发送时间: 2017-11-15 18:11:37 (星期三)
> 收件人: "Chunguang Li" 
> 抄送: qemu-devel@nongnu.org, quint...@redhat.com, amit.s...@redhat.com, 
> pbonz...@redhat.com, stefa...@redhat.com
> 主题: Re: [Qemu-devel] Abnormal observation during migration: too many 
> "write-not-dirty" pages
> 
> * Chunguang Li (lichungu...@hust.edu.cn) wrote:
> > Hi all!
> > 
> > I got a very abnormal observation for the VM migration. I found that many 
> > pages marked as dirty during migration are "not really dirty", which is, 
> > their content are the same as the old version.
> > 
> > 
> > 
> > 
> > I did the migration experiment like this:
> > 
> > During the setup phase of migration, first I suspended the VM. Then I 
> > copied all the pages within the guest physical address space to a memory 
> > buffer as large as the guest memory size. After that, the dirty tracking 
> > began and I resumed the VM. Besides, at the end
> > of each iteration, I also suspended the VM temporarily. During the 
> > suspension, I compared the content of all the pages marked as dirty in this 
> > iteration byte-by-byte with their former copies inside the buffer. If the 
> > content of one page was the same as its former copy, I recorded it as a 
> > "write-not-dirty" page (the page is written exactly with the same content 
> > as the old version). Otherwise, I replaced this page in the buffer with the 
> > new content, for the possible comparison in the future. After the reset of 
> > the dirty bitmap, I resumed the VM. Thus, I obtain the proportion of the 
> > write-not-dirty pages within all the pages marked as dirty for each 
> > pre-copy iteration.
> > 
> > I repeated this experiment with 15 workloads, which are 11 CPU2006 
> > benchmarks, Memcached server, kernel compilation, playing a video, and an 
> > idle VM. The CPU2006 benchmarks and Memcached are write-intensive 
> > workloads. So almost all of them did not converge to stop-copy.
> > 
> > 
> > 
> > 
> > Startlingly, the proportions of the write-not-dirty pages are quite high. 
> > Memcached and three CPU2006 benchmarks(zeusmp, mcf and bzip2) have the most 
> > high proportions. Their proportions of the write-not-dirty pages within all 
> > the dirty pages are as high as 45%-80%. The proportions of the other 
> > workloads are about 5%-20%, which are also abnormal. According to my 
> > intuition, the proportion of write-not-dirty pages should be far less than 
> > these numbers. I think it should be quite a particular case that one page 
> > is written with exactly the same content as the former data.
> > 
> > Besides, the zero pages are not counted for all the results. Because I 
> > think codes like memset() may write large area of pages to zero pages, 
> > which are already zero pages before.
> > 
> > 
> > 
> > 
> > I excluded some possible unknown reasons with the machine hardware, because 
> > I repeated the experiments with two sets of different machines. Then I 
> > guessed it might be related with the huge page feature. However, the result 
> > was the same when I turned the huge page feature off in the OS.
> > 
> > 
> > 
> > 
> > Now there are only two possible reasons in my opinion. 
> > 
> > First, there is some bugs in the KVM kernel dirty tracking mechanism. It 
> > may mark some pages that do not receive write request as dirty.
> > 
> > Second, there is some bugs in the OS running inside the VM. It may issue 
> > some unnecessary write requests.
> > 
> > 
> > What do you think about this abnormal phenomenon? Any advice or possible 
> > reasons or even guesses? I appreciate any responses, because it has 
> > confused me for a long time. Thank you.
> 
> Wasn't it you who pointed out last year the other possibility? - The
> problem of false positives due to sync'ing the whole of memory and then
> writing the data out, but some of the dirty pages were already written?
> 
> Dave

Yes, you remember that! It was me. After that, I did more analysis and 
experiments. I found that, in fact, both reasons contribute to the "fake dirty" 
pages (dirty pages that do not need to be resent, because their contents are 
the same as that in the target node). One is what I pointed out last year, 
which you have mentioned. The other reason is what I am talking about now, the 
"write-not-dirty" phenomenon.
In fact, according to my experiments results, the "wirte-not-dirty" is the main 
reason resulting to the "fake dirty" pages, while sync'ing the whole of memory 
contributes less.

Chunguang
> 
> > 
> > --
> > Chunguang Li, Ph.D. Candidate
> > Wuhan National Laboratory for Optoelectronics (WNLO)
> > Huazhong University of Science & Technology (HUST)
> > Wuhan, Hubei Prov., China
> > 
> --
> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK





Re: [Qemu-devel] [PATCH] ps2: fix PS2Queue counter field type

2017-11-15 Thread Paolo Bonzini
On 15/11/2017 14:30, Dr. David Alan Gilbert wrote:
> * Paolo Bonzini (pbonz...@redhat.com) wrote:
>> On 15/11/2017 13:51, Daniel P. Berrange wrote:
>>> If you're concerned that someone is tampering with QEMU state
>>> in transit during migration, then you're going to end up playing
>>> whack-a-mole across the entire QEMU codebase IMHO. The answer
>>> to the problem of tampering is to have encryption of the
>>> migration data stream between both QEMU's. Thus QEMU on the
>>> target merely has to trust QEMU on the source. If QEMU on the
>>> source is itself compromised you've already lost and migration
>>> won't make life any worse.
>>
>> This is not entirely true.  A lot of such cases were fixed in the past,
>> especially when they could cause out-of-bounds access.  Someone could
>> provide a bad migration stream (e.g. as a fake bug report!), so
>> migration data should not be considered trusted.
> 
> There's probably others to be honest; it's not something we've
> traditionally been careful of.

There was a flurry of fixes a while ago:

- CVE-2013-4149 to CVE-2013-4151
- CVE-2013-4526 to CVE-2013-4527
- CVE-2013-4529 to CVE-2013-4542
- CVE-2013-6399
- CVE-2014-0182

This one was introduced in 2.1, around the same time these others were
fixed, by commit 2858ab09e6 ("ps2: set ps/2 output buffer size as the
same as kernel", 2014-05-16).

Thanks,

Paolo

> 
>> However, PJP's patch breaks migration by changing a 4-byte field to
>> 1-byte.  The correct fix is to range-check the fields in
>> ps2_common_post_load.
> 
> Agreed.
> 
> Dave
> 
>> Thanks,
>>
>> Paolo
>>
> --
> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
> 




Re: [Qemu-devel] [PATCH for-2.11] qcow2: Fix overly broad madvise()

2017-11-15 Thread Max Reitz
On 2017-11-15 10:09, Alberto Garcia wrote:
> On Tue 14 Nov 2017 07:41:27 PM CET, Max Reitz wrote:
>> @mem_size and @offset are both size_t, thus subtracting them from one
>> another will just return a big size_t if mem_size < offset -- even more
>> obvious here because the result is stored in another size_t.
>>
>> Checking that result to be positive is therefore not sufficient to
>> excluse the case that offset > mem_size.  Thus, we currently sometimes
>> issue an madvise() over a very large address range.
>>
>> This is triggered by iotest 163, but with -m64, this does not result in
>> tangible problems.  But with -m32, this test produces three segfaults,
>> all of which are fixed by this patch.
>>
>> Signed-off-by: Max Reitz 
> 
> Oh, I guess this happens when the page size is larger than the cluster
> size? Otherwise I don't see how...
> 
> Reviewed-by: Alberto Garcia 

Yes, the test uses 512 byte clusters.

Max



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-devel] [PATCH] exec: Skip mru section if it's a partial page and not resolving subpage

2017-11-15 Thread Paolo Bonzini
On 14/11/2017 23:42, BALATON Zoltan wrote:
> This fixes a crash caused by picking the wrong memory region in
> address_space_lookup_region seen with client code accessing a device
> model that uses alias memory regions.
> 
> Signed-off-by: BALATON Zoltan 
> ---
>  exec.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/exec.c b/exec.c
> index 97a24a8..e5f2b9a 100644
> --- a/exec.c
> +++ b/exec.c
> @@ -413,6 +413,7 @@ static MemoryRegionSection 
> *address_space_lookup_region(AddressSpaceDispatch *d,
>  bool update;
>  
>  if (section && section != &d->map.sections[PHYS_SECTION_UNASSIGNED] &&
> +(resolve_subpage || !section->offset_within_region) &&
>  section_covers_addr(section, addr)) {
>  update = false;
>  } else {
> 

This is another possibility:

diff --git a/exec.c b/exec.c
index 97a24a875e..3bb9fcf257 100644
--- a/exec.c
+++ b/exec.c
@@ -410,22 +410,16 @@ static MemoryRegionSection 
*address_space_lookup_region(AddressSpaceDispatch *d,
 {
 MemoryRegionSection *section = atomic_read(&d->mru_section);
 subpage_t *subpage;
-bool update;
 
-if (section && section != &d->map.sections[PHYS_SECTION_UNASSIGNED] &&
-section_covers_addr(section, addr)) {
-update = false;
-} else {
+if (!section || section == &d->map.sections[PHYS_SECTION_UNASSIGNED] ||
+!section_covers_addr(section, addr)) {
 section = phys_page_find(d, addr);
-update = true;
+atomic_set(&d->mru_section, section);
 }
 if (resolve_subpage && section->mr->subpage) {
 subpage = container_of(section->mr, subpage_t, iomem);
 section = &d->map.sections[subpage->sub_section[SUBPAGE_IDX(addr)]];
 }
-if (update) {
-atomic_set(&d->mru_section, section);
-}
 return section;
 }
 

It will skip the expensive phys_page_find but not the cheap subpage lookup.
Does it work for you?

Paolo



Re: [Qemu-devel] HAXM is now open source

2017-11-15 Thread Paolo Bonzini
On 15/11/2017 09:25, Yu Ning wrote:
> 
> 
> On 11/15/2017 3:13, John Snow wrote:
>>
>> On 11/14/2017 06:09 AM, Thomas Huth wrote:
>>>
>>> That's great news! I hope this all will help to promote QEMU on Windows
>>> and macOS quite a bit!
>>>
>>> However, during the past months, I noticed a couple of times that users
>>> ask on IRC or the qemu-discuss mailing list how they could accelerate
>>> their QEMU on Windows - and they are running only in TCG mode when you
>>> ask how they start QEMU. So it seems like there is not much knowledge
>>> about "--accel hax" in the public yet. Maybe you could write a nice blog
>>> post for the QEMU blog or something similar that explains how to use
>>> HAXM with QEMU on Windows for the normal users? Or maybe make it more
>>> prominent in the QEMU wiki? (e.g. the main page only mentions KVM and
>>> Xen, but not HAXM)
>> A blog post advertising this new development would be an absolute
>> miracle for linking to people who are just getting started with QEMU on
>> Windows.
>>
>> (It would also be really good for idiots like me, who do not use Windows
>> for anything other than playing video games and sometimes forget that it
>> is capable of doing other things.)
> 
> Thanks for the encouragement.
> 
> I think I can start by writing a small section that can be added to an
> existing document, and later expanded to a blog article. But I haven't
> found a suitable place for it on the QEMU wiki.
> 
> Does it make sense to add the proposed piece to the QEMU user manual
> (qemu-doc.texi), under the Quick Start section (2.2)? The user manual is
> published on Stefan's website and referred to by qemu.org:
> 
> https://qemu.weilnetz.de/doc/qemu-doc.html

Unfortunately the "Quick Start" section is quite out of date (it doesn't
even mention KVM!) and is basically a huge list of options.  We are in
the process of improving QEMU documentation, so I think a blog post is
preferrable at this stage.

Thanks,

Paolo

> so I think it's a popular resource among users.
> 
> BTW, I assume most Windows users begin their QEMU journey from this page
> (also credit to Stefan):
> 
> https://qemu.weilnetz.de/
> 
> and I've verified the hax accelerator module is built into the latest
> binaries there (at least the W64 ones).
> 
> -Yu
> 




[Qemu-devel] [Bug 1728256] Re: Memory corruption in Windows 10 guest / amd64

2017-11-15 Thread adg
I've also had the exact symptoms and issues you've described. I have
also noticed that the VM would BSOD with the
CRITICAL_STRUCTURE_CORRUPTION message when the host system would read VM
memory from swap.

After disabling swap on the host system I've completely managed to
eliminate this BSOD issue. Hopefully it's also applicable to your system
so you can atleast figure out how to move forward.

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1728256

Title:
  Memory corruption in Windows 10 guest / amd64

Status in QEMU:
  New

Bug description:
  I have a Win 10 Pro x64 guest inside a qemu/kvm running on an Arch x86_64 
host. The VM has a physical GPU passed through, as well as the physical USB 
controllers, as well as a dedicated SSD attached via SATA; you can find the 
complete libvirt xml here: https://pastebin.com/U1ZAXBNg
  I built qemu from source using the qemu-minimal-git AUR package; you can find 
the build script here: 
https://aur.archlinux.org/cgit/aur.git/tree/PKGBUILD?h=qemu-minimal-git (if you 
aren't familiar with Arch, this is essentially a bash script where build() and 
package() are run to build the files, and then install them into the $pkgdir to 
later tar them up.)

  Starting with qemu v2.10.0, Windows crashes randomly with a bluescreen
  about CRITICAL_STRUCTURE_CORRUPTION. I also tested the git heads
  f90ea7ba7c, 861cd431c9 and e822e81e35, before I went back to v2.9.0,
  which is running stable for over 50 hours right now.

  During my tests I found that locking the memory pages alleviates the
  problem somewhat, but never completely avoids it. However, with the
  crashes occuring randomly, that could as well be false conclusions; I
  had crashes within minutes after boot with that too.

  I will now start `git bisect`ing; if you have any other suggestions on
  what I could try or possible patches feel free to leave them with me.

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1728256/+subscriptions



Re: [Qemu-devel] Abnormal observation during migration: too many "write-not-dirty" pages

2017-11-15 Thread Chunguang Li



> -Original Messages-
> From: "Juan Quintela" 
> Sent Time: 2017-11-15 17:45:44 (Wednesday)
> To: "Chunguang Li" 
> Cc: qemu-devel@nongnu.org, dgilb...@redhat.com, amit.s...@redhat.com, 
> pbonz...@redhat.com, stefa...@redhat.com
> Subject: Re: Abnormal observation during migration: too many 
> "write-not-dirty" pages
> 
> "Chunguang Li"  wrote:
> > Hi all! 
> 
> Hi
> 
> Sorry for the delay, I was on vacation an still getting up to speed.

Hi, Juan, thanks for your reply.

> 
> > I got a very abnormal observation for the VM migration. I found that many 
> > pages marked as dirty during
> > migration are "not really dirty", which is, their content are the same as 
> > the old version. 
> 
> I think your test is quite good, and I am also ashamed that 80% of
> "false" dirty pages is really a lot.
> 
> > I did the migration experiment like this: 
> >
> > During the setup phase of migration, first I suspended the VM. Then I 
> > copied all the pages within the guest
> > physical address space to a memory buffer as large as the guest memory 
> > size. After that, the dirty tracking
> > began and I resumed the VM. Besides, at the end
> > of each iteration, I also suspended the VM temporarily. During the 
> > suspension, I compared the content of all
> > the pages marked as dirty in this iteration byte-by-byte with their former 
> > copies inside the buffer. If the
> > content of one page was the same as its former copy, I recorded it as a 
> > "write-not-dirty" page (the page is
> > written exactly with the same content as the old version). Otherwise, I 
> > replaced this page in the buffer with
> > the new content, for the possible comparison in the future. After the reset 
> > of the dirty bitmap, I resumed the
> > VM. Thus, I obtain the proportion of the write-not-dirty pages within all 
> > the pages marked as dirty for each
> > pre-copy iteration. 
> 
> 
> vhost and friends could make a small difference here, but in general,
> this approach should be ok.
> 
> > I repeated this experiment with 15 workloads, which are 11 CPU2006 
> > benchmarks, Memcached server,
> > kernel compilation, playing a video, and an idle VM. The CPU2006 benchmarks 
> > and Memcached are
> > write-intensive workloads. So almost all of them did not converge to 
> > stop-copy. 
> 
> That is the impressive part, 15 workloads.  Thanks for taking the effor.
> 
> BTW, do you have your qemu changes handy, just to be able to test
> locally, and "review" how do you measure things.

Sorry, I do not have my changes handy. But don't worry, I will send them to you 
tomorrow morning. It's night here.

> 
> 
> > Startlingly, the proportions of the write-not-dirty pages are quite high. 
> > Memcached and three CPU2006
> > benchmarks(zeusmp, mcf and bzip2) have the most high proportions. Their 
> > proportions of the write-not-dirty
> > pages within all the dirty pages are as high as 45%-80%.
> 
> Or the workload does really stupid things like:
> 
> a = 0;
> a = 1;
> a = 0;
> 
> This makes no sense at all.
> 
> Just in case, could you try to test this with xbzrle?  It should go well
> with this use case (but you need to get a big enough buffer to cache
> enough memory).

In fact, I have tested these workloads (the 45%-80% ones) with xbzrle. And when 
the buffer is big enough, they really go well. While they do not converge to 
stop-copy before, now they finish migration quickly.

> 
> 
> > The proportions of the other workloads are about
> > 5%-20%, which are also abnormal. According to my intuition, the proportion 
> > of write-not-dirty pages should be
> > far less than these numbers. I think it should be quite a particular case 
> > that one page is written with exactly
> > the same content as the former data. 
> 
> I agree with that.
> 
> > Besides, the zero pages are not counted for all the results. Because I 
> > think codes like memset() may write
> > large area of pages to zero pages, which are already zero pages before. 
> >
> > I excluded some possible unknown reasons with the machine hardware, because 
> > I repeated the experiments
> > with two sets of different machines. Then I guessed it might be related 
> > with the huge page feature. However,
> > the result was the same when I turned the huge page feature off in the OS. 
> 
> Huge page could have caused that.  Remember that we have transparent
> huge pages.  I have to look at that code.

In fact, the results are the same no matter I turn on or turn off the 
transparent huge pages in the OS.

Later, Chunguang.

> 
> > Now there are only two possible reasons in my opinion. 
> >
> > First, there is some bugs in the KVM kernel dirty tracking mechanism. It 
> > may mark some pages that do not
> > receive write request as dirty. 
> 
> That is a posibilty.
> 
> > Second, there is some bugs in the OS running inside the VM. It may issue 
> > some unnecessary write
> > requests. 
> >
> > What do you think about this abnormal phenomenon? Any advice or possible 
> > reasons or even gu

Re: [Qemu-devel] Abnormal observation during migration: too many "write-not-dirty" pages

2017-11-15 Thread Dr. David Alan Gilbert
* Chunguang Li (lichungu...@hust.edu.cn) wrote:
> 
> 
> 
> > -原始邮件-
> > 发件人: "Dr. David Alan Gilbert" 
> > 发送时间: 2017-11-15 18:11:37 (星期三)
> > 收件人: "Chunguang Li" 
> > 抄送: qemu-devel@nongnu.org, quint...@redhat.com, amit.s...@redhat.com, 
> > pbonz...@redhat.com, stefa...@redhat.com
> > 主题: Re: [Qemu-devel] Abnormal observation during migration: too many 
> > "write-not-dirty" pages
> > 
> > * Chunguang Li (lichungu...@hust.edu.cn) wrote:
> > > Hi all!
> > > 
> > > I got a very abnormal observation for the VM migration. I found that many 
> > > pages marked as dirty during migration are "not really dirty", which is, 
> > > their content are the same as the old version.
> > > 
> > > 
> > > 
> > > 
> > > I did the migration experiment like this:
> > > 
> > > During the setup phase of migration, first I suspended the VM. Then I 
> > > copied all the pages within the guest physical address space to a memory 
> > > buffer as large as the guest memory size. After that, the dirty tracking 
> > > began and I resumed the VM. Besides, at the end
> > > of each iteration, I also suspended the VM temporarily. During the 
> > > suspension, I compared the content of all the pages marked as dirty in 
> > > this iteration byte-by-byte with their former copies inside the buffer. 
> > > If the content of one page was the same as its former copy, I recorded it 
> > > as a "write-not-dirty" page (the page is written exactly with the same 
> > > content as the old version). Otherwise, I replaced this page in the 
> > > buffer with the new content, for the possible comparison in the future. 
> > > After the reset of the dirty bitmap, I resumed the VM. Thus, I obtain the 
> > > proportion of the write-not-dirty pages within all the pages marked as 
> > > dirty for each pre-copy iteration.
> > > 
> > > I repeated this experiment with 15 workloads, which are 11 CPU2006 
> > > benchmarks, Memcached server, kernel compilation, playing a video, and an 
> > > idle VM. The CPU2006 benchmarks and Memcached are write-intensive 
> > > workloads. So almost all of them did not converge to stop-copy.
> > > 
> > > 
> > > 
> > > 
> > > Startlingly, the proportions of the write-not-dirty pages are quite high. 
> > > Memcached and three CPU2006 benchmarks(zeusmp, mcf and bzip2) have the 
> > > most high proportions. Their proportions of the write-not-dirty pages 
> > > within all the dirty pages are as high as 45%-80%. The proportions of the 
> > > other workloads are about 5%-20%, which are also abnormal. According to 
> > > my intuition, the proportion of write-not-dirty pages should be far less 
> > > than these numbers. I think it should be quite a particular case that one 
> > > page is written with exactly the same content as the former data.
> > > 
> > > Besides, the zero pages are not counted for all the results. Because I 
> > > think codes like memset() may write large area of pages to zero pages, 
> > > which are already zero pages before.
> > > 
> > > 
> > > 
> > > 
> > > I excluded some possible unknown reasons with the machine hardware, 
> > > because I repeated the experiments with two sets of different machines. 
> > > Then I guessed it might be related with the huge page feature. However, 
> > > the result was the same when I turned the huge page feature off in the OS.
> > > 
> > > 
> > > 
> > > 
> > > Now there are only two possible reasons in my opinion. 
> > > 
> > > First, there is some bugs in the KVM kernel dirty tracking mechanism. It 
> > > may mark some pages that do not receive write request as dirty.
> > > 
> > > Second, there is some bugs in the OS running inside the VM. It may issue 
> > > some unnecessary write requests.
> > > 
> > > 
> > > What do you think about this abnormal phenomenon? Any advice or possible 
> > > reasons or even guesses? I appreciate any responses, because it has 
> > > confused me for a long time. Thank you.
> > 
> > Wasn't it you who pointed out last year the other possibility? - The
> > problem of false positives due to sync'ing the whole of memory and then
> > writing the data out, but some of the dirty pages were already written?
> > 
> > Dave
> 
> Yes, you remember that!

Yes, I remember that, and my TODO list told me it was you :-)

> It was me. After that, I did more analysis and experiments. I found that, in 
> fact, both reasons contribute to the "fake dirty" pages (dirty pages that do 
> not need to be resent, because their contents are the same as that in the 
> target node). One is what I pointed out last year, which you have mentioned. 
> The other reason is what I am talking about now, the "write-not-dirty" 
> phenomenon.
> In fact, according to my experiments results, the "wirte-not-dirty" is the 
> main reason resulting to the "fake dirty" pages, while sync'ing the whole of 
> memory contributes less.

How do you differentiate between "fake dirty' and the syncing?

The cases where values change back to what they used to be seem
the most likely to me (e.g. locks/counts 

Re: [Qemu-devel] [RFC 1/5] hw/vfio: Add function for getting reserved_region of device iommu group

2017-11-15 Thread Shameerali Kolothum Thodi
Hi Alex,

> -Original Message-
> From: Alex Williamson [mailto:alex.william...@redhat.com]
> Sent: Tuesday, November 14, 2017 3:48 PM
> To: Zhuyijun 
> Cc: qemu-...@nongnu.org; qemu-devel@nongnu.org;
> eric.au...@redhat.com; peter.mayd...@linaro.org; Shameerali Kolothum
> Thodi ; Zhaoshenglong
> 
> Subject: Re: [Qemu-devel] [RFC 1/5] hw/vfio: Add function for getting
> reserved_region of device iommu group
> 
> On Tue, 14 Nov 2017 09:15:50 +0800
>  wrote:
> 
> > From: Zhu Yijun 
> >
> > With kernel 4.11, iommu/smmu will populate the MSI IOVA reserved
> > window and PCI reserved window which has to be excluded from Guest iova
> allocations.
> >
> > However, If it falls within the Qemu default virtual memory address
> > space, then reserved regions may get allocated for a Guest VF DMA iova
> > and it will fail.
> >
> > So get those reserved regions in this patch and create some holes in
> > the Qemu ram address in next patchset.
> >
> > Signed-off-by: Zhu Yijun 
> > ---
> >  hw/vfio/common.c  | 67
> +++
> >  hw/vfio/pci.c |  2 ++
> >  hw/vfio/platform.c|  2 ++
> >  include/exec/memory.h |  7 +
> >  include/hw/vfio/vfio-common.h |  3 ++
> >  5 files changed, 81 insertions(+)
> 
> I generally prefer the vfio interface to be more self sufficient, if there are
> regions the IOMMU cannot map, we should be describing those via capabilities
> on the container through the vfio interface.  If we're just scraping together
> things from sysfs, the user can just as easily do that and provide an explicit
> memory map for the VM taking the devices into account. 

Ok. I was under the impression that the purpose of introducing the 
/sys/kernel/iommu_groups/reserved_regions was to get the IOVA regions 
that are reserved(MSI or non-mappable) for Qemu or other apps to
make use of.  I think this was introduced as part of the "KVM/MSI passthrough
support on ARM" patch series. And if I remember correctly, Eric had 
an approach where the user space can retrieve all the reserved regions through
the VFIO_IOMMU_GET_INFO ioctl and later this idea was replaced with the 
sysfs interface.

May be I am missing something here.

> Of course having a
> device associated with a restricted memory map that conflicts with the default
> memory map for the VM implies that you can never support hot-add of such
> devices.

True.  Hot-add and migration will have issues on these platforms.

Thanks,
Shameer



[Qemu-devel] [Question] why need to start all queues in vhost_net_start

2017-11-15 Thread Longpeng(Mike)
Hi guys,

We got a BUG report from our testers yesterday, the testing scenario was
migrating a VM (Windows guest, *4 vcpus*, 4GB, vhost-user net: *7 queues*).

We found the cause reason, and we'll report the BUG or send a fix patch
to upstream if necessary( we haven't test the upstream yet, sorry... ).

We want to know why the vhost_net_start() must start *total queues* ( in our
VM there're 7 queues ) but not *the queues that current used* ( in our VM, guest
only uses the first 4 queues because it's limited by the number of vcpus) ?

Looking forward to your help, thx :)

-- 
Regards,
Longpeng



Re: [Qemu-devel] [Question] why need to start all queues in vhost_net_start

2017-11-15 Thread Jason Wang



On 2017年11月15日 22:55, Longpeng(Mike) wrote:

Hi guys,

We got a BUG report from our testers yesterday, the testing scenario was
migrating a VM (Windows guest, *4 vcpus*, 4GB, vhost-user net: *7 queues*).

We found the cause reason, and we'll report the BUG or send a fix patch
to upstream if necessary( we haven't test the upstream yet, sorry... ).


Could you explain this a little bit more?



We want to know why the vhost_net_start() must start *total queues* ( in our
VM there're 7 queues ) but not *the queues that current used* ( in our VM, guest
only uses the first 4 queues because it's limited by the number of vcpus) ?

Looking forward to your help, thx :)


Since the codes have been there for years and works well for kernel 
datapath. You should really explain what's wrong.


Thanks



Re: [Qemu-devel] [PATCH 2/5] qcow2: multiple clusters write compressed

2017-11-15 Thread Max Reitz
On 2017-11-14 11:16, Anton Nefedov wrote:
> From: Pavel Butsykin 
> 
> At the moment, qcow2_co_pwritev_compressed can process the requests size
> less than or equal to one cluster. This patch added possibility to write
> compressed data in the QCOW2 more than one cluster. The implementation
> is simple, we just split large requests into separate clusters and write
> using existing functionality. For unaligned requests we use a workaround
> and do write data without compression.
> 
> Signed-off-by: Anton Nefedov 
> ---
>  block/qcow2.c | 77 
> +++
>  1 file changed, 56 insertions(+), 21 deletions(-)

On one hand, it might be better to do this centrally somewhere in
block/io.c.  On the other, that would require more work because it would
probably mean having to introduce another field in BlockLimits, and it
wouldn't do much -- because qcow (v1) is, well, qcow v1...  And vmdk
seems to completely not care about this single cluster limitation.  So
for now we probably wouldn't even gain anything by doing this in block/io.c.

So long story short, it's OK to do this locally in qcow2, yes.

> diff --git a/block/qcow2.c b/block/qcow2.c
> index 45c5651..3d5f17d 100644
> --- a/block/qcow2.c
> +++ b/block/qcow2.c
> @@ -3325,11 +3325,9 @@ static int qcow2_truncate(BlockDriverState *bs, 
> int64_t offset,
>  return 0;
>  }
>  
> -/* XXX: put compressed sectors first, then all the cluster aligned
> -   tables to avoid losing bytes in alignment */
>  static coroutine_fn int
> -qcow2_co_pwritev_compressed(BlockDriverState *bs, uint64_t offset,
> -uint64_t bytes, QEMUIOVector *qiov)
> +qcow2_co_pwritev_cluster_compressed(BlockDriverState *bs, uint64_t offset,
> +uint64_t bytes, QEMUIOVector *qiov)
>  {
>  BDRVQcow2State *s = bs->opaque;
>  QEMUIOVector hd_qiov;
> @@ -3339,25 +3337,12 @@ qcow2_co_pwritev_compressed(BlockDriverState *bs, 
> uint64_t offset,
>  uint8_t *buf, *out_buf;
>  int64_t cluster_offset;
>  
> -if (bytes == 0) {
> -/* align end of file to a sector boundary to ease reading with
> -   sector based I/Os */
> -cluster_offset = bdrv_getlength(bs->file->bs);
> -if (cluster_offset < 0) {
> -return cluster_offset;
> -}
> -return bdrv_truncate(bs->file, cluster_offset, PREALLOC_MODE_OFF, 
> NULL);
> -}
> -
> -if (offset_into_cluster(s, offset)) {
> -return -EINVAL;
> -}
> +assert(bytes <= s->cluster_size);
> +assert(!offset_into_cluster(s, offset));
>  
>  buf = qemu_blockalign(bs, s->cluster_size);
> -if (bytes != s->cluster_size) {
> -if (bytes > s->cluster_size ||
> -offset + bytes != bs->total_sectors << BDRV_SECTOR_BITS)
> -{
> +if (bytes < s->cluster_size) {
> +if (offset + bytes != bs->total_sectors << BDRV_SECTOR_BITS) {
>  qemu_vfree(buf);
>  return -EINVAL;
>  }
> @@ -3437,6 +3422,56 @@ fail:
>  return ret;
>  }
>  
> +/* XXX: put compressed sectors first, then all the cluster aligned
> +   tables to avoid losing bytes in alignment */
> +static coroutine_fn int
> +qcow2_co_pwritev_compressed(BlockDriverState *bs, uint64_t offset,
> +uint64_t bytes, QEMUIOVector *qiov)
> +{
> +BDRVQcow2State *s = bs->opaque;
> +QEMUIOVector hd_qiov;
> +uint64_t curr_off = 0;
> +int ret;
> +
> +if (bytes == 0) {
> +/* align end of file to a sector boundary to ease reading with
> +   sector based I/Os */
> +int64_t cluster_offset = bdrv_getlength(bs->file->bs);
> +if (cluster_offset < 0) {
> +return cluster_offset;
> +}
> +return bdrv_truncate(bs->file, cluster_offset, PREALLOC_MODE_OFF, 
> NULL);
> +}
> +
> +if (offset_into_cluster(s, offset)) {
> +return -EINVAL;
> +}
> +
> +qemu_iovec_init(&hd_qiov, qiov->niov);
> +do {
> +uint32_t chunk_size;
> +
> +qemu_iovec_reset(&hd_qiov);
> +chunk_size = MIN(bytes, s->cluster_size);
> +qemu_iovec_concat(&hd_qiov, qiov, curr_off, chunk_size);
> +
> +ret = qcow2_co_pwritev_cluster_compressed(bs, offset + curr_off,
> +  chunk_size, &hd_qiov);
> +if (ret == -ENOTSUP) {

Why this?  I mean, I can see the appeal, but then we should probably
actually return -ENOTSUP somewhere (e.g. for unaligned clusters and the
like) -- and we should not abort if offset_into_cluster(s, cluster) is
true, but we should write the header uncompressed and compress the main
bulk.

Max

> +ret = qcow2_co_pwritev(bs, offset + curr_off, chunk_size,
> +   &hd_qiov, 0);
> +}
> +if (ret < 0) {
> +break;
> +}
> +curr_off += chunk_size;
> +bytes -= chunk_size;
> +} while (bytes);
> 

Re: [Qemu-devel] [PATCH] fix scripts/update-linux-headers.sh here document

2017-11-15 Thread Christian Borntraeger

On 11/10/2017 10:03 AM, Gerd Hoffmann wrote:
> Signed-off-by: Gerd Hoffmann 

Tested-by: Christian Borntraeger 

> ---
>  scripts/update-linux-headers.sh | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/scripts/update-linux-headers.sh b/scripts/update-linux-headers.sh
> index 8b847e279b..e2b159aa3d 100755
> --- a/scripts/update-linux-headers.sh
> +++ b/scripts/update-linux-headers.sh
> @@ -108,7 +108,7 @@ for arch in $ARCHLIST; do
>  if [ $arch = x86 ]; then
>  cat <<-EOF >"$output/include/standard-headers/asm-x86/hyperv.h"
>  /* this is a temporary placeholder until kvm_para.h stops including 
> it */
> -EOF
> +EOF
>  cp "$tmpdir/include/asm/unistd_32.h" "$output/linux-headers/asm-x86/"
>  cp "$tmpdir/include/asm/unistd_x32.h" 
> "$output/linux-headers/asm-x86/"
>  cp "$tmpdir/include/asm/unistd_64.h" "$output/linux-headers/asm-x86/"
> 




Re: [Qemu-devel] [PATCH 1/5] qcow2: reject unaligned offsets in write compressed

2017-11-15 Thread Max Reitz
On 2017-11-14 11:16, Anton Nefedov wrote:
> Misaligned compressed write is not supported.
> 
> Signed-off-by: Anton Nefedov 
> ---
>  block/qcow2.c | 4 
>  1 file changed, 4 insertions(+)

Thanks, applied to my block branch for 2.11:

https://github.com/XanClic/qemu/commits/block

Max



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-devel] [PATCH for-2.12 v3 07/11] spapr: introduce an 'irq_base' number

2017-11-15 Thread Cédric Le Goater
On 11/14/2017 03:45 PM, Greg Kurz wrote:
> On Fri, 10 Nov 2017 15:20:13 +
> Cédric Le Goater  wrote:
> 
>> 'irq_base' is a base IRQ number which lets us allocate only the subset
>> of the IRQ numbers used on the sPAPR platform. It is sync with the
>> ICSState 'offset' attribute and this is slightly redundant. We could
>> also choose to waste some extra bytes (512) and allocate the whole
>> number space. To be discussed.
>>
>> But more important, it removes a dependency on the ICSState object of
>> the sPAPR machine which is required for XIVE.
>>
>> Signed-off-by: Cédric Le Goater 
>> ---
>>  hw/ppc/spapr.c | 7 ---
>>  include/hw/ppc/spapr.h | 1 +
>>  2 files changed, 5 insertions(+), 3 deletions(-)
>>
>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
>> index bf0e5b4f815b..1cbbd7715a85 100644
>> --- a/hw/ppc/spapr.c
>> +++ b/hw/ppc/spapr.c
>> @@ -2362,6 +2362,7 @@ static void ppc_spapr_init(MachineState *machine)
>>  /* Initialize the IRQ allocator */
>>  spapr->nr_irqs  = XICS_IRQS_SPAPR;
>>  spapr->irq_map  = bitmap_new(spapr->nr_irqs);
>> +spapr->irq_base = XICS_IRQ_BASE;
>>
> 
> Since this is a constant value, do we really need a machine-level value ?

no. I don't think either.   

But I would like to know why we are starting to allocate IRQ numbers 
at 4096 ? Only 2 is reserved fo IPIs. So that seems a little large. 
I have not found the reason though.


Also I am starting to think that we should probably segment the allocation 
per device like this is specified in the PAPR specs. Each device has one 
or more Bus Unit IDentifier (BUID) which acts as a prefix for the IRQ 
number. That would facilitate the IRQ numbering and fix some issues 
in migration when devices are hotplugged. I am thinking about phbs
mostly.

C.


> Especially now that all the code that needs it is in spapr.c, I guess it
> can directly use the macro, no ?
> 
>>  /* Set up Interrupt Controller before we create the VCPUs */
>>  xics_system_init(machine, spapr->nr_irqs, &error_fatal);
>> @@ -3630,7 +3631,7 @@ static void spapr_irq_free_block_2_11(XICSFabric *xi, 
>> int irq, int num)
>>  static bool spapr_irq_test(XICSFabric *xi, int irq)
>>  {
>>  sPAPRMachineState *spapr = SPAPR_MACHINE(xi);
>> -int srcno = irq - spapr->ics->offset;
>> +int srcno = irq - spapr->irq_base;
>>  
>>  return test_bit(srcno, spapr->irq_map);
>>  }
>> @@ -3656,13 +3657,13 @@ static int spapr_irq_alloc_block(XICSFabric *xi, int 
>> count, int align)
>>  }
>>  
>>  bitmap_set(spapr->irq_map, srcno, count);
>> -return srcno + spapr->ics->offset;
>> +return srcno + spapr->irq_base;
>>  }
>>  
>>  static void spapr_irq_free_block(XICSFabric *xi, int irq, int num)
>>  {
>>  sPAPRMachineState *spapr = SPAPR_MACHINE(xi);
>> -int srcno = irq - spapr->ics->offset;
>> +int srcno = irq - spapr->irq_base;
>>  
>>  bitmap_clear(spapr->irq_map, srcno, num);
>>  }
>> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
>> index 023436c32b2a..200667dcff9d 100644
>> --- a/include/hw/ppc/spapr.h
>> +++ b/include/hw/ppc/spapr.h
>> @@ -82,6 +82,7 @@ struct sPAPRMachineState {
>>  int32_t nr_irqs;
>>  unsigned long *irq_map;
>>  unsigned long *irq_map_ref;
>> +uint32_t irq_base;
>>  ICSState *ics;
>>  sPAPRRTCState rtc;
>>  
> 




Re: [Qemu-devel] [PATCH] fix scripts/update-linux-headers.sh here document

2017-11-15 Thread Paolo Bonzini
On 10/11/2017 10:03, Gerd Hoffmann wrote:
> Signed-off-by: Gerd Hoffmann 
> ---
>  scripts/update-linux-headers.sh | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/scripts/update-linux-headers.sh b/scripts/update-linux-headers.sh
> index 8b847e279b..e2b159aa3d 100755
> --- a/scripts/update-linux-headers.sh
> +++ b/scripts/update-linux-headers.sh
> @@ -108,7 +108,7 @@ for arch in $ARCHLIST; do
>  if [ $arch = x86 ]; then
>  cat <<-EOF >"$output/include/standard-headers/asm-x86/hyperv.h"
>  /* this is a temporary placeholder until kvm_para.h stops including 
> it */
> -EOF
> +EOF
>  cp "$tmpdir/include/asm/unistd_32.h" "$output/linux-headers/asm-x86/"
>  cp "$tmpdir/include/asm/unistd_x32.h" 
> "$output/linux-headers/asm-x86/"
>  cp "$tmpdir/include/asm/unistd_64.h" "$output/linux-headers/asm-x86/"
> 

Queued, thanks.

Paolo



Re: [Qemu-devel] [PATCH] Revert "docker: Enable features explicitly in test-full"

2017-11-15 Thread Paolo Bonzini
On 18/10/2017 10:20, Fam Zheng wrote:
> This reverts commit 5e8a7fe673ca5949bd51939ee36faaf3b1320de8.
> 
> It's hard to get all images to have all these packages, the usual
> "FEATURES" and "require" mechanism doesn't scale with so many features.
> With that change, the test basically only works in ubuntu.
> 
> Until a better way comes up, leave the feature enabling to ./configure
> detection.
> 
> But don't remove the "-e" removal.
> 
> Reported-by: Paolo Bonzini 
> Signed-off-by: Fam Zheng 
> ---
>  tests/docker/test-full | 79 
> +++---
>  1 file changed, 4 insertions(+), 75 deletions(-)
> 
> diff --git a/tests/docker/test-full b/tests/docker/test-full
> index 816d5a3eec..b4e42d25d7 100755
> --- a/tests/docker/test-full
> +++ b/tests/docker/test-full
> @@ -1,8 +1,8 @@
>  #!/bin/bash
>  #
> -# Compile all the targets with as many features enabled as possible
> +# Compile all the targets.
>  #
> -# Copyright 2016, 2017 Red Hat Inc.
> +# Copyright (c) 2016 Red Hat Inc.
>  #
>  # Authors:
>  #  Fam Zheng 
> @@ -13,77 +13,6 @@
>  
>  . common.rc
>  
> -cd "$BUILD_DIR" || exit 1
> +cd "$BUILD_DIR"
>  
> -build_qemu \
> ---enable-attr \
> ---enable-bluez \
> ---enable-brlapi \
> ---enable-bsd-user \
> ---enable-bzip2 \
> ---enable-cap-ng \
> ---enable-coroutine-pool \
> ---enable-crypto-afalg \
> ---enable-curl \
> ---enable-curses \
> ---enable-debug \
> ---enable-debug-info \
> ---enable-debug-tcg \
> ---enable-docs \
> ---enable-fdt \
> ---enable-gcrypt \
> ---enable-glusterfs \
> ---enable-gnutls \
> ---enable-gprof \
> ---enable-gtk \
> ---enable-guest-agent \
> ---enable-jemalloc \
> ---enable-kvm \
> ---enable-libiscsi \
> ---enable-libnfs \
> ---enable-libssh2 \
> ---enable-libusb \
> ---enable-linux-aio \
> ---enable-linux-user \
> ---enable-live-block-migration \
> ---enable-lzo \
> ---enable-modules \
> ---enable-numa \
> ---enable-opengl \
> ---enable-pie \
> ---enable-profiler \
> ---enable-qom-cast-debug \
> ---enable-rbd \
> ---enable-rdma \
> ---enable-replication \
> ---enable-sdl \
> ---enable-seccomp \
> ---enable-smartcard \
> ---enable-snappy \
> ---enable-spice \
> ---enable-stack-protector \
> ---enable-system \
> ---enable-tcg \
> ---enable-tcg-interpreter \
> ---enable-tools \
> ---enable-tpm \
> ---enable-trace-backend=ftrace \
> ---enable-usb-redir \
> ---enable-user \
> ---enable-vde \
> ---enable-vhost-net \
> ---enable-vhost-scsi \
> ---enable-vhost-user \
> ---enable-vhost-vsock \
> ---enable-virtfs \
> ---enable-vnc \
> ---enable-vnc-jpeg \
> ---enable-vnc-png \
> ---enable-vnc-sasl \
> ---enable-vte \
> ---enable-werror \
> ---enable-xen \
> ---enable-xen-pci-passthrough \
> ---enable-xen-pv-domain-build \
> ---enable-xfsctl \
> -&& make check $MAKEFLAGS && install_qemu
> +build_qemu && make check $MAKEFLAGS && install_qemu
> 

Was this never committed?

Thanks,

Paolo



[Qemu-devel] [PATCH] s390/kvm_virtio/linux-headers: remove traces of old virtio transport

2017-11-15 Thread Christian Borntraeger
We no longer support the old s390 transport, neither does the newest
Linux kernel. Remove it from the linux header script as well as the
s390x virtio code.  We still should handle the VIRTIO_NOTIFY hypercall,
to tolerate early printk on older guest kernels without an sclp console.
We continue to ignore these events.

Signed-off-by: Christian Borntraeger 
---
 hw/s390x/s390-virtio-hcall.h   |  6 ++-
 include/standard-headers/asm-s390/kvm_virtio.h | 64 --
 scripts/update-linux-headers.sh|  1 -
 3 files changed, 4 insertions(+), 67 deletions(-)
 delete mode 100644 include/standard-headers/asm-s390/kvm_virtio.h

diff --git a/hw/s390x/s390-virtio-hcall.h b/hw/s390x/s390-virtio-hcall.h
index cbc270e..9800c4b 100644
--- a/hw/s390x/s390-virtio-hcall.h
+++ b/hw/s390x/s390-virtio-hcall.h
@@ -1,7 +1,7 @@
 /*
  * Support for virtio hypercalls on s390x
  *
- * Copyright 2012 IBM Corp.
+ * Copyright IBM Corp. 2012, 2017
  * Author(s): Cornelia Huck 
  *
  * This work is licensed under the terms of the GNU GPL, version 2 or (at
@@ -12,9 +12,11 @@
 #ifndef HW_S390_VIRTIO_HCALL_H
 #define HW_S390_VIRTIO_HCALL_H
 
-#include "standard-headers/asm-s390/kvm_virtio.h"
 #include "standard-headers/asm-s390/virtio-ccw.h"
 
+/* The only thing that we need from the old kvm_virtio.h file */
+#define KVM_S390_VIRTIO_NOTIFY 0
+
 typedef int (*s390_virtio_fn)(const uint64_t *args);
 void s390_register_virtio_hypercall(uint64_t code, s390_virtio_fn fn);
 int s390_virtio_hypercall(CPUS390XState *env);
diff --git a/include/standard-headers/asm-s390/kvm_virtio.h 
b/include/standard-headers/asm-s390/kvm_virtio.h
deleted file mode 100644
index daad324..000
--- a/include/standard-headers/asm-s390/kvm_virtio.h
+++ /dev/null
@@ -1,64 +0,0 @@
-/*
- * definition for virtio for kvm on s390
- *
- * Copyright IBM Corp. 2008
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License (version 2 only)
- * as published by the Free Software Foundation.
- *
- *Author(s): Christian Borntraeger 
- */
-
-#ifndef __KVM_S390_VIRTIO_H
-#define __KVM_S390_VIRTIO_H
-
-#include "standard-headers/linux/types.h"
-
-struct kvm_device_desc {
-   /* The device type: console, network, disk etc.  Type 0 terminates. */
-   uint8_t type;
-   /* The number of virtqueues (first in config array) */
-   uint8_t num_vq;
-   /*
-* The number of bytes of feature bits.  Multiply by 2: one for host
-* features and one for guest acknowledgements.
-*/
-   uint8_t feature_len;
-   /* The number of bytes of the config array after virtqueues. */
-   uint8_t config_len;
-   /* A status byte, written by the Guest. */
-   uint8_t status;
-   uint8_t config[0];
-};
-
-/*
- * This is how we expect the device configuration field for a virtqueue
- * to be laid out in config space.
- */
-struct kvm_vqconfig {
-   /* The token returned with an interrupt. Set by the guest */
-   uint64_t token;
-   /* The address of the virtio ring */
-   uint64_t address;
-   /* The number of entries in the virtio_ring */
-   uint16_t num;
-
-};
-
-#define KVM_S390_VIRTIO_NOTIFY 0
-#define KVM_S390_VIRTIO_RESET  1
-#define KVM_S390_VIRTIO_SET_STATUS 2
-
-/* The alignment to use between consumer and producer parts of vring.
- * This is pagesize for historical reasons. */
-#define KVM_S390_VIRTIO_RING_ALIGN 4096
-
-
-/* These values are supposed to be in ext_params on an interrupt */
-#define VIRTIO_PARAM_MASK  0xff
-#define VIRTIO_PARAM_VRING_INTERRUPT   0x0
-#define VIRTIO_PARAM_CONFIG_CHANGED0x1
-#define VIRTIO_PARAM_DEV_ADD   0x2
-
-#endif
diff --git a/scripts/update-linux-headers.sh b/scripts/update-linux-headers.sh
index ad80fe3..712588e 100755
--- a/scripts/update-linux-headers.sh
+++ b/scripts/update-linux-headers.sh
@@ -95,7 +95,6 @@ for arch in $ARCHLIST; do
 rm -rf "$output/include/standard-headers/asm-$arch"
 mkdir -p "$output/include/standard-headers/asm-$arch"
 if [ $arch = s390 ]; then
-cp_portable "$tmpdir/include/asm/kvm_virtio.h" 
"$output/include/standard-headers/asm-s390/"
 cp_portable "$tmpdir/include/asm/virtio-ccw.h" 
"$output/include/standard-headers/asm-s390/"
 fi
 if [ $arch = arm ]; then
-- 
2.9.4




Re: [Qemu-devel] [Qemu-block] segfault in parallel blockjobs (iotest 30)

2017-11-15 Thread Alberto Garcia
On Fri 10 Nov 2017 04:02:23 AM CET, Fam Zheng wrote:
>> > I'm thinking that perhaps we should add the pause point directly to
>> > block_job_defer_to_main_loop(), to prevent any block job from
>> > running the exit function when it's paused.
>> 
>> I was trying this and unfortunately this breaks the mirror job at
>> least (reproduced with a simple block-commit on the topmost node,
>> which uses commit_active_start() -> mirror_start_job()).
>> 
>> So what happens is that mirror_run() always calls
>> bdrv_drained_begin() before returning, pausing the block job. The
>> closing bdrv_drained_end() call is at the end of mirror_exit(),
>> already in the main loop.
>> 
>> So the mirror job is always calling block_job_defer_to_main_loop()
>> and mirror_exit() when the job is paused.
>
> FWIW, I think Max's report on 194 failures is related:
>
> https://lists.gnu.org/archive/html/qemu-devel/2017-11/msg01822.html
>
> so perhaps it's worth testing his patch too:
>
> https://lists.gnu.org/archive/html/qemu-devel/2017-11/msg01835.html

Well, that doesn't solve the original crash with parallel block jobs.
The root of the crash is that the mirror job manipulates the graph
_while being paused_, so the BlockReopenQueue in bdrv_reopen_multiple()
gets messed up, and pausing the jobs (commit 40840e419be31e) won't help.

I have the impression that one major source of headaches is the fact
that the reopen queue contains nodes that don't need to be reopened at
all. Ideally this should be detected early on in bdrv_reopen_queue(), so
there's no chance that the queue contains nodes used by a different
block job. If we had that then op blockers should be enough to prevent
these things. Or am I missing something?

Berto



[Qemu-devel] [Bug 1724570] Re: qemu-system-x86_64 generates ACPI tables with broken endianess when run on big-endian hosts

2017-11-15 Thread Thomas Huth
I think something like this should fix this issue:

diff a/tests/bios-tables-test.c b/tests/bios-tables-test.c
--- a/tests/bios-tables-test.c
+++ b/tests/bios-tables-test.c
@@ -279,8 +279,19 @@ static void dump_aml_files(test_data *data, bool rebuild)
 }
 g_assert(fd >= 0);
 
+sdt->header.signature = cpu_to_le32(sdt->header.signature);
+sdt->header.length = cpu_to_le32(sdt->header.length);
+sdt->header.oem_revision = cpu_to_le32(sdt->header.oem_revision);
+sdt->header.asl_compiler_revision = 
cpu_to_le32(sdt->header.asl_compiler_revision);
+
 ret = qemu_write_full(fd, sdt, sizeof(AcpiTableHeader));
 g_assert(ret == sizeof(AcpiTableHeader));
+
+sdt->header.signature = le32_to_cpu(sdt->header.signature);
+sdt->header.length = le32_to_cpu(sdt->header.length);
+sdt->header.oem_revision = le32_to_cpu(sdt->header.oem_revision);
+sdt->header.asl_compiler_revision = 
le32_to_cpu(sdt->header.asl_compiler_revision);
+
 ret = qemu_write_full(fd, sdt->aml, sdt->aml_len);
 g_assert(ret == sdt->aml_len);

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1724570

Title:
  qemu-system-x86_64 generates ACPI tables with broken endianess when
  run on big-endian hosts

Status in QEMU:
  New

Bug description:
  The bios-tables-test always fails when run on a big-endian host, which
  has iasl installed. When it calls iasl to dumps the AML files into ASL
  files, iasl complains

  Intel ACPI Component Architecture
  ASL+ Optimizing Compiler/Disassembler version 20170831
  Copyright (c) 2000 - 2017 Intel Corporation

  Input file aml-4L677Y, Length 0x38 (56) bytes
  Table [TEPH] is too long for file - needs: 0x3800, remaining in file: 0x38
  Could not get ACPI tables from aml-4L677Y, AE_BAD_HEADER

  
  At first I thought this was an iasl bug, but the latest version of iasl in 
rawhide is ported to big endian.

  So I looked at the actual AML files that bios-tables-test extracts
  from the qemu-system-x86_64 memory space, when running on ppc64 host.
  These do indeed have different content from the AML files generated by
  qemu-system-x86_64 when running on an x86_64 host.

  
  eg the AML file for the HPET shows

  < 000   T   E   P   H nul nul nul   8 soh etx   B   O   C   H   S  sp
  <45544850380003014f4248432053
  < 020   B   X   P   C   H   P   E   T nul nul nul soh   B   X   P   C
  <5842435050485445010058424350
  < 040 nul nul nul soh soh   " ack nul nul nul nul nul nul nul   P   ~
  <0100a2018086fed0
  ---
  > 000   H   P   E   T   8 nul nul nul soh etx   B   O   C   H   S  sp
  >50485445003803014f4248432053
  > 020   B   X   P   C   H   P   E   T soh nul nul nul   B   X   P   C
  >5842435050485445000158424350
  > 040 soh nul nul nul soh   " ack nul nul nul nul nul nul nul   P   ~
  >0001a2018086fed0

  so not only is the table name inverted, but the lenght is inverted,
  and several fields later on are inverted too.

  Other AML files for APIC and DSDT show similar brokenness

  This is seen with QEMU 2.10.0

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1724570/+subscriptions



Re: [Qemu-devel] [PATCH for-2.12 v3 09/11] spapr: split the IRQ number space for LSI interrupts

2017-11-15 Thread Greg Kurz
On Fri, 10 Nov 2017 15:20:15 +
Cédric Le Goater  wrote:

> The type of an interrupt, MSI or LSI, is stored under the flag
> attribute of the ICSIRQState array. To reduce the use of this array
> and consequently of the ICSState object (This is needed to introduce
> the new XIVE model), we choose to split the IRQ number space of the
> machine in two: first the LSIs and then the MSIs.
> 
> This also has the benefit to keep the LSI IRQ numbers in a well known
> range which will be useful for PHB hotplug.
> 

Well... LSIs indeed land in a well known range, but it isn't enough for PHB
hotplug. Each PHB is uniquely identified by its 'index' property, and we
want each PHB to have fixed LSIs, so that they are invariant across migration.

> This change only applies to the latest pseries machines. Older
> machines still use the ICSIRQState array to define the IRQ type.
> 
> Signed-off-by: Cédric Le Goater 
> ---
> 
>  Changes since v2 :
> 
>  - introduced a second set of XICSFabric IRQ operations for older
>pseries machines
> 
>  hw/intc/xics_spapr.c  |  6 +++---
>  hw/ppc/spapr.c| 33 +
>  include/hw/ppc/xics.h |  2 +-
>  3 files changed, 33 insertions(+), 8 deletions(-)pe-total-#msi
> 
> diff --git a/hw/intc/xics_spapr.c b/hw/intc/xics_spapr.c
> index de9e65d35247..b8e91aaf52bd 100644
> --- a/hw/intc/xics_spapr.c
> +++ b/hw/intc/xics_spapr.c
> @@ -260,7 +260,7 @@ int spapr_ics_alloc(ICSState *ics, int irq_hint, bool 
> lsi, Error **errp)
>  }
>  irq = irq_hint;
>  } else {
> -irq = xic->irq_alloc_block(ics->xics, 1, 1);
> +irq = xic->irq_alloc_block(ics->xics, 1, 1, lsi);
>  if (irq < 0) {
>  error_setg(errp, "can't allocate IRQ: no IRQ left");
>  return -1;
> @@ -297,9 +297,9 @@ int spapr_ics_alloc_block(ICSState *ics, int num, bool 
> lsi,
>  if (align) {
>  assert((num == 1) || (num == 2) || (num == 4) ||
> (num == 8) || (num == 16) || (num == 32));
> -first = xic->irq_alloc_block(ics->xics, num, num);
> +first = xic->irq_alloc_block(ics->xics, num, num, lsi);
>  } else {
> -first = xic->irq_alloc_block(ics->xics, num, 1);
> +first = xic->irq_alloc_block(ics->xics, num, 1, lsi);
>  }
>  if (first < 0) {
>  error_setg(errp, "can't find a free %d-IRQ block", num);
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index ce314fcf38db..f14eae6196cd 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -3596,7 +3596,8 @@ static bool spapr_irq_test_2_11(XICSFabric *xi, int irq)
>  return !ICS_IRQ_FREE(ics, srcno);
>  }
>  
> -static int spapr_irq_alloc_block_2_11(XICSFabric *xi, int count, int align)
> +static int spapr_irq_alloc_block_2_11(XICSFabric *xi, int count, int align,
> +  bool lsi)
>  {
>  sPAPRMachineState *spapr = SPAPR_MACHINE(xi);
>  ICSState *ics = spapr->ics;
> @@ -3628,7 +3629,7 @@ static void spapr_irq_free_block_2_11(XICSFabric *xi, 
> int irq, int num)
>  }
>  }
>  
> -static bool spapr_irq_is_lsi(XICSFabric *xi, int irq)
> +static bool spapr_irq_is_lsi_2_11(XICSFabric *xi, int irq)
>  {
>  sPAPRMachineState *spapr = SPAPR_MACHINE(xi);
>  int srcno = irq - spapr->ics->offset;
> @@ -3644,10 +3645,21 @@ static bool spapr_irq_test(XICSFabric *xi, int irq)
>  return test_bit(srcno, spapr->irq_map);
>  }
>  
> -static int spapr_irq_alloc_block(XICSFabric *xi, int count, int align)
> +
> +/*
> + * Let's provision 4 LSIs per PHBs
> + */
> +#define SPAPR_MAX_LSI (SPAPR_MAX_PHBS * 4)
> +
> +/*
> + * Split the IRQ number space of the machine in two: first the LSIs
> + * and then the MSIs. This allows us to keep the LSI IRQ numbers in a
> + * well known range which is useful for PHB hotplug.
> + */
> +static int spapr_irq_alloc_block(XICSFabric *xi, int count, int align, bool 
> lsi)
>  {
>  sPAPRMachineState *spapr = SPAPR_MACHINE(xi);
> -int start = 0;
> +int start = lsi ? 0 : SPAPR_MAX_LSI;
>  int srcno;
>  
>  /*
> @@ -3664,6 +3676,10 @@ static int spapr_irq_alloc_block(XICSFabric *xi, int 
> count, int align)
>  return -1;
>  }
>  
> +if (lsi && srcno >= SPAPR_MAX_LSI) {
> +return -1;
> +}
> +
>  bitmap_set(spapr->irq_map, srcno, count);
>  return srcno + spapr->irq_base;
>  }
> @@ -3676,6 +3692,14 @@ static void spapr_irq_free_block(XICSFabric *xi, int 
> irq, int num)
>  bitmap_clear(spapr->irq_map, srcno, num);
>  }
>  
> +static bool spapr_irq_is_lsi(XICSFabric *xi, int irq)
> +{
> +sPAPRMachineState *spapr = SPAPR_MACHINE(xi);
> +int srcno = irq - spapr->irq_base;
> +
> +return (srcno >= 0) && (srcno < SPAPR_MAX_LSI);
> +}
> +
>  static void spapr_pic_print_info(InterruptStatsProvider *obj,
>   Monitor *mon)
>  {
> @@ -3860,6 +3884,7 @@ static void 
> spapr_machine_2_11_class_options(MachineClass *mc)
>  xic->irq_test = spapr_irq_test_2

Re: [Qemu-devel] [Question] why need to start all queues in vhost_net_start

2017-11-15 Thread Longpeng(Mike)
2017-11-15 23:05 GMT+08:00 Jason Wang :
>
>
> On 2017年11月15日 22:55, Longpeng(Mike) wrote:
>>
>> Hi guys,
>>
>> We got a BUG report from our testers yesterday, the testing scenario was
>> migrating a VM (Windows guest, *4 vcpus*, 4GB, vhost-user net: *7
>> queues*).
>>
>> We found the cause reason, and we'll report the BUG or send a fix patch
>> to upstream if necessary( we haven't test the upstream yet, sorry... ).
>
>
> Could you explain this a little bit more?
>
>>
>> We want to know why the vhost_net_start() must start *total queues* ( in
>> our
>> VM there're 7 queues ) but not *the queues that current used* ( in our VM,
>> guest
>> only uses the first 4 queues because it's limited by the number of vcpus)
>> ?
>>
>> Looking forward to your help, thx :)
>
>
> Since the codes have been there for years and works well for kernel
> datapath. You should really explain what's wrong.
>

OK. :)

In our scenario,  the Windows's virtio-net driver only use the first 4
queues and it
*only set desc/avail/used table for the first 4 queues*, so in QEMU
the desc/avail/
used of the last 3 queues are ZERO,  but unfortunately...
'''
vhost_net_start
  for (i = 0; i < total_queues; i++)
vhost_net_start_one
  vhost_dev_start
vhost_virtqueue_start
'''
In vhost_virtqueue_start(), it will calculate the HVA of
desc/avail/used table, so for last
3 queues, it will use ZERO as the GPA to calculate the HVA, and then
send the results
to the user-mode backend ( we use *vhost-user* ) by vhost_virtqueue_set_addr().

When the EVS get these address, it will update a *idx* which will be
treated as  vq's
last_avail_idx when virtio-net stop ( pls see vhost_virtqueue_stop() ).

So we get the following result after virtio-net stop:
  the desc/avail/used of the last 3 queues's vqs are all ZERO, but these vqs's
  last_avail_idx is NOT ZERO.

At last, virtio_load() reports an error:
'''
  if (!vdev->vq[i].vring.desc && vdev->vq[i].last_avail_idx) { // <--
will be TRUE
  error_report("VQ %d address 0x0 "
 "inconsistent with Host index 0x%x",
 i, vdev->vq[i].last_avail_idx);
return -1;
   }
'''

BTW, the problem won't appear if use Linux guest, because the Linux virtio-net
driver will set all 7 queues's desc/avail/used tables. And the problem
won't appear
if the VM use vhost-net, because vhost-net won't update *idx* in SET_ADDR ioctl.

Sorry for my pool English, Maybe I could describe the problem in Chinese for you
in private if necessary.


> Thanks


-- 
Regards,
Longpeng



[Qemu-devel] [qemu-img] support for XVA

2017-11-15 Thread Gandalf Corvotempesta
I'm thinking if support for XVA files could be added to qemu-img
The file-format is well known (it's just a tar archive) and there are scripts
that are able to convert an XVA file to a RAW image. (ie:
https://gist.github.com/miebach/0433947bcf053de23159)

Running these script on their own is very time consuming, as you have to
extract the XVA, convert any disk image from "single chunks" to a raw image
and then use qemu-img to convert from raw to qcow.

Maybe a native support will be able to skip some steps. (like the
conversion from
"single chunks" to raw and raw to qcow2)



[Qemu-devel] [Bug 1680679] Re: qemu cannot run twice

2017-11-15 Thread misairu
Does your Subsystem ID and Subsystem Vendor ID (of your GPU) show
correctly inside the WindowsVM?

It should be the same ID shown in your host. Otherwise that will trigger
the Code 43 error.

I once have this problem but now solve this by some vfio-pci option. Now
I have a laptop that passthrough my dGPU with OVMF working perfectly.

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1680679

Title:
  qemu cannot run twice

Status in QEMU:
  Invalid

Bug description:
  After using qemu with gpu passthrough and then shutting down windows 7 
properly I cannot boot windows 7 a second time.
  Only a full reboot of linux fixes this issue.
  Qemu appears to corrupt something in linux when exiting.
  I get no error messages but windows 7 never finishes booting during the 2nd 
try.
  Apparently I do try to run vfiobind each time the script is run.
  Wondering if rerunning vfiobind can cause an issue?

  
  My specs:
  
-
  System:Host: GT70-2PE Kernel: 4.5.4-040504-generic x86_64 (64 bit gcc: 
5.3.1)
 Desktop: Cinnamon 3.2.7 (Gtk 3.18.9) Distro: Linux Mint 18.1 Serena
  Machine:   Mobo: Micro-Star model: MS-1763 v: REV:0.C Bios: American 
Megatrends v: E1763IMS.51B date: 01/29/2015
  CPU:   Quad core Intel Core i7-4810MQ (-HT-MCP-) cache: 6144 KB
 flags: (lm nx sse sse2 sse3 sse4_1 sse4_2 ssse3 vmx) bmips: 22347
 clock speeds: max: 2801 MHz 1: 2801 MHz 2: 2801 MHz 3: 2801 MHz 4: 
2801 MHz 5: 2801 MHz 6: 2801 MHz
 7: 2801 MHz 8: 2801 MHz
  Graphics:  Card-1: Intel 4th Gen Core Processor Integrated Graphics 
Controller bus-ID: 00:02.0
 Card-2: NVIDIA GK104M [GeForce GTX 880M] bus-ID: 01:00.0
 Display Server: X.Org 1.18.4 drivers: intel (unloaded: fbdev,vesa)
 Resolution: 1920x1080@60.02hz, 1920x1080@60.00hz
 GLX Renderer: Mesa DRI Intel Haswell Mobile GLX Version: 3.0 Mesa 
12.0.6 Direct Rendering: Yes

  
  My script:
  
---
  #!/bin/bash

  cd ~/qemu
  sudo ./up.sh tap0

  configfile=~/qemu/vfio-pci1.cfg

  vfiobind() {
  dev="$1"
  vendor=$(cat /sys/bus/pci/devices/$dev/vendor)
  device=$(cat /sys/bus/pci/devices/$dev/device)
  if [ -e /sys/bus/pci/devices/$dev/driver ]; then
  echo $dev > /sys/bus/pci/devices/$dev/driver/unbind
  fi
  echo $vendor $device > /sys/bus/pci/drivers/vfio-pci/new_id

  }

  modprobe vfio-pci

  cat $configfile | while read line;do
  echo $line | grep ^# >/dev/null 2>&1 && continue
  vfiobind $line
  done

  sudo qemu-system-x86_64 -machine type=q35,accel=kvm -cpu host,kvm=off \
  -smp 8,sockets=1,cores=4,threads=2 \
  -bios /usr/share/seabios/bios.bin \
  -serial none \
  -parallel none \
  -vga none \
  -m 4G \
  -mem-path /run/hugepages/kvm \
  -mem-prealloc \
  -balloon none \
  -rtc clock=host,base=localtime \
  -device 
ioh3420,bus=pcie.0,addr=1c.0,multifunction=on,port=1,chassis=1,id=root.1 \
  -device vfio-pci,host=01:00.0,bus=root.1,addr=00.0,multifunction=on,x-vga=on \
  -device virtio-scsi-pci,id=scsi \
  -drive 
id=disk0,if=virtio,cache=none,format=raw,file=/home/dad/qemu/windows7.img \
  -drive 
file=/home/dad/1TB-Backup/Iso/SP1ForWin7.iso,id=isocd,format=raw,if=none 
-device scsi-cd,drive=isocd \
  -net nic -net tap,ifname=tap0,script=no,downscript=no \
  -usbdevice host:413c:a503 \
  -usbdevice host:13fe:3100 \
  -usbdevice host:0bc2:ab21 \
  -boot menu=on \
  -boot order=c

  sudo ./down.sh tap0

  exit 0

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1680679/+subscriptions



Re: [Qemu-devel] [PATCH for-2.12 v3 09/11] spapr: split the IRQ number space for LSI interrupts

2017-11-15 Thread Cédric Le Goater
On 11/15/2017 03:52 PM, Greg Kurz wrote:
> On Fri, 10 Nov 2017 15:20:15 +
> Cédric Le Goater  wrote:
> 
>> The type of an interrupt, MSI or LSI, is stored under the flag
>> attribute of the ICSIRQState array. To reduce the use of this array
>> and consequently of the ICSState object (This is needed to introduce
>> the new XIVE model), we choose to split the IRQ number space of the
>> machine in two: first the LSIs and then the MSIs.
>>
>> This also has the benefit to keep the LSI IRQ numbers in a well known
>> range which will be useful for PHB hotplug.
>>
> 
> Well... LSIs indeed land in a well known range, but it isn't enough for PHB
> hotplug. Each PHB is uniquely identified by its 'index' property, and we
> want each PHB to have fixed LSIs, so that they are invariant across migration.

ok. 

So, as said in another email, we should think about segmenting
the allocation per device. At least for PHBs. This is specified in
the PAPR specs, each device has one or more Bus Unit IDentifier (BUID)
acting as a prefix for the IRQ number.

We could model that by using a specific range for each PHB in the 
overall IRQ number space, depending on some index. LSIs would be 
allocated at the beginning of this range, when the device is realized 
and MSIs later on when the guest starts.  

Identifying a LSI could be done using a mask on the IRQ number range 
of each PHB. It should be fast enough. I don't see other devices 
using LSIs under the sPAPR platform.


C.



>> This change only applies to the latest pseries machines. Older
>> machines still use the ICSIRQState array to define the IRQ type.
>>
>> Signed-off-by: Cédric Le Goater 
>> ---
>>
>>  Changes since v2 :
>>
>>  - introduced a second set of XICSFabric IRQ operations for older
>>pseries machines
>>
>>  hw/intc/xics_spapr.c  |  6 +++---
>>  hw/ppc/spapr.c| 33 +
>>  include/hw/ppc/xics.h |  2 +-
>>  3 files changed, 33 insertions(+), 8 deletions(-)pe-total-#msi
>>
>> diff --git a/hw/intc/xics_spapr.c b/hw/intc/xics_spapr.c
>> index de9e65d35247..b8e91aaf52bd 100644
>> --- a/hw/intc/xics_spapr.c
>> +++ b/hw/intc/xics_spapr.c
>> @@ -260,7 +260,7 @@ int spapr_ics_alloc(ICSState *ics, int irq_hint, bool 
>> lsi, Error **errp)
>>  }
>>  irq = irq_hint;
>>  } else {
>> -irq = xic->irq_alloc_block(ics->xics, 1, 1);
>> +irq = xic->irq_alloc_block(ics->xics, 1, 1, lsi);
>>  if (irq < 0) {
>>  error_setg(errp, "can't allocate IRQ: no IRQ left");
>>  return -1;
>> @@ -297,9 +297,9 @@ int spapr_ics_alloc_block(ICSState *ics, int num, bool 
>> lsi,
>>  if (align) {
>>  assert((num == 1) || (num == 2) || (num == 4) ||
>> (num == 8) || (num == 16) || (num == 32));
>> -first = xic->irq_alloc_block(ics->xics, num, num);
>> +first = xic->irq_alloc_block(ics->xics, num, num, lsi);
>>  } else {
>> -first = xic->irq_alloc_block(ics->xics, num, 1);
>> +first = xic->irq_alloc_block(ics->xics, num, 1, lsi);
>>  }
>>  if (first < 0) {
>>  error_setg(errp, "can't find a free %d-IRQ block", num);
>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
>> index ce314fcf38db..f14eae6196cd 100644
>> --- a/hw/ppc/spapr.c
>> +++ b/hw/ppc/spapr.c
>> @@ -3596,7 +3596,8 @@ static bool spapr_irq_test_2_11(XICSFabric *xi, int 
>> irq)
>>  return !ICS_IRQ_FREE(ics, srcno);
>>  }
>>  
>> -static int spapr_irq_alloc_block_2_11(XICSFabric *xi, int count, int align)
>> +static int spapr_irq_alloc_block_2_11(XICSFabric *xi, int count, int align,
>> +  bool lsi)
>>  {
>>  sPAPRMachineState *spapr = SPAPR_MACHINE(xi);
>>  ICSState *ics = spapr->ics;
>> @@ -3628,7 +3629,7 @@ static void spapr_irq_free_block_2_11(XICSFabric *xi, 
>> int irq, int num)
>>  }
>>  }
>>  
>> -static bool spapr_irq_is_lsi(XICSFabric *xi, int irq)
>> +static bool spapr_irq_is_lsi_2_11(XICSFabric *xi, int irq)
>>  {
>>  sPAPRMachineState *spapr = SPAPR_MACHINE(xi);
>>  int srcno = irq - spapr->ics->offset;
>> @@ -3644,10 +3645,21 @@ static bool spapr_irq_test(XICSFabric *xi, int irq)
>>  return test_bit(srcno, spapr->irq_map);
>>  }
>>  
>> -static int spapr_irq_alloc_block(XICSFabric *xi, int count, int align)
>> +
>> +/*
>> + * Let's provision 4 LSIs per PHBs
>> + */
>> +#define SPAPR_MAX_LSI (SPAPR_MAX_PHBS * 4)
>> +
>> +/*
>> + * Split the IRQ number space of the machine in two: first the LSIs
>> + * and then the MSIs. This allows us to keep the LSI IRQ numbers in a
>> + * well known range which is useful for PHB hotplug.
>> + */
>> +static int spapr_irq_alloc_block(XICSFabric *xi, int count, int align, bool 
>> lsi)
>>  {
>>  sPAPRMachineState *spapr = SPAPR_MACHINE(xi);
>> -int start = 0;
>> +int start = lsi ? 0 : SPAPR_MAX_LSI;
>>  int srcno;
>>  
>>  /*
>> @@ -3664,6 +3676,10 @@ static int spapr_irq_alloc_block(XICSFabric *xi, int 
>> count, int align)
>> 

[Qemu-devel] [PATCH v3] ssh: switch from libssh2 to libssh

2017-11-15 Thread Pino Toscano
Rewrite the implementation of the ssh block driver to use libssh instead
of libssh2.  The libssh library has various advantages over libssh2:
- easier API for authentication (for example for using ssh-agent)
- easier API for known_hosts handling
- supports newer types of keys in known_hosts

Kerberos authentication can be enabled once the libssh bug for it [1] is
fixed.

The development version of libssh (i.e. the future 0.8.x) supports
fsync, so reuse the build time check for this.

[1] https://red.libssh.org/issues/242

Signed-off-by: Pino Toscano 
---

Changes from v2:
- used again an own fd
- fixed co_yield() implementation

Changes from v1:
- fixed jumbo packets writing
- fixed missing 'err' assignment
- fixed commit message

 block/Makefile.objs |   6 +-
 block/ssh.c | 494 
 configure   |  65 ---
 3 files changed, 259 insertions(+), 306 deletions(-)

diff --git a/block/Makefile.objs b/block/Makefile.objs
index 6eaf78a046..c0df21d0b4 100644
--- a/block/Makefile.objs
+++ b/block/Makefile.objs
@@ -20,7 +20,7 @@ block-obj-$(CONFIG_CURL) += curl.o
 block-obj-$(CONFIG_RBD) += rbd.o
 block-obj-$(CONFIG_GLUSTERFS) += gluster.o
 block-obj-$(CONFIG_VXHS) += vxhs.o
-block-obj-$(CONFIG_LIBSSH2) += ssh.o
+block-obj-$(CONFIG_LIBSSH) += ssh.o
 block-obj-y += accounting.o dirty-bitmap.o
 block-obj-y += write-threshold.o
 block-obj-y += backup.o
@@ -41,8 +41,8 @@ rbd.o-libs := $(RBD_LIBS)
 gluster.o-cflags   := $(GLUSTERFS_CFLAGS)
 gluster.o-libs := $(GLUSTERFS_LIBS)
 vxhs.o-libs:= $(VXHS_LIBS)
-ssh.o-cflags   := $(LIBSSH2_CFLAGS)
-ssh.o-libs := $(LIBSSH2_LIBS)
+ssh.o-cflags   := $(LIBSSH_CFLAGS)
+ssh.o-libs := $(LIBSSH_LIBS)
 block-obj-$(if $(CONFIG_BZIP2),m,n) += dmg-bz2.o
 dmg-bz2.o-libs := $(BZIP2_LIBS)
 qcow.o-libs:= -lz
diff --git a/block/ssh.c b/block/ssh.c
index b049a16eb9..9e96c9d684 100644
--- a/block/ssh.c
+++ b/block/ssh.c
@@ -24,8 +24,8 @@
 
 #include "qemu/osdep.h"
 
-#include 
-#include 
+#include 
+#include 
 
 #include "block/block_int.h"
 #include "qapi/error.h"
@@ -41,14 +41,12 @@
 /* DEBUG_SSH=1 enables the DPRINTF (debugging printf) statements in
  * this block driver code.
  *
- * TRACE_LIBSSH2= enables tracing in libssh2 itself.  Note
- * that this requires that libssh2 was specially compiled with the
- * `./configure --enable-debug' option, so most likely you will have
- * to compile it yourself.  The meaning of  is described
- * here: http://www.libssh2.org/libssh2_trace.html
+ * TRACE_LIBSSH= enables tracing in libssh itself.
+ * The meaning of  is described here:
+ * http://api.libssh.org/master/group__libssh__log.html
  */
 #define DEBUG_SSH 0
-#define TRACE_LIBSSH2 0 /* or try: LIBSSH2_TRACE_SFTP */
+#define TRACE_LIBSSH  0 /* see: SSH_LOG_* */
 
 #define DPRINTF(fmt, ...)   \
 do {\
@@ -64,18 +62,14 @@ typedef struct BDRVSSHState {
 
 /* SSH connection. */
 int sock; /* socket */
-LIBSSH2_SESSION *session; /* ssh session */
-LIBSSH2_SFTP *sftp;   /* sftp session */
-LIBSSH2_SFTP_HANDLE *sftp_handle; /* sftp remote file handle */
+ssh_session session;  /* ssh session */
+sftp_session sftp;/* sftp session */
+sftp_file sftp_handle;/* sftp remote file handle */
 
-/* See ssh_seek() function below. */
-int64_t offset;
-bool offset_op_read;
-
-/* File attributes at open.  We try to keep the .filesize field
+/* File attributes at open.  We try to keep the .size field
  * updated if it changes (eg by writing at the end of the file).
  */
-LIBSSH2_SFTP_ATTRIBUTES attrs;
+sftp_attributes attrs;
 
 InetSocketAddress *inet;
 
@@ -87,26 +81,23 @@ static void ssh_state_init(BDRVSSHState *s)
 {
 memset(s, 0, sizeof *s);
 s->sock = -1;
-s->offset = -1;
 qemu_co_mutex_init(&s->lock);
 }
 
 static void ssh_state_free(BDRVSSHState *s)
 {
+if (s->attrs) {
+sftp_attributes_free(s->attrs);
+}
 if (s->sftp_handle) {
-libssh2_sftp_close(s->sftp_handle);
+sftp_close(s->sftp_handle);
 }
 if (s->sftp) {
-libssh2_sftp_shutdown(s->sftp);
+sftp_free(s->sftp);
 }
 if (s->session) {
-libssh2_session_disconnect(s->session,
-   "from qemu ssh client: "
-   "user closed the connection");
-libssh2_session_free(s->session);
-}
-if (s->sock >= 0) {
-close(s->sock);
+ssh_disconnect(s->session);
+ssh_free(s->session);
 }
 }
 
@@ -121,13 +112,13 @@ session_error_setg(Error **errp, BDRVSSHState *s, const 
char *fs, ...)
 va_end(args);
 
 if (s->session) {
-char *ssh_err;
+const char *ssh_err;
 int ssh_err_code;
 
-/* This is not an errno. 

Re: [Qemu-devel] [PATCH 2/5] qcow2: multiple clusters write compressed

2017-11-15 Thread Anton Nefedov

On 15/11/2017 6:11 PM, Max Reitz wrote:

On 2017-11-14 11:16, Anton Nefedov wrote:

From: Pavel Butsykin 

At the moment, qcow2_co_pwritev_compressed can process the requests size
less than or equal to one cluster. This patch added possibility to write
compressed data in the QCOW2 more than one cluster. The implementation
is simple, we just split large requests into separate clusters and write
using existing functionality. For unaligned requests we use a workaround
and do write data without compression.

Signed-off-by: Anton Nefedov 
---
  block/qcow2.c | 77 +++
  1 file changed, 56 insertions(+), 21 deletions(-)


On one hand, it might be better to do this centrally somewhere in
block/io.c.  On the other, that would require more work because it would
probably mean having to introduce another field in BlockLimits, and it
wouldn't do much -- because qcow (v1) is, well, qcow v1...  And vmdk
seems to completely not care about this single cluster limitation.  So
for now we probably wouldn't even gain anything by doing this in block/io.c.

So long story short, it's OK to do this locally in qcow2, yes.



[..]


+qemu_iovec_reset(&hd_qiov);
+chunk_size = MIN(bytes, s->cluster_size);
+qemu_iovec_concat(&hd_qiov, qiov, curr_off, chunk_size);
+
+ret = qcow2_co_pwritev_cluster_compressed(bs, offset + curr_off,
+  chunk_size, &hd_qiov);
+if (ret == -ENOTSUP) {


Why this?  I mean, I can see the appeal, but then we should probably
actually return -ENOTSUP somewhere (e.g. for unaligned clusters and the
like) -- and we should not abort if offset_into_cluster(s, cluster) is
true, but we should write the header uncompressed and compress the main
bulk.

Max



Right, sorry, missed this part when porting the patch.

I think this needs to be removed (and the commit message fixed
accordingly).
Returning an error, rather than silently falling back to uncompressed
seems preferable to me. "Compressing writers" (backup, img convert and
now stream) are aware that they have to cluster-align, and if they fail
to do so that means there is an error somewhere.
If it won't return an error anymore, then the unaligned tail shouldn't
either. So we can end up 'letting' the callers send small unaligned
requests which will never get compressed.

/Anton


+ret = qcow2_co_pwritev(bs, offset + curr_off, chunk_size,
+   &hd_qiov, 0);




Re: [Qemu-devel] [PATCH 2/5] qcow2: multiple clusters write compressed

2017-11-15 Thread Max Reitz
On 2017-11-15 17:28, Anton Nefedov wrote:
> On 15/11/2017 6:11 PM, Max Reitz wrote:
>> On 2017-11-14 11:16, Anton Nefedov wrote:
>>> From: Pavel Butsykin 
>>>
>>> At the moment, qcow2_co_pwritev_compressed can process the requests size
>>> less than or equal to one cluster. This patch added possibility to write
>>> compressed data in the QCOW2 more than one cluster. The implementation
>>> is simple, we just split large requests into separate clusters and write
>>> using existing functionality. For unaligned requests we use a workaround
>>> and do write data without compression.
>>>
>>> Signed-off-by: Anton Nefedov 
>>> ---
>>>   block/qcow2.c | 77
>>> +++
>>>   1 file changed, 56 insertions(+), 21 deletions(-)
>>
>> On one hand, it might be better to do this centrally somewhere in
>> block/io.c.  On the other, that would require more work because it would
>> probably mean having to introduce another field in BlockLimits, and it
>> wouldn't do much -- because qcow (v1) is, well, qcow v1...  And vmdk
>> seems to completely not care about this single cluster limitation.  So
>> for now we probably wouldn't even gain anything by doing this in
>> block/io.c.
>>
>> So long story short, it's OK to do this locally in qcow2, yes.
>>
> 
> [..]
> 
>>> +    qemu_iovec_reset(&hd_qiov);
>>> +    chunk_size = MIN(bytes, s->cluster_size);
>>> +    qemu_iovec_concat(&hd_qiov, qiov, curr_off, chunk_size);
>>> +
>>> +    ret = qcow2_co_pwritev_cluster_compressed(bs, offset +
>>> curr_off,
>>> +  chunk_size,
>>> &hd_qiov);
>>> +    if (ret == -ENOTSUP) {
>>
>> Why this?  I mean, I can see the appeal, but then we should probably
>> actually return -ENOTSUP somewhere (e.g. for unaligned clusters and the
>> like) -- and we should not abort if offset_into_cluster(s, cluster) is
>> true, but we should write the header uncompressed and compress the main
>> bulk.
>>
>> Max
>>
> 
> Right, sorry, missed this part when porting the patch.
> 
> I think this needs to be removed (and the commit message fixed
> accordingly).
> Returning an error, rather than silently falling back to uncompressed
> seems preferable to me. "Compressing writers" (backup, img convert and
> now stream) are aware that they have to cluster-align, and if they fail
> to do so that means there is an error somewhere.

OK for me.

> If it won't return an error anymore, then the unaligned tail shouldn't
> either. So we can end up 'letting' the callers send small unaligned
> requests which will never get compressed.

Either way is fine.  It just looks to me like vmdk falls back to
uncompressed writes, so it's not like it would be completely new behavior...

(But I won't judge whether what vmdk does is a good idea.)

Max



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-devel] [Qemu-block] segfault in parallel blockjobs (iotest 30)

2017-11-15 Thread Anton Nefedov



On 15/11/2017 6:42 PM, Alberto Garcia wrote:

On Fri 10 Nov 2017 04:02:23 AM CET, Fam Zheng wrote:

I'm thinking that perhaps we should add the pause point directly to
block_job_defer_to_main_loop(), to prevent any block job from
running the exit function when it's paused.


I was trying this and unfortunately this breaks the mirror job at
least (reproduced with a simple block-commit on the topmost node,
which uses commit_active_start() -> mirror_start_job()).

So what happens is that mirror_run() always calls
bdrv_drained_begin() before returning, pausing the block job. The
closing bdrv_drained_end() call is at the end of mirror_exit(),
already in the main loop.

So the mirror job is always calling block_job_defer_to_main_loop()
and mirror_exit() when the job is paused.


FWIW, I think Max's report on 194 failures is related:

https://lists.gnu.org/archive/html/qemu-devel/2017-11/msg01822.html

so perhaps it's worth testing his patch too:

https://lists.gnu.org/archive/html/qemu-devel/2017-11/msg01835.html


Well, that doesn't solve the original crash with parallel block jobs.
The root of the crash is that the mirror job manipulates the graph
_while being paused_, so the BlockReopenQueue in bdrv_reopen_multiple()
gets messed up, and pausing the jobs (commit 40840e419be31e) won't help.

I have the impression that one major source of headaches is the fact
that the reopen queue contains nodes that don't need to be reopened at
all. Ideally this should be detected early on in bdrv_reopen_queue(), so
there's no chance that the queue contains nodes used by a different
block job. If we had that then op blockers should be enough to prevent
these things. Or am I missing something?

Berto



After applying Max's patch I tried the similar approach; that is keep
BDSes referenced while they are in the reopen queue.
Now I get the stream job hanging. Somehow one blk_root_drained_begin()
is not paired with blk_root_drained_end(). So the job stays paused.
Didn't dig deeper yet, but at first glance the reduced reopen queue
won't help with this, as reopen drains all BDSes anyway (or can we avoid
that too?)

/Anton



Re: [Qemu-devel] [PATCH for-2.12 v3 07/11] spapr: introduce an 'irq_base' number

2017-11-15 Thread Greg Kurz
On Wed, 15 Nov 2017 15:24:08 +
Cédric Le Goater  wrote:

> On 11/14/2017 03:45 PM, Greg Kurz wrote:
> > On Fri, 10 Nov 2017 15:20:13 +
> > Cédric Le Goater  wrote:
> >   
> >> 'irq_base' is a base IRQ number which lets us allocate only the subset
> >> of the IRQ numbers used on the sPAPR platform. It is sync with the
> >> ICSState 'offset' attribute and this is slightly redundant. We could
> >> also choose to waste some extra bytes (512) and allocate the whole
> >> number space. To be discussed.
> >>
> >> But more important, it removes a dependency on the ICSState object of
> >> the sPAPR machine which is required for XIVE.
> >>
> >> Signed-off-by: Cédric Le Goater 
> >> ---
> >>  hw/ppc/spapr.c | 7 ---
> >>  include/hw/ppc/spapr.h | 1 +
> >>  2 files changed, 5 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> >> index bf0e5b4f815b..1cbbd7715a85 100644
> >> --- a/hw/ppc/spapr.c
> >> +++ b/hw/ppc/spapr.c
> >> @@ -2362,6 +2362,7 @@ static void ppc_spapr_init(MachineState *machine)
> >>  /* Initialize the IRQ allocator */
> >>  spapr->nr_irqs  = XICS_IRQS_SPAPR;
> >>  spapr->irq_map  = bitmap_new(spapr->nr_irqs);
> >> +spapr->irq_base = XICS_IRQ_BASE;
> >>  
> > 
> > Since this is a constant value, do we really need a machine-level value ?  
> 
> no. I don't think either.   
> 
> But I would like to know why we are starting to allocate IRQ numbers 
> at 4096 ? Only 2 is reserved fo IPIs. So that seems a little large. 
> I have not found the reason though.
> 

Same here... I've tried to git blame/log and google qemu-devel archives
and couldn't find anything either.

> 
> Also I am starting to think that we should probably segment the allocation 
> per device like this is specified in the PAPR specs. Each device has one 
> or more Bus Unit IDentifier (BUID) which acts as a prefix for the IRQ 
> number. That would facilitate the IRQ numbering and fix some issues 
> in migration when devices are hotplugged. I am thinking about phbs
> mostly.

Makes sense. Also there's something we should clarify: we create one ICS for
the entire machine, able to handle XICS_IRQS_SPAPR (== 1024) irqs. But each PHB
advertises it can provide XICS_IRQS_SPAPR MSIs through the “ibm,pe-total-#msi”
DT prop... this looks wrong.

> 
> C.
> 
> 
> > Especially now that all the code that needs it is in spapr.c, I guess it
> > can directly use the macro, no ?
> >   
> >>  /* Set up Interrupt Controller before we create the VCPUs */
> >>  xics_system_init(machine, spapr->nr_irqs, &error_fatal);
> >> @@ -3630,7 +3631,7 @@ static void spapr_irq_free_block_2_11(XICSFabric 
> >> *xi, int irq, int num)
> >>  static bool spapr_irq_test(XICSFabric *xi, int irq)
> >>  {
> >>  sPAPRMachineState *spapr = SPAPR_MACHINE(xi);
> >> -int srcno = irq - spapr->ics->offset;
> >> +int srcno = irq - spapr->irq_base;
> >>  
> >>  return test_bit(srcno, spapr->irq_map);
> >>  }
> >> @@ -3656,13 +3657,13 @@ static int spapr_irq_alloc_block(XICSFabric *xi, 
> >> int count, int align)
> >>  }
> >>  
> >>  bitmap_set(spapr->irq_map, srcno, count);
> >> -return srcno + spapr->ics->offset;
> >> +return srcno + spapr->irq_base;
> >>  }
> >>  
> >>  static void spapr_irq_free_block(XICSFabric *xi, int irq, int num)
> >>  {
> >>  sPAPRMachineState *spapr = SPAPR_MACHINE(xi);
> >> -int srcno = irq - spapr->ics->offset;
> >> +int srcno = irq - spapr->irq_base;
> >>  
> >>  bitmap_clear(spapr->irq_map, srcno, num);
> >>  }
> >> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> >> index 023436c32b2a..200667dcff9d 100644
> >> --- a/include/hw/ppc/spapr.h
> >> +++ b/include/hw/ppc/spapr.h
> >> @@ -82,6 +82,7 @@ struct sPAPRMachineState {
> >>  int32_t nr_irqs;
> >>  unsigned long *irq_map;
> >>  unsigned long *irq_map_ref;
> >> +uint32_t irq_base;
> >>  ICSState *ics;
> >>  sPAPRRTCState rtc;
> >>
> >   
> 




Re: [Qemu-devel] [PATCH] s390/kvm_virtio/linux-headers: remove traces of old virtio transport

2017-11-15 Thread Cornelia Huck
On Wed, 15 Nov 2017 16:42:23 +0100
Christian Borntraeger  wrote:

> We no longer support the old s390 transport, neither does the newest
> Linux kernel. Remove it from the linux header script as well as the
> s390x virtio code.  We still should handle the VIRTIO_NOTIFY hypercall,
> to tolerate early printk on older guest kernels without an sclp console.

Are there any such guests still around? Wouldn't they be unable to run
because of the missing old transport anyway?

> We continue to ignore these events.
> 
> Signed-off-by: Christian Borntraeger 
> ---
>  hw/s390x/s390-virtio-hcall.h   |  6 ++-
>  include/standard-headers/asm-s390/kvm_virtio.h | 64 
> --
>  scripts/update-linux-headers.sh|  1 -
>  3 files changed, 4 insertions(+), 67 deletions(-)
>  delete mode 100644 include/standard-headers/asm-s390/kvm_virtio.h



Re: [Qemu-devel] [PATCH v3] ssh: switch from libssh2 to libssh

2017-11-15 Thread Richard W.M. Jones
On Wed, Nov 15, 2017 at 05:26:48PM +0100, Pino Toscano wrote:
> Rewrite the implementation of the ssh block driver to use libssh instead
> of libssh2.  The libssh library has various advantages over libssh2:
> - easier API for authentication (for example for using ssh-agent)
> - easier API for known_hosts handling
> - supports newer types of keys in known_hosts
> 
> Kerberos authentication can be enabled once the libssh bug for it [1] is
> fixed.
> 
> The development version of libssh (i.e. the future 0.8.x) supports
> fsync, so reuse the build time check for this.
> 
> [1] https://red.libssh.org/issues/242
> 
> Signed-off-by: Pino Toscano 

I haven't reviewed this in detail, but I have tested it and I was able
to boot a VM remotely over ssh, do lots of heavy reads and writes, and
at no time did it hang/crash.  So:

Tested-by: Richard W.M. Jones 

I'll take a look at the code in more detail later.

Rich.

> Changes from v2:
> - used again an own fd
> - fixed co_yield() implementation
> 
> Changes from v1:
> - fixed jumbo packets writing
> - fixed missing 'err' assignment
> - fixed commit message
> 
>  block/Makefile.objs |   6 +-
>  block/ssh.c | 494 
> 
>  configure   |  65 ---
>  3 files changed, 259 insertions(+), 306 deletions(-)
> 
> diff --git a/block/Makefile.objs b/block/Makefile.objs
> index 6eaf78a046..c0df21d0b4 100644
> --- a/block/Makefile.objs
> +++ b/block/Makefile.objs
> @@ -20,7 +20,7 @@ block-obj-$(CONFIG_CURL) += curl.o
>  block-obj-$(CONFIG_RBD) += rbd.o
>  block-obj-$(CONFIG_GLUSTERFS) += gluster.o
>  block-obj-$(CONFIG_VXHS) += vxhs.o
> -block-obj-$(CONFIG_LIBSSH2) += ssh.o
> +block-obj-$(CONFIG_LIBSSH) += ssh.o
>  block-obj-y += accounting.o dirty-bitmap.o
>  block-obj-y += write-threshold.o
>  block-obj-y += backup.o
> @@ -41,8 +41,8 @@ rbd.o-libs := $(RBD_LIBS)
>  gluster.o-cflags   := $(GLUSTERFS_CFLAGS)
>  gluster.o-libs := $(GLUSTERFS_LIBS)
>  vxhs.o-libs:= $(VXHS_LIBS)
> -ssh.o-cflags   := $(LIBSSH2_CFLAGS)
> -ssh.o-libs := $(LIBSSH2_LIBS)
> +ssh.o-cflags   := $(LIBSSH_CFLAGS)
> +ssh.o-libs := $(LIBSSH_LIBS)
>  block-obj-$(if $(CONFIG_BZIP2),m,n) += dmg-bz2.o
>  dmg-bz2.o-libs := $(BZIP2_LIBS)
>  qcow.o-libs:= -lz
> diff --git a/block/ssh.c b/block/ssh.c
> index b049a16eb9..9e96c9d684 100644
> --- a/block/ssh.c
> +++ b/block/ssh.c
> @@ -24,8 +24,8 @@
>  
>  #include "qemu/osdep.h"
>  
> -#include 
> -#include 
> +#include 
> +#include 
>  
>  #include "block/block_int.h"
>  #include "qapi/error.h"
> @@ -41,14 +41,12 @@
>  /* DEBUG_SSH=1 enables the DPRINTF (debugging printf) statements in
>   * this block driver code.
>   *
> - * TRACE_LIBSSH2= enables tracing in libssh2 itself.  Note
> - * that this requires that libssh2 was specially compiled with the
> - * `./configure --enable-debug' option, so most likely you will have
> - * to compile it yourself.  The meaning of  is described
> - * here: http://www.libssh2.org/libssh2_trace.html
> + * TRACE_LIBSSH= enables tracing in libssh itself.
> + * The meaning of  is described here:
> + * http://api.libssh.org/master/group__libssh__log.html
>   */
>  #define DEBUG_SSH 0
> -#define TRACE_LIBSSH2 0 /* or try: LIBSSH2_TRACE_SFTP */
> +#define TRACE_LIBSSH  0 /* see: SSH_LOG_* */
>  
>  #define DPRINTF(fmt, ...)   \
>  do {\
> @@ -64,18 +62,14 @@ typedef struct BDRVSSHState {
>  
>  /* SSH connection. */
>  int sock; /* socket */
> -LIBSSH2_SESSION *session; /* ssh session */
> -LIBSSH2_SFTP *sftp;   /* sftp session */
> -LIBSSH2_SFTP_HANDLE *sftp_handle; /* sftp remote file handle */
> +ssh_session session;  /* ssh session */
> +sftp_session sftp;/* sftp session */
> +sftp_file sftp_handle;/* sftp remote file handle */
>  
> -/* See ssh_seek() function below. */
> -int64_t offset;
> -bool offset_op_read;
> -
> -/* File attributes at open.  We try to keep the .filesize field
> +/* File attributes at open.  We try to keep the .size field
>   * updated if it changes (eg by writing at the end of the file).
>   */
> -LIBSSH2_SFTP_ATTRIBUTES attrs;
> +sftp_attributes attrs;
>  
>  InetSocketAddress *inet;
>  
> @@ -87,26 +81,23 @@ static void ssh_state_init(BDRVSSHState *s)
>  {
>  memset(s, 0, sizeof *s);
>  s->sock = -1;
> -s->offset = -1;
>  qemu_co_mutex_init(&s->lock);
>  }
>  
>  static void ssh_state_free(BDRVSSHState *s)
>  {
> +if (s->attrs) {
> +sftp_attributes_free(s->attrs);
> +}
>  if (s->sftp_handle) {
> -libssh2_sftp_close(s->sftp_handle);
> +sftp_close(s->sftp_handle);
>  }
>  if (s->sftp) {
> -libssh2_sftp_shutdown(s->sftp);
> +sftp_free(s->sftp);
>  }
>  if (s->session) {
> -libs

Re: [Qemu-devel] [PATCH v6 3/5] fw_cfg: do DMA read operation

2017-11-15 Thread Michael S. Tsirkin
On Mon, Nov 13, 2017 at 08:29:56PM +0100, Marc-André Lureau wrote:
> Modify fw_cfg_read_blob() to use DMA if the device supports it.
> Return errors, because the operation may fail.
> 
> To avoid polling with unbound amount of time, the DMA operation is
> expected to complete within 200ms, or will return ETIME error.
> 
> We may want to switch all the *buf addresses to use only kmalloc'ed
> buffers (instead of using stack/image addresses with dma=false).
> 
> Signed-off-by: Marc-André Lureau 
> Reviewed-by: Gabriel Somlo 
> ---
>  drivers/firmware/qemu_fw_cfg.c | 154 
> -
>  1 file changed, 137 insertions(+), 17 deletions(-)
> 
> diff --git a/drivers/firmware/qemu_fw_cfg.c b/drivers/firmware/qemu_fw_cfg.c
> index 1f3e8545dab7..2ac4cd869fe6 100644
> --- a/drivers/firmware/qemu_fw_cfg.c
> +++ b/drivers/firmware/qemu_fw_cfg.c
> @@ -33,6 +33,8 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
>  
>  MODULE_AUTHOR("Gabriel L. Somlo ");
>  MODULE_DESCRIPTION("QEMU fw_cfg sysfs support");
> @@ -43,12 +45,26 @@ MODULE_LICENSE("GPL");
>  #define FW_CFG_ID 0x01
>  #define FW_CFG_FILE_DIR   0x19
>  
> +#define FW_CFG_VERSION_DMA 0x02
> +#define FW_CFG_DMA_CTL_ERROR   0x01
> +#define FW_CFG_DMA_CTL_READ0x02
> +#define FW_CFG_DMA_CTL_SKIP0x04
> +#define FW_CFG_DMA_CTL_SELECT  0x08
> +#define FW_CFG_DMA_CTL_WRITE   0x10
> +#define FW_CFG_DMA_TIMEOUT 200 /* ms */
> +
>  /* size in bytes of fw_cfg signature */
>  #define FW_CFG_SIG_SIZE 4
>  
>  /* fw_cfg "file name" is up to 56 characters (including terminating nul) */
>  #define FW_CFG_MAX_FILE_PATH 56
>  
> +/* platform device for dma mapping */
> +static struct device *dev;
> +
> +/* fw_cfg revision attribute, in /sys/firmware/qemu_fw_cfg top-level dir. */
> +static u32 fw_cfg_rev;
> +
>  /* fw_cfg file directory entry type */
>  struct fw_cfg_file {
>   u32 size;
> @@ -57,6 +73,12 @@ struct fw_cfg_file {
>   char name[FW_CFG_MAX_FILE_PATH];
>  };
>  
> +struct fw_cfg_dma {
> + u32 control;
> + u32 length;
> + u64 address;
> +} __packed;
> +
>  /* fw_cfg device i/o register addresses */
>  static bool fw_cfg_is_mmio;
>  static phys_addr_t fw_cfg_p_base;
> @@ -75,12 +97,93 @@ static inline u16 fw_cfg_sel_endianness(u16 key)
>   return fw_cfg_is_mmio ? cpu_to_be16(key) : cpu_to_le16(key);
>  }
>  
> +static inline bool fw_cfg_dma_enabled(void)
> +{
> + return fw_cfg_rev & FW_CFG_VERSION_DMA && fw_cfg_reg_dma;
> +}
> +
> +static bool fw_cfg_wait_for_control(struct fw_cfg_dma *d, unsigned long 
> timeout)
> +{
> + ktime_t start;
> + ktime_t stop;
> +
> + start = ktime_get();
> + stop = ktime_add(start, ms_to_ktime(timeout));
> +
> + do {
> + if ((be32_to_cpu(d->control) & ~FW_CFG_DMA_CTL_ERROR) == 0)
> + return true;
> +
> + usleep_range(50, 100);

BTW it's not nice that this is uninterruptible. I think we need a
variant of usleep_range that is interruptible and call that here.
Thoughts?


> + } while (ktime_before(ktime_get(), stop));
> +
> + return false;
> +}
> +
> +static ssize_t fw_cfg_dma_transfer(void *address, u32 length, u32 control)
> +{
> + dma_addr_t dma_addr = 0;
> + struct fw_cfg_dma *d;
> + dma_addr_t dma;
> + ssize_t ret = length;
> + enum dma_data_direction dir =
> + (control & FW_CFG_DMA_CTL_READ ? DMA_FROM_DEVICE : 0);
> +
> + if (address && length) {
> + dma_addr = dma_map_single(dev, address, length, dir);
> + if (dma_mapping_error(NULL, dma_addr)) {
> + WARN(1, "%s: failed to map address\n", __func__);
> + return -EFAULT;
> + }
> + }
> +
> + d = kmalloc(sizeof(*d), GFP_KERNEL | GFP_DMA);
> + if (!d) {
> + ret = -ENOMEM;
> + goto end;
> + }
> +
> + dma = dma_map_single(dev, d, sizeof(*d), DMA_BIDIRECTIONAL);
> + if (dma_mapping_error(NULL, dma)) {
> + WARN(1, "%s: failed to map fw_cfg_dma\n", __func__);
> + ret = -EFAULT;
> + goto end;
> + }
> +
> + *d = (struct fw_cfg_dma) {
> + .address = cpu_to_be64(dma_addr),
> + .length = cpu_to_be32(length),
> + .control = cpu_to_be32(control)
> + };
> +
> + iowrite32be((u64)dma >> 32, fw_cfg_reg_dma);
> + iowrite32be(dma, fw_cfg_reg_dma + 4);
> +
> + if (!fw_cfg_wait_for_control(d, FW_CFG_DMA_TIMEOUT)) {
> + WARN(1, "%s: timeout", __func__);
> + ret = -ETIME;
> + } else if (be32_to_cpu(d->control) & FW_CFG_DMA_CTL_ERROR) {
> + ret = -EIO;
> + }
> +
> + dma_unmap_single(dev, dma, sizeof(*d), DMA_BIDIRECTIONAL);
> +
> +end:
> + kfree(d);
> + if (dma_addr)
> + dma_unmap_single(dev, dma_addr, length, dir);

So if host was delayed what this does is corrupt guest memory.
IMHO things are already bad, let's at least keep i

Re: [Qemu-devel] [qemu-img] support for XVA

2017-11-15 Thread Eric Blake
[adding libguestfs]

On 11/15/2017 09:52 AM, Gandalf Corvotempesta wrote:
> I'm thinking if support for XVA files could be added to qemu-img
> The file-format is well known (it's just a tar archive) and there are scripts
> that are able to convert an XVA file to a RAW image. (ie:
> https://gist.github.com/miebach/0433947bcf053de23159)
> 
> Running these script on their own is very time consuming, as you have to
> extract the XVA, convert any disk image from "single chunks" to a raw image
> and then use qemu-img to convert from raw to qcow.
> 
> Maybe a native support will be able to skip some steps. (like the
> conversion from
> "single chunks" to raw and raw to qcow2)

Another possibility might be writing an nbdkit plugin that can directly
read XVA, then you can connect qemu to the NBD server provided by
nbdkit, without having to teach qemu proper how to read the file.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-devel] [qemu-img] support for XVA

2017-11-15 Thread Gandalf Corvotempesta
XVA is a tar archive.
I don't think would be possible to directly use the image stored inside
without extracting and merging each chunks

Any random reads would be impossible to do, only a huge sequential dump to
build the raw image

Il 15 nov 2017 6:33 PM, "Eric Blake"  ha scritto:

[adding libguestfs]

On 11/15/2017 09:52 AM, Gandalf Corvotempesta wrote:
> I'm thinking if support for XVA files could be added to qemu-img
> The file-format is well known (it's just a tar archive) and there are
scripts
> that are able to convert an XVA file to a RAW image. (ie:
> https://gist.github.com/miebach/0433947bcf053de23159)
>
> Running these script on their own is very time consuming, as you have to
> extract the XVA, convert any disk image from "single chunks" to a raw
image
> and then use qemu-img to convert from raw to qcow.
>
> Maybe a native support will be able to skip some steps. (like the
> conversion from
> "single chunks" to raw and raw to qcow2)

Another possibility might be writing an nbdkit plugin that can directly
read XVA, then you can connect qemu to the NBD server provided by
nbdkit, without having to teach qemu proper how to read the file.

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org


[Qemu-devel] [PATCH] iotests: Make 087 pass without AIO enabled

2017-11-15 Thread Max Reitz
If AIO has not been enabled in the qemu build that is to be tested, we
should skip the "aio=native without O_DIRECT" test instead of failing.

Signed-off-by: Max Reitz 
---
Cleber wanted to fix this in July with his "build configuration query
tool and conditional (qemu-io)test skip" series
(https://lists.gnu.org/archive/html/qemu-block/2017-07/msg01303.html),
but unfortunately there hasn't been any activity on that (as far as I
can see), so let's just solve it the simple way.
---
 tests/qemu-iotests/087 | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/tests/qemu-iotests/087 b/tests/qemu-iotests/087
index 27ab6c5151..2561a14456 100755
--- a/tests/qemu-iotests/087
+++ b/tests/qemu-iotests/087
@@ -102,7 +102,14 @@ echo
 echo === aio=native without O_DIRECT ===
 echo
 
-run_qemu <

[Qemu-devel] [PULL 01/10] fix: unrealize virtio device if we fail to hotplug it

2017-11-15 Thread Michael S. Tsirkin
From: linzhecheng 

If we fail to hotplug virtio-blk device and then suspend
or shutdown VM, qemu is likely to crash.

Re-production steps:
1. Run VM named vm001
2. Create a virtio-blk.xml which contains wrong configurations:

  
  
  

3. Run command : virsh attach-device vm001 virtio-blk.xml
error: Failed to attach device from blk-scsi.xml
error: internal error: unable to execute QEMU command 'device_add': Please set 
scsi=off for virtio-blk devices in order to use virtio 1.0
it means hotplug virtio-blk device failed.
4. Suspend or shutdown VM will leads to qemu crash

Problem happens in virtio_vmstate_change which is called by
vm_state_notify:
vdev’s parent_bus is NULL, so qdev_get_parent_bus(DEVICE(vdev)) will crash.
virtio_vmstate_change is added to the list vm_change_state_head at 
virtio_blk_device_realize(virtio_init),
but after hotplug virtio-blk failed, virtio_vmstate_change will not be removed 
from vm_change_state_head.
Adding unrealize function of virtio-blk device can solve this problem.

Signed-off-by: linzhecheng 
Reviewed-by: Stefan Hajnoczi 
Reviewed-by: Michael S. Tsirkin 
Signed-off-by: Michael S. Tsirkin 
---
 hw/virtio/virtio.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
index 5884ce3..ea532dc 100644
--- a/hw/virtio/virtio.c
+++ b/hw/virtio/virtio.c
@@ -2491,6 +2491,7 @@ static void virtio_device_realize(DeviceState *dev, Error 
**errp)
 virtio_bus_device_plugged(vdev, &err);
 if (err != NULL) {
 error_propagate(errp, err);
+vdc->unrealize(dev, NULL);
 return;
 }
 
-- 
MST




[Qemu-devel] [PULL 00/10] pc, pci, virtio: fixes for rc1

2017-11-15 Thread Michael S. Tsirkin
The following changes since commit 1fa0f627d03cd0d0755924247cafeb42969016bf:

  Update version for v2.11.0-rc1 release (2017-11-14 18:37:49 +)

are available in the git repository at:

  git://git.kernel.org/pub/scm/virt/kvm/mst/qemu.git tags/for_upstream

for you to fetch changes up to 6ed0c1c3ab0925009d15cffb944d9ccb04718846:

  build-sys: restrict vmcoreinfo to fw_cfg+dma capable targets (2017-11-15 
20:13:13 +0200)


pc, pci, virtio: fixes for rc1

A bunch of fixes all over the place.

Signed-off-by: Michael S. Tsirkin 


Alexey Kardashevskiy (1):
  pci: Initialize pci_dev->name before use

Daniel P. Berrange (2):
  tests: report errors when iasl exits with non-zero status
  test: fix detection of errors from iasl

Dou Liyang (1):
  NUMA: Enable adding NUMA node implicitly

Marc-André Lureau (2):
  vmcoreinfo: put it in the 'misc' device category
  build-sys: restrict vmcoreinfo to fw_cfg+dma capable targets

Marcel Apfelbaum (2):
  hw/pci-host: Fix x86 Host Bridges 64bit PCI hole
  hw/pcie-pci-bridge: restrict to X86 and ARM

Michael S. Tsirkin (1):
  tests/acpi-test-data: update _CRS in DSDT

linzhecheng (1):
  fix: unrealize virtio device if we fail to hotplug it

 default-configs/arm-softmmu.mak  |   2 ++
 default-configs/i386-softmmu.mak |   1 +
 default-configs/x86_64-softmmu.mak   |   1 +
 include/hw/boards.h  |   1 +
 include/hw/i386/pc.h |  10 -
 include/hw/pci-host/q35.h|   1 +
 hw/i386/pc.c |  23 +++
 hw/i386/pc_piix.c|   1 +
 hw/i386/pc_q35.c |   1 +
 hw/misc/vmcoreinfo.c |   1 +
 hw/pci-host/piix.c   |  32 --
 hw/pci-host/q35.c|  42 ---
 hw/pci/pci.c |   2 +-
 hw/virtio/virtio.c   |   1 +
 numa.c   |  21 +-
 tests/bios-tables-test.c |  26 --
 vl.c |   3 +--
 hw/misc/Makefile.objs|   2 +-
 hw/pci-bridge/Makefile.objs  |   4 ++--
 tests/acpi-test-data/pc/DSDT | Bin 5098 -> 5144 bytes
 tests/acpi-test-data/pc/DSDT.bridge  | Bin 6957 -> 7003 bytes
 tests/acpi-test-data/pc/DSDT.cphp| Bin 5561 -> 5607 bytes
 tests/acpi-test-data/pc/DSDT.ipmikcs | Bin 5170 -> 5216 bytes
 tests/acpi-test-data/pc/DSDT.memhp   | Bin 6463 -> 6509 bytes
 tests/acpi-test-data/q35/DSDT| Bin 7782 -> 7828 bytes
 tests/acpi-test-data/q35/DSDT.bridge | Bin 7799 -> 7845 bytes
 tests/acpi-test-data/q35/DSDT.cphp   | Bin 8245 -> 8291 bytes
 tests/acpi-test-data/q35/DSDT.ipmibt | Bin 7857 -> 7903 bytes
 tests/acpi-test-data/q35/DSDT.memhp  | Bin 9147 -> 9193 bytes
 29 files changed, 155 insertions(+), 20 deletions(-)




[Qemu-devel] [PULL 04/10] test: fix detection of errors from iasl

2017-11-15 Thread Michael S. Tsirkin
From: "Daniel P. Berrange" 

The conditional looking for errors while loading asl files would ignore
errors from loading the expected data, if the actual data succeeded.

Signed-off-by: Daniel P. Berrange 
Reviewed-by: Michael S. Tsirkin 
Signed-off-by: Michael S. Tsirkin 
---
 tests/bios-tables-test.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tests/bios-tables-test.c b/tests/bios-tables-test.c
index ee441f1..be05e8b 100644
--- a/tests/bios-tables-test.c
+++ b/tests/bios-tables-test.c
@@ -458,7 +458,7 @@ static void test_acpi_asl(test_data *data)
 exp_asl = normalize_asl(exp_sdt->asl);
 
 /* TODO: check for warnings */
-g_assert(!err || exp_err);
+g_assert(!err && !exp_err);
 
 if (g_strcmp0(asl->str, exp_asl->str)) {
 if (exp_err) {
-- 
MST




[Qemu-devel] [PULL 06/10] hw/pcie-pci-bridge: restrict to X86 and ARM

2017-11-15 Thread Michael S. Tsirkin
From: Marcel Apfelbaum 

The PCIE-PCI bridge is specific to "pure" PCIe systems
(on QEMU we have X86 and ARM), it does not make sense to
have it in other archs.

Reported-by: Thomas Huth 
Signed-off-by: Marcel Apfelbaum 
Reviewed-by: Thomas Huth 
Reviewed-by: Philippe Mathieu-Daudé 
Tested-by: Philippe Mathieu-Daudé 
Reviewed-by: Cornelia Huck 
Reviewed-by: Michael S. Tsirkin 
Signed-off-by: Michael S. Tsirkin 
Tested-by: Yongbok Kim 
---
 hw/pci-bridge/Makefile.objs | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/hw/pci-bridge/Makefile.objs b/hw/pci-bridge/Makefile.objs
index 666db37..1b05023 100644
--- a/hw/pci-bridge/Makefile.objs
+++ b/hw/pci-bridge/Makefile.objs
@@ -1,5 +1,5 @@
-common-obj-y += pci_bridge_dev.o pcie_pci_bridge.o
-common-obj-$(CONFIG_PCIE_PORT) += pcie_root_port.o gen_pcie_root_port.o
+common-obj-y += pci_bridge_dev.o
+common-obj-$(CONFIG_PCIE_PORT) += pcie_root_port.o gen_pcie_root_port.o 
pcie_pci_bridge.o
 common-obj-$(CONFIG_PXB) += pci_expander_bridge.o
 common-obj-$(CONFIG_XIO3130) += xio3130_upstream.o xio3130_downstream.o
 common-obj-$(CONFIG_IOH3420) += ioh3420.o
-- 
MST




[Qemu-devel] [PULL 03/10] tests: report errors when iasl exits with non-zero status

2017-11-15 Thread Michael S. Tsirkin
From: "Daniel P. Berrange" 

If iasl exits with non-zero status, the test unhelpfully just reports
that the AML did not match, because the data files it thought iasl
generated do not exist. This adds an explicit check for the exit status
of iasl and prints stderr if it was non-zero. Thus gives us a fighting
chance of diagnosing why iasl failed.

Signed-off-by: Daniel P. Berrange 
Reviewed-by: Philippe Mathieu-Daudé 
Reviewed-by: Michael S. Tsirkin 
Signed-off-by: Michael S. Tsirkin 
---
 tests/bios-tables-test.c | 24 ++--
 1 file changed, 18 insertions(+), 6 deletions(-)

diff --git a/tests/bios-tables-test.c b/tests/bios-tables-test.c
index 564da45..ee441f1 100644
--- a/tests/bios-tables-test.c
+++ b/tests/bios-tables-test.c
@@ -304,6 +304,7 @@ static bool load_asl(GArray *sdts, AcpiSdtTable *sdt)
 gchar *out, *out_err;
 gboolean ret;
 int i;
+int status;
 
 fd = g_file_open_tmp("asl-XX.dsl", &sdt->asl_file, &error);
 g_assert_no_error(error);
@@ -324,14 +325,25 @@ static bool load_asl(GArray *sdts, AcpiSdtTable *sdt)
 g_string_append_printf(command_line, "-d %s", sdt->aml_file);
 
 /* pass 'out' and 'out_err' in order to be redirected */
-ret = g_spawn_command_line_sync(command_line->str, &out, &out_err, NULL, 
&error);
+ret = g_spawn_command_line_sync(command_line->str, &out, &out_err, 
&status, &error);
 g_assert_no_error(error);
 if (ret) {
-ret = g_file_get_contents(sdt->asl_file, (gchar **)&sdt->asl,
-  &sdt->asl_len, &error);
-g_assert(ret);
-g_assert_no_error(error);
-ret = (sdt->asl_len > 0);
+if (status != 0) {
+g_printerr("'%s' exited with status %d", command_line->str, 
status);
+if (!g_str_equal(out, "")) {
+g_printerr("%s", out);
+}
+if (!g_str_equal(out_err, "")) {
+g_printerr("%s", out_err);
+}
+ret = FALSE;
+} else {
+ret = g_file_get_contents(sdt->asl_file, (gchar **)&sdt->asl,
+  &sdt->asl_len, &error);
+g_assert(ret);
+g_assert_no_error(error);
+ret = (sdt->asl_len > 0);
+}
 }
 
 g_free(out);
-- 
MST




[Qemu-devel] [PULL 02/10] pci: Initialize pci_dev->name before use

2017-11-15 Thread Michael S. Tsirkin
From: Alexey Kardashevskiy 

This moves pci_dev->name initialization earlier so
pci_dev->bus_master_as could get a name instead of an empty string.

Reviewed-by: Philippe Mathieu-Daudé 
Reviewed-by: Peter Xu 
Signed-off-by: Alexey Kardashevskiy 
Reviewed-by: Michael S. Tsirkin 
Signed-off-by: Michael S. Tsirkin 
---
 hw/pci/pci.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index 5ed3c8d..b2d139b 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -1030,6 +1030,7 @@ static PCIDevice *do_pci_register_device(PCIDevice 
*pci_dev, PCIBus *bus,
 
 pci_dev->devfn = devfn;
 pci_dev->requester_id_cache = pci_req_id_cache_get(pci_dev);
+pstrcpy(pci_dev->name, sizeof(pci_dev->name), name);
 
 memory_region_init(&pci_dev->bus_master_container_region, OBJECT(pci_dev),
"bus master container", UINT64_MAX);
@@ -1039,7 +1040,6 @@ static PCIDevice *do_pci_register_device(PCIDevice 
*pci_dev, PCIBus *bus,
 if (qdev_hotplug) {
 pci_init_bus_master(pci_dev);
 }
-pstrcpy(pci_dev->name, sizeof(pci_dev->name), name);
 pci_dev->irq_state = 0;
 pci_config_alloc(pci_dev);
 
-- 
MST




[Qemu-devel] [PULL 05/10] hw/pci-host: Fix x86 Host Bridges 64bit PCI hole

2017-11-15 Thread Michael S. Tsirkin
From: Marcel Apfelbaum 

Currently there is no MMIO range over 4G
reserved for PCI hotplug. Since the 32bit PCI hole
depends on the number of cold-plugged PCI devices
and other factors, it is very possible is too small
to hotplug PCI devices with large BARs.

Fix it by reserving 2G for I4400FX chipset
in order to comply with older Win32 Guest OSes
and 32G for Q35 chipset.

Even if the new defaults of pci-hole64-size will appear in
"info qtree" also for older machines, the property was
not implemented so no changes will be visible to guests.

Note this is a regression since prev QEMU versions had
some range reserved for 64bit PCI hotplug.

Reviewed-by: Laszlo Ersek 
Reviewed-by: Gerd Hoffmann 
Signed-off-by: Marcel Apfelbaum 
Reviewed-by: Michael S. Tsirkin 
Signed-off-by: Michael S. Tsirkin 
---
 include/hw/i386/pc.h  | 10 +-
 include/hw/pci-host/q35.h |  1 +
 hw/i386/pc.c  | 22 ++
 hw/pci-host/piix.c| 32 ++--
 hw/pci-host/q35.c | 42 +++---
 5 files changed, 101 insertions(+), 6 deletions(-)

diff --git a/include/hw/i386/pc.h b/include/hw/i386/pc.h
index 087d184..ef438bd 100644
--- a/include/hw/i386/pc.h
+++ b/include/hw/i386/pc.h
@@ -238,7 +238,6 @@ void pc_guest_info_init(PCMachineState *pcms);
 #define PCI_HOST_PROP_PCI_HOLE64_SIZE  "pci-hole64-size"
 #define PCI_HOST_BELOW_4G_MEM_SIZE "below-4g-mem-size"
 #define PCI_HOST_ABOVE_4G_MEM_SIZE "above-4g-mem-size"
-#define DEFAULT_PCI_HOLE64_SIZE (~0x0ULL)
 
 
 void pc_pci_as_mapping_init(Object *owner, MemoryRegion *system_memory,
@@ -249,6 +248,7 @@ void pc_memory_init(PCMachineState *pcms,
 MemoryRegion *system_memory,
 MemoryRegion *rom_memory,
 MemoryRegion **ram_memory);
+uint64_t pc_pci_hole64_start(void);
 qemu_irq pc_allocate_cpu_irq(void);
 DeviceState *pc_vga_init(ISABus *isa_bus, PCIBus *pci_bus);
 void pc_basic_device_init(ISABus *isa_bus, qemu_irq *gsi,
@@ -375,6 +375,14 @@ bool e820_get_entry(int, uint32_t, uint64_t *, uint64_t *);
 .driver   = TYPE_X86_CPU,\
 .property = "x-hv-max-vps",\
 .value= "0x40",\
+},{\
+.driver   = "i440FX-pcihost",\
+.property = "x-pci-hole64-fix",\
+.value= "off",\
+},{\
+.driver   = "q35-pcihost",\
+.property = "x-pci-hole64-fix",\
+.value= "off",\
 },
 
 #define PC_COMPAT_2_9 \
diff --git a/include/hw/pci-host/q35.h b/include/hw/pci-host/q35.h
index 58983c0..8f4ddde 100644
--- a/include/hw/pci-host/q35.h
+++ b/include/hw/pci-host/q35.h
@@ -68,6 +68,7 @@ typedef struct Q35PCIHost {
 PCIExpressHost parent_obj;
 /*< public >*/
 
+bool pci_hole64_fix;
 MCHPCIState mch;
 } Q35PCIHost;
 
diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index e11a65b..fafe5ba 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -1448,6 +1448,28 @@ void pc_memory_init(PCMachineState *pcms,
 pcms->ioapic_as = &address_space_memory;
 }
 
+/*
+ * The 64bit pci hole starts after "above 4G RAM" and
+ * potentially the space reserved for memory hotplug.
+ */
+uint64_t pc_pci_hole64_start(void)
+{
+PCMachineState *pcms = PC_MACHINE(qdev_get_machine());
+PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
+uint64_t hole64_start = 0;
+
+if (pcmc->has_reserved_memory && pcms->hotplug_memory.base) {
+hole64_start = pcms->hotplug_memory.base;
+if (!pcmc->broken_reserved_end) {
+hole64_start += memory_region_size(&pcms->hotplug_memory.mr);
+}
+} else {
+hole64_start = 0x1ULL + pcms->above_4g_mem_size;
+}
+
+return ROUND_UP(hole64_start, 1ULL << 30);
+}
+
 qemu_irq pc_allocate_cpu_irq(void)
 {
 return qemu_allocate_irq(pic_irq_request, NULL, 0);
diff --git a/hw/pci-host/piix.c b/hw/pci-host/piix.c
index a7e2256..a684a7c 100644
--- a/hw/pci-host/piix.c
+++ b/hw/pci-host/piix.c
@@ -50,6 +50,7 @@ typedef struct I440FXState {
 PCIHostState parent_obj;
 Range pci_hole;
 uint64_t pci_hole64_size;
+bool pci_hole64_fix;
 uint32_t short_root_bus;
 } I440FXState;
 
@@ -112,6 +113,9 @@ struct PCII440FXState {
 #define I440FX_PAM_SIZE 7
 #define I440FX_SMRAM0x72
 
+/* Keep it 2G to comply with older win32 guests */
+#define I440FX_PCI_HOST_HOLE64_SIZE_DEFAULT (1ULL << 31)
+
 /* Older coreboot versions (4.0 and older) read a config register that doesn't
  * exist in real hardware, to get the RAM size from QEMU.
  */
@@ -238,29 +242,52 @@ static void i440fx_pcihost_get_pci_hole_end(Object *obj, 
Visitor *v,
 visit_type_uint32(v, name, &value, errp);
 }
 
+/*
+ * The 64bit PCI hole start is set by the Guest firmware
+ * as the address of the first 64bit PCI MEM resource.
+ * If no PCI device has resources on the 64bit area,
+ * the 64bit PCI hole will start after "over 4G RAM" and the
+ * reserved space for memory hotplug if any.
+ */
 static void i440fx_pcihost_get_

[Qemu-devel] [PULL 07/10] tests/acpi-test-data: update _CRS in DSDT

2017-11-15 Thread Michael S. Tsirkin
commit dadf988e81b15065ac1d6dbaf4b87b5b80c7b670
hw/pci-host: Fix x86 Host Bridges 64bit PCI hole

Added a 64 bit hole to _CRS of PCI0.
Update the expected files accordingly.

Signed-off-by: Michael S. Tsirkin 
---
 tests/acpi-test-data/pc/DSDT | Bin 5098 -> 5144 bytes
 tests/acpi-test-data/pc/DSDT.bridge  | Bin 6957 -> 7003 bytes
 tests/acpi-test-data/pc/DSDT.cphp| Bin 5561 -> 5607 bytes
 tests/acpi-test-data/pc/DSDT.ipmikcs | Bin 5170 -> 5216 bytes
 tests/acpi-test-data/pc/DSDT.memhp   | Bin 6463 -> 6509 bytes
 tests/acpi-test-data/q35/DSDT| Bin 7782 -> 7828 bytes
 tests/acpi-test-data/q35/DSDT.bridge | Bin 7799 -> 7845 bytes
 tests/acpi-test-data/q35/DSDT.cphp   | Bin 8245 -> 8291 bytes
 tests/acpi-test-data/q35/DSDT.ipmibt | Bin 7857 -> 7903 bytes
 tests/acpi-test-data/q35/DSDT.memhp  | Bin 9147 -> 9193 bytes
 10 files changed, 0 insertions(+), 0 deletions(-)

diff --git a/tests/acpi-test-data/pc/DSDT b/tests/acpi-test-data/pc/DSDT
index 
15c3135d65f168a91edfdc3471ea1d3f012a824f..99f05a502752d9dbac38fdd93f1ebb79b4564fb4
 100644
GIT binary patch
delta 98
zcmaE*K0|}cCD5uv2`1v!?+^ymL^npaU1zoXPdvIeJ~%
d3=BNX3`l?x$o~KTe?5ps0u3Pc=GWZ+*#RA&6M_H$

delta 51
zcmbQC@k*V`CDxV5j&1XHNr;c;}#CK__;uyvg<4Ih!SU
H{<8xBdNvM&

diff --git a/tests/acpi-test-data/pc/DSDT.bridge 
b/tests/acpi-test-data/pc/DSDT.bridge
index 
d38586c95bf31f0212279a2505efd8e2fd321ccc..cf23343e6402421f09da5d09f72811108fbd2661
 100644
GIT binary patch
delta 98
zcmZ2$cH4~0CD&9Awg_yJmb6N3N%

delta 51
zcmca@w$_ZxCDxV5j&1XHNr;c;}#CK__;uyvg<4Ih!SU
GocIB3sttqy

diff --git a/tests/acpi-test-data/pc/DSDT.cphp 
b/tests/acpi-test-data/pc/DSDT.cphp
index 
2dd70bf9520406c36c3684714bb714e536a28d20..c99c49f43705e99d1e0a8ba19d44145dfa63d009
 100644
GIT binary patch
delta 98
zcmdm~{al;NCD&e7}A
eW?)i<76CD5kr=)BV5j&1XHNr;c;}#CK?g3bIg_slxV5j&1XHNr;c;}#CK__;uyvbJtayI7)
H{$~dOd}0qS

diff --git a/tests/acpi-test-data/q35/DSDT b/tests/acpi-test-data/q35/DSDT
index 
a080e2ace20ce9b88d5a61078d8caa0262617eed..aa402cca667f82ed0a2dc4969508d8f6e38ad910
 100644
GIT binary patch
delta 97
zcmaE6GsTw6CDJsB}io*2FOV5j&1XHNr;c;}#CK__;uyvhG$awgA^
G{R04T6c3I7

diff --git a/tests/acpi-test-data/q35/DSDT.bridge 
b/tests/acpi-test-data/q35/DSDT.bridge
index 
31a76732e563dde32e6d976670baa732a6b91807..fc3e79c583ababf5615e76ba2f7bc3df1483abb4
 100644
GIT binary patch
delta 98
zcmexvv(%Q$CD0XqQbM(5j
b85nq&8IS-Yko_MBIFQ6ZOb~zaR@pxQS6~zP

delta 50
zcmca_yU~`*CD

[Qemu-devel] [PULL 08/10] NUMA: Enable adding NUMA node implicitly

2017-11-15 Thread Michael S. Tsirkin
From: Dou Liyang 

Linux and Windows need ACPI SRAT table to make memory hotplug work properly,
however currently QEMU doesn't create SRAT table if numa options aren't present
on CLI.

Which breaks both linux and windows guests in certain conditions:
 * Windows: won't enable memory hotplug without SRAT table at all
 * Linux: if QEMU is started with initial memory all below 4Gb and no SRAT table
   present, guest kernel will use nommu DMA ops, which breaks 32bit hw drivers
   when memory is hotplugged and guest tries to use it with that drivers.

Fix above issues by automatically creating a numa node when QEMU is started with
memory hotplug enabled but without '-numa' options on CLI.
(PS: auto-create numa node only for new machine types so not to break 
migration).

Which would provide SRAT table to guests without explicit -numa options on CLI
and would allow:
 * Windows: to enable memory hotplug
 * Linux: switch to SWIOTLB DMA ops, to bounce DMA transfers to 32bit allocated
   buffers that legacy drivers/hw can handle.

[Rewritten by Igor]

Reported-by: Thadeu Lima de Souza Cascardo 
Suggested-by: Igor Mammedov 
Signed-off-by: Dou Liyang 
Cc: Paolo Bonzini 
Cc: Richard Henderson 
Cc: Eduardo Habkost 
Cc: "Michael S. Tsirkin" 
Cc: Marcel Apfelbaum 
Cc: Igor Mammedov 
Cc: David Hildenbrand 
Cc: Thomas Huth 
Cc: Alistair Francis 
Cc: Takao Indoh 
Cc: Izumi Taku 
Reviewed-by: Igor Mammedov 
Reviewed-by: Michael S. Tsirkin 
Signed-off-by: Michael S. Tsirkin 
---
 include/hw/boards.h |  1 +
 hw/i386/pc.c|  1 +
 hw/i386/pc_piix.c   |  1 +
 hw/i386/pc_q35.c|  1 +
 numa.c  | 21 -
 vl.c|  3 +--
 6 files changed, 25 insertions(+), 3 deletions(-)

diff --git a/include/hw/boards.h b/include/hw/boards.h
index 62f160e..156b16f 100644
--- a/include/hw/boards.h
+++ b/include/hw/boards.h
@@ -197,6 +197,7 @@ struct MachineClass {
 bool ignore_memory_transaction_failures;
 int numa_mem_align_shift;
 const char **valid_cpu_types;
+bool auto_enable_numa_with_memhp;
 void (*numa_auto_assign_ram)(MachineClass *mc, NodeInfo *nodes,
  int nb_nodes, ram_addr_t size);
 
diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index fafe5ba..c3afe5b 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -2347,6 +2347,7 @@ static void pc_machine_class_init(ObjectClass *oc, void 
*data)
 mc->cpu_index_to_instance_props = pc_cpu_index_to_props;
 mc->get_default_cpu_node_id = pc_get_default_cpu_node_id;
 mc->possible_cpu_arch_ids = pc_possible_cpu_arch_ids;
+mc->auto_enable_numa_with_memhp = true;
 mc->has_hotpluggable_cpus = true;
 mc->default_boot_order = "cad";
 mc->hot_add_cpu = pc_hot_add_cpu;
diff --git a/hw/i386/pc_piix.c b/hw/i386/pc_piix.c
index f79d5cb..5e47528 100644
--- a/hw/i386/pc_piix.c
+++ b/hw/i386/pc_piix.c
@@ -446,6 +446,7 @@ static void pc_i440fx_2_10_machine_options(MachineClass *m)
 m->is_default = 0;
 m->alias = NULL;
 SET_MACHINE_COMPAT(m, PC_COMPAT_2_10);
+m->auto_enable_numa_with_memhp = false;
 }
 
 DEFINE_I440FX_MACHINE(v2_10, "pc-i440fx-2.10", NULL,
diff --git a/hw/i386/pc_q35.c b/hw/i386/pc_q35.c
index da3ea60..d606004 100644
--- a/hw/i386/pc_q35.c
+++ b/hw/i386/pc_q35.c
@@ -318,6 +318,7 @@ static void pc_q35_2_10_machine_options(MachineClass *m)
 m->alias = NULL;
 SET_MACHINE_COMPAT(m, PC_COMPAT_2_10);
 m->numa_auto_assign_ram = numa_legacy_auto_assign_ram;
+m->auto_enable_numa_with_memhp = false;
 }
 
 DEFINE_Q35_MACHINE(v2_10, "pc-q35-2.10", NULL,
diff --git a/numa.c b/numa.c
index 8d78d95..7151b24 100644
--- a/numa.c
+++ b/numa.c
@@ -216,6 +216,7 @@ static void parse_numa_node(MachineState *ms, 
NumaNodeOptions *node,
 }
 numa_info[nodenr].present = true;
 max_numa_nodeid = MAX(max_numa_nodeid, nodenr + 1);
+nb_numa_nodes++;
 }
 
 static void parse_numa_distance(NumaDistOptions *dist, Error **errp)
@@ -282,7 +283,6 @@ static int parse_numa(void *opaque, QemuOpts *opts, Error 
**errp)
 if (err) {
 goto end;
 }
-nb_numa_nodes++;
 break;
 case NUMA_OPTIONS_TYPE_DIST:
 parse_numa_distance(&object->u.dist, &err);
@@ -433,6 +433,25 @@ void parse_numa_opts(MachineState *ms)
 exit(1);
 }
 
+/*
+ * If memory hotplug is enabled (slots > 0) but without '-numa'
+ * options explicitly on CLI, guestes will break.
+ *
+ *   Windows: won't enable memory hotplug without SRAT table at all
+ *
+ *   Linux: if QEMU is started with initial memory all below 4Gb
+ *   and no SRAT table present, guest kernel will use nommu DMA ops,
+ *   which breaks 32bit hw drivers when memory is hotplugged and
+ *   guest tries to use it with that drivers.
+ *
+ * Enable NUMA implicitly by adding a new NUMA node automatically.
+ */
+if (ms->ram_slots > 0 && nb_numa_nodes == 0 &&
+mc->auto_enable_numa_with_memhp) {
+NumaNodeOptions node = {

[Qemu-devel] [PULL 09/10] vmcoreinfo: put it in the 'misc' device category

2017-11-15 Thread Michael S. Tsirkin
From: Marc-André Lureau 

Signed-off-by: Marc-André Lureau 
Reviewed-by: Michael S. Tsirkin 
Signed-off-by: Michael S. Tsirkin 
---
 hw/misc/vmcoreinfo.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/hw/misc/vmcoreinfo.c b/hw/misc/vmcoreinfo.c
index a618e12..31db57a 100644
--- a/hw/misc/vmcoreinfo.c
+++ b/hw/misc/vmcoreinfo.c
@@ -79,6 +79,7 @@ static void vmcoreinfo_device_class_init(ObjectClass *klass, 
void *data)
 dc->vmsd = &vmstate_vmcoreinfo;
 dc->realize = vmcoreinfo_realize;
 dc->hotpluggable = false;
+set_bit(DEVICE_CATEGORY_MISC, dc->categories);
 }
 
 static const TypeInfo vmcoreinfo_device_info = {
-- 
MST




Re: [Qemu-devel] [RFC 1/5] hw/vfio: Add function for getting reserved_region of device iommu group

2017-11-15 Thread Alex Williamson
On Wed, 15 Nov 2017 09:49:41 +
Shameerali Kolothum Thodi  wrote:

> Hi Alex,
> 
> > -Original Message-
> > From: Alex Williamson [mailto:alex.william...@redhat.com]
> > Sent: Tuesday, November 14, 2017 3:48 PM
> > To: Zhuyijun 
> > Cc: qemu-...@nongnu.org; qemu-devel@nongnu.org;
> > eric.au...@redhat.com; peter.mayd...@linaro.org; Shameerali Kolothum
> > Thodi ; Zhaoshenglong
> > 
> > Subject: Re: [Qemu-devel] [RFC 1/5] hw/vfio: Add function for getting
> > reserved_region of device iommu group
> > 
> > On Tue, 14 Nov 2017 09:15:50 +0800
> >  wrote:
> >   
> > > From: Zhu Yijun 
> > >
> > > With kernel 4.11, iommu/smmu will populate the MSI IOVA reserved
> > > window and PCI reserved window which has to be excluded from Guest iova  
> > allocations.  
> > >
> > > However, If it falls within the Qemu default virtual memory address
> > > space, then reserved regions may get allocated for a Guest VF DMA iova
> > > and it will fail.
> > >
> > > So get those reserved regions in this patch and create some holes in
> > > the Qemu ram address in next patchset.
> > >
> > > Signed-off-by: Zhu Yijun 
> > > ---
> > >  hw/vfio/common.c  | 67  
> > +++  
> > >  hw/vfio/pci.c |  2 ++
> > >  hw/vfio/platform.c|  2 ++
> > >  include/exec/memory.h |  7 +
> > >  include/hw/vfio/vfio-common.h |  3 ++
> > >  5 files changed, 81 insertions(+)  
> > 
> > I generally prefer the vfio interface to be more self sufficient, if there 
> > are
> > regions the IOMMU cannot map, we should be describing those via capabilities
> > on the container through the vfio interface.  If we're just scraping 
> > together
> > things from sysfs, the user can just as easily do that and provide an 
> > explicit
> > memory map for the VM taking the devices into account.   
> 
> Ok. I was under the impression that the purpose of introducing the 
> /sys/kernel/iommu_groups/reserved_regions was to get the IOVA regions 
> that are reserved(MSI or non-mappable) for Qemu or other apps to
> make use of.  I think this was introduced as part of the "KVM/MSI passthrough
> support on ARM" patch series. And if I remember correctly, Eric had 
> an approach where the user space can retrieve all the reserved regions through
> the VFIO_IOMMU_GET_INFO ioctl and later this idea was replaced with the 
> sysfs interface.
> 
> May be I am missing something here.

And sysfs is a good interface if the user wants to use it to configure
the VM in a way that's compatible with a device.  For instance, in your
case, a user could evaluate these reserved regions across all devices in
a system, or even across an entire cluster, and instantiate the VM with
a memory map compatible with hotplugging any of those evaluated
devices (QEMU implementation of allowing the user to do this TBD).
Having the vfio device evaluate these reserved regions only helps in
the cold-plug case.  So the proposed solution is limited in scope and
doesn't address similar needs on other platforms.  There is value to
verifying that a device's IOVA space is compatible with a VM memory map
and modifying the memory map on cold-plug or rejecting the device on
hot-plug, but isn't that why we have an ioctl within vfio to expose
information about the IOMMU?  Why take the path of allowing QEMU to
rummage through sysfs files outside of vfio, implying additional
security and access concerns, rather than filling the gap within the
vifo API?  Thanks,

Alex



[Qemu-devel] [PULL 10/10] build-sys: restrict vmcoreinfo to fw_cfg+dma capable targets

2017-11-15 Thread Michael S. Tsirkin
From: Marc-André Lureau 

vmcoreinfo is built for all targets. However, it requires fw_cfg with
DMA operations support (write operation). Restrict vmcoreinfo exposure
to architectures that are supporting FW_CFG_DMA, that is arm-virt and
x86 only atm.

Signed-off-by: Marc-André Lureau 
Reviewed-by: Thomas Huth 
Reviewed-by: Daniel Henrique Barboza 
Tested-by: Daniel Henrique Barboza 
Reviewed-by: Michael S. Tsirkin 
Signed-off-by: Michael S. Tsirkin 
---
 default-configs/arm-softmmu.mak| 2 ++
 default-configs/i386-softmmu.mak   | 1 +
 default-configs/x86_64-softmmu.mak | 1 +
 hw/misc/Makefile.objs  | 2 +-
 4 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/default-configs/arm-softmmu.mak b/default-configs/arm-softmmu.mak
index 5059d13..d37edc4 100644
--- a/default-configs/arm-softmmu.mak
+++ b/default-configs/arm-softmmu.mak
@@ -130,3 +130,5 @@ CONFIG_SMBIOS=y
 CONFIG_ASPEED_SOC=y
 CONFIG_GPIO_KEY=y
 CONFIG_MSF2=y
+
+CONFIG_FW_CFG_DMA=y
diff --git a/default-configs/i386-softmmu.mak b/default-configs/i386-softmmu.mak
index a685c43..95ac4b4 100644
--- a/default-configs/i386-softmmu.mak
+++ b/default-configs/i386-softmmu.mak
@@ -60,3 +60,4 @@ CONFIG_SMBIOS=y
 CONFIG_HYPERV_TESTDEV=$(CONFIG_KVM)
 CONFIG_PXB=y
 CONFIG_ACPI_VMGENID=y
+CONFIG_FW_CFG_DMA=y
diff --git a/default-configs/x86_64-softmmu.mak 
b/default-configs/x86_64-softmmu.mak
index ea69e82..0221236 100644
--- a/default-configs/x86_64-softmmu.mak
+++ b/default-configs/x86_64-softmmu.mak
@@ -60,3 +60,4 @@ CONFIG_SMBIOS=y
 CONFIG_HYPERV_TESTDEV=$(CONFIG_KVM)
 CONFIG_PXB=y
 CONFIG_ACPI_VMGENID=y
+CONFIG_FW_CFG_DMA=y
diff --git a/hw/misc/Makefile.objs b/hw/misc/Makefile.objs
index 19202d9..10c88a8 100644
--- a/hw/misc/Makefile.objs
+++ b/hw/misc/Makefile.objs
@@ -9,7 +9,7 @@ common-obj-$(CONFIG_PCI_TESTDEV) += pci-testdev.o
 common-obj-$(CONFIG_EDU) += edu.o
 
 common-obj-y += unimp.o
-common-obj-y += vmcoreinfo.o
+common-obj-$(CONFIG_FW_CFG_DMA) += vmcoreinfo.o
 
 obj-$(CONFIG_VMPORT) += vmport.o
 
-- 
MST




Re: [Qemu-devel] [PATCH 3/5] block: support compressed write for copy-on-read

2017-11-15 Thread Max Reitz
On 2017-11-14 11:16, Anton Nefedov wrote:
> Signed-off-by: Anton Nefedov 
> ---
>  block/io.c | 30 --
>  block/trace-events |  2 +-
>  2 files changed, 25 insertions(+), 7 deletions(-)
> 
> diff --git a/block/io.c b/block/io.c
> index 3d5ef2c..93c6b24 100644
> --- a/block/io.c
> +++ b/block/io.c

[...]

> @@ -1209,6 +1220,13 @@ int coroutine_fn bdrv_co_preadv(BdrvChild *child,
>  return ret;
>  }
>  
> +/* write compressed only makes sense with copy on read */
> +if ((flags & BDRV_REQ_WRITE_COMPRESSED) &&
> +!(flags & BDRV_REQ_COPY_ON_READ))
> +{
> +return -EINVAL;
> +}
> +

I think the assertion in bdrv_aligned_preadv() should be enough, but
either way:

Reviewed-by: Max Reitz 

>  bdrv_inc_in_flight(bs);
>  
>  /* Don't do copy-on-read if we read data before write operation */



signature.asc
Description: OpenPGP digital signature


  1   2   >